## importance sampling by kernel smoothing

Posted in Books, Statistics with tags , , , , , on September 27, 2016 by xi'an

As noted in an earlier post, Bernard Delyon and François Portier have recently published a paper in Bernoulli about improving the speed of convergence of an importance sampling estimator of

∫ φ(x) dx

when replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator… They also consider a debiased version that converges even faster at the rate

$n h_n^{d/2}$

where n is the sample size, h the bandwidth and d the dimension. There is however a caveat, namely a collection of restrictive assumptions on the components of this new estimator:

1. the integrand φ has a compact support, is bounded, and satisfies some Hölder-type regularity condition;
2. the importance distribution ƒ is upper and lower bounded, its r-th order derivatives are upper bounded;
3. the kernel K is order r, with exponential tails, and symmetric;
4. the leave-one-out correction for bias has a cost O(n²) compared with O(n) cost of the regular Monte-Carlo estimator;
5. the bandwidth h in the kernel estimator has a rate in n linked with the dimension d and the regularity indices of ƒ and φ

and this bandwidth needs to be evaluated as well. In the paper the authors rely on a control variate for which the integral is known, but which “looks like φ”, a strong requirement in appearance only since this new function is the convolution of φ with a kernel estimate of ƒ which expectation is the original importance estimate of the integral. This sounds convoluted but this is a generic control variate nonetheless! But this is also a costly step. Because of the kernel estimation aspect, the method deteriorates with the dimension of the variate x. However, since φ(x) is a real number, I wonder if running the non-parametric density estimate directly on the sample of φ(x)’s would lead to an improved estimator…

## maximum of a Dirichlet vector

Posted in Books, Statistics with tags , , , , , , , on September 26, 2016 by xi'an

An intriguing question on Stack Exchange this weekend, about the distribution of max{p¹,p²,…}the maximum component of a Dirichlet vector Dir(a¹,a²,…) with arbitrary hyper-parameters. Writing the density of this random variable is feasible, using its connection with a Gamma vector, but I could not find a closed-form expression. If there is such an expression, it may follow from the many properties of the Dirichlet distribution and I’d be interested in learning about it. (Very nice stamp, by the way! I wonder if the original formula was made with LaTeX…)

## an inverse permutation test

Posted in Books, Kids, R, Statistics with tags , , , , on September 23, 2016 by xi'an

A straightforward but probabilistic riddle this week in the Riddler, which is to find the expected order of integer i when the sequence {1,2,…,n} is partitioned at random into two sets, A and B, each of which is then sorted before both sets are merged. For instance, if {1,2,3,4} is divided in A={1,4} and B={2,3}, the order of 2 in {1,4,2,3} is 3. An R rendering of the experiment is

m=rbinom(1,n,.5)
if (m*(n-m)>0){
fist=sort(sample(1:n,m))
return(order(c(fist,sort((1:n)[-fist])))[i])
}else{
return(i)}
[\sourcecode]

It is rather easy to find that the probability that the order of i takes the value j is

${i-1 \choose j-1}(1/2)^i$

if j<i (i is in A) and

${n-i \choose n-j}(1/2)^{n-i+1}$

if $j>i$ (i is in B), the case i=j being the addition of both cases, but the mean can be found (almost) immediately by considering that, when i is in A, its average value is (i+1)/2 and when it is in B, its average value is (n+i)/2 [by symmetry]. Hence a global mean of (2i+n+1)/4….

## Monte Carlo with determinantal processes [reply from the authors]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on September 22, 2016 by xi'an

[Rémi Bardenet and Adrien Hardy have written a reply to my comments of today on their paper, which is more readable as a post than as comments, so here it is. I appreciate the intention, as well as the perfect editing of the reply, suited for a direct posting!]

Thanks for your comments, Xian. As a foreword, a few people we met also had the intuition that DPPs would be relevant for Monte Carlo, but no result so far was backing this claim. As it turns out, we had to work hard to prove a CLT for importance-reweighted DPPs, using some deep recent results on orthogonal polynomials. We are currently working on turning this probabilistic result into practical algorithms. For instance, efficient sampling of DPPs is indeed an important open question, to which most of your comments refer. Although this question is out of the scope of our paper, note however that our results do not depend on how you sample. Efficient sampling of DPPs, along with other natural computational questions, is actually the crux of an ANR grant we just got, so hopefully in a few years we can write a more detailed answer on this blog! We now answer some of your other points.

“one has to examine the conditions for the result to operate, from the support being within the unit hypercube,”
Any compactly supported measure would do, using dilations, for instance. Note that we don’t assume the support is the whole hypercube.

“to the existence of N orthogonal polynomials wrt the dominating measure, not discussed here”
As explained in Section 2.1.2, it is enough that the reference measure charges some open set of the hypercube, which is for instance the case if it has a density with respect to the Lebesgue measure.

“to the lack of relation between the point process and the integrand,”
Actually, our method depends heavily on the target measure μ. Unlike vanilla QMC, the repulsiveness between the quadrature nodes is tailored to the integration problem.

“changing N requires a new simulation of the entire vector unless I missed the point.”
You’re absolutely right. This is a well-known open issue in probability, see the discussion on Terence Tao’s blog.

“This requires figuring out the upper bounds on the acceptance ratios, a “problem-dependent” request that may prove impossible to implement”
We agree that in general this isn’t trivial. However, good bounds are available for all Jacobi polynomials, see Section 3.

“Even without this stumbling block, generating the N-sized sample for dimension d=N (why d=N, I wonder?)”
This is a misunderstanding: we do not say that d=N in any sense. We only say that sampling from a DPP using the algorithm of [Hough et al] requires the same number of operations as orthonormalizing N vectors of dimension N, hence the cubic cost.

1. “how does it relate to quasi-Monte Carlo?”
So far, the connection to QMC is only intuitive: both rely on well-spaced nodes, but using different mathematical tools.

2. “the marginals of the N-th order determinantal process are far from uniform (see Fig. 1), and seemingly concentrated on the boundaries”
This phenomenon is due to orthogonal polynomials. We are investigating more general constructions that give more flexibility.

3. “Is the variance of the resulting estimator (2.11) always finite?”
Yes. For instance, this follows from the inequality below (5.56) since ƒ(x)/K(x,x) is Lipschitz.

4. and 5. We are investigating concentration inequalities to answer these points.

6. “probabilistic numerics produce an epistemic assessment of uncertainty, contrary to the current proposal.”
A partial answer may be our Remark 2.12. You can interpret DPPs as putting a Gaussian process prior over ƒ and sequentially sampling from the posterior variance of the GP.

## Monte Carlo with determinantal processes

Posted in Books, Statistics with tags , , , , , , , , on September 21, 2016 by xi'an

Rémi Bardenet and Adrien Hardy have arXived this paper a few months ago but I was a bit daunted by the sheer size of the paper, until I found the perfect opportunity last week..! The approach relates to probabilistic numerics as well as Monte Carlo, in that it can be seen as a stochastic version of Gaussian quadrature. The authors mention in the early pages a striking and recent result by Delyon and Portier that using an importance weight where the sampling density is replaced with the leave-one-out kernel estimate produces faster convergence than the regular Monte Carlo √n! Which reminds me of quasi-Monte Carlo of course, discussed in the following section (§1.3), with the interesting [and new to me] comment that the theoretical rate (and thus the improvement) does not occur until the sample size N is exponential in the dimension. Bardenet and Hardy achieve similar super-efficient convergence by mixing quadrature with repulsive simulation. For almost every integrable function.

The fact that determinantal point processes (on the unit hypercube) and Gaussian quadrature methods are connected is not that surprising once one considers that such processes are associated with densities made of determinants, which matrices are kernel-based, K(x,y), with K expressed as a sum of orthogonal polynomials. An N-th order determinantal process in dimension d satisfies a generalised Central Limit Theorem in that the speed of convergence is

$\sqrt{N}^{(d-1)/d}$

which means faster than √N…  This is more surprising, of course, even though one has to examine the conditions Continue reading

## Bayesian model selection without evidence

Posted in Books, Statistics, University life with tags , , , , , , , on September 20, 2016 by xi'an

“The new method circumvents the challenges associated with accurate evidence calculations by computing posterior odds ratios using Bayesian parameter estimation”

One paper leading to another, I had a look at Hee et al. 2015 paper on Bayes factor estimation. The “novelty” stands in introducing the model index as an extra parameter in a single model encompassing all models under comparison, the “new” parameterisation being in (θ,n) rather than in θ. With the distinction that the parameter θ is now made of the union of all parameters across all models. Which reminds us very much of Carlin and Chib (1995) approach to the problem. (Peter Green in his Biometrika (1995) paper on reversible jump MCMC uses instead a direct sum of parameter spaces.) The authors indeed suggest simulating jointly (θ,n) in an MCMC or nested sampling scheme. Rather than being updated by arbitrary transforms as in Carlin and Chib (1995) the useless parameters from the other models are kept constant… The goal being to estimate P(n|D) the marginal posterior on the model index, aka the posterior probability of model n.

Now, I am quite not certain keeping the other parameter constants is a valid move: given a uniform prior on n and an equally uniform proposal, the acceptance probability simplifies into the regular Metropolis-Hastings ratio for model n. Hence the move is valid within model n. If not, I presume the previous pair (θ⁰,n⁰) is repeated. Wait!, actually, this is slightly more elaborate: if a new value of n, m, is proposed, then the acceptance ratio involves the posteriors for both n⁰ and m, possibly only the likelihoods when the proposal is the prior. So the move will directly depend on the likelihood ratio in this simplified case, which indicates the scheme could be correct after all. Except that this neglects the measure theoretic subtleties that led to reversible jump symmetry and hence makes me wonder. In other words, it follows exactly the same pattern as reversible jump without the constraints of the latter… Free lunch,  anyone?!

## Le bayésianisme aujourd’hui

Posted in Books, Statistics with tags , , , , , , , on September 19, 2016 by xi'an

A few years ago, I was asked by Isabelle Drouet to contribute a chapter to a multi-disciplinary book on the Bayesian paradigm, book that is now soon to appear. In French. It has this rather ugly title of Bayesianism today. Not that I had hear of Bayesianism or bayésianime previously. There are chapters on the Bayesian notion(s) of probability, game theory, statistics, on applications, and on the (potentially) Bayesian structure of human intelligence. Most of it is thus outside statistics, but I will certainly read through it when I receive my copy.