**F**elipe Medina-Aguayo, Antony Lee and Gareth Roberts (all at Warwick University) have recently published—even though the paper was accepted a year ago—a paper in Statistics and Computing about a variant to the pseudo-marginal Metropolis-Hastings algorithm. The modification is to simulate an estimate of the likelihood or posterior at the current value of the Markov chain at every iteration, rather than reproducing the current estimate. The reason for this refreshment of the weight estimate is to prevent stickiness in the chain, when a random weight leads to a very high value of the posterior. Unfortunately, this change leads to a Markov chain with the wrong stationary distribution. When this stationary exists! The paper actually produces examples of transient noisy chains, even in simple cases such as a geometric target distribution. And even when taking the average of a large number of weights. But the paper also contains sufficient conditions, like negative weight moments or uniform ergodicity of the proposal, for the noisy chain to be geometrically ergodic. Even though the applicability of those conditions to complex targets is not always obvious.

## Archive for the Statistics Category

## stability of noisy Metropolis-Hastings

Posted in Statistics with tags Markov chain, Markov chain Monte Carlo algorithm, MCMC convergence, particle filter, pseudo-marginal MCMC, sequential Monte Carlo, University of Warwick on September 28, 2016 by xi'an## importance sampling by kernel smoothing

Posted in Books, Statistics with tags Bernoulli, importance sampling, leave-one-out calibration, non-parametric kernel estimation, unbiased estimation, variance correction on September 27, 2016 by xi'an**A**s noted in an earlier post, Bernard Delyon and François Portier have recently published a paper in Bernoulli about improving the speed of convergence of an importance sampling estimator of

∫ φ(x) dx

when replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator… They also consider a debiased version that converges even faster at the rate

where n is the sample size, h the bandwidth and d the dimension. There is however a caveat, namely a collection of restrictive assumptions on the components of this new estimator:

- the integrand φ has a compact support, is bounded, and satisfies some Hölder-type regularity condition;
- the importance distribution ƒ is upper and lower bounded, its r-th order derivatives are upper bounded;
- the kernel K is order r, with exponential tails, and symmetric;
- the leave-one-out correction for bias has a cost O(n²) compared with O(n) cost of the regular Monte-Carlo estimator;
- the bandwidth h in the kernel estimator has a rate in n linked with the dimension d and the regularity indices of ƒ and φ

and this bandwidth needs to be evaluated as well. In the paper the authors rely on a control variate for which the integral is known, but which “looks like φ”, a strong requirement *in appearance only* since this new function is the convolution of φ with a kernel estimate of ƒ which expectation is the original importance estimate of the integral. This sounds convoluted but this is a generic control variate nonetheless! But this is also a costly step. Because of the kernel estimation aspect, the method deteriorates with the dimension of the variate x. However, since φ(x) is a real number, I wonder if running the non-parametric density estimate directly on the sample of φ(x)’s would lead to an improved estimator…

## maximum of a Dirichlet vector

Posted in Statistics, Books with tags LaTeX, cross validated, marginalisation, order statistics, Stack Exchange, Dirichlet distribution, stamp, Peter Dirichlet on September 26, 2016 by xi'an**A**n intriguing question on Stack Exchange this weekend, about the distribution of max{p¹,p²,…}the maximum component of a Dirichlet vector Dir(a¹,a²,…) with arbitrary hyper-parameters. Writing the density of this random variable is feasible, using its connection with a Gamma vector, but I could not find a closed-form expression. If there is such an expression, it may follow from the many properties of the Dirichlet distribution and I’d be interested in learning about it. (Very nice stamp, by the way! I wonder if the original formula was made with LaTeX…)

## contemporary issues in hypothesis testing

Posted in Statistics with tags Bayesian hypothesis testing, birthday problem, bounds, brain imaging, CRiSM, decision theory, E.T. Jaynes, England, fMRI, Genetics, Gregor Mendel, ISBA 2016, p-values, permutation tests, PNAS, podcast, psychology, University of Warwick, Zeeman building on September 26, 2016 by xi'an**T**his week [at Warwick], among other things, I attended the CRiSM workshop on hypothesis testing, giving the same talk as at ISBA last June. There was a most interesting and unusual talk by Nick Chater (from Warwick) about the psychological aspects of hypothesis testing, namely about the unnatural features of an hypothesis in everyday life, i.e., how far this formalism stands from human psychological functioning. Or what we know about it. And then my Warwick colleague Tom Nichols explained how his recent work on permutation tests for fMRIs, published in PNAS, testing hypotheses on what should be null if real data and getting a high rate of false positives, got the medical imaging community all up in arms due to over-simplified reports in the media questioning the validity of 15 years of research on fMRI and the related 40,000 papers! For instance, some of the headings questioned the entire research in the area. Or transformed a software bug missing the boundary effects into a major flaw. (See this podcast on Not So Standard Deviations for a thoughtful discussion on the issue.) One conclusion of this story is to be wary of assertions when submitting a hot story to journals with a substantial non-scientific readership! The afternoon talks were equally exciting, with Andrew explaining to us live from New York why he hates hypothesis testing and prefers model building. With the birthday model as an example. And David Draper gave an encompassing talk about the distinctions between inference and decision, proposing a Jaynes information criterion and illustrating it on Mendel‘s historical [and massaged!] pea dataset. The next morning, Jim Berger gave an overview on the frequentist properties of the Bayes factor, with in particular a novel [to me] upper bound on the Bayes factor associated with a p-value (Sellke, Bayarri and Berger, 2001)

B¹⁰(p) ≤ 1/-e p log p

with the specificity that B¹⁰(p) is not testing the original hypothesis [problem] but a substitute where the null is the hypothesis that p is uniformly distributed, versus a non-parametric alternative that p is more concentrated near zero. This reminded me of our PNAS paper on the impact of summary statistics upon Bayes factors. And of some forgotten reference studying Bayesian inference based solely on the p-value… It is too bad I had to rush back to Paris, as this made me miss the last talks of this fantastic workshop centred on maybe the most important aspect of statistics!

## Jim Harrison (1937-2016)

Posted in Statistics with tags A Good Day to Die, Arizona, Jim Harrison, Lake Michigan, Legends of the Fall, Upper Peninsula on September 25, 2016 by xi'an

“The wilderness does not make you forget your normal life as much as it removes the distractions for proper remembering.”J. Harrison

**O**ne of my favourite authors passed away earlier this year and I was not even aware of it! Jim Harrison died from a heart attack in Arizona on March 26. I read Legends of the Fall [for the first time] when I arrived in the US in 1987 and then other [if not all] novels like A good day to die or Wolf…

“Barring love, I’ll take my life in large doses alone: rivers, forests, fish, grouse, mountains. Dogs.”J. Harrison

What I liked in those novels was less the plot, which often is secondary—even though the Cervantesque story of the two guys trying to blow a dam in A good day to die is pure genius!—, than the depiction of the characters and their almost always bleak life, as well as the love of outdoors, in a northern Michigan that is at its heart undistinguishable from (eastern) Canada or central Finland. His tales told of eating and drinking, of womanising, fishing, and hunting, of failed promises and multiple capitulations, tales that are always bawdy and brimming with testosterone, but also with a gruff tenderness for those big hairy guys and their dogs. Especially their dogs. There is a lot of nostalgia seeping through these stories, a longing for a wild rural (almost feral) America that most people will never touch. Or even conceive. But expressed in a melancholic rather than reactionary way. In a superb prose that often sounded like a poem.

“I like grit, I like love and death, I am tired of irony…”J. Harrison

If anything, remembering those great novels makes me long for the most recent books of Harrison I have not [yet] read. Plus the non-fiction book The Raw and the Cooked.

## an inverse permutation test

Posted in Books, Kids, R, Statistics with tags order statistics, permutation tests, R, sorting, The Riddler on September 23, 2016 by xi'an**A** straightforward but probabilistic riddle this week in the Riddler, which is to find the expected order of integer i when the sequence {1,2,…,n} is partitioned at random into two sets, A and B, each of which is then sorted before both sets are merged. For instance, if {1,2,3,4} is divided in A={1,4} and B={2,3}, the order of 2 in {1,4,2,3} is 3. An R rendering of the experiment is

m=rbinom(1,n,.5)

if (m*(n-m)>0){

fist=sort(sample(1:n,m))

return(order(c(fist,sort((1:n)[-fist])))[i])

}else{

return(i)}

[\sourcecode]

It is rather easy to find that the probability that the order of i takes the value j is

if j<i (i is in A) and

if $j>i$ (i is in B), the case i=j being the addition of both cases, but the mean can be found (almost) immediately by considering that, when i is in A, its average value is (i+1)/2 and when it is in B, its average value is (n+i)/2 [by symmetry]. Hence a global mean of (2i+n+1)/4….

## Monte Carlo with determinantal processes [reply from the authors]

Posted in Books, Statistics with tags Bayesian inference, central limit theorem, determinantal point processes, Gaussian processes, Gaussian quadrature, hypercube, Lebesgue measure, Monte Carlo integration, Monte Carlo Statistical Methods, orthogonal polynomials, probabilistic numerics, quasi-Monte Carlo methods, super-efficiency, Terence Tao, zero variance importance sampling on September 22, 2016 by xi'an*[Rémi Bardenet and Adrien Hardy have written a reply to my comments of today on their paper, which is more readable as a post than as comments, so here it is. I appreciate the intention, as well as the perfect editing of the reply, suited for a direct posting!]*

**T**hanks for your comments, Xian. As a foreword, a few people we met also had the intuition that DPPs would be relevant for Monte Carlo, but no result so far was backing this claim. As it turns out, we had to work hard to prove a CLT for importance-reweighted DPPs, using some deep recent results on orthogonal polynomials. We are currently working on turning this probabilistic result into practical algorithms. For instance, efficient sampling of DPPs is indeed an important open question, to which most of your comments refer. Although this question is out of the scope of our paper, note however that our results do not depend on how you sample. Efficient sampling of DPPs, along with other natural computational questions, is actually the crux of an ANR grant we just got, so hopefully in a few years we can write a more detailed answer on this blog! We now answer some of your other points.

*“one has to examine the conditions for the result to operate, from the support being within the unit hypercube,”*

Any compactly supported measure would do, using dilations, for instance. Note that we don’t assume the support is the whole hypercube.

*“to the existence of N orthogonal polynomials wrt the dominating measure, not discussed here”*

As explained in Section 2.1.2, it is enough that the reference measure charges some open set of the hypercube, which is for instance the case if it has a density with respect to the Lebesgue measure.

*“to the lack of relation between the point process and the integrand,”*

Actually, our method depends heavily on the target measure μ. Unlike vanilla QMC, the repulsiveness between the quadrature nodes is tailored to the integration problem.

*“changing N requires a new simulation of the entire vector unless I missed the point.”*

You’re absolutely right. This is a well-known open issue in probability, see the discussion on Terence Tao’s blog.

*“This requires figuring out the upper bounds on the acceptance ratios, a “problem-dependent” request that may prove impossible to implement”*

We agree that in general this isn’t trivial. However, good bounds are available for all Jacobi polynomials, see Section 3.

*“Even without this stumbling block, generating the N-sized sample for dimension d=N (why d=N, I wonder?)”*

This is a misunderstanding: we do not say that d=N in any sense. We only say that sampling from a DPP using the algorithm of [Hough et al] requires the same number of operations as orthonormalizing N vectors of dimension N, hence the cubic cost.

*1. “how does it relate to quasi-Monte Carlo?”*

So far, the connection to QMC is only intuitive: both rely on well-spaced nodes, but using different mathematical tools.

*2. “the marginals of the N-th order determinantal process are far from uniform (see Fig. 1), and seemingly concentrated on the boundaries”*

This phenomenon is due to orthogonal polynomials. We are investigating more general constructions that give more flexibility.

*3. “Is the variance of the resulting estimator (2.11) always finite?”*

Yes. For instance, this follows from the inequality below (5.56) since ƒ(x)/K(x,x) is Lipschitz.

*4. and 5.* We are investigating concentration inequalities to answer these points.

*6. “probabilistic numerics produce an epistemic assessment of uncertainty, contrary to the current proposal.”*

A partial answer may be our Remark 2.12. You can interpret DPPs as putting a Gaussian process prior over ƒ and sequentially sampling from the posterior variance of the GP.