Archive for curse of dimensionality

new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.

 

Metropolis-Hastings importance sampling

Posted in Books, Statistics, University life with tags , , , , , , , , , on June 6, 2018 by xi'an

[Warning: As I first got the paper from the authors and sent them my comments, this paper read contains their reply as well.]

In a sort of crazy coincidence, Daniel Rudolf and Björn Sprungk arXived a paper on a Metropolis-Hastings importance sampling estimator that offers similarities with  the one by Ingmar Schuster and Ilja Klebanov posted on arXiv the same day. The major difference in the construction of the importance sampler is that Rudolf and Sprungk use the conditional distribution of the proposal in the denominator of their importance weight, while Schuster and Klebanov go for the marginal (or a Rao-Blackwell representation of the marginal), mostly in an independent Metropolis-Hastings setting (for convergence) and for a discretised Langevin version in the applications. The former use a very functional L² approach to convergence (which reminded me of the early Schervish and Carlin, 1990, paper on the convergence of MCMC algorithms), not all of it necessary in my opinion. As for instance the extension of convergence properties to the augmented chain, namely (current, proposed), is rather straightforward since the proposed chain is a random transform of the current chain. An interesting remark at the end of the proof of the CLT is that the asymptotic variance of the importance sampling estimator is the same as with iid realisations from the target. This is a point we also noticed when constructing population Monte Carlo techniques (more than ten years ago), namely that dependence on the past in sequential Monte Carlo does not impact the validation and the moments of the resulting estimators, simply because “everything cancels” in importance ratios. The mean square error bound on the Monte Carlo error (Theorem 20) is not very surprising as the term ρ(y)²/P(x,y) appears naturally in the variance of importance samplers.

The first illustration where the importance sampler does worse than the initial MCMC estimator for a wide range of acceptance probabilities (Figures 2 and 3, which is which?) and I do not understand the opposite conclusion from the authors.

[Here is an answer from Daniel and Björn about this point:]

Indeed the formulation in our paper is unfortunate. The point we want to stress is that we observed in the numerical experiments certain ranges of step-sizes for which MH importance sampling shows a better performance than the classical MH algorithm with optimal scaling. Meaning that the MH importance sampling with optimal step-size can outperform MH sampling, without using additional computational resources. Surprisingly, the optimal step-size for the MH importance sampling estimator seems to remain constant for an increasing dimension in contrast to the well-known optimal scaling of the MH algorithm (given by a constant optimal acceptance rate).

The second uses the Pima Indian diabetes benchmark, amusingly (?) referring to Chopin and Ridgway (2017) who warn against the recourse to this dataset and to this model! The loss in mean square error due to the importance sampling may again be massive (Figure 5) and setting for an optimisation of the scaling factor in Metropolis-Hastings algorithms sounds unrealistic.

[And another answer from Daniel and Björn about this point:]

Indeed, Chopin and Ridgway suggest more complex problems with a larger number of covariates as benchmarks. However, the well-studied PIMA data set is a sufficient example in order to illustrate the possible benefits but also the limitations of the MH importance sampling approach. The latter are clearly (a) the required knowledge about the optimal step-size—otherwise the performance can indeed be dramatically worse than for the MH algorithm—and (b) the restriction to a small or at most moderate number of covariates. As you are indicating, optimizing the scaling factor is a challenging task. However, the hope is to derive some simple rule of thumb for the MH importance sampler similar to the well-known acceptance rate tuning for the standard MCMC estimator.

the curse of large dimension [teaser]

Posted in Books, pictures, Statistics, University life with tags , , , , , , on January 11, 2018 by xi'an

MCM 2017

Posted in Statistics with tags , , , , , , , , , , , , on July 3, 2017 by xi'an

And thus I am back in Montréal, for MCM 2017, located in HEC Montréal, on the campus of Université de Montréal, for three days. My talk is predictably about ABC, what else?!, gathering diverse threads from different talks and papers:

marginal likelihoods from MCMC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on April 26, 2017 by xi'an

A new arXiv entry on ways to approximate marginal likelihoods based on MCMC output, by astronomers (apparently). With an application to the 2015 Planck satellite analysis of cosmic microwave background radiation data, which reminded me of our joint work with the cosmologists of the Paris Institut d’Astrophysique ten years ago. In the literature review, the authors miss several surveys on the approximation of those marginals, including our San Antonio chapter, on Bayes factors approximations, but mention our ABC survey somewhat inappropriately since it is not advocating the use of ABC for such a purpose. (They mention as well variational Bayes approximations, INLA, powered likelihoods, if not nested sampling.)

The proposal of this paper is to identify the marginal m [actually denoted a there] as the normalising constant of an unnormalised posterior density. And to do so the authors estimate the posterior by a non-parametric approach, namely a k-nearest-neighbour estimate. With the additional twist of producing a sort of Bayesian posterior on the constant m. [And the unusual notion of number density, used for the unnormalised posterior.] The Bayesian estimation of m relies on a Poisson sampling assumption on the k-nearest neighbour distribution. (Sort of, since k is actually fixed, not random.)

If the above sounds confusing and imprecise it is because I am myself rather mystified by the whole approach and find it difficult to see the point in this alternative. The Bayesian numerics does not seem to have other purposes than producing a MAP estimate. And using a non-parametric density estimate opens a Pandora box of difficulties, the most obvious one being the curse of dimension(ality). This reminded me of the commented paper of Delyon and Portier where they achieve super-efficient convergence when using a kernel estimator, but with a considerable cost and a similar sensitivity to dimension.

seeking the error in nested sampling

Posted in pictures, Statistics, Travel with tags , , , , , , on April 13, 2017 by xi'an

A newly arXived paper on the error in nested sampling, written by Higson and co-authors, and read in Berlin, looks at the difficult task of evaluating the sampling error of nested sampling. The conclusion is essentially negative in that the authors recommend multiple runs of the method to assess the magnitude of the variability of the output by bootstrap, i.e. to call for the most empirical approach…

The core of this difficulty lies in the half-plug-in, half-quadrature, half-Monte Carlo (!) feature of the nested sampling algorithm, in that (i) the truncation of the unit interval is based on a expectation of the mass of each shell (i.e., the zone between two consecutive isoclines of the likelihood, (ii) the evidence estimator is a quadrature formula, and (iii) the level of the likelihood at the truncation is replaced with a simulated value that is not even unbiased (and correlated with the previous value in the case of an MCMC implementation). As discussed in our paper with Nicolas, the error in the evidence approximation is of the same order as other Monte Carlo methods in that it gets down like the square root of the number of terms at each iteration. Contrary to earlier intuitions that focussed on the error due to the quadrature.

But the situation is much less understood when the resulting sample is used for estimation of quantities related with the posterior distribution. With no clear approach to assess and even less correct the resulting error, since it is not solely a Monte Carlo error. As noted by the authors, the quadrature approximation to the univariate integral replaces the unknown prior weight of a shell with its Beta order statistic expectation and the average of the likelihood over the shell with a single (uniform???) realisation. Or the mean value of a transform of the parameter with a single (biased) realisation. Since most posterior expectations can be represented as integrals over likelihood levels of the average value over an iso-likelihood contour. The approach advocated in the paper involved multiple threads of an “unwoven nested sampling run”, which means launching n nested sampling runs with one living term from the n currents living points in the current nested sample. (Those threads may then later be recombined into a single nested sample.) This is the starting point to a nested flavour of bootstrapping, where threads are sampled with replacement, from which confidence intervals and error estimates can be constructed. (The original notion appears in Skilling’s 2006 paper, but I missed it.)

The above graphic is an attempt within the paper at representing the (marginal) posterior of a transform f(θ). That I do not fully understand… The notations are rather horrendous as X is not the data but the prior probability for the likelihood to be above a given bound which is actually the corresponding quantile. (There is no symbol for data and £ is used for the likelihood function as well as realisations of the likelihood function…) A vertical slice on the central panel gives the posterior distribution of f(θ) given the event that the likelihood is in the corresponding upper tail. Or given the corresponding shell (?).

high dimension Metropolis-Hastings algorithms

Posted in Books, Kids, Mountains, pictures, R, Statistics with tags , , , , , , on January 26, 2016 by xi'an

When discussing high dimension models with Ingmar Schüster Schuster [blame my fascination for accented characters!] the other day, we came across the following paradox with Metropolis-Hastings algorithms. If attempting to simulate from a multivariate standard normal distribution in a large dimension, when starting from the mode of the target, i.e., its mean γ, leaving the mode γis extremely unlikely, given the huge drop between the value of the density at the mode γ and at likely realisations (corresponding to the blue sequence). Even when relying on the very scale that makes the proposal identical to the target! Resorting to a tiny scale like Σ/p manages to escape the unhealthy neighbourhood of the highly unlikely mode (as shown with the brown sequence).

Here is the corresponding R code:

p=100
T=1e3
mh=mu #mode as starting value
vale=rep(0,T)
for (t in 1:T){
prop=mvrnorm(1,mh,sigma/p)
if (log(runif(1))<logdmvnorm(prop,mu,sigma)-
   logdmvnorm(mh,mu,sigma)) mh=prop
vale[t]=logdmvnorm(mh,mu,sigma)}