## importance sampling by kernel smoothing

Posted in Books, Statistics with tags , , , , , on September 27, 2016 by xi'an

As noted in an earlier post, Bernard Delyon and François Portier have recently published a paper in Bernoulli about improving the speed of convergence of an importance sampling estimator of

∫ φ(x) dx

when replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator… They also consider a debiased version that converges even faster at the rate

$n h_n^{d/2}$

where n is the sample size, h the bandwidth and d the dimension. There is however a caveat, namely a collection of restrictive assumptions on the components of this new estimator:

1. the integrand φ has a compact support, is bounded, and satisfies some Hölder-type regularity condition;
2. the importance distribution ƒ is upper and lower bounded, its r-th order derivatives are upper bounded;
3. the kernel K is order r, with exponential tails, and symmetric;
4. the leave-one-out correction for bias has a cost O(n²) compared with O(n) cost of the regular Monte-Carlo estimator;
5. the bandwidth h in the kernel estimator has a rate in n linked with the dimension d and the regularity indices of ƒ and φ

and this bandwidth needs to be evaluated as well. In the paper the authors rely on a control variate for which the integral is known, but which “looks like φ”, a strong requirement in appearance only since this new function is the convolution of φ with a kernel estimate of ƒ which expectation is the original importance estimate of the integral. This sounds convoluted but this is a generic control variate nonetheless! But this is also a costly step. Because of the kernel estimation aspect, the method deteriorates with the dimension of the variate x. However, since φ(x) is a real number, I wonder if running the non-parametric density estimate directly on the sample of φ(x)’s would lead to an improved estimator…

## merging MCMC subposteriors

Posted in Books, Statistics, University life with tags , , , , , , , on June 8, 2016 by xi'an

Christopher Nemeth and Chris Sherlock arXived a paper yesterday about an approach to distributed MCMC sampling via Gaussian processes. As in several other papers commented on the ‘Og, the issue is to merge MCMC samples from sub-posteriors into a sample or any sort of approximation of the complete (product) posterior. I am quite sympathetic to the approach adopted in this paper, namely to use a log-Gaussian process representation of each sub-posterior and then to replace each sub-posterior with its log-Gaussian process posterior expectation in an MCMC or importance scheme. And to assess its variability through the posterior variance of the sum of log-Gaussian processes. As pointed out by the authors the closed form representation of the posterior mean of the log-posterior is invaluable as it allows for an HMC implementation. And importance solutions as well. The probabilistic numerics behind this perspective are also highly relevant.

A few arguable (?) points:

1. The method often relies on importance sampling and hence on the choice of an importance function that is most likely influential but delicate to calibrate in complex settings as I presume the Gaussian estimates are not useful in this regard;
2. Using Monte Carlo to approximate the value of the approximate density at a given parameter value (by simulating from the posterior distribution) is natural but is it that efficient?
3. It could be that, by treating all sub-posterior samples as noisy versions of the same (true) posterior, a more accurate approximation of this posterior could be constructed;
4. The method relies on the exponentiation of a posterior expectation or simulation. As of yesterday, I am somehow wary of log-normal expectations!
5. If the purpose of the exercise is to approximate univariate integrals, it would seem more profitable to use the Gaussian processes at the univariate level;
6. The way the normalising missing constants and the duplicate simulations are processed (or not) could deserve further exploration;
7. Computing costs are in fine unclear when compared with the other methods in the toolbox.

## inefficiency of data augmentation for large samples

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on May 31, 2016 by xi'an

On Monday, James Johndrow, Aaron Smith, Natesh Pillai, and David Dunson arXived a paper on the diminishing benefits of using data augmentation for large and highly imbalanced categorical data. They reconsider the data augmentation scheme of Tanner and Wong (1987), surprisingly not mentioned, used in the first occurrences of the Gibbs sampler like Albert and Chib’s (1993) or our mixture estimation paper with Jean Diebolt (1990). The central difficulty with data augmentation is that the distribution to be simulated operates on a space that is of order O(n), even when the original distribution covers a single parameter. As illustrated by the coalescent in population genetics (and the subsequent intrusion of the ABC methodology), there are well-known cases when the completion is near to impossible and clearly inefficient (as again illustrated by the failure of importance sampling strategies on the coalescent). The paper provides spectral gaps for the logistic and probit regression completions, which are of order a power of log(n) divided by √n, when all observations are equal to one. In a somewhat related paper with Jim Hobert and Vivek Roy, we studied the spectral gap for mixtures with a small number of observations: I wonder at the existence of a similar result in this setting, when all observations stem from one component of the mixture, when all observations are one. The result in this paper is theoretically appealing, the more because the posteriors associated with such models are highly regular and very close to Gaussian (and hence not that challenging as argued by Chopin and Ridgway). And because the data augmentation algorithm is uniformly ergodic in this setting (as we established with Jean Diebolt  and later explored with Richard Tweedie). As demonstrated in the  experiment produced in the paper, when comparing with HMC and Metropolis-Hastings (same computing times?), which produce much higher effective sample sizes.

## likelihood inflating sampling algorithm

Posted in Books, Statistics, University life with tags , , , , , , , , on May 24, 2016 by xi'an

My friends from Toronto Radu Craiu and Jeff Rosenthal have arXived a paper along with Reihaneh Entezari on MCMC scaling for large datasets, in the spirit of Scott et al.’s (2013) consensus Monte Carlo. They devised an likelihood inflated algorithm that brings a novel perspective to the problem of large datasets. This question relates to earlier approaches like consensus Monte Carlo, but also kernel and Weierstrass subsampling, already discussed on this blog, as well as current research I am conducting with my PhD student Changye Wu. The approach by Entezari et al. is somewhat similar to consensus Monte Carlo and the other solutions in that they consider an inflated (i.e., one taken to the right power) likelihood based on a subsample, with the full sample being recovered by importance sampling. Somewhat unsurprisingly this approach leads to a less dispersed estimator than consensus Monte Carlo (Theorem 1). And the paper only draws a comparison with that sub-sampling method, rather than covering other approaches to the problem, maybe because this is the most natural connection, one approach being the k-th power of the other approach.

“…we will show that [importance sampling] is unnecessary in many instances…” (p.6)

An obvious question that stems from the approach is the call for importance sampling, since the numerator of the importance sampler involves the full likelihood which is unavailable in most instances when sub-sampled MCMC is required. I may have missed the part of the paper where the above statement is discussed, but the only realistic example discussed therein is the Bayesian regression tree (BART) of Chipman et al. (1998). Which indeed constitutes a challenging if one-dimensional example, but also one that requires delicate tuning that leads to cancelling importance weights but which may prove delicate to extrapolate to other models.

## Monte Carlo methods for Potts models

Posted in pictures, Statistics, University life with tags , , , , on March 10, 2016 by xi'an

There will be a seminar talk by Mehdi Molkaraie (Pompeu Fabra) next week at Institut Henri Poincaré (IHP), Paris, on his paper with Vincent Gomez.

We consider the problem of estimating the partition function of the ferromagnetic q-state Potts model. We propose an importance sampling algorithm in the dual of the normal factor graph representing the model. The algorithm can efficiently compute an estimate of the partition function when the coupling parameters of the model are strong (corresponding to models at low temperature) or when the model contains a mixture of strong and weak couplings. We show that, in this setting, the proposed algorithm significantly outperforms the state of the art methods.

The talk is at 14:30, March 17. It is part of a trimester program on information and computation theories I was completely unaware of.

## multiple try Metropolis

Posted in Books, Statistics, University life with tags , , , , , , on February 18, 2016 by xi'an

Luca Martino and Francisco Louzada recently wrote a paper in Computational Statistics about some difficulties with the multiple try Metropolis algorithm. This version of Metropolis by Liu et al. (2000) makes several proposals in parallel and picks one among them by multinomial sampling where the weights are proportional to the corresponding importance weights. This is followed by a Metropolis acceptance step that requires simulating the same number of proposed moves from the selected value. While this is necessary to achieve detailed balance, this mixture of MCMC and importance sampling is inefficient in that it simulates a large number of particles and ends up using only one of them. By comparison, a particle filter for the same setting would propagate all N particles along iterations and only resamples occasionaly when the ESS is getting too small. (I also wonder if the method could be seen as a special kind of pseudo-marginal approach, given that the acceptance ratio is an empirical average with expectation the missing normalising constan [as I later realised the authors had pointed out!]… In which case efficiency comparisons by Christophe Andrieu and Matti Vihola could prove useful.)

The issue raised by Martino and Louzada is that the estimator of the normalising constant can be poor at times, especially when the chain is in low regions of the target, and hence get the chain stuck. The above graph illustrates this setting in the paper. However, the reason for the failure is mostly that the proposal distribution is inappropriate for the purpose of approximating the normalising constant, i.e., that importance sampling does not converge in this situation, since otherwise the average of the importance weights should a.s. converge to the normalising constant. And the method should not worsen when increasing the number of proposals at a given stage. (The solution proposed by the authors to have a random number of proposals seems unlikely to solve the issue in a generic situation. Changing the proposals towards different tail behaviours as in population Monte Carlo is more akin to defensive sampling and thus more likely to avoid trapping states. Interestingly, the authors eventually resort to a mixture denominator in the importance sampler following AMIS.)

## Bayesian Indirect Inference and the ABC of GMM

Posted in Books, Statistics, University life with tags , , , , , , , , , , on February 17, 2016 by xi'an

“The practicality of estimation of a complex model using ABC is illustrated by the fact that we have been able to perform 2000 Monte Carlo replications of estimation of this simple DSGE model, using a single 32 core computer, in less than 72 hours.” (p.15)

Earlier this week, Michael Creel and his coauthors arXived a long paper with the above title, where ABC relates to approximate Bayesian computation. In short, this paper provides deeper theoretical foundations for the local regression post-processing of Mark Beaumont and his coauthors (2002). And some natural extensions. But apparently considering one univariate transform η(θ) of interest at a time. The theoretical validation of the method is that the resulting estimators converge at speed √n under some regularity assumptions. Including the identifiability of the parameter θ in the mean of the summary statistics T, which relates to our consistency result for ABC model choice. And a CLT on an available (?) preliminary estimator of η(θ).

The paper also includes a GMM version of ABC which appeal is less clear to me as it seems to rely on a preliminary estimator of the univariate transform of interest η(θ). Which is then randomized by a normal random walk. While this sounds a wee bit like noisy ABC, it differs from this generic approach as the model is not assumed to be known, but rather available through an asymptotic Gaussian approximation. (When the preliminary estimator is available in closed form, I do not see the appeal of adding this superfluous noise. When it is unavailable, it is unclear why a normal perturbation can be produced.)

“[In] the method we study, the estimator is consistent, asymptotically normal, and asymptotically as efficient as a limited information maximum likelihood estimator. It does not require either optimization, or MCMC, or the complex evaluation of the likelihood function.” (p.3)

Overall, I have trouble relating the paper to (my?) regular ABC in that the outcome of the supported procedures is an estimator rather than a posterior distribution. Those estimators are demonstrably endowed with convergence properties, including quantile estimates that can be exploited for credible intervals, but this does not produce a posterior distribution in the classical Bayesian sense. For instance, how can one run model comparison in this framework? Furthermore, each of those inferential steps requires solving another possibly costly optimisation problem.

“Posterior quantiles can also be used to form valid confidence intervals under correct model specification.” (p.4)

Nitpicking(ly), this statement is not correct in that posterior quantiles produce valid credible intervals and only asymptotically correct confidence intervals!

“A remedy is to choose the prior π(θ) iteratively or adaptively as functions of initial estimates of θ, so that the “prior” becomes dependent on the data, which can be denoted as π(θ|T).” (p.6)

This modification of the basic ABC scheme relying on simulation from the prior π(θ) can be found in many earlier references and the iterative construction of a better fitted importance function rather closely resembles ABC-PMC. Once again nitpicking(ly), the importance weights are defined therein (p.6) as the inverse of what they should be.