Archive for SMC

EM degeneracy

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , on June 16, 2021 by xi'an

At the MHC 2021 conference today (to which I biked to attend for real!, first time since BayesComp!) I listened to Christophe Biernacki exposing the dangers of EM applied to mixtures in the presence of missing data, namely that the algorithm has a rising probability to reach a degenerate solution, namely a single observation component. Rising in the proportion of missing data. This is not hugely surprising as there is a real (global) mode at this solution. If one observation components are prohibited, they should not be accepted in the EM update. Just as in Bayesian analyses with improper priors, the likelihood should bar single or double  observations components… Which of course makes EM harder to implement. Or not?! MCEM, SEM and Gibbs are obviously straightforward to modify in this case.

Judith Rousseau also gave a fascinating talk on the properties of non-parametric mixtures, from a surprisingly light set of conditions for identifiability to posterior consistency . With an interesting use of several priors simultaneously that is a particular case of the cut models. Namely a correct joint distribution that cannot be a posterior, although this does not impact simulation issues. And a nice trick turning a hidden Markov chain into a fully finite hidden Markov chain as it is sufficient to recover a Bernstein von Mises asymptotic. If inefficient. Sylvain LeCorff presented a pseudo-marginal sequential sampler for smoothing, when the transition densities are replaced by unbiased estimators. With connection with approximate Bayesian computation smoothing. This proves harder than I first imagined because of the backward-sampling operations…

Bayes factors revisited

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , on March 22, 2021 by xi'an


“Bayes factor analyses are highly sensitive to and crucially depend on prior assumptions about model parameters (…) Note that the dependency of Bayes factors on the prior goes beyond the dependency of the posterior on the prior. Importantly, for most interesting problems and models, Bayes factors cannot be computed analytically.”

Daniel J. Schad, Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, Shravan Vasishth have just arXived a massive document on the Bayes factor, worrying about the computation of this common tool, but also at the variability of decisions based on Bayes factors, e.g., stressing correctly that

“…we should not confuse inferences with decisions. Bayes factors provide inference on hypotheses. However, to obtain discrete decisions (…) from continuous inferences in a principled way requires utility functions. Common decision heuristics (e.g., using Bayes factor larger than 10 as a discovery threshold) do not provide a principled way to perform decisions, but are merely heuristic conventions.”

The text is long and at times meandering (at least in the sections I read), while trying a wee bit too hard to bring up the advantages of using Bayes factors versus frequentist or likelihood solutions. (The likelihood ratio being presented as a “frequentist” solution, which I think is an incorrect characterisation.) For instance, the starting point of preferring a model with a higher marginal likelihood is presented as an evidence (oops!) rather than argumented. Since this quantity depends on both the prior and the likelihood, it being high or low is impacted by both. One could then argue that using its numerical value as an absolute criterion amounts to selecting the prior a posteriori as much as checking the fit to the data! The paper also resorts to the Occam’s razor argument, which I wish we could omit, as it is a vague criterion, wide open to misappropriation. It is also qualitative, rather than quantitative, hence does not help in calibrating the Bayes factor.

Concerning the actual computation of the Bayes factor, an issue that has always been a concern and a research topic for me, the authors consider only two “very common methods”, the Savage–Dickey density ratio method and bridge sampling. We discussed the shortcomings of the Savage–Dickey density ratio method with Jean-Michel Marin about ten years ago. And while bridge sampling is an efficient approach when comparing models of the same dimension, I have reservations about this efficiency in other settings. Alternative approaches like importance nested sampling, noise contrasting estimation or SMC samplers are often performing quite efficiently as normalising constant approximations. (Not to mention our version of harmonic mean estimator with HPD support.)

Simulation-based inference is based on the notion that simulated data can be produced from the predictive distributions. Reminding me of ABC model choice to some extent. But I am uncertain this approach can be used to calibrate the decision procedure to select the most appropriate model. We thought about using this approach in our testing by mixture paper and it is favouring the more complex of the two models. This seems also to occur for the example behind Figure 5 in the paper.

Two other points: first, the paper does not consider the important issue with improper priors, which are not rigorously compatible with Bayes factors, as I discussed often in the past. And second, Bayes factors are not truly Bayesian decision procedures, since they remove the prior weights on the models, thus the mention of utility functions therein seems inappropriate unless a genuine utility function can be produced.

sandwiching a marginal

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 8, 2021 by xi'an

When working recently on a paper for estimating the marginal likelihood, I was pointed out this earlier 2015 paper by Roger Grosse, Zoubin Ghahramani and Ryan Adams, which had escaped till now. The beginning of the paper discusses the shortcomings of importance sampling (when simulating from the prior) and harmonic mean (when simulating from the posterior) as solution. And of anNealed importance sampling (when simulating from a sequence, which sequence?!, of targets). The authors are ending up proposing a sequential Monte Carlo or (posterior) particle learning solution. A remark on annealed importance sampling is that there exist both a forward and a backward version for estimating the marginal likelihood, either starting from a simulation from the prior (easy) or from a simulation from the posterior (hard!). As in, e.g., Nicolas Chopin’s thesis, the intermediate steps are constructed from a subsample of the entire sample.

In this context, unbiasedness can be misleading: because partition function estimates can vary over many orders of magnitude, it’s common for an unbiased estimator to drastically underestimate Ζ with overwhelming probability, yet occasionally return extremely large estimates. (An extreme example is likelihood weighting, which is unbiased, but is extremely unlikely to give an accurate answer for a high-dimensional model.) Unless the estimator is chosen very carefully, the variance is likely to be extremely large, or even infinite.”

One novel aspect of the paper is to advocate for the simultaneous use of different methods and for producing both lower and upper bounds on the marginal p(y) and wait for them to get close enough. It is however delicate to find upper bounds, except when using the dreaded harmonic mean estimator.  (A nice trick associated with reverse annealed importance sampling is that the reverse chain can be simulated exactly from the posterior if associated with simulated data, except I am rather lost at the connection between the actual and simulated data.) In a sequential harmonic mean version, the authors also look at the dangers of using an harmonic mean but argue the potential infinite variance of the weights does not matter so much for log p(y), without displaying any variance calculation… The paper also contains a substantial experimental section that compares the different solutions evoked so far, plus others like nested sampling. Which did not work poorly in the experiment (see below) but could not be trusted to provide a lower or an upper bound. The computing time to achieve some level of agreement is however rather daunting. An interesting read definitely (and I wonder what happened to the paper in the end).

population quasi-Monte Carlo

Posted in Books, Statistics with tags , , , , , , , , , , , , on January 28, 2021 by xi'an

“Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution”

A return of the prodigal son!, with this arXival by Huang, Joseph, and Mak, of a paper on population Monte Carlo using quasi-random sequences. The construct is based on an earlier notion of Joseph and Mak, support points, which are defined wrt a given target distribution F as minimising the variability of a sample from F away from these points. (I would have used instead my late friend Bernhard Flury’s principal points!) The proposal uses Owen-style scrambled Sobol points, followed by a deterministic mixture weighting à la PMC, followed by importance support resampling to find the next location parameters of the proposal mixture (which is why I included an unrelated mixture surface as my post picture!). This importance support resampling is obviously less variable than the more traditional ways of resampling but the cost moves from O(M) to O(M²).

“The main computational complexity of the algorithm is O(M²) from computing the pairwise distance of the M weighted samples”

The covariance parameters are updated as in our 2008 paper. This new proposal is interesting and reasonable, with apparent significant gains, albeit I would have liked to see a clearer discussion of the actual computing costs of PQMC.

averaged acceptance ratios

Posted in Statistics with tags , , , , , , , , , , , , , on January 15, 2021 by xi'an

In another recent arXival, Christophe Andrieu, Sinan Yıldırım, Arnaud Doucet, and Nicolas Chopin study the impact of averaging estimators of acceptance ratios in Metropolis-Hastings algorithms. (It is connected with the earlier arXival rephrasing Metropolis-Hastings in terms of involutions discussed here.)

“… it is possible to improve performance of this algorithm by using a modification where the acceptance ratio r(ξ) is integrated with respect to a subset of the proposed variables.”

This interpretation of the current proposal makes it a form of Rao-Blackwellisation, explicitly mentioned on p.18, where, using a mixture proposal, with an adapted acceptance probability, it depends on the integrated acceptance ratio only. Somewhat magically using this ratio and its inverse with probability ½. And it increases the average Metropolis-Hastings acceptance probability (albeit with a larger number of simulations). Since the ideal averaging is rarely available, the authors implement a Monte Carlo averaging version. With applications to the exchange algorithm and to reversible jump MCMC. The major application is to pseudo-marginal settings with a high complexity (in the number T of terms) and where the authors’ approach does scale efficiently with T. There is even an ABC side to the story as one illustration is made of the ABC approximation to the posterior of an α-stable sample. As an encompassing proposal for handling Metropolis-Hastings environments with latent variables and several versions of the acceptance ratios, this is quite an interesting paper that I think we will study in further detail with our students.