**W**ith Charly Andral (PhD, Paris Dauphine), Randal Douc, and Hugo Marival (PhD, Telecom SudParis), we just arXived a paper on importance Markov chains that merges importance sampling and MCMC. An idea already mentioned in Hastings (1970) and even earlier in Fodsick (1963), and later exploited in Liu et al. (2003) for instance. And somewhat dual of the vanilla Rao-Backwellisation paper Randal and I wrote a (long!) while ago. Given a target π with a dominating measure π⁰≥Mπ, using a Markov kernel to simulate from this dominating measure and subsampling by the importance weight ρ does produce a new Markov chain with the desired target measure as invariant distribution. However, the domination assumption is rather unrealistic and a generic approach can be implemented without it, by defining an extended Markov chain, with the addition of the number N of replicas as the supplementary term… And a transition kernel R(n|x) on N with expectation ρ, which is a minimal(ist) assumption for the validation of the algorithm.. While this initially defines a semi-Markov chain, an extended Markov representation is also feasible, by decreasing N one by one until reaching zero, and this is most helpful in deriving convergence properties for the resulting chain, including a CLT. While the choice of the kernel R is free, the optimal choice is associated with residual sampling, where only the fractional part of ρ is estimated by a Bernoulli simulation.

## Archive for arXiv

## important Markov chains

Posted in Books, Statistics, University life with tags arXiv, CLT, geometric ergodicity, importance sampling, Law of Large Numbers, Markov kernel, MCMC, PhD students, residual sampling, semi-Markov chain, Université Paris Dauphine, vanilla Rao-Blackwellisation on July 21, 2022 by xi'an## day one at ISBA 22

Posted in pictures, Statistics, Travel, University life with tags ABC, ABC-PMC, arXiv, BFF Statistics Conference, bistronomie, Bruno de Finetti, Canada, copulas, distilled importance sampling, evidence, exchangeability, fiducial inference, John Maynard Keynes, Korean food, Montréal, normalising flow, optimal transport, population demography, Québec, R.A. Fisher, Rademacher complexity, restaurant, Roe v. Wade, SMC-ABC, sufficiency, Supreme Court, United Nations Population Fund on June 29, 2022 by xi'anStarted the day with a much appreciated swimming practice in the [alas warm⁺⁺⁺] outdoor 50m pool on the Island with no one but me in the slooow lane. And had my first ride with the biXi system, surprised at having to queue behind other bikes at red lights! More significantly, it was a great feeling to reunite at last with so many friends I had not met for more than two years!!!

My friend Adrian Raftery gave the very first plenary lecture on his work on the Bayesian approach to long-term population projections, which was recently a work censored by some US States, then counter-censored by the Supreme Court [too busy to kill Roe v. Wade!]. Great to see the use of Bayesian methods validated by the UN Population Division [with at least one branch of the UN

Stephen Lauritzen returning to de Finetti notion of a model as something not real or true at all, back to exchangeability. Making me wonder when exchangeability is more than a convenient assumption leading to the Hewitt-Savage theorem. And sufficiency. I mean, without falling into a Keynesian fallacy, each point of the sample has unique specificities that cannot be taken into account in an exchangeable model. Nice to hear some measure theory, though!!! Plus a comment on the median never being sufficient, recouping an older (and presumably not original) point of mine. Stephen’s (or Fisher’s?) argument being that the median cannot be recursively computed!

Antonietta Mira and I had our ABC session this afternoon with Cecilia Viscardi, Sirio Legramanti, and Massimiliano Tamborino (Warwick) as speakers. Cecilia linked ABC with normalising flows, in collaboration with Dennis Prangle (whose earlier paper on this connection was presented as the first One World ABC seminar). Thus using past simulations to approximate the posterior by a neural network, possibly with a significant increase in computing time when compared with more rudimentary SMC-ABC methods in larger dimensions. Sirio considered summary-free ABC based on discrepancies like Rademacher complexity. Which more or less contains MMD, Kullback-Leibler, Wasserstein and more, although it seems to be dependent on the parameterisation of the observations. An interesting opening at the end was that this approach could apply to non iid settings. Massi presented a paper coauthored with Umberto that had just been arXived. On sequential ABC with a dependence on the summary statistic (hence guided). Further bringing copulas into the game, although this forces another choice [for the marginals] in the method.

Tamara Broderick talked about a puzzling leverage effect of some observations in economic studies where a tiny portion of individuals may modify the significance or the sign of a coefficient, for which I cannot tell whether the data or the reliance on statistical significance are to blame. Robert Kohn presented mixture-of-Gaussian copulas [not to be confused with mixture of Gaussian-copulas!] and Nancy Reid concluded my first [and somewhat exhausting!] day at ISBA with a BFF talk on the different statistical paradigms take on confidence (for which the notion of calibration seems to remain frequentist).

Side comments: First, most people in the conference are wearing masks, which is great! Also, I find it hard to read slides from the screen, which I presume is an age issue (?!) Even more aside, I had Korean lunch in a place that refused to serve me a glass of water, which I find amazing.

## prior sensitivity of the marginal likelihood

Posted in Books, pictures, Statistics, University life with tags arXiv, Bayes factors, Bayesian model selection, blogging, expected posterior prior, flat prior, fractional Bayes factor, Harold Jeffreys, improper priors, intrinsic Bayes factor, Madrid, marginal likelihood, parameterisation, Susie Bayarri on June 27, 2022 by xi'an**F**ernando Llorente and (Madrilene) coauthors have just arXived a paper on the safe use of prior densities for Bayesian model selection. Rather than blaming the Bayes factor, or excommunicating some improper priors, they consider in this survey solutions to design “objective” priors in model selection. (Writing this post made me realised I had forgotten to arXive a recent piece I wrote on the topic, based on short courses and blog pieces, for an incoming handbook on Bayesian advance(ment)s! Soon to be corrected.)

While intrinsically interested in the topic and hence with the study, I somewhat disagree with the perspective adopted by the authors. They for instance stick to the notion that a flat prior over the parameter space is appropriate as “the maximal expression of a non-informative prior” (despite depending on the parameterisation). Over bounded sets at least, while advocating priors “with great scale parameter” otherwise. They also refer to Jeffreys (1939) priors, by which they mean *estimation priors* rather than *testing priors*. As uncovered by Susie Bayarri and Gonzalo Garcia-Donato. Considering asymptotic consistency, they state that “in the asymptotic regime, Bayesian model selection is more sensitive to the sample size D than to the prior specifications”, which I find both imprecise and confusing, as my feeling is that the prior specification remains overly influential as the sample size increases. (In my view, consistency is a minimalist requirement, rather than “comforting”.) The argument therein that a flat prior is *informative* for model choice stems from the fact that the marginal likelihood goes to zero as the support of the prior goes to infinity, which may have been an earlier argument of Jeffreys’ (1939), but does not carry much weight as the property is shared by many other priors (as remarked later). Somehow, the penalisation aspect of the marginal is not exploited more deeply in the paper. In the “objective” Bayes section, they adhere to the (convenient but weakly supported) choice of a common prior on the nuisance parameters (shared by different models). Their main argument is to develop (heretic!) “data-based priors”, from Aitkin (1991, not cited) double use of the data (or setting the likelihood to the power two), all the way to the intrinsic and fractional Bayes factors of Tony O’Hagan (1995), Jim Berger and Luis Pericchi (1996), and to the *expected posterior priors* of Pérez and Berger (2002) on which I worked with Juan Cano and Diego Salmeròn. (While the presentation is made against a flat prior, nothing prevents the use of another reference, improper, prior.) A short section also mentions the X-validation approach(es) of Aki Vehtari and co-authors.

## evidence estimation in finite and infinite mixture models

Posted in Books, Statistics, University life with tags arXiv, Bayes factor, Bayesian model evaluation, bridge sampling, candidate's formula, Charlie Geyer, Chib's approximation, Dirichlet process mixture, nested sampling, noise contrasting estimation, sequential Monte Carlo, SIS, SMC, Université Paris Dauphine on May 20, 2022 by xi'an**A**drien Hairault (PhD student at Dauphine), Judith and I just arXived a new paper on evidence estimation for mixtures. This may sound like a well-trodden path that I have repeatedly explored in the past, but methinks that estimating the model evidence doth remain a notoriously difficult task for large sample or many component finite mixtures and even more for “infinite” mixture models corresponding to a Dirichlet process. When considering different Monte Carlo techniques advocated in the past, like Chib’s (1995) method, SMC, or bridge sampling, they exhibit a range of performances, in terms of computing time… One novel (?) approach in the paper is to write Chib’s (1995) identity for partitions rather than parameters as (a) it bypasses the label switching issue (as we already noted in Hurn et al., 2000), another one is to exploit Geyer (1991-1994) reverse logistic regression technique in the more challenging Dirichlet mixture setting, and yet another one a sequential importance sampling solution à la Kong et al. (1994), as also noticed by Carvalho et al. (2010). [We did not cover nested sampling as it quickly becomes onerous.]

Applications are numerous. In particular, testing for the number of components in a finite mixture model or against the fit of a finite mixture model for a given dataset has long been and still is an issue of much interest and diverging opinions, albeit yet missing a fully satisfactory resolution. Using a Bayes factor to find the right number of components K in a finite mixture model is known to provide a consistent procedure. We furthermore establish there the consistence of the Bayes factor when comparing a parametric family of finite mixtures against the nonparametric ‘strongly identifiable’ Dirichlet Process Mixture (DPM) model.

## efficiency of normalising over discrete parameters

Posted in Statistics with tags arXiv, Gibbs sampler, Hamiltonian Monte Carlo, JAGS, latent variable models, marginalisation, MCMC, mixtures of distributions, Monte Carlo experiment, STAN on May 1, 2022 by xi'an**Y**esterday, I noticed a new arXival entitled *Investigating the efficiency of marginalising over discrete parameters in Bayesian computations* written by Wen Wang and coauthors. The paper is actually comparing the simulation of a Gibbs sampler with an Hamiltonian Monte Carlo approach on Gaussian mixtures, when including and excluding latent variables, respectively. The authors missed the opposite marginalisation when the parameters are integrated.

*While marginalisation requires substantial mathematical effort, folk wisdom in the Stan community suggests that fitting models with marginalisation is more efficient than using Gibbs sampling.*

The comparison is purely experimental, though, which means it depends on the simulated data, the sample size, the prior selection, and of course the chosen algorithms. It also involves the [mostly] automated [off-the-shelf] choices made in the adopted software, JAGS and Stan. The outcome is only evaluated through ESS and the (old) R statistic. Which all depend on the parameterisation. But evacuates the label switching problem by imposing an ordering on the Gaussian means, which may have a different impact on marginalised and unmarginalised models. All in all, there is not much one can conclude about this experiment since the parameter values beyond the simulated data seem to impact the performances much more than the type of algorithm one implements.