## prior sensitivity of the marginal likelihood

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on June 27, 2022 by xi'an

Fernando Llorente and (Madrilene) coauthors have just arXived a paper on the safe use of prior densities for Bayesian model selection. Rather than blaming the Bayes factor, or excommunicating some improper priors, they consider in this survey solutions to design “objective” priors in model selection. (Writing this post made me realised I had forgotten to arXive a recent piece I wrote on the topic, based on short courses and blog pieces, for an incoming handbook on Bayesian advance(ment)s! Soon to be corrected.)

While intrinsically interested in the topic and hence with the study, I somewhat disagree with the perspective adopted by the authors. They for instance stick to the notion that a flat prior over the parameter space is appropriate as “the maximal expression of a non-informative prior” (despite depending on the parameterisation). Over bounded sets at least, while advocating priors “with great scale parameter” otherwise. They also refer to Jeffreys (1939) priors, by which they mean estimation priors rather than testing priors. As uncovered by Susie Bayarri and Gonzalo Garcia-Donato. Considering asymptotic consistency, they state that “in the asymptotic regime, Bayesian model selection is more sensitive to the sample size D than to the prior specifications”, which I find both imprecise and confusing,  as my feeling is that the prior specification remains overly influential as the sample size increases. (In my view, consistency is a minimalist requirement, rather than “comforting”.) The argument therein that a flat prior is informative for model choice stems from the fact that the marginal likelihood goes to zero as the support of the prior goes to infinity, which may have been an earlier argument of Jeffreys’ (1939), but does not carry much weight as the property is shared by many other priors (as remarked later). Somehow, the penalisation aspect of the marginal is not exploited more deeply in the paper. In the “objective” Bayes section, they adhere to the (convenient but weakly supported) choice of a common prior on the nuisance parameters (shared by different models). Their main argument is to develop (heretic!) “data-based priors”, from Aitkin (1991, not cited) double use of the data (or setting the likelihood to the power two), all the way to the intrinsic and fractional Bayes factors of Tony O’Hagan (1995), Jim Berger and Luis Pericchi (1996), and to the expected posterior priors of Pérez and Berger (2002) on which I worked with Juan Cano and Diego Salmeròn. (While the presentation is made against a flat prior, nothing prevents the use of another reference, improper, prior.) A short section also mentions the X-validation approach(es) of Aki Vehtari and co-authors.

## Another harmonic mean

Posted in Books, Statistics, University life with tags , , , , , , , , on May 21, 2022 by xi'an

Yet another paper that addresses the approximation of the marginal likelihood by a truncated harmonic mean, a popular theme of mine. A 2020 paper by Johannes Reich, entitled Estimating marginal likelihoods from the posterior draws through a geometric identity and published in Monte Carlo Methods and Applications.

The geometric identity it aims at exploiting is that

$m(x) = \frac{\int_A \,\text d\theta}{\int_A \pi(\theta|x)\big/\pi(\theta)f(x|\theta)\,\text d\theta}$

for any (positive volume) compact set $A$. This is exactly the same identity as in an earlier and uncited 2017 paper by Ana Pajor, with the also quite similar (!) title Estimating the Marginal Likelihood Using the Arithmetic Mean Identity and which I discussed on the ‘Og, linked with another 2012 paper by Lenk. Also discussed here. This geometric or arithmetic identity is again related to the harmonic mean correction based on a HPD region A that Darren Wraith and myself proposed at MaxEnt 2009. And that Jean-Michel and I presented at Frontiers of statistical decision making and Bayesian analysis in 2010.

In this avatar, the set A is chosen close to an HPD region, once more, with a structure that allows for an exact computation of its volume. Namely an ellipsoid that contains roughly 50% of the simulations from the posterior (rather than our non-intersecting union of balls centered at the 50% HPD points), which assumes a Euclidean structure of the parameter space (or, in other words, depends on the parameterisation)In the mixture illustration, the author surprisingly omits Chib’s solution, despite symmetrised versions avoiding the label (un)switching issues. . What I do not get is how this solution gets around the label switching challenge in that set A remains an ellipsoid for multimodal posteriors, which means it either corresponds to a single mode [but then how can a simulation be restricted to a “single permutation of the indicator labels“?] or it covers all modes but also the unlikely valleys in-between.

## robust inference using posterior bootstrap

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on February 18, 2022 by xi'an

The famous 1994 Read Paper by Michael Newton and Adrian Raftery was entitled Approximate Bayesian inference, where the boostrap aspect is in randomly (exponentially) weighting each observation in the iid sample through a power of the corresponding density, a proposal that happened at about the same time as Tony O’Hagan suggested the related fractional Bayes factor. (The paper may also be equally famous for suggesting the harmonic mean estimator of the evidence!, although it only appeared as an appendix to the paper.) What is unclear to me is the nature of the distribution g(θ) associated with the weighted bootstrap sample, conditional on the original sample, since the outcome is the result of a random Exponential sample and of an optimisation step. With no impact of the prior (which could have been used as a penalisation factor), corrected by Michael and Adrian via an importance step involving the estimation of g(·).

At the Algorithm Seminar today in Warwick, Emilie Pompe presented recent research, including some written jointly with Pierre Jacob, [which I have not yet read] that does exactly that inclusion of the log prior as penalisation factor, along with an extra weight different from one, as motivated by the possibility of a misspecification. Including a new approach to cut models. An alternative mentioned during the talk that reminds me of GANs is to generate a pseudo-sample from the prior predictive and add it to the original sample. (Some attendees commented on the dependence of the later version on the chosen parameterisation, which is an issue that had X’ed my mind as well.)

## distributed evidence

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on December 16, 2021 by xi'an

Alexander Buchholz (who did his PhD at CREST with Nicolas Chopin), Daniel Ahfock, and my friend Sylvia Richardson published a great paper on the distributed computation of Bayesian evidence in Bayesian Analysis. The setting is one of distributed data from several sources with no communication between them, which relates to consensus Monte Carlo even though model choice has not been particularly studied from that perspective. The authors operate under the assumption of conditionally conjugate models, i.e., the existence of a data augmentation scheme into an exponential family so that conjugate priors can be used. For a division of the data into S blocks, the fundamental identity in the paper is

$p(y) = \alpha^S \prod_{s=1}^S \tilde p(y_s) \int \prod_{s=1}^S \tilde p(\theta|y_s)\,\text d\theta$

where α is the normalising constant of the sub-prior exp{log[p(θ)]/S} and the other terms are associated with this prior. Under the conditionally conjugate assumption, the integral can be approximated based on the latent variables. Most interestingly, the associated variance is directly connected with the variance of

$p(z_{1:S}|y)\Big/\prod_{s=1}^S \tilde p(z_s|y_s)$

under the joint:

“The variance of the ratio measures the quality of the product of the conditional sub-posterior as an importance sample proposal distribution.”

Assuming this variance is finite (which is likely). An approximate alternative is proposed, namely to replace the exact sub-posterior with a Normal distribution, as in consensus Monte Carlo, which should obviously require some consideration as to which parameterisation of the model produces the “most normal” (or the least abnormal!) posterior. And ensures a finite variance in the importance sampling approximation (as ensured by the strong bounds in Proposition 5). A problem shared by the bridgesampling package.

“…if the error that comes from MCMC sampling is relatively small and that the shard sizes are large enough so that the quality of the subposterior normal approximation is reasonable, our suggested approach will result in good approximations of the full data set marginal likelihood.”

The resulting approximation can also be handy in conjunction with reversible jump MCMC, in the sense that RJMCMC algorithms can be run in parallel on different chunks or shards of the entire dataset. Although the computing gain may be reduced by the need for separate approximations.

## sandwiching a marginal

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 8, 2021 by xi'an

When working recently on a paper for estimating the marginal likelihood, I was pointed out this earlier 2015 paper by Roger Grosse, Zoubin Ghahramani and Ryan Adams, which had escaped till now. The beginning of the paper discusses the shortcomings of importance sampling (when simulating from the prior) and harmonic mean (when simulating from the posterior) as solution. And of anNealed importance sampling (when simulating from a sequence, which sequence?!, of targets). The authors are ending up proposing a sequential Monte Carlo or (posterior) particle learning solution. A remark on annealed importance sampling is that there exist both a forward and a backward version for estimating the marginal likelihood, either starting from a simulation from the prior (easy) or from a simulation from the posterior (hard!). As in, e.g., Nicolas Chopin’s thesis, the intermediate steps are constructed from a subsample of the entire sample.

In this context, unbiasedness can be misleading: because partition function estimates can vary over many orders of magnitude, it’s common for an unbiased estimator to drastically underestimate Ζ with overwhelming probability, yet occasionally return extremely large estimates. (An extreme example is likelihood weighting, which is unbiased, but is extremely unlikely to give an accurate answer for a high-dimensional model.) Unless the estimator is chosen very carefully, the variance is likely to be extremely large, or even infinite.”

One novel aspect of the paper is to advocate for the simultaneous use of different methods and for producing both lower and upper bounds on the marginal p(y) and wait for them to get close enough. It is however delicate to find upper bounds, except when using the dreaded harmonic mean estimator.  (A nice trick associated with reverse annealed importance sampling is that the reverse chain can be simulated exactly from the posterior if associated with simulated data, except I am rather lost at the connection between the actual and simulated data.) In a sequential harmonic mean version, the authors also look at the dangers of using an harmonic mean but argue the potential infinite variance of the weights does not matter so much for log p(y), without displaying any variance calculation… The paper also contains a substantial experimental section that compares the different solutions evoked so far, plus others like nested sampling. Which did not work poorly in the experiment (see below) but could not be trusted to provide a lower or an upper bound. The computing time to achieve some level of agreement is however rather daunting. An interesting read definitely (and I wonder what happened to the paper in the end).