## evidence estimation in finite and infinite mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on May 20, 2022 by xi'an

Adrien Hairault (PhD student at Dauphine), Judith and I just arXived a new paper on evidence estimation for mixtures. This may sound like a well-trodden path that I have repeatedly explored in the past, but methinks that estimating the model evidence doth remain a notoriously difficult task for large sample or many component finite mixtures and even more for “infinite” mixture models corresponding to a Dirichlet process. When considering different Monte Carlo techniques advocated in the past, like Chib’s (1995) method, SMC, or bridge sampling, they exhibit a range of performances, in terms of computing time… One novel (?) approach in the paper is to write Chib’s (1995) identity for partitions rather than parameters as (a) it bypasses the label switching issue (as we already noted in Hurn et al., 2000), another one is to exploit  Geyer (1991-1994) reverse logistic regression technique in the more challenging Dirichlet mixture setting, and yet another one a sequential importance sampling solution à la  Kong et al. (1994), as also noticed by Carvalho et al. (2010). [We did not cover nested sampling as it quickly becomes onerous.]

Applications are numerous. In particular, testing for the number of components in a finite mixture model or against the fit of a finite mixture model for a given dataset has long been and still is an issue of much interest and diverging opinions, albeit yet missing a fully satisfactory resolution. Using a Bayes factor to find the right number of components K in a finite mixture model is known to provide a consistent procedure. We furthermore establish there the consistence of the Bayes factor when comparing a parametric family of finite mixtures against the nonparametric ‘strongly identifiable’ Dirichlet Process Mixture (DPM) model.

## [more than] everything you always wanted to know about marginal likelihood

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on February 10, 2022 by xi'an

Earlier this year, F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago have arXived an updated version of their massive survey on marginal likelihood computation. Which I can only warmly recommend to anyone interested in the matter! Or looking for a base camp to initiate a graduate project. They break the methods into four families

1. Deterministic approximations (e.g., Laplace approximations)
2. Methods based on density estimation (e.g., Chib’s method, aka the candidate’s formula)
3. Importance sampling, including sequential Monte Carlo, with a subsection connecting with MCMC
4. Vertical representations (mostly, nested sampling)

Besides sheer computation, the survey also broaches upon issues like improper priors and alternatives to Bayes factors. The parts I would have done in more details are reversible jump MCMC and the long-lasting impact of Geyer’s reverse logistic regression (with the noise contrasting extension), even though the link with bridge sampling is briefly mentioned there. There is even a table reporting on the coverage of earlier surveys. Of course, the following postnote of the manuscript

The Christian Robert’s blog deserves a special mention , since Professor C. Robert has devoted several entries of his blog with very interesting comments regarding the marginal likelihood estimation and related topics.

does not in the least make me less objective! Some of the final recommendations

• use of Naive Monte Carlo [simulate from the prior] should be always considered [assuming a proper prior!]
• a multiple-try method is a good choice within the MCMC schemes
• optimal umbrella sampling estimator is difficult and costly to implement , so its best performance may not be achieved in practice
• adaptive importance sampling uses the posterior samples to build a suitable normalized proposal, so it benefits from localizing samples in regions of high posterior probability while preserving the properties of standard importance sampling
• Chib’s method is a good alternative, that provide very good performances [but is not always available]
• the success [of nested sampling] in the literature is surprising.

## Easy computation of the Bayes Factor

Posted in Books, Statistics with tags , , , , , on August 21, 2021 by xi'an

“Choosing the ranges has been criticized as introducing subjectivity; however, the key point is that the ranges are given quantitatively and should be justified”

On arXiv, I came across a paper by physicists Dunstan, Crowne, and Drew, on computing the Bayes factor by linear regression. Paper that I found rather hard to read given that the method is never completely spelled out but rather described through some examples (or the captions of figures)… The magical formula (for the marginal likelihood)

$B=(2\pi)^{n/2}L_{\max}\dfrac{\text{Cov}_p}{\prod_{i=1}^n \Delta p_i}$

where n is the parameter dimension, Cov is the Fisher information matrix, and the denominator the volume of a flat prior on an hypercube (!), seems to come for a Laplace approximation. But it depends rather crucially (!) on the choice of this volume. A severe drawback the authors evacuate with the above quote… And by using an example where the parameters have a similar meaning under both models. The following ones compare several dimensions of parameters without justifying (enough) the support of the corresponding priors. In addition, using a flat prior over the hypercube seems to clash with the existence of a (Fisher) correlation between the components. (To be completely open as to why I discuss this paper, I was asked to review the paper, which I declined.)

## ABC on brain networks

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on April 16, 2021 by xi'an

Research Gate sent me an automated email pointing out a recent paper citing some of our ABC papers. The paper is written by Timothy West et al., neuroscientists in the UK, comparing models of Parkinsonian circuit dynamics. Using SMC-ABC. One novelty is the update of the tolerance by a fixed difference, unless the acceptance rate is too low, in which case the tolerance is reinitialised to a starting value.

“(…) the proposal density P(θ|D⁰) is formed from the accepted parameters sets. We use a density approximation to the marginals and a copula for the joint (…) [i.e.] a nonparametric estimation of the marginal densities overeach parameter [and] the t-copula(…) Data are transformed to the copula scale (unit-square) using the kernel density estimator of the cumulative distribution function of each parameter and then transformed to the joint space with the t-copula.”

The construct of the proposal is quite involved, as described in the above quote. The model choice approach is standard (à la Grelaud et al.) but uses the median distance as a tolerance.

“(…) test whether the ABC estimator will: a) yield parameter estimates that are unique to the data from which they have been optimized; and b) yield consistent estimation of parameters across multiple instances (…) test the face validity of the model comparison framework (…) [and] demonstrate the scalability of the optimization and model comparison framework.”

The paper runs a fairly extensive test of the above features, concluding that “the ABC optimized posteriors are consistent across multiple initializations and that the output is determined by differences in the underlying model generating the given data.” Concerning model comparison, the authors mix the ABC Bayes factor with a post-hoc analysis of divergence to discriminate against overfitting. And mention the potential impact of the summary statistics in the conclusion section, albeit briefly, and the remark that the statistics were “sufficient to recover known parameters” is not supporting their use for model comparison. The additional criticism of sampling strategies for approximating Bayes factors is somewhat irrelevant, the main issue with ABC model choice being a change of magnitude in the evidence.

“ABC has established itself as a key tool for parameter estimation in systems biology (…) but is yet to see wide adoption in systems neuroscience. It is known that ABC will not perform well under certain conditions (Sunnåker et al., 2013). Specifically, it has been shown that the
simplest form of ABC algorithm based upon an rejection-sampling approach is inefficient in the case where the prior densities lie far from the true posterior (…) This motivates the use of neurobiologically grounded models over phenomenological models where often the ranges of potential parameter values are unknown.”

## Bayes factors revisited

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , on March 22, 2021 by xi'an

“Bayes factor analyses are highly sensitive to and crucially depend on prior assumptions about model parameters (…) Note that the dependency of Bayes factors on the prior goes beyond the dependency of the posterior on the prior. Importantly, for most interesting problems and models, Bayes factors cannot be computed analytically.”

Daniel J. Schad, Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, Shravan Vasishth have just arXived a massive document on the Bayes factor, worrying about the computation of this common tool, but also at the variability of decisions based on Bayes factors, e.g., stressing correctly that

“…we should not confuse inferences with decisions. Bayes factors provide inference on hypotheses. However, to obtain discrete decisions (…) from continuous inferences in a principled way requires utility functions. Common decision heuristics (e.g., using Bayes factor larger than 10 as a discovery threshold) do not provide a principled way to perform decisions, but are merely heuristic conventions.”

The text is long and at times meandering (at least in the sections I read), while trying a wee bit too hard to bring up the advantages of using Bayes factors versus frequentist or likelihood solutions. (The likelihood ratio being presented as a “frequentist” solution, which I think is an incorrect characterisation.) For instance, the starting point of preferring a model with a higher marginal likelihood is presented as an evidence (oops!) rather than argumented. Since this quantity depends on both the prior and the likelihood, it being high or low is impacted by both. One could then argue that using its numerical value as an absolute criterion amounts to selecting the prior a posteriori as much as checking the fit to the data! The paper also resorts to the Occam’s razor argument, which I wish we could omit, as it is a vague criterion, wide open to misappropriation. It is also qualitative, rather than quantitative, hence does not help in calibrating the Bayes factor.

Concerning the actual computation of the Bayes factor, an issue that has always been a concern and a research topic for me, the authors consider only two “very common methods”, the Savage–Dickey density ratio method and bridge sampling. We discussed the shortcomings of the Savage–Dickey density ratio method with Jean-Michel Marin about ten years ago. And while bridge sampling is an efficient approach when comparing models of the same dimension, I have reservations about this efficiency in other settings. Alternative approaches like importance nested sampling, noise contrasting estimation or SMC samplers are often performing quite efficiently as normalising constant approximations. (Not to mention our version of harmonic mean estimator with HPD support.)

Simulation-based inference is based on the notion that simulated data can be produced from the predictive distributions. Reminding me of ABC model choice to some extent. But I am uncertain this approach can be used to calibrate the decision procedure to select the most appropriate model. We thought about using this approach in our testing by mixture paper and it is favouring the more complex of the two models. This seems also to occur for the example behind Figure 5 in the paper.

Two other points: first, the paper does not consider the important issue with improper priors, which are not rigorously compatible with Bayes factors, as I discussed often in the past. And second, Bayes factors are not truly Bayesian decision procedures, since they remove the prior weights on the models, thus the mention of utility functions therein seems inappropriate unless a genuine utility function can be produced.