Archive for Radford Neal

Finite mixture models do not reliably learn the number of components

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 15, 2022 by xi'an

When preparing my talk for Padova, I found that Diana Cai, Trevor Campbell, and Tamara Broderick wrote this ICML / PLMR paper last year on the impossible estimation of the number of components in a mixture.

“A natural check on a Bayesian mixture analysis is to establish that the Bayesian posterior on the number of components increasingly concentrates near the truth as the number of data points becomes arbitrarily large.” Cai, Campbell & Broderick (2021)

Which seems to contradict [my formerly-Glaswegian friend] Agostino Nobile  who showed in his thesis that the posterior on the number of components does concentrate at the true number of components, provided the prior contains that number in its support. As well as numerous papers on the consistency of the Bayes factor, including the one against an infinite mixture alternative, as we discussed in our recent paper with Adrien and Judith. And reminded me of the rebuke I got in 2001 from the late David McKay when mentioning that I did not believe in estimating the number of components, both because of the impact of the prior modelling and of the tendency of the data to push for more clusters as the sample size increased. (This was a most lively workshop Mike Titterington and I organised at ICMS in Edinburgh, where Radford Neal also delivered an impromptu talk to argue against using the Galaxy dataset as a benchmark!)

“In principle, the Bayes factor for the MFM versus the DPM could be used as an empirical criterion for choosing between the two models, and in fact, it is quite easy to compute an approximation to the Bayes factor using importance sampling” Miller & Harrison (2018)

This is however a point made in Miller & Harrison (2018) that the estimation of k logically goes south if the data is not from the assumed mixture model. In this paper, Cai et al. demonstrate that the posterior diverges, even when it depends on the sample size. Or even the sample as in empirical Bayes solutions.

slice sampling revisited

Posted in Books, pictures, Statistics with tags , , , , , , , , on April 15, 2016 by xi'an

Figure 1 (c.) Neal, 2003Thanks to an X validated question, I re-read Radford Neal’s 2003 Slice sampling paper. Which is an Annals of Statistics discussion paper, and rightly so. While I was involved in the editorial processing of this massive paper (!), I had only vague memories left about it. Slice sampling has this appealing feature of being the equivalent of random walk Metropolis-Hastings for Gibbs sampling, without the drawback of setting a scale for the moves.

“These slice sampling methods can adaptively change the scale of changes made, which makes them easier to tune than Metropolis methods and also avoids problems that arise when the appropriate scale of changes varies over the distribution  (…) Slice sampling methods that improve sampling by suppressing random walks can also be constructed.” (p.706)

One major theme in the paper is fighting random walk behaviour, of which Radford is a strong proponent. Even at the present time, I am a bit surprised by this feature as component-wise slice sampling is exhibiting clear features of a random walk, exploring the subgraph of the target by random vertical and horizontal moves. Hence facing the potential drawback of backtracking to previously visited places.

“A Markov chain consisting solely of overrelaxed updates might not be ergodic.” (p.729)

Overrelaxation is presented as a mean to avoid the random walk behaviour by removing rejections. The proposal is actually deterministic projecting the current value to the “other side” of the approximate slice. If it stays within the slice it is accepted. This “reflection principle” [in that it takes the symmetric wrt the centre of the slice] is also connected with antithetic sampling in that it induces rather negative correlation between the successive simulations. The last methodological section covers reflective slice sampling, which appears as a slice version of Hamiltonian Monte Carlo (HMC). Given the difficulty in implementing exact HMC (reflected in the later literature), it is no wonder that Radford proposes an approximation scheme that is valid if somewhat involved.

“We can show invariance of this distribution by showing (…) detailed balance, which for a uniform distribution reduces to showing that the probability density for x¹ to be selected as the next state, given that the current state is x0, is the same as the probability density for x⁰ to be the next state, given that x¹ is the current state, for any states x⁰ and x¹ within [the slice] S.” (p.718)

In direct connection with the X validated question there is a whole section of the paper on implementing single-variable slice sampling that I had completely forgotten, with a collection of practical implementations when the slice

S={x; u < f(x) }

cannot be computed in an exact manner. Like the “stepping out” procedure. The resulting set (interval) where the uniform simulation in x takes place may well miss some connected component(s) of the slice. This quote may sound like a strange argument in that the move may well leave a part of the slice off and still satisfy this condition. Not really since it states that it must hold for any pair of states within S… The very positive side of this section is to allow for slice sampling in cases where the inversion of u < f(x) is intractable. Hence with a strong practical implication. The multivariate extension of the approximation procedure is more (potentially) fraught with danger in that it may fell victim to a curse of dimension, in that the box for the uniform simulation of x may be much too large when compared with the true slice (or slice of the slice). I had more of a memory of the “trail of crumbs” idea, mostly because of the name I am afraid!, which links with delayed rejection, as indicated in the paper, but seems awfully delicate to calibrate.

Sampling latent states for high-dimensional non-linear state space models with the embedded HMM method

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 17, 2016 by xi'an

IMG_19390Previously, I posted a comment on a paper by Alex Shestopaloff and Radford Neal, after my visit to Toronto two years ago, using a particular version of ensemble Monte Carlo. A new paper by the same authors was recently arXived, as an refinement of the embedded HMM paper of Neal (2003), in that the authors propose a new and more efficient way to generate from the (artificial) embedded hidden Markov sampler that is central to their technique of propagating a set of pool states. The method exploits both forward and backward representations of HMMs in an alternating manner. And propagates the pool states from one observation time to the next. The paper also exploits latent Gaussian structures to make autoregressive proposals, as well as flip proposals from x to -x [which seem to only make sense when 0 is a central value for the target, i.e. when the observables y only depend on |x|]. All those modifications bring the proposal quite close to (backward) particle Gibbs, the difference being in using Metropolis rather than importance steps. And in an improvement brought by the embedded HMM approach, even though it is always delicate to generalise those comparisons when some amount of calibration is required by both algorithms under comparison. (Especially delicate when it is rather remote from my area of expertise!) Anyway, I am still intrigued [in a positive way] by the embedded HMM idea as it remains mysterious that a finite length HMM simulation can improve the convergence performances that much. And wonder at a potential connection with an earlier paper of Anthony Lee and Krys Latuszynski using a random number of auxiliary variables. Presumably a wrong impression from a superficial memory…

rediscovering the harmonic mean estimator

Posted in Kids, Statistics, University life with tags , , , , , , , on November 10, 2015 by xi'an

When looking at unanswered questions on X validated, I came across a question where the author wanted to approximate a normalising constant

N=\int g(x)\,\text{d}x\,,

while simulating from the associated density, g. While seemingly unaware of the (huge) literature in the area, he re-derived [a version of] the harmonic mean estimate by considering the [inverted importance sampling] identity

\int_\mathcal{X} \dfrac{\alpha(x)}{g(x)}p(x) \,\text{d}x=\int_\mathcal{X} \dfrac{\alpha(x)}{N} \,\text{d}x=\dfrac{1}{N}

when α is a probability density and by using for α the uniform over the whole range of the simulations from g. This choice of α obviously leads to an estimator with infinite variance when the support of g is unbounded, but the idea can be easily salvaged by using instead another uniform distribution, for instance on an highest density region, as we studied in our papers with Darren Wraith and Jean-Michel Marin. (Unfortunately, the originator of the question does not seem any longer interested in the problem.)

MCMC for non-linear state space models using ensembles of latent sequences

Posted in Statistics with tags , , , , on November 6, 2013 by xi'an

IMG_19390While visiting U of T, I had the opportunity to discuss the above paper MCMC for non-linear state space models using ensembles of latent sequences with both authors, Alex Shestopalo ff and Radford Neal, paper that I had completely missed during my hospital break. The paper borrows from the central idea of Neal (2003), which “is to temporarily reduce the state space of the model, which may be countably in finite or continuous, to a finite collection of randomly generated `pool’ states at each time point.” Several copies of the hidden Markov chain are generated from a proposal and weighted according to the posterior. This makes for a finite state space on which a forward-backward algorithm can be run, thus producing a latent sequence that is more likely than the ones originally produced from the proposal, as it borrows at different times from different chains.  (I alas had no lasting memory of this early paper of Radford’s. But this is essentially ensemble Monte Carlo.)

What Radford patiently explained to me in Toronto was why this method did not have the same drawback as an importance sampling method, as the weights were local rather than global and hence did not degenerate as the length of the chain/HMM increased. Which I find a pretty good argument! The trick of being able to rely on forward-backward simulation is also very appealing. This of course does not mean that the method is always converging quickly, as the proposal matters. A novelty of the current paper is the inclusion of parameter simulation steps as well, steps that are part of the ensemble Monte Carlo process (rather than a standard Gibbs implementation). There also is a delayed acceptance (as opposed to delayed rejection) step where a subset of the chain is used to check for early (and cheaper) rejection.

The paper is a bit short on describing the way pool states can be generated (see Section 7), but it seems local (time-wise) perturbations of the current state are considered. I wonder if an intermediate particle step could produce a more efficient proposal… It also seems possible to consider a different number of pool states at more uncertain or more sticky times, with a potential for adaptivity depending on the acceptance rate.

For the evaluation of the method, the authors consider the Ricker population dynamics model found in Wood (2003) and Fearnhead and Prangle (2012), where semi-automated ABC is used. In the experiment described therein, there are only 100 latent states, which is enough to hinder MCMC. The ensemble method does much better. While there is no comparison with ABC, I would presume this method, relying on a more precise knowledge of the probabilistic model, should do better. Maybe a good topic for a Master project?