observed vs. complete in EM algorithm

Posted in Statistics with tags , , , , , on November 17, 2022 by xi'an

While answering a question related with the EM  algorithm on X validated, I realised a global (or generic) feature of the (objective) E function, namely that

$E(\theta'|\theta)=\mathbb E_{\theta}[\log\,f_{X,Z}(x^\text{obs},Z|\theta')|X=x^\text{obs}]$

can always be written as

$\log\,f_X(x^\text{obs};\theta')+\mathbb E_{\theta}[\log\,f_{Z|X}(Z|x^\text{obs},\theta')|X=x^\text{obs}]$

therefore always includes the (log-) observed likelihood, at least in this formal representation. While the proof that EM is monotonous in the values of the observed likelihood uses this decomposition as well, in that

$\log\,f_X(x^\text{obs};\theta')=\log\,\mathbb E_{\theta}\left[\frac{f_{X,Z}(x^\text{obs},Z;\theta')}{f_{Z|X}(Z|x^\text{obs},\theta)}\big|X=x^\text{obs}\right]$

I wonder if the appearance of the actual target in the temporary target E(θ’|θ) can be exploited any further.

efficiency of normalising over discrete parameters

Posted in Statistics with tags , , , , , , , , , on May 1, 2022 by xi'an

Yesterday, I noticed a new arXival entitled Investigating the efficiency of marginalising over discrete parameters in Bayesian computations written by Wen Wang and coauthors. The paper is actually comparing the simulation of a Gibbs sampler with an Hamiltonian Monte Carlo approach on Gaussian mixtures, when including and excluding latent variables, respectively. The authors missed the opposite marginalisation when the parameters are integrated.

While marginalisation requires substantial mathematical effort, folk wisdom in the Stan community suggests that fitting models with marginalisation is more efficient than using Gibbs sampling.

The comparison is purely experimental, though, which means it depends on the simulated data, the sample size, the prior selection, and of course the chosen algorithms. It also involves the [mostly] automated [off-the-shelf] choices made in the adopted software, JAGS and Stan. The outcome is only evaluated through ESS and the (old) R statistic. Which all depend on the parameterisation. But evacuates the label switching problem by imposing an ordering on the Gaussian means, which may have a different impact on marginalised and unmarginalised models. All in all, there is not much one can conclude about this experiment since the parameter values beyond the simulated data seem to impact the performances much more than the type of algorithm one implements.

ordered allocation sampler

Posted in Books, Statistics with tags , , , , , , , , , , , on November 29, 2021 by xi'an

Recently, Pierpaolo De Blasi and María Gil-Leyva arXived a proposal for a novel Gibbs sampler for mixture models. In both finite and infinite mixture models. In connection with Pitman (1996) theory of species sampling and with interesting features in terms of removing the vexing label switching features.

The key idea is to work with the mixture components in the random order of appearance in an exchangeable sequence from the mixing distribution (…) In accordance with the order of appearance, we derive a new Gibbs sampling algorithm that we name the ordered allocation sampler. “

This central idea is thus a reinterpretation of the mixture model as the marginal of the component model when its parameter is distributed as a species sampling variate. An ensuing marginal algorithm is to integrate out the weights and the allocation variables to only consider the non-empty component parameters and the partition function, which are label invariant. Which reminded me of the proposal we made in our 2000 JASA paper with Gilles Celeux and Merrilee Hurn (one of my favourite papers!). And of the [first paper in Statistical Methodology] 2004 partitioned importance sampling version with George Casella and Marty Wells. As in the later, the solution seems to require the prior on the component parameters to be conjugate (as I do not see a way to produce an unbiased estimator of the partition allocation probabilities).

The ordered allocation sample considers the posterior distribution of the different object made of the parameters and of the sequence of allocations to the components for the sample written in a given order, ie y¹,y², &tc. Hence y¹ always gets associated with component 1, y² with either component 1 or component 2, and so on. For this distribution, the full conditionals are available, incl. the full posterior on the number m of components, only depending on the data through the partition sizes and the number m⁺ of non-empty components. (Which relates to the debate as to whether or not m is estimable…) This sequential allocation reminded me as well of an earlier 2007 JRSS paper by Nicolas Chopin. Albeit using particles rather than Gibbs and applied to a hidden Markov model. Funny enough, their synthetic dataset univ4 almost resembles the Galaxy dataset (as in the above picture of mine)!

latent variables for a hierarchical Poisson model

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , on March 11, 2021 by xi'an

Answering a question on X validated about a rather standard hierarchical Poisson model, and its posterior Gibbs simulation, where observations are (d and w being a document and a word index, resp.)

$N_{w,d}\sim\mathcal P(\textstyle\sum_{1\le k\le K} \pi_{k,d}\varphi_{k,w})\qquad(1)$

I found myself dragged into an extended discussion on the validation of creating independent Poisson latent variables

$N_{k,w,d}\sim\mathcal P(\pi_{k,d}\varphi_{k,w})\qquad(2)$

since observing their sum in (1) was preventing the latent variables (2) from being independent. And then found out that the originator of the question had asked on X validated an unanswered and much more detailed question in 2016, even though the notations differ. The question does contain the solution I proposed above, including the Multinomial distribution on the Poisson latent variables given their sum (and the true parameters). As it should be since the derivation was done in a linked 2014 paper by Gopalan, Hofman, and Blei, later published in the Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence (UAI). I am thus bemused at the question resurfacing five years later in a much simplified version, but still exhibiting the same difficulty with the conditioning principles…

averaged acceptance ratios

Posted in Statistics with tags , , , , , , , , , , , , , on January 15, 2021 by xi'an

In another recent arXival, Christophe Andrieu, Sinan Yıldırım, Arnaud Doucet, and Nicolas Chopin study the impact of averaging estimators of acceptance ratios in Metropolis-Hastings algorithms. (It is connected with the earlier arXival rephrasing Metropolis-Hastings in terms of involutions discussed here.)

“… it is possible to improve performance of this algorithm by using a modification where the acceptance ratio r(ξ) is integrated with respect to a subset of the proposed variables.”

This interpretation of the current proposal makes it a form of Rao-Blackwellisation, explicitly mentioned on p.18, where, using a mixture proposal, with an adapted acceptance probability, it depends on the integrated acceptance ratio only. Somewhat magically using this ratio and its inverse with probability ½. And it increases the average Metropolis-Hastings acceptance probability (albeit with a larger number of simulations). Since the ideal averaging is rarely available, the authors implement a Monte Carlo averaging version. With applications to the exchange algorithm and to reversible jump MCMC. The major application is to pseudo-marginal settings with a high complexity (in the number T of terms) and where the authors’ approach does scale efficiently with T. There is even an ABC side to the story as one illustration is made of the ABC approximation to the posterior of an α-stable sample. As an encompassing proposal for handling Metropolis-Hastings environments with latent variables and several versions of the acceptance ratios, this is quite an interesting paper that I think we will study in further detail with our students.