Archive for missing data

a neat EM resolution

Posted in Books, Kids, Statistics, University life with tags , , , , , , on February 3, 2021 by xi'an

Read (and answered) this question on X validation about finding the maximum likelihood estimator of a 2×2 Gaussian covariance matrix when some observations are partly missing.  The neat thing is that, in this case, the maximisation step is identical to the maximum likelihood estimation of the 2×2 Gaussian covariance matrix by redefining the empirical covariance matrix into Z and maximising


in Σ. Nothing involved but fun to explain, nonetheless. (In my final exam this year, no student even approached the EM questions!)

postdoc on missing data at École Polytechnique

Posted in Kids, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , on November 18, 2016 by xi'an

Julie Josse contacted me for advertising a postdoc position at École Polytechnique, in Palaiseau, south of Paris. “The fellowship is focusing on missing data. Interested graduates should apply as early as possible since the position will be filled when a suitable candidate is found. The Centre for Applied Mathematics (CMAP) is  looking for highly motivated individuals able to develop a general multiple imputation method for multivariate continuous and categorical variables and its implementation in the free R software. The successful candidate will be part of research group in the statistical team on missing values. The postdoc will also have excellent opportunities to collaborate with researcher in public health with partners on the analysis of a large register from the Paris Hospital (APHP) to model the decisions and events when severe trauma patients are handled by emergency doctors. Candidates should contact Julie Josse at”

MissData 2015 in Rennes [June 18-19]

Posted in R, Statistics, Travel, University life with tags , , , , , , on February 9, 2015 by xi'an

This (early) summer, a conference on missing data will be organised in Rennes, Brittany, with the support of the French Statistical Society [SFDS]. (Check the website if interested, Rennes is a mere two hours from Paris by fast train.)

label switching in Bayesian mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on October 31, 2014 by xi'an

cover of Mixture Estimation and ApplicationsA referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

\sum_{j=1}^k \omega_j f(y|\theta_i)

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1” would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1”.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

p(z_i=j|{\mathbf y})

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})

does not hold when the two permutations σ and τ give different images of zi

more typos in Monte Carlo statistical methods

Posted in Books, Statistics, University life with tags , , , , , , , , , on October 28, 2011 by xi'an

Jan Hanning kindly sent me this email about several difficulties with Chapters 3, Monte Carlo Integration, and  5, Monte Carlo Optimization, when teaching out of our book Monte Carlo Statistical Methods [my replies in italics between square brackets, apologies for the late reply and posting, as well as for the confusion thus created. Of course, the additional typos will soon be included in the typo lists on my book webpage.]:

  1. I seem to be unable to reproduce Table 3.3 on page 88 – especially the chi-square column does not look quite right. [No, they definitely are not right: the true χ² quantiles should be 2.70, 3.84, and 6.63, at the levels 0.1, 0.05, and 0.01, respectively. I actually fail to understand how we got this table that wrong…]
  2. The second question  I have is the choice of the U(0,1) in this Example 3.6. It  feels to me that a choice of Beta(23.5,18.5) for p1 and Beta(36.5,5.5) for p2 might give a better representation based on the data we have. Any comments? [I am plainly uncertain about this… Yours is the choice based on the posterior Beta coefficient distributions associated with Jeffreys prior, hence making the best use of the data. I wonder whether or not we should remove this example altogether… It is certainly “better” than the uniform. However, in my opinion, there is no proper choice for the distribution of the pi‘s because we are mixing there a likelihood-ratio solution with a Bayesian perspective on the predictive distribution of the likelihood-ratio. If anything, this exposes the shortcomings of a classical approach, but it is likely to confuse the students! Anyway, this is a very interesting problem.]
  3. My students discovered that Problem 5.19 has the following typos, copying from their e-mail: “x_x” should be “x_i” [sure!]. There are a few “( )”s missing here and there [yes!]. Most importantly, the likelihood/density seems incorrect. The normalizing constant should be the reciprocal of the one showed in the book [oh dear, indeed, the constant in the exponential density did not get to the denominator…]. As a result, all the formulas would differ except the ones in part (a). [they clearly need to be rewritten, sorry about this mess!]
  4. I am unsure about the if and only if part of the Theorem 5.15 [namely that the likelihood sequence is stationary if and only if the Q function in the E step has reached a stationary point]. It appears to me that a condition for the “if part” is missing [the “only if” part is a direct consequence of Jensen’s inequality]. Indeed Theorem 1 of Dempster et al 1977 has an extra condition [note that the original proof for convergence of EM has a flaw, as discussed here]. Am I missing something obvious? [maybe: it seems to me that, once Q reaches a fixed point, the likelihood L does not change… It is thus tautological, not a proof of convergence! But the theorem says a wee more, so this needs investigating. As Jan remarked, there is no symmetry in the Q function…]
  5. Should there be a (n-m) in the last term of formula (5.17)? [yes, indeed!, multiply the last term by (n-m)]
  6. Finally, I am a bit confused about the likelihood in Example 5.22 [which is a capture-recapture model]. Assume that Hij=k [meaning the animal i is in state k at time j]. Do you assume that you observed Xijr [which is the capture indicator for animal i at time j in zone k: it is equal to 1 for at most one k] as a Binomial B(n,pr) even for r≠k? [no, we observe all Xijr‘s with r≠k equal to zero]  The nature of the problem seems to suggest that the answer is no [for other indices, Xijr is always zero, indeed] If that is the case I do not see where the power on top of (1-pk) in the middle of the page 185 comes from [when the capture indices are zero, they do not contribute to the sum, which explains for this condensed formula. Therefore, I do not think there is anything wrong with this over-parameterised representation of the missing variables.]
  7. In Section 5.3.4, there seems to be a missing minus sign in the approximation formula for the variance [indeed, shame on us for missing the minus in the observed information matrix!]
  8. I could not find the definition of \mathbb{N}^* in Theorem 6.15. Is it all natural numbers or all integers? May be it would help to include it in Appendix B. [Surprising! This is the set of all positive integers, I thought this was a standard math notation…]
  9. In Definition 6.27, you probably want to say covering of A and not X. [Yes, we were already thinking of the next theorem, most likely!]
  10. In Proposition 6.33 –   all x in A instead of all x in X. [Yes, again! As shown in the proof. Even though it also holds for all x in X]

Thanks a ton to Jan and to his UNC students (and apologies for leading them astray with those typos!!!)