## slice sampling for Dirichlet mixture process

Posted in Books, Statistics, University life with tags , , , , , , , on June 21, 2017 by xi'an

When working with my PhD student Changye in Dauphine this morning I realised that slice sampling also applies to discrete support distributions and could even be of use in such settings. That it works is (now) straightforward in that the missing variable representation behind the slice sampler also applies to densities defined with respect to a discrete measure. That this is useful transpires from the short paper of Stephen Walker (2007) where we saw this, as Stephen relies on the slice sampler to sample from the Dirichlet mixture model by eliminating the tail problem associated with this distribution. (This paper appeared in Communications in Statistics and it is through Pati & Dunson (2014) taking advantage of this trick that Changye found about its very existence. I may have known about it in an earlier life, but I had clearly forgotten everything!)

While the prior distribution (of the weights) of the Dirichlet mixture process is easy to generate via the stick breaking representation, the posterior distribution is trickier as the weights are multiplied by the values of the sampling distribution (likelihood) at the corresponding parameter values and they cannot be normalised. Introducing a uniform to replace all weights in the mixture with an indicator that the uniform is less than those weights corresponds to a (latent variable) completion [or a demarginalisation as we called this trick in Monte Carlo Statistical Methods]. As elaborated in the paper, the Gibbs steps corresponding to this completion are easy to implement, involving only a finite number of components. Meaning the allocation to a component of the mixture can be operated rather efficiently. Or not when considering that the weights in the Dirichlet mixture are not monotone, hence that a large number of them may need to be computed before picking the next index in the mixture when the uniform draw happens to be quite small.

## comments on Watson and Holmes

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , on April 1, 2016 by xi'an

“The world is full of obvious things which nobody by any chance ever observes.” The Hound of the Baskervilles

In connection with the incoming publication of James Watson’s and Chris Holmes’ Approximating models and robust decisions in Statistical Science, Judith Rousseau and I wrote a discussion on the paper that has been arXived yesterday.

“Overall, we consider that the calibration of the Kullback-Leibler divergence remains an open problem.” (p.18)

While the paper connects with earlier ones by Chris and coauthors, and possibly despite the overall critical tone of the comments!, I really appreciate the renewed interest in robustness advocated in this paper. I was going to write Bayesian robustness but to differ from the perspective adopted in the 90’s where robustness was mostly about the prior, I would say this is rather a Bayesian approach to model robustness from a decisional perspective. With definitive innovations like considering the impact of posterior uncertainty over the decision space, uncertainty being defined e.g. in terms of Kullback-Leibler neighbourhoods. Or with a Dirichlet process distribution on the posterior. This may step out of the standard Bayesian approach but it remains of definite interest! (And note that this discussion of ours [reluctantly!] refrained from capitalising on the names of the authors to build easy puns linked with the most Bayesian of all detectives!)

## Dirichlet process mixture inconsistency

Posted in Books, Statistics with tags , , , , on February 15, 2016 by xi'an

Judith Rousseau pointed out to me this NIPS paper by Jeff Miller and Matthew Harrison on the possible inconsistency of Dirichlet mixtures priors for estimating the (true) number of components in a (true) mixture model. The resulting posterior on the number of components does not concentrate on the right number of components. Which is not the case when setting a prior on the unknown number of components of a mixture, where consistency occurs. (The inconsistency results established in the paper are actually focussed on iid Gaussian observations, for which the estimated number of Gaussian components is almost never equal to 1.) In a more recent arXiv paper, they also show that a Dirichlet prior on the weights and a prior on the number of components can still produce the same features as a Dirichlet mixtures priors. Even the stick breaking representation! (Paper that I already reviewed last Spring.)

## Conditional love [guest post]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on August 4, 2015 by xi'an

[When Dan Simpson told me he was reading Terenin’s and Draper’s latest arXival in a nice Bath pub—and not a nice bath tub!—, I asked him for a blog entry and he agreed. Here is his piece, read at your own risk! If you remember to skip the part about Céline Dion, you should enjoy it very much!!!]

Probability has traditionally been described, as per Kolmogorov and his ardent follower Katy Perry, unconditionally. This is, of course, excellent for those of us who really like measure theory, as the maths is identical. Unfortunately mathematical convenience is not necessarily enough and a large part of the applied statistical community is working with Bayesian methods. These are unavoidably conditional and, as such, it is natural to ask if there is a fundamentally conditional basis for probability.

Bruno de Finetti—and later Richard Cox and Edwin Jaynes—considered conditional bases for Bayesian probability that are, unfortunately, incomplete. The critical problem is that they mainly consider finite state spaces and construct finitely additive systems of conditional probability. For a variety of reasons, neither of these restrictions hold much truck in the modern world of statistics.

In a recently arXiv’d paper, Alexander Terenin and David Draper devise a set of axioms that make the Cox-Jaynes system of conditional probability rigorous. Furthermore, they show that the complete set of Kolmogorov axioms (including countable additivity) can be derived as theorems from their axioms by conditioning on the entire sample space.

This is a deep and fundamental paper, which unfortunately means that I most probably do not grasp it’s complexities (especially as, for some reason, I keep reading it in pubs!). However I’m going to have a shot at having some thoughts on it, because I feel like it’s the sort of paper one should have thoughts on. Continue reading

## mixture models with a prior on the number of components

Posted in Books, Statistics, University life with tags , , , , , , , on March 6, 2015 by xi'an

“From a Bayesian perspective, perhaps the most natural approach is to treat the numberof components like any other unknown parameter and put a prior on it.”

Another mixture paper on arXiv! Indeed, Jeffrey Miller and Matthew Harrison recently arXived a paper on estimating the number of components in a mixture model, comparing the parametric with the non-parametric Dirichlet prior approaches. Since priors can be chosen towards agreement between those. This is an obviously interesting issue, as they are often opposed in modelling debates. The above graph shows a crystal clear agreement between finite component mixture modelling and Dirichlet process modelling. The same happens for classification.  However, Dirichlet process priors do not return an estimate of the number of components, which may be considered a drawback if one considers this is an identifiable quantity in a mixture model… But the paper stresses that the number of estimated clusters under the Dirichlet process modelling tends to be larger than the number of components in the finite case. Hence that the Dirichlet process mixture modelling is not consistent in that respect, producing parasite extra clusters…

In the parametric modelling, the authors assume the same scale is used in all Dirichlet priors, that is, for all values of k, the number of components. Which means an incoherence when marginalising from k to (k-p) components. Mild incoherence, in fact, as the parameters of the different models do not have to share the same priors. And, as shown by Proposition 3.3 in the paper, this does not prevent coherence in the marginal distribution of the latent variables. The authors also draw a comparison between the distribution of the partition in the finite mixture case and the Chinese restaurant process associated with the partition in the infinite case. A further analogy is that the finite case allows for a stick breaking representation. A noteworthy difference between both modellings is about the size of the partitions

$\mathbb{P}(s_1,\ldots,s_k)\propto\prod_{j=1}^k s_j^{-\gamma}\quad\text{versus}\quad\mathbb{P}(s_1,\ldots,s_k)\propto\prod_{j=1}^k s_j^{-1}$

in the finite (homogeneous partitions) and infinite (extreme partitions) cases.

An interesting entry into the connections between “regular” mixture modelling and Dirichlet mixture models. Maybe not ultimately surprising given the past studies by Peter Green and Sylvia Richardson of both approaches (1997 in Series B and 2001 in JASA).