## dominating measure

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on March 21, 2019 by xi'an

Yet another question on X validated reminded me of a discussion I had once  with Jay Kadane when visiting Carnegie Mellon in Pittsburgh. Namely the fundamentally ill-posed nature of conjugate priors. Indeed, when considering the definition of a conjugate family as being a parameterised family Þ of distributions over the parameter space Θ stable under transform to the posterior distribution, this property is completely dependent (if there is such a notion as completely dependent!) on the dominating measure adopted on the parameter space Θ. Adopted is the word as there is no default, reference, natural, &tc. measure that promotes one specific measure on Θ as being the dominating measure. This is a well-known difficulty that also sticks out in most “objective Bayes” problems, as well as with maximum entropy priors. This means for instance that, while the Gamma distributions constitute a conjugate family for a Poisson likelihood, so do the truncated Gamma distributions. And so do the distributions which density (against a Lebesgue measure over an arbitrary subset of (0,∞)) is the product of a Gamma density by an arbitrary function of θ. I readily acknowledge that the standard conjugate priors as introduced in every Bayesian textbook are standard because they facilitate (to a certain extent) posterior computations. But, just like there exist an infinity of MaxEnt priors associated with an infinity of dominating measures, there exist an infinity of conjugate families, once more associated with an infinity of dominating measures. And the fundamental reason is that the sampling model (which induces the shape of the conjugate family) does not provide a measure on the parameter space Θ.

## I’m getting the point

Posted in Statistics with tags , , , , , , on February 14, 2019 by xi'an

A long-winded X validated discussion on the [textbook] mean-variance conjugate posterior for the Normal model left me [mildly] depressed at the point and use of answering questions on this forum. Especially as it came at the same time as a catastrophic outcome for my mathematical statistics exam.  Possibly an incentive to quit X validated as one quits smoking, although this is not the first attempt

## inverse stable priors

Posted in Statistics with tags , , , , , , on November 24, 2017 by xi'an

Dexter Cahoy and Joseph Sedransk just arXived a paper on so-called inverse stable priors. The starting point is the supposed defficiency of Gamma conjugate priors, which have explosive behaviour near zero. Albeit remaining proper. (This behaviour eventually vanishes for a large enough sample size.) The alternative involves a transform of alpha-stable random variables, with the consequence that the density of this alternative prior does not have a closed form. Neither does the posterior. When the likelihood can be written as exp(a.θ+b.log θ), modulo a reparameterisation, which covers a wide range of distributions, the posterior can be written in terms of the inverse stable density and of another (intractable) function called the generalized Mittag-Leffler function. (Which connects this post to an earlier post on Sofia Kovaleskaya.) For simulating this posterior, the authors suggest using an accept-reject algorithm based on the prior as proposal, which has the advantage of removing the intractable inverse stable density but the disadvantage of… simulating from the prior! (No mention is made of the acceptance rate.) I am thus reserved as to how appealing this new proposal is, despite “the inverse stable density (…) becoming increasingly popular in several areas of study”. And hence do not foresee a bright future for this class of prior…

## Particle Gibbs for conjugate mixture posteriors

Posted in Books, Statistics, University life with tags , , , , , on September 8, 2015 by xi'an

Alexandre Bouchard-Coté, Arnaud Doucet, and Andrew Roth have arXived a paper “Particle Gibbs Split-Merge Sampling for Bayesian Inference in Mixture Models” that proposes an efficient algorithm to explore the posterior distribution of a mixture, when interpreted as a clustering model. (To clarify the previous sentence, this is a regular plain vanilla mixture model for which they explore the latent variable representation.)

I like very much the paper because it relates to an earlier paper of mine with George Casella and Marty Wells, paper we wrote right after a memorable JSM in Baltimore (during what may have been my last visit to Cornell University as George left for Florida the following year). The starting point of this approach to mixture estimation is that the (true) parameters of a mixture can be (exactly) integrated out when using conjugate priors and a completion step. Namely, the marginal posterior distribution of the latent variables given the data is available in closed form. The latent variables being the component allocations for the observations. The joint posterior is then a product of the prior on the parameters times the prior on the latents times a product of simple (e.g., Gaussian) terms. This equivalently means the marginal likelihoods conditional on the allocations are available in closed form. Looking directly at those marginal likelihoods, a prior distribution on the allocations can be introduced (e.g., the Pitman-Yor process or the finite Dirichlet prior) and, together, they define a closed form target. Albeit complex. As often on a finite state space. In our paper with George and Marty, we proposed using importance sampling to explore the set, using for instance marginal distributions for the allocations, which are uniform in the case of exchangeable priors, but this is not very efficient, as exhibited by our experiments where very few partitions would get most of the weight.

Even a Gibbs sampler on subsets of those indicators restricted to two components cannot be managed directly. The paper thus examines a specially designed particle Gibbs solution that implements a split and merge move on two clusters at a time. Merging or splitting the subset. With intermediate target distributions, SMC style. While this is quite an involved mechanism, that could be deemed as excessive for the problem at hand, as well as inducing extra computing time, experiments in the paper demonstrate the mostly big improvement in efficiency brought by this algorithm.

## the worst possible proof [X’ed]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on July 18, 2015 by xi'an

Another surreal experience thanks to X validated! A user of the forum recently asked for an explanation of the above proof in Lynch’s (2007) book, Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. No wonder this user was puzzled: the explanation makes no sense outside the univariate case… It is hard to fathom why on Earth the author would resort to this convoluted approach to conclude about the posterior conditional distribution being a normal centred at the least square estimate and with σ²X’X as precision matrix. Presumably, he has a poor opinion of the degree of matrix algebra numeracy of his readers [and thus should abstain from establishing the result]. As it seems unrealistic to postulate that the author is himself confused about matrix algebra, given his MSc in Statistics [the footnote ² seen above after “appropriately” acknowledges that “technically we cannot divide by” the matrix, but it goes on to suggest multiplying the numerator by the matrix

$(X^\text{T}X)^{-1} (X^\text{T}X)$

which does not make sense either, unless one introduces the trace tr(.) operator, presumably out of reach for most readers]. And this part of the explanation is unnecessarily confusing in that a basic matrix manipulation leads to the result. Or even simpler, a reference to Pythagoras’  theorem.

## JSM 2014, Boston [#2]

Posted in Statistics, Travel, University life with tags , , , , , , , , on August 7, 2014 by xi'an

Day #2 at JSM started quite early as I had to be on site by 7am for the CHANCE editors breakfast. No running then, except to Porter metro station. Interesting exchange full of new ideas to keep the journal cruising. In particular, a call for proposals on special issues on sexy topics (reproducible research anyone? I already have some book reviews.). And directions to increase the international scope and readership. And possibly adding or reporting on a data challenge. After this great start, I attended the Bayesian Time Series and Dynamic Models session, where David Scott Matteson from Cornell University presented an extension of the Toronto ambulance data analysis Dawn Woodard had exposed in Banff at an earlier workshop. The extension dealt with the spatio-temporal nature of the data,  using a mixture model with time-dependent weights that revolved cyclically in an autoexponential manner. And rekindling the interest in the birth-and-death alternative to reversible jump. Plus another talk by Scott Holan mixing Bayesian analysis with frequency data, an issue that always puzzled me. The second session I attended was Multiscale Modeling for Complex Massive Data, with a modelling of brain connections through a non-parametric mixture by David Dunson. And a machine learning talk by Mauro Maggioni on a projection cum optimisation technique to fight the curse of dimension. Who proposed a solution to an optimal transport problem that is much more convincing than the one I discussed a while ago. Unfortunately, this made me miss the Biometrics showcase session, where Debashis Mondal presented a joint work with Julian Besag on Exact Goodness-of-Fit Tests for Markov Chains. And where both my friends Michael Newton and Peter Green were discussants… An idle question that came to me during this last talk was about the existence of particle filters for spatial Markov structures (rather than the usual ones on temporal Markov models).

After a [no] lunch break spent on pondering over a conjecture laid to me by Natesh Pillai yesterday, I eventually joined the Feature Allocation session. Eventually as I basically had to run the entire perimeter of the conference centre! The three talks by Finale Doshi-Velez, Tamara Broderick, and Yuan Ji were all impressive and this may have been my best session so far at JSM! Thanks to Peter Müller for organising it! Tamara Broderick focussed on a generic way to build conjugate priors for non-parametric models, with all talks involving Indian buffets. Maybe a suggestion for tonight’s meal..! (In the end, great local food onn Harvard Square.)

## noninformative priors for mixtures

Posted in Books, Statistics, University life with tags , , , , , , , , on May 26, 2014 by xi'an

“A novel formulation of the mixture model is introduced, which includes the prior constraint that each Gaussian component is always assigned a minimal number of data points. This enables noninformative improper priors such as the Jeffreys prior to be placed on the component parameters. We demonstrate difficulties involved in specifying a prior for the standard Gaussian mixture model, and show how the new model can be used to overcome these. MCMC methods are given for efficient sampling from the posterior of this model.” C. Stoneking

Following in the theme of the Jeffreys’ post of two weeks ago, I spotted today a newly arXived paper about using improper priors for mixtures…and surviving it! It is entitled “Bayesian inference of Gaussian mixture models with noninformative priors” and written by Colin Stoneking at ETH Zürich. As mentioned in the previous post, one specificity of our 1990-1994 paper on mixture with Jean Diebolt was to allow for improper priors by imposing at least two observations per component. The above abstract thus puzzled me until I found on page 3 that the paper was indeed related to ours (and Larry’s 2000 validation)! Actually, I should not complain about citations of my earlier works on mixtures as they cover seven different papers, but the bibliography is somewhat missing the paper we wrote with George Casella and Marty Wells in Statistical Methodology in 2004 (this was actually the very first paper of this new journal!), where we show that conjugate priors allow for the integration of the weights, resulting in a close-form expression for the distribution of the partition vector. (This was also extended in the chapter “Exact Bayesian Analysis of Mixtures” I wrote with Kerrie Mengersen in our book Mixtures: Estimation and Applications.)

“There is no well-founded, general method to choose the parameters of a given prior to make it weakly informative for Gaussian mixtures.” C. Stoneking

The first part of the paper shows why looking for weakly informative priors is doomed to fail in this mixture setting: there is no stabilisation as hyperparameters get towards the border (between proper-ness and improper-ness), and on the opposite the frequency of appearances of empty components grows steadily to 100%…  The second part gets to the reassessment of our 1990 exclusion trick, first considering that it is not producing a true posterior, then criticising Larry’s 2000 analysis as building a data-dependent “prior”, and at last proposing a reformulation where the exclusion of the empty components and those with one allocated observation becomes part of the “prior” (albeit a prior on the allocation vector). In fine, the posterior thus constructed remains the same as ours, with a message that if we start our model as the likelihood of the sample excluding empty or single-observation terms, we can produce a proper Bayesian analysis. (Except for a missing if minor renormalisation.) This leads me to wonder about the conclusion that inference about the (unknown) number of components in the mixture being impossible from this perspective. For instance, we could define fractional Bayes factors à la O’Hagan (1995) this way, i.e. starting from the restricted likelihood and taking a fraction of the likelihood to make the posterior proper, then using the remaining fraction to compute a Bayes factor. (Fractional Bayes factors do not work for the regular likelihood of a Gaussian mixture, irrespective of the sample size.)