## Bayesian inference with no likelihood

Posted in Books, Statistics, University life with tags , , , , , , , , on January 28, 2020 by xi'an

This week I made a quick trip to Warwick for the defence (or viva) of the PhD thesis of Jack Jewson, containing novel perspectives on constructing Bayesian inference without likelihood or without complete trust in said likelihood. The thesis aimed at constructing minimum divergence posteriors in an M-open perspective and built a rather coherent framework from principles to implementation. There is a clear link with the earlier work of Bissiri et al. (2016), with further consistency constraints where the outcome must recover the true posterior in the M-closed scenario (if not always the case with the procedures proposed in the thesis).

Although I am partial to the use of empirical likelihoods in setting, I appreciated the position of the thesis and the discussion of the various divergences towards the posterior derivation (already discussed on this blog) , with interesting perspectives on the calibration of the pseudo-posterior à la Bissiri et al. (2016). Among other things, the thesis pointed out a departure from the likelihood principle and some of its most established consequences, like Bayesian additivity. In that regard, there were connections with generative adversarial networks (GANs) and their Bayesian versions that could have been explored. And an impression that the type of Bayesian robustness explored in the thesis has more to do with outliers than with misspecification. Epsilon-contamination amodels re quite specific as it happens, in terms of tails and other things.

The next chapter is somewhat “less” Bayesian in my view as it considers a generalised form of variational inference. I agree that the view of the posterior as a solution to an optimisation is tempting but changing the objective function makes the notion less precise.  Which makes reading it somewhat delicate as it seems to dilute the meaning of both prior and posterior to the point of becoming irrelevant.

The last chapter on change-point models is quite alluring in that it capitalises on the previous developments to analyse a fairly realistic if traditional problem, applied to traffic in London, prior and posterior to the congestion tax. However, there is always an issue with robustness and outliers in that the notion is somewhat vague or informal. Things start clarifying at the end but I find surprising that conjugates are robust optimal solutions since the usual folk theorem from the 80’s is that they are not robust.

## dominating measure

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on March 21, 2019 by xi'an

Yet another question on X validated reminded me of a discussion I had once  with Jay Kadane when visiting Carnegie Mellon in Pittsburgh. Namely the fundamentally ill-posed nature of conjugate priors. Indeed, when considering the definition of a conjugate family as being a parameterised family Þ of distributions over the parameter space Θ stable under transform to the posterior distribution, this property is completely dependent (if there is such a notion as completely dependent!) on the dominating measure adopted on the parameter space Θ. Adopted is the word as there is no default, reference, natural, &tc. measure that promotes one specific measure on Θ as being the dominating measure. This is a well-known difficulty that also sticks out in most “objective Bayes” problems, as well as with maximum entropy priors. This means for instance that, while the Gamma distributions constitute a conjugate family for a Poisson likelihood, so do the truncated Gamma distributions. And so do the distributions which density (against a Lebesgue measure over an arbitrary subset of (0,∞)) is the product of a Gamma density by an arbitrary function of θ. I readily acknowledge that the standard conjugate priors as introduced in every Bayesian textbook are standard because they facilitate (to a certain extent) posterior computations. But, just like there exist an infinity of MaxEnt priors associated with an infinity of dominating measures, there exist an infinity of conjugate families, once more associated with an infinity of dominating measures. And the fundamental reason is that the sampling model (which induces the shape of the conjugate family) does not provide a measure on the parameter space Θ.

## I’m getting the point

Posted in Statistics with tags , , , , , , on February 14, 2019 by xi'an

A long-winded X validated discussion on the [textbook] mean-variance conjugate posterior for the Normal model left me [mildly] depressed at the point and use of answering questions on this forum. Especially as it came at the same time as a catastrophic outcome for my mathematical statistics exam.  Possibly an incentive to quit X validated as one quits smoking, although this is not the first attempt

## inverse stable priors

Posted in Statistics with tags , , , , , , on November 24, 2017 by xi'an

Dexter Cahoy and Joseph Sedransk just arXived a paper on so-called inverse stable priors. The starting point is the supposed defficiency of Gamma conjugate priors, which have explosive behaviour near zero. Albeit remaining proper. (This behaviour eventually vanishes for a large enough sample size.) The alternative involves a transform of alpha-stable random variables, with the consequence that the density of this alternative prior does not have a closed form. Neither does the posterior. When the likelihood can be written as exp(a.θ+b.log θ), modulo a reparameterisation, which covers a wide range of distributions, the posterior can be written in terms of the inverse stable density and of another (intractable) function called the generalized Mittag-Leffler function. (Which connects this post to an earlier post on Sofia Kovaleskaya.) For simulating this posterior, the authors suggest using an accept-reject algorithm based on the prior as proposal, which has the advantage of removing the intractable inverse stable density but the disadvantage of… simulating from the prior! (No mention is made of the acceptance rate.) I am thus reserved as to how appealing this new proposal is, despite “the inverse stable density (…) becoming increasingly popular in several areas of study”. And hence do not foresee a bright future for this class of prior…

## Particle Gibbs for conjugate mixture posteriors

Posted in Books, Statistics, University life with tags , , , , , on September 8, 2015 by xi'an

Alexandre Bouchard-Coté, Arnaud Doucet, and Andrew Roth have arXived a paper “Particle Gibbs Split-Merge Sampling for Bayesian Inference in Mixture Models” that proposes an efficient algorithm to explore the posterior distribution of a mixture, when interpreted as a clustering model. (To clarify the previous sentence, this is a regular plain vanilla mixture model for which they explore the latent variable representation.)

I like very much the paper because it relates to an earlier paper of mine with George Casella and Marty Wells, paper we wrote right after a memorable JSM in Baltimore (during what may have been my last visit to Cornell University as George left for Florida the following year). The starting point of this approach to mixture estimation is that the (true) parameters of a mixture can be (exactly) integrated out when using conjugate priors and a completion step. Namely, the marginal posterior distribution of the latent variables given the data is available in closed form. The latent variables being the component allocations for the observations. The joint posterior is then a product of the prior on the parameters times the prior on the latents times a product of simple (e.g., Gaussian) terms. This equivalently means the marginal likelihoods conditional on the allocations are available in closed form. Looking directly at those marginal likelihoods, a prior distribution on the allocations can be introduced (e.g., the Pitman-Yor process or the finite Dirichlet prior) and, together, they define a closed form target. Albeit complex. As often on a finite state space. In our paper with George and Marty, we proposed using importance sampling to explore the set, using for instance marginal distributions for the allocations, which are uniform in the case of exchangeable priors, but this is not very efficient, as exhibited by our experiments where very few partitions would get most of the weight.

Even a Gibbs sampler on subsets of those indicators restricted to two components cannot be managed directly. The paper thus examines a specially designed particle Gibbs solution that implements a split and merge move on two clusters at a time. Merging or splitting the subset. With intermediate target distributions, SMC style. While this is quite an involved mechanism, that could be deemed as excessive for the problem at hand, as well as inducing extra computing time, experiments in the paper demonstrate the mostly big improvement in efficiency brought by this algorithm.

## the worst possible proof [X’ed]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on July 18, 2015 by xi'an

Another surreal experience thanks to X validated! A user of the forum recently asked for an explanation of the above proof in Lynch’s (2007) book, Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. No wonder this user was puzzled: the explanation makes no sense outside the univariate case… It is hard to fathom why on Earth the author would resort to this convoluted approach to conclude about the posterior conditional distribution being a normal centred at the least square estimate and with σ²X’X as precision matrix. Presumably, he has a poor opinion of the degree of matrix algebra numeracy of his readers [and thus should abstain from establishing the result]. As it seems unrealistic to postulate that the author is himself confused about matrix algebra, given his MSc in Statistics [the footnote ² seen above after “appropriately” acknowledges that “technically we cannot divide by” the matrix, but it goes on to suggest multiplying the numerator by the matrix

$(X^\text{T}X)^{-1} (X^\text{T}X)$

which does not make sense either, unless one introduces the trace tr(.) operator, presumably out of reach for most readers]. And this part of the explanation is unnecessarily confusing in that a basic matrix manipulation leads to the result. Or even simpler, a reference to Pythagoras’  theorem.

## JSM 2014, Boston [#2]

Posted in Statistics, Travel, University life with tags , , , , , , , , on August 7, 2014 by xi'an

Day #2 at JSM started quite early as I had to be on site by 7am for the CHANCE editors breakfast. No running then, except to Porter metro station. Interesting exchange full of new ideas to keep the journal cruising. In particular, a call for proposals on special issues on sexy topics (reproducible research anyone? I already have some book reviews.). And directions to increase the international scope and readership. And possibly adding or reporting on a data challenge. After this great start, I attended the Bayesian Time Series and Dynamic Models session, where David Scott Matteson from Cornell University presented an extension of the Toronto ambulance data analysis Dawn Woodard had exposed in Banff at an earlier workshop. The extension dealt with the spatio-temporal nature of the data,  using a mixture model with time-dependent weights that revolved cyclically in an autoexponential manner. And rekindling the interest in the birth-and-death alternative to reversible jump. Plus another talk by Scott Holan mixing Bayesian analysis with frequency data, an issue that always puzzled me. The second session I attended was Multiscale Modeling for Complex Massive Data, with a modelling of brain connections through a non-parametric mixture by David Dunson. And a machine learning talk by Mauro Maggioni on a projection cum optimisation technique to fight the curse of dimension. Who proposed a solution to an optimal transport problem that is much more convincing than the one I discussed a while ago. Unfortunately, this made me miss the Biometrics showcase session, where Debashis Mondal presented a joint work with Julian Besag on Exact Goodness-of-Fit Tests for Markov Chains. And where both my friends Michael Newton and Peter Green were discussants… An idle question that came to me during this last talk was about the existence of particle filters for spatial Markov structures (rather than the usual ones on temporal Markov models).

After a [no] lunch break spent on pondering over a conjecture laid to me by Natesh Pillai yesterday, I eventually joined the Feature Allocation session. Eventually as I basically had to run the entire perimeter of the conference centre! The three talks by Finale Doshi-Velez, Tamara Broderick, and Yuan Ji were all impressive and this may have been my best session so far at JSM! Thanks to Peter Müller for organising it! Tamara Broderick focussed on a generic way to build conjugate priors for non-parametric models, with all talks involving Indian buffets. Maybe a suggestion for tonight’s meal..! (In the end, great local food onn Harvard Square.)