## more of the same!

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on December 10, 2015 by xi'an

Daniel Seita, Haoyu Chen, and John Canny arXived last week a paper entitled “Fast parallel SAME Gibbs sampling on general discrete Bayesian networks“.  The distributions of the observables are defined by full conditional probability tables on the nodes of a graphical model. The distributions on the latent or missing nodes of the network are multinomial, with Dirichlet priors. To derive the MAP in such models, although this goal is not explicitly stated in the paper till the second page, the authors refer to the recent paper by Zhao et al. (2015), discussed on the ‘Og just as recently, which applies our SAME methodology. Since the paper is mostly computational (and submitted to ICLR 2016, which takes place juuust before AISTATS 2016), I do not have much to comment about it. Except to notice that the authors mention our paper as “Technical report, Statistics and Computing, 2002”. I am not sure the editor of Statistics and Computing will appreciate! The proper reference is in Statistics and Computing, 12:77-84, 2002.

“We argue that SAME is beneficial for Gibbs sampling because it helps to reduce excess variance.”

Still, I am a wee bit surprised at both the above statement and at the comparison with a JAGS implementation. Because SAME augments the number of latent vectors as the number of iterations increases, so should be slower by a mere curse of dimension,, slower than a regular Gibbs with a single latent vector. And because I do not get either the connection with JAGS: SAME could be programmed in JAGS, couldn’t it? If the authors means a regular Gibbs sampler with no latent vector augmentation, the comparison makes little sense as one algorithm aims at the MAP (with a modest five replicas), while the other encompasses the complete posterior distribution. But this sounds unlikely when considering that the larger the number m of replicas the better their alternative to JAGS. It would thus be interesting to understand what the authors mean by JAGS in this setup!

## two correlated pseudo-marginals for the price of one!

Posted in Books, Statistics, University life with tags , , , , , , , , , on November 30, 2015 by xi'an

Within two days, two papers on using correlated auxiliary random variables for pseudo-marginal Metropolis-Hastings, one by George Deligiannidis, Arnaud Doucet, Michael Pitt, and Robert Kohn, and another one by Johan Dahlin, Fredrik Lindsten, Joel Kronander, and Thomas Schön! They both make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk or an autoregressive model of sorts, and possibly transform these auxiliary variables by inverse cdf or something else.

“Experimentally, the efficiency of computations is increased relative to the standard pseudo-marginal algorithm by up to 180-fold in our simulations.” Deligiannidis et al.

The first paper by Deligiannidis et al. aims at reducing the variance of the Metropolis-Hastings acceptance ratio by correlating the auxiliary variables. While the auxiliary variable can be of any dimension, all that matters is its transform into a (univariate) estimate of the density, used as pseudo-marginal at each iteration. However, if a Markov kernel is used for proposing new auxiliary variables, the sequence of the pseudo-marginal is no longer a Markov chain. Which implies looking at the auxiliary variables. Nonetheless, they manage to derive a CLT on the log pseudo-marginal estimate, for a latent variable model with sample size growing to infinity. Based on this CLT, they control the correlation in the Crank-Nicholson proposal so that the asymptotic variance  is of order one:  this correlation has to converge to 1 as 1-exp(-χN/T), where N is the number of Monte Carlo samples for T time intervals. Those results extend to the bootstrap particle filter. Upon reflection, it makes much sense aiming for a high correlation since the unbiased estimator of the target hardly changes from one iteration to the next (but needs to move for the method to be validated by Metropolis-Hastings arguments). The two simulation experiments showed massive gains following this scheme, as reported in the above quote.

“[ABC] can be used to approximate the log-likelihood using importance sampling and particle filtering. However, in practice these estimates suffer from a large variance with results in bad mixing for the Markov chain.” Dahlin et al.

In the second paper, which came a day later, presumably induced by the first paper, acknowledged from the start, the authors also make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk, and possibly transform these auxiliary variables by inverse cdf or something else. The central part of the paper is about tuning the scale in the Crank-Nicholson proposal, in the spirit of Gelman, Gilks and Roberts (1996). Since all that matters is the (univariate) estimate of the density used as pseudo-marginal, The authors approximate the law of the log-density by a Gaussian distribution, despite the difficulty with the “projected” Markov chain, thus looking for the optimal scaling but only achieving a numerical optimisation rather than the equivalent of the golden number of MCMC acceptance probabilities, 0.234. Although in a sense the above should be the goal for the auxiliary variable acceptance rate, when those are of high enough dimension. One thing I could not find in this paper is how much (generic) improvement is gathered from this modification from an iid version. (Another is linked with the above quote, which I find difficult to understand as ABC is not primarily intended as a pseudo-marginal method. In a sense it is the worst possible pseudo-marginal estimator in that it uses estimators taking values in {0,1}…)

## Particle Gibbs for conjugate mixture posteriors

Posted in Books, Statistics, University life with tags , , , , , on September 8, 2015 by xi'an

Alexandre Bouchard-Coté, Arnaud Doucet, and Andrew Roth have arXived a paper “Particle Gibbs Split-Merge Sampling for Bayesian Inference in Mixture Models” that proposes an efficient algorithm to explore the posterior distribution of a mixture, when interpreted as a clustering model. (To clarify the previous sentence, this is a regular plain vanilla mixture model for which they explore the latent variable representation.)

I like very much the paper because it relates to an earlier paper of mine with George Casella and Marty Wells, paper we wrote right after a memorable JSM in Baltimore (during what may have been my last visit to Cornell University as George left for Florida the following year). The starting point of this approach to mixture estimation is that the (true) parameters of a mixture can be (exactly) integrated out when using conjugate priors and a completion step. Namely, the marginal posterior distribution of the latent variables given the data is available in closed form. The latent variables being the component allocations for the observations. The joint posterior is then a product of the prior on the parameters times the prior on the latents times a product of simple (e.g., Gaussian) terms. This equivalently means the marginal likelihoods conditional on the allocations are available in closed form. Looking directly at those marginal likelihoods, a prior distribution on the allocations can be introduced (e.g., the Pitman-Yor process or the finite Dirichlet prior) and, together, they define a closed form target. Albeit complex. As often on a finite state space. In our paper with George and Marty, we proposed using importance sampling to explore the set, using for instance marginal distributions for the allocations, which are uniform in the case of exchangeable priors, but this is not very efficient, as exhibited by our experiments where very few partitions would get most of the weight.

Even a Gibbs sampler on subsets of those indicators restricted to two components cannot be managed directly. The paper thus examines a specially designed particle Gibbs solution that implements a split and merge move on two clusters at a time. Merging or splitting the subset. With intermediate target distributions, SMC style. While this is quite an involved mechanism, that could be deemed as excessive for the problem at hand, as well as inducing extra computing time, experiments in the paper demonstrate the mostly big improvement in efficiency brought by this algorithm.

## JSM 2015 [day #4]

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on August 13, 2015 by xi'an

My first session today was Markov Chain Monte Carlo for Contemporary Statistical Applications with a heap of interesting directions in MCMC research! Now, without any possible bias (!), I would definitely nominate Murray Pollock (incidentally from Warwick) as the winner for best slides, funniest presentation, and most enjoyable accent! More seriously, the scalable Langevin algorithm he developed with Paul Fearnhead, Adam Johansen, and Gareth Roberts, is quite impressive in avoiding computing costly likelihoods. With of course caveats on which targets it applies to. Murali Haran showed a new proposal to handle high dimension random effect models by a projection trick that reduces the dimension. Natesh Pillai introduced us (or at least me!) to a spectral clustering that allowed for an automated partition of the target space, itself the starting point to his parallel MCMC algorithm. Quite exciting, even though I do not perceive partitions as an ideal solution to this problem. The final talk in the session was Galin Jones’ presentation of consistency results and conditions for multivariate quantities which is a surprisingly unexplored domain. MCMC is still alive and running!

The second MCMC session of the morning, Monte Carlo Methods Facing New Challenges in Statistics and Science, was equally diverse, with Lynn Kuo’s talk on the HAWK approach, where we discovered that harmonic mean estimators are still in use, e.g., in MrBayes software employed in phylogenetic inference. The proposal to replace this awful estimator that should never be seen again (!) was rather closely related to an earlier solution of us for marginal likelihood approximation, based there on a partition of the whole space rather than an HPD region in our case… Then, Michael Betancourt brilliantly acted as a proxy for Andrew to present the STAN language, with a flashy trailer he most recently designed. Featuring Andrew as the sole actor. And with great arguments for using it, including the potential to run expectation propagation (as a way of life). In fine, Faming Liang proposed a bootstrap subsampling version of the Metropolis-Hastings algorithm, where the likelihood acknowledging the resulting bias in the limiting distribution.

My first afternoon session was another entry on Statistical Phylogenetics, somewhat continued from yesterday’s session. Making me realised I had not seen a single talk on ABC for the entire meeting! The issues discussed in the session were linked with aligning sequences and comparing  many trees. Again in settings where likelihoods can be computed more or less explicitly. Without any expertise in the matter, I wondered at a construction that would turn all trees, like  into realizations of a continuous model. For instance by growing one branch at a time while removing the MRCA root… And maybe using a particle like method to grow trees. As an aside, Vladimir Minin told me yesterday night about genetic mutations that could switch on and off phenotypes repeatedly across generations… For instance  the ability to glow in the dark for species of deep sea fish.

When stating that I did not see a single talk about ABC, I omitted Steve Fienberg’s Fisher Lecture R.A. Fisher and the Statistical ABCs, keeping the morceau de choix for the end! Even though of course Steve did not mention the algorithm! A was for asymptotics, or ancilarity, B for Bayesian (or biducial??), C for causation (or cuffiency???)… Among other germs, I appreciated that Steve mentioned my great-grand father Darmois in connection with exponential families! And the connection with Jon Wellner’s LeCam Lecture from a few days ago. And reminding us that Savage was a Fisher lecturer himself. And that Fisher introduced fiducial distributions quite early. And for defending the Bayesian perspective. Steve also set some challenges like asymptotics for networks, Bayesian model assessment (I liked the notion of stepping out of the model), and randomization when experimenting with networks. And for big data issues. And for personalized medicine, building on his cancer treatment. No trace of the ABC algorithm, obviously, but a wonderful Fisher’s lecture, also most obviously!! Bravo, Steve, keep thriving!!!

## reflections on the probability space induced by moment conditions with implications for Bayesian Inference [refleXions]

Posted in Statistics, University life with tags , , , , , , , , , , on November 26, 2014 by xi'an

“The main finding is that if the moment functions have one of the properties of a pivotal, then the assertion of a distribution on moment functions coupled with a proper prior does permit Bayesian inference. Without the semi-pivotal condition, the assertion of a distribution for moment functions either partially or completely specifies the prior.” (p.1)

Ron Gallant will present this paper at the Conference in honour of Christian Gouréroux held next week at Dauphine and I have been asked to discuss it. What follows is a collection of notes I made while reading the paper , rather than a coherent discussion, to come later. Hopefully prior to the conference.

The difficulty I have with the approach presented therein stands as much with the presentation as with the contents. I find it difficult to grasp the assumptions behind the model(s) and the motivations for only considering a moment and its distribution. Does it all come down to linking fiducial distributions with Bayesian approaches? In which case I am as usual sceptical about the ability to impose an arbitrary distribution on an arbitrary transform of the pair (x,θ), where x denotes the data. Rather than a genuine prior x likelihood construct. But I bet this is mostly linked with my lack of understanding of the notion of structural models.

“We are concerned with situations where the structural model does not imply exogeneity of θ, or one prefers not to rely on an assumption of exogeneity, or one cannot construct a likelihood at all due to the complexity of the model, or one does not trust the numerical approximations needed to construct a likelihood.” (p.4)

As often with econometrics papers, this notion of structural model sets me astray: does this mean any latent variable model or an incompletely defined model, and if so why is it incompletely defined? From a frequentist perspective anything random is not a parameter. The term exogeneity also hints at this notion of the parameter being not truly a parameter, but including latent variables and maybe random effects. Reading further (p.7) drives me to understand the structural model as defined by a moment condition, in the sense that

$\mathbb{E}[m(\mathbf{x},\theta)]=0$

has a unique solution in θ under the true model. However the focus then seems to make a major switch as Gallant considers the distribution of a pivotal quantity like

$Z=\sqrt{n} W(\mathbf{x},\theta)^{-\frac{1}{2}} m(\mathbf{x},\theta)$

as induced by the joint distribution on (x,θ), hence conversely inducing constraints on this joint, as well as an associated conditional. Which is something I have trouble understanding, First, where does this assumed distribution on Z stem from? And, second, exchanging randomness of terms in a random variable as if it was a linear equation is a pretty sure way to produce paradoxes and measure theoretic difficulties.

The purely mathematical problem itself is puzzling: if one knows the distribution of the transform Z=Z(X,Λ), what does that imply on the joint distribution of (X,Λ)? It seems unlikely this will induce a single prior and/or a single likelihood… It is actually more probable that the distribution one arbitrarily selects on m(x,θ) is incompatible with a joint on (x,θ), isn’t it?

“The usual computational method is MCMC (Markov chain Monte Carlo) for which the best known reference in econometrics is Chernozhukov and Hong (2003).” (p.6)

While I never heard of this reference before, it looks like a 50 page survey and may be sufficient for an introduction to MCMC methods for econometricians. What I do not get though is the connection between this reference to MCMC and the overall discussion of constructing priors (or not) out of fiducial distributions. The author also suggests using MCMC to produce the MAP estimate but this always stroke me as inefficient (unless one uses our SAME algorithm of course).

“One can also compute the marginal likelihood from the chain (Newton and Raftery (1994)), which is used for Bayesian model comparison.” (p.22)

Not the best solution to rely on harmonic means for marginal likelihoods…. Definitely not. While the author actually uses the stabilised version (15) of Newton and Raftery (1994) estimator, which in retrospect looks much like a bridge sampling estimator of sorts, it remains dangerously close to the original [harmonic mean solution] especially for a vague prior. And it only works when the likelihood is available in closed form.

“The MCMC chains were comprised of 100,000 draws well past the point where transients died off.” (p.22)

I wonder if the second statement (with a very nice image of those dying transients!) is intended as a consequence of the first one or independently.

“A common situation that requires consideration of the notions that follow is that deriving the likelihood from a structural model is analytically intractable and one cannot verify that the numerical approximations one would have to make to circumvent the intractability are sufficiently accurate.” (p.7)

This then is a completely different business, namely that defining a joint distribution by mean of moment equations prevents regular Bayesian inference because the likelihood is not available. This is more exciting because (i) there are alternative available! From ABC to INLA (maybe) to EP to variational Bayes (maybe). And beyond. In particular, the moment equations are strongly and even insistently suggesting that empirical likelihood techniques could be well-suited to this setting. And (ii) it is no longer a mathematical worry: there exist a joint distribution on m(x,θ), induced by a (or many) joint distribution on (x,θ). So the question of finding whether or not it induces a single proper prior on θ becomes relevant. But, if I want to use ABC, being given the distribution of m(x,θ) seems to mean I can only generate new values of this transform while missing a natural distance between observations and pseudo-observations. Still, I entertain lingering doubts that this is the meaning of the study. Where does the joint distribution come from..?!

“Typically C is coarse in the sense that it does not contain all the Borel sets (…)  The probability space cannot be used for Bayesian inference”

My understanding of that part is that defining a joint on m(x,θ) is not always enough to deduce a (unique) posterior on θ, which is fine and correct, but rather anticlimactic. This sounds to be what Gallant calls a “partial specification of the prior” (p.9).

Overall, after this linear read, I remain very much puzzled by the statistical (or Bayesian) implications of the paper . The fact that the moment conditions are central to the approach would once again induce me to check the properties of an alternative approach like empirical likelihood.