## X divergence for approximate inference

Posted in Statistics with tags , , , , , , , on March 14, 2017 by xi'an

Dieng et al. arXived this morning a new version of their paper on using the Χ divergence for variational inference. The Χ divergence essentially is the expectation of the squared ratio of the target distribution over the approximation, under the approximation. It is somewhat related to Expectation Propagation (EP), which aims at the Kullback-Leibler divergence between the target distribution and the approximation, under the target. And to variational Bayes, which is the same thing just the opposite way! The authors also point a link to our [adaptive] population Monte Carlo paper of 2008. (I wonder at a possible version through Wasserstein distance.)

Some of the arguments in favour of this new version of variational Bayes approximations is that (a) the support of the approximation over-estimates the posterior support; (b) it produces over-dispersed versions; (c) it relates to a well-defined and global objective function; (d) it allows for a sandwich inequality on the model evidence; (e) the function of the [approximation] parameter to be minimised is under the approximation, rather than under the target. The latest allows for a gradient-based optimisation. While one of the applications is on a Bayesian probit model applied to the Pima Indian women dataset [and will thus make James and Nicolas cringe!], the experimental assessment shows lower error rates for this and other benchmarks. Which in my opinion does not tell so much about the original Bayesian approach.

Posted in Statistics with tags , , , , , on August 30, 2016 by xi'an

Following Paul Russell’s talk at MCqMC 2016, I took a look at his recently arXived paper. In the plane to Sydney. The pseudo-code representation of the method is identical to our population Monte Carlo algorithm as is the suggestion to approximate the posterior by a mixture, but one novel aspect is to use Reich’s ensemble transportation at the resampling stage, in order to maximise the correlation between the original and the resampled versions of the particle systems. (As in our later versions of PMC, the authors also use as importance denominator the entire mixture rather than conditioning on the selected last-step particle.)

“The output of the resampling algorithm gives us a set of evenly weighted samples that we believe represents the target distribution well”

I disagree with this statement: Reweighting does not improve the quality of the posterior approximation, since it introduces more variability. If the original sample is found missing in its adequation to the target, so is the resampled one. Worse, by producing a sample with equal weights, this step may give a false impression of adequate representation…

Another unclear point in the pape relates to tuning the parameters of the mixture importance sampler. The paper discusses tuning these parameters during a burn-in stage, referring to “due to the constraints on adaptive MCMC algorithms”, which indeed is only pertinent for MCMC algorithms, since importance sampling can be constantly modified while remaining valid. This was a major point for advocating PMC. I am thus unsure what the authors mean by a burn-in period in such a context. Actually, I am also unsure on how they use effective sample size to select the new value of the importance parameter, e.g., the variance β in a random walk mixture: the effective sample size involves this variance implicitly through the realised sample hence changing β means changing the realised sample… This seems too costly to contemplate so I wonder at the way Figure 4.2 is produced.

“A popular approach for adaptive MCMC algorithms is to view the scaling parameter as a random variable which we can sample during the course of the MCMC iterations.”

While this is indeed an attractive notion [that I played with in the early days of adaptive MCMC, with the short-lived notion of cyber-parameters], I do not think it is of much help in optimising an MCMC algorithm, since the scaling parameter need be optimised, resulting into a time-inhomogeneous target. A more appropriate tool is thus stochastic optimisation à la Robbins-Monro, as exemplified in Andrieu and Moulines (2006). The paper however remains unclear as to how the scales are updated (see e.g. Section 4.2).

“Ideally, we would like to use a resampling algorithm which is not prohibitively costly for moderately or large sized ensembles, which preserves the mean of the samples, and which makes it much harder for the new samples to forget a significant region in the density.”

The paper also misses on the developments of the early 2000’s about more sophisticated resampling steps, especially Paul Fearnhead’s contributions (see also Nicolas Chopin’s thesis). There exist valid resampling methods that require a single uniform (0,1) to be drawn, rather than m. The proposed method has a flavour similar to systematic resampling, but I wonder at the validity of returning values that are averages of earlier simulations, since this modifies their distribution into ones with slimmer tails. (And it is parameterisation dependent.) Producing xi with probability pi is not the same as returning the average of the pixi‘s.

## adaptive resample move for estimating constants

Posted in Books, Statistics, University life with tags , , , , , on April 4, 2016 by xi'an

“…adaptive resample-move allows us to reduce the variance of the estimate of normalizing constants.”

A few days before our Estimating Constants workshop, Marco Fraccaroa, Ulrich Paquet, and Ole Winthera arXived a paper on estimating normalising constants by resample-move sequential Monte Carlo. The main result of this note is a theorem that derives the optimal relative size of each particle system, based on the approximate variance of the associated importance weights. Which is of major importance when running a sequential algorithm under computing time or budget constraints. In practice this theorem cannot be applied in a sequential manner [since it depends on future steps] and the authors propose instead an adaptive algorithm, enlarging the current set of particles if the effective sample size per particle is not large enough. There may however be a danger of an endless step if the proposal is particularly ill-fitted to the target. Under a fixed total budget, this potential explosion in the number of particles may jeopardize the entire process. Unless some safeguarding mechanism is introduced towards getting back in time to recover more variety in earlier steps. The paper compares the adaptive method with other solutions, including an Riemanian manifold HMC, on Gaussian processes and restricted Boltzman machines. Both examples being associated with very specialised ways of building the sequence of tempered targets, it seems.

Posted in Books, pictures, Statistics, University life with tags , , , , , , on July 30, 2015 by xi'an

Ingmar Schuster, who visited Paris-Dauphine last Spring (and is soon to return here as a postdoc funded by Fondation des Sciences Mathématiques de Paris) has arXived last week a paper on gradient importance sampling. In this paper, he builds a sequential importance sampling (or population Monte Carlo) algorithm that exploits the additional information contained in the gradient of the target. The proposal or importance function being essentially the MALA move as its proposal, mixed across the elements of the previous population. When compared with our original PMC mixture of random walk proposals found in e.g. this paper, each term in the mixture thus involves an extra gradient, with a scale factor that decreases to zero as 1/t√t. Ingmar compares his proposal with an adaptive Metropolis, an adaptive MALTa and an HM algorithms, for two mixture distributions and the banana target of Haario et al. (1999) we also used in our paper. As well as a logistic regression. In each case, he finds both a smaller squared error and a smaller bias for the same computing time (evaluated as the number of likelihood evaluations). While we discussed this scheme when he visited, I remain intrigued as to why it works so well when compared with the other solutions. One possible explanation is that the use of the gradient drift is more efficient on a population of particles than on a single Markov chain, provided the population covers all modes of importance on the target surface: the “fatal” attraction of the local model is then much less of an issue…

## Bayesian model averaging in astrophysics

Posted in Books, Statistics, University life with tags , , , , , , , , , , on July 29, 2015 by xi'an

[A 2013 post that somewhat got lost in a pile of postponed entries and referee’s reports…]

In this review paper, now published in Statistical Analysis and Data Mining 6, 3 (2013), David Parkinson and Andrew R. Liddle go over the (Bayesian) model selection and model averaging perspectives. Their argument in favour of model averaging is that model selection via Bayes factors may simply be too inconclusive to favour one model and only one model. While this is a correct perspective, this is about it for the theoretical background provided therein. The authors then move to the computational aspects and the first difficulty is their approximation (6) to the evidence

$P(D|M) = E \approx \frac{1}{n} \sum_{i=1}^n L(\theta_i)Pr(\theta_i)\, ,$

where they average the likelihood x prior terms over simulations from the posterior, which does not provide a valid (either unbiased or converging) approximation. They surprisingly fail to account for the huge statistical literature on evidence and Bayes factor approximation, incl. Chen, Shao and Ibrahim (2000). Which covers earlier developments like bridge sampling (Gelman and Meng, 1998).

As often the case in astrophysics, at least since 2007, the authors’ description of nested sampling drifts away from perceiving it as a regular Monte Carlo technique, with the same convergence speed n1/2 as other Monte Carlo techniques and the same dependence on dimension. It is certainly not the only simulation method where the produced “samples, as well as contributing to the evidence integral, can also be used as posterior samples.” The authors then move to “population Monte Carlo [which] is an adaptive form of importance sampling designed to give a good estimate of the evidence”, a particularly restrictive description of a generic adaptive importance sampling method (Cappé et al., 2004). The approximation of the evidence (9) based on PMC also seems invalid:

$E \approx \frac{1}{n} \sum_{i=1}^n \dfrac{L(\theta_i)}{q(\theta_i)}\, ,$

is missing the prior in the numerator. (The switch from θ in Section 3.1 to X in Section 3.4 is  confusing.) Further, the sentence “PMC gives an unbiased estimator of the evidence in a very small number of such iterations” is misleading in that PMC is unbiased at each iteration. Reversible jump is not described at all (the supposedly higher efficiency of this algorithm is far from guaranteed when facing a small number of models, which is the case here, since the moves between models are governed by a random walk and the acceptance probabilities can be quite low).

The second quite unrelated part of the paper covers published applications in astrophysics. Unrelated because the three different methods exposed in the first part are not compared on the same dataset. Model averaging is obviously based on a computational device that explores the posteriors of the different models under comparison (or, rather, averaging), however no recommendation is found in the paper as to efficiently implement the averaging or anything of the kind. In conclusion, I thus find this review somehow anticlimactic.

## optimal mixture weights in multiple importance sampling

Posted in Statistics, University life with tags , , , , , , on December 12, 2014 by xi'an

Multiple importance sampling is back!!! I am always interested in this improvement upon regular importance sampling, even or especially after publishing a recent paper about our AMIS (for adaptive multiple importance sampling) algorithm, so I was quite eager to see what was in Hera He’s and Art Owen’s newly arXived paper. The paper is definitely exciting and set me on a new set of importance sampling improvements and experiments…

Some of the most interesting developments in the paper are that, (i) when using a collection of importance functions qi with the same target p, every ratio qi/p is a control variate function with expectation 1 [assuming each of the qi‘s has a support smaller than the support of p]; (ii) the weights of a mixture of the qi‘s can be chosen in an optimal way towards minimising the variance for a certain integrand; (iii) multiple importance sampling incorporates quite naturally stratified sampling, i.e. the qi‘s may have disjoint supports; )iv) control variates contribute little, esp. when compared with the optimisation over the weights [which does not surprise me that much, given that the control variates have little correlation with the integrands]; (v) Veach’s (1997) seminal PhD thesis remains a driving force behind those results [and in getting Eric Veach an Academy Oscar in 2014!].

One extension that I would find of the uttermost interest deals with unscaled densities, both for p and the qi‘s. In that case, the weights do not even sum up to a know value and I wonder at how much more difficult it is to analyse this realistic case. And unscaled densities led me to imagine using geometric mixtures instead. Or even harmonic mixtures! (Maybe not.)

Another one is more in tune with our adaptive multiple mixture paper. The paper works with regret, but one could also work with remorse! Besides the pun, this means that one could adapt the weights along iterations and even possible design new importance functions from the past outcome, i.e., be adaptive once again. He and Owen suggest mixing their approach with our adaptive sequential Monte Carlo model.

## Another history of MCMC

Posted in Books, Statistics, University life with tags , , , , , on April 20, 2011 by xi'an

In the most recent issue of Statistical Science, the special topic is “Celebrating the EM Algorithm’s Quandunciacentennial“. It contains an historical survey by Martin Tanner and Wing Wong on the emergence of MCMC Bayesian computation in the 1980’s, This survey is more focused and more informative than our global history (also to appear in Statistical Science). In particular, it provides the authors’ analysis as to why MCMC was delayed by ten years or so (or even more when considering that a Gibbs sampler as a simulation tool appears in both Hastings’ (1970) and Besag‘s (1974) papers). They dismiss [our] concerns about computing power (I was running Monte Carlo simulations on my Apple IIe by 1986 and a single mean square error curve evaluation for a James-Stein type estimator would then take close to a weekend!) and Markov innumeracy, rather attributing the reluctance to a lack of confidence into the method. This perspective remains debatable as, apart from Tony O’Hagan who was then fighting again Monte Carlo methods as being un-Bayesian (1987, JRSS D),  I do not remember any negative attitude at the time about simulation and the immediate spread of the MCMC methods from Alan Gelfand’s and Adrian Smith’s presentations of their 1990 paper shows on the opposite that the Bayesian community was ready for the move.

Another interesting point made in this historical survey is that Metropolis’ and other Markov chain methods were first presented outside simulation sections of books like Hammersley and Handscomb (1964), Rubinstein (1981) and Ripley (1987), perpetuating the impression that such methods were mostly optimisation or niche specific methods. This is also why Besag’s earlier works (not mentioned in this survey) did not get wider recognition, until later. Something I was not aware is the appearance of iterative adaptive importance sampling (i.e. population Monte Carlo) in the Bayesian literature of the 1980’s, with proposals from Herman van Dijk, Adrian Smith, and others. The appendix about Smith et al. (1985), the 1987 special issue of JRSS D, and the computation contents of Valencia 3 (that I sadly missed for being in the Army!) is also quite informative about the perception of computational Bayesian statistics at this time.

A missing connection in this survey is Gilles Celeux and Jean Diebolt’s stochastic EM (or SEM). As early as 1981, with Michel Broniatowski, they proposed a simulated version of EM  for mixtures where the latent variable z was simulated from its conditional distribution rather than replaced with its expectation. So this was the first half of the Gibbs sampler for mixtures we completed with Jean Diebolt about ten years later. (Also found in Gelman and King, 1990.) These authors did not get much recognition from the community, though, as they focused almost exclusively on mixtures, used simulation to produce a randomness that would escape the local mode attraction, rather than targeting the posterior distribution, and did not analyse the Markovian nature of their algorithm until later with the simulated annealing EM algorithm.