Archive for arXiv

adaptive Metropolis-Hastings sampling using reversible dependent mixture proposals

Posted in Statistics with tags , , , , , on May 23, 2013 by xi'an

In the plane to Birmingham, I was reading this recent arXived paper by Minh-Ngoc Tran, Michael K. Pitt, and Robert Kohn. The adaptive structure of their ACMH algorithm is based upon two parallel Markov chains, the former (called the trial chain) feeding the proposal densities of the later (called the main chain), bypassing the more traditional diminishing adaptation conditions. (Even though convergence actually follows from a minorisation condition.) These proposals are mixtures of t distributions fitted by variational Bayes approximations. Furthermore, the proposals are (a) reversible and (b) mixing local [dependent] and global [independent] components. One nice aspect of the reversibility is that the proposals do not have to be evaluated at each step.

The convergence results in the paper indeed assume a uniform minorisation condition on all proposal densities: although this sounded restrictive at first (but allows for straightforward proofs), I realised this could be implemented by adding a specific component to the mixture as in Corollary 3. (I checked the proof to realise that the minorisation on the proposal extends to the minorisation on the Metropolis-Hastings transition kernel.) A reversible kernel is defined as satisfying the detailed balance condition, which means that a single Gibbs step is reversible even though the Gibbs sampler as a whole is not. If a reversible Markov kernel with stationary distribution ζ is used, the acceptance probability in the Metropolis-Hastings transition is

α(x,z) = min{1,π(z)ζ(x)/π(x)ζ(z)}

(a result I thought was already known). The sweet deal is that the transition kernel involves Dirac masses, but the acceptance probability bypasses the difficulty. The way mixtures of t distributions can be reversible follows from Pitt & Walker (2006) construction, with  ζ  a specific mixture of t distributions. This target is estimated by variational Bayes. The paper further bypasses my classical objection to the use of normal, t or mixtures thereof, distributions:  this modelling assumes a sort of common Euclidean space for all components, which is (a) highly restrictive and (b) very inefficient in terms of acceptance rate. Instead, Tran & al. resort to Metropolis-within-Gibbs by constructing a partition of the components into subgroups.

Initializing adaptive importance sampling with Markov chains

Posted in Statistics with tags , , , , , , , , , , , on May 6, 2013 by xi'an

Another paper recently arXived by Beaujean and Caldwell elaborated on our population Monte Carlo papers (Cappé et al., 2005, Douc et al., 2007, Wraith et al., 2010) to design a more thorough starting distribution. Interestingly, the authors mention the fact that PMC is an EM-type algorithm to emphasize the importance of the starting distribution, as with “poor proposal, PMC fails as proposal updates lead to a consecutively poorer approximation of the target” (p.2). I had not thought of this possible feature of PMC, which indeed proceeds along integrated EM steps, and thus could converge to a local optimum (if not poorer than the start as the Kullback-Leibler divergence decreases).

The solution proposed in this paper is similar to the one we developed in our AMIS paper. An important part of the simulation is dedicated to the construction of the starting distribution, which is a mixture deduced from multiple Metropolis-Hastings runs. I find the method spends an unnecessary long time on refining this mixture by culling the number of components: down-the-shelf clustering techniques should be sufficient, esp. if one considers that the value of the target is available at every simulated point. This has been my pet (if idle) theory for a long while: we do not take (enough) advantage of this informative feature in our simulation methods… I also find the Student’s t versus Gaussian kernel debate (p.6) somehow superfluous: as we shown in Douc et al., 2007, we can process Student’s t distributions so we can as well work with those. And rather worry about the homogeneity assumption this choice implies: working with any elliptically symmetric kernel assumes a local Euclidean structure on the parameter space, for all components, and does not model properly highly curved spaces. Another pet theory of mine’s. As for picking the necessary number of simulations at each PMC iteration, I would add to the ESS and the survival rate of the components a measure of the Kullback-Leibler divergence, as it should decrease at each iteration (with an infinite number of particles).

Another interesting feature is in the comparison with Multinest, the current version of nested sampling, developed by Farhan Feroz. This is the second time I read a paper involving nested sampling in the past two days. While this PMC implementation does better than nested sampling on the examples processed in the paper, the Multinest outcome remains relevant, particularly because it handles multi-modality fairly well. The authors seem to think parallelisation is an issue with nested sampling, while I do see why: at the most naïve stage, several nested samplers can be run in parallel and the outcomes pulled together.

Bayesian brittleness

Posted in Statistics with tags , , , , , on May 3, 2013 by xi'an

Here is the abstract of a recently arXived paper that attracted my attention:

Although it is known that Bayesian estimators may be inconsistent if the model is misspecified, it is also a popular belief that a “good” or “close” enough model should have good convergence properties. This paper shows that, contrary to popular belief, there is no such thing as a “close enough” model in Bayesian inference in the following sense: we derive optimal lower and upper bounds on posterior values obtained from models that exactly capture an arbitrarily large number of finite-dimensional marginals of the data-generating distribution and/or that are arbitrarily close to the data-generating distribution in the Prokhorov or total variation metrics; these bounds show that such models may still make the largest possible prediction error after conditioning on an arbitrarily large number of sample data. Therefore, under model misspecification, and without stronger assumptions than (arbitrary) closeness in Prokhorov or total variation metrics, Bayesian inference offers no better guarantee of accuracy than arbitrarily picking a value between the essential infimum and supremum of the quantity of interest. In particular, an unscrupulous practitioner could slightly perturb a given prior and model to achieve any desired posterior conclusions.ink

The paper is both too long and too theoretical for me to get into it deep enough. The main point however is that, given the space of all possible measures, the set of (parametric) Bayes inferences constitutes a tiny finite-dimensional that may lie far far away from the true model. I do not find the result unreasonable, far from it!, but the fact that Bayesian (and other) inferences may be inconsistent for most misspecified models is not such a major issue in my opinion. (Witness my post on the Robins-Wasserman paradox.) I am not so much convinced either about this “popular belief that a “good” or “close” enough model should have good convergence properties”, as it is intuitively reasonable that the immensity of the space of all models can induce non-convergent behaviours. The statistical question is rather what can be done about it. Does it matter that the model is misspecified? If it does, is there any meaning in estimating parameters without a model? For a finite sample size, should we at all bother that the model is not “right” or “close enough” if discrepancies cannot be detected at this precision level? I think the answer to all those questions is negative and that we should proceed with our imperfect models and imperfect inference as long as our imperfect simulation tools do not exhibit strong divergences.

austerity in MCMC land (#2)

Posted in R, Statistics with tags , , , on April 29, 2013 by xi'an

mcmc run with median instead of meanAfter reading the arXiv paper by Korattikara, Chen and Welling, I wondered about the expression of the acceptance step of the Metropolis-Hastings algorithm as a mean of log-likelihoods over the sample. More specifically the long sleepless nights at the hospital led me to ponder the rather silly question of the impact of replacing mean by median. I thus tried running a Metropolis-Hastings algorithm with the substitute and it (of course!) let to a nonsensical answer, as shown by the above graph. The true posterior is the one for a normal model and the histogram indicates a lack of convergence of the Markov chain to this posterior even though it does converge to some posterior. Here is the R code for this tiny experiment:

#data generation
N=100
x=rnorm(N)

#HM steps
T=10^5
theta=rep(0,T)
curlike=dnorm(x,log=TRUE)
for (t in 2:T){

  prop=theta[t-1]+.1*rnorm(1)
  proplike=dnorm(x,mean=prop,log=TRUE)
  u=runif(1)
  bound=log(u)-dnorm(prop,sd=10,log=TRUE)+
         dnorm(theta[t-1],sd=10,log=TRUE)
  if (median(proplike)-median(curlike)>bound/N){
   theta[t]=prop;curlike=proplike
   } else { theta[t]=theta[t-1]}
 }

re-re-relevant statistics for ABC model choice

Posted in Books, Statistics, University life with tags , , , , , , on March 18, 2013 by xi'an

mad statistics performancesAfter a very, very long delay, we eventually re-revised our paper about necessary and sufficient conditions on summary statistics to be relevant for model choice (i.e. to lead to consistent tests). Reasons, both good and bad, abound for this delay! Some (rather bad) were driven by the completion of a certain new edition… Some (fairly good) are connected with the requests from the Series B editorial team, towards improving our methodological input.  As a result we put more emphasis on the post-ABC cross-checking for the relevance of the summary choice, via a predictive posterior evaluation of the means of the summary statistic under both models and a test for mean equality. And re-ran a series of experiments on a three population population genetic example. Plus, on the side, simplified some of our assumptions. I dearly hope the paper can make it through but am also looking forward the opinion of the Series B editorial team  The next version of Relevant statistics for Bayesian model choice should be arXived by now (meaning when this post appears!).

Follow

Get every new post delivered to your Inbox.

Join 343 other followers