## density normalization for MCMC algorithms

Posted in Statistics, University life with tags , , , , , , , , on November 6, 2014 by xi'an

Another paper addressing the estimation of the normalising constant and the wealth of available solutions just came out on arXiv, with the full title of “Target density normalization for Markov chain Monte Carlo algorithms“, written by Allen Caldwell and Chang Liu. (I became aware of it by courtesy of Ewan Cameron, as it appeared in the physics section of arXiv. It is actually a wee bit annoying that papers in the subcategory “Data Analysis, Statistics and Probability” of physics do not get an automated reposting on the statistics lists…)

In this paper, the authors compare three approaches to the problem of finding

$\mathfrak{I} = \int_\Omega f(\lambda)\,\text{d}\lambda$

when the density f is unormalised, i.e., in more formal terms, when f is proportional to a probability density (and available):

1. an “arithmetic mean”, which is an importance sampler based on (a) reducing the integration volume to a neighbourhood ω of the global mode. This neighbourhood is chosen as an hypercube and the importance function turns out to be the uniform over this hypercube. The corresponding estimator is then a rescaled version of the average of f over uniform simulations in ω.
2.  an “harmonic mean”, of all choices!, with again an integration over the neighbourhood ω of the global mode in order to avoid the almost sure infinite variance of harmonic mean estimators.
3. a Laplace approximation, using the target at the mode and the Hessian at the mode as well.

The paper then goes to comparing those three solutions on a few examples, demonstrating how the diameter of the hypercube can be calibrated towards a minimum (estimated) uncertainty. The rather anticlimactic conclusion is that the arithmetic mean is the most reliable solution as harmonic means may fail in larger dimension and more importantly fail to signal its failure, while Laplace approximations only approximate well quasi-Gaussian densities…

What I find most interesting in this paper is the idea of using only one part of the integration space to compute the integral, even though it is not exactly new. Focussing on a specific region ω has pros and cons, the pros being that the reduction to a modal region reduces needs for absolute MCMC convergence and helps in selecting alternative proposals and also prevents from the worst consequences of using a dreaded harmonic mean, the cons being that the region needs be well-identified, which means requirements on the MCMC kernel, and that the estimate is a product of two estimates, the frequency being driven by a Binomial noise.  I also like very much the idea of calibrating the diameter Δof the hypercube ex-post by estimating the uncertainty.

As an aside, the paper mentions most of the alternative solutions I just presented in my Monte Carlo graduate course two days ago (like nested or bridge or Rao-Blackwellised sampling, including our proposal with Darren Wraith), but dismisses them as not “directly applicable in an MCMC setting”, i.e., without modifying this setting. I unsurprisingly dispute this labelling, both because something like the Laplace approximation requires extra-work on the MCMC output (and once done this work can lead to advanced Laplace methods like INLA) and because other methods could be considered as well (for instance, bridge sampling over several hypercubes). As shown in the recent paper by Mathieu Gerber and Nicolas Chopin (soon to be discussed at the RSS!), MCqMC has also become a feasible alternative that would compete well with the methods studied in this paper.

Overall, this is a paper that comes in a long list of papers on constant approximations. I do not find the Markov chain of MCMC aspect particularly compelling or specific, once the effective sample size is accounted for. It would be nice to find generic ways of optimising the visit to the hypercube ω and to estimate efficiently the weight of ω. The comparison is solely run over examples, but they all rely on a proper characterisation of the hypercube and the ability to simulate efficiently f over that hypercube.

## postdoc in Paris?

Posted in Kids, Statistics, Travel, University life with tags , , , , , , , on November 4, 2014 by xi'an

There is an open call of the Fondation Sciences Mathématiques de Paris (FSMP) about a postdoctoral funding program with 18 position-years available for staying in Université Paris-Dauphine (and other participating universities). The net support is quite decent  (wrt French terms and academic salaries) and the application form easy to fill. So, if you are interested in coming to Paris to work on ABC, MCMC, Bayesian model choice, &tc., feel free to contact me (or another Parisian statistician) and to apply! The deadline is December 01, 2014.  And the decision will be made by January 15, 2015. The starting date for the postdoc is October 01, 2015.

## projective covariate selection

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on October 28, 2014 by xi'an

While I was in Warwick, Dan Simpson [newly arrived from Norway on a postdoc position] mentioned to me he had attended a talk by Aki Vehtari in Norway where my early work with Jérôme Dupuis on projective priors was used. He gave me the link to this paper by Peltola, Havulinna, Salomaa and Vehtari that indeed refers to the idea that a prior on a given Euclidean space defines priors by projections on all subspaces, despite the zero measure of all those subspaces. (This notion first appeared in a joint paper with my friend Costas Goutis, who alas died in a diving accident a few months later.) The projection further allowed for a simple expression of the Kullback-Leibler deviance between the corresponding models and for a Pythagorean theorem on the additivity of the deviances between embedded models. The weakest spot of this approach of ours was, in my opinion and unsurprisingly, about deciding when a submodel was too far from the full model. The lack of explanatory power introduced therein had no absolute scale and later discussions led me to think that the bound should depend on the sample size to ensure consistency. (The recent paper by Nott and Leng that was expanding on this projection has now appeared in CSDA.)

“Specifically, the models with subsets of covariates are found by maximizing the similarity of their predictions to this reference as proposed by Dupuis and Robert [12]. Notably, this approach does not require specifying priors for the submodels and one can instead focus on building a good reference model. Dupuis and Robert (2003) suggest choosing the size of the covariate subset based on an acceptable loss of explanatory power compared to the reference model. We examine using cross-validation based estimates of predictive performance as an alternative.” T. Peltola et al.

The paper also connects with the Bayesian Lasso literature, concluding on the horseshoe prior being more informative than the Laplace prior. It applies the selection approach to identify biomarkers with predictive performances in a study of diabetic patients. The authors rank model according to their (log) predictive density at the observed data, using cross-validation to avoid exploiting the data twice. On the MCMC front, the paper implements the NUTS version of HMC with STAN.

## delayed acceptance [alternative]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on October 22, 2014 by xi'an

In a comment on our Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching paper, Philip commented that he had experimented with an alternative splitting technique retaining the right stationary measure: the idea behind his alternative acceleration is again (a) to divide the target into bits and (b) run the acceptance step by parts, towards a major reduction in computing time. The difference with our approach is to represent the  overall acceptance probability

$\min_{k=0,..,d}\left\{\prod_{j=1}^k \rho_j(\eta,\theta),1\right\}$

and, even more surprisingly than in our case, this representation remains associated with the right (posterior) target!!! Provided the ordering of the terms is random with a symmetric distribution on the permutation. This property can be directly checked via the detailed balance condition.

In a toy example, I compared the acceptance rates (acrat) for our delayed solution (letabin.R), for this alternative (letamin.R), and for a non-delayed reference (letabaz.R), when considering more and more fractured decompositions of a Bernoulli likelihood.

> system.time(source("letabin.R"))
user system elapsed
225.918 0.444 227.200
> acrat
[1] 0.3195 0.2424 0.2154 0.1917 0.1305 0.0958
> system.time(source("letamin.R"))
user system elapsed
340.677 0.512 345.389
> acrat
[1] 0.4045 0.4138 0.4194 0.4003 0.3998 0.4145
> system.time(source("letabaz.R"))
user system elapsed
49.271 0.080 49.862
> acrat
[1] 0.6078 0.6068 0.6103 0.6086 0.6040 0.6158


A very interesting outcome since the acceptance rate does not change with the number of terms in the decomposition for the alternative delayed acceptance method… Even though it logically takes longer than our solution. However, the drawback is that detailed balance implies picking the order at random, hence loosing on the gain in computing the cheap terms first. If reversibility could be bypassed, then this alternative would definitely get very appealing!

## Combining Particle MCMC with Rao-Blackwellized Monte Carlo Data Association

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on October 10, 2014 by xi'an

This recently arXived paper by Juho Kokkala and Simo Särkkä mixes a whole lot of interesting topics, from particle MCMC and Rao-Blackwellisation to particle filters, Kalman filters, and even bear population estimation. The starting setup is the state-space hidden process models where particle filters are of use. And where Andrieu, Doucet and Hollenstein (2010) introduced their particle MCMC algorithms. Rao-Blackwellisation steps have been proposed in this setup in the original paper, as well as in the ensuing discussion, like recycling rejected parameters and associated particles. The beginning of the paper is a review of the literature in this area, in particular of the Rao-Blackwellized Monte Carlo Data Association algorithm developed by Särkkä et al. (2007), of which I was not aware previously. (I alas have not followed closely enough the filtering literature in the past years.) Targets evolve independently according to Gaussian dynamics.

In the description of the model (Section 3), I feel there are prerequisites on the model I did not have (and did not check in Särkkä et al., 2007), like the meaning of targets and measurements: it seems the model assumes each measurement corresponds to a given target. More details or an example would have helped. The extension against the existing appears to be the (major) step of including unknown parameters. Due to my lack of expertise in the domain, I have no notion of the existence of similar proposals in the literature, but handling unknown parameters is definitely of direct relevance for the statistical analysis of such problems!

The simulation experiment based on an Ornstein-Uhlenbeck model is somewhat anticlimactic in that the posterior on the mean reversion rate is essentially the prior, conveniently centred at the true value, while the others remain quite wide. It may be that the experiment was too ambitious in selecting 30 simultaneous targets with only a total of 150 observations. Without highly informative priors, my beotian reaction is to doubt the feasibility of the inference. In the case of the Finnish bear study, the huge discrepancy between priors and posteriors, as well as the significant difference between the forestry expert estimations and the model predictions should be discussed, if not addressed, possibly via a simulation using the posteriors as priors. Or maybe using a hierarchical Bayes model to gather a time-wise coherence in the number of bear families. (I wonder if this technique would apply to the type of data gathered by Mohan Delampady on the West Ghats tigers…)

Overall, I am slightly intrigued by the practice of running MCMC chains in parallel and merging the outcomes with no further processing. This assumes a lot in terms of convergence and mixing on all the chains. However, convergence is never directly addressed in the paper.

## hypothesis testing for MCMC

Posted in Books, Statistics, University life with tags , , , , on October 6, 2014 by xi'an

A recent arXival by Benjamin Gyori and Daniel Paulin considers sequential testing based on MCMC simulation. The test is about an expectation under the target and stationary distribution of the Markov chain (i.e., the posterior in a Bayesian setting). Hence testing whether or not the posterior expectation is below a certain bound is not directly relevant from a Bayesian perspective. One would test instead whether or not the parameter itself is below the bound… The paper is then more a study of sequential tests when the data is a Markov chain than in any clear connection with MCMC topics. Despite the paper including an example of a Metropolis-Hastings scheme for approximating the posterior on the parameters of an ODE. I am a bit puzzled by the purpose of the test, as I was rather expecting tests connected with the convergence of the Markov chain or of the empirical mean. (But, given the current hour, I may also have missed a crucial point!)

## future of computational statistics

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , , , , , on September 29, 2014 by xi'an

I am currently preparing a survey paper on the present state of computational statistics, reflecting on the massive evolution of the field since my early Monte Carlo simulations on an Apple //e, which would take a few days to return a curve of approximate expected squared error losses… It seems to me that MCMC is attracting more attention nowadays than in the past decade, both because of methodological advances linked with better theoretical tools, as for instance in the handling of stochastic processes, and because of new forays in accelerated computing via parallel and cloud computing, The breadth and quality of talks at MCMski IV is testimony to this. A second trend that is not unrelated to the first one is the development of new and the rehabilitation of older techniques to handle complex models by approximations, witness ABC, Expectation-Propagation, variational Bayes, &tc. With a corollary being an healthy questioning of the models themselves. As illustrated for instance in Chris Holmes’ talk last week. While those simplifications are inevitable when faced with hardly imaginable levels of complexity, I still remain confident about the “inevitability” of turning statistics into an “optimize+penalize” tunnel vision…  A third characteristic is the emergence of new languages and meta-languages intended to handle complexity both of problems and of solutions towards a wider audience of users. STAN obviously comes to mind. And JAGS. But it may be that another scale of language is now required…

If you have any suggestion of novel directions in computational statistics or instead of dead ends, I would be most interested in hearing them! So please do comment or send emails to my gmail address bayesianstatistics