Archive for MCMSki IV

better together?

Posted in Mountains, Statistics, University life, Books, pictures with tags , , , , , , , , on August 31, 2017 by xi'an

Yesterday came out on arXiv a joint paper by Pierre Jacob, Lawrence Murray, Chris Holmes and myself, Better together? Statistical learning in models made of modules, paper that was conceived during the MCMski meeting in Chamonix, 2014! Indeed it is mostly due to Martyn Plummer‘s talk at this meeting about the cut issue that we started to work on this topic at the fringes of the [standard] Bayesian world. Fringes because a standard Bayesian approach to the problem would always lead to use the entire dataset and the entire model to infer about a parameter of interest. [Disclaimer: the use of the very slogan of the anti-secessionists during the Scottish Independence Referendum of 2014 in our title is by no means a measure of support of their position!] Comments and suggested applications most welcomed!

The setting of the paper is inspired by realistic situations where a model is made of several modules, connected within a graphical model that represents the statistical dependencies, each relating to a specific data modality. In a standard Bayesian analysis, given data, a conventional statistical update then allows for coherent uncertainty quantification and information propagation through and across the modules. However, misspecification of or even massive uncertainty about any module in the graph can contaminate the estimate and update of parameters of other modules, often in unpredictable ways. Particularly so when certain modules are trusted more than others. Hence the appearance of cut models, where practitioners  prefer skipping the full model and limit the information propagation between these modules, for example by restricting propagation to only one direction along the edges of the graph. (Which is sometimes represented as a diode on the edge.) The paper investigates in which situations and under which formalism such modular approaches can outperform the full model approach in misspecified settings. By developing the appropriate decision-theoretic framework. Meaning we can choose between [several] modular and full-model approaches.

unbiased MCMC

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on August 25, 2017 by xi'an

Two weeks ago, Pierre Jacob, John O’Leary, and Yves F. Atchadé arXived a paper on unbiased MCMC with coupling. Associating MCMC with unbiasedness is rather challenging since MCMC are rarely producing simulations from the exact target, unless specific tools like renewal can be produced in an efficient manner. (I supported the use of such renewal techniques as early as 1995, but later experiments led me to think renewal control was too rare an occurrence to consider it as a generic convergence assessment method.)

This new paper makes me think I had given up too easily! Here the central idea is coupling of two (MCMC) chains, associated with the debiasing formula used by Glynn and Rhee (2014) and already discussed here. Having the coupled chains meet at some time with probability one implies that the debiasing formula does not need a (random) stopping time. The coupling time is sufficient. Furthermore, several estimators can be derived from the same coupled Markov chain simulations, obtained by starting the averaging at a later time than the first iteration. The average of these (unbiased) averages results into a weighted estimate that weights more the later differences. Although coupling is also at the basis of perfect simulation methods, the analogy between this debiasing technique and perfect sampling is hard to fathom, since the coupling of two chains is not a perfect sampling instant. (Something obvious only in retrospect for me is that the variance of the resulting unbiased estimator is at best the variance of the original MCMC estimator.)

When discussing the implementation of coupling in Metropolis and Gibbs settings, the authors give a simple optimal coupling algorithm I was not aware of. Which is a form of accept-reject also found in perfect sampling I believe. (Renewal based on small sets makes an appearance on page 11.) I did not fully understood the way two random walk Metropolis steps are coupled, in that the normal proposals seem at odds with the boundedness constraints. But coupling is clearly working in this setting, while renewal does not. In toy examples like the (Efron and Morris!) baseball data and the (Gelfand and Smith!) pump failure data, the parameters k and m of the algorithm can be optimised against the variance of the averaged averages. And this approach comes highly useful in the case of the cut distribution,  a problem which I became aware of during MCMskiii and on which we are currently working with Pierre and others.

future of computational statistics

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , , , , , on September 29, 2014 by xi'an

I am currently preparing a survey paper on the present state of computational statistics, reflecting on the massive evolution of the field since my early Monte Carlo simulations on an Apple //e, which would take a few days to return a curve of approximate expected squared error losses… It seems to me that MCMC is attracting more attention nowadays than in the past decade, both because of methodological advances linked with better theoretical tools, as for instance in the handling of stochastic processes, and because of new forays in accelerated computing via parallel and cloud computing, The breadth and quality of talks at MCMski IV is testimony to this. A second trend that is not unrelated to the first one is the development of new and the rehabilitation of older techniques to handle complex models by approximations, witness ABC, Expectation-Propagation, variational Bayes, &tc. With a corollary being an healthy questioning of the models themselves. As illustrated for instance in Chris Holmes’ talk last week. While those simplifications are inevitable when faced with hardly imaginable levels of complexity, I still remain confident about the “inevitability” of turning statistics into an “optimize+penalize” tunnel vision…  A third characteristic is the emergence of new languages and meta-languages intended to handle complexity both of problems and of solutions towards a wider audience of users. STAN obviously comes to mind. And JAGS. But it may be that another scale of language is now required…

If you have any suggestion of novel directions in computational statistics or instead of dead ends, I would be most interested in hearing them! So please do comment or send emails to my gmail address bayesianstatistics

likelihood-free inference via classification

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , on August 5, 2014 by xi'an

IMG_2268Last week, Michael Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander posted on arXiv the last version of the paper they had presented at MCMSki 4. As indicated by its (above) title, it suggests implementing ABC based on classification tools. Thus making it somewhat connected to our recent random forest paper.

The starting idea in the paper is that datasets generated from distributions with different parameters should be easier to classify than datasets generated from distributions with the same parameters. And that classification accuracy naturally induces a distance between datasets and between the parameters behind those datasets. We had followed some of the same track when starting using random forests, before realising that for our model choice setting, proceeding the entire ABC way once the random forest procedure had been constructed was counter-productive. Random forests are just too deadly as efficient model choice machines to try to compete with them through an ABC postprocessing. Performances are just… Not. As. Good!

A side question: I have obviously never thought about that before but why is the naïve Bayes classification rule so called?! It never sounded very Bayesian to me to (a) use the true value of the parameter and (b) average the classification performances. Interestingly, the authors (i) show identical performances of other classification methods (Fig. 2) and (ii) an exception for MA time series: when we first experimented random forests, raw data from an MA(2) model was tested to select between MA(1) and  MA(2) models, and the performances of the resulting random forest were quite poor.

Now, an opposition between our two approaches is that Michael and his coauthors also include point estimation within the range of classification-based ABC inference. As we stressed in our paper, we restrict the range to classification and model choice because we do not think those machine learning tools are stable and powerful enough to perform regression and posterior probability approximation. I also see a practical weakness in the estimation scheme proposed in this new paper. Namely that the Monte Carlo gets into the way of the consistency theorem. And possibly of the simulation method itself. Another remark is that, while the authors compare the fit produced by different classification methods, there should be a way to aggregate them towards higher efficiency. Returning once more to our random forest paper, we saw improved performances each time we included a reference method, from LDA to SVMs. It would be interesting to see a (summary) variable selection version of the proposed method. A final remark is that computing time and effort do not seem to get mentioned in the paper (unless Indian jetlag confuses me more than usual). I wonder how fast the computing effort grows with the sample size to reach parametric and quadratic convergence rates.

a pseudo-marginal perspective on the ABC algorithm

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , on May 5, 2014 by xi'an


My friends Luke Bornn, Natesh Pillai and Dawn Woodard just arXived along with Aaron Smith a short note on the convergence properties of ABC. When compared with acceptance-rejection or regular MCMC. Unsurprisingly, ABC does worse in both cases. What is central to this note is that ABC can be (re)interpreted as a pseudo-marginal method where the data comparison step acts like an unbiased estimator of the true ABC target (not of the original ABC target, mind!). From there, it is mostly an application of Christophe Andrieu’s and Matti Vihola’s results in this setup. The authors also argue that using a single pseudo-data simulation per parameter value is the optimal strategy (as compared with using several), when considering asymptotic variance. This makes sense in terms of simulating in a larger dimensional space but what of the cost of producing those pseudo-datasets against the cost of producing a new parameter? There are a few (rare) cases where the datasets are much cheaper to produce.

MCqMC 2014 [closup]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , on April 16, 2014 by xi'an

Leuven6As mentioned earlier, this was my very first MCqMC conference and I really enjoyed it, even though (or because) there were many topics that did not fall within my areas of interest. (By comparison, WSC is a serie of conferences too remote from those areas for my taste, as I realised in Berlin where we hardly attended any talk and hardly anyone attended my session!) Here I appreciated the exposure to different mathematical visions on Monte Carlo, without being swamped by applications as at WSC… Obviously, our own Bayesian computational community was much less represented than at, say, MCMSki! Nonetheless, I learned a lot during this conference for instance from Peter Glynn‘s fantastic talk, and I came back home with new problems and useful references [as well as a two-hour delay in the train ride from Brussels]. I also obviously enjoyed the college-town atmosphere of Leuven, the many historical landmarks  and the easily-found running routes out of the town. I am thus quite eager to attend the next MCqMC 2016 meeting (in Stanford, an added bonus!) and even vaguely toying with the idea of organising MCqMC 2018 in Monaco (depending on the return for ISBA 2016 and ISBA 2018). In any case, thanks to the scientific committee for the invitation to give a plenary lecture in Leuven and to the local committee for a perfect organisation of the meeting.

Pre-processing for approximate Bayesian computation in image analysis

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on March 21, 2014 by xi'an

ridge6With Matt Moores and Kerrie Mengersen, from QUT, we wrote this short paper just in time for the MCMSki IV Special Issue of Statistics & Computing. And arXived it, as well. The global idea is to cut down on the cost of running an ABC experiment by removing the simulation of a humongous state-space vector, as in Potts and hidden Potts model, and replacing it by an approximate simulation of the 1-d sufficient (summary) statistics. In that case, we used a division of the 1-d parameter interval to simulate the distribution of the sufficient statistic for each of those parameter values and to compute the expectation and variance of the sufficient statistic. Then the conditional distribution of the sufficient statistic is approximated by a Gaussian with these two parameters. And those Gaussian approximations substitute for the true distributions within an ABC-SMC algorithm à la Del Moral, Doucet and Jasra (2012).


Across 20 125 × 125 pixels simulated images, Matt’s algorithm took an average of 21 minutes per image for between 39 and 70 SMC iterations, while resorting to pseudo-data and deriving the genuine sufficient statistic took an average of 46.5 hours for 44 to 85 SMC iterations. On a realistic Landsat image, with a total of 978,380 pixels, the precomputation of the mapping function took 50 minutes, while the total CPU time on 16 parallel threads was 10 hours 38 minutes. By comparison, it took 97 hours for 10,000 MCMC iterations on this image, with a poor effective sample size of 390 values. Regular SMC-ABC algorithms cannot handle this scale: It takes 89 hours to perform a single SMC iteration! (Note that path sampling also operates in this framework, thanks to the same precomputation: in that case it took 2.5 hours for 10⁵ iterations, with an effective sample size of 10⁴…)

Since my student’s paper on Seaman et al (2012) got promptly rejected by TAS for quoting too extensively from my post, we decided to include me as an extra author and submitted the paper to this special issue as well.