## Better together in Kolkata [slides]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on January 4, 2018 by xi'an

Here are the slides of the talk on modularisation I am giving today at the PC Mahalanobis 125 Conference in Kolkata, mostly borrowed from Pierre’s talk at O’Bayes 2018 last month:

[which made me realise Slideshare has discontinued the option to update one’s presentation, forcing users to create a new presentation for each update!] Incidentally, the amphitheatre at ISI is located right on top of a geological exhibit room with a reconstituted Barapasaurus tagorei so I will figuratively ride a dinosaur during my talk!

## the Hyvärinen score is back

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , , , on November 21, 2017 by xi'an

Stéphane Shao, Pierre Jacob and co-authors from Harvard have just posted on arXiv a new paper on Bayesian model comparison using the Hyvärinen score

$\mathcal{H}(y, p) = 2\Delta_y \log p(y) + ||\nabla_y \log p(y)||^2$

which thus uses the Laplacian as a natural and normalisation-free penalisation for the score test. (Score that I first met in Padova, a few weeks before moving from X to IX.) Which brings a decision-theoretic alternative to the Bayes factor and which delivers a coherent answer when using improper priors. Thus a very appealing proposal in my (biased) opinion! The paper is mostly computational in that it proposes SMC and SMC² solutions to handle the estimation of the Hyvärinen score for models with tractable likelihoods and tractable completed likelihoods, respectively. (Reminding me that Pierre worked on SMC² algorithms quite early during his Ph.D. thesis.)

A most interesting remark in the paper is to recall that the Hyvärinen score associated with a generic model on a series must be the prequential (predictive) version

$\mathcal{H}_T (M) = \sum_{t=1}^T \mathcal{H}(y_t; p_M(dy_t|y_{1:(t-1)}))$

rather than the version on the joint marginal density of the whole series. (Followed by a remark within the remark that the logarithm scoring rule does not make for this distinction. And I had to write down the cascading representation

$\log p(y_{1:T})=\sum_{t=1}^T \log p(y_t|y_{1:t-1})$

to convince myself that this unnatural decomposition, where the posterior on θ varies on each terms, is true!) For consistency reasons.

This prequential decomposition is however a plus in terms of computation when resorting to sequential Monte Carlo. Since each time step produces an evaluation of the associated marginal. In the case of state space models, another decomposition of the authors, based on measurement densities and partial conditional expectations of the latent states allows for another (SMC²) approximation. The paper also establishes that for non-nested models, the Hyvärinen score as a model selection tool asymptotically selects the closest model to the data generating process. For the divergence induced by the score. Even for state-space models, under some technical assumptions.  From this asymptotic perspective, the paper exhibits an example where the Bayes factor and the Hyvärinen factor disagree, even asymptotically in the number of observations, about which mis-specified model to select. And last but not least the authors propose and assess a discrete alternative relying on finite differences instead of derivatives. Which remains a proper scoring rule.

I am quite excited by this work (call me biased!) and I hope it can induce following works as a viable alternative to Bayes factors, if only for being more robust to the [unspecified] impact of the prior tails. As in the above picture where some realisations of the SMC² output and of the sequential decision process see the wrong model being almost acceptable for quite a long while…

## impressions from EcoSta2017 [guest post]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on July 6, 2017 by xi'an

[This is a guest post on the recent EcoSta2017 (Econometrics and Statistics) conference in Hong Kong, contributed by Chris Drovandi from QUT, Brisbane.]

There were (at least) two sessions on Bayesian Computation at the recent EcoSta (Econometrics and Statistics) 2017 conference in Hong Kong. Below is my review of them. My overall impression of the conference is that there were lots of interesting talks, albeit a lot in financial time series, not my area. Even so I managed to pick up a few ideas/concepts that could be useful in my research. One criticism I had was that there were too many sessions in parallel, which made choosing quite difficult and some sessions very poorly attended. Another criticism of many participants I spoke to was that the location of the conference was relatively far from the city area.

In the first session (chaired by Robert Kohn), Minh-Ngoc Tran spoke about this paper on Bayesian estimation of high-dimensional Copula models with mixed discrete/continuous margins. Copula models with all continuous margins are relatively easy to deal with, but when the margins are discrete or mixed there are issues with computing the likelihood. The main idea of the paper is to re-write the intractable likelihood as an integral over a hypercube of ≤J dimensions (where J is the number of variables), which can then be estimated unbiasedly (with variance reduction by using randomised quasi-MC numbers). The paper develops advanced (correlated) pseudo-marginal and variational Bayes methods for inference.

In the following talk, Chris Carter spoke about different types of pseudo-marginal methods, particle marginal Metropolis-Hastings and particle Gibbs for state space models. Chris suggests that a combination of these methods into a single algorithm can further improve mixing. Continue reading

## SMC on a sequence of increasing dimension targets

Posted in Statistics with tags , , , , , , , , , on February 15, 2017 by xi'an

Richard Everitt and co-authors have arXived a preliminary version of a paper entitled Sequential Bayesian inference for mixture models and the coalescent using sequential Monte Carlo samplers with transformations. The central notion is an SMC version of the Carlin & Chib (1995) completion in the comparison of models in different dimensions. Namely to create auxiliary variables for each model in such a way that the dimension of the completed models are all the same. (Reversible jump MCMC à la Peter Green (1995) can also be interpreted this way, even though only relevant bits of the completion are used in the transitions.) I find the paper and the topic most interesting if only because it relates to earlier papers of us on population Monte Carlo. It also brought to my awareness the paper by Karagiannis and Andrieu (2013) on annealed reversible jump MCMC that I had missed at the time it appeared. The current paper exploits this annealed expansion in the devising of the moves. (Sequential Monte Carlo on a sequence of models with increasing dimension has been studied in the past.)

The way the SMC is described in the paper, namely, reweight-subsample-move, does not strike me as the most efficient as I would try to instead move-reweight-subsample, using a relevant move that incorporate the new model and hence enhance the chances of not rejecting.

One central application of the paper is mixture models with an unknown number of components. The SMC approach applied to this problem means creating a new component at each iteration t and moving the existing particles after adding the parameters of the new component. Since using the prior for this new part is unlikely to be at all efficient, a split move as in Richardson and Green (1997) can be considered, which brings back the dreaded Jacobian of RJMCMC into the picture! Here comes an interesting caveat of the method, namely that the split move forces a choice of the split component of the mixture. However, this does not appear as a strong difficulty, solved in the paper by auxiliary [index] variables, but possibly better solved by a mixture representation of the proposal, as in our PMC [population Monte Carlo] papers. Which also develop a family of SMC algorithms, incidentally. We found there that using a mixture representation of the proposal achieves a provable variance reduction.

“This puts a requirement on TSMC that the single transition it makes must be successful.”

As pointed by the authors, the transformation SMC they develop faces the drawback that a given model is only explored once in the algorithm, when moving to the next model. On principle, there would be nothing wrong in including regret steps, retracing earlier models in the light of the current one, since each step is an importance sampling step valid on its own right. But SMC also offers a natural albeit potentially high-varianced approximation to the marginal likelihood, which is quite appealing when comparing with an MCMC outcome. However, it would have been nice to see a comparison with alternative estimates of the marginal in the case of mixtures of distributions. I also wonder at the comparative performances of a dual approach that would be sequential in the number of observations as well, as in Chopin (2004) or our first population Monte Carlo paper (Cappé et al., 2005), since subsamples lead to tempered versions of the target and hence facilitate moves between models, being associated with flatter likelihoods.

## anytime algorithm

Posted in Books, Statistics with tags , , , , , , , , , on January 11, 2017 by xi'an

Lawrence Murray, Sumeet Singh, Pierre Jacob, and Anthony Lee (Warwick) recently arXived a paper on Anytime Monte Carlo. (The earlier post on this topic is no coincidence, as Lawrence had told me about this problem when he visited Paris last Spring. Including a forced extension when his passport got stolen.) The difficulty with anytime algorithms for MCMC is the lack of exchangeability of the MCMC sequence (except for formal settings where regeneration can be used).

When accounting for duration of computation between steps of an MCMC generation, the Markov chain turns into a Markov jump process, whose stationary distribution α is biased by the average delivery time. Unless it is constant. The authors manage this difficulty by interlocking the original chain with a secondary chain so that even- and odd-index chains are independent. The secondary chain is then discarded. This provides a way to run an anytime MCMC. The principle can be extended to K+1 chains, run one after the other, since only one of those chains need be discarded. It also applies to SMC and SMC². The appeal of anytime simulation in this particle setting is that resampling is no longer a bottleneck. Hence easily distributed among processors. One aspect I do not fully understand is how the computing budget is handled, since allocating the same real time to each iteration of SMC seems to envision each target in the sequence as requiring the same amount of time. (An interesting side remark made in this paper is the lack of exchangeability resulting from elaborate resampling mechanisms, lack I had not thought of before.)

## anytime!

Posted in Books, Mountains, pictures, Statistics, Travel with tags , , , , , on December 22, 2016 by xi'an

“An anytime algorithm is an algorithm that can be run continuously, generating progressively better solutions when afforded additional computation time. Traditional particle-based inference algorithms are not anytime in nature; all particles need to be propagated in lock-step to completion in order to compute expectations.”

Following a discussion with Lawrence Murray last week, I read Paige et al.  NIPS 2014 paper on their anytime sequential Monte Carlo algorithm. As explained above, an anytime algorithm is interruptible, meaning it can be stopped at any time without biasing the outcome of the algorithm. While MCMC algorithms can qualify as anytime (provided they are in stationary regime), it is not the case with sequential and particle Monte Carlo algorithms, which do not have an inbred growing mechanism preserving the target. In the case of Paige et al.’s proposal, the interruptible solution returns an unbiased estimator of the marginal likelihood at time n for any number of particles, even when this number is set or increased during the computation. The idea behind the solution is to create a particle cascade by going one particle at a time and creating children of this particle in proportion to the current average weight. An approach that can be run indefinitely. And since memory is not infinite, the authors explain how to cap the number of alive particles without putting the running distribution in jeopardy…

## rare events for ABC

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , on November 24, 2016 by xi'an

Dennis Prangle, Richard G. Everitt and Theodore Kypraios just arXived a new paper on ABC, aiming at handling high dimensional data with latent variables, thanks to a cascading (or nested) approximation of the probability of a near coincidence between the observed data and the ABC simulated data. The approach amalgamates a rare event simulation method based on SMC, pseudo-marginal Metropolis-Hastings and of course ABC. The rare event is the near coincidence of the observed summary and of a simulated summary. This is so rare that regular ABC is forced to accept not so near coincidences. Especially as the dimension increases.  I mentioned nested above purposedly because I find that the rare event simulation method of Cérou et al. (2012) has a nested sampling flavour, in that each move of the particle system (in the sample space) is done according to a constrained MCMC move. Constraint derived from the distance between observed and simulated samples. Finding an efficient move of that kind may prove difficult or impossible. The authors opt for a slice sampler, proposed by Murray and Graham (2016), however they assume that the distribution of the latent variables is uniform over a unit hypercube, an assumption I do not fully understand. For the pseudo-marginal aspect, note that while the approach produces a better and faster evaluation of the likelihood, it remains an ABC likelihood and not the original likelihood. Because the estimate of the ABC likelihood is monotonic in the number of terms, a proposal can be terminated earlier without inducing a bias in the method.

This is certainly an innovative approach of clear interest and I hope we will discuss it at length at our BIRS ABC 15w5025 workshop next February. At this stage of light reading, I am slightly overwhelmed by the combination of so many computational techniques altogether towards a single algorithm. The authors argue there is very little calibration involved, but so many steps have to depend on as many configuration choices.