## controlled thermodynamic integral for Bayesian model comparison [reply]

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , , , on April 30, 2014 by xi'an

Chris Oates wrotes the following reply to my Icelandic comments on his paper with Theodore Papamarkou, and Mark Girolami, reply that is detailed enough to deserve a post on its own:

Thank you Christian for your discussion of our work on the Og, and also for your helpful thoughts in the early days of this project! It might be interesting to speculate on some aspects of this procedure:

(i) Quadrature error is present in all estimates of evidence that are based on thermodynamic integration. It remains unknown how to exactly compute the optimal (variance minimising) temperature ladder “on-the-fly”; indeed this may be impossible, since the optimum is defined via a boundary value problem rather than an initial value problem. Other proposals for approximating this optimum are compatible with control variates (e.g. Grosse et al, NIPS 2013, Friel and Wyse, 2014). In empirical experiments we have found that the second order quadrature rule proposed by Friel and Wyse 2014 leads to substantially reduced bias, regardless of the specific choice of ladder.

(ii) Our experiments considered first and second degree polynomials as ZV control variates. In fact, intuition specifically motivates the use of second degree polynomials: Let us presume a linear expansion of the log-likelihood in θ. Then the implied score function is constant, not depending on θ. The quadratic ZV control variates are, in effect, obtained by multiplying the score function by θ. Thus control variates can be chosen to perfectly correlate with the log-likelihood, leading to zero-variance estimators. Of course, there is an empirical question of whether higher-order polynomials are useful when this Taylor approximation is inappropriate, but they would require the estimation of many more coefficients and in practice may be less stable.

(iii) We require that the control variates are stored along the chain and that their sample covariance is computed after the MCMC has terminated. For the specific examples in the paper such additional computation is a negligible fraction of the total computational, so that we did not provide specific timings. When non-diffegeometric MCMC is used to obtain samples, or when the score is unavailable in closed-form and must be estimated, the computational cost of the procedure would necessarily increase.

For the wide class of statistical models with tractable likelihoods, employed in almost all areas of statistical application, the CTI we propose should provide state-of-the-art estimation performance with negligible increase in computational costs.

## controlled thermodynamic integral for Bayesian model comparison

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , on April 24, 2014 by xi'an

Chris Oates, Theodore Papamarkou, and Mark Girolami (all from the University of Warwick) just arXived a paper on a new form of thermodynamic integration for computing marginal likelihoods. (I had actually discussed this paper with the authors on a few occasions when visiting Warwick.) The other name of thermodynamic integration is path sampling (Gelman and Meng, 1998). In the current paper, the path goes from the prior to the posterior by a sequence of intermediary distributions using a power of the likelihood. While the path sampling technique is quite efficient a method, the authors propose to improve it through the recourse to control variates, in order to decrease the variance. The control variate is taken from Mira et al. (2013), namely a one-dimensional temperature-dependent transform of the score function. (Strictly speaking, this is an asymptotic control variate in that the mean is only asymptotically zero.) This control variate is then incorporated within the expectation inside the path sampling integral. Its arbitrary elements are then calibrated against the variance of the path sampling integral. Except for the temperature ladder where the authors use a standard geometric rate, as the approach does not account for Monte Carlo and quadrature errors. (The degree of the polynomials used in the control variates is also arbitrarily set.) Interestingly, the paper mixes a lot of recent advances, from the zero variance notion of Mira et al. (2013) to the manifold Metropolis-adjusted Langevin algorithm of Girolami and Calderhead (2011), uses as a base method pMCMC (Jasra et al., 2007). The examples processed in the paper are regression (where the controlled version truly has a zero variance!) and logistic regression (with the benchmarked Pima Indian dataset), with a counter-example of a PDE interestingly proposed in the discussion section. I quite agree with the authors that the method is difficult to envision in complex enough models. I also did not see mentions therein of the extra time involved in using this control variate idea.

## Vanilla Rao-Blackwellisation [re]revised

Posted in R, Statistics with tags , , , , , , , on June 1, 2010 by xi'an

Although the revision is quite minor, it took us two months to complete from the time I received the news in the Atlanta airport lounge… The vanilla Rao-Blackwellisation paper with Randal Douc has thus been resubmitted to the Annals of Statistics. And rearXived. The only significant change is the inclusion of two tables detailing computing time, like the one below

$\left| \begin{matrix} \tau &\text{median} &\text{mean }&q_{.8} &q_{.9} &\text{time}\\ 0.25 &0.0 &8.85 &4.9 &13 &4.2\\ 0.50 &0.0 &6.76 &4 &11 &2.25\\ 1.00 &0.25 &6.15 &4 &10 &2.5\\ 2.00 &0.20 &5.90 &3.5 &8.5 &4.5\\\end{matrix} \right|$

which provides different evaluations of the additional computing effort due to the use of the Rao–Blackwellisation: median and mean numbers of additional iterations, $80\%$ and $90\%$ quantiles for the additional iterations, and ratio of the average R computing times obtained over $10^5$ simulations. (Turning the above table into a formula acceptable by WordPress took me for ever, as any additional white space between the terms of the matrix is mis-interpreted!) Now, the mean time column does not look very supportive of the Rao-Blackwellisation technique, but this is due to the presence of a few outlying runs that required many iterations before hitting an acceptance probability of one. Excessive computing time can be curbed by using a pre-set number of iterations, as described in the paper…

## pimax(mcsm)

Posted in Books, R, Statistics with tags , , , , on May 13, 2010 by xi'an

The function pimax from our package mcsm is used in to reproduce Figure 5.11 of our book Introducing Monte Carlo Methods with R. (The name comes from using the Pima Indian R benchmark as the reference dataset.) I got this email from Josué

I ran the ‘pimax’ example from the mcsm manual, and it gave me the following message:

> pimax(Nsim = 10^3)
Error in raaj[t, ] = apply(as.matrix(aas), 1, margap) :
number of items to replace is not a multiple of replacement length
> pimax()
Error in raaj[t, ] = apply(as.matrix(aas), 1, margap) :
number of items to replace is not a multiple of replacement length

but when running pimax(10^2) on my machine I did get the following picture and no error message. So I wonder if this is a matter of version of R or something else…

## Vanilla Rao-Blackwellisation for revision

Posted in R, Statistics with tags , , , , , on March 18, 2010 by xi'an

The vanilla Rao-Blackwellisation paper with Randal Douc that had been resubmitted to the Annals of Statistics is now back for a revision, with quite encouraging comments:

The paper has been reviewed by two referees both of whom comment on the clear exposition and the novelty of the results. Both referees point to the empirical results as being suggestive of a more incremental improvement in practice rather than a major advance. However the approach the authors adopt is novel and I believe may motivate further developments in this area.

I cannot but agree on those comments! Since we are reducing the variance of the weights, the overall effect may be difficult to spot in practical applications. In the current version of the paper, we manage 20% reduction in the variance of those weights, but obviously this does not transfer to the same reduction of the variance of the overall estimator! Our vanilla Rao-Blackwellisation does not speed up the Markov chain.