## controlled thermodynamic integral for Bayesian model comparison

**C**hris Oates, Theodore Papamarkou, and Mark Girolami (all from the University of Warwick) just arXived a paper on a new form of thermodynamic integration for computing marginal likelihoods. (I had actually discussed this paper with the authors on a few occasions when visiting Warwick.) The other name of thermodynamic integration is path sampling (Gelman and Meng, 1998). In the current paper, the path goes from the prior to the posterior by a sequence of intermediary distributions using a power of the likelihood. While the path sampling technique is quite efficient a method, the authors propose to improve it through the recourse to control variates, in order to decrease the variance. The control variate is taken from Mira et al. (2013), namely a one-dimensional temperature-dependent transform of the score function. (Strictly speaking, this is an asymptotic control variate in that the mean is only asymptotically zero.) This control variate is then incorporated within the expectation inside the path sampling integral. Its arbitrary elements are then calibrated against the variance of the path sampling integral. Except for the temperature ladder where the authors use a standard geometric rate, as the approach does not account for Monte Carlo and quadrature errors. (The degree of the polynomials used in the control variates is also arbitrarily set.) Interestingly, the paper mixes a lot of recent advances, from the zero variance notion of Mira et al. (2013) to the manifold Metropolis-adjusted Langevin algorithm of Girolami and Calderhead (2011), uses as a base method pMCMC (Jasra et al., 2007). The examples processed in the paper are regression (where the controlled version truly has a zero variance!) and logistic regression (with the benchmarked Pima Indian dataset), with a counter-example of a PDE interestingly proposed in the discussion section. I quite agree with the authors that the method is difficult to envision in complex enough models. I also did not see mentions therein of the extra time involved in using this control variate idea.

April 27, 2014 at 7:05 pm

Thank you Christian for your discussion of our work on the Og, and also for your helpful thoughts in the early days of this project! It might be interesting to speculate on some aspects of this procedure:

(i) Quadrature error is present in all estimates of evidence that are based on thermodynamic integration. It remains unknown how to exactly compute the optimal (variance minimising) temperature ladder “on-the-fly”; indeed this may be impossible, since the optimum is defined via a boundary value problem rather than an initial value problem. Other proposals for approximating this optimum are compatible with control variates (e.g. Grosse et al, NIPS 2013, Friel and Wyse, 2014). In empirical experiments we have found that the second order quadrature rule proposed by Friel and Wyse 2014 leads to substantially reduced bias, regardless of the specific choice of ladder.

(ii) Our experiments considered first and second degree polynomials as ZV control variates. In fact, intuition specifically motivates the use of second degree polynomials: Let us presume a linear expansion of the log-likelihood in \theta. Then the implied score function is constant, not depending on \theta. The quadratic ZV control variates are, in effect, obtained by multiplying the score function by \theta. Thus control variates can be chosen to perfectly correlate with the log-likelihood, leading to zero-variance estimators. Of course, there is an empirical question of whether higher-order polynomials are useful when this Taylor approximation is inappropriate, but they would require the estimation of many more coefficients and in practice may be less stable.

(iii) We require that the control variates are stored along the chain and that their sample covariance is computed after the MCMC has terminated. For the specific examples in the paper such additional computation is a negligible fraction of the total computational, so that we did not provide specific timings. When non-diffegeometric MCMC is used to obtain samples, or when the score is unavailable in closed-form and must be estimated, the computational cost of the procedure would necessarily increase.

For the wide class of statistical models with tractable likelihoods, employed in almost all areas of statistical application, the CTI we propose should provide state-of-the-art estimation performance with negligible increase in computational costs.