Archive for simulated tempering

non-reversible jump MCMC

Posted in Books, pictures, Statistics with tags , , , , , , , on June 29, 2020 by xi'an

Philippe Gagnon and et Arnaud Doucet have recently arXived a paper on a non-reversible version of reversible jump MCMC, the methodology introduced by Peter Green in 1995 to tackle Bayesian model choice/comparison/exploration. Whom Philippe presented at BayesComp20.

“The objective of this paper is to propose sampling schemes which do not suffer from such a diffusive behaviour by exploiting the lifting idea (…)”

The idea is related to lifting, creating non-reversible behaviour by adding a direction index (a spin) to the exploration of the models, assumed to be totally ordered, as with nested models (mixtures, changepoints, &tc.).  As with earlier versions of lifting, the chain proceeds along one (spin) direction until the proposal is rejected in which case the spin spins. The acceptance probability in the event of a change of model (upwards or downwards) is essentially the same as the reversible one (meaning it includes the dreaded Jacobian!). The original difficulty with reversible jump remains active with non-reversible jump in that the move from one model to the next must produce plausible values. The paper recalls two methods proposed by Christophe Andrieu and his co-authors. One consists in buffering a tempering sequence, but this proves costly.  Pursuing the interesting underlying theme that both reversible and non-reversible versions are noisy approximations of the marginal ratio, the other one consists in marginalising out the parameter to approximate the marginal probability of moving between nearby models. Combined with multiple choice to preserve stationarity and select more likely moves at the same time. Still requiring a multiplication of the number of simulations but parallelisable. The paper contains an exact comparison result that non-reversible jump leads to a smaller asymptotic variance than reversible jump, but it is unclear to me whether or not this accounts for the extra computing time resulting from the multiple paths in the proposed algorithms. (Even though the numerical illustration shows an improvement brought by the non-reversible side for the same computational budget.)

thermodynamic integration plus temperings

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , on July 30, 2019 by xi'an

Biljana Stojkova and David Campbel recently arXived a paper on the used of parallel simulated tempering for thermodynamic integration towards producing estimates of marginal likelihoods. Resulting into a rather unwieldy acronym of PT-STWNC for “Parallel Tempering – Simulated Tempering Without Normalizing Constants”. Remember that parallel tempering runs T chains in parallel for T different powers of the likelihood (from 0 to 1), potentially swapping chain values at each iteration. Simulated tempering monitors a single chain that explores both the parameter space and the temperature range. Requiring a prior on the temperature. Whose optimal if unrealistic choice was found by Geyer and Thomson (1995) to be proportional to the inverse (and unknown) normalising constant (albeit over a finite set of temperatures). Proposing the new temperature instead via a random walk, the Metropolis within Gibbs update of the temperature τ then involves normalising constants.

“This approach is explored as proof of concept and not in a general sense because the precision of the approximation depends on the quality of the interpolator which in turn will be impacted by smoothness and continuity of the manifold, properties which are difficult to characterize or guarantee given the multi-modal nature of the likelihoods.”

To bypass this issue, the authors pick for their (formal) prior on the temperature τ, a prior such that the profile posterior distribution on τ is constant, i.e. the joint distribution at τ and at the mode [of the conditional posterior distribution of the parameter] is constant. This choice makes for a closed form prior, provided this mode of the tempered posterior can de facto be computed for each value of τ. (However it is unclear to me why the exact mode would need to be used.) The resulting Metropolis ratio becomes independent of the normalising constants. The final version of the algorithm runs an extra exchange step on both this simulated tempering version and the untempered version, i.e., the original unnormalised posterior. For the marginal likelihood, thermodynamic integration is invoked, following Friel and Pettitt (2008), using simulated tempering samples of (θ,τ) pairs (associated instead with the above constant profile posterior) and simple Riemann integration of the expected log posterior. The paper stresses the gain due to a continuous temperature scale, as it “removes the need for optimal temperature discretization schedule.” The method is applied to the Glaxy (mixture) dataset in order to compare it with the earlier approach of Friel and Pettitt (2008), resulting in (a) a selection of the mixture with five components and (b) much more variability between the estimated marginal  likelihoods for different numbers of components than in the earlier approach (where the estimates hardly move with k). And (c) a trimodal distribution on the means [and unimodal on the variances]. This example is however hard to interpret, since there are many contradicting interpretations for the various numbers of components in the model. (I recall Radford Neal giving an impromptu talks at an ICMS workshop in Edinburgh in 2001 to warn us we should not use the dataset without a clear(er) understanding of the astrophysics behind. If I remember well he was excluded all low values for the number of components as being inappropriate…. I also remember taking two days off with Peter Green to go climbing Craigh Meagaidh, as the only authorised climbing place around during the foot-and-mouth epidemics.) In conclusion, after presumably too light a read (I did not referee the paper!), it remains unclear to me why the combination of the various tempering schemes is bringing a noticeable improvement over the existing. At a given computational cost. As the temperature distribution does not seem to favour spending time in the regions where the target is most quickly changing. As such the algorithm rather appears as a special form of exchange algorithm.