Nested Sampling SMC [a reply]

Posted in Books, Statistics, University life with tags , , , , , , , , , on April 9, 2020 by xi'an
Here is a response from Robert Salomone following my comments of the earlier day (and pointing out I already commented the paper two years ago):
You may be interested to know that we are at the tail end of carrying out a major revision of the paper, which we hope will be done in the near future — there will be some new theory (we are in the final stages for a consistency proof of the ANS-SMC algorithm with new co-author Adam Johansen), as well as new numerics (including comparisons to Nested Sampling), and additional discussion that clarifies the overall narrative.
A few comments relating your post that may clear some things up:
• The method you describe with the auxiliary variable is actually one of three proposed algorithms. We call this one “Improved Nested Sampling” as it is the algorithm most similar to the original Nested Sampling. Two further extensions are the adaptive SMC sampler, and the fixed SMC sampler – the latter of which is provably consistent and unbiased for the model evidence (we also often see improvements over standard NS for similar computational effort when MCMC is used).
• Regarding computational effort – it is the same for Improved NS (in fact, you can obtain the standard Nested Sampling evidence estimate from the same computational run!). For the adaptive variant, the computational effort is roughly the same for ρ = e⁻¹. In the current version of the paper this is only discussed briefly (last page of p.23). However, in the revision we will include additional experiments comparing the practical performance.
• Regarding the question of “why not regular SMC”; we chose to focus more on why SMC is a good way to do Nested Sampling rather than why Nested Sampling is a good way to do SMC. Our main priority was to show there is a lot of opportunity to develop new nested sampling style algorithms by approaching it from a different angle. That said, Nested Sampling’s primary advantage over standard SMC seems to be in problems involving “phase transitions’’ such as our first example, for which temperature based methods are inherently ill-suited (and will often fail to detect so!).

thermodynamic integration plus temperings

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , on July 30, 2019 by xi'an

Biljana Stojkova and David Campbel recently arXived a paper on the used of parallel simulated tempering for thermodynamic integration towards producing estimates of marginal likelihoods. Resulting into a rather unwieldy acronym of PT-STWNC for “Parallel Tempering – Simulated Tempering Without Normalizing Constants”. Remember that parallel tempering runs T chains in parallel for T different powers of the likelihood (from 0 to 1), potentially swapping chain values at each iteration. Simulated tempering monitors a single chain that explores both the parameter space and the temperature range. Requiring a prior on the temperature. Whose optimal if unrealistic choice was found by Geyer and Thomson (1995) to be proportional to the inverse (and unknown) normalising constant (albeit over a finite set of temperatures). Proposing the new temperature instead via a random walk, the Metropolis within Gibbs update of the temperature τ then involves normalising constants.

“This approach is explored as proof of concept and not in a general sense because the precision of the approximation depends on the quality of the interpolator which in turn will be impacted by smoothness and continuity of the manifold, properties which are difficult to characterize or guarantee given the multi-modal nature of the likelihoods.”

To bypass this issue, the authors pick for their (formal) prior on the temperature τ, a prior such that the profile posterior distribution on τ is constant, i.e. the joint distribution at τ and at the mode [of the conditional posterior distribution of the parameter] is constant. This choice makes for a closed form prior, provided this mode of the tempered posterior can de facto be computed for each value of τ. (However it is unclear to me why the exact mode would need to be used.) The resulting Metropolis ratio becomes independent of the normalising constants. The final version of the algorithm runs an extra exchange step on both this simulated tempering version and the untempered version, i.e., the original unnormalised posterior. For the marginal likelihood, thermodynamic integration is invoked, following Friel and Pettitt (2008), using simulated tempering samples of (θ,τ) pairs (associated instead with the above constant profile posterior) and simple Riemann integration of the expected log posterior. The paper stresses the gain due to a continuous temperature scale, as it “removes the need for optimal temperature discretization schedule.” The method is applied to the Glaxy (mixture) dataset in order to compare it with the earlier approach of Friel and Pettitt (2008), resulting in (a) a selection of the mixture with five components and (b) much more variability between the estimated marginal  likelihoods for different numbers of components than in the earlier approach (where the estimates hardly move with k). And (c) a trimodal distribution on the means [and unimodal on the variances]. This example is however hard to interpret, since there are many contradicting interpretations for the various numbers of components in the model. (I recall Radford Neal giving an impromptu talks at an ICMS workshop in Edinburgh in 2001 to warn us we should not use the dataset without a clear(er) understanding of the astrophysics behind. If I remember well he was excluded all low values for the number of components as being inappropriate…. I also remember taking two days off with Peter Green to go climbing Craigh Meagaidh, as the only authorised climbing place around during the foot-and-mouth epidemics.) In conclusion, after presumably too light a read (I did not referee the paper!), it remains unclear to me why the combination of the various tempering schemes is bringing a noticeable improvement over the existing. At a given computational cost. As the temperature distribution does not seem to favour spending time in the regions where the target is most quickly changing. As such the algorithm rather appears as a special form of exchange algorithm.

QuanTA

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , on September 17, 2018 by xi'an

My Warwick colleagues Nick Tawn [who also is my most frequent accomplice to running, climbing and currying in Warwick!] and Gareth Robert have just arXived a paper on QuanTA, a new parallel tempering algorithm that Nick designed during his thesis at Warwick, which he defended last semester. Parallel tempering targets in parallel several powered (or power-tempered) versions of the target distribution. With proposed switches between adjacent targets. An improved version transforms the local values before operating the switches. Ideally, the transform should be the composition of the cdf and inverse cdf, but this is impossible. Linearising the transform is feasible, but does not agree with multimodality, which calls for local transforms. Which themselves call for the identification of the different modes. In QuanTA, they are identified by N parallel runs of the standard, or rather N/2 to avoid dependence issues, and K-means estimates. The paper covers the construction of an optimal scaling of temperatures, in that the difference between the temperatures is scaled [with order 1/√d] so that the acceptance rate for swaps is 0.234. Which in turns induces a practical if costly calibration of the temperatures, especially when the size of the jump is depending on the current temperature. However, this cost issue is addressed in the paper, resorting to the acceptance rate as a proxy for effective sample size and the acceptance rate over run time to run the comparison with regular parallel tempering, leading to strong improvements in the mixture examples examined in the paper. The use of machine learning techniques like K-means or more involved solutions is a promising thread in this exciting area of tempering, where intuition about high temperatures can be actually misleading. Because using the wrong scale means missing the area of interest, which is not the mode!

optimisation via slice sampling

Posted in Statistics with tags , , , , on December 20, 2012 by xi'an

This morning, over breakfast; I read the paper recently arXived by John Birge et Nick Polson. I was intrigued by the combination of optimisation and of slice sampling, but got a wee disappointed by the paper, in that it proposes a simple form of simulated annealing, minimising f(x) by targeting a small collection of energy functions exp{-τf(x)}. Indeed, the slice sampler is used to explore each of those targets, i.e. for a fixed temperature τ. For the four functions considered in the paper, a slice sampler can indeed be implemented, but this feature could be seen as a marginalia, given that a random walk Metropolis-Hastings algorithm could be used as a proposal mechanism in other cases. The other intriguing fact is that annealing is not used in the traditional way of increasing the coefficient τ along iterations (as in our SAME algorithm), for which convergence issues are much more intricate, but instead stays at the level where a whole (Markov) sample is used for each temperature, the outcomes being then compared in terms of localisation (and maybe for starting at the next temperature value). I did not see any discussion about the selection of the successive temperatures, which usually is a delicate issue in realistic settings, nor of the stopping rule(s) used to decide the maximum has been reached.

threshold schedules for ABC-SMC

Posted in Statistics, University life with tags , , , , , , , , , on October 17, 2012 by xi'an

Daniel Silk, Sarah Filippi and Michael Stumpf just posted on arXiv a new paper on ABC-SMC. They attack therein a typical problem with SMC (and PMC, and even MCMC!) methods, namely the ability to miss (global) modes of the target due to a poor initial exploration. So, if a proposal is built on previous simulations and if  those simulations have missed an important mode, it is quite likely that this proposal will concentrate on other parts of the space and miss the important mode even more. This is also why simulated annealing and other stochastic optimisation algorithms are so fickle: a wrong parameterisation (like the temperature schedule) induces the sequence to converge to a local optimum rather than to the global optimum. Since sequential ABC is a form of simulated annealing, the decreasing tolerance (or threshold) playing the part of the temperature, it is no surprise that it suffers from this generic disease…

The method proposed in the paper aims at avoiding this difficulty by looking at sudden drops in the acceptance rate curve (as a function of the tolerance ε),

$\aleph_t(\epsilon)=\int p_t(x)\mathbb{I}(\Delta(x,x^\star)\le \epsilon)\,\text{d}x,$

suggesting for threshold the value of ε that maximises the second derivative of this curve. Now, before getting to (at?) the issue of turning this principle into a practical algorithm, let me linger at the motivation for it:

“To see this, one can imagine an ε-ball expanding about the true data; at first the ball only encompasses a small number of particles that were drawn from very close to the global maximum, corresponding to the low gradient at the foot of the shape. Once ε is large enough we are able to accept the relatively large number of particles sitting in the local maximum, which causes the increase in gradient.” (p.5)

Thus, the argument for looking at values of ε preceding fast increases in the acceptance rate ℵ is that we are thus avoiding flatter and lower regions of the posterior support corresponding to a local maximum. It clearly encompasses the case studied by the authors of a global highly concentrated global mode, surrounded by flatter lower modes, but it seems to me that this is not necessarily the only possible reason for changes in the shape of the acceptance probability ℵ. First, we are looking at an ABC acceptance rate, not at a genuine likelihood. Second, this acceptance rate function depends on the current (and necessarily rough) approximate to the genuine predictive, pt. Furthermore, when taking into account this rudimentary replacement of the true likelihood function, it is rather difficult to envision how it impacts the correspondence between a proximity in the data and a proximity in the parameter space. (The toy example is both illuminating and misleading, because it considers a setting where the data is a deterministic transform of the parameter.) I thus think the analysis should refer more strongly to the non-parametric literature and in particular to the k-nearest-neighbour approach recently reformulated by Biau et al.: there is no reason to push the tolerance ε all the way down to zero as this limit does not correspond to the optimal value of the tolerance. If one does not use a non-parametric determination of the “right” tolerance, the lingering question is when and why stopping the sequence of ABC simulations.

The acceptance rate function ℵ is approximated using an unscented transform. I understand neither the implementation of, nor the motivations for, this choice. Given that the function ℵ is a one-dimensional transform of the tolerance and that it actually corresponds to the cdf of the distance between the true data and the (current) pseudo-data, why couldn’t we use smooth versions of a cdf estimate based on the simulated distances, given that these distances are already computed..?  I must be missing something.