Ingmar Schuster, who visited Paris-Dauphine last Spring (and is soon to return here as a postdoc funded by Fondation des Sciences Mathématiques de Paris) has arXived last week a paper on gradient importance sampling. In this paper, he builds a sequential importance sampling (or population Monte Carlo) algorithm that exploits the additional information contained in the gradient of the target. The proposal or importance function being essentially the MALA move as its proposal, mixed across the elements of the previous population. When compared with our original PMC mixture of random walk proposals found in e.g. this paper, each term in the mixture thus involves an extra gradient, with a scale factor that decreases to zero as 1/t√t. Ingmar compares his proposal with an adaptive Metropolis, an adaptive MALTa and an HM algorithms, for two mixture distributions and the banana target of Haario et al. (1999) we also used in our paper. As well as a logistic regression. In each case, he finds both a smaller squared error and a smaller bias for the same computing time (evaluated as the number of likelihood evaluations). While we discussed this scheme when he visited, I remain intrigued as to why it works so well when compared with the other solutions. One possible explanation is that the use of the gradient drift is more efficient on a population of particles than on a single Markov chain, provided the population covers all modes of importance on the target surface: the “fatal” attraction of the local model is then much less of an issue…
Archive for sequential Monte Carlo
The paper “Bayesian Parameter Inference and Model Selection by Population Annealing in System Biology” by Yohei Murakami got published in PLoS One last August but I only became aware of it when ResearchGate pointed it out to me [by mentioning one of our ABC papers was quoted there].
“We are recommended to try a number of annealing schedules to check the influence of the schedules on the simulated data (…) As a whole, the simulations with the posterior parameter ensemble could, not only reproduce the data used for parameter inference, but also capture and predict the data which was not used for parameter inference.”
Population annealing is a notion introduced by Y Iba, the very same IBA who introduced the notion of population Monte Carlo that we studied in subsequent papers. It reproduces the setting found in many particle filter papers of a sequence of (annealed or rather tempered) targets ranging from an easy (i.e., almost flat) target to the genuine target, and of an update of a particle set by MCMC moves and reweighing. I actually have trouble perceiving the difference with other sequential Monte Carlo schemes as those exposed in Del Moral, Doucet and Jasra (2006, Series B). And the same is true of the ABC extension covered in this paper. (Where the annealed intermediate targets correspond to larger tolerances.) This sounds like a traditional ABC-SMC algorithm. Without the adaptive scheme on the tolerance ε found e.g. in Del Moral et al., since the sequence is set in advance. [However, the discussion about the implementation includes the above quote that suggests a vague form of cross-validated tolerance construction]. The approximation of the marginal likelihood also sounds standard, the marginal being approximated by the proportion of accepted pseudo-samples. Or more exactly by the sum of the SMC weights at the end of the annealing simulation. This actually raises several questions: (a) this estimator is always between 0 and 1, while the marginal likelihood is not restricted [but this is due to a missing 1/ε in the likelihood estimate that cancels from both numerator and denominator]; (b) seeing the kernel as a non-parametric estimate of the likelihood led me to wonder why different ε could not be used in different models, in that the pseudo-data used for each model under comparison differs. If we were in a genuine non-parametric setting the bandwidth would be derived from the pseudo-data.
“Thus, Bayesian model selection by population annealing is valid.”
The discussion about the use of ABC population annealing somewhat misses the point of using ABC, which is to approximate the genuine posterior distribution, to wit the above quote: that the ABC Bayes factors favour the correct model in the simulation does not tell anything about the degree of approximation wrt the original Bayes factor. [The issue of non-consistent Bayes factors does not apply here as there is no summary statistic applied to the few observations in the data.] Further, the magnitude of the variability of the values of this Bayes factor as ε varies, from 1.3 to 9.6, mostly indicates that the numerical value is difficult to trust. (I also fail to explain the huge jump in Monte Carlo variability from 0.09 to 1.17 in Table 1.) That this form of ABC-SMC improves upon the basic ABC rejection approach is clear. However it needs to build some self-control to avoid arbitrary calibration steps and reduce the instability of the final estimates.
“The weighting function is set to be large value when the observed data and the simulated data are ‘‘close’’, small value when they are ‘‘distant’’, and constant when they are ‘‘equal’’.”
The above quote is somewhat surprising as the estimated likelihood f(xobs|xobs,θ) is naturally constant when xobs=xsim… I also failed to understand how the model intervened in the indicator function used as a default ABC kernel
On December 10, I will alas not travel to London to attend the Read Paper on sequential quasi-Monte Carlo presented by Mathieu Gerber and Nicolas Chopin to The Society, as I fly instead to Montréal for the NIPS workshops… I am quite sorry to miss this event, as this is a major paper which brings quasi-Monte Carlo methods into mainstream statistics. I will most certainly write a discussion and remind Og’s readers that contributed (800 words) discussions are welcome from everyone, the deadline for submission being January 02.
This newly arXived paper by S. Golchi and D. Campbell from Vancouver (hence the above picture) considers the (quite) interesting problem of simulating from a target distribution defined by a constraint. This is a question that have bothered me for a long while as I could not come up with a satisfactory solution all those years… Namely, when considering a hard constraint on a density, how can we find a sequence of targets that end up with the restricted density? This is of course connected with the zero measure case posted a few months ago. For instance, how do we efficiently simulate a sample from a Student’s t distribution with a fixed sample mean and a fixed sample variance?
“The key component of SMC is the filtering sequence of distributions through which the particles evolve towards the target distribution.” (p.3)
This is indeed the main issue! The paper considers using a sequence of intermediate targets hardening progressively the constraint(s), along with an SMC sampler, but this recommendation remains rather vague and hence I am at loss as to how to make it work when the exact constraint implies a change of measure. The first example is monotone regression where y has mean f(x) and f is monotone. (Everything is unidimensional here.) The sequence is then defined by adding a multiplicative term that is a function of ∂f/∂x, for instance
with τ growing to infinity to make the constraint moving from soft to hard. An interesting introduction, even though the hard constraint does not imply a change of parameter space or of measure. The second example is about estimating the parameters of an ODE, with the constraint being the ODE being satisfied exactly. Again, not exactly what I was looking for. But with an exotic application to deaths from the 1666 Black (Death) plague.
And then the third example is about ABC and the choice of summary statistics! The sequence of constraints is designed to keep observed and simulated summary statistics close enough when the dimension of those summaries increases, which means they are considered simultaneously rather than jointly. (In the sense of Ratmann et al., 2009. That is, with a multidimensional distance.) The model used for the application of the SMC is the dynamic model of Wood (2010, Nature). The outcome of this specific implementation is not that clear compared with alternatives… And again sadly does not deal with the/my zero measure issue.
- Dan Simpson replied to my comments of last Tuesday about his PC construction;
- Arnaud Doucet precised some issues about his adaptive subsampling paper;
- Amandine Schreck clarified why I had missed some points in her Bayesian variable selection paper;
- Randal Douc defended the efficiency of using Carlin and Chib (1995) method for mixture simulation.
Thanks to them for taking the time to answer my musings…
“At equilibrium, we thus should not expect gains of several orders of magnitude.”
As was signaled to me several times during the MCqMC conference in Leuven, Rémi Bardenet, Arnaud Doucet and Chris Holmes (all from Oxford) just wrote a short paper for the proceedings of ICML on a way to speed up Metropolis-Hastings by reducing the number of terms one computes in the likelihood ratio involved in the acceptance probability, i.e.
The observations appearing in this likelihood ratio are a random subsample from the original sample. Even though this leads to an unbiased estimator of the true log-likelihood sum, this approach is not justified on a pseudo-marginal basis à la Andrieu-Roberts (2009). (Writing this in the train back to Paris, I am not convinced this approach is in fact applicable to this proposal as the likelihood itself is not estimated in an unbiased manner…)
In the paper, the quality of the approximation is evaluated by Hoeffding’s like inequalities, which serves as the basis for a stopping rule on the number of terms eventually evaluated in the random subsample. In fine, the method uses a sequential procedure to determine if enough terms are used to take the decision and the probability to take the same decision as with the whole sample is bounded from below. The sequential nature of the algorithm requires to either recompute the vector of likelihood terms for the previous value of the parameter or to store all of them for deriving the partial ratios. While the authors adress the issue of self-evaluating whether or not this complication is worth the effort, I wonder (from my train seat) why they focus so much on recovering the same decision as with the complete likelihood ratio and the same uniform. It would suffice to get the same distribution for the decision (an alternative that is easier to propose than to create of course). I also (idly) wonder if a Gibbs version would be manageable, i.e. by changing only some terms in the likelihood ratio at each iteration, in which case the method could be exact… (I found the above quote quite relevant as, in an alternative technique we are constructing with Marco Banterle, the speedup is particularly visible in the warmup stage.) Hence another direction in this recent flow of papers attempting to speed up MCMC methods against the incoming tsunami of “Big Data” problems.