Archive for sequential Monte Carlo

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on April 21, 2014 by xi'an

As I was flying over Skye (with [maybe] a first if hazy perspective on the Cuillin ridge!) to Iceland, three long sets of replies to some of my posts appeared on the ‘Og:

Thanks to them for taking the time to answer my musings…

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , on April 15, 2014 by xi'an

“At equilibrium, we thus should not expect gains of several orders of magnitude.”

As was signaled to me several times during the MCqMC conference in Leuven, Rémi Bardenet, Arnaud Doucet and Chris Holmes (all from Oxford) just wrote a short paper for the proceedings of ICML on a way to speed up Metropolis-Hastings by reducing the number of terms one computes in the likelihood ratio involved in the acceptance probability, i.e.

$\prod_{i=1}^n\frac{L(\theta^\prime|x_i)}{L(\theta|x_i)}.$

The observations appearing in this likelihood ratio are a random subsample from the original sample. Even though this leads to an unbiased estimator of the true log-likelihood sum, this approach is not justified on a pseudo-marginal basis à la Andrieu-Roberts (2009). (Writing this in the train back to Paris, I am not convinced this approach is in fact applicable to this proposal as the likelihood itself is not estimated in an unbiased manner…)

In the paper, the quality of the approximation is evaluated by Hoeffding’s like inequalities, which serves as the basis for a stopping rule on the number of terms eventually evaluated in the random subsample. In fine, the method uses a sequential procedure to determine if enough terms are used to take the decision and the probability to take the same decision as with the whole sample is bounded from below. The sequential nature of the algorithm requires to either recompute the vector of likelihood terms for the previous value of the parameter or to store all of them for deriving the partial ratios. While the authors adress the issue of self-evaluating whether or not this complication is worth the effort, I wonder (from my train seat) why they focus so much on recovering the same decision as with the complete likelihood ratio and the same uniform. It would suffice to get the same distribution for the decision (an alternative that is easier to propose than to create of course). I also (idly) wonder if a Gibbs version would be manageable, i.e. by changing only some terms in the likelihood ratio at each iteration, in which case the method could be exact… (I found the above quote quite relevant as, in an alternative technique we are constructing with Marco Banterle, the speedup is particularly visible in the warmup stage.) Hence another direction in this recent flow of papers attempting to speed up MCMC methods against the incoming tsunami of “Big Data” problems.

Nonlinear Time Series just appeared

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , on February 26, 2014 by xi'an

My friends Randal Douc and Éric Moulines just published this new time series book with David Stoffer. (David also wrote Time Series Analysis and its Applications with Robert Shumway a year ago.) The books reflects well on the research of Randal and Éric over the past decade, namely convergence results on Markov chains for validating both inference in nonlinear time series and algorithms applied to those objects. The later includes MCMC, pMCMC, sequential Monte Carlo, particle filters, and the EM algorithm. While I am too close to the authors to write a balanced review for CHANCE (the book is under review by another researcher, before you ask!), I think this is an important book that reflects the state of the art in the rigorous study of those models. Obviously, the mathematical rigour advocated by the authors makes Nonlinear Time Series a rather advanced book (despite the authors’ reassuring statement that “nothing excessively deep is used”) more adequate for PhD students and researchers than starting graduates (and definitely not advised for self-study), but the availability of the R code (on the highly personal page of David Stoffer) comes to balance the mathematical bent of the book in the first and third parts. A great reference book!

particle efficient importance sampling

Posted in Statistics with tags , , , , , , on October 15, 2013 by xi'an

Marcel Scharth and Robert Kohn just arXived a new article entitled “particle efficient importance sampling“. What is—the efficiency—about?! The spectacular diminution in variance—(the authors mention a factor of 6,000 when compared with regular particle filters!—in a stochastic volatility simulation study.

If I got the details right, the improvement stems from a paper by Richard and Zhang (Journal of  Econometrics, 2007). In a state-space/hidden Markov model setting, (non-sequential) importance sampling tries to approximate the smoothing distribution one term at a time, ie p(xt|xt-1,y1:n), but Richard and Zhang (2007) modify the target by looking at

p(yt|xt)p(xt|xt-1(xt-1,y1:n),

where the last term χ(xt-1,y1:n) is the normalising constant of the proposal kernel for the previous (in t-1) target, k(xt-1|xt-2,y1:n). This kernel is actually parameterised as k(xt-1|xt-2,at(y1:n)) and the EIS algorithm optimises those parameters, one term at a time. The current paper expands Richard and Zhang (2007) by using particles to approximate the likelihood contribution and reduce the variance once the “optimal” EIS solution is obtained. (They also reproduce Richard’s and Zhang’s tricks of relying on the same common random numbers.

This approach sounds like a “miracle” to me, in the sense(s) that (a) the “normalising constant” is far from being uniquely defined (and just as far from being constant in the parameter at) and (b) it is unrelated with the target distribution (except for the optimisation step). In the extreme case when the normalising constant is also constant… in at, this step clearly is useless. (This also opens the potential for an optimisation in the choice of χ(xt-1,y1:n)…)

The simulation study starts from a univariate stochastic volatility model relying on two hidden correlated AR(1) models. (There may be a typo in the definition in Section 4.1, i.e. a Φi missing.) In those simulations, EIS brings a significant variance reduction when compared with standard particle filters and particle EIS further improves upon EIS by a factor of 2 to 20 (in the variance). I could not spot in the paper which choice had been made for χ()… which is annoying as I gathered from my reading that it must have a strong impact on the efficiency attached to the name of the method!

sampling from time-varying log-concave distributions

Posted in Statistics, University life with tags , , , , , on October 2, 2013 by xi'an

Sasha Rakhlin from Wharton sent me this paper he wrote (and arXived) with Hariharan Narayanan on a specific Markov chain algorithm that handles sequential Monte Carlo problems for log-concave targets. By relying on novel (by my standards) mathematical techniques, they manage to obtain geometric ergodicity results for random-walk based algorithms and log-concave targets. One of the new tools is the notion of self-concordant barrier, a sort of convex potential function F associated with a reference convex set and with Lipschitz properties. The second tool is a Gaussian distribution based on the metric induced by F. The third is the Dikin walk Markov chain, which uses this Gaussian as proposal and moves almost like the Metropolis-Hastings algorithm, except that it rejects with at least a probability of ½. The scale (or step size) of the Gaussian proposal is determined by the regularity of the log-concave target. In that setting, the total variation distance between the target at the t-th level and the distribution of the Markov chain can be fairly precisely approximated. Which leads in turn to a scaling of the number of random walk steps that are necessary to ensure convergence. Depending on the pace of the moving target, a single step of the random walk may be sufficient, which is quite an interesting feature.

Initializing adaptive importance sampling with Markov chains

Posted in Statistics with tags , , , , , , , , , , , on May 6, 2013 by xi'an

Another paper recently arXived by Beaujean and Caldwell elaborated on our population Monte Carlo papers (Cappé et al., 2005, Douc et al., 2007, Wraith et al., 2010) to design a more thorough starting distribution. Interestingly, the authors mention the fact that PMC is an EM-type algorithm to emphasize the importance of the starting distribution, as with “poor proposal, PMC fails as proposal updates lead to a consecutively poorer approximation of the target” (p.2). I had not thought of this possible feature of PMC, which indeed proceeds along integrated EM steps, and thus could converge to a local optimum (if not poorer than the start as the Kullback-Leibler divergence decreases).

The solution proposed in this paper is similar to the one we developed in our AMIS paper. An important part of the simulation is dedicated to the construction of the starting distribution, which is a mixture deduced from multiple Metropolis-Hastings runs. I find the method spends an unnecessary long time on refining this mixture by culling the number of components: down-the-shelf clustering techniques should be sufficient, esp. if one considers that the value of the target is available at every simulated point. This has been my pet (if idle) theory for a long while: we do not take (enough) advantage of this informative feature in our simulation methods… I also find the Student’s t versus Gaussian kernel debate (p.6) somehow superfluous: as we shown in Douc et al., 2007, we can process Student’s t distributions so we can as well work with those. And rather worry about the homogeneity assumption this choice implies: working with any elliptically symmetric kernel assumes a local Euclidean structure on the parameter space, for all components, and does not model properly highly curved spaces. Another pet theory of mine’s. As for picking the necessary number of simulations at each PMC iteration, I would add to the ESS and the survival rate of the components a measure of the Kullback-Leibler divergence, as it should decrease at each iteration (with an infinite number of particles).

Another interesting feature is in the comparison with Multinest, the current version of nested sampling, developed by Farhan Feroz. This is the second time I read a paper involving nested sampling in the past two days. While this PMC implementation does better than nested sampling on the examples processed in the paper, the Multinest outcome remains relevant, particularly because it handles multi-modality fairly well. The authors seem to think parallelisation is an issue with nested sampling, while I do see why: at the most naïve stage, several nested samplers can be run in parallel and the outcomes pulled together.

threshold schedules for ABC-SMC

Posted in Statistics, University life with tags , , , , , , , , , on October 17, 2012 by xi'an

Daniel Silk, Sarah Filippi and Michael Stumpf just posted on arXiv a new paper on ABC-SMC. They attack therein a typical problem with SMC (and PMC, and even MCMC!) methods, namely the ability to miss (global) modes of the target due to a poor initial exploration. So, if a proposal is built on previous simulations and if  those simulations have missed an important mode, it is quite likely that this proposal will concentrate on other parts of the space and miss the important mode even more. This is also why simulated annealing and other stochastic optimisation algorithms are so fickle: a wrong parameterisation (like the temperature schedule) induces the sequence to converge to a local optimum rather than to the global optimum. Since sequential ABC is a form of simulated annealing, the decreasing tolerance (or threshold) playing the part of the temperature, it is no surprise that it suffers from this generic disease…

The method proposed in the paper aims at avoiding this difficulty by looking at sudden drops in the acceptance rate curve (as a function of the tolerance ε),

$\aleph_t(\epsilon)=\int p_t(x)\mathbb{I}(\Delta(x,x^\star)\le \epsilon)\,\text{d}x,$

suggesting for threshold the value of ε that maximises the second derivative of this curve. Now, before getting to (at?) the issue of turning this principle into a practical algorithm, let me linger at the motivation for it:

“To see this, one can imagine an ε-ball expanding about the true data; at first the ball only encompasses a small number of particles that were drawn from very close to the global maximum, corresponding to the low gradient at the foot of the shape. Once ε is large enough we are able to accept the relatively large number of particles sitting in the local maximum, which causes the increase in gradient.” (p.5)

Thus, the argument for looking at values of ε preceding fast increases in the acceptance rate ℵ is that we are thus avoiding flatter and lower regions of the posterior support corresponding to a local maximum. It clearly encompasses the case studied by the authors of a global highly concentrated global mode, surrounded by flatter lower modes, but it seems to me that this is not necessarily the only possible reason for changes in the shape of the acceptance probability ℵ. First, we are looking at an ABC acceptance rate, not at a genuine likelihood. Second, this acceptance rate function depends on the current (and necessarily rough) approximate to the genuine predictive, pt. Furthermore, when taking into account this rudimentary replacement of the true likelihood function, it is rather difficult to envision how it impacts the correspondence between a proximity in the data and a proximity in the parameter space. (The toy example is both illuminating and misleading, because it considers a setting where the data is a deterministic transform of the parameter.) I thus think the analysis should refer more strongly to the non-parametric literature and in particular to the k-nearest-neighbour approach recently reformulated by Biau et al.: there is no reason to push the tolerance ε all the way down to zero as this limit does not correspond to the optimal value of the tolerance. If one does not use a non-parametric determination of the “right” tolerance, the lingering question is when and why stopping the sequence of ABC simulations.

The acceptance rate function ℵ is approximated using an unscented transform. I understand neither the implementation of, nor the motivations for, this choice. Given that the function ℵ is a one-dimensional transform of the tolerance and that it actually corresponds to the cdf of the distance between the true data and the (current) pseudo-data, why couldn’t we use smooth versions of a cdf estimate based on the simulated distances, given that these distances are already computed..?  I must be missing something.