Archive for stochastic volatility

accelerating HMC by learning the leapfrog scale

Posted in Books, Statistics with tags , , , , , , , , on October 12, 2018 by xi'an

In this new arXiv submission that was part of Changye Wu’s thesis [defended last week],  we try to reduce the high sensitivity of the HMC algorithm to its hand-tuned parameters, namely the step size ε  of the discretisation scheme, the number of steps L of the integrator, and the covariance matrix of the auxiliary variables. By calibrating the number of steps of the Leapfrog integrator towards avoiding both slow mixing chains and wasteful computation costs. We do so by learning from the No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014) which already automatically tunes both the step size and the number of leapfrogs.

The core idea behind NUTS is to pick the step size via primal-dual averaging in a burn-in (warmup, Andrew would say) phase and to build at each iteration a proposal based on following a locally longest path on a level set of the Hamiltonian. This is achieved by a recursive algorithm that, at each call to the leapfrog integrator, requires to evaluate both the gradient of the target distribution and the Hamiltonianitself. Roughly speaking an iteration of NUTS costs twice as much as regular HMC with the same number of calls to the integrator. Our approach is to learn from NUTS the scale of the leapfrog length and use the resulting empirical distribution of the longest leapfrog path to randomly pick the value of  L at each iteration of an HMC scheme. This obviously preserves the validity of the HMC algorithm.

While a theoretical comparison of the convergence performances of NUTS and this eHMC proposal seem beyond our reach, we ran a series of experiments to evaluate these performances, using as a criterion an ESS value that is calibrated by the evaluation cost of the logarithm of target density function and of its gradient, as this is usually the most costly part of the algorithms. As well as a similarly calibrated expected square jumping distance. Above is one such illustration for a stochastic volatility model, the first axis representing the targeted acceptance probability in the Metropolis step. Some of the gains in either ESS or ESJD are by a factor of ten, which relates to our argument that NUTS somewhat wastes computation effort using a uniformly distributed proposal over the candidate set, instead of being close to its end-points, which automatically reduces the distance between the current position and the proposal.

impressions from EcoSta2017 [guest post]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on July 6, 2017 by xi'an

[This is a guest post on the recent EcoSta2017 (Econometrics and Statistics) conference in Hong Kong, contributed by Chris Drovandi from QUT, Brisbane.]

There were (at least) two sessions on Bayesian Computation at the recent EcoSta (Econometrics and Statistics) 2017 conference in Hong Kong. Below is my review of them. My overall impression of the conference is that there were lots of interesting talks, albeit a lot in financial time series, not my area. Even so I managed to pick up a few ideas/concepts that could be useful in my research. One criticism I had was that there were too many sessions in parallel, which made choosing quite difficult and some sessions very poorly attended. Another criticism of many participants I spoke to was that the location of the conference was relatively far from the city area.

In the first session (chaired by Robert Kohn), Minh-Ngoc Tran spoke about this paper on Bayesian estimation of high-dimensional Copula models with mixed discrete/continuous margins. Copula models with all continuous margins are relatively easy to deal with, but when the margins are discrete or mixed there are issues with computing the likelihood. The main idea of the paper is to re-write the intractable likelihood as an integral over a hypercube of ≤J dimensions (where J is the number of variables), which can then be estimated unbiasedly (with variance reduction by using randomised quasi-MC numbers). The paper develops advanced (correlated) pseudo-marginal and variational Bayes methods for inference.

In the following talk, Chris Carter spoke about different types of pseudo-marginal methods, particle marginal Metropolis-Hastings and particle Gibbs for state space models. Chris suggests that a combination of these methods into a single algorithm can further improve mixing. Continue reading

efficient approximate Bayesian inference for models with intractable likelihood

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , on July 6, 2015 by xi'an

Awalé board on my garden table, March 15, 2013Dalhin, Villani [Mattias, not Cédric] and Schön arXived a paper this week with the above title. The type of intractable likelihood they consider is a non-linear state-space (HMM) model and the SMC-ABC they propose is based on an optimised Laplace approximation. That is, replacing the posterior distribution on the parameter θ with a normal distribution obtained by a Taylor expansion of the log-likelihood. There is no obvious solution for deriving this approximation in the case of intractable likelihood functions and the authors make use of a Bayesian optimisation technique called Gaussian process optimisation (GPO). Meaning that the Laplace approximation is the Laplace approximation of a surrogate log-posterior. GPO is a Bayesian numerical method in the spirit of the probabilistic numerics discussed on the ‘Og a few weeks ago. In the current setting, this means iterating three steps

  1. derive an approximation of the log-posterior ξ at the current θ using SMC-ABC
  2. construct a surrogate log-posterior by a Gaussian process using the past (ξ,θ)’s
  3. determine the next value of θ

In the first step, a standard particle filter cannot be used to approximate the observed log-posterior at θ because the conditional density of observed given latent is intractable. The solution is to use ABC for the HMM model, in the spirit of many papers by Ajay Jasra and co-authors. However, I find the construction of the substitute model allowing for a particle filter very obscure… (A side effect of the heat wave?!) I can spot a noisy ABC feature in equation (7), but am at a loss as to how the reparameterisation by the transform τ is compatible with the observed-given-latent conditional being unavailable: if the pair (x,v) at time t has a closed form expression, so does (x,y), at least on principle, since y is a deterministic transform of (x,v). Another thing I do not catch is why having a particle filter available prevent the use of a pMCMC approximation.

The second step constructs a Gaussian process posterior on the log-likelihood, with Gaussian errors on the ξ’s. The Gaussian process mean is chosen as zero, while the covariance function is a Matérn function. With hyperparameters that are estimated by maximum likelihood estimators (based on the argument that the marginal likelihood is available in closed form). Turning the approach into an empirical Bayes version.

The next design point in the sequence of θ’s is the argument of the maximum of a certain acquisition function, which is chosen here as a sort of maximum regret associated with the posterior predictive associated with the Gaussian process. With possible jittering. At this stage, it reminded me of the Gaussian process approach proposed by Michael Gutmann in his NIPS poster last year.

Overall, the method is just too convoluted for me to assess its worth and efficiency without a practical implementation to… practice upon, for which I do not have time! Hence I would welcome any comment from readers having attempted such implementations. I also wonder at the lack of link with Simon Wood‘s Gaussian approximation that appeared in Nature (2010) and was well-discussed in the Read Paper of Fearnhead and Prangle (2012).

Stochastic volatility filtering with intractable likelihoods

Posted in Books, Statistics, University life with tags , , , , , , on May 23, 2014 by xi'an

“The contribution of our work is two-fold: first, we extend the SVM literature, by proposing a new method for obtaining the filtered volatility estimates. Second, we build upon the current ABC literature by introducing the ABC auxiliary particle filter, which can be easily applied not only to SVM, but to any hidden Markov model.”

Another ABC arXival: Emilian Vankov and Katherine B. Ensor posted a paper with the above title. They consider a stochastic volatility model with an α-stable distribution on the observables (or returns). Which makes the likelihood unavailable, even were the hidden Markov sequence known… Now, I find very surprising that the authors do not mention the highly relevant paper of Peters, Sisson and Fan, Likelihood-free Bayesian inference for α-stable models, published in CSDA, in 2012, where an ABC algorithm is specifically designed for handling α-stable likelihoods. (Commented on that earlier post.) Similarly, the use of a particle filter coupled to ABC seems to be advanced as a novelty when many researchers have implemented such filters, including Pierre Del Moral, Arnaud Doucet, Ajay Jasra, Sumeet Singh and others, in similar or more general settings. Furthermore, Simon Barthelmé and Nicolas Chopin analysed this very model by EP-ABC and ABC.  I thus find it a wee bit hard to pinpoint the degree of innovation contained in this new ABC paper

particle efficient importance sampling

Posted in Statistics with tags , , , , , , on October 15, 2013 by xi'an

Marcel Scharth and Robert Kohn just arXived a new article entitled “particle efficient importance sampling“. What is—the efficiency—about?! The spectacular diminution in variance—(the authors mention a factor of 6,000 when compared with regular particle filters!—in a stochastic volatility simulation study.

If I got the details right, the improvement stems from a paper by Richard and Zhang (Journal of  Econometrics, 2007). In a state-space/hidden Markov model setting, (non-sequential) importance sampling tries to approximate the smoothing distribution one term at a time, ie p(xt|xt-1,y1:n), but Richard and Zhang (2007) modify the target by looking at


where the last term χ(xt-1,y1:n) is the normalising constant of the proposal kernel for the previous (in t-1) target, k(xt-1|xt-2,y1:n). This kernel is actually parameterised as k(xt-1|xt-2,at(y1:n)) and the EIS algorithm optimises those parameters, one term at a time. The current paper expands Richard and Zhang (2007) by using particles to approximate the likelihood contribution and reduce the variance once the “optimal” EIS solution is obtained. (They also reproduce Richard’s and Zhang’s tricks of relying on the same common random numbers.

This approach sounds like a “miracle” to me, in the sense(s) that (a) the “normalising constant” is far from being uniquely defined (and just as far from being constant in the parameter at) and (b) it is unrelated with the target distribution (except for the optimisation step). In the extreme case when the normalising constant is also constant… in at, this step clearly is useless. (This also opens the potential for an optimisation in the choice of χ(xt-1,y1:n)…)

The simulation study starts from a univariate stochastic volatility model relying on two hidden correlated AR(1) models. (There may be a typo in the definition in Section 4.1, i.e. a Φi missing.) In those simulations, EIS brings a significant variance reduction when compared with standard particle filters and particle EIS further improves upon EIS by a factor of 2 to 20 (in the variance). I could not spot in the paper which choice had been made for χ()… which is annoying as I gathered from my reading that it must have a strong impact on the efficiency attached to the name of the method!

twisted filters

Posted in Statistics with tags , , , , , , , , , , on October 6, 2012 by xi'an

Nick Witheley (Bristol) and Anthony Lee (Warwick) just posted an interesting paper called ‘Twisted particle filters‘ on arXiv. (Presumably unintentionally, the title sounds like Twisted Sister, pictured above, even though I never listened to this [particularly] heavy kind of hard rock! Twisting is customarily used in the large deviation literature.)

The twisted particle paper studies the impact of the choice of something similar to, if subtly different from, an importance function on the approximation of the marginal density (or evidence) for HMMS. (In essence, the standard particle filter is modified for only one particle in the population.) The core of the paper is to compare those importance functions in a fixed N-large n setting. As in simpler importance sampling cases, there exists an optimal if impractical choice of importance function, leading to a zero variance estimator of the evidence. Nick and Anthony produce an upper bound on the general variance as well. One of the most appealing features of the paper is that the authors manage a convergence result in n rather than N. (Although the algorithms are obviously validated in the more standard large N sense.)

The paper is quite theoretical and dense (I was about to write heavy in connection with the above!), with half its length dedicated to proofs. It relies on operator theory, with eigen-functions behind the optimal filter, while not unrelated with Pierre Del Moral’s works. (It took me a while to realise that the notation ω was not the elemental draw from the σ-algebra but rather the realisation of the observed sequence! And I had to keep recalling that θ was the lag operator and not a model parameter [there is no model parameter].)

Andrew gone NUTS!

Posted in pictures, R, Statistics, University life with tags , , , , , , , , , , on November 24, 2011 by xi'an

Matthew Hoffman and Andrew Gelman have posted a paper on arXiv entitled “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” and developing an improvement on the Hamiltonian Monte Carlo algorithm called NUTS (!). Here is the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size ε and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter ε on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

Now, my suspicious and pessimistic nature always makes me wary of universality claims! I completely acknowledge the difficulty in picking the number of leapfrog steps L in the Hamiltonian algorithm, since the theory behind does not tell anything useful about L. And the paper is quite convincing in its description of the NUTS algorithm, being moreover beautifully written.  As indicated in the paper, the “doubling” solution adopted by NUTS is reminding me of Radford Neal’s procedure in his Annals of Statistics paper on slice sampling. (The NUTS algorithm also relies on a slice sampling step.) However, besides its ability to work as an automatic Hamiltonian methodology, I wonder about the computing time (and the “unacceptably large amount of memory”, p.12) required by the doubling procedure: 2j is growing fast with j! (If my intuition is right, the computing time should increase rather quickly with the dimension. And I do not get the argument within the paper that the costly part is the gradient computation: it seems to me the gradient must be computed for all of the 2j points.) The authors also mention delayed rejection à la Tierney and Mira (1999) and the scheme reminded me a wee of the pinball sampler we devised a while ago with Kerrie Mengersen. Choosing the discretisation step ε is more “traditional”, using the stochastic approximation approach we set in our unpublished-yet-often-quoted tech report with Christophe Andrieu. (I do not think I got the crux of the “dual averaging” for optimal calibration on p.11) The illustration through four benchmarks [incl. a stochastic volatility model!] is quite convincing as well, with (unsurprisingly) great graphical tools. A final grumble: that the code is “only” available in the proprietary language Matlab! Now, I bet some kind of Rao-Blackwellisation is feasible with all the intermediate simulations!