## MCMC for non-linear state space models using ensembles of latent sequences

**W**hile visiting U of T, I had the opportunity to discuss the above paper MCMC for non-linear state space models using ensembles of latent sequences with both authors, Alex Shestopaloff and Radford Neal, paper that I had completely missed during my hospital break. The paper borrows from the central idea of Neal (2003), which “is to temporarily reduce the state space of the model, which may be countably infinite or continuous, to a finite collection of randomly generated `pool’ states at each time point.” Several copies of the hidden Markov chain are generated from a proposal and weighted according to the posterior. This makes for a finite state space on which a forward-backward algorithm can be run, thus producing a latent sequence that is more likely than the ones originally produced from the proposal, as it borrows at different times from different chains. (I alas had no lasting memory of this early paper of Radford’s. But this is essentially *ensemble Monte Carlo*.)

**W**hat Radford patiently explained to me in Toronto was why this method did not have the same drawback as an importance sampling method, as the weights were local rather than global and hence did not degenerate as the length of the chain/HMM increased. Which I find a pretty good argument! The trick of being able to rely on forward-backward simulation is also very appealing. This of course does not mean that the method is always converging quickly, as the proposal matters. A novelty of the current paper is the inclusion of parameter simulation steps as well, steps that are part of the ensemble Monte Carlo process (rather than a standard Gibbs implementation). There also is a delayed acceptance (as opposed to delayed rejection) step where a subset of the chain is used to check for early (and cheaper) rejection.

**T**he paper is a bit short on describing the way pool states can be generated (see Section 7), but it seems local (time-wise) perturbations of the current state are considered. I wonder if an intermediate particle step could produce a more efficient proposal… It also seems possible to consider a different number of pool states at more uncertain or more sticky times, with a potential for adaptivity depending on the acceptance rate.

For the evaluation of the method, the authors consider the Ricker population dynamics model found in Wood (2003) and Fearnhead and Prangle (2012), where semi-automated ABC is used. In the experiment described therein, there are only 100 latent states, which is enough to hinder MCMC. The ensemble method does much better. While there is no comparison with ABC, I would presume this method, relying on a more precise knowledge of the probabilistic model, should do better. Maybe a good topic for a Master project?

November 6, 2013 at 6:41 pm

It was great to talk with you here!

In the experiments in the paper, we only use a “global” distribution for pool states, which I think will generally be preferred when the state is just one-dimensional, since not-too-many states are enough to pretty-much cover the possibilities. The ensemble method then comes close to marginalizing over the hidden state sequence. When the state is higher-dimensional, though, I expect that keeping the pool states in the local vicinity of the current sequence may work better.

Comparing with ABC would be interesting. There is a conceptual difficulty, though. MCMC becomes exact in the limit as the number of iterations goes to infinity. ABC becomes exact in the limit as the data tolerance goes to zero (and the number of proposals to obtain an accepted value goes to infinity). To compare, you could set some accuracy requirement, and then see how much compute time is needed with each method. But accuracy of the sampled versus the true posterior distribution isn’t a one-dimensional quantity. So which method is best may depend on how you choose to measure accuracy.

Of course, it often turns out that one method is much, much better than another, in which case this may be just a theoretical point.

November 6, 2013 at 7:05 pm

Yes this marginalisation feature is something that took time for me to understand, if you remember our (most helpful) discussions, as I tend to reason as an importance (if unimportant) sampler. And I agree that the comparison with ABC is quite delicate. Ideally we would like to run them

both for the same amount of computing time and receive a similar approximation to the posterior. If this happens it is a good sign. If it does not, it is delicate to understand which one has erred way from the true target. Maybe using the ABC output as a basis for your ensemble proposal could work in an iterative scheme. But, as you say, if one is much much better, we end up with wasting out time. Unless we can run both methods in parallel (still requiring extra coding!).