## the alive particle filter

**A**s mentioned earlier on the ‘Og, this is a paper written by Ajay Jasra, Anthony Lee, Christopher Yau, and Xiaole Zhang that I missed when it got arXived (as I was also missing my thumb at the time…) The setting is a particle filtering one with a growing product of spaces and constraints on the moves between spaces. The motivating example is one of an ABC algorithm for an HMM where at each time, the simulated (pseudo-)observation is forced to remain within a given distance of the true observation. The (one?) problem with this implementation is that the particle filter may well die out by making only proposals that stand out of the above ball. Based on an idea of François Le Gland and Nadia Oudjane, the authors define the alive filter by imposing a fixed number of moves onto the subset, running a negative binomial number of proposals. By removing the very last accepted particle, they show that the negative binomial experiment allows for an unbiased estimator of the normalising constant(s). Most of the paper is dedicated to the theoretical study of this scheme, with results summarised as (p.2)

1. Time uniform L

_{p}bounds for the particle filter estimates

2. A central limit theorem (CLT) for suitably normalized and centered particle filter estimates

3. An unbiased property of the particle filter estimates

4. The relative variance of the particle filter estimates, assuming N = O(n), is shown to grow linearly in n.

**T**he assumptions necessary to reach those impressive properties are fairly heavy (or “exceptionally strong” in the words of the authors, p.5): the original and constrained transition kernels are both uniformly ergodic, with equivalent coverage of the constrained subsets for all possible values of the particle at the previous step. I also find the proposed implementation of the ABC filtering inadequate for approximating the posterior on the parameters of the (HMM) model. Expecting every realisation of the simulated times series to be close to the corresponding observed value is too hard a constraint. The above results are scaled in terms of the number N of accepted particles but it may well be that the number of generated particles and hence the overall computing time is much larger. In the examples completing the paper, the comparison is run with an earlier ABC sampler based on the very same stepwise proximity to the observed series so the impact of this choice is difficult to assess and, furthermore, the choice of the tolerances ε is difficult to calibrate: is 3, 6, or 12 a small or large value for ε? A last question that I heard from several sources during talks on that topic is why an ABC approach would be required in HMM settings where SMC applies. Given that the ABC reproduces a simulation on the pair latent variables x parameters, there does not seem to be a gain there…

February 15, 2014 at 2:29 am

Hi

Just a couple of points with regards to your response, which might be of interest.

For `keeping the simulated sequence in constant proximity’, yes this is both a problem for this method but, possibly, the reason why it can help. Essentially at any given time step, the (conditional) expecting waiting time is $N^2/(\Phi_p(\eta_{p-1}^N})(A))$, so if one does not want to wait too long, the number of alive samples must be proportional to the (sqrt of) success probability, which may be prohibitively small. However, if this probability is `not too small, but not too big’ the standard SMC may collapse, whereas this doesn’t happen for the alive filter, which is again useful for PMCMC algorithms (and possibly filtering). If the success probability is around $10^-30$ its a very hard problem and, the alive filter is not going to help much at all.

On the theory, the assumptions could be much weaker (but, as I say, I could not motivate myself to prove them). For the linear growth bound, one can appeal to the work of Whiteley, Kantas & J (2012, SPA) or Whiteley (2013, AAP), where the assumptions are weaker, but do not apply (directly) to the F-K approximations with indicator potential.

On epsilon, sorry, I had missed your point. On this, we would need to check with my (now completed) PhD student Xiaole Zhang; I cannot answer your question.

Best,

Ajay

February 14, 2014 at 6:38 am

Dear Christian,

Just one final comment this afternoon; sorry for the many remarks.

There is one situation where you may (although I am not advocating its use) use ABC for HMMs when `SMC can be used’ as I will now describe.

Consider a HMM with stochastic state dynamics and the effective dimension of the state vector (in $R^d$) is very large. I will assume that the transition and observation densities are known (parameters known) and may be evaluated. In such scenarios one could, in principle apply SMC. The problem with such an approach is that as is generally agreed at the moment (although I am not saying that this wont change) SMC for filtering in such cases is very inefficient, because one often requires an exponential effort in the effective dimension (see the works of Bickel, van Handel and Chorin for example). Here, often what is done is approximate filtering with methods such as the ensemble Kalman filter (EnKF) and there have been other efforts such as by Ramon van Handel (although still biased). Here one could use an ABC approximation and what I have proved in the technical report (arXiv version) of

Beskos, A, Crisan D, Jasra A, Whiteley N (2014). `Error bounds and Normalizing constants for sequential Monte Carlo samplers. Advances in Applied Probability (to appear).

is that the SMC approximation of the ABC approximation is generally reasonable (cost is O(Nd)), but that the bias is exponential in the dimension (similar to some results for EnKF, see work by Stuart and Kody Law). Thus you could apply an SMC approximation of the ABC approximation and that might be better than competing approximate methods such the EnKF, 4d VAR etc, but, I don’t claim it to be the case.

Kind Regards,

Ajay

February 14, 2014 at 5:56 am

Dear Christian,

Just to add a quote from page 2 of the alive particle filter paper (at least the page 2 of the version I have):

`We assume $g( y | x_n)$ is unknown (even up to an unbiased estimate), but one can sample from the associated

distribution. In this scenario, one cannot apply a standard particle filter.’

So, I’m not clear on `A last question that I heard from several sources during talks on that topic is why an ABC approach would be required in HMM settings where SMC applies.’ Perhaps we have made an incorrect comment?

Kind Regards,

Ajay

February 14, 2014 at 3:16 am

Dear Christian,

One extra comment, which I forgot to mention. If one wants to prove results for adaptive epsilon, then it should be possible using the approach in a recent work with Beskos, Kantas and Thiery (http://arxiv.org/abs/1306.6462), although some modification is needed.

Kind Regards,

Ajay

February 14, 2014 at 3:10 am

Dear Christian,

Just to follow up with a couple of extra points.

What one can show (it is implicit in the proof) is that the relative variance of the alive estimate of the normalizing constant is less than that of the one without the alive mechanism. What I conjecture, on the basis of this result and the fact that the relative variance of the estimate of the normalizing constant plays an important role in the mixing properties of PMCMC (eg roughly, the work of Andrieu, Lee and Vihola on arXiv) along with the numerical reults is the following. Even taking into account the cost of the alive part, the mixing time of PMCMC procedures with the alive should be better than those without it, but, I do not have the proof. The alive PF gives one a way to present an unbiased estimate of the target which should be subject to a lower variance and potentially a better mixing MCMC. In the examples the target is the ABC approximation, which I would have thought is needed as I mention below.

I would also like to add, that neither I, nor any of my co-authors on ABC for HMMS (Sumeetpal Singh, Nikolas Kantas, Anthony Lee, Chris Yau, Adam Persing, James Martin) has *ever* suggested to use ABC when there exists a non-negative unbiased estimator of the true HMM – that is, when the actual HMM can be fitted using `standard’ computational methods. There may be numerical examples in some of our articles where one can fit the exact HMM, but the point there, is to illustrate computational aspects of the ideas introduced (and possible benefits) and not for statistical inference, where one does not need to use an ABC approximation (or at least to my understanding – if I am wrong I would welcome any remarks on why). I cannot understand why anyone would suggest differently.

Kind Regards,

Ajay Jasra

February 14, 2014 at 1:24 am

Dear Christian,

Thanks for your comments, I have to say that I don’t agree with some of them, but certainly (point 2 below) is accurate and I discuss why.

First on the use of ABC for HMMs when SMC can be applied (as I’ll argue you can’t use SMC in the context I’m interested in and the paper isn’t about that really). If the likelihood of the observations (conditional density of the observation given the state) is intractable (cannot be evaluated and NO unbiased estimator) how do you propose to draw exact inference? The weights in SMC cannot be evaluated (trying optimal proposals etc, not just bootstrap) and there is no unbiased estimate of them. So then, at least to my knowledge, you cannot be exact. There is an example of an intractable model in the article. I find the comment very odd, unless one can evaluate the likelihood. I’d appreciate if you could please tell me why ABC is not useful with intractable likelihoods (as I have said) for HMMs.

Second on being inadequate for estimating parameters. The approach will never completely solve the problems of particle approximations of Feynman-Kac formula. The idea is to ensure that the system does not die out, but that could be exponential time (in some parameter, such as probability of success). The whole point is that in (eg) a PMMH proposal, when the standard SMC does not do so well, one can possibly avoid wasting a proposal. Empirically, this seems to help, as we saw. I suspect one can prove results on mixing time, but as I argue below, that’s not the point of the paper. The idea of the paper is a reasonably interesting (to me) increment in methodology, which was not ours to start with (please see a paper by Kunsch in TOMACs). We never claim it is the solution on its own, but that there are some nice things that can be done which do add to the literature.

Third on the theory. First, we were the first to look at this modification of Le Gland (formally Kunsch didn’t do this). So then the idea is what can we prove which is important algorithmically, which is what we did. So then, in my opinion, the idea is to obtain indicative results, not to spend many months proving something we know is pretty much true. My experience with such proofs (and I have several papers with weak mathematical assumptions) is that they don’t add much to the methodology, which was the idea of this paper. Of course, such results are important, but I must admit to being tired of proving results which are difficult to be published. Note that the CLT is scaled with T_n, and the L_p can be in terms of the expected value of T_n (well it’s inverse square root), so the effect you mention can be quantified.

Fourthly, epsilon can be adaptively calibrated (please take a look at the filtering paper in 2012 with Sumeetpal Singh). We didn’t do that because our analysis doesn’t cover adaptive epsilon. It’s an easy modification.

I hope my clarifications this morning are useful and perhaps you will disagree with what I say, but my intention is to provide a scientific rebuttal of some of your points. However, the article is far from ground breaking and just something which we found interesting rather than trying to completely solve a very challenging problem.

Kind Regards,

Ajay

February 14, 2014 at 9:07 am

Thanks, Ajay! As we say in French, “la critique est facile mais l’art est difficile”: it is much easier to write a few comments than to solve the real issues. And to miss points. Especially when loosing touch with the particle filter literature which grows much faster than I can follow… So I agree that (1)I do not have much of a solution when the observed given latent likelihood cannot be computed: the only alternative I can think of when the dimension is small enough is to use a single non-parametric kernel estimator throughout all time periods. Of course this is a biased solution and there may be good arguments against biased estimators; (2) the line about parameter estimation was poorly worded! What I find unappealing is this notion of keeping the simulated sequence in constant proximity to the original observed sequence. This sounds like a very hard constraint, quite likely to cause long simulation delays to make it happen, even though the sequential implementation alleviates this to some extent; (3) again, being mostly from outside the field of particle filters,I do not have much to say about the techniques of proof and I certainly do not claim better proofs or lighter assumptions could be found. I also agree that the theoretical results do not add to the methodology and thus should not be an hindrance for completing and publishing a paper in this area! I actually find the linearity in n to be the most appealing result in the series (but may wonder as to how much this depends on the assumptions); (4) my comment about the tolerance was not about adaptivity, but simply on whether or not the chosen numerical values were large or small in terms of proximity to the true posterior (a void question in fine since we cannot know about the true posterior). And, to conclude, I am sorry if the post sounded overly negative. They often do. I should have added or started with the point that this is indeed a very complex and challenging problem and that reaching this type of processing is clearly a success! Cheers!