## SMC on a sequence of increasing dimension targets

Posted in Statistics with tags , , , , , , , , , on February 15, 2017 by xi'an

Richard Everitt and co-authors have arXived a preliminary version of a paper entitled Sequential Bayesian inference for mixture models and the coalescent using sequential Monte Carlo samplers with transformations. The central notion is an SMC version of the Carlin & Chib (1995) completion in the comparison of models in different dimensions. Namely to create auxiliary variables for each model in such a way that the dimension of the completed models are all the same. (Reversible jump MCMC à la Peter Green (1995) can also be interpreted this way, even though only relevant bits of the completion are used in the transitions.) I find the paper and the topic most interesting if only because it relates to earlier papers of us on population Monte Carlo. It also brought to my awareness the paper by Karagiannis and Andrieu (2013) on annealed reversible jump MCMC that I had missed at the time it appeared. The current paper exploits this annealed expansion in the devising of the moves. (Sequential Monte Carlo on a sequence of models with increasing dimension has been studied in the past.)

The way the SMC is described in the paper, namely, reweight-subsample-move, does not strike me as the most efficient as I would try to instead move-reweight-subsample, using a relevant move that incorporate the new model and hence enhance the chances of not rejecting.

One central application of the paper is mixture models with an unknown number of components. The SMC approach applied to this problem means creating a new component at each iteration t and moving the existing particles after adding the parameters of the new component. Since using the prior for this new part is unlikely to be at all efficient, a split move as in Richardson and Green (1997) can be considered, which brings back the dreaded Jacobian of RJMCMC into the picture! Here comes an interesting caveat of the method, namely that the split move forces a choice of the split component of the mixture. However, this does not appear as a strong difficulty, solved in the paper by auxiliary [index] variables, but possibly better solved by a mixture representation of the proposal, as in our PMC [population Monte Carlo] papers. Which also develop a family of SMC algorithms, incidentally. We found there that using a mixture representation of the proposal achieves a provable variance reduction.

“This puts a requirement on TSMC that the single transition it makes must be successful.”

As pointed by the authors, the transformation SMC they develop faces the drawback that a given model is only explored once in the algorithm, when moving to the next model. On principle, there would be nothing wrong in including regret steps, retracing earlier models in the light of the current one, since each step is an importance sampling step valid on its own right. But SMC also offers a natural albeit potentially high-varianced approximation to the marginal likelihood, which is quite appealing when comparing with an MCMC outcome. However, it would have been nice to see a comparison with alternative estimates of the marginal in the case of mixtures of distributions. I also wonder at the comparative performances of a dual approach that would be sequential in the number of observations as well, as in Chopin (2004) or our first population Monte Carlo paper (Cappé et al., 2005), since subsamples lead to tempered versions of the target and hence facilitate moves between models, being associated with flatter likelihoods.

## anytime algorithm

Posted in Books, Statistics with tags , , , , , , , , , on January 11, 2017 by xi'an

Lawrence Murray, Sumeet Singh, Pierre Jacob, and Anthony Lee (Warwick) recently arXived a paper on Anytime Monte Carlo. (The earlier post on this topic is no coincidence, as Lawrence had told me about this problem when he visited Paris last Spring. Including a forced extension when his passport got stolen.) The difficulty with anytime algorithms for MCMC is the lack of exchangeability of the MCMC sequence (except for formal settings where regeneration can be used).

When accounting for duration of computation between steps of an MCMC generation, the Markov chain turns into a Markov jump process, whose stationary distribution α is biased by the average delivery time. Unless it is constant. The authors manage this difficulty by interlocking the original chain with a secondary chain so that even- and odd-index chains are independent. The secondary chain is then discarded. This provides a way to run an anytime MCMC. The principle can be extended to K+1 chains, run one after the other, since only one of those chains need be discarded. It also applies to SMC and SMC². The appeal of anytime simulation in this particle setting is that resampling is no longer a bottleneck. Hence easily distributed among processors. One aspect I do not fully understand is how the computing budget is handled, since allocating the same real time to each iteration of SMC seems to envision each target in the sequence as requiring the same amount of time. (An interesting side remark made in this paper is the lack of exchangeability resulting from elaborate resampling mechanisms, lack I had not thought of before.)

## anytime!

Posted in Books, Mountains, pictures, Statistics, Travel with tags , , , , , on December 22, 2016 by xi'an

“An anytime algorithm is an algorithm that can be run continuously, generating progressively better solutions when afforded additional computation time. Traditional particle-based inference algorithms are not anytime in nature; all particles need to be propagated in lock-step to completion in order to compute expectations.”

Following a discussion with Lawrence Murray last week, I read Paige et al.  NIPS 2014 paper on their anytime sequential Monte Carlo algorithm. As explained above, an anytime algorithm is interruptible, meaning it can be stopped at any time without biasing the outcome of the algorithm. While MCMC algorithms can qualify as anytime (provided they are in stationary regime), it is not the case with sequential and particle Monte Carlo algorithms, which do not have an inbred growing mechanism preserving the target. In the case of Paige et al.’s proposal, the interruptible solution returns an unbiased estimator of the marginal likelihood at time n for any number of particles, even when this number is set or increased during the computation. The idea behind the solution is to create a particle cascade by going one particle at a time and creating children of this particle in proportion to the current average weight. An approach that can be run indefinitely. And since memory is not infinite, the authors explain how to cap the number of alive particles without putting the running distribution in jeopardy…

## rare events for ABC

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , on November 24, 2016 by xi'an

Dennis Prangle, Richard G. Everitt and Theodore Kypraios just arXived a new paper on ABC, aiming at handling high dimensional data with latent variables, thanks to a cascading (or nested) approximation of the probability of a near coincidence between the observed data and the ABC simulated data. The approach amalgamates a rare event simulation method based on SMC, pseudo-marginal Metropolis-Hastings and of course ABC. The rare event is the near coincidence of the observed summary and of a simulated summary. This is so rare that regular ABC is forced to accept not so near coincidences. Especially as the dimension increases.  I mentioned nested above purposedly because I find that the rare event simulation method of Cérou et al. (2012) has a nested sampling flavour, in that each move of the particle system (in the sample space) is done according to a constrained MCMC move. Constraint derived from the distance between observed and simulated samples. Finding an efficient move of that kind may prove difficult or impossible. The authors opt for a slice sampler, proposed by Murray and Graham (2016), however they assume that the distribution of the latent variables is uniform over a unit hypercube, an assumption I do not fully understand. For the pseudo-marginal aspect, note that while the approach produces a better and faster evaluation of the likelihood, it remains an ABC likelihood and not the original likelihood. Because the estimate of the ABC likelihood is monotonic in the number of terms, a proposal can be terminated earlier without inducing a bias in the method.

This is certainly an innovative approach of clear interest and I hope we will discuss it at length at our BIRS ABC 15w5025 workshop next February. At this stage of light reading, I am slightly overwhelmed by the combination of so many computational techniques altogether towards a single algorithm. The authors argue there is very little calibration involved, but so many steps have to depend on as many configuration choices.

## MCqMC 2016 [#2]

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , on August 17, 2016 by xi'an

In her plenary talk this morning, Christine Lemieux discussed connections between quasi-Monte Carlo and copulas, covering a question I have been considering for a while. Namely, when provided with a (multivariate) joint cdf F, is there a generic way to invert a vector of uniforms [or quasi-uniforms] into a simulation from F? For Archimedian copulas (as we always can get back to copulas), there is a resolution by the Marshall-Olkin representation,  but this puts a restriction on the distributions F that can be considered. The session on synthetic likelihoods [as introduced by Simon Wood in 2010] put together by Scott Sisson was completely focussed on using normal approximations for the distribution of the vector of summary statistics, rather than the standard ABC non-parametric approximation. While there is a clear (?) advantage in using a normal pseudo-likelihood, since it stabilises with much less simulations than a non-parametric version, I find it difficult to compare both approaches, as they lead to different posterior distributions. In particular, I wonder at the impact of the dimension of the summary statistics on the approximation, in the sense that it is less and less likely that the joint is normal as this dimension increases. Whether this is damaging for the resulting inference is another issue, possibly handled by a supplementary ABC step that would take the first-step estimate as summary statistic. (As a side remark, I am intrigued at everyone being so concerned with unbiasedness of methods that are approximations with no assessment of the amount of approximation!) The last session of the day was about multimodality and MCMC solutions, with talks by Hyungsuk Tak, Pierre Jacob and Babak Shababa, plus mine. Hunsuk presented the RAM algorithm I discussed earlier under the title of “love-hate” algorithm, which was a kind reference to my post! (I remain puzzled by the ability of the algorithm to jump to another mode, given that the intermediary step aims at a low or even zero probability region with an infinite mass target.) And Pierre talked about using SMC for Wang-Landau algorithms, with a twist to the classical stochastic optimisation schedule that preserves convergence. And a terrific illustration on a distribution inspired from the Golden Gate Bridge that reminded me of my recent crossing! The discussion around my folded Markov chain talk focussed on the extension of the partition to more than two sets, the difficulty being in generating automated projections, with comments about connections with computer graphic tools. (Too bad that the parallel session saw talks by Mark Huber and Rémi Bardenet that I missed! Enjoying a terrific Burmese dinner with Rémi, Pierre and other friends also meant I could not post this entry on time for the customary 00:16. Not that it matters in the least…)

## Bayesian model comparison with intractable constants

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 8, 2016 by xi'an

Richard Everitt, Adam Johansen (Warwick), Ellen Rowing and Melina Evdemon-Hogan have updated [on arXiv] a survey paper on the computation of Bayes factors in the presence of intractable normalising constants. Apparently destined for Statistics and Computing when considering the style. A great entry, in particular for those attending the CRiSM workshop Estimating Constants in a few months!

A question that came to me from reading the introduction to the paper is why a method like Møller et al.’s (2006) auxiliary variable trick should be considered more “exact” than the pseudo-marginal approach of Andrieu and Roberts (2009) since the later can equally be seen as an auxiliary variable approach. The answer was on the next page (!) as it is indeed a special case of Andrieu and Roberts (2009). Murray et al. (2006) also belongs to this group with a product-type importance sampling estimator, based on a sequence of tempered intermediaries… As noted by the authors, there is a whole spectrum of related methods in this area, some of which qualify as exact-approximate, inexact approximate and noisy versions.

Their main argument is to support importance sampling as the method of choice, including sequential Monte Carlo (SMC) for large dimensional parameters. The auxiliary variable of Møller et al.’s (2006) is then part of the importance scheme. In the first toy example, a Poisson is opposed to a Geometric distribution, as in our ABC model choice papers, for which a multiple auxiliary variable approach dominates both ABC and Simon Wood’s synthetic likelihood for a given computing cost. I did not spot which artificial choice was made for the Z(θ)’s in both models, since the constants are entirely known in those densities. A very interesting section of the paper is when envisioning biased approximations to the intractable density. If only because the importance weights are most often biased due to the renormalisation (possibly by resampling). And because the variance derivations are then intractable as well. However, due to this intractability, the paper can only approach the impact of those approximations via empirical experiments. This leads however to the interrogation on how to evaluate the validity of the approximation in settings where truth and even its magnitude are unknown… Cross-validation and bootstrap type evaluations may prove too costly in realistic problems. Using biased solutions thus mostly remains an open problem in my opinion.

The SMC part in the paper is equally interesting if only because it focuses on the data thinning idea studied by Chopin (2002) and many other papers in the recent years. This made me wonder why an alternative relying on a sequence of approximations to the target with tractable normalising constants could not be considered. A whole sequence of auxiliary variable completions sounds highly demanding in terms of computing budget and also requires a corresponding sequence of calibrations. (Now, ABC fares no better since it requires heavy simulations and repeated calibrations, while further exhibiting a damning missing link with the target density. ) Unfortunately, embarking upon a theoretical exploration of the properties of approximate SMC is quite difficult, as shown by the strong assumptions made in the paper to bound the total variation distance to the true target.

## Quasi-Monte Carlo sampling

Posted in Books, Kids, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on December 10, 2014 by xi'an

“The QMC algorithm forces us to write any simulation as an explicit function of uniform samples.” (p.8)

As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper.  Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).

Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).

“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)

At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order

$\mathfrak{O}(N^{-1}\log(N)^{d-}) \text{ at cost } \mathfrak{O}(N\log(N))$

means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms.  For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform

$\mathbf{x}_t^n = \Gamma_t(\mathbf{x}_{t-1}^n,\mathbf{u}_{t}^n)$

which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal

$m_t(\tilde{\mathbf{x}}_{t-1}^n,\cdot)$

has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.

The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.

Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?

In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?

What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?

Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?