## multinomial resampling by Metropolis

Posted in Books, Statistics with tags , , , , , on December 28, 2017 by xi'an

A few years ago Lawrence Murray wrote a note on accelerating the resampling stage in particle filters by using a Metropolis step. And GPUs. The notion that Metropolis can be applied in this setting is at first puzzling since exact multinomial sampling is available. And Metropolis requires convergence guarantees. Which Lawrence covers by a Raftery and Lewis assessment, which has severe limitations in general but may well be adequate for this very case, although possibly too conservative in the number of recommended Metropolis iterations. The gain brought by Metropolis is that it does not require summing up all the particle weights, and as a result the gain is real in that Metropolis beats all other approaches (time-wise) when the number of particles is not too large and the heterogeneity of the weighs not too  high. (I did not know of this note until Richard Everitt brought it to my attention.)

## resampling methods

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on December 6, 2017 by xi'an

A paper that was arXived [and that I missed!] last summer is a work on resampling by Mathieu Gerber, Nicolas Chopin (CREST), and Nick Whiteley. Resampling is used to sample from a weighted empirical distribution and to correct for very small weights in a weighted sample that otherwise lead to degeneracy in sequential Monte Carlo (SMC). Since this step is based on random draws, it induces noise (while improving the estimation of the target), reducing this noise is preferable, hence the appeal of replacing plain multinomial sampling with more advanced schemes. The initial motivation is for sequential Monte Carlo where resampling is rife and seemingly compulsory, but this also applies to importance sampling when considering several schemes at once. I remember discussing alternative schemes with Nicolas, then completing his PhD, as well as Olivier Cappé, Randal Douc, and Eric Moulines at the time (circa 2004) we were working on the Hidden Markov book. And getting then a somewhat vague idea as to why systematic resampling failed to converge.

In this paper, Mathieu, Nicolas and Nick show that stratified sampling (where a uniform is generated on every interval of length 1/n) enjoys some form of consistent, while systematic sampling (where the “same” uniform is generated on every interval of length 1/n) does not necessarily enjoy this consistency. There actually exists cases where convergence does not occur. However, a residual version of systematic sampling (where systematic sampling is applied to the residuals of the decimal parts of the n-enlarged weights) is itself consistent.

The paper also studies the surprising feature uncovered by Kitagawa (1996) that stratified sampling applied to an ordered sample brings an error of O(1/n²) between the cdf rather than the usual O(1/n). It took me a while to even understand the distinction between the original and the ordered version (maybe because Nicolas used the empirical cdf during his SAD (Stochastic Algorithm Day!) talk, ecdf that is the same for ordered and initial samples).  And both systematic and deterministic sampling become consistent in this case. The result was shown in dimension one by Kitagawa (1996) but extends to larger dimensions via the magical trick of the Hilbert curve.

Posted in Statistics with tags , , , , , on August 30, 2016 by xi'an

Following Paul Russell’s talk at MCqMC 2016, I took a look at his recently arXived paper. In the plane to Sydney. The pseudo-code representation of the method is identical to our population Monte Carlo algorithm as is the suggestion to approximate the posterior by a mixture, but one novel aspect is to use Reich’s ensemble transportation at the resampling stage, in order to maximise the correlation between the original and the resampled versions of the particle systems. (As in our later versions of PMC, the authors also use as importance denominator the entire mixture rather than conditioning on the selected last-step particle.)

“The output of the resampling algorithm gives us a set of evenly weighted samples that we believe represents the target distribution well”

I disagree with this statement: Reweighting does not improve the quality of the posterior approximation, since it introduces more variability. If the original sample is found missing in its adequation to the target, so is the resampled one. Worse, by producing a sample with equal weights, this step may give a false impression of adequate representation…

Another unclear point in the pape relates to tuning the parameters of the mixture importance sampler. The paper discusses tuning these parameters during a burn-in stage, referring to “due to the constraints on adaptive MCMC algorithms”, which indeed is only pertinent for MCMC algorithms, since importance sampling can be constantly modified while remaining valid. This was a major point for advocating PMC. I am thus unsure what the authors mean by a burn-in period in such a context. Actually, I am also unsure on how they use effective sample size to select the new value of the importance parameter, e.g., the variance β in a random walk mixture: the effective sample size involves this variance implicitly through the realised sample hence changing β means changing the realised sample… This seems too costly to contemplate so I wonder at the way Figure 4.2 is produced.

“A popular approach for adaptive MCMC algorithms is to view the scaling parameter as a random variable which we can sample during the course of the MCMC iterations.”

While this is indeed an attractive notion [that I played with in the early days of adaptive MCMC, with the short-lived notion of cyber-parameters], I do not think it is of much help in optimising an MCMC algorithm, since the scaling parameter need be optimised, resulting into a time-inhomogeneous target. A more appropriate tool is thus stochastic optimisation à la Robbins-Monro, as exemplified in Andrieu and Moulines (2006). The paper however remains unclear as to how the scales are updated (see e.g. Section 4.2).

“Ideally, we would like to use a resampling algorithm which is not prohibitively costly for moderately or large sized ensembles, which preserves the mean of the samples, and which makes it much harder for the new samples to forget a significant region in the density.”

The paper also misses on the developments of the early 2000’s about more sophisticated resampling steps, especially Paul Fearnhead’s contributions (see also Nicolas Chopin’s thesis). There exist valid resampling methods that require a single uniform (0,1) to be drawn, rather than m. The proposed method has a flavour similar to systematic resampling, but I wonder at the validity of returning values that are averages of earlier simulations, since this modifies their distribution into ones with slimmer tails. (And it is parameterisation dependent.) Producing xi with probability pi is not the same as returning the average of the pixi‘s.

## SPA 2015 Oxford [my day #2]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on July 17, 2015 by xi'an

Today I [barely made it on a delayed train from Leaminton Spa to Oxford as I] chaired my invited session at SPA 2015 on advanced MCMC methodology. The three speakers, Randal Douc, Mike Pitt and Matti Vihola, all gave talks related to the pseudo-marginal technique. For instance, Randal gave examples of guaranteed variance improvements by adding randomisation steps in the generation of the rv’s behind the unbiased estimation of the likelihood function. Mike Pitt presented the paper I discussed a little while ago about evaluating the computing performances of pseudo-marginal approximations, with a fairly compelling perspective [I may have missed from the paper] on approximating the distribution on the approximation to the log-likelihood as a normal. Which led me to ponder at the ultimate version where the log-likelihood itself would get directly simulated in an MCMC algorithm bypassing the preliminary simulation of the parameters. Sounds a bit too fantasy-like to be of any use… Matti Vihola also presented recent results with Christophe Andrieu on comparing pseudo-marginal approximations, based on convex ordering properties. They included a domination result on ABC-MCM algorithms, as noted in a recent post. Which made me musing about the overall importance of unbiasedness in the global picture, where all we need are converging approximations, in fine.

## discussions on Gerber and Chopin

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on May 29, 2015 by xi'an

As a coincidence, I received my copy of JRSS Series B with the Read Paper by Mathieu Gerber and Nicolas Chopin on sequential quasi Monte Carlo just as I was preparing an arXival of a few discussions on the paper! Among the [numerous and diverse] discussions, a few were of particular interest to me [I highlighted members of the University of Warwick and of Université Paris-Dauphine to suggest potential biases!]:

1. Mike Pitt (Warwick), Murray Pollock et al.  (Warwick) and Finke et al. (Warwick) all suggested combining quasi Monte Carlo with pseudomarginal Metropolis-Hastings, pMCMC (Pitt) and Rao-Bklackwellisation (Finke et al.);
2. Arnaud Doucet pointed out that John Skilling had used the Hilbert (ordering) curve in a 2004 paper;
3. Chris Oates, Dan Simpson and Mark Girolami (Warwick) suggested combining quasi Monte Carlo with their functional control variate idea;
4. Richard Everitt wondered about the dimension barrier of d=6 and about possible slice extensions;
5. Zhijian He and Art Owen pointed out simple solutions to handle a random number of uniforms (for simulating each step in sequential Monte Carlo), namely to start with quasi Monte Carlo and end up with regular Monte Carlo, in an hybrid manner;
6. Hans Künsch points out the connection with systematic resampling à la Carpenter, Clifford and Fearnhead (1999) and wonders about separating the impact of quasi Monte Carlo between resampling and propagating [which vaguely links to one of my comments];
7. Pierre L’Ecuyer points out a possible improvement over the Hilbert curve by a preliminary sorting;
8. Frederik Lindsten and Sumeet Singh propose using ABC to extend the backward smoother to intractable cases [but still with a fixed number of uniforms to use at each step], as well as Mateu and Ryder (Paris-Dauphine) for a more general class of intractable models;
9. Omiros Papaspiliopoulos wonders at the possibility of a quasi Markov chain with “low discrepancy paths”;
10. Daniel Rudolf suggest linking the error rate of sequential quasi Monte Carlo with the bounds of Vapnik and Ĉervonenkis (1977).

The arXiv document also includes the discussions by Julyan Arbel and Igor Prünster (Turino) on the Bayesian nonparametric side of sqMC and by Robin Ryder (Dauphine) on the potential of sqMC for ABC.

## Quasi-Monte Carlo sampling

Posted in Books, Kids, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on December 10, 2014 by xi'an

“The QMC algorithm forces us to write any simulation as an explicit function of uniform samples.” (p.8)

As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper.  Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).

Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).

“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)

At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order

$\mathfrak{O}(N^{-1}\log(N)^{d-}) \text{ at cost } \mathfrak{O}(N\log(N))$

means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms.  For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform

$\mathbf{x}_t^n = \Gamma_t(\mathbf{x}_{t-1}^n,\mathbf{u}_{t}^n)$

which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal

$m_t(\tilde{\mathbf{x}}_{t-1}^n,\cdot)$

has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.

The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.

Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?

In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?

What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?

Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?

## resampling and [GPU] parallelism

Posted in Statistics, University life with tags , , , , , , on March 13, 2012 by xi'an

In a recent note posted on arXiv, Lawrence Murray compares the implementation of resampling schemes for parallel systems like GPUs. Given a system of weighted particles, (xii), there are several ways of drawing a sample according to those weights:

1. regular multinomial resampling, where each point in the (new) sample is one of the (xii), with probability (xii), meaning there is a uniform generated for each point;
2. stratified resampling, where the weights are added, divided into equal pieces and a uniform is sampled on each piece, which means that points with large weights are sampled at least once and those with small weights at most once;
3. systematic resampling, which is the same as the above except that the same uniform is used for each piece,
4. Metropolis resampling, where a Markov chain converges to the distribution (ω1,…, ωP) on {1,…,P},

The three first resamplers are common in the particle system literature (incl. Nicolas Chopin’s PhD thesis), but difficult to adapt to GPUs (and I always feel uncomfortable with the fact that systematic uses a single uniform!), while the last one is more unusual, but actually well-fitted for a parallel implementation. While Lawrence Murray suggests using Raftery and Lewis’ (1992) assessment of the required number of Metropolis iterations to “achieve convergence”, I would instead suggest taking advantage of the toric nature of the space (as represented above) to run a random walk and wait for the equivalent of a complete cycle. In any case, this is a cool illustration of the new challenges posed by parallel implementations (like the development of proper random generators).