## sandwiching a marginal

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 8, 2021 by xi'an When working recently on a paper for estimating the marginal likelihood, I was pointed out this earlier 2015 paper by Roger Grosse, Zoubin Ghahramani and Ryan Adams, which had escaped till now. The beginning of the paper discusses the shortcomings of importance sampling (when simulating from the prior) and harmonic mean (when simulating from the posterior) as solution. And of anNealed importance sampling (when simulating from a sequence, which sequence?!, of targets). The authors are ending up proposing a sequential Monte Carlo or (posterior) particle learning solution. A remark on annealed importance sampling is that there exist both a forward and a backward version for estimating the marginal likelihood, either starting from a simulation from the prior (easy) or from a simulation from the posterior (hard!). As in, e.g., Nicolas Chopin’s thesis, the intermediate steps are constructed from a subsample of the entire sample.

In this context, unbiasedness can be misleading: because partition function estimates can vary over many orders of magnitude, it’s common for an unbiased estimator to drastically underestimate Ζ with overwhelming probability, yet occasionally return extremely large estimates. (An extreme example is likelihood weighting, which is unbiased, but is extremely unlikely to give an accurate answer for a high-dimensional model.) Unless the estimator is chosen very carefully, the variance is likely to be extremely large, or even infinite.”

One novel aspect of the paper is to advocate for the simultaneous use of different methods and for producing both lower and upper bounds on the marginal p(y) and wait for them to get close enough. It is however delicate to find upper bounds, except when using the dreaded harmonic mean estimator.  (A nice trick associated with reverse annealed importance sampling is that the reverse chain can be simulated exactly from the posterior if associated with simulated data, except I am rather lost at the connection between the actual and simulated data.) In a sequential harmonic mean version, the authors also look at the dangers of using an harmonic mean but argue the potential infinite variance of the weights does not matter so much for log p(y), without displaying any variance calculation… The paper also contains a substantial experimental section that compares the different solutions evoked so far, plus others like nested sampling. Which did not work poorly in the experiment (see below) but could not be trusted to provide a lower or an upper bound. The computing time to achieve some level of agreement is however rather daunting. An interesting read definitely (and I wonder what happened to the paper in the end). ## Quasi-Monte Carlo sampling

Posted in Books, Kids, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on December 10, 2014 by xi'an “The QMC algorithm forces us to write any simulation as an explicit function of uniform samples.” (p.8)

As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper.  Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).

Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).

“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)

At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order $\mathfrak{O}(N^{-1}\log(N)^{d-}) \text{ at cost } \mathfrak{O}(N\log(N))$

means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms.  For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform $\mathbf{x}_t^n = \Gamma_t(\mathbf{x}_{t-1}^n,\mathbf{u}_{t}^n)$

which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal $m_t(\tilde{\mathbf{x}}_{t-1}^n,\cdot)$

has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.

The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.

Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?

In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?

What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?

Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?

## Particle learning [rejoinder]

Posted in R, Statistics, University life with tags , , , , , , , , , , on November 10, 2010 by xi'an

Following the posting on arXiv of the Statistical Science paper of Carvalho et al., and the publication by the same authors in Bayesian Analysis of Particle Learning for general mixtures I noticed on Hedibert Lopes’ website his rejoinder to the discussion of his Valencia 9 paper has been posted. Since the discussion involved several points made by members of the CREST statistics lab (and covered the mixture paper as much as the Valencia 9 paper), I was quite eager to read Hedie’s reply. Unsurprisingly, this rejoinder is however unlikely to modify my reservations about particle learning. The following is a detailed examination of the arguments found in the rejoinder but requires a preliminary reading of the above papers as well as our discussion.. Continue reading

## On particle learning

Posted in R, Statistics, University life with tags , , on June 5, 2010 by xi'an

In connection with the Valencia 9 meeting that started yesterday, and with Hedie‘s talk there, we have posted on arXiv a set of comments on particle learning. The arXiv paper contains several discussions but they mostly focus on the inevitable degeneracy that accompanies particle systems. When Lopes et al. state that $p(Z^t|y^t)$ is not of interest as the filtered, low dimensional $p(Z_t|y^t)$ is sufficient for inference at time t, they seem to implicitly imply that the restriction of the simulation focus to a low dimensional vector is a way to avoid the degeneracy inherent to all particle filters. The particle learning algorithm therefore relies on an approximation of $p(Z^t|y^t)$ and the fact that this approximation quickly degenerates as t increases means that this approximation impacts the approximation of $p(Z_t|y^t)$. We show that, unless the size of the particle population exponentially increases with t, the sample of $Z_t$‘s will not be distributed as an iid sample from $p(Z_t|y^t)$. The graph above is an illustration of the degeneracy in the setup of a Poisson mixture with five components and 10,000 observations. The boxplots represent the variation of the evidence approximations based on a particle learning sample and Lopes et al. approximation, on a particle learning sample and Chib’s (1995) approximation, and on an MCMC sample and Chib’s (1995) approximation, for 250 replications. The differences are therefore quite severe when considering this number of observations. (I put the R code on my website for anyone who wants to check if I programmed things wrong.) There is no clear solution to the degeneracy problem, in my opinion, because the increase in the particle size overcoming degeneracy must be particularly high… We will be discussing that this morning.

On Monday, Paul Fearnhead and Benjamin Taylor reposted on arXiv a paper about adaptive SMC. It is as well since I had missed the first posting on Friday. While the method has some similarities with our earlier work on population Monte Carlo methods with Olivier Cappé, Randal Douc, Arnaud Guillin and Jean-Michel Marin, there are quite novel and interesting features in this paper!  First, the paper is firmly set within a sequential setup, as in Chopin (2002, Biometrika) and Del Moral, Doucet and Jasra (2006, JRSS B). This means considering a sequence of targets corresponding to likelihoods with increasing datasets. We mentioned this case as a possible implementation of population Monte Carlo but never truly experimented with this. Fearnhead and Taylor do set their method within this framework, using a sequence of populations (or particle systems) aimed at this moving sequence of targets. The second major difference is that, while they also use a mixture of transition kernels as their proposal (or importance functions) and while they also aim at optimising the parameters of those transitions (parameters that I would like to dub cyberparameters to distinguish them from the parameters of the statistical model), they do not update those cyberparameters in a deterministic way, as we do. On the opposite, they build a non-parametric approximation $\pi_t(h)$ to the distribution of those cyberparameters and simulate from those approximations at each step of the sequential algorithm, using a weight $f(\theta^{(j)}_{t-1},\theta^{(j)}_t)$ that assesses the quality of the move from $\theta^{(j)}_{t-1}$ to $\theta^{(j)}_{t}$, based on the simulated $h^{(j)}_t$. I like very much this aspect of the paper, in that the cyberparameters are part of the dynamics in the stochastic algorithm, a point I tried to implement since the (never published) controlled MCMC paper with Christophe Andrieu. As we do in our paper now published in Statistics and Computing, they further establish that this method is asymptotically improving the efficiency criterion at each step of the sequential procedure. The paper concludes with an extensive simulation study where Fearnhead and Taylor show that their implementation outperforms random walk with adaptive steps. (I am not very happy with their mixture example in that they resort to an ordering on the means…)