## Bayesian inference with intractable normalizing functions

Posted in Books, Statistics with tags , , , , , , , , , , , on December 13, 2018 by xi'an

In the latest September issue of JASA I received a few days ago, I spotted a review paper by Jaewoo Park & Murali Haran on intractable normalising constants Z(θ). There have been many proposals for solving this problem as well as several surveys, some conferences and even a book. The current survey focus on MCMC solutions, from auxiliary variable approaches to likelihood approximation algorithms (albeit without ABC entries, even though the 2006 auxiliary variable solutions of Møller et al. et of Murray et al. do simulate pseudo-observations and hence…). This includes the MCMC approximations to auxiliary sampling proposed by Faming Liang and co-authors across several papers. And the paper Yves Atchadé, Nicolas Lartillot and I wrote ten years ago on an adaptive MCMC targeting Z(θ) and using stochastic approximation à la Wang-Landau. Park & Haran stress the relevance of using sufficient statistics in this approach towards fighting computational costs, which makes me wonder if an ABC version could be envisioned.  The paper also includes pseudo-marginal techniques like Russian Roulette (once spelled Roullette) and noisy MCMC as proposed in Alquier et al.  (2016). These methods are compared on three examples: (1) the Ising model, (2) a social network model, the Florentine business dataset used in our original paper, and a larger one where most methods prove too costly, and (3) an attraction-repulsion point process model. In conclusion, an interesting survey, taking care to spell out the calibration requirements and the theoretical validation, if of course depending on the chosen benchmarks.

## Russian roulette still rolling

Posted in Statistics with tags , , , , , , , , , , , , on March 22, 2017 by xi'an

Colin Wei and Iain Murray arXived a new version of their paper on doubly-intractable distributions, which is to be presented at AISTATS. It builds upon the Russian roulette estimator of Lyne et al. (2015), which itself exploits the debiasing technique of McLeish et al. (2011) [found earlier in the physics literature as in Carter and Cashwell, 1975, according to the current paper]. Such an unbiased estimator of the inverse of the normalising constant can be used for pseudo-marginal MCMC, except that the estimator is sometimes negative and has to be so as proved by Pierre Jacob and co-authors. As I discussed in my post on the Russian roulette estimator, replacing the negative estimate with its absolute value does not seem right because a negative value indicates that the quantity is close to zero, hence replacing it with zero would sound more appropriate. Wei and Murray start from the property that, while the expectation of the importance weight is equal to the normalising constant, the expectation of the inverse of the importance weight converges to the inverse of the weight for an MCMC chain. This however sounds like an harmonic mean estimate because the property would also stand for any substitute to the importance density, as it only requires the density to integrate to one… As noted in the paper, the variance of the resulting Roulette estimator “will be high” or even infinite. Following Glynn et al. (2014), the authors build a coupled version of that solution, which key feature is to cut the higher order terms in the debiasing estimator. This does not guarantee finite variance or positivity of the estimate, though. In order to decrease the variance (assuming it is finite), backward coupling is introduced, with a Rao-Blackwellisation step using our 1996 Biometrika derivation. Which happens to be of lower cost than the standard Rao-Blackwellisation in that special case, O(N) versus O(N²), N being the stopping rule used in the debiasing estimator. Under the assumption that the inverse importance weight has finite expectation [wrt the importance density], the resulting backward-coupling Russian roulette estimator can be proven to be unbiased, as it enjoys a finite expectation. (As in the generalised harmonic mean case, the constraint imposes thinner tails on the importance function, which then hampers the convergence of the MCMC chain.) No mention is made of achieving finite variance for those estimators, which again is a serious concern due to the similarity with harmonic means…

## estimating constants [survey]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on February 2, 2017 by xi'an

A new survey on Bayesian inference with intractable normalising constants was posted on arXiv yesterday by Jaewoo Park and Murali Haran. A rather massive work of 58 pages, almost handy for a short course on the topic! In particular, it goes through the most common MCMC methods with a detailed description, followed by comments on components to be calibrated and the potential theoretical backup. This includes for instance the method of Liang et al. (2016) that I reviewed a few months ago. As well as the Wang-Landau technique we proposed with Yves Atchadé and Nicolas Lartillot. And the noisy MCMC of Alquier et al. (2016), also reviewed a few months ago. (The Russian Roulette solution is only mentioned very briefly as” computationally very expensive”. But still used in some illustrations. The whole area of pseudo-marginal MCMC is also missing from the picture.)

“…auxiliary variable approaches tend to be more efficient than likelihood approximation approaches, though efficiencies vary quite a bit…”

The authors distinguish between MCMC methods where the normalizing constant is approximated and those where it is omitted by an auxiliary representation. The survey also distinguishes between asymptotically exact and asymptotically inexact solutions. For instance, using a finite number of MCMC steps instead of the associated target results in an asymptotically inexact method. The question that remains open is what to do with the output, i.e., whether or not there is a way to correct for this error. In the illustration for the Ising model, the double Metropolis-Hastings version of Liang et al. (2010) achieves for instance massive computational gains, but also exhibits a persistent bias that would go undetected were it the sole method implemented. This aspect of approximate inference is not really explored in the paper, but constitutes a major issue for modern statistics (and machine learning as well, when inference is taken into account.)

In conclusion, this survey provides a serious exploration of recent MCMC methods. It begs for a second part involving particle filters, which have often proven to be faster and more efficient than MCMC methods, at least in state space models. In that regard, Nicolas Chopin and James Ridgway examined further techniques when calling to leave the Pima Indians [dataset] alone.

## pitfalls of nested Monte Carlo

Posted in Books, pictures, Statistics, University life with tags , , , , , on December 19, 2016 by xi'an

A few days ago, Tom Rainforth, Robert Cornish, Hongseok Yang, and Frank Wood from Oxford have arXived a paper on the limitations of nested Monte Carlo. By nested Monte Carlo [not nested sampling], they mean Monte Carlo techniques used to evaluate the expectation of a non-linear transform of an expectation, which often call for plug-in resolution. The main result is that this expectation cannot be evaluated by an unbiased estimator. Which is only mildly surprising. I do wonder if there still exist series solutions à la Glynn and Rhee, as in the Russian roulette version. Which is mentioned in a footnote. Or specially tuned versions, as suggested by some techniques found in Devroye’s book where the expectation of the exponential of another expectation is considered… (The paper is quite short, which may be correlated with the format imposed by some machine-learning conference proceedings like AISTATS.)

## a closed-form but intractable birthday

Posted in Books, Kids, pictures, Statistics with tags , , , , , on November 7, 2016 by xi'an

An interesting [at least for me!] question found on X Validated yesterday, namely how to simulate efficiently from the generalised birthday problem (or paradox) distribution, which provides the probability of finding exactly k different birthday dates as

$\mathbb{P}(V = k) = \binom{n}{k}\displaystyle\sum_{i=0}^k (-1)^i \binom{k}{i} \left(\frac{k-i}{n}\right)^m$

where m is the number of individuals with a random birthday and n the number of days (e.g., n=365). The paradox with this closed-form formula (found by the inclusion-exclusion rule) is that it is too unstable to use per se. While it is always possible to run m draws from a uniform over {1,…,n} and count the number of different values, e.g.,

x=length(unique(sample(1:n,m,rep=TRUE)))


this takes much more time than using the exact distribution, if available:

sample(1:m,1e6,rep=TRUE,prob=eff[-1])


I played a little bit with the notion of using an unbiased estimator of the said probability, but the alternating series means that the unbiased estimator may end up being negative, which is an issue met in recent related papers like the famous Russian Roulette.

## exact, unbiased, what else?!

Posted in Books, Statistics, University life with tags , , , , , , , , on April 13, 2016 by xi'an

Last week, Matias Quiroz, Mattias Villani, and Robert Kohn arXived a paper on exact subsampling MCMC, a paper that contributes to the current literature on approximating MCMC samplers for large datasets, in connection with an earlier paper of Quiroz et al. discussed here last week.

The “exact” in the title is to be understood in the Russian roulette sense. By using Rhee and Glynn debiaising device, the authors achieve an unbiased estimator of the likelihood as in Bardenet et al. (2015). The central tool for the derivation of an unbiased and positive estimator is to find a control variate for each component of the log likelihood that is good enough for the difference between the component and the control to be lower bounded. By the constant a in the screen capture above. When the individual terms d in the product are iid unbiased estimates of the log likelihood difference. And q is the sum of the control variates. Or maybe more accurately of the cheap substitutes to the exact log likelihood components. Thus still of complexity O(n), which makes the application to tall data more difficult to contemplate.

The \$64 question is obviously how to produce cheap and efficient control variates that kill the curse of the tall data. (It still irks to resort to this term of control variate, really!) Section 3.2 in the paper suggests clustering the data and building an approximation for each cluster, which seems to imply manipulating the whole dataset at this early stage. At a cost of O(Knd). Furthermore, because finding a correct lower bound a is close to impossible in practice, the authors use a “soft lower bound”, meaning that it is only an approximation and thus that (3.4) above can get negative from time to time, which cancels the validation of the method as a pseudo-marginal approach. The resolution of this difficulty is to resort to the same proxy as in the Russian roulette paper, replacing the unbiased estimator with its absolute value, an answer I already discussed for the Russian roulette paper. An additional step is proposed by Quiroz et al., namely correlating the random numbers between numerator and denominator in their final importance sampling estimator, via a Gaussian copula as in Deligiannidis et al.

This paper made me wonder (idly wonder, mind!) anew how to get rid of the vexing unbiasedness requirement. From a statistical and especially from a Bayesian perspective, unbiasedness is a second order property that cannot be achieved for most transforms of the parameter θ. And that does not keep under reparameterisation. It is thus vexing and perplexing that unbiased is so central to the validation of our Monte Carlo technique and that any divergence from this canon leaves us wandering blindly with no guarantee of ever reaching the target of the simulation experiment…

## more e’s [and R’s]

Posted in Kids, pictures, R, Statistics with tags , , , , , , , on February 22, 2016 by xi'an

Alex Thiéry suggested debiasing the biased estimate of e by Rhee and Glynn truncated series method, so I tried the method to see how much of an improvement (if any!) this would bring. I first attempted to naïvely implement the raw formula of Rhee and Glynn

$\hat{\mathfrak{e}} = \sum_{n=1}^N \{\hat{e}_{n+1}-\hat{e}_n\}\big/\mathbb{P}(N\ge n)$

with a (large) Poisson distribution on the stopping rule N, but this took ages. I then realised that the index n did not have to be absolute, i.e. to start at n=1 and proceed snailwise one integer at a time: the formula remains equally valid after a change of time, i.e. n=can start at an arbitrary value and proceeds by steps of arbitrary size, which obviously speeds things up! Continue reading