## unbiased estimation of log-normalising constants

Posted in Statistics with tags , , , , , , , on October 16, 2018 by xi'an

Maxime Rischard, Pierre Jacob, and Natesh Pillai [warning: both of whom are co-authors and friends of mine!] have just arXived a paper on the use of path sampling (a.k.a., thermodynamic integration) for log-constant unbiased approximation and the resulting consequences on Bayesian model comparison by X validation. If the goal is the estimation of the log of a ratio of two constants, creating an artificial path between the corresponding distributions and looking at the derivative at any point of this path of the log-density produces an unbiased estimator. Meaning that random sampling along the path, corrected by the distribution of the sampling still produces an unbiased estimator. From there the authors derive an unbiased estimator for any X validation objective function, CV(V,T)=-log p(V|T), taking m observations T in and leaving n-m observations T out… The marginal conditional log density in the criterion is indeed estimated by an unbiased path sampler, using a powered conditional likelihood. And unbiased MCMC schemes à la Jacob et al. for simulating unbiased MCMC realisations of the intermediary targets on the path. Tuning it towards an approximately constant cost for all powers.

So in all objectivity and fairness (!!!), I am quite excited by this new proposal within my favourite area! Or rather two areas since it brings together the estimation of constants and an alternative to Bayes factors for Bayesian testing. (Although the paper does not broach upon the calibration of the X validation values.)

## new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.

## afternoon on Bayesian computation

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , , on April 6, 2016 by xi'an

Richard Everitt organises an afternoon workshop on Bayesian computation in Reading, UK, on April 19, the day before the Estimating Constant workshop in Warwick, following a successful afternoon last year. Here is the programme:

1230-1315  Antonietta Mira, Università della Svizzera italiana
1315-1345  Ingmar Schuster, Université Paris-Dauphine
1345-1415  Francois-Xavier Briol, University of Warwick
1415-1445  Jack Baker, University of Lancaster
1445-1515  Alexander Mihailov, University of Reading
1515-1545  Coffee break
1545-1630  Arnaud Doucet, University of Oxford
1630-1700  Philip Maybank, University of Reading
1700-1730  Elske van der Vaart, University of Reading
1730-1800  Reham Badawy, Aston University
1815-late  Pub and food (SCR, UoR campus)

and the general abstract:

The Bayesian approach to statistical inference has seen major successes in the past twenty years, finding application in many areas of science, engineering, finance and elsewhere. The main drivers of these successes were developments in Monte Carlo methods and the wide availability of desktop computers. More recently, the use of standard Monte Carlo methods has become infeasible due the size and complexity of data now available. This has been countered by the development of next-generation Monte Carlo techniques, which are the topic of this meeting.

The meeting takes place in the Nike Lecture Theatre, Agriculture Building [building number 59].

## rediscovering the harmonic mean estimator

Posted in Kids, Statistics, University life with tags , , , , , , , on November 10, 2015 by xi'an

When looking at unanswered questions on X validated, I came across a question where the author wanted to approximate a normalising constant

$N=\int g(x)\,\text{d}x\,,$

while simulating from the associated density, g. While seemingly unaware of the (huge) literature in the area, he re-derived [a version of] the harmonic mean estimate by considering the [inverted importance sampling] identity

$\int_\mathcal{X} \dfrac{\alpha(x)}{g(x)}p(x) \,\text{d}x=\int_\mathcal{X} \dfrac{\alpha(x)}{N} \,\text{d}x=\dfrac{1}{N}$

when α is a probability density and by using for α the uniform over the whole range of the simulations from g. This choice of α obviously leads to an estimator with infinite variance when the support of g is unbounded, but the idea can be easily salvaged by using instead another uniform distribution, for instance on an highest density region, as we studied in our papers with Darren Wraith and Jean-Michel Marin. (Unfortunately, the originator of the question does not seem any longer interested in the problem.)

## probabilistic numerics and uncertainty in computations

Posted in Books, pictures, Statistics, University life with tags , , , , , , on June 10, 2015 by xi'an

“We deliver a call to arms for probabilistic numerical methods: algorithms for numerical tasks, including linear algebra, integration, optimization and solving differential equations, that return uncertainties in their calculations.” (p.1)

Philipp Hennig, Michael Osborne and Mark Girolami (Warwick) posted on arXiv a paper to appear in Proceedings A of the Royal Statistical Society that relates to the probabilistic numerics workshop they organised in Warwick with Chris Oates two months ago. The paper is both a survey and a tribune about the related questions the authors find of most interest. The overall perspective is proceeding along Persi Diaconis’ call for a principled Bayesian approach to numerical problems. One interesting argument made from the start of the paper is that numerical methods can be seen as inferential rules, in that a numerical approximation of a deterministic quantity like an integral can be interpreted as an estimate, even as a Bayes estimate if a prior is used on the space of integrals. I am always uncertain about this perspective, as for instance illustrated in the post about the missing constant in Larry Wasserman’s paradox. The approximation may look formally the same as an estimate, but there is a design aspect that is almost always attached to numerical approximations and rarely analysed as such. Not mentioning the somewhat philosophical issue that the integral itself is a constant with no uncertainty (while a statistical model should always entertain the notion that a model can be mis-specified). The distinction explains why there is a zero variance importance sampling estimator, while there is no uniformly zero variance estimator in most parametric models. At a possibly deeper level, the debate that still invades the use of Bayesian inference to solve statistical problems would most likely resurface in numerics, in that the significance of a probability statement surrounding a mathematical quantity can only be epistemic and relate to the knowledge (or lack thereof) about this quantity rather than to the quantity itself.

“(…) formulating quadrature as probabilistic regression precisely captures a trade-off between prior assumptions inherent in a computation and the computational effort required in that computation to achieve a certain precision. Computational rules arising from a strongly constrained hypothesis class can perform much better than less restrictive rules if the prior assumptions are valid.” (p.7)

Another general worry [repeating myself] about setting a prior in those functional spaces is that the posterior may then mostly reflect the choice of the prior rather than the information contained in the “data”. The above quote mentions prior assumptions that seem hard to build from prior opinion about the functional of interest. And even less about the function itself. Coming back from a gathering of “objective Bayesians“, it seems equally hard to agree upon a reference prior. However, since I like the alternative notion of using decision theory in conjunction with probabilistic numerics, it seems hard to object to the use of priors, given the “invariance” of prior x loss… But I would like to understand better how it is possible to check for prior assumption (p.7) without using the data. Or maybe it does not matter so much in this setting? Unlikely, as indicated in the remarks about the bias resulting from the active design (p.13).

A last issue I find related to the exploratory side of the paper is the “big world versus small worlds” debate, namely whether we can use the Bayesian approach to solve a sequence of small problems rather than trying to solve the big problem all at once. Which forces us to model the entirety of unknowns. And almost certainly fail. (This is was the point of the Robbins-Wasserman counterexample.) Adopting a sequence of solutions may be construed as incoherent in that the prior distribution is adapted to the problem rather than encompassing all problems. Although this would not shock the proponents of reference priors.