Archive for harmonic mean estimator

WBIC, practically

Posted in Statistics with tags , , , , , , , , , on October 20, 2017 by xi'an

“Thus far, WBIC has received no more than a cursory mention by Gelman et al. (2013)”

I had missed this 2015  paper by Nial Friel and co-authors on a practical investigation of Watanabe’s WBIC. Where WBIC stands for widely applicable Bayesian information criterion. The thermodynamic integration approach explored by Nial and some co-authors for the approximation of the evidence, thermodynamic integration that produces the log-evidence as an integral between temperatures t=0 and t=1 of a powered evidence, is eminently suited for WBIC, as the widely applicable Bayesian information criterion is associated with the specific temperature t⁰ that makes the power posterior equidistant, Kullback-Leibler-wise, from the prior and posterior distributions. And the expectation of the log-likelihood under this very power posterior equal to the (genuine) evidence. In fact, WBIC is often associated with the sub-optimal temperature 1/log(n), where n is the (effective?) sample size. (By comparison, if my minimalist description is unclear!, thermodynamic integration requires a whole range of temperatures and associated MCMC runs.) In an ideal Gaussian setting, WBIC improves considerably over thermodynamic integration, the larger the sample the better. In more realistic settings, though, including a simple regression and a logistic [Pima Indians!] model comparison, thermodynamic integration may do better for a given computational cost although the paper is unclear about these costs. The paper also runs a comparison with harmonic mean and nested sampling approximations. Since the integral of interest involves a power of the likelihood, I wonder if a safe version of the harmonic mean resolution can be derived from simulations of the genuine posterior. Provided the exact temperature t⁰ is known…

Russian roulette still rolling

Posted in Statistics with tags , , , , , , , , , , , , on March 22, 2017 by xi'an

Colin Wei and Iain Murray arXived a new version of their paper on doubly-intractable distributions, which is to be presented at AISTATS. It builds upon the Russian roulette estimator of Lyne et al. (2015), which itself exploits the debiasing technique of McLeish et al. (2011) [found earlier in the physics literature as in Carter and Cashwell, 1975, according to the current paper]. Such an unbiased estimator of the inverse of the normalising constant can be used for pseudo-marginal MCMC, except that the estimator is sometimes negative and has to be so as proved by Pierre Jacob and co-authors. As I discussed in my post on the Russian roulette estimator, replacing the negative estimate with its absolute value does not seem right because a negative value indicates that the quantity is close to zero, hence replacing it with zero would sound more appropriate. Wei and Murray start from the property that, while the expectation of the importance weight is equal to the normalising constant, the expectation of the inverse of the importance weight converges to the inverse of the weight for an MCMC chain. This however sounds like an harmonic mean estimate because the property would also stand for any substitute to the importance density, as it only requires the density to integrate to one… As noted in the paper, the variance of the resulting Roulette estimator “will be high” or even infinite. Following Glynn et al. (2014), the authors build a coupled version of that solution, which key feature is to cut the higher order terms in the debiasing estimator. This does not guarantee finite variance or positivity of the estimate, though. In order to decrease the variance (assuming it is finite), backward coupling is introduced, with a Rao-Blackwellisation step using our 1996 Biometrika derivation. Which happens to be of lower cost than the standard Rao-Blackwellisation in that special case, O(N) versus O(N²), N being the stopping rule used in the debiasing estimator. Under the assumption that the inverse importance weight has finite expectation [wrt the importance density], the resulting backward-coupling Russian roulette estimator can be proven to be unbiased, as it enjoys a finite expectation. (As in the generalised harmonic mean case, the constraint imposes thinner tails on the importance function, which then hampers the convergence of the MCMC chain.) No mention is made of achieving finite variance for those estimators, which again is a serious concern due to the similarity with harmonic means…

nested sampling for philogenies

Posted in Statistics with tags , , , , , , , on March 3, 2017 by xi'an

“Methods to estimate the marginal likelihood should be sensitive to the prior choice. Non-informative priors should increase the contribution of low-likelihood regions of parameter space in the estimated marginal likelihood. Consequently, the prior choice should affect the estimated evidence.”

 In a most recent arXival, Maturana, Brewer, and Klaere discuss of the appeal of nested sampling for conducting model choice in philogenetic models. In comparison with the “generalized steppingstone sampling” method, which represents the evidence as a product of ratios of evidences (Fan et al., 2011). And which I do not think I have previously met, with all references provided therein relating to Bayesian philogenetics, apparently. The stepping stone approach relies on a sequence of tempered targets, moving from a reference distribution to the real target as a temperature β goes from zero to one. (The paper also mentions thermodynamic integration as too costly.) Nested sampling—much discussed on this blog!—is presented in this paper as having the ability to deal with partly convex likelihoods, although I do not really get how or why. (As there is nothing new in the fairly pedagogical pretentation of nested sampling therein.) Nothing appears to be mentioned about the difficulty to handle multimodal as high likelihood isolated regions are unlikely to be sampled from poorly weighted priors (by which I mean that a region with significant likelihood mass is unlikely to get sampled if the prior distribution gives little prior weight to that region). The novelty in the paper is to compare nested sampling with generalized steppingstone sampling and path sampling on several phylogenetic examples. I did not spot computing time mentioned there. As usual with examples, my reservation is that the conclusions drawn for one particular analysis of one (three) particular example(s) does not convey a general method about the power and generality of a method. Even though I acknowledge that nested sampling is on principle operational in wide generality.

puzzled by harmony [not!]

Posted in Books, Kids, Mountains, pictures, R, Running, Statistics, Travel with tags , , , , , on December 13, 2016 by xi'an

In answering yet another question on X validated about the numerical approximation of the marginal likelihood, I suggested using an harmonic mean estimate as a simple but worthless solution based on an MCMC posterior sample. This was on a toy example with a uniform prior on (0,π) and a “likelihood” equal to sin(θ) [really a toy problem!]. Simulating an MCMC chain by a random walk Metropolis-Hastings algorithm is straightforward, as is returning the harmonic mean of the sin(θ)’s.

f <- function(x){
    if ((0<x)&(x<pi)){
        return(sin(x))}else{
        return(0)}}

n = 2000 #number of iterations
sigma = 0.5
x = runif(1,0,pi) #initial x value
chain = fx = f(x)   
#generates an array of random x values from norm distribution
rands = rnorm(n,0, sigma) 
#Metropolis - Hastings algorithm
for (i in 2:n){
    can = x + rands[i]  #candidate for jump
    fcan=f(can)
    aprob = fcan/fx #acceptance probability
    if (runif(1) < aprob){
        x = can
        fx = fcan}
    chain=c(chain,fx)}
I = pi*length(chain)/sum(1/chain) #integral harmonic approximation

However, the outcome looks remarkably stable and close to the expected value 2/π, despite 1/sin(θ) having an infinite integral on (0,π). Meaning that the average of the 1/sin(θ)’s has no variance. Hence I wonder why this specific example does not lead to an unreliable output… But re-running the chain with a smaller scale σ starts producing values of sin(θ) regularly closer to zero, which leads to an estimate of I both farther away from 2 and much more variable. No miracle, in the end!

ISBA 2016 [#2]

Posted in Books, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , on June 15, 2016 by xi'an

Today I attended Persi Diaconis’ de Finetti’s ISBA Lecture and not only because I was an invited discussant, by all means!!! Persi was discussing his views on Bayesian numerical analysis. As already expressed in his 1988 paper. Which now appears as a foundational precursor to probabilistic numerics. And which is why I had a very easy time in preparing my discussion as I mostly borrowed from my NIPS slides. With some degree of legitimacy since I was already a discussant there. Anyway,  here is the most novel slide in the discussion, built upon my realisation that the principle behind nested sampling is fairly generic for integral approximation, rather than being restricted to marginal likelihood approximation.

persidiscussionAmong many interesting things, Persi’s talk made me think anew about infinite variance importance sampling. And about the paper by Souraj Chatterjee and Persi that I discussed a few months ago. In that some regularisation of those “useless” importance estimates can stem from prior modelling. Not as an aside, let me add I am very grateful to the ISBA 2016 organisers and to the chair of the de Finetti lecture committee for their invitation to discuss this talk!

commentaries in financial econometrics

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on April 27, 2016 by xi'an

My comment(arie)s on the moment approach to Bayesian inference by Ron Gallant have appeared, along with other comment(arie)s:

Invited Article
Reflections on the Probability Space Induced by Moment Conditions with
Implications for Bayesian Inference
A. Ronald Gallant . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Commentaries
Dante Amengual and Enrique Sentana .. . . . . . . . . . 248
John Geweke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253
Jae-Young Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Oliver Linton and Ruochen Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .261
Christian P. Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Christopher A. Sims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Wei Wei and Asger Lunde . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .278
Author Response
A. Ronald Gallant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .284

formula (4) in Gallant's paperWhile commenting on commentaries is formally bound to induce an infinite loop [or l∞p], I remain puzzled by the main point of the paper, which is that setting a structural distribution on a moment function Z(x,θ) plus a prior p(θ) induces a distribution on the pair (x,θ) in a possibly weaker σ-algebra. (The two distributions may actually be incompatible.) Handling this framework requires checking that a posterior exists, which sounds rather unnatural (even though we also have to check properness of the posterior). And the meaning of such a posterior remains unclear, as for instance in this assertion that (4) above is a likelihood, when it does not define a density in x but on the object inside the exponential.

“…it is typically difficult to determine whether there exists a p(x|θ) such that the implied distribution of m(x,θ) is the one stated, and if not, what damage is done thereby” J. Geweke (p.254)

Continue reading

estimating constants [impression soleil levant]

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on April 25, 2016 by xi'an

The CRiSM workshop on estimating constants which took place here in Warwick from April 20 till April 22 was quite enjoyable [says most objectively one of the organisers!], with all speakers present to deliver their talks  (!) and around sixty participants, including 17 posters. It remains a exciting aspect of the field that so many and so different perspectives are available on the “doubly intractable” problem of estimating a normalising constant. Several talks and posters concentrated on Ising models, which always sound a bit artificial to me, but also are perfect testing grounds for approximations to classical algorithms.

On top of [clearly interesting!] talks associated with papers I had already read [and commented here], I had not previously heard about Pierre Jacob’s coupling SMC sequence, which paper is not yet out [no spoiler then!]. Or about Michael Betancourt’s adiabatic Monte Carlo and its connection with the normalising constant. Nicolas Chopin talked about the unnormalised Poisson process I discussed a while ago, with this feature that the normalising constant itself becomes an additional parameter. And that integration can be replaced with (likelihood) maximisation. The approach, which is based on a reference distribution (and an artificial logistic regression à la Geyer), reminded me of bridge sampling. And indirectly of path sampling, esp. when Merrilee Hurn gave us a very cool introduction to power posteriors in the following talk. Also mentioning the controlled thermodynamic integration of Chris Oates and co-authors I discussed a while ago. (Too bad that Chris Oates could not make it to this workshop!) And also pointing out that thermodynamic integration could be a feasible alternative to nested sampling.

Another novel aspect was found in Yves Atchadé’s talk about sparse high-dimension matrices with priors made of mutually exclusive measures and quasi-likelihood approximations. A simplified version of the talk being in having a non-identified non-constrained matrix later projected onto one of those measure supports. While I was aware of his noise-contrastive estimation of normalising constants, I had not previously heard Michael Gutmann give a talk on that approach (linking to Geyer’s 1994 mythical paper!). And I do remain nonplussed at the possibility of including the normalising constant as an additional parameter [in a computational and statistical sense]..! Both Chris Sherlock and Christophe Andrieu talked about novel aspects on pseudo-marginal techniques, Chris on the lack of variance reduction brought by averaging unbiased estimators of the likelihood and Christophe on the case of large datasets, recovering better performances in latent variable models by estimating the ratio rather than taking a ratio of estimators. (With Christophe pointing out that this was an exceptional case when harmonic mean estimators could be considered!)