Archive for harmonic mean estimator

new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.

 

the [not so infamous] arithmetic mean estimator

Posted in Books, Statistics with tags , , , , , , , , , on June 15, 2018 by xi'an

“Unfortunately, no perfect solution exists.” Anna Pajor

Another paper about harmonic and not-so-harmonic mean estimators that I (also) missed came out last year in Bayesian Analysis. The author is Anna Pajor, whose earlier note with Osiewalski I also spotted on the same day. The idea behind the approach [which belongs to the branch of Monte Carlo methods requiring additional simulations after an MCMC run] is to start as the corrected harmonic mean estimator on a restricted set A as to avoid tails of the distributions and the connected infinite variance issues that plague the harmonic mean estimator (an old ‘Og tune!). The marginal density p(y) then satisfies an identity involving the prior expectation of the likelihood function restricted to A divided by the posterior coverage of A. Which makes the resulting estimator unbiased only when this posterior coverage of A is known, which does not seem realist or efficient, except if A is an HPD region, as suggested in our earlier “safe” harmonic mean paper. And efficient only when A is well-chosen in terms of the likelihood function. In practice, the author notes that P(A|y) is to be estimated from the MCMC sequence and that the set A should be chosen to return large values of the likelihood, p(y|θ), through importance sampling, hence missing somehow the double opportunity of using an HPD region. Hence using the same default choice as in Lenk (2009), an HPD region which lower bound is derived as the minimum likelihood in the MCMC sample, “range of the posterior sampler output”. Meaning P(A|y)=1. (As an aside, the paper does not produce optimality properties or even heuristics towards efficiently choosing the various parameters to be calibrated in the algorithm, like the set A itself. As another aside, the paper concludes with a simulation study on an AR(p) model where the marginal may be obtained in closed form if stationarity is not imposed, which I first balked at, before realising that even in this setting both the posterior and the marginal do exist for a finite sample size, and hence the later can be estimated consistently by Monte Carlo methods.) A last remark is that computing costs are not discussed in the comparison of methods.

The final experiment in the paper is aiming at the marginal of a mixture model posterior, operating on the galaxy benchmark used by Roeder (1990) and about every other paper on mixtures since then (incl. ours). The prior is pseudo-conjugate, as in Chib (1995). And label-switching is handled by a random permutation of indices at each iteration. Which may not be enough to fight the attraction of the current mode on a Gibbs sampler and hence does not automatically correct Chib’s solution. As shown in Table 7 by the divergence with Radford Neal’s (1999) computations of the marginals, which happen to be quite close to the approximation proposed by the author. (As an aside, the paper mentions poor performances of Chib’s method when centred at the posterior mean, but this is a setting where the posterior mean is meaningless because of the permutation invariance. As another, I do not understand how the RMSE can be computed in this real data situation.) The comparison is limited to Chib’s method and a few versions of arithmetic and harmonic means. Missing nested sampling (Skilling, 2006; Chopin and X, 2011), and attuned importance sampling as in Berkoff et al. (2003), Marin, Mengersen and X (2005), and the most recent Lee and X (2016) in Bayesian Analysis.

another version of the corrected harmonic mean estimator

Posted in Books, pictures, Statistics, University life with tags , , , , , on June 11, 2018 by xi'an

A few days ago I came across a short paper in the Central European Journal of Economic Modelling and Econometrics by Pajor and Osiewalski that proposes a correction to the infamous harmonic mean estimator that is essentially the one Darren and I made in 2009, namely to restrict the evaluations of the likelihood function to a subset A of the simulations from the posterior. Paper that relates to an earlier 2009 paper by Peter Lenk, which investigates the same object with this same proposal and that we had missed for all that time. The difference is that, while we examine an arbitrary HPD region at level 50% or 80% as the subset A, Lenk proposes to derive a minimum likelihood value from the MCMC run and to use the associated HPD region, which means using all simulations, hence producing the same object as the original harmonic mean estimator, except that it is corrected by a multiplicative factor P(A). Or rather an approximation. This correction thus maintains the infinite variance of the original, a point apparently missed in the paper.

Bayesian goodness of fit

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on April 10, 2018 by xi'an

 

Persi Diaconis and Guanyang Wang have just arXived an interesting reflection on the notion of Bayesian goodness of fit tests. Which is a notion that has always bothered me, in a rather positive sense (!), as

“I also have to confess at the outset to the zeal of a convert, a born again believer in stochastic methods. Last week, Dave Wright reminded me of the advice I had given a graduate student during my algebraic geometry days in the 70’s :`Good Grief, don’t waste your time studying statistics. It’s all cookbook nonsense.’ I take it back! …” David Mumford

The paper starts with a reference to David Mumford, whose paper with Wu and Zhou on exponential “maximum entropy” synthetic distributions is at the source (?) of this paper, and whose name appears in its very title: “A conversation for David Mumford”…, about his conversion from pure (algebraic) maths to applied maths. The issue of (Bayesian) goodness of fit is addressed, with card shuffling examples, the null hypothesis being that the permutation resulting from the shuffling is uniformly distributed if shuffling takes enough time. Interestingly, while the parameter space is compact as a distribution on a finite set, Lindley’s paradox still occurs, namely that the null (the permutation comes from a Uniform) is always accepted provided there is no repetition under a “flat prior”, which is the Dirichlet D(1,…,1) over all permutations. (In this finite setting an improper prior is definitely improper as it does not get proper after accounting for observations. Although I do not understand why the Jeffreys prior is not the Dirichlet(½,…,½) in this case…) When resorting to the exponential family of distributions entertained by Zhou, Wu and Mumford, including the uniform distribution as one of its members, Diaconis and Wang advocate the use of a conjugate prior (exponential family, right?!) to compute a Bayes factor that simplifies into a ratio of two intractable normalising constants. For which the authors suggest using importance sampling, thermodynamic integration, or the exchange algorithm. Except that they rely on the (dreaded) harmonic mean estimator for computing the Bayes factor in the following illustrative section! Due to the finite nature of the space, I presume this estimator still has a finite variance. (Remark 1 calls for convergence results on exchange algorithms, which can be found I think in the just as recent arXival by Christophe Andrieu and co-authors.) An interesting if rare feature of the example processed in the paper is that the sufficient statistic used for the permutation model can be directly simulated from a Multinomial distribution. This is rare as seen when considering the benchmark of Ising models, for which the summary and sufficient statistic cannot be directly simulated. (If only…!) In fine, while I enjoyed the paper a lot, I remain uncertain as to its bearings, since defining an objective alternative for the goodness-of-fit test becomes quickly challenging outside simple enough models.

divide & reconquer

Posted in Books, Statistics, University life with tags , , , , , , , , , , on February 5, 2018 by xi'an

Qi Liu, Anindya Bhadra, and William Cleveland from Purdue have arXived a paper entitled Divide and Recombine for Large and Complex Data: Model Likelihood Functions using MCMC. Which is a variation on the earlier divide & … papers attempting at handling large datasets. The beginning is quite similar to these earlier papers in that the likelihood is split into sub-likelihoods, approximated from MCMC samples and recombined into an approximate full likelihood. As in for instance Scott et al. one approximation use for the subsample is to replace the likelihood with a Normal approximation, or a skew Normal generalisation, which remains  a limited choice for heavy tailed likelihoods. Producing a Normal and skew-Normal approximation for the whole [data] likelihood, respectively. If I understand correctly, these approximations are missing a normalising constant to bring them to scale with the true likelihood, which I do not completely understand as the likelihood only needs to be defined up to a [constant] constant for most purposes, including Bayesian ones. The  method of estimation of this constant proposed therein is called the contour probability algorithm and it consists in using a highest density region to compare a likelihood and its approximation. (Nothing to do with our adaptation of Gelfand and Dey (1994) based on HPDs, with Darren Wright. Nor with nested sampling.) Returning a form of qq-plot. This is rather exploratory, while hardly addressing the issue of the precision of such approximations and the resolution of conflicting proposals. And the comparison with all these other recent proposals for splitting likelihoods into manageable bits (proposals that are mentioned in the final section, including our recentering scheme with my student Changye Wu).

WBIC, practically

Posted in Statistics with tags , , , , , , , , , on October 20, 2017 by xi'an

“Thus far, WBIC has received no more than a cursory mention by Gelman et al. (2013)”

I had missed this 2015  paper by Nial Friel and co-authors on a practical investigation of Watanabe’s WBIC. Where WBIC stands for widely applicable Bayesian information criterion. The thermodynamic integration approach explored by Nial and some co-authors for the approximation of the evidence, thermodynamic integration that produces the log-evidence as an integral between temperatures t=0 and t=1 of a powered evidence, is eminently suited for WBIC, as the widely applicable Bayesian information criterion is associated with the specific temperature t⁰ that makes the power posterior equidistant, Kullback-Leibler-wise, from the prior and posterior distributions. And the expectation of the log-likelihood under this very power posterior equal to the (genuine) evidence. In fact, WBIC is often associated with the sub-optimal temperature 1/log(n), where n is the (effective?) sample size. (By comparison, if my minimalist description is unclear!, thermodynamic integration requires a whole range of temperatures and associated MCMC runs.) In an ideal Gaussian setting, WBIC improves considerably over thermodynamic integration, the larger the sample the better. In more realistic settings, though, including a simple regression and a logistic [Pima Indians!] model comparison, thermodynamic integration may do better for a given computational cost although the paper is unclear about these costs. The paper also runs a comparison with harmonic mean and nested sampling approximations. Since the integral of interest involves a power of the likelihood, I wonder if a safe version of the harmonic mean resolution can be derived from simulations of the genuine posterior. Provided the exact temperature t⁰ is known…

Russian roulette still rolling

Posted in Statistics with tags , , , , , , , , , , , , on March 22, 2017 by xi'an

Colin Wei and Iain Murray arXived a new version of their paper on doubly-intractable distributions, which is to be presented at AISTATS. It builds upon the Russian roulette estimator of Lyne et al. (2015), which itself exploits the debiasing technique of McLeish et al. (2011) [found earlier in the physics literature as in Carter and Cashwell, 1975, according to the current paper]. Such an unbiased estimator of the inverse of the normalising constant can be used for pseudo-marginal MCMC, except that the estimator is sometimes negative and has to be so as proved by Pierre Jacob and co-authors. As I discussed in my post on the Russian roulette estimator, replacing the negative estimate with its absolute value does not seem right because a negative value indicates that the quantity is close to zero, hence replacing it with zero would sound more appropriate. Wei and Murray start from the property that, while the expectation of the importance weight is equal to the normalising constant, the expectation of the inverse of the importance weight converges to the inverse of the weight for an MCMC chain. This however sounds like an harmonic mean estimate because the property would also stand for any substitute to the importance density, as it only requires the density to integrate to one… As noted in the paper, the variance of the resulting Roulette estimator “will be high” or even infinite. Following Glynn et al. (2014), the authors build a coupled version of that solution, which key feature is to cut the higher order terms in the debiasing estimator. This does not guarantee finite variance or positivity of the estimate, though. In order to decrease the variance (assuming it is finite), backward coupling is introduced, with a Rao-Blackwellisation step using our 1996 Biometrika derivation. Which happens to be of lower cost than the standard Rao-Blackwellisation in that special case, O(N) versus O(N²), N being the stopping rule used in the debiasing estimator. Under the assumption that the inverse importance weight has finite expectation [wrt the importance density], the resulting backward-coupling Russian roulette estimator can be proven to be unbiased, as it enjoys a finite expectation. (As in the generalised harmonic mean case, the constraint imposes thinner tails on the importance function, which then hampers the convergence of the MCMC chain.) No mention is made of achieving finite variance for those estimators, which again is a serious concern due to the similarity with harmonic means…