Archive for HPD region

machine-learning harmonic mean

Posted in Books, Statistics with tags , , , , , , on February 25, 2022 by xi'an

In a recent arXival, Jason McEwen propose a resurrection of the “infamous” harmonic mean estimator. In Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator, they propose to aim at the “optimal importance function”. The paper provides a fair coverage of the literature on that topic, incl. our 2009 paper with Darren Wraith (although I do not follow the criticism of using a uniform over an HPD region, esp. since one of the learnt targets is also a uniform over an hypersphere, presumably optimised in terms of the chosen parameterisation).

“…the learnt harmonic mean estimator, a variant of the original estimator that solves its large variance problem. This is achieved by interpreting the harmonic mean estimator as importance sampling and introducing a new target distribution (…) learned to approximate the optimal but inaccessible target, while minimising the variance of the resulting estimator. Since the estimator requires samples of the posterior only it is agnostic to the strategy used to generate posterior samples.”

The method thus builds upon Gelfand and Dey (1994) general proposal that is a form of inverse importance sampling since the numerator [the new target] is free while the denominator is the unnormalised posterior. The optimal target being the complete posterior (since it lead to a null variance), the authors propose to try to approximate this posterior by various means. (Note however that an almost Dirac mass at a value with positive posterior would work as well!, at least in principle…) as the sections on moment approximations sound rather standard (and assume the estimated variances are finite) while the reason for the inclusion of the Bayes factor approximation is rather unclear. However, I am rather skeptical at the proposals made therein towards approximating the posterior distribution, from a Gaussian mixture [for which parameterisation?] to KDEs, or worse ML tools like neural nets [not explored there, which makes one wonder about the title], as the estimands will prove very costly, and suffer from the curse of dimensionality (3 hours for d=2¹⁰…).The Pima Indian women’s diabetes dataset and its quasi-Normal posterior are used as a benchmark, meaning that James and Nicolas did not shout loud enough! And I find surprising that most examples include the original harmonic mean estimator despite its complete lack of trustworthiness.

an arithmetic mean identity

Posted in Books, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , on December 19, 2019 by xi'an

A 2017 paper by Ana Pajor published in Bayesian Analysis addresses my favourite problem [of computing the marginal likelihood] and which I discussed on the ‘Og, linking with another paper by Lenk published in 2012 in JCGS. That I already discussed here last year. Lenk’s (2009) paper is actually using a technique related to the harmonic mean correction based on HPD regions Darren Wraith and myself proposed at MaxEnt 2009. And which Jean-Michel and I presented at Frontiers of statistical decision making and Bayesian analysis in 2010. As I had only vague memories about the arithmetic mean version, we discussed the paper together with graduate students in Paris Dauphine.

The arithmetic mean solution, representing the marginal likelihood as the prior average of the likelihood, is a well-known approach used as well as the basis for nested sampling. With the improvement consisting in restricting the simulation to a set Ð with sufficiently high posterior probability. I am quite uneasy about P(Ð|y) estimated by 1 as the shape of the set containing all posterior simulations is completely arbitrary, parameterisation dependent, and very random since based on the extremes of this posterior sample. Plus, the set Ð converges to the entire parameter space with the number of posterior simulations. An alternative that we advocated in our earlier paper is to take Ð as the HPD region or a variational Bayes version . But the central issue with the HPD regions is how to construct these from an MCMC output and how to compute both P(Ð) and P(Ð|y). It does not seem like a good idea to set P(Ð|x) to the intended α level for the HPD coverage. Using a non-parametric version for estimating Ð could be in the end the only reasonable solution.

As a test, I reran the example of a conjugate normal model used in the paper, based on (exact) simulations from both the prior and  the posterior, and obtained approximations that were all close from the true marginal. With Chib’s being exact in that case (of course!), and an arithmetic mean surprisingly close without an importance correction:

> print(c(hame,chme,came,chib))
[1] -107.6821 -106.5968 -115.5950 -115.3610

Both harmonic versions are of the right order but not trustworthy, the truncation to such a set Ð as the one chosen in this paper having little impact.

new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.


the [not so infamous] arithmetic mean estimator

Posted in Books, Statistics with tags , , , , , , , , , on June 15, 2018 by xi'an

“Unfortunately, no perfect solution exists.” Anna Pajor

Another paper about harmonic and not-so-harmonic mean estimators that I (also) missed came out last year in Bayesian Analysis. The author is Anna Pajor, whose earlier note with Osiewalski I also spotted on the same day. The idea behind the approach [which belongs to the branch of Monte Carlo methods requiring additional simulations after an MCMC run] is to start as the corrected harmonic mean estimator on a restricted set A as to avoid tails of the distributions and the connected infinite variance issues that plague the harmonic mean estimator (an old ‘Og tune!). The marginal density p(y) then satisfies an identity involving the prior expectation of the likelihood function restricted to A divided by the posterior coverage of A. Which makes the resulting estimator unbiased only when this posterior coverage of A is known, which does not seem realist or efficient, except if A is an HPD region, as suggested in our earlier “safe” harmonic mean paper. And efficient only when A is well-chosen in terms of the likelihood function. In practice, the author notes that P(A|y) is to be estimated from the MCMC sequence and that the set A should be chosen to return large values of the likelihood, p(y|θ), through importance sampling, hence missing somehow the double opportunity of using an HPD region. Hence using the same default choice as in Lenk (2009), an HPD region which lower bound is derived as the minimum likelihood in the MCMC sample, “range of the posterior sampler output”. Meaning P(A|y)=1. (As an aside, the paper does not produce optimality properties or even heuristics towards efficiently choosing the various parameters to be calibrated in the algorithm, like the set A itself. As another aside, the paper concludes with a simulation study on an AR(p) model where the marginal may be obtained in closed form if stationarity is not imposed, which I first balked at, before realising that even in this setting both the posterior and the marginal do exist for a finite sample size, and hence the later can be estimated consistently by Monte Carlo methods.) A last remark is that computing costs are not discussed in the comparison of methods.

The final experiment in the paper is aiming at the marginal of a mixture model posterior, operating on the galaxy benchmark used by Roeder (1990) and about every other paper on mixtures since then (incl. ours). The prior is pseudo-conjugate, as in Chib (1995). And label-switching is handled by a random permutation of indices at each iteration. Which may not be enough to fight the attraction of the current mode on a Gibbs sampler and hence does not automatically correct Chib’s solution. As shown in Table 7 by the divergence with Radford Neal’s (1999) computations of the marginals, which happen to be quite close to the approximation proposed by the author. (As an aside, the paper mentions poor performances of Chib’s method when centred at the posterior mean, but this is a setting where the posterior mean is meaningless because of the permutation invariance. As another, I do not understand how the RMSE can be computed in this real data situation.) The comparison is limited to Chib’s method and a few versions of arithmetic and harmonic means. Missing nested sampling (Skilling, 2006; Chopin and X, 2011), and attuned importance sampling as in Berkoff et al. (2003), Marin, Mengersen and X (2005), and the most recent Lee and X (2016) in Bayesian Analysis.

another version of the corrected harmonic mean estimator

Posted in Books, pictures, Statistics, University life with tags , , , , , on June 11, 2018 by xi'an

A few days ago I came across a short paper in the Central European Journal of Economic Modelling and Econometrics by Pajor and Osiewalski that proposes a correction to the infamous harmonic mean estimator that is essentially the one Darren and I made in 2009, namely to restrict the evaluations of the likelihood function to a subset A of the simulations from the posterior. Paper that relates to an earlier 2009 paper by Peter Lenk, which investigates the same object with this same proposal and that we had missed for all that time. The difference is that, while we examine an arbitrary HPD region at level 50% or 80% as the subset A, Lenk proposes to derive a minimum likelihood value from the MCMC run and to use the associated HPD region, which means using all simulations, hence producing the same object as the original harmonic mean estimator, except that it is corrected by a multiplicative factor P(A). Or rather an approximation. This correction thus maintains the infinite variance of the original, a point apparently missed in the paper.

%d bloggers like this: