## machine-learning harmonic mean

Posted in Books, Statistics with tags , , , , , , on February 25, 2022 by xi'an

In a recent arXival, Jason McEwen propose a resurrection of the “infamous” harmonic mean estimator. In Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator, they propose to aim at the “optimal importance function”. The paper provides a fair coverage of the literature on that topic, incl. our 2009 paper with Darren Wraith (although I do not follow the criticism of using a uniform over an HPD region, esp. since one of the learnt targets is also a uniform over an hypersphere, presumably optimised in terms of the chosen parameterisation).

“…the learnt harmonic mean estimator, a variant of the original estimator that solves its large variance problem. This is achieved by interpreting the harmonic mean estimator as importance sampling and introducing a new target distribution (…) learned to approximate the optimal but inaccessible target, while minimising the variance of the resulting estimator. Since the estimator requires samples of the posterior only it is agnostic to the strategy used to generate posterior samples.”

The method thus builds upon Gelfand and Dey (1994) general proposal that is a form of inverse importance sampling since the numerator [the new target] is free while the denominator is the unnormalised posterior. The optimal target being the complete posterior (since it lead to a null variance), the authors propose to try to approximate this posterior by various means. (Note however that an almost Dirac mass at a value with positive posterior would work as well!, at least in principle…) as the sections on moment approximations sound rather standard (and assume the estimated variances are finite) while the reason for the inclusion of the Bayes factor approximation is rather unclear. However, I am rather skeptical at the proposals made therein towards approximating the posterior distribution, from a Gaussian mixture [for which parameterisation?] to KDEs, or worse ML tools like neural nets [not explored there, which makes one wonder about the title], as the estimands will prove very costly, and suffer from the curse of dimensionality (3 hours for d=2¹⁰…).The Pima Indian women’s diabetes dataset and its quasi-Normal posterior are used as a benchmark, meaning that James and Nicolas did not shout loud enough! And I find surprising that most examples include the original harmonic mean estimator despite its complete lack of trustworthiness.

## unbalanced sampling

Posted in pictures, R, Statistics with tags , , , , , , , on May 17, 2021 by xi'an

A question from X validated on sampling from an unknown density f when given both a sample from the density f restricted to a (known) interval A , say, and a sample from f restricted to the complement of A, say. Or at least on producing an estimate of the mass of A under f, p(A)

The problem sounds impossible to solve without an ability to compute the density value at a given value, since  any convex combination αf¹+(1-α)f² would return the same two samples. Assuming continuity of the density f at the boundary point a between A and its complement, a desperate solution for p(A)/1-p(A) is to take the ratio of the density estimates at the value a, which turns out not so poor an approximation if seemingly biased. This was surprising to me as kernel density estimates are notoriously bad at boundary points.

If f(x) can be computed [up to a constant] at an arbitrary x, it is obviously feasible to simulate from f and approximate p(A). But the problem is then moot as a resolution would not even need the initial samples. If exploiting those to construct a single kernel density estimate, this estimate can be used as a proposal in an MCMC algorithm. Surprisingly (?), using instead the empirical cdf as proposal does not work.

## likelihood-free and summary-free?

Posted in Books, Mountains, pictures, Statistics, Travel with tags , , , , , , , , , , , , , on March 30, 2021 by xi'an

My friends and coauthors Chris Drovandi and David Frazier have recently arXived a paper entitled A comparison of likelihood-free methods with and without summary statistics. In which they indeed compare these two perspectives on approximate Bayesian methods like ABC and Bayesian synthetic likelihoods.

“A criticism of summary statistic based approaches is that their choice is often ad hoc and there will generally be an  inherent loss of information.”

In ABC methods, the recourse to a summary statistic is often advocated as a “necessary evil” against the greater evil of the curse of dimension, paradoxically providing a faster convergence of the ABC approximation (Fearnhead & Liu, 2018). The authors propose a somewhat generic selection of summary statistics based on [my undergrad mentors!] Gouriéroux’s and Monfort’s indirect inference, using a mixture of Gaussians as their auxiliary model. Summary-free solutions, as in our Wasserstein papers, rely on distances between distributions, hence are functional distances, that can be seen as dimension-free as well (or criticised as infinite dimensional). Chris and David consider energy distances (which sound very much like standard distances, except for averaging over all permutations), maximum mean discrepancy as in Gretton et al. (2012), Cramèr-von Mises distances, and Kullback-Leibler divergences estimated via one-nearest-neighbour formulas, for a univariate sample. I am not aware of any degree of theoretical exploration of these functional approaches towards the precise speed of convergence of the ABC approximation…

“We found that at least one of the full data approaches was competitive with or outperforms ABC with summary statistics across all examples.”

The main part of the paper, besides a survey of the existing solutions, is to compare the performances of these over a few chosen (univariate) examples, with the exact posterior as the golden standard. In the g & k model, the Pima Indian benchmark of ABC studies!, Cramèr does somewhat better. While it does much worse in an M/G/1 example (where Wasserstein does better, and similarly for a stereological extremes example of Bortot et al., 2007). An ordering inversed again for a toad movement model I had not seen before. While the usual provision applies, namely that this is a simulation study on unidimensional data and a small number of parameters, the design of the four comparison experiments is very careful, eliminating versions that are either too costly or too divergence, although this could be potentially criticised for being unrealistic (i.e., when the true posterior is unknown). The computing time is roughly the same across methods, which essentially remove the call to kernel based approximations of the likelihood. Another point of interest is that the distance methods are significantly impacted by transforms on the data, which should not be so for intrinsic distances! Demonstrating the distances are not intrinsic…

## frontier of simulation-based inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 11, 2020 by xi'an

“This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, `The Science of Deep Learning,’ held March 13–14, 2019, at the National Academy of Sciences in Washington, DC.”

A paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe just appeared in PNAS on the frontier of simulation-based inference. Sounding more like a tribune than a research paper producing new input. Or at least like a review. Providing a quick introduction to simulators, inference, ABC. Stating the shortcomings of simulation-based inference as three-folded:

1. costly, since required a large number of simulated samples
2. loosing information through the use of insufficient summary statistics or poor non-parametric approximations of the sampling density.
3. wasteful as requiring new computational efforts for new datasets, primarily for ABC as learning the likelihood function (as a function of both the parameter θ and the data x) is only done once.

And the difficulties increase with the dimension of the data. While the points made above are correct, I want to note that ideally ABC (and Bayesian inference as a whole) only depends on a single dimension observation, which is the likelihood value. Or more practically that it only depends on the distance from the observed data to the simulated data. (Possibly the Wasserstein distance between the cdfs.) And that, somewhat unrealistically, that ABC could store the reference table once for all. Point 3 can also be debated in that the effort of learning an approximation can only be amortized when exactly the same model is re-employed with new data, which is likely in industrial applications but less in scientific investigations, I would think. About point 2, the paper misses part of the ABC literature on selecting summary statistics, e.g., the culling afforded by random forests ABC, or the earlier use of the score function in Martin et al. (2019).

The paper then makes a case for using machine-, active-, and deep-learning advances to overcome those blocks. Recouping other recent publications and talks (like Dennis on One World ABC’minar!). Once again presenting machine-learning techniques such as normalizing flows as more efficient than traditional non-parametric estimators. Of which I remain unconvinced without deeper arguments [than the repeated mention of powerful machine-learning techniques] on the convergence rates of these estimators (rather than extolling the super-powers of neural nets).

“A classifier is trained using supervised learning to discriminate two sets of data, although in this case both sets come from the simulator and are generated for different parameter points θ⁰ and θ¹. The classifier output function can be converted into an approximation of the likelihood ratio between θ⁰ and θ¹ (…) learning the likelihood or posterior is an unsupervised learning problem, whereas estimating the likelihood ratio through a classifier is an example of supervised learning and often a simpler task.”

The above comment is highly connected to the approach set by Geyer in 1994 and expanded in Gutmann and Hyvärinen in 2012. Interestingly, at least from my narrow statistician viewpoint!, the discussion about using these different types of approximation to the likelihood and hence to the resulting Bayesian inference never engages into a quantification of the approximation or even broaches upon the potential for inconsistent inference unlocked by using fake likelihoods. While insisting on the information loss brought by using summary statistics.

“Can the outcome be trusted in the presence of imperfections such as limited sample size, insufficient network capacity, or inefficient optimization?”

Interestingly [the more because the paper is classified as statistics] the above shows that the statistical question is set instead in terms of numerical error(s). With proposals to address it ranging from (unrealistic) parametric bootstrap to some forms of GANs.

## optimal proposal for ABC

Posted in Statistics with tags , , , , , , , , , , on October 8, 2018 by xi'an

As pointed out by Ewan Cameron in a recent c’Og’ment, Justin Alsing, Benjamin Wandelt, and Stephen Feeney have arXived last August a paper where they discuss an optimal proposal density for ABC-SMC and ABC-PMC. Optimality being understood as maximising the effective sample size.

“Previous studies have sought kernels that are optimal in the (…) Kullback-Leibler divergence between the proposal KDE and the target density.”

The effective sample size for ABC-SMC is actually the regular ESS multiplied by the fraction of accepted simulations. Which surprisingly converges to the ratio

E[q(θ)/π(θ)|D]/E[π(θ)/q(θ)|D]

under the (true) posterior. (Where q(θ) is the importance density and π(θ) the prior density.] When optimised in q, this usually produces an implicit equation which results in a form of geometric mean between posterior and prior. The paper looks at approximate ways to find this optimum. Especially at an upper bound on q. Something I do not understand from the simulations is that the starting point seems to be the plain geometric mean between posterior and prior, in a setting where the posterior is supposedly unavailable… Actually the paper is silent on how the optimal can be approximated in practice, for the very reason I just mentioned. Apart from using a non-parametric or mixture estimate of the posterior after each SMC iteration, which may prove extremely costly when processed through the optimisation steps. However, an interesting if side outcome of these simulations is that the above geometric mean does much better than the posterior itself when considering the effective sample size.