## martingale posteriors

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on November 7, 2022 by xi'an

A new Royal Statistical Society Read Paper featuring Edwin Fong, Chris Holmes, and Steve Walker. Starting from the predictive

$p(y_{n+1:+\infty}|y_{1:n})\ \ \ (1)$

rather than from the posterior distribution on the parameter is a fairly novel idea, also pursued by Sonia Petrone and some of her coauthors. It thus adopts a de Finetti’s perspective while adding some substance to the rather metaphysical nature of the original. It however relies on the “existence” of an infinite sample in (1) that assumes a form of underlying model à la von Mises or at least an infinite population. The representation of a parameter θ as a function of an infinite sequence comes as a shock first but starts making sense when considering it as a functional of the underlying distribution. Of course, trading (modelling) a random “opaque” parameter θ for (envisioning) an infinite sequence of random (un)observations may sound like a sure loss rather than as a great deal, but it gives substance to the epistemic uncertainty about a distributional parameter, even when a model is assumed, as in Example 1, which defines θ in the usual parametric way (i.e., the mean of the iid variables). Furthermore, the link with bootstrap and even more Bayesian bootstrap becomes clear when θ is seen this way.

Always a fan of minimal loss approaches, but (2.4) defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence does not exist outside the primary definition of said parametric family. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant but what is its Bayesian justification? (I did not read Appendix C.2. in full detail but could not spot the prior on F.)

“The resemblance of the martingale posterior to a bootstrap estimator should not have gone unnoticed”

I am always fan of minimal loss approaches, but I wonder at (2.4), as it defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence it does not exist outside the primary definition of said parametric family, which limits its appeal. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant and connect with bootstrap, but I wonder at its Bayesian justification. (I did not read Appendix C.2. in full detail but could not spot a prior on F.)

While I completely missed the resemblance, it is indeed the case that, if the predictive at each step is build from the earlier “sample”, the support is not going to evolve. However, this is not particularly exciting as the Bayesian non-parametric estimator is most rudimentary. This seems to bring us back to Rubin (1981) ?! A Dirichlet prior is mentioned with no further detail. And I am getting confused at the complete lack of structure, prior, &tc. It seems to contradict the next section:

“While the prescription of (3.1) remains a subjective task, we find it to be no more subjective than the selection of a likelihood function”

Copulas!!! Again, I am very glad to see copulas involved in the analysis. However, I remain unclear as to why Corollary 1 implies that any sequence of copulas could do the job. Further, why does the Gaussian copula appear as the default choice? What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. I am also missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying, while relying on a single hyperparameter in (4.5.2)?

In the illustration section, the use of the galaxy dataset may fail to appeal to Radford Neal, in a spirit similar to Chopin’s & Ridgway’s call to leave the Pima Indians alone, since he delivered a passionate lecture on the inappropriateness of a mixture model for this dataset (at ICMS in 2001). I am unclear as to where the number of modes is extracted from the infinite predictive. What is $\theta$ in this case?

Copulas!!! Although I am unclear why Corollary 1 implies that any sequence of copulas does the job. And why the Gaussian copula appears as the default choice. What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. Missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying. A single hyperparameter (4.5.2)?

## machine-learning harmonic mean

Posted in Books, Statistics with tags , , , , , , on February 25, 2022 by xi'an

In a recent arXival, Jason McEwen propose a resurrection of the “infamous” harmonic mean estimator. In Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator, they propose to aim at the “optimal importance function”. The paper provides a fair coverage of the literature on that topic, incl. our 2009 paper with Darren Wraith (although I do not follow the criticism of using a uniform over an HPD region, esp. since one of the learnt targets is also a uniform over an hypersphere, presumably optimised in terms of the chosen parameterisation).

“…the learnt harmonic mean estimator, a variant of the original estimator that solves its large variance problem. This is achieved by interpreting the harmonic mean estimator as importance sampling and introducing a new target distribution (…) learned to approximate the optimal but inaccessible target, while minimising the variance of the resulting estimator. Since the estimator requires samples of the posterior only it is agnostic to the strategy used to generate posterior samples.”

The method thus builds upon Gelfand and Dey (1994) general proposal that is a form of inverse importance sampling since the numerator [the new target] is free while the denominator is the unnormalised posterior. The optimal target being the complete posterior (since it lead to a null variance), the authors propose to try to approximate this posterior by various means. (Note however that an almost Dirac mass at a value with positive posterior would work as well!, at least in principle…) as the sections on moment approximations sound rather standard (and assume the estimated variances are finite) while the reason for the inclusion of the Bayes factor approximation is rather unclear. However, I am rather skeptical at the proposals made therein towards approximating the posterior distribution, from a Gaussian mixture [for which parameterisation?] to KDEs, or worse ML tools like neural nets [not explored there, which makes one wonder about the title], as the estimands will prove very costly, and suffer from the curse of dimensionality (3 hours for d=2¹⁰…).The Pima Indian women’s diabetes dataset and its quasi-Normal posterior are used as a benchmark, meaning that James and Nicolas did not shout loud enough! And I find surprising that most examples include the original harmonic mean estimator despite its complete lack of trustworthiness.

## new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.

## Metropolis-Hastings importance sampling

Posted in Books, Statistics, University life with tags , , , , , , , , , on June 6, 2018 by xi'an

[Warning: As I first got the paper from the authors and sent them my comments, this paper read contains their reply as well.]

In a sort of crazy coincidence, Daniel Rudolf and Björn Sprungk arXived a paper on a Metropolis-Hastings importance sampling estimator that offers similarities with  the one by Ingmar Schuster and Ilja Klebanov posted on arXiv the same day. The major difference in the construction of the importance sampler is that Rudolf and Sprungk use the conditional distribution of the proposal in the denominator of their importance weight, while Schuster and Klebanov go for the marginal (or a Rao-Blackwell representation of the marginal), mostly in an independent Metropolis-Hastings setting (for convergence) and for a discretised Langevin version in the applications. The former use a very functional L² approach to convergence (which reminded me of the early Schervish and Carlin, 1990, paper on the convergence of MCMC algorithms), not all of it necessary in my opinion. As for instance the extension of convergence properties to the augmented chain, namely (current, proposed), is rather straightforward since the proposed chain is a random transform of the current chain. An interesting remark at the end of the proof of the CLT is that the asymptotic variance of the importance sampling estimator is the same as with iid realisations from the target. This is a point we also noticed when constructing population Monte Carlo techniques (more than ten years ago), namely that dependence on the past in sequential Monte Carlo does not impact the validation and the moments of the resulting estimators, simply because “everything cancels” in importance ratios. The mean square error bound on the Monte Carlo error (Theorem 20) is not very surprising as the term ρ(y)²/P(x,y) appears naturally in the variance of importance samplers.

The first illustration where the importance sampler does worse than the initial MCMC estimator for a wide range of acceptance probabilities (Figures 2 and 3, which is which?) and I do not understand the opposite conclusion from the authors.

Indeed the formulation in our paper is unfortunate. The point we want to stress is that we observed in the numerical experiments certain ranges of step-sizes for which MH importance sampling shows a better performance than the classical MH algorithm with optimal scaling. Meaning that the MH importance sampling with optimal step-size can outperform MH sampling, without using additional computational resources. Surprisingly, the optimal step-size for the MH importance sampling estimator seems to remain constant for an increasing dimension in contrast to the well-known optimal scaling of the MH algorithm (given by a constant optimal acceptance rate).

The second uses the Pima Indian diabetes benchmark, amusingly (?) referring to Chopin and Ridgway (2017) who warn against the recourse to this dataset and to this model! The loss in mean square error due to the importance sampling may again be massive (Figure 5) and setting for an optimisation of the scaling factor in Metropolis-Hastings algorithms sounds unrealistic.