## martingale posteriors

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on November 7, 2022 by xi'an

A new Royal Statistical Society Read Paper featuring Edwin Fong, Chris Holmes, and Steve Walker. Starting from the predictive

$p(y_{n+1:+\infty}|y_{1:n})\ \ \ (1)$

rather than from the posterior distribution on the parameter is a fairly novel idea, also pursued by Sonia Petrone and some of her coauthors. It thus adopts a de Finetti’s perspective while adding some substance to the rather metaphysical nature of the original. It however relies on the “existence” of an infinite sample in (1) that assumes a form of underlying model à la von Mises or at least an infinite population. The representation of a parameter θ as a function of an infinite sequence comes as a shock first but starts making sense when considering it as a functional of the underlying distribution. Of course, trading (modelling) a random “opaque” parameter θ for (envisioning) an infinite sequence of random (un)observations may sound like a sure loss rather than as a great deal, but it gives substance to the epistemic uncertainty about a distributional parameter, even when a model is assumed, as in Example 1, which defines θ in the usual parametric way (i.e., the mean of the iid variables). Furthermore, the link with bootstrap and even more Bayesian bootstrap becomes clear when θ is seen this way.

Always a fan of minimal loss approaches, but (2.4) defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence does not exist outside the primary definition of said parametric family. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant but what is its Bayesian justification? (I did not read Appendix C.2. in full detail but could not spot the prior on F.)

“The resemblance of the martingale posterior to a bootstrap estimator should not have gone unnoticed”

I am always fan of minimal loss approaches, but I wonder at (2.4), as it defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence it does not exist outside the primary definition of said parametric family, which limits its appeal. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant and connect with bootstrap, but I wonder at its Bayesian justification. (I did not read Appendix C.2. in full detail but could not spot a prior on F.)

While I completely missed the resemblance, it is indeed the case that, if the predictive at each step is build from the earlier “sample”, the support is not going to evolve. However, this is not particularly exciting as the Bayesian non-parametric estimator is most rudimentary. This seems to bring us back to Rubin (1981) ?! A Dirichlet prior is mentioned with no further detail. And I am getting confused at the complete lack of structure, prior, &tc. It seems to contradict the next section:

“While the prescription of (3.1) remains a subjective task, we find it to be no more subjective than the selection of a likelihood function”

Copulas!!! Again, I am very glad to see copulas involved in the analysis. However, I remain unclear as to why Corollary 1 implies that any sequence of copulas could do the job. Further, why does the Gaussian copula appear as the default choice? What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. I am also missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying, while relying on a single hyperparameter in (4.5.2)?

In the illustration section, the use of the galaxy dataset may fail to appeal to Radford Neal, in a spirit similar to Chopin’s & Ridgway’s call to leave the Pima Indians alone, since he delivered a passionate lecture on the inappropriateness of a mixture model for this dataset (at ICMS in 2001). I am unclear as to where the number of modes is extracted from the infinite predictive. What is $\theta$ in this case?

Copulas!!! Although I am unclear why Corollary 1 implies that any sequence of copulas does the job. And why the Gaussian copula appears as the default choice. What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. Missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying. A single hyperparameter (4.5.2)?

## day one at ISBA 22

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , on June 29, 2022 by xi'an

Started the day with a much appreciated swimming practice in the [alas warm⁺⁺⁺] outdoor 50m pool on the Island with no one but me in the slooow lane. And had my first ride with the biXi system, surprised at having to queue behind other bikes at red lights! More significantly, it was a great feeling to reunite at last with so many friends I had not met for more than two years!!!

My friend Adrian Raftery gave the very first plenary lecture on his work on the Bayesian approach to long-term population projections, which was recently  a work censored by some US States, then counter-censored by the Supreme Court [too busy to kill Roe v. Wade!]. Great to see the use of Bayesian methods validated by the UN Population Division [with at least one branch of the UN

Stephen Lauritzen returning to de Finetti notion of a model as something not real or true at all, back to exchangeability. Making me wonder when exchangeability is more than a convenient assumption leading to the Hewitt-Savage theorem. And sufficiency. I mean, without falling into a Keynesian fallacy, each point of the sample has unique specificities that cannot be taken into account in an exchangeable model. Nice to hear some measure theory, though!!! Plus a comment on the median never being sufficient, recouping an older (and presumably not original) point of mine. Stephen’s (or Fisher’s?) argument being that the median cannot be recursively computed!

Antonietta Mira and I had our ABC session this afternoon with Cecilia Viscardi, Sirio Legramanti, and Massimiliano Tamborino (Warwick) as speakers. Cecilia linked ABC with normalising flows, in collaboration with Dennis Prangle (whose earlier paper on this connection was presented as the first One World ABC seminar). Thus using past simulations to approximate the posterior by a neural network, possibly with a significant increase in computing time when compared with more rudimentary SMC-ABC methods in larger dimensions. Sirio considered summary-free ABC based on discrepancies like Rademacher complexity. Which more or less contains MMD, Kullback-Leibler, Wasserstein and more, although it seems to be dependent on the parameterisation of the observations. An interesting opening at the end was that this approach could apply to non iid settings. Massi presented a paper coauthored with Umberto that had just been arXived. On sequential ABC with a dependence on the summary statistic (hence guided). Further bringing copulas into the game, although this forces another choice [for the marginals] in the method.

Tamara Broderick talked about a puzzling leverage effect of some observations in economic studies where a tiny portion of individuals may modify the significance or the sign of a coefficient, for which I cannot tell whether the data or the reliance on statistical significance are to blame. Robert Kohn presented mixture-of-Gaussian copulas [not to be confused with mixture of Gaussian-copulas!] and Nancy Reid concluded my first [and somewhat exhausting!] day at ISBA with a BFF talk on the different statistical paradigms take on confidence (for which the notion of calibration seems to remain frequentist).

Side comments: First, most people in the conference are wearing masks, which is great! Also, I find it hard to read slides from the screen, which I presume is an age issue (?!) Even more aside, I had Korean lunch in a place that refused to serve me a glass of water, which I find amazing.

## marginal likelihood as exhaustive X validation

Posted in Statistics with tags , , , , , , , , on October 9, 2020 by xi'an

In the June issue of Biometrika (for which I am deputy editor) Edwin Fong and Chris Holmes have a short paper (that I did not process!) on the validation of the marginal likelihood as the unique coherent updating rule. Marginal in the general sense of Bissiri et al. (2016). Coherent in the sense of being invariant to the order of input of exchangeable data, if in a somewhat self-defining version (Definition 1). As a consequence, marginal likelihood arises as the unique prequential scoring rule under coherent belief updating in the Bayesian framework. (It is unique given the prior or its generalisation, obviously.)

“…we see that 10% of terms contributing to the marginal likelihood come from out-of-sample predictions, using on average less than 5% of the available training data.”

The paper also contains the interesting remark that the log marginal likelihood is the average leave-p-out X-validation score, across all values of p. Which shows that, provided the marginal can be approximated, the X validation assessment is feasible. Which leads to a highly relevant (imho) spotlight on how this expresses the (deadly) impact of the prior selection on the numerical value of the marginal likelihood. Leaving outsome of the least informative terms in the X-validation leads to exactly the log geometric intrinsic Bayes factor of Berger & Pericchi (1996). Most interesting connection with the Bayes factor community but one that depends on the choice of the dismissed fraction of p‘s.

## a pen for ABC

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on February 13, 2019 by xi'an

Among the flury of papers arXived around the ICML 2019 deadline, I read on my way back from Oxford a paper by Wiqvist et al. on learning summary statistics for ABC by neural nets. Pointing out at another recent paper by Jiang et al. (2017, Statistica Sinica) which constructed a neural network for predicting each component of the parameter vector based on the input (raw) data, as an automated non-parametric regression of sorts. Creel (2017) does the same but with summary statistics. The current paper builds up from Jiang et al. (2017), by adding the constraint that exchangeability and partial exchangeability features should be reflected by the neural net prediction function. With applications to Markovian models. Due to a factorisation theorem for d-block invariant models, the authors impose partial exchangeability for order d Markov models by combining two neural networks that end up satisfying this factorisation. The concept is exemplified for one-dimension g-and-k distributions, alpha-stable distributions, both of which are made of independent observations, and the AR(2) and MA(2) models, as in our 2012 ABC survey paper. Since the later is not Markovian the authors experiment with different orders and reach the conclusion that an order of 10 is most appropriate, although this may be impacted by being a ble to handle the true likelihood.

## calibrating approximate credible sets

Posted in Books, Statistics with tags , , , , , , , on October 26, 2018 by xi'an

Earlier this week, Jeong Eun Lee, Geoff Nicholls, and Robin Ryder arXived a paper on the calibration of approximate Bayesian credible intervals. (Warning: all three authors are good friends of mine!) They start from the core observation that dates back to Monahan and Boos (1992) of exchangeability between θ being generated from the prior and φ being generated from the posterior associated with one observation generated from the prior predictive. (There is no name for this distribution, other than the prior, that is!) A setting amenable to ABC considerations! Actually, Prangle et al. (2014) relies on this property for assessing the ABC error, while pointing out that the test for exchangeability is not fool-proof since it works equally for two generations from the prior.

“The diagnostic tools we have described cannot be “fooled” in quite the same way checks based on the exchangeability can be.”

The paper thus proposes methods for computing the coverage [under the true posterior] of a credible set computed using an approximate posterior. (I had to fire up a few neurons to realise this was the right perspective, rather than the reverse!) A first solution to approximate the exact coverage of the approximate credible set is to use logistic regression, instead of the exact coverage, based on some summary statistics [not necessarily in an ABC framework]. And a simulation outcome that the parameter [simulated from the prior] at the source of the simulated data is within the credible set. Another approach is to use importance sampling when simulating from the pseudo-posterior. However this sounds dangerously close to resorting to an harmonic mean estimate, since the importance weight is the inverse of the approximate likelihood function. Not that anything unseemly transpires from the simulations.