## ABC in Lapland²

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on March 16, 2023 by xi'an

On the second day of our workshop, Aki Vehtari gave a short talk about his recent works on speed up post processing by importance sampling a simulation of an imprecise version of the likelihood until the desired precision is attained, importance corrected by Pareto smoothing¹⁵. A very interesting foray into the meaning of practical models and the hard constraints on computer precision. Grégoire Clarté (formerly a PhD student of ours at Dauphine) stayed on a similar ground of using sparse GP versions of the likelihood and post processing by VB²³ then stir and repeat!

Riccardo Corradin did model-based clustering when the nonparametric mixture kernel is missing a normalizing constant, using ABC with a Wasserstein distance and an adaptive proposal, with some flavour of ABC-Gibbs (and no issue of label switching since this is clustering). Mixtures of g&k models, yay! Tommaso Rigon reconsidered clustering via a (generalised Bayes à la Bissiri et al.) discrepancy measure rather than a true model, summing over all clusters and observations a discrepancy between said observation and said cluster. Very neat if possibly costly since involving distances to clusters or within clusters. Although she considered post-processing and Bayesian bootstrap, Judith (formerly [?] Dauphine)  acknowledged that she somewhat drifted from the theme of the workshop by considering BvM theorems for functionals of unknown functions, with a form of Laplace correction. (Enjoying Lapland so much that I though “Lap” in Judith’s talk was for Lapland rather than Laplace!!!) And applications to causality.

After the (X country skiing) break, Lorenzo Pacchiardi presented his adversarial approach to ABC, differing from Ramesh et al. (2022) by the use of scoring rule minimisation, where unbiased estimators of gradients are available, Ayush Bharti argued for involving experts in selecting the summary statistics, esp. for misspecified models, and Ulpu Remes presented a Jensen-Shanon divergence for selecting models likelihood-freely²², using a test statistic as summary statistic..

Sam Duffield made a case for generalised Bayesian inference in correcting errors in quantum computers, Joshua Bon went back to scoring rules for correcting the ABC approximation, with an importance step, while Trevor Campbell, Iuri Marocco and Hector McKimm nicely concluded the workshop with lightning-fast talks in place of the cancelled poster session. Great workshop, in my most objective opinion, with new directions!

## ABC in Lapland

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , on March 15, 2023 by xi'an

Greetings from Levi, Lapland! Sonia Petrone beautifully started the ABC workshop with a (the!) plenary Sunday night talk on quasi-Bayes in the spirit of both Fortini & Petrone (2020) and the more recent Fong, Holmes, and Walker (2023). The talk got me puzzled by wondering the nature of convergence, in that it happens no matter what the underlying distribution (or lack thereof) of the data is, in that, even without any exchangeability structure, the predictive is converging. The quasi stems from a connection with the historical Smith and Markov (1978) sequential update approximation for the posterior attached with mixtures of distributions. Which itself relates to both Dirichlet posterior updates and Bayesian bootstrap à la Newton & Raftery. Appropriate link when the convergence seems to stem from the sequence of predictives instead of the underlying distribution, if any, pulling Bayes by its own bootstrap…! Chris Holmes also talked the next day about this approach, esp. about a Bayesian approach to causality that does not require counterfactuals, in connection with a recent arXival of his (on my reading list).

Carlo Alberto presented both his 2014 SABC (simulated annealing) algorithm with a neat idea of reducing waste in the tempering schedule and a recent summary selection approach based on an auto-encoder function of both y and noise to reduce to sufficient statistic. A similar idea was found in Yannik Schälte’s talk (slide above). Who was returning to Richard Wiilkinson’s exact ABC¹³ with adaptive sequential generator, also linking to simulated annealing and ABC-SMC¹² to the rescue. Notion of amortized inference. Seemingly approximating data y with NN and then learn parameter by a normalising flow.

David Frazier talked on Q-posterior²³ approach, based on Fisher’s identity, for approximating score function, which first seemed to require some exponential family structure on a completed model (but does not, after discussing with David!), Jack Jewson on beta divergence priors²³ for uncertainty on likelihoods, better than KLD divergence on e-contamination situations, any impact on ABC? Masahiro Fujisawa back to outliers impact on ABC, again with e-contaminations (with me wondering at the impact of outliers on NN estimation).

In the afternoon session (due to two last minute cancellations, we skipped (or [MCMC] skied) one afternoon session, which coincided with a bright and crispy day, how convenient! ), Massi Tamborino (U of Warwick) FitzHugh-Nagumo process, with impossibilities to solve the inference problem differently, for instance Euler-Maruyama does not always work, numerical schemes are inducing a bias. Back to ABC with the hunt for a summary that get rid of the noise, as in Carlo Alberto’s work. Yuexi Wang talked about her works on adversarial ABC inspired from GANs. Another instance where noise is used as input. True data not used in training? Imke Botha discussed an improvement to ensemble Kalman inversion which, while biased, gains over both regular SMC timewise and ensemble Kalman inversion in precision, and Chaya Weerasinghe focussed on Bayesian forecasting in state space models under model misspecification, via approximate Bayesian computation, using an auxiliary model to produce summary statistics as in indirect inference.

## martingale posteriors

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on November 7, 2022 by xi'an

A new Royal Statistical Society Read Paper featuring Edwin Fong, Chris Holmes, and Steve Walker. Starting from the predictive

$p(y_{n+1:+\infty}|y_{1:n})\ \ \ (1)$

rather than from the posterior distribution on the parameter is a fairly novel idea, also pursued by Sonia Petrone and some of her coauthors. It thus adopts a de Finetti’s perspective while adding some substance to the rather metaphysical nature of the original. It however relies on the “existence” of an infinite sample in (1) that assumes a form of underlying model à la von Mises or at least an infinite population. The representation of a parameter θ as a function of an infinite sequence comes as a shock first but starts making sense when considering it as a functional of the underlying distribution. Of course, trading (modelling) a random “opaque” parameter θ for (envisioning) an infinite sequence of random (un)observations may sound like a sure loss rather than as a great deal, but it gives substance to the epistemic uncertainty about a distributional parameter, even when a model is assumed, as in Example 1, which defines θ in the usual parametric way (i.e., the mean of the iid variables). Furthermore, the link with bootstrap and even more Bayesian bootstrap becomes clear when θ is seen this way.

Always a fan of minimal loss approaches, but (2.4) defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence does not exist outside the primary definition of said parametric family. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant but what is its Bayesian justification? (I did not read Appendix C.2. in full detail but could not spot the prior on F.)

“The resemblance of the martingale posterior to a bootstrap estimator should not have gone unnoticed”

I am always fan of minimal loss approaches, but I wonder at (2.4), as it defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence it does not exist outside the primary definition of said parametric family, which limits its appeal. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant and connect with bootstrap, but I wonder at its Bayesian justification. (I did not read Appendix C.2. in full detail but could not spot a prior on F.)

While I completely missed the resemblance, it is indeed the case that, if the predictive at each step is build from the earlier “sample”, the support is not going to evolve. However, this is not particularly exciting as the Bayesian non-parametric estimator is most rudimentary. This seems to bring us back to Rubin (1981) ?! A Dirichlet prior is mentioned with no further detail. And I am getting confused at the complete lack of structure, prior, &tc. It seems to contradict the next section:

“While the prescription of (3.1) remains a subjective task, we find it to be no more subjective than the selection of a likelihood function”

Copulas!!! Again, I am very glad to see copulas involved in the analysis. However, I remain unclear as to why Corollary 1 implies that any sequence of copulas could do the job. Further, why does the Gaussian copula appear as the default choice? What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. I am also missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying, while relying on a single hyperparameter in (4.5.2)?

In the illustration section, the use of the galaxy dataset may fail to appeal to Radford Neal, in a spirit similar to Chopin’s & Ridgway’s call to leave the Pima Indians alone, since he delivered a passionate lecture on the inappropriateness of a mixture model for this dataset (at ICMS in 2001). I am unclear as to where the number of modes is extracted from the infinite predictive. What is $\theta$ in this case?

Copulas!!! Although I am unclear why Corollary 1 implies that any sequence of copulas does the job. And why the Gaussian copula appears as the default choice. What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. Missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying. A single hyperparameter (4.5.2)?