Archive for Bayesian predictive

Conformal Bayesian Computation

Posted in Books, pictures, Statistics, University life with tags , , , , , on July 8, 2021 by xi'an

Edwin Fong and Chris Holmes (Oxford) just wrote a paper on Bayesian scalable methods from a M-open perspective. Borrowing from the conformal prediction framework of Vovk et al. (2005) to achieve frequentist coverage for prediction intervals. The method starts with the choice of a conformity measure that measures how well each observation in the sample agrees with the sample. Which is exchangeable and hence leads to a rank statistic from which a p-value can be derived. Which is the empirical cdf associated with the observed conformities. Following Vovk et al. (2005) and Wasserman (2011) Edwin and Chris note that the Bayesian predictive itself acts like a conformity measure. Predictive that can itself be approximated by MCMC and importance sampling (possibly smoothed by Pareto). The paper also expands the setting to partial exchangeable models, renamed group conformal predictions. While reluctant to engage into turning Bayesian solutions into frequentist ones, I can see some worth in deriving both in order to expose discrepancies and hence signal possible issues with models and priors.

focused Bayesian prediction

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on June 3, 2020 by xi'an

In this fourth session of our One World ABC Seminar, my friend and coauthor Gael Martin, gave an after-dinner talk on focused Bayesian prediction, more in the spirit of Bissiri et al. than following a traditional ABC approach.  because along with Ruben Loaiza-Maya and [my friend and coauthor] David Frazier, they consider the possibility of a (mild?) misspecification of the model. Using thus scoring rules à la Gneiting and Raftery. Gael had in fact presented an earlier version at our workshop in Oaxaca, in November 2018. As in other solutions of that kind, difficulty in weighting the score into a distribution. Although asymptotic irrelevance, direct impact on the current predictions, at least for the early dates in the time series… Further calibration of the set of interest A. Or the focus of the prediction. As a side note the talk perfectly fits the One World likelihood-free seminar as it does not use the likelihood function!

“The very premise of this paper is that, in reality, any choice of predictive class is such that the truth is not contained therein, at which point there is no reason to presume that the expectation of any particular scoring rule will be maximized at the truth or, indeed, maximized by the same predictive distribution that maximizes a different (expected) score.”

This approach requires the proxy class to be close enough to the true data generating model. Or in the word of the authors to be plausible predictive models. And to produce the true distribution via the score as it is proper. Or the closest to the true model in the misspecified family. I thus wonder at a possible extension with a non-parametric version, the prior being thus on functionals rather than parameters, if I understand properly the meaning of Π(Pθ). (Could the score function be misspecified itself?!) Since the score is replaced with its empirical version, the implementation is  resorting to off-the-shelf MCMC. (I wonder for a few seconds if the approach could be seen as a pseudo-marginal MCMC but the estimation is always based on the same observed sample, hence does not directly fit the pseudo-marginal MCMC framework.)

[Notice: Next talk in the series is tomorrow, 11:30am GMT+1.]

ABCDE for approximate Bayesian conditional density estimation

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on February 26, 2018 by xi'an

Another arXived paper I surprisingly (?) missed, by George Papamakarios and Iain Murray, on an ABCDE (my acronym!) substitute to ABC for generative models. The paper was reviewed [with reviews made available!] and accepted by NIPS 2016. (Most obviously, I was not one of the reviewers!)

“Conventional ABC algorithms such as the above suffer from three drawbacks. First, they only represent the parameter posterior as a set of (possibly weighted or correlated) samples [for which] it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses. Second, the parameter samples do not come from the correct Bayesian posterior (…) Third, as the ε-tolerance is reduced, it can become impractical to simulate the model enough times to match the observed data even once [when] simulations are expensive to perform”

The above criticisms are a wee bit overly harsh as, well…, Monte Carlo approximations remain a solution worth considering for all Bayesian purposes!, while the approximation [replacing the data with a ball] in ABC is replaced with an approximation of the true posterior as a mixture. Both requiring repeated [and likely expensive] simulations. The alternative is in iteratively simulating from pseudo-predictives towards learning better pseudo-posteriors, then used as new proposals at the next iteration modulo an importance sampling correction.  The approximation to the posterior chosen therein is a mixture density network, namely a mixture distribution with parameters obtained as neural networks based on the simulated pseudo-observations. Which the authors claim [p.4] requires no tuning. (Still, there are several aspects to tune, from the number of components to the hyper-parameter λ [p.11, eqn (35)], to the structure of the neural network [20 tanh? 50 tanh?], to the number of iterations, to the amount of X checking. As usual in NIPS papers, it is difficult to assess how arbitrary the choices made in the experiments are. Unless one starts experimenting with the codes provided.) All in all, I find the paper nonetheless exciting enough (!) to now start a summer student project on it in Dauphine and hope to check the performances of ABCDE on different models, as well as comparing this ABC implementation with a synthetic likelihood version.

 As an addendum, let me point out the very pertinent analysis of this paper by Dennis Prangle, 18 months ago!

a unified treatment of predictive model comparison

Posted in Books, Statistics, University life with tags , , , , , , , , , on June 16, 2015 by xi'an

“Applying various approximation strategies to the relative predictive performance derived from predictive distributions in frequentist and Bayesian inference yields many of the model comparison techniques ubiquitous in practice, from predictive log loss cross validation to the Bayesian evidence and Bayesian information criteria.”

Michael Betancourt (Warwick) just arXived a paper formalising predictive model comparison in an almost Bourbakian sense! Meaning that he adopts therein a very general representation of the issue, with minimal assumptions on the data generating process (excluding a specific metric and obviously the choice of a testing statistic). He opts for an M-open perspective, meaning that this generating process stands outside the hypothetical statistical model or, in Lindley’s terms, a small world. Within this paradigm, the only way to assess the fit of a model seems to be through the predictive performances of that model. Using for instance an f-divergence like the Kullback-Leibler divergence, based on the true generated process as the reference. I think this however puts a restriction on the choice of small worlds as the probability measure on that small world has to be absolutely continuous wrt the true data generating process for the distance to be finite. While there are arguments in favour of absolutely continuous small worlds, this assumes a knowledge about the true process that we simply cannot gather. Ignoring this difficulty, a relative Kullback-Leibler divergence can be defined in terms of an almost arbitrary reference measure. But as it still relies on the true measure, its evaluation proceeds via cross-validation “tricks” like jackknife and bootstrap. However, on the Bayesian side, using the prior predictive links the Kullback-Leibler divergence with the marginal likelihood. And Michael argues further that the posterior predictive can be seen as the unifying tool behind information criteria like DIC and WAIC (widely applicable information criterion). Which does not convince me towards the utility of those criteria as model selection tools, as there is too much freedom in the way approximations are used and a potential for using the data several times.

posterior predictive distributions of Bayes factors

Posted in Books, Kids, Statistics with tags , , , on October 8, 2014 by xi'an

Once a Bayes factor B(y)  is computed, one needs to assess its strength. As repeated many times here, Jeffreys’ scale has no validation whatsoever, it is simply a division of the (1,∞) range into regions of convenience. Following earlier proposals in the literature (Box, 1980; García-Donato and Chen, 2005; Geweke and Amisano, 2008), an evaluation of this strength within the issue at stake, i.e. the comparison of two models, can be based on the predictive distribution. While most authors (like García-Donato and Chen) consider the prior predictive, I think using the posterior predictive distribution is more relevant since

  1. it exploits the information contained in the data y, thus concentrates on a region of relevance in the parameter space(s), which is especially interesting in weakly informative settings (even though we should abstain from testing in those cases, dixit Andrew);
  2. it reproduces the behaviour of the Bayes factor B(x) for values x of the observation similar to the original observation y;
  3. it does not hide issues of indeterminacy linked with improper priors: the Bayes factor B(x) remains indeterminate, even with a well-defined predictive;
  4. it does not separate between errors of type I and errors of type II but instead uses the natural summary provided by the Bayesian analysis, namely the predictive distribution π(x|y);
  5. as long as the evaluation is not used to reach a decision, there is no issue of “using the data twice”, we are simply producing an estimator of the posterior loss, for instance the (posterior) probability of selecting the wrong model. The Bayes factor B(x) is thus functionally  independent of y, while x is probabilistically dependent on y.

Note that, even though probabilities of errors of type I and errors of type II can be computed, they fail to account for the posterior probabilities of both models. (This is the delicate issue with the solution of García-Donato and Chen.) Another nice feature is that the predictive distribution of the Bayes factor can be computed even in complex settings where ABC needs to be used.