Archive for Bayesian foundations

Why should I be Bayesian when my model is wrong?

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on May 9, 2017 by xi'an

Guillaume Dehaene posted the above question on X validated last Friday. Here is an except from it:

However, as everybody knows, assuming that my model is correct is fairly arrogant: why should Nature fall neatly inside the box of the models which I have considered? It is much more realistic to assume that the real model of the data p(x) differs from p(x|θ) for all values of θ. This is usually called a “misspecified” model.

My problem is that, in this more realistic misspecified case, I don’t have any good arguments for being Bayesian (i.e: computing the posterior distribution) versus simply computing the Maximum Likelihood Estimator.

Indeed, according to Kleijn, v.d Vaart (2012), in the misspecified case, the posterior distribution converges as nto a Dirac distribution centred at the MLE but does not have the correct variance (unless two values just happen to be same) in order to ensure that credible intervals of the posterior match confidence intervals for θ.

Which is a very interesting question…that may not have an answer (but that does not make it less interesting!)

A few thoughts about that meme that all models are wrong: (resonating from last week discussion):

  1. While the hypothetical model is indeed almost invariably and irremediably wrong, it still makes sense to act in an efficient or coherent manner with respect to this model if this is the best one can do. The resulting inference produces an evaluation of the formal model that is the “closest” to the actual data generating model (if any);
  2. There exist Bayesian approaches that can do without the model, a most recent example being the papers by Bissiri et al. (with my comments) and by Watson and Holmes (which I discussed with Judith Rousseau);
  3. In a connected way, there exists a whole branch of Bayesian statistics dealing with M-open inference;
  4. And yet another direction I like a lot is the SafeBayes approach of Peter Grünwald, who takes into account model misspecification to replace the likelihood with a down-graded version expressed as a power of the original likelihood.
  5. The very recent Read Paper by Gelman and Hennig addresses this issue, albeit in a circumvoluted manner (and I added some comments on my blog).
  6. In a sense, Bayesians should be the least concerned among statisticians and modellers about this aspect since the sampling model is to be taken as one of several prior assumptions and the outcome is conditional or relative to all those prior assumptions.

Bayes is typically wrong…

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on May 3, 2017 by xi'an

In Harvard, this morning, Don Fraser gave a talk at the Bayesian, Fiducial, and Frequentist conference where he repeated [as shown by the above quote] the rather harsh criticisms on Bayesian inference he published last year in Statistical Science. And which I discussed a few days ago. The “wrongness” of Bayes starts with the completely arbitrary choice of the prior, which Don sees as unacceptable, and then increases because the credible regions are not confident regions, outside natural parameters from exponential families (Welch and Peers, 1963). And one-dimensional parameters using the profile likelihood (although I cannot find a proper definition of what the profile likelihood is in the paper, apparently a plug-in version that is not a genuine likelihood, hence somewhat falling under the same this-is-not-a-true-probability cleaver as the disputed Bayesian approach).

“I expect we’re all missing something, but I do not know what it is.” D.R. Cox, Statistical Science, 1994

And then Nancy Reid delivered a plenary lecture “Are we converging?” on the afternoon that compared most principles (including objective if not subjective Bayes) against different criteria, like consistency, nuisance elimination, calibration, meaning of probability, and so on.  In an highly analytic if pessimistic panorama. (The talk should be available on line at some point soon.)

on Dutch book arguments

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , on May 1, 2017 by xi'an

“Reality is not always probable, or likely.”― Jorge Luis Borges

As I am supposed to discuss Teddy Seidenfeld‘s talk at the Bayes, Fiducial and Frequentist conference in Harvard today [the snow happened last time!], I started last week [while driving to Wales] reading some related papers of his. Which is great as I had never managed to get through the Dutch book arguments, including those in Jim’s book.

The paper by Mark Schervish, Teddy Seidenfeld, and Jay Kadane is defining coherence as the inability to bet against the predictive statements based on the procedure. A definition that sounds like a self-fulfilling prophecy to me as it involves a probability measure over the parameter space. Furthermore, the notion of turning inference, which aims at scientific validation, into a leisure, no-added-value, and somewhat ethically dodgy like gambling, does not agree with my notion of a validation for a theory. That is, not as a compelling reason for adopting a Bayesian approach. Not that I have suddenly switched to the other [darker] side, but I do not feel those arguments helping in any way, because of this dodgy image associated with gambling. (Pardon my French, but each time I read about escrows, I think of escrocs, or crooks, which reinforces this image! Actually, this name derives from the Old French escroue, but the modern meaning of écroué is sent to jail, which brings us back to the same feeling…)

Furthermore, it sounds like both a weak notion, since it implies an almost sure loss for the bookmaker, plus coherency holds for any prior distribution, including Dirac masses!, and a frequentist one, in that it looks at all possible values of the parameter (in a statistical framework). It also turns errors into monetary losses, taking them at face value. Which sounds also very formal to me.

But the most fundamental problem I have with this approach is that, from a Bayesian perspective, it does not bring any evaluation or ranking of priors, and in particular does not help in selecting or eliminating some. By behaving like a minimax principle, it does not condition on the data and hence does not evaluate the predictive properties of the model in terms of the data, e.g. by comparing pseudo-data with real data.

 While I see no reason to argue in favour of p-values or minimax decision rules, I am at a loss in understanding the examples in How to not gamble if you must. In the first case, i.e., when dismissing the α-level most powerful test in the simple vs. simple hypothesis testing case, the argument (in Example 4) starts from the classical (Neyman-Pearsonist) statistician favouring the 0.05-level test over others. Which sounds absurd, as this level corresponds to a given loss function, which cannot be compared with another loss function. Even though the authors chose to rephrase the dilemma in terms of a single 0-1 loss function and then turn the classical solution into the choice of an implicit variance-dependent prior. Plus force the poor Pearsonist to make a wager represented by the risk difference. The whole sequence of choices sounds both very convoluted and far away from the usual practice of a classical statistician… Similarly, when attacking [in Section 5.2] the minimax estimator in the Bernoulli case (for the corresponding proper prior depending on the sample size n), this minimax estimator is admissible under quadratic loss and still a Dutch book argument applies, which in my opinion definitely argues against the Dutch book reasoning. The way to produce such a domination result is to mix two Bernoulli estimation problems for two different sample sizes but the same parameter value, in which case there exist [other] choices of Beta priors and a convex combination of the risks functions that lead to this domination. But this example [Example 6] mostly exposes the artificial nature of the argument: when estimating the very same probability θ, what is the relevance of adding the risks or errors resulting from using two estimators for two different sample sizes. Of the very same probability θ. I insist on the very same because when instead estimating two [independent] values of θ, there cannot be a Stein effect for the Bernoulli probability estimation problem, that is, any aggregation of admissible estimators remains admissible. (And yes it definitely sounds like an exercise in frequentist decision theory!)

en route to Boston!

Posted in pictures, Running, Travel, University life with tags , , , , , , , on April 29, 2017 by xi'an

Bayes, reproducibility and the Quest for Truth

Posted in Books, Statistics, University life with tags , , , , , on April 27, 2017 by xi'an

Don Fraser, Mylène Bédard, and three coauthors have written a paper with the above dramatic title in Statistical Science about the reproducibility of Bayesian inference in the framework of what they call a mathematical prior. Connecting with the earlier quick-and-dirty tag attributed by Don to Bayesian credible intervals.

“We provide simple (…) counter-examples to general claims that Bayes can offer accuracy for statistical inference. To obtain this accuracy with Bayes, more effort is required compared to recent likelihood methods (…) [and] accuracy beyond first order is routinely not available (…) An alternative is to view default Bayes as an exploratory technique and then ask does it do as it overtly claims? Is it reproducible as understood in contemporary science? (…) No one has answers although speculative claims abound.” (p. 1)

The early stages of the paper questions the nature of a prior distribution in terms of objectivity and reproducibility, which strikes me as a return to older debates on the nature of probability. And of a dubious insistence on the reality of a prior when the said reality is customarily and implicitly assumed for the sampling distribution. While we “can certainly ask how [a posterior] quantile relates to the true value of the parameter”, I see no compelling reason why the associated quantile should be endowed with a frequentist coverage meaning, i.e., be more than a normative indication of the deviation from the true value. (Assuming there is such a parameter.) To consider that the credible interval of interest can be “objectively” assessed by simulation experiments evaluating its coverage is thus doomed from the start (since there is not reason for the nominal coverage) and situated on the wrong plane since it stems from the hypothetical frequentist model for a range of parameter values. Instead I find simulations from (generating) models useful in a general ABC sense, namely by producing realisations from the predictive one can assess at which degree of roughness the data is compatible with the formal construct. To bind reproducibility to the frequentist framework thus sounds wrong [to me] as being model-based. In other words, I do not find the definition of reproducibility used in the paper to be objective (literally bouncing back from Gelman and Hennig Read Paper)

At several points in the paper, the legal consequences of using a subjective prior are evoked as legally binding and implicitly as dangerous. With the example of the L’Aquila expert trial. I have trouble seeing the relevance of this entry as an adverse lawyer is as entitled to attack the expert on her or his sampling model. More fundamentally, I feel quite uneasy about bringing this type of argument into the debate!

sleeping beauty

Posted in Books, Kids, Statistics with tags , , , , , , , , , on December 24, 2016 by xi'an

Through X validated, W. Huber made me aware of this probability paradox [or para-paradox] of which I had never heard before. One of many guises of this paradox goes as follows:

Shahrazad is put to sleep on Sunday night. Depending on the hidden toss of a fair coin, she is awaken either once (Heads) or twice (Tails). After each awakening, she gets back to sleep and forget that awakening. When awakened, what should her probability of Heads be?

My first reaction is to argue that Shahrazad does not gain information between the time she goes to sleep when the coin is fair and the time(s) she is awaken, apart from being awaken, since she does not know how many times she has been awaken, so the probability of Heads remains ½. However, when thinking more about it on my bike ride to work, I thought of the problem as a decision theory or betting problem, which makes ⅓ the optimal answer.

I then read [if not the huge literature] a rather extensive analysis of the paradox by Ciweski, Kadane, Schervish, Seidenfeld, and Stern (CKS³), which concludes at roughly the same thing, namely that, when Monday is completely exchangeable with Tuesday, meaning that no event can bring any indication to Shahrazad of which day it is, the posterior probability of Heads does not change (Corollary 1) but that a fair betting strategy is p=1/3, with the somewhat confusing remark by CKS³ that this may differ from her credence. But then what is the point of the experiment? Or what is the meaning of credence? If Shahrazad is asked for an answer, there must be a utility or a penalty involved otherwise she could as well reply with a probability of p=-3.14 or p=10.56… This makes for another ill-defined aspect of the “paradox”.

Another remark about this ill-posed nature of the experiment is that, when imagining running an ABC experiment, I could only come with one where the fair coin is thrown (Heads or Tails) and a day (Monday or Tuesday) is chosen at random. Then every proposal (Heads or Tails) is accepted as an awakening, hence the posterior on Heads is the uniform prior. The same would not occurs if we consider the pair of awakenings under Tails as two occurrences of (p,E), but this does not sound (as) correct since Shahrazad only knows of one E: to paraphrase Jeffreys, this is an unobservable result that may have not occurred. (Or in other words, Bayesian learning is not possible on Groundhog Day!)

ISBA 2016 [#6]

Posted in Kids, Mountains, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , on June 19, 2016 by xi'an

Fifth and final day of ISBA 2016, which was as full and intense as the previous ones. (Or even more if taking into account the late evening social activities pursued by most participants.) First thing in the morning, I managed to get very close to a hill top, thanks to the hints provided by Jeff Miller!, and with no further scratches from the nasty local thorn bushes. And I was back with plenty of time for a Bayesian robustness session with great talks. (Session organised by Judith Rousseau whom I crossed while running, rushing to the airport thanks to an Air France last-minute cancellation.) First talk by James Watson (on his paper with Chris Holmes on Kullback neighbourhoods on priors that Judith and I discussed recently in Statistical Science). Then as a contrapunto Peter Grünwald gave a neat geometric motivation for possible misbehaviour of Bayesian inference in non-convex misspecified environments and discussed his SafeBayes resolution that weights down the likelihood. In a sort of PAC-Bayesian way. And Erlis Ruli presented the ABC-R approach he developed with Laura Ventura and Nicola Sartori based on M-estimators and score functions. Making wonder [idly, as usual] whether cumulating different M-estimators would make a difference in the performances of the ABC algorithm.

David Dunson delivered one of the plenary lectures on high-dimensional discrete parameter estimation, including for instance categorical data. This wide-range talk covered many aspects and papers of David’s work, including a use of tensors I had neither seen nor heard of before before. With sparse modelling to resist the combinatoric explosion of contingency tables. However, and you may blame my Gallic pessimistic daemon for this remark, I have trouble to picture the meaning and relevance of a joint distribution on a space of hundreds and hundreds of dimension and similarly the ability to check the adequacy of any modelling in terms of goodness of fit. For instance, to borrow a non-military example from David’s talk, handling genetic data on ACGT sequences to infer its distribution sounds unreasonable unless most of the bases are mono-allelic. And the only way I see to test the realism of a model in this framework would be to engineer realisations of this distribution to observe the outcome, a test that seems neither feasible not desirable. Prediction based on such models may obviously operate satisfactorily without such realism requirements.

My first afternoon session (after the ISBA assembly that announced the location of ISBA 2020 in Yunnan, China!, home of Pu’ Ehr tea) was about accelerated MCMC schemes with talks by Sanvesh Srivastava on divide-and-conquer MCMC using Wasserstein barycentres, already discussed here, Minsuk Shin on a faster stochastic search variable selection which I could not understand, and Alex Beskos on the extension of Giles’ multilevel Monte Carlo to MCMC settings, which sounded worth investigating further even though I did not follow the notion all the way through. After listening to Luke Bornn explaining how to recalibrate grid data for climate science by accounting for correlation (with the fun title of `lost moments’), I rushed to my rental to [help] cook dinner for friends and… the ISBA 2016 conference was over!