Among the many comments (thanks!) I received when posting our Testing via mixture estimation paper came the suggestion to relate this approach to the notion of full Bayesian significance test (FBST) developed by (Julio, not Hal) Stern and Pereira, from São Paulo, Brazil. I thus had a look at this alternative and read the Bayesian Analysis paper they published in 2008, as well as a paper recently published in Logic Journal of IGPL. (I could not find what the IGPL stands for.) The central notion in these papers is the e-value, which provides the posterior probability that the posterior density is larger than the largest posterior density over the null set. This definition bothers me, first because the null set has a measure equal to zero under an absolutely continuous prior (BA, p.82). Hence the posterior density is defined in an arbitrary manner over the null set and the maximum is itself arbitrary. (An issue that invalidates my 1993 version of the Lindley-Jeffreys paradox!) And second because it considers the posterior probability of an event that does not exist a priori, being conditional on the data. This sounds in fact quite similar to Statistical Inference, Murray Aitkin’s (2009) book using a posterior distribution of the likelihood function. With the same drawback of using the data twice. And the other issues discussed in our commentary of the book. (As a side-much-on-the-side remark, the authors incidentally forgot me when citing our 1992 Annals of Statistics paper about decision theory on accuracy estimators..!)
Archive for Bayes factor
With Kaniav Kamary, Kerrie Mengersen, and Judith Rousseau, we have just arXived (and submitted) a paper entitled “Testing hypotheses via a mixture model”. (We actually presented some earlier version of this work in Cancũn, Vienna, and Gainesville, so you may have heard of it already.) The notion we advocate in this paper is to replace the posterior probability of a model or an hypothesis with the posterior distribution of the weights of a mixture of the models under comparison. That is, given two models under comparison,
we propose to estimate the (artificial) mixture model
and in particular derive the posterior distribution of α. One may object that the mixture model is neither of the two models under comparison but this is the case at the boundary, i.e., when α=0,1. Thus, if we use prior distributions on α that favour the neighbourhoods of 0 and 1, we should be able to see the posterior concentrate near 0 or 1, depending on which model is true. And indeed this is the case: for any given Beta prior on α, we observe a higher and higher concentration at the right boundary as the sample size increases. And establish a convergence result to this effect. Furthermore, the mixture approach offers numerous advantages, among which [verbatim from the paper]:
Once a Bayes factor B(y) is computed, one needs to assess its strength. As repeated many times here, Jeffreys’ scale has no validation whatsoever, it is simply a division of the (1,∞) range into regions of convenience. Following earlier proposals in the literature (Box, 1980; García-Donato and Chen, 2005; Geweke and Amisano, 2008), an evaluation of this strength within the issue at stake, i.e. the comparison of two models, can be based on the predictive distribution. While most authors (like García-Donato and Chen) consider the prior predictive, I think using the posterior predictive distribution is more relevant since
- it exploits the information contained in the data y, thus concentrates on a region of relevance in the parameter space(s), which is especially interesting in weakly informative settings (even though we should abstain from testing in those cases, dixit Andrew);
- it reproduces the behaviour of the Bayes factor B(x) for values x of the observation similar to the original observation y;
- it does not hide issues of indeterminacy linked with improper priors: the Bayes factor B(x) remains indeterminate, even with a well-defined predictive;
- it does not separate between errors of type I and errors of type II but instead uses the natural summary provided by the Bayesian analysis, namely the predictive distribution π(x|y);
- as long as the evaluation is not used to reach a decision, there is no issue of “using the data twice”, we are simply producing an estimator of the posterior loss, for instance the (posterior) probability of selecting the wrong model. The Bayes factor B(x) is thus functionally independent of y, while x is probabilistically dependent on y.
Note that, even though probabilities of errors of type I and errors of type II can be computed, they fail to account for the posterior probabilities of both models. (This is the delicate issue with the solution of García-Donato and Chen.) Another nice feature is that the predictive distribution of the Bayes factor can be computed even in complex settings where ABC needs to be used.
“Using ABC to evaluate competing models has various hazards and comes with recommended precautions (Robert et al. 2011), and unsurprisingly, many if not most researchers have a healthy scepticism as these tools continue to mature.”
Michael Hickerson just published an open-access letter with the above title in Molecular Ecology. (As in several earlier papers, incl. the (in)famous ones by Templeton, Hickerson confuses running an ABC algorithm with conducting Bayesian model comparison, but this is not the main point of this post.)
“Rather than using ABC with weighted model averaging to obtain the three corresponding posterior model probabilities while allowing for the handful of model parameters (θ, τ, γ, Μ) to be estimated under each model conditioned on each model’s posterior probability, these three models are sliced up into 143 ‘submodels’ according to various parameter ranges.”
The letter is in fact a supporting argument for the earlier paper of Pelletier and Carstens (2014, Molecular Ecology) which conducted the above splitting experiment. I could not read this paper so cannot judge of the relevance of splitting this way the parameter range. From what I understand it amounts to using mutually exclusive priors by using different supports.
“Specifically, they demonstrate that as greater numbers of the 143 sub-models are evaluated, the inference from their ABC model choice procedure becomes increasingly.”
An interestingly cut sentence. Increasingly unreliable? mediocre? weak?
“…with greater numbers of models being compared, the most probable models are assigned diminishing levels of posterior probability. This is an expected result…”
True, if the number of models under consideration increases, under a uniform prior over model indices, the posterior probability of a given model mechanically decreases. But the pairwise Bayes factors should not be impacted by the number of models under comparison and the letter by Hickerson states that Pelletier and Carstens found the opposite:
“…pairwise Bayes factor[s] will always be more conservative except in cases when the posterior probabilities are equal for all models that are less probable than the most probable model.”
Which means that the “Bayes factor” in this study is computed as the ratio of a marginal likelihood and of a compound (or super-marginal) likelihood, averaged over all models and hence incorporating the prior probabilities of the model indices as well. I had never encountered such a proposal before. Contrary to the letter’s claim:
“…using the Bayes factor, incorporating all models is perhaps more consistent with the Bayesian approach of incorporating all uncertainty associated with the ABC model choice procedure.”
Besides the needless inclusion of ABC in this sentence, a somewhat confusing sentence, as Bayes factors are not, stricto sensu, Bayesian procedures since they remove the prior probabilities from the picture.
“Although the outcome of model comparison with ABC or other similar likelihood-based methods will always be dependent on the composition of the model set, and parameter estimates will only be as good as the models that are used, model-based inference provides a number of benefits.”
All models are wrong but the very fact that they are models allows for producing pseudo-data from those models and for checking if the pseudo-data is similar enough to the observed data. In components that matters the most for the experimenter. Hence a loss function of sorts…
Next week I am giving a talk at BAYSM in Vienna. BAYSM is the Bayesian Young Statisticians meeting so one may wonder why, but with Chris Holmes and Mike West, we got invited as more… erm… senior speakers! So I decided to give a definitely senior talk on a thread pursued throughout my career so far, namely mixtures. Plus it also relates to works of the other senior speakers. Here is the abstract for the talk:
Mixtures of distributions are fascinating objects for statisticians in that they both constitute a straightforward extension of standard distributions and offer a complex benchmark for evaluating statistical procedures, with a likelihood both computable in a linear time and enjoying an exponential number of local models (and sometimes infinite modes). This fruitful playground appeals in particular to Bayesians as it constitutes an easily understood challenge to the use of improper priors and of objective Bayes solutions. This talk will review some ancient and some more recent works of mine on mixtures of distributions, from the 1990 Gibbs sampler to the 2000 label switching and to later studies of Bayes factor approximations, nested sampling performances, improper priors, improved importance samplers, ABC, and a inverse perspective on the Bayesian approach to testing of hypotheses.
I am very grateful to the scientific committee for this invitation, as it will give me the opportunity to meet the new generation, learn from them and in addition discover Vienna where I have never been, despite several visits to Austria. Including its top, the Großglockner. I will also give a seminar in Linz the day before. In the Institut für Angewandte Statistik.
Last morning at the neuroscience workshop Jean-François Cardoso presented independent component analysis though a highly pedagogical and enjoyable tutorial that stressed the geometric meaning of the approach, summarised by the notion that the (ICA) decomposition
of the data X seeks both independence between the columns of S and non-Gaussianity. That is, getting as away from Gaussianity as possible. The geometric bits came from looking at the Kullback-Leibler decomposition of the log likelihood
where the expectation is computed under the true distribution P of the data X. And Qθ is the hypothesised distribution. A fine property of this decomposition is a statistical version of Pythagoreas’ theorem, namely that when the family of Qθ‘s is an exponential family, the Kullback-Leibler distance decomposes into
where θ⁰ is the expected maximum likelihood estimator of θ. (We also noticed this possibility of a decomposition in our Kullback-projection variable-selection paper with Jérôme Dupuis.) The talk by Aapo Hyvärinen this morning was related to Jean-François’ in that it used ICA all the way to a three-level representation if oriented towards natural vision modelling in connection with his book and the paper on unormalised models recently discussed on the ‘Og.
On the afternoon, Eric-Jan Wagenmaker [who persistently and rationally fight the (ab)use of p-values and who frequently figures on Andrew’s blog] gave a warning tutorial talk about the dangers of trusting p-values and going fishing for significance in existing studies, much in the spirit of Andrew’s blog (except for the defence of Bayes factors). Arguing in favour of preregistration. The talk was full of illustrations from psychology. And included the line that ESP testing is the jester of academia, meaning that testing for whatever form of ESP should be encouraged as a way to check testing procedures. If a procedure finds a significant departure from the null in this setting, there is something wrong with it! I was then reminded that Eric-Jan was one of the authors having analysed Bem’s controversial (!) paper on the “anomalous processes of information or energy transfer that are currently unexplained in terms of known physical or biological mechanisms”… (And of the shocking talk by Jessica Utts on the same topic I attended in Australia two years ago.)
Today I gave a talk in the Advances in model selection session. Organised by Veronika Rockova and Ed George. (A bit of pre-talk stress: I actually attempted to change my slides at 5am and only managed to erase the current version! I thus left early enough to stop by the presentation room…) Here are the final slides, which have much in common with earlier versions, but also borrowed from Jean-Michel Marin’s talk in Cambridge. A posteriori, I think the talk missed one slide on the practical run of the ABC random forest algorithm, since later questions showed miscomprehension from the audience.
The other talks in this session were by Andreas Buja [whom I last met in Budapest last year] on valid post-modelling inference. A very relevant reflection on the fundamental bias in statistical modelling. Then by Nick Polson, about efficient ways to compute MAP for objective functions that are irregular. Great entry into optimisation methods I had never heard of earlier.! (The abstract is unrelated.) And last but not least by Veronika Rockova, on mixing Indian buffet processes with spike-and-slab priors for factor analysis with unknown numbers of factors. A definitely advanced contribution to factor analysis, with a very nice idea of introducing a non-identifiable rotation to align on orthogonal designs. (Here too the abstract is unrelated, a side effect of the ASA requiring abstracts sent very long in advance.)
Although discussions lasted well into the following Bayesian Inference: Theory and Foundations session, I managed to listen to a few talks there. In particular, a talk by Keli Liu on constructing non-informative priors. A question of direct relevance. The notion of objectivity is to achieve a frequentist distribution of the Bayes factor associated with the point null that is constant. Or has a constant quantile at a given level. The second talk by Alexandra Bolotskikh related to older interests of mine’s, namely the construction of improved confidence regions in the spirit of Stein. (Not that surprising, given that a coauthor is Marty Wells, who worked with George and I on the topic.) A third talk by Abhishek Pal Majumder (jointly with Jan Hanning) dealt on a new type of fiducial distributions, with matching prior properties. This sentence popped a lot over the past days, but this is yet another area where I remain puzzled by the very notion. I mean the notion of fiducial distribution. Esp. in this case where the matching prior gets even closer to being plain Bayesian.