Archive for Bayesian tests of hypotheses

A new approach to Bayesian hypothesis testing

Posted in Books, Statistics with tags , , , , , on September 8, 2016 by xi'an

“The main purpose of this paper is to develop a new Bayesian hypothesis testing approach for the point null hypothesis testing (…) based on the Bayesian deviance and constructed in a decision theoretical framework. It can be regarded as the Bayesian version of the likelihood ratio test.”

This paper got published in Journal of Econometrics two years ago but I only read it a few days ago when Kerrie Mengersen pointed it out to me. Here is an interesting criticism of Bayes factors.

“In the meantime, unfortunately, Bayes factors also suffers from several theoretical and practical difficulties. First, when improper prior distributions are used, Bayes factors contains undefined constants and takes arbitrary values (…) Second, when a proper but vague prior distribution with a large spread is used to represent prior ignorance, Bayes factors tends to favour the null hypothesis. The problem may persist even when the sample size is large (…) Third, the calculation of Bayes factors generally requires the evaluation of marginal likelihoods. In many models, the marginal likelihoods may be difficult to compute.”

I completely agree with these points, which are part of a longer list in our testing by mixture estimation paper. The authors also rightly blame the rigidity of the 0-1 loss function behind the derivation of the Bayes factor. An alternative decision-theoretic based on the Kullback-Leibler distance has been proposed by José Bernardo and Raúl Rueda, in a 2002 paper, evaluating the average divergence between the null and the full under the full, with the slight drawback that any nuisance parameter has the same prior under both hypotheses. (Which makes me think of the Savage-Dickey paradox, since everything here seems to take place under the alternative.) And the larger drawback of requiring a lower bound for rejecting the null. (Although it could be calibrated under the null prior predictive.)

This paper suggests using instead the difference of the Bayesian deviances, which is the expected log ratio integrated against the posterior. (With the possible embarrassment of the quantity having no prior expectation since the ratio depends on the data. But after all the evidence or marginal likelihood faces the same “criticism”.) So it is a sort of Bayes factor on the logarithms, with a strong similarity with Bernardo & Rueda’s solution since they are equal in expectation under the marginal. As in Dawid et al.’s recent paper, the logarithm removes the issue with the normalising constant and with the Lindley-Jeffreys paradox. The approach then needs to be calibrated in order to define a decision bound about the null. The asymptotic distribution of the criterion is  χ²(p)−p, where p is the dimension of the parameter to be tested, but this sounds like falling back on frequentist tests. And the deadly .05% bounds. I would rather favour a calibration of the criterion using prior or posterior predictives under both models…

MDL multiple hypothesis testing

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on September 1, 2016 by xi'an

“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”

After my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the 4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11), that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the multiple then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. In fine, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.

messages from Harvard

Posted in pictures, Statistics, Travel, University life with tags , , , , , , on March 24, 2016 by xi'an

As in Bristol two months ago, where I joined the statistics reading in the morning, I had the opportunity to discuss the paper on testing via mixtures prior to my talk with a group of Harvard graduate students. Which concentrated on the biasing effect of the Bayes factor against the more complex hypothesis/model. Arguing [if not in those terms!] that Occam’s razor was too sharp. With a neat remark that decomposing the log Bayes factor as

log(p¹(y¹,H))+log(p²(y²|y¹,H))+…

meant that the first marginal was immensely and uniquely impacted by the prior modelling, hence very likely to be very small for a larger model H, which would then take forever to recover from. And asking why there was such a difference with cross-validation

log(p¹(y¹|y⁻¹,H))+log(p²(y²|y⁻²,H))+…

where the leave-one out posterior predictor is indeed more stable. While the later leads to major overfitting in my opinion, I never spotted the former decomposition which does appear as a strong and maybe damning criticism of the Bayes factor in terms of long-term impact of the prior modelling.

Other points made during the talk or before when preparing the talk:

  1. additive mixtures are but one encompassing model, geometric mixtures could be fun too, if harder to process (e.g., missing normalising constant). Or Zellner’s mixtures (with again the normalising issue);
  2. if the final outcome of the “test” is the posterior on α itself, the impact of the hyper-parameter on α is quite relative since this posterior can be calibrated by simulation against limiting cases (α=0,1);
  3. for the same reason the different rate of accumulation near zero and one  when compared with a posterior probability is hardly worrying;
  4. what I see as a fundamental difference in processing improper priors for Bayes factors versus mixtures is not perceived as such by everyone;
  5. even a common parameter θ on both models does not mean both models are equally weighted a priori, which relates to an earlier remark in Amsterdam about the different Jeffreys priors one can use;
  6. the MCMC output also produces a sample of θ’s which behaviour is obviously different from single model outputs. It would be interesting to study further the behaviour of those samples, which are not to be confused with model averaging;
  7. the mixture setting has nothing intrinsically Bayesian in that the model can be processed in other ways.

Trip to Louvain (and back)

Posted in Kids, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , on October 24, 2015 by xi'an

Rue des Wallons, during the 24 heures de vélo, Louvain-la-Neuve, Oct. 21, 2015Apart from the minor initial inconvenience that I missed my train to Brussels thanks to the SNCF train company dysfunctional automata [but managed to switch to one half-an-hour later], my Belgian trip to Louvain-la-Neuve was quite enjoyable! I met with several local faculty [UCL] members I had not seen for several years, I gave my talk for the World Statistics Day in front of a large audience, maybe not the most appropriate talk for that day since it was somewhat skeptical about the nature of statistical tests, I got sharp questions, comments, and suggestions on the mixture approach to testing [incl. a challenging one about the Bernoulli B(p) case], I had a superb and animated and friendly dinner in a local restaurant—where everyone kindly spoke French although I was the only native French speaker—, I met the next morning with two PhD students from KU Leuven (the “other” part of the former Leuven university, albeit in the Flemmish side of the border) about functional ABC and generalised Jeffreys priors, I had a few more interesting discussions, and I managed to grab a few bags of Belgian waffles in Brussels before heading home! (In case you wonder from the above pixture, the crowds in the pedestrian streets of Louvain-la-Neuve were not connected to my visit!, but to a student festival centred at beer a 24 hour bike relay that attracted around 50,000 students, for less than a hundred bikes!)

World statistics day

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on October 20, 2015 by xi'an

Today is October 20, World Statistics Day as launched by the UN. And supported by local and international societies. In connection with that day, among many events, the RSS will be hosting a reception, China will hold a seminar in… Xi’an, how appropriate!, my friend Kerrie Mengersen will give a talk at the Queensland University of Technology (QUT) on The power and promise of immersive virtual environments. (Bringing her pet crocodile to the talk, hopefully!)

Leuven9And I will also give a talk in Louvain-la-Neuve, Belgium, on Le délicat dilemme des tests d’hypothèse et de leur résolution bayésienne. At ISBA, which stands for Institute of Statistics, Biostatistics and Actuarial Sciences and not for the Bayesian society!. within UCL, which stands for Université Catholique de Louvain and not for University College London! (And which is not to be confused with the Katholieke Universiteit Leuven, in Leuven, where I was last year for MCqMC. About 25 kilometers away.) In case this is not confusing enough, here are my slides (in English, while the talk will be in French):