Today, I gave my course at the University of Sciences (Truòng Dai hoc) here in Saigon in front of 40 students. It was a bit of a crash course and I covered between four and five chapters of The Bayesian Choice in about six hours. This was exhausting both for them and for me, but I managed to keep writing on the blackboard till the end and they bravely managed to keep their focus till the end as well. Since the students were of various backgrounds different from maths and stats (even though some were completing PhD’s involving Bayesian tools) I do wonder how much they sifted from this crash course, apart from my oft repeated messages that everyone had to pick a prior rather than go fishing for the prior… (No pho tday but a spicy beef stew and banh xeo local pancakes!) Here are the slides for the students:
Archive for The Bayesian Choice
Here is a short bio of me written in Vietnamese in conjunction with the course I will give at CMS (Centre for Mathematical Sciences), Ho Chi Min City, next week:
Christian P. Robert là giáo sư tại Khoa Toán ứng dụng của ĐH Paris Dauphine từ năm 2000. GS Robert đã từng giảng dạy ở các ĐH Perdue, Cornell (Mỹ) và ĐH Canterbury (New-Zealand). Ông đã làm biên tập cho tạp chí Journal of the Royal Statistical Society Series B từ năm 2006 đến năm 2009 và là phó biên tập cho tạp chí Annals of Statistics. Năm 2008, ông làm Chủ tịch của Hiệp hội Thống kê Quốc tế về Thống kê Bayes (ISBA). Lĩnh vực nghiên cứu của GS Robert bao gồm Thống kê Bayes mà tập trung chính vào Lý thuyết quyết định (Decision theory) và Mô hình lựa chọn (Model selection), Lý thuyết về Xích Markov trong mô phỏng và Thống kê tính toán.
As I attended Jamie Robins’ session in Varanasi and did not have a clear enough idea of the Robbins and Wasserman paradox to discuss it viva vocce, here are my thoughts after reading Larry’s summary. My first reaction was to question whether or not this was a Bayesian statistical problem (meaning why should I be concered with the problem). Just as the normalising constant problem was not a statistical problem. We are estimating an integral given some censored realisations of a binomial depending on a covariate through an unknown function θ(x). There is not much of a parameter. However, the way Jamie presented it thru clinical trials made the problem sound definitely statistical. So end of the silly objection. My second step is to consider the very point of estimating the entire function (or infinite dimensional parameter) θ(x) when only the integral ψ is of interest. This is presumably the reason why the Bayesian approach fails as it sounds difficult to consistently estimate θ(x) under censored binomial observations, while ψ can be. Of course, if we want to estimate the probability of a success like ψ going through functional estimation this sounds like overshooting. But the Bayesian modelling of the problem appears to require considering all unknowns at once, including the function θ(x) and cannot forget about it. We encountered a somewhat similar problem with Jean-Michel Marin when working on the k-nearest neighbour classification problem. Considering all the points in the testing sample altogether as unknowns would dwarf the training sample and its information content to produce very poor inference. And so we ended up dealing with one point at a time after harsh and intense discussions! Now, back to the Robins and Wasserman paradox, I see no problem in acknowledging a classical Bayesian approach cannot produce a convergent estimate of the integral ψ. Simply because the classical Bayesian approach is an holistic system that cannot remove information to process a subset of the original problem. Call it the curse of marginalisation. Now, on a practical basis, would there be ways of running simulations of the missing Y’s when π(x) is known in order to achieve estimates of ψ? Presumably, but they would end up with a frequentist validation…
Måns Thulin released today an arXiv document on some decision-theoretic justifications for [running] Bayesian hypothesis testing through credible sets. His main point is that using the unnatural prior setting mass on a point-null hypothesis can be avoided by rejecting the null when the point-null value of the parameter does not belong to the credible interval and that this decision procedure can be validated through the use of special loss functions. While I stress to my students that point-null hypotheses are very unnatural and should be avoided at all cost, and also that constructing a confidence interval is not the same as designing a test—the former assess the precision in the estimation, while the later opposes two different and even incompatible models—, let us consider Måns’ arguments for their own sake.
The idea of the paper is that there exist loss functions for testing point-null hypotheses that lead to HPD, symmetric and one-sided intervals as acceptance regions, depending on the loss func. This was already found in Pereira & Stern (1999). The issue with these loss functions is that they involve the corresponding credible sets in their definition, hence are somehow tautological. For instance, when considering the HPD set and T(x) as the largest HPD set not containing the point-null value of the parameter, the corresponding loss function is
parameterised by a,b,c. And depending on the HPD region.
Måns then introduces new loss functions that do not depend on x and still lead to either the symmetric or the one-sided credible intervals.as acceptance regions. However, one test actually has two different alternatives (Theorem 2), which makes it essentially a composition of two one-sided tests, while the other test returns the result to a one-sided test (Theorem 3), so even at this face-value level, I do not find the result that convincing. (For the one-sided test, George Casella and Roger Berger (1986) established links between Bayesian posterior probabilities and frequentist p-values.) Both Theorem 3 and the last result of the paper (Theorem 4) use a generic and set-free observation-free loss function (related to eqn. (5.2.1) in my book!, as quoted by the paper) but (and this is a big but) they only hold for prior distributions setting (prior) mass on both the null and the alternative. Otherwise, the solution is to always reject the hypothesis with the zero probability… This is actually an interesting argument on the why-are-credible-sets-unsuitable-for-testing debate, as it cannot bypass the introduction of a prior mass on Θ0!
Overall, I furthermore consider that a decision-theoretic approach to testing should encompass future steps rather than focussing on the reply to the (admittedly dumb) question is θ zero? Therefore, it must have both plan A and plan B at the ready, which means preparing (and using!) prior distributions under both hypotheses. Even on point-null hypotheses.
Now, after I wrote the above, I came upon a Stack Exchange page initiated by Måns last July. This is presumably not the first time a paper stems from Stack Exchange, but this is a fairly interesting outcome: thanks to the debate on his question, Måns managed to get a coherent manuscript written. Great! (In a sense, this reminded me of the polymath experiments of Terry Tao, Timothy Gower and others. Meaning that maybe most contributors could have become coauthors to the paper!)
This morning I attended Alan Gelfand talk on directional data, i.e. on the torus (0,2π), and found his modeling via wrapped normals (i.e. normal reprojected onto the unit sphere) quite interesting and raising lots of probabilistic questions. For instance, usual moments like mean and variance had no meaning in this space. The variance matrix of the underlying normal, as well of its mean, obviously matter. One thing I am wondering about is how restrictive the normal assumption is. Because of the projection, any random change to the scale of the normal vector does not impact this wrapped normal distribution but there are certainly features that are not covered by this family. For instance, I suspect it can only offer at most two modes over the range (0,2π). And that it cannot be explosive at any point.
The keynote lecture this afternoon was delivered by Roderick Little in a highly entertaining way, about calibrated Bayesian inference in official statistics. For instance, he mentioned the inferential “schizophrenia” in this field due to the between design-based and model-based inferences. Although he did not define what he meant by “calibrated Bayesian” in the most explicit manner, he had this nice list of eight good reasons to be Bayesian (that came close to my own list at the end of the Bayesian Choice):
- conceptual simplicity (Bayes is prescriptive, frequentism is not), “having a model is an advantage!”
- avoiding ancillarity angst (Bayes conditions on everything)
- avoiding confidence cons (confidence is not probability)
- nails nuisance parameters (frequentists are either wrong or have a really hard time)
- escapes from asymptotia
- incorporates prior information and if not weak priors work fine
- Bayes is useful (25 of the top 30 cited are statisticians out of which … are Bayesians)
Bayesians go to Valencia![joke! Actually it should have been Bayesian go MCMskiing!]
- Calibrated Bayes gets better frequentists answers
He however insisted that frequentists should be Bayesians and also that Bayesians should be frequentists, hence the calibration qualification.
After an interesting session on Bayesian statistics, with (adaptive or not) mixtures and variational Bayes tools, I actually joined the “young statistician dinner” (without any pretense at being a young statistician, obviously) and had interesting exchanges on a whole variety of topics, esp. as Kerrie Mengersen adopted (reinvented) my dinner table switch strategy (w/o my R simulated annealing code). Until jetlag caught up with me.
As I was writing my next column for CHANCE, I decided I will include a methodology box about “using the data twice”. Here is the draft. (The second part is reproduced verbatim from an earlier post on Error and Inference.)
Several aspects of the books covered in this CHANCE review [i.e., Bayesian ideas and data analysis, and Bayesian modeling using WinBUGS] face the problem of “using the data twice”. What does that mean? Nothing really precise, actually. The accusation of “using the data twice” found in the Bayesian literature can be thrown at most procedures exploiting the Bayesian machinery without actually being Bayesian, i.e.~which cannot be derived from the posterior distribution. For instance, the integrated likelihood approach in Murray Aitkin’s Statistical Inference avoids the difficulties related with improper priors πi by first using the data x to construct (proper) posteriors πi(θi|x) and then secondly using the data in a Bayes factor
as if the posteriors were priors. This obviously solves the improperty difficulty (see. e.g., The Bayesian Choice), but it creates a statistical procedure outside the Bayesian domain, hence requiring a separate validation since the usual properties of Bayesian procedures do not apply. Similarly, the whole empirical Bayes approach falls under this category, even though some empirical Bayes procedures are asymptotically convergent. The pseudo-marginal likelihood of Geisser and Eddy (1979), used in Bayesian ideas and data analysis, is defined by
through the marginal posterior likelihoods. While it also allows for improper priors, it does use the same data in each term of the product and, again, it is not a Bayesian procedure.
Once again, from first principles, a Bayesian approach should use the data only once, namely when constructing the posterior distribution on every unknown component of the model(s). Based on this all-encompassing posterior, all inferential aspects should be the consequences of a sequence of decision-theoretic steps in order to select optimal procedures. This is the ideal setting while, in practice, relying on a sequence of posterior distributions is often necessary, each posterior being a consequence of earlier decisions, which makes it the result of a multiple (improper) use of the data… For instance, the process of Bayesian variable selection is on principle clean from the sin of “using the data twice”: one simply computes the posterior probability of each of the variable subsets and this is over. However, in a case involving many (many) variables, there are two difficulties: one is about building the prior distributions for all possible models, a task that needs to be automatised to some extent; another is about exploring the set of potential models. First, ressorting to projection priors as in the intrinsic solution of Pèrez and Berger (2002, Biometrika, a much valuable article!), while unavoidable and a “least worst” solution, means switching priors/posteriors based on earlier acceptances/rejections, i.e. on the data. Second, the path of models truly explored by a computational algorithm [which will be a minuscule subset of the set of all models] will depend on the models rejected so far, either when relying on a stepwise exploration or when using a random walk MCMC algorithm. Although this is not crystal clear (there is actually plenty of room for supporting the opposite view!), it could be argued that the data is thus used several times in this process…
“‘Frequentist methods achieve an objective connection to hypotheses about the data-generating process by being constrained and calibrated by the method’s error probabilities in relation to these models .”—D. Cox and D. Mayo, p.277, Error and Inference, 2010
The second part of the seventh chapter of Error and Inference, is David Cox’s and Deborah Mayo’s “Objectivity and conditionality in frequentist inference“. (Part of the section is available on Google books.) The purpose is clear and the chapter quite readable from a statistician’s perspective. I however find it difficult to quantify objectivity by first conditioning on “a statistical model postulated to have generated data”, as again this assumes the existence of a “true” probability model where “probabilities (…) are equal or close to the actual relative frequencies”. As earlier stressed by Andrew:
“I don’t think it’s helpful to speak of “objective priors.” As a scientist, I try to be objective as much as possible, but I think the objectivity comes in the principle, not the prior itself. A prior distribution–any statistical model–reflects information, and the appropriate objective procedure will depend on what information you have.”
The paper opposes the likelihood, Bayesian, and frequentist methods, reproducing what Gigerenzer called the “superego, the ego, and the id” in his paper on statistical significance. Cox and Mayo stress from the start that the frequentist approach is (more) objective because it is based on the sampling distribution of the test. My primary problem with this thesis is that the “hypothetical long run” (p.282) does not hold in realistic settings. Even in the event of a reproduction of similar or identical tests, a sequential procedure exploiting everything that has been observed so far is more efficient than the mere replication of the same procedure solely based on the current observation.
“Virtually all (…) models are to some extent provisional, which is precisely what is expected in the building up of knowledge.”—D. Cox and D. Mayo, p.283, Error and Inference, 2010
The above quote is something I completely agree with, being another phrasing of George Box’s “all models are wrong”, but this transience of working models is a good reason in my opinion to account for the possibility of alternative working models from the start of the statistical analysis. Hence for an inclusion of those models in the statistical analysis equally from the start. Which leads almost inevitably to a Bayesian formulation of the testing problem.
“‘Perhaps the confusion [over the role of sufficient statistics] stems in part because the various inference schools accept the broad, but not the detailed, implications of sufficiency.”—D. Cox and D. Mayo, p.286, Error and Inference, 2010
The discussion over the sufficiency principle is interesting, as always. The authors propose to solve the confusion between the sufficiency principle and the frequentist approach by assuming that inference “is relative to the particular experiment, the type of inference, and the overall statistical approach” (p.287). This creates a barrier between sampling distributions that avoids the binomial versus negative binomial paradox always stressed in the Bayesian literature. But the solution is somehow tautological: by conditioning on the sampling distribution, it avoids the difficulties linked with several sampling distributions all producing the same likelihood. After my recent work on ABC model choice, I am however less excited about the sufficiency principle as the existence of [non-trivial] sufficient statistics is quite the rare event. Especially across models. The section (pp. 288-289) is also revealing about the above “objectivity” of the frequentist approach in that the derivation of a test taking large value away from the null with a well-known distribution under the null is not an automated process, esp. when nuisance parameters cannot be escaped from (pp. 291-294). Achieving separation from nuisance parameters, i.e. finding statistics that can be conditioned upon to eliminate those nuisance parameters, does not seem feasible outside well-formalised models related with exponential families. Even in such formalised models, a (clear?) element of arbitrariness is involved in the construction of the separations, which implies that the objectivity is under clear threat. The chapter recognises this limitation in Section 9.2 (pp.293-294), however it argues that separation is much more common in the asymptotic sense and opposes the approach to the Bayesian averaging over the nuisance parameters, which “may be vitiated by faulty priors” (p.294). I am not convinced by the argument, given that the (approximate) condition approach amount to replace the unknown nuisance parameter by an estimator, without accounting for the variability of this estimator. Averaging brings the right (in a consistency sense) penalty.
A compelling section is the one about the weak conditionality principle (pp. 294-298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Example 1.3.7, p.18) that the classical confidence interval averages over the experiments… Mea culpa! The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this. I could however argue about “conditioning is warranted to achieve objective frequentist goals” (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”. Nonetheless, an approach based on the mixture remains frequentist if non-optimal… (The chapter later attacks the derivation of the likelihood principle, I will come back to it in a later post.)
“‘Many seem to regard reference Bayesian theory to be a resting point until satisfactory subjective or informative priors are available. It is hard to see how this gives strong support to the reference prior research program.”—D. Cox and D. Mayo, p.302, Error and Inference, 2010
A section also worth commenting is (unsurprisingly!) the one addressing the limitations of the Bayesian alternatives (pp. 298–302). It however dismisses right away the personalistic approach to priors by (predictably if hastily) considering it fails the objectivity canons. This seems a wee quick to me, as the choice of a prior is (a) the choice of a reference probability measure against which to assess the information brought by the data, not clearly less objective than picking one frequentist estimator or another, and (b) a personal construction of the prior can also be defended on objective grounds, based on the past experience of the modeler. That it varies from one modeler to the next is not an indication of subjectivity per se, simply of different past experiences. Cox and Mayo then focus on reference priors, à la Bernardo-Berger, once again pointing out the lack of uniqueness of those priors as a major flaw. While the sub-chapter agrees on the understanding of those priors as convention or reference priors, aiming at maximising the input from the data, it gets stuck on the impropriety of such priors: “if priors are not probabilities, what then is the interpretation of a posterior?” (p.299). This seems like a strange comment to me: the interpretation of a posterior is that it is a probability distribution and this is the only mathematical constraint one has to impose on a prior. (Which may be a problem in the derivation of reference priors.) As detailed in The Bayesian Choice among other books, there are many compelling reasons to invite improper priors into the game. (And one not to, namely the difficulty with point null hypotheses.) While I agree that the fact that some reference priors (like matching priors, whose discussion p. 302 escapes me) have good frequentist properties is not compelling within a Bayesian framework, it seems a good enough answer to the more general criticism about the lack of objectivity: in that sense, frequency-validated reference priors are part of the huge package of frequentist procedures and cannot be dismissed on the basis of being Bayesian. That reference priors are possibly at odd with the likelihood principle does not matter very much: the shape of the sampling distribution is part of the prior information, not of the likelihood per se. The final argument (Section 12) that Bayesian model choice requires the preliminary derivation of “the possible departures that might arise” (p.302) has been made at several points in Error and Inference. Besides being in my opinion a valid working principle, i.e. selecting the most appropriate albeit false model, this definition of well-defined alternatives is mimicked by the assumption of “statistics whose distribution does not depend on the model assumption” (p. 302) found in the same last paragraph.
In conclusion this (sub-)chapter by David Cox and Deborah Mayo is (as could be expected!) a deep and thorough treatment of the frequentist approach to the sufficiency and (weak) conditionality principle. It however fails to convince me that there exists a “unique and unambiguous” frequentist approach to all but the most simple problems. At least, from reading this chapter, I cannot find a working principle that would lead me to this single unambiguous frequentist procedure.