With my friends Peter Green (Bristol), Krzysztof Łatuszyński (Warwick) and Marcello Pereyra (Bristol), we just arXived the first version of “Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, which first title was the title of this post. This is a survey of our own perspective on Bayesian computation, from what occurred in the last 25 years [a lot!] to what could occur in the near future [a lot as well!]. Submitted to Statistics and Computing towards the special 25th anniversary issue, as announced in an earlier post.. Pulling strength and breadth from each other’s opinion, we have certainly attained more than the sum of our initial respective contributions, but we are welcoming comments about bits and pieces of importance that we miss and even more about promising new directions that are not posted in this survey. (A warning that is should go with most of my surveys is that my input in this paper will not differ by a large margin from ideas expressed here or in previous surveys.)
Archive for Bayesian Analysis
Our random forest paper was alas rejected last week. Alas because I think the approach is a significant advance in ABC methodology when implemented for model choice, avoiding the delicate selection of summary statistics and the report of shaky posterior probability approximation. Alas also because the referees somewhat missed the point, apparently perceiving random forests as a way to project a large collection of summary statistics on a limited dimensional vector as in the Read Paper of Paul Fearnhead and Dennis Prarngle, while the central point in using random forests is the avoidance of a selection or projection of summary statistics. They also dismissed ou approach based on the argument that the reduction in error rate brought by random forests over LDA or standard (k-nn) ABC is “marginal”, which indicates a degree of misunderstanding of what the classification error stand for in machine learning: the maximum possible gain in supervised learning with a large number of classes cannot be brought arbitrarily close to zero. Last but not least, the referees did not appreciate why we mostly cannot trust posterior probabilities produced by ABC model choice and hence why the posterior error loss is a valuable and almost inevitable machine learning alternative, dismissing the posterior expected loss as being not Bayesian enough (or at all), for “averaging over hypothetical datasets” (which is a replicate of Jeffreys‘ famous criticism of p-values)! Certainly a first time for me to be rejected based on this argument!
This blog post was contributed by my friend Julien Cornebise, as a reprint of a column he wrote for the latest ISBA Bulletin.
This article is an occasion to pay forward ever so slightly, by encouraging current Ph.D. candidates on their path, the support ISBA gave me. Four years ago, I was honored and humbled to receive the ISBA 2010 Savage Award, category Theory and Methods, for my Ph.D. dissertation defended in 2009. Looking back, I can now testify how much this brought to me both inside and outside of Academia.
Inside Academia: confirming and mitigating the widely-shared post-graduate’s impostor syndrome
Upon hearing of the great news, a brilliant multi-awarded senior researcher in my lab very kindly wrote to me that such awards meant never having to prove one’s worth again. Although genuinely touched by her congratulations, being far less accomplished and more junior than her, I felt all the more responsible to prove myself worth of this show of confidence from ISBA. It would be rather awkward to receive such an award only to fail miserably shortly after.
This resonated deeply with the shared secret of recent PhDs, discovered during my year at SAMSI, a vibrant institution where half a dozen new postdocs arrive each year: each and every one of us, fresh Ph.D.s from some of the best institutions (Cambridge, Duke, Waterloo, Paris…) secretly suffered the very same impostor syndrome. We were looking at each other’s CV/website and thinking “jeez! this guy/girl across the door is an expert of his/her field, look at all he/she has done, whereas I just barely scrape by on my own research!” – all the while putting up a convincing façade of self-assurance in front of audiences and whiteboards, to the point of apparent cockiness. Only after candid exchanges in SAMSI’s very open environment did we all discover being in the very same mindset.
In hindsight the explanation is simple: each young researcher in his/her own domain has the very expertise to measure how much he/she still does not know and has yet to learn, while he/she hears other young researchers, experts in their own other field, present results not as familiar to him/her, thus sounding so much more advanced. This take-away from SAMSI was perfectly confirmed by the Savage Award: yes, maybe indeed, I, just like my other colleagues, might actually know something relatively valuable, and my scraping by might just be not so bad – as is also the case of so many of my young colleagues.
Of course, impostor syndrome is a clingy beast and, healthily, I hope to never get entirely over it – merely overcoming it enough to say “Do not worry, thee young candidate, thy doubts pave a path well trodden”.
A similar message is also part of the little-known yet gem of a guide “How to do Research at MIT AI Lab – Emotional Factors”, relevant far beyond its original lab. I recommend it to any Ph.D. student; the feedback from readers is unanimous.
Outside Academia: incredibly increased readability
After two post-docs, and curious to see what was out there in atypical paths, I took a turn out of purely academic research, first as an independent consultant, then recruited out of the blue by a start-up’s recruiter, and eventually doing my small share to help convince investors. I discovered there another facet of ISBA’s Savage Award: tremendous readability.
In Academia, the dominating metric of quality is the length of the publication list – a debate for another day. Outside of Academia, however, not all interlocutors know how remarkable is a JRSSB Read Paper, or an oral presentation at NIPS, or a publication in Nature.
This is where international learned societies, like ISBA, come into play: the awards they bestow can serve as headline-grabbing material in a biography, easily spotted. The interlocutors do not need to be familiar with the subtleties of Bayesian Analysis. All they see is a stamp of approval from an official association of this researcher’s peers. That, in itself, is enough of a quality metric to pass the first round of contact, raise interest, and get the chance to further the conversation.
First concrete example: the recruiter who contacted me for the start-up I joined in 2011 was tasked to find profiles for an Applied position. The Savage Award on the CV grabbed his attention, even though he had no inkling what Adaptive Sequential Monte Carlo Methods were, nor if they were immediately relevant to the start-up. Passing it to the start-up’s managers, they immediately changed focus and interviewed me for their Research track instead: a profile that was not what they were looking for originally, yet stood out enough to interest them for a position they had not thought of filling via a recruiter – and indeed a unique position that I would never have thought to find this way either!
Second concrete example, years later, hard at work in this start-up’s amazing team: investors were coming for a round of technical due diligence. Venture capitals sent their best scientists-in-residence to dive deeply into the technical details of our research. Of course what matters in the end is, and forever will be, the work that is done and presented. Yet, the Savage Award was mentioned in the first line of the biography that was sent ahead of time, as a salient point to give a strong first impression of our research team.
Advices to Ph.D. Candidates: apply, you are the world best expert on your topic
That may sound trivial, but the first advice: apply. Discuss with your advisor the possibility to put your dissertation up for consideration. This might sound obvious to North-American students, whose educative system is rife with awards for high-performing students. Not so much in France, where those would be at odds with the sometimes over-present culture of égalité in the younger-age public education system. As a cultural consequence, few French Ph.D. students, even the most brilliant, would consider putting up their dissertation for consideration. I have been very lucky in that regard to benefit from the advice of a long-term Bayesian, who offered to send it for me – thanks again Xi’an! Not all students, regardless how brilliant their work, are made aware of this possibility.
The second advice, closely linked: do not underestimate the quality of your work. You are the foremost expert in the entire world on your Ph.D. topic. As discussed above, it is all too easy to see how advanced are the maths wielded by your office-mate, yet oversee the as-much-advanced maths you are juggling on a day-to-day basis, more familiar to you, and whose limitations you know better than anyone else. Actually, knowing these very limitations is what proves you are an expert.
A word of thanks and final advice
Finally, a word of thanks. I have been incredibly lucky, throughout my career so far, to meet great people. My dissertation already had four pages of acknowledgements: I doubt the Bulletin’s editor would appreciate me renewing (and extending!) them here. They are just as heartfelt today as they were then. I must, of course, add ISBA and the Savage Award committee for their support, as well as all those who, by their generous donations, allow the Savage Fund to stay alive throughout the years.
Of interest to Ph.D. candidates, though, one special mention of a dual tutelage system, that I have seen successfully at work many times. The most senior, a professor with the deep knowledge necessary to steer the project brings his endless fonts of knowledge collected over decades, wrapped in hardened tough-love. The youngest, a postdoc or fresh assistant professor, brings virtuosity, emulation and day-to-day patience. In my case they were Pr. Éric Moulines and Dr. Jimmy Olsson. That might be the final advice to a student: if you ever stumble, as many do, as I most surely did, because Ph.D. studies can be a hell of a roller-coaster to go through, reach out to the people around you and the joint set of skills they want to offer you. In combination, they can be amazing, and help you open doors that, in retrospect, can be worth all the efforts.
Julien Cornebise, Ph.D.
Among the many comments (thanks!) I received when posting our Testing via mixture estimation paper came the suggestion to relate this approach to the notion of full Bayesian significance test (FBST) developed by (Julio, not Hal) Stern and Pereira, from São Paulo, Brazil. I thus had a look at this alternative and read the Bayesian Analysis paper they published in 2008, as well as a paper recently published in Logic Journal of IGPL. (I could not find what the IGPL stands for.) The central notion in these papers is the e-value, which provides the posterior probability that the posterior density is larger than the largest posterior density over the null set. This definition bothers me, first because the null set has a measure equal to zero under an absolutely continuous prior (BA, p.82). Hence the posterior density is defined in an arbitrary manner over the null set and the maximum is itself arbitrary. (An issue that invalidates my 1993 version of the Lindley-Jeffreys paradox!) And second because it considers the posterior probability of an event that does not exist a priori, being conditional on the data. This sounds in fact quite similar to Statistical Inference, Murray Aitkin’s (2009) book using a posterior distribution of the likelihood function. With the same drawback of using the data twice. And the other issues discussed in our commentary of the book. (As a side-much-on-the-side remark, the authors incidentally forgot me when citing our 1992 Annals of Statistics paper about decision theory on accuracy estimators..!)
Deborah Mayo arXived a Statistical Science paper a few days ago, along with discussions by Jan Bjørnstad, Phil Dawid, Don Fraser, Michael Evans, Jan Hanning, R. Martin and C. Liu. I am very glad that this discussion paper came out and that it came out in Statistical Science, although I am rather surprised to find no discussion by Jim Berger or Robert Wolpert, and even though I still cannot entirely follow the deductive argument in the rejection of Birnbaum’s proof, just as in the earlier version in Error & Inference. But I somehow do not feel like going again into a new debate about this critique of Birnbaum’s derivation. (Even though statements like the fact that the SLP “would preclude the use of sampling distributions” (p.227) would call for contradiction.)
“It is the imprecision in Birnbaum’s formulation that leads to a faulty impression of exactly what is proved.” M. Evans
Indeed, at this stage, I fear that [for me] a more relevant issue is whether or not the debate does matter… At a logical cum foundational [and maybe cum historical] level, it makes perfect sense to uncover if and which if any of the myriad of Birnbaum’s likelihood Principles holds. [Although trying to uncover Birnbaum’s motives and positions over time may not be so relevant.] I think the paper and the discussions acknowledge that some version of the weak conditionality Principle does not imply some version of the strong likelihood Principle. With other logical implications remaining true. At a methodological level, I am less much less sure it matters. Each time I taught this notion, I got blank stares and incomprehension from my students, to the point I have now stopped altogether teaching the likelihood Principle in class. And most of my co-authors do not seem to care very much about it. At a purely mathematical level, I wonder if there even is ground for a debate since the notions involved can be defined in various imprecise ways, as pointed out by Michael Evans above and in his discussion. At a statistical level, sufficiency eventually is a strange notion in that it seems to make plenty of sense until one realises there is no interesting sufficiency outside exponential families. Just as there are very few parameter transforms for which unbiased estimators can be found. So I also spend very little time teaching and even less worrying about sufficiency. (As it happens, I taught the notion this morning!) At another and presumably more significant statistical level, what matters is information, e.g., conditioning means adding information (i.e., about which experiment has been used). While complex settings may prohibit the use of the entire information provided by the data, at a formal level there is no argument for not using the entire information, i.e. conditioning upon the entire data. (At a computational level, this is no longer true, witness ABC and similar limited information techniques. By the way, ABC demonstrates if needed why sampling distributions matter so much to Bayesian analysis.)
“Non-subjective Bayesians who (…) have to live with some violations of the likelihood principle (…) since their prior probability distributions are influenced by the sampling distribution.” D. Mayo (p.229)
In the end, the fact that the prior may depend on the form of the sampling distribution and hence does violate the likelihood Principle does not worry me so much. In most models I consider, the parameters are endogenous to those sampling distributions and do not live an ethereal existence independently from the model: they are substantiated and calibrated by the model itself, which makes the discussion about the LP rather vacuous. See, e.g., the coefficients of a linear model. In complex models, or in large datasets, it is even impossible to handle the whole data or the whole model and proxies have to be used instead, making worries about the structure of the (original) likelihood vacuous. I think we have now reached a stage of statistical inference where models are no longer accepted as ideal truth and where approximation is the hard reality, imposed by the massive amounts of data relentlessly calling for immediate processing. Hence, where the self-validation or invalidation of such approximations in terms of predictive performances is the relevant issue. Provided we can at all face the challenge…
Here are some entries I spotted in the past days as of potential interest, for which I will have not enough time to comment:
- arXiv:1410.0163: Instrumental Variables: An Econometrician’s Perspective by Guido Imbens
- arXiv:1410.0123: Deep Tempering by Guillaume Desjardins, Heng Luo, Aaron Courville, Yoshua Bengio
- arXiv:1410.0255: Variance reduction for irreversible Langevin samplers and diffusion on graphs by Luc Rey-Bellet, Konstantinos Spiliopoulos
- arXiv:1409.8502: Combining Particle MCMC with Rao-Blackwellized Monte Carlo Data Association for Parameter Estimation in Multiple Target Tracking by Juho Kokkala, Simo Särkkä
- arXiv:1409.8185: Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models by Theodoros Tsiligkaridis, Keith W. Forsythe
- arXiv:1409.7986: Hypothesis testing for Markov chain Monte Carlo by Benjamin M. Gyori, Daniel Paulin
- arXiv:1409.7672: Order-invariant prior specification in Bayesian factor analysis by Dennis Leung, Mathias Drton
- arXiv:1409.7458: Beyond Maximum Likelihood: from Theory to Practice by Jiantao Jiao, Kartik Venkat, Yanjun Han, Tsachy Weissman
- arXiv:1409.7419: Identifying the number of clusters in discrete mixture models by Cláudia Silvestre, Margarida G. M. S. Cardoso, Mário A. T. Figueiredo
- arXiv:1409.7287: Identification of jump Markov linear models using particle filters by Andreas Svensson, Thomas B. Schön, Fredrik Lindsten
- arXiv:1409.7074: Variational Pseudolikelihood for Regularized Ising Inference by Charles K. Fisher
This introduction to Bayesian Analysis, Bayes’ Rule, was written by James Stone from the University of Sheffield, who contacted CHANCE suggesting a review of his book. I thus bought it from amazon to check the contents. And write a review.
First, the format of the book. It is a short paper of 127 pages, plus 40 pages of glossary, appendices, references and index. I eventually found the name of the publisher, Sebtel Press, but for a while thought the book was self-produced. While the LaTeX output is fine and the (Matlab) graphs readable, pictures are not of the best quality and the display editing is minimal in that there are several huge white spaces between pages. Nothing major there, obviously, it simply makes the book look like course notes, but this is in no way detrimental to its potential appeal. (I will not comment on the numerous appearances of Bayes’ alleged portrait in the book.)
“… (on average) the adjusted value θMAP is more accurate than θMLE.” (p.82)
Bayes’ Rule has the interesting feature that, in the very first chapter, after spending a rather long time on Bayes’ formula, it introduces Bayes factors (p.15). With the somewhat confusing choice of calling the prior probabilities of hypotheses marginal probabilities. Even though they are indeed marginal given the joint, marginal is usually reserved for the sample, as in marginal likelihood. Before returning to more (binary) applications of Bayes’ formula for the rest of the chapter. The second chapter is about probability theory, which means here introducing the three axioms of probability and discussing geometric interpretations of those axioms and Bayes’ rule. Chapter 3 moves to the case of discrete random variables with more than two values, i.e. contingency tables, on which the range of probability distributions is (re-)defined and produces a new entry to Bayes’ rule. And to the MAP. Given this pattern, it is not surprising that Chapter 4 does the same for continuous parameters. The parameter of a coin flip. This allows for discussion of uniform and reference priors. Including maximum entropy priors à la Jaynes. And bootstrap samples presented as approximating the posterior distribution under the “fairest prior”. And even two pages on standard loss functions. This chapter is followed by a short chapter dedicated to estimating a normal mean, then another short one on exploring the notion of a continuous joint (Gaussian) density.
“To some people the word Bayesian is like a red rag to a bull.” (p.119)
Bayes’ Rule concludes with a chapter entitled Bayesian wars. A rather surprising choice, given the intended audience. Which is rather bound to confuse this audience… The first part is about probabilistic ways of representing information, leading to subjective probability. The discussion goes on for a few pages to justify the use of priors but I find completely unfair the argument that because Bayes’ rule is a mathematical theorem, it “has been proven to be true”. It is indeed a maths theorem, however that does not imply that any inference based on this theorem is correct! (A surprising parallel is Kadane’s Principles of Uncertainty with its anti-objective final chapter.)
All in all, I remain puzzled after reading Bayes’ Rule. Puzzled by the intended audience, as contrary to other books I recently reviewed, the author does not shy away from mathematical notations and concepts, even though he proceeds quite gently through the basics of probability. Therefore, potential readers need some modicum of mathematical background that some students may miss (although it actually corresponds to what my kids would have learned in high school). It could thus constitute a soft entry to Bayesian concepts, before taking a formal course on Bayesian analysis. Hence doing no harm to the perception of the field.