## Nonparametric hierarchical Bayesian quantiles

Posted in Books, Statistics, University life with tags , , , , , , , on June 9, 2016 by xi'an

Luke Bornn, Neal Shephard and Reza Solgi have recently arXived a research report on non-parametric Bayesian quantiles. This work relates to their earlier paper that combines Bayesian inference with moment estimators, in that the quantiles do not define entirely the distribution of the data, which then needs to be completed by Bayesian means. But contrary to this previous paper, it does not require MCMC simulation for distributions defined on a variety as, e.g., a curve.

Here a quantile is defined as minimising an asymmetric absolute risk, i.e., an expected loss. It is therefore a deterministic function of the model parameters for a parametric model and a functional of the model otherwise. And connected to a moment if not a moment per se. In the case of a model with a discrete support, the unconstrained model is parameterised by the probability vector θ and β=t(θ). However, the authors study the opposite approach, namely to set a prior on β, p(β), and then complement this prior with a conditional prior on θ, p(θ|β), the joint prior p(β)p(θ|β) being also the marginal p(θ) because of the deterministic relation. However, I am getting slightly lost in the motivation for the derivation of the conditional when the authors pick an arbitrary prior on θ and use it to derive a conditional on β which, along with an arbitrary (“scientific”) prior on β defines a new prior on θ. This works out in the discrete case because β has a finite support. But it is unclear (to me) why it should work in the continuous case [not covered in the paper].

Getting back to the central idea of defining first the distribution on the quantile β, a further motivation is provided in the hierarchical extension of Section 3, where the same quantile distribution is shared by all individuals (e.g., cricket players) in the population, while the underlying distributions for the individuals are otherwise disconnected and unconstrained. (Obviously, a part of the cricket example went far above my head. But one may always idly wonder why all players should share the same distribution. And about what would happen when imposing no quantile constraint but picking instead a direct hierarchical modelling on the θ’s.) This common distribution on β can then be modelled by a Dirichlet hyperprior.

The paper also contains a section on estimating the entire quantile function, which is a wee paradox in that this function is again a deterministic transform of the original parameter θ, but that the authors use instead pointwise estimation, i.e., for each level τ. I find the exercise furthermore paradoxical in that the hierarchical modelling with a common distribution on the quantile β(τ) only is repeated for each τ but separately, while it should be that the entire parameter should share a common distribution. Given the equivalence between the quantile function and the entire parameter θ.

## covariant priors, Jeffreys and paradoxes

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 9, 2016 by xi'an

“If no information is available, π(α|M) must not deliver information about α.”

In a recent arXival apparently submitted to Bayesian Analysis, Giovanni Mana and Carlo Palmisano discuss of the choice of priors in metrology. Which reminded me of this meeting I attended at the Bureau des Poids et Mesures in Sèvres where similar debates took place, albeit being led by ferocious anti-Bayesians! Their reference prior appears to be the Jeffreys prior, because of its reparameterisation invariance.

“The relevance of the Jeffreys rule in metrology and in expressing uncertainties in measurements resides in the metric invariance.”

This, along with a second order approximation to the Kullback-Leibler divergence, is indeed one reason for advocating the use of a Jeffreys prior. I at first found it surprising that the (usually improper) prior is used in a marginal likelihood, as it cannot be normalised. A source of much debate [and of our alternative proposal].

“To make a meaningful posterior distribution and uncertainty assessment, the prior density must be covariant; that is, the prior distributions of different parameterizations must be obtained by transformations of variables. Furthermore, it is necessary that the prior densities are proper.”

The above quote is quite interesting both in that the notion of covariant is used rather than invariant or equivariant. And in that properness is indicated as a requirement. (Even more surprising is the noun associated with covariant, since it clashes with the usual notion of covariance!) They conclude that the marginal associated with an improper prior is null because the normalising constant of the prior is infinite.

“…the posterior probability of a selected model must not be null; therefore, improper priors are not allowed.”

## Bayesian Data Analysis [BDA3 – part #2]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on March 31, 2014 by xi'an

Here is the second part of my review of Gelman et al.’ Bayesian Data Analysis (third edition):

“When an iterative simulation algorithm is “tuned” (…) the iterations will not in general converge to the target distribution.” (p.297)

Part III covers advanced computation, obviously including MCMC but also model approximations like variational Bayes and expectation propagation (EP), with even a few words on ABC. The novelties in this part are centred at Stan, the language Andrew is developing around Hamiltonian Monte Carlo techniques, a sort of BUGS of the 10’s! (And of course Hamiltonian Monte Carlo techniques themselves. A few (nit)pickings: the book advises important resampling without replacement (p.266) which makes some sense when using a poor importance function but ruins the fundamentals of importance sampling. Plus, no trace of infinite variance importance sampling? of harmonic means and their dangers? In the Metropolis-Hastings algorithm, the proposal is called the jumping rule and denoted by Jt, which, besides giving the impression of a Jacobian, seems to allow for time-varying proposals and hence time-inhomogeneous Markov chains, which convergence properties are much hairier. (The warning comes much later, as exemplified in the above quote.) Moving from “burn-in” to “warm-up” to describe the beginning of an MCMC simulation. Being somewhat 90’s about convergence diagnoses (as shown by the references in Section 11.7), although the book also proposes new diagnoses and relies much more on effective sample sizes. Particle filters are evacuated in hardly half-a-page. Maybe because Stan does not handle particle filters. A lack of intuition about the Hamiltonian Monte Carlo algorithms, as the book plunges immediately into a two-page pseudo-code description. Still using physics vocabulary that put me (and maybe only me) off. Although I appreciated the advice to check analytical gradients against their numerical counterpart.

“In principle there is no limit to the number of levels of variation that can be handled in this way. Bayesian methods provide ready guidance in handling the estimation of the unknown parameters.” (p.381)

I also enjoyed reading the part about modes that stand at the boundary of the parameter space (Section 13.2), even though I do not think modes are great summaries in Bayesian frameworks and while I do not see how picking the prior to avoid modes at the boundary avoids the data impacting the prior, in fine. The variational Bayes section (13.7) is equally enjoyable, with a proper spelled-out illustration, introducing an unusual feature for Bayesian textbooks.  (Except that sampling without replacement is back!) Same comments for the Expectation Propagation (EP) section (13.8) that covers brand new notions. (Will they stand the test of time?!)

“Geometrically, if β-space is thought of as a room, the model implied by classical model selection claims that the true β has certain prior probabilities of being in the room, on the floor, on the walls, in the edge of the room, or in a corner.” (p.368)

“You can use MCMC, normal approximation, variational Bayes, expectation propagation, Stan, or any other method. But your fit must be Bayesian.” (p.517)

Part V concentrates the most advanced material, with Chapter 19 being mostly an illustration of a few complex models, slightly superfluous in my opinion, Chapter 20 a very short introduction to functional bases, including a basis selection section (20.2) that implements the “zero coefficient” variable selection principle refuted in the regression chapter(s), and does not go beyond splines (what about wavelets?), Chapter 21 a (quick) coverage of Gaussian processes with the motivating birth-date example (and two mixture datasets I used eons ago…), Chapter 22 a more (too much?) detailed study of finite mixture models, with no coverage of reversible-jump MCMC, and Chapter 23 an entry on Bayesian non-parametrics through Dirichlet processes.

“In practice, for well separated components, it is common to remain stuck in one labelling across all the samples that are collected. One could argue that the Gibbs sampler has failed in such a case.” (p.535)

To get back to mixtures, I liked the quote about the label switching issue above, as I was “one” who argued that the Gibbs sampler fails to converge! The corresponding section seems to favour providing a density estimate for mixture models, rather than component-wise evaluations, but it nonetheless mentions the relabelling by permutation approach (if missing our 2000 JASA paper). The section about inferring on the unknown number of components suggests conducting a regular Gibbs sampler on a model with an upper bound on the number of components and then checking for empty components, an idea I (briefly) considered in the mid-1990’s before the occurrence of RJMCMC. Of course, the prior on the components matters and the book suggests using a Dirichlet with fixed sum like 1 on the coefficients for all numbers of components.

“14. Objectivity and subjectivity: discuss the statement `People tend to believe results that support their preconceptions and disbelieve results that surprise them. Bayesian methods tend to encourage this undisciplined mode of thinking.’¨ (p.100)

Obviously, this being a third edition begets the question, what’s up, doc?!, i.e., what’s new [when compared with the second edition]? Quite a lot, even though I am not enough of a Gelmanian exegist to produce a comparision table. Well, for a starter, David Dunson and Aki Vethtari joined the authorship, mostly contributing to the advanced section on non-parametrics, Gaussian processes, EP algorithms. Then the Hamiltonian Monte Carlo methodology and Stan of course, which is now central to Andrew’s interests. The book does include a short Appendix on running computations in R and in Stan. Further novelties were mentioned above, like the vision of weakly informative priors taking over noninformative priors but I think this edition of Bayesian Data Analysis puts more stress on clever and critical model construction and on the fact that it can be done in a Bayesian manner. Hence the insistence on predictive and cross-validation tools. The book may be deemed somewhat short on exercices, providing between 3 and 20 mostly well-developed problems per chapter, often associated with datasets, rather than the less exciting counter-example above. Even though Andrew disagrees and his students at ENSAE this year certainly did not complain, I personally feel a total of 220 exercices is not enough for instructors and self-study readers. (At least, this reduces the number of email requests for solutions! Esp. when 50 of those are solved on the book website.) But this aspect is a minor quip: overall this is truly the reference book for a graduate course on Bayesian statistics and not only Bayesian data analysis.

## Bayesian Data Analysis [BDA3]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on March 28, 2014 by xi'an

Andrew Gelman and his coauthors, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Don Rubin, have now published the latest edition of their book Bayesian Data Analysis. David and Aki are newcomers to the authors’ list, with an extended section on non-linear and non-parametric models. I have been asked by Sam Behseta to write a review of this new edition for JASA (since Sam is now the JASA book review editor). After wondering about my ability to produce an objective review (on the one hand, this is The Competition  to Bayesian Essentials!, on the other hand Andrew is a good friend spending the year with me in Paris), I decided to jump for it and write a most subjective review, with the help of Clara Grazian who was Andrew’s teaching assistant this year in Paris and maybe some of my Master students who took Andrew’s course. The second edition was reviewed in the September 2004 issue of JASA and we now stand ten years later with an even more impressive textbook. Which truly what Bayesian data analysis should be.

This edition has five parts, Fundamentals of Bayesian Inference, Fundamentals of Bayesian Data Analysis, Advanced Computation, Regression Models, and Non-linear and Non-parametric Models, plus three appendices. For a total of xiv+662 pages. And a weight of 2.9 pounds (1395g on my kitchen scale!) that makes it hard to carry around in the metro…. I took it to Warwick (and then Nottingham and Oxford and back to Paris) instead.

We could avoid the mathematical effort of checking the integrability of the posterior density (…) The result would clearly show the posterior contour drifting off toward infinity.” (p.111)

While I cannot go into a detailed reading of those 662 pages (!), I want to highlight a few gems. (I already wrote a detailed and critical analysis of Chapter 6 on model checking in that post.) The very first chapter provides all the necessary items for understanding Bayesian Data Analysis without getting bogged in propaganda or pseudo-philosophy. Then the other chapters of the first part unroll in a smooth way, cruising on the B highway… With the unique feature of introducing weakly informative priors (Sections 2.9 and 5.7), like the half-Cauchy distribution on scale parameters. It may not be completely clear how weak a weakly informative prior, but this novel notion is worth including in a textbook. Maybe a mild reproach at this stage: Chapter 5 on hierarchical models is too verbose for my taste, as it essentially focus on the hierarchical linear model. Of course, this is an essential chapter as it links exchangeability, the “atom” of Bayesian reasoning used by de Finetti, with hierarchical models. Still. Another comment on that chapter: it broaches on the topic of improper posteriors by suggesting to run a Markov chain that can exhibit improperness by enjoying an improper behaviour. When it happens as in the quote above, fine!, but there is no guarantee this is always the case! For instance, improperness may be due to regions near zero rather than infinity. And a last barb: there is a dense table (Table 5.4, p.124) that seems to run contrariwise to Andrew’s avowed dislike of tables. I could also object at the idea of a “true prior distribution” (p.128), or comment on the trivia that hierarchical chapters seem to attract rats (as I also included a rat example in the hierarchical Bayes chapter of Bayesian Choice and so does the BUGS Book! Hence, a conclusion that Bayesian textbooks are better be avoided by muriphobiacs…)

“Bayes factors do not work well for models that are inherently continuous (…) Because we emphasize continuous families of models rather than discrete choices, Bayes factors are rarely relevant in our approach to Bayesian statistics.” (p.183 & p.193)

Part II is about “the creative choices that are required, first to set up a Bayesian model in a complex problem, then to perform the model checking and confidence building that is typically necessary to make posterior inferences scientifically defensible” (p.139). It is certainly one of the strengths of the book that it allows for a critical look at models and tools that are rarely discussed in more theoretical Bayesian books. As detailed in my  earlier post on Chapter 6, model checking is strongly advocated, via posterior predictive checks and… posterior predictive p-values, which are at best empirical indicators that something could be wrong, definitely not that everything’s allright! Chapter 7 is the model comparison equivalent of Chapter 6, starting with the predictive density (aka the evidence or the marginal likelihood), but completely bypassing the Bayes factor for information criteria like the Watanabe-Akaike or widely available information criterion (WAIC), and advocating cross-validation, which is empirically satisfying but formally hard to integrate within a full Bayesian perspective. Chapter 8 is about data collection, sample surveys, randomization and related topics, another entry that is missing from most Bayesian textbooks, maybe not that surprising given the research topics of some of the authors. And Chapter 9 is the symmetric in that it focus on the post-modelling step of decision making.

(Second part of the review to appear on Monday, leaving readers the weekend to recover!)

## Statistics for spatio-temporal data [book review]

Posted in Books, Statistics, University life with tags , , , , , , on October 14, 2013 by xi'an

Here is the new reference book about spatial and spatio-temporal statistical modelling!  Noel Cressie wrote the earlier classic Statistics for Spatial Data in 1993 and he has now co-authored with Christopher Wikle (a plenary speaker at ISBA 2014 in Cancún) the new bible on the topic. And with a very nice cover of a Guatemaltec lienzo about the Spanish conquest. (Disclaimer: as I am a good friend of Noel, do not expect this review to remain unbiased!)

“…we state the obvious, that political boundaries cannot hold back a one-meter rise in sea level; our environment is ultimately a global resource and its stewardship is an international responsibility.” (p.11)

The book is a sum (in the French/Latin meaning of somme/summa when applied to books—I am not sure this explanation makes any sense!) and, as its predecessor, it covers an enormous range of topics and methods. So do not expect a textbook coverage of most notions and prepare to read further articles referenced in the text. One of the many differences with the earlier book is that MCMC appears from the start as a stepping stone that is necessary to handle

“…there are model-selection criteria that could be invoked (e.g., AIC, BIC, DIC, etc.), which concentrate on the twin pillars of predictability and parsimony. But they do not address the third pillar, namely scientific interpretability (i.e., knowledge).” (p.33)

The first chapter of the book is actually a preface motivating the topics covered by the book, which may be confusing on a first read, esp. for a graduate student, as there is no math formula and no model introduced at this stage. Anyway, this is not really a book made for a linear read. It is quite  witty (with too many quotes to report here!) and often funny (I learned for instance that Einstein’s quote “Everything should be made as simple as possible, but not simpler” was a paraphrase of an earlier lecture, invented by the Reader’s Digest!).

“Thus, we believe that it is not helpful to try to classify probability distributions that determine the statistical models, as subjective or objective. Better questions to ask are about the sensitivity of inferences to model choices and whether such choices make sense scientifically.” (p.32)

The overall tone of the book is mostly Bayesian, in a non-conflictual conditional probability way, insisting on hierarchical (Bayesian) model building. Incidentally, it uses the same bracket notation for generic distributions (densities) as in Gelfand and Smith (JASA, 1990), i.e. [X|Y] and [X|Z,y][Z|y,θ], notation that did not get much of a fan club. (I actually do not know where it stemmed from.) The second chapter contains an illustration of the search for the USS Scorpion using a Bayesian model (including priors built from experts’ opinions), example which is also covered [without the maths!] in Sharon McGrayne’s Theory that would not die.

The book is too rich and my time is too tight (!) to cover each chapter in details.  (For instance, I am not so happy with the temporal chapter in that it moves away from the Bayesian perspective without much of a justification.) Suffice to say then that it appears like an updated and improved version of its predecessor, with 45 pages of references, some of them quite recent. If I was to teach from this book at a Master level, it would take the whole academic year and then some, assuming enough mathematical culture from the student audience.

As an addendum, I noticed several negative reviews on amazon due to the poor quality of the printing, but the copy I received from John Wiley was quite fine, with the many colour graphs well-rendered. Maybe an earlier printing or a different printing agreement?