Archive for DIC

statistical modeling with R [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on June 10, 2023 by xi'an

Statistical Modeling with R (A dual frequentist and Bayesian approach for life scientists) is a recent book written by Pablo Inchausti, from Uruguay. In a highly personal and congenial style (witness the preface), with references to (fiction) books that enticed me to buy them. The book was sent to me by the JASA book editor for review and I went through the whole of it during my flight back from Jeddah. [Disclaimer about potential self-plagiarism: this post or a likely edited version of it will eventually appear in JASA. If not CHANCE, for once.]

The very first sentence (after the preface) quotes my late friend Steve Fienberg, which is definitely starting on the right foot. The exposition of the motivations for writing the book is quite convincing, with more emphasis than usual put on the notion and limitations of modeling. The discourse is overall inspirational and contains many relevant remarks and links that make it worth reading it as a whole. While heavily connected with a few R packages like fitdist, fitistrplus, brms (a  front for Stan), glm, glmer, the book is wisely bypassing the perilous reef of recalling R bases. Similarly for the foundations of probability and statistics. While lacking in formal definitions, in my opinion, it reads well enough to somehow compensate for this very lack. I also appreciate the coherent and throughout continuation of the parallel description of Bayesian and non-Bayesian analyses, an attempt that often too often quickly disappear in other books. (As an aside, note that hardly anyone claims to be a frequentist, except maybe Deborah Mayo.) A new model is almost invariably backed by a new dataset, if a few being somewhat inappropriate as in the mammal sleep patterns of Chapter 5. Or in Fig. 6.1.

Given that the main motivation for the book (when compared with references like BDA) is heavily towards the practical implementation of statistical modelling via R packages, it is inevitable that a large fraction of Statistical Modeling with R is spent on the analysis of R outputs, even though it sometimes feels a wee bit too heavy for yours truly.  The R screen-copies are however produced in moderate quantity and size, even though the variations in typography/fonts (at least on my copy?!) may prove confusing. Obviously the high (explosive?) distinction between regression models may eventually prove challenging for the novice reader. The specific issue of prior input (or “defining priors”) is briefly addressed in a non-chapter (p.323), although mentions are made throughout preceding chapters. I note the nice appearance of hierarchical models and experimental designs towards the end, but would have appreciated some discussions on missing topics such as time series, causality, connections with machine learning, non-parametrics, model misspecification. As an aside, I appreciated being reminded about the apocryphal nature of Ockham’s much cited quotePluralitas non est ponenda sine necessitate“.

Typo Jeffries found in Fig. 2.1, along with a rather sketchy representation of the history of both frequentist and Bayesian statistics. And Jon Wakefield’s book (with related purpose of presenting both versions of parametric inference) was mistakenly entered as Wakenfield’s in the bibliography file. Some repetitions occur. I do not like the use of the equivalence symbol ≈ for proportionality. And I found two occurrences of the unavoidable “the the” typo (p.174 and p.422). I also had trouble with some sentences like “long-run, hypothetical distribution of parameter estimates known as the sampling distribution” (p.27), “maximum likelihood estimates [being] sufficient” (p.28), “Jeffreys’ (1939) conjugate priors” [which were introduced by Raiffa and Schlaifer] (p.35), “A posteriori tests in frequentist models” (p.130), “exponential families [having] limited practical implications for non-statisticians” (p.190), “choice of priors being correct” (p.339), or calling MCMC sample terms “estimates” (p.42), and issues with some repetitions, missing indices for acronyms, packages, datasets, but did not bemoan the lack homework sections (beyond suggesting new datasets for analysis).

A problematic MCMC entry is found when calibrating the choice of the Metropolis-Hastings proposal towards avoiding negative values “that will generate an error when calculating the log-likelihood” (p.43) since it suggests proposed values should not exceed the support of the posterior (and indicates a poor coding of the log-likelihood!). I also find the motivation for the full conditional decomposition behind the Gibbs sampler (p.47) unnecessarily confusing. (And automatically having a Metropolis-Hastings step within Gibbs as on Fig. 3.9 brings another magnitude of confusion.) The Bayes factor section is very terse. The derivation of the Kullback-Leibler representation (7.3) as an expected log likelihood ratio seems to be missing a reference measure. Of course, seeing a detailed coverage of DIC (Section 7.4) did not suit me either, even though the issue with mixtures was alluded to (with no detail whatsoever). The Nelder presentation of the generalised linear models felt somewhat antiquated, since the addition of the scale factor a(φ) sounds over-parameterized.

But those are minor quibble in relation to a book that should attract curious minds of various background knowledge and expertise in statistics, as well as work nicely to support an enthusiastic teacher of statistical modelling. I thus recommend this book most enthusiastically.

mathematical theory of Bayesian statistics [book review]

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on May 6, 2021 by xi'an

I came by chance (and not by CHANCE) upon this 2018 CRC Press book by Sumio Watanabe and ordered it myself to gather which material it really covered. As the back-cover blurb was not particularly clear and the title sounded quite general. After reading it, I found out that this is a mathematical treatise on some aspects of Bayesian information criteria, in particular on the Widely Applicable Information Criterion (WAIC) that was introduced by the author in 2010. The result is a rather technical and highly focussed book with little motivation or intuition surrounding the mathematical results, which may make the reading arduous for readers. Some background on mathematical statistics and Bayesian inference is clearly preferable and the book cannot be used as a textbook for most audiences, as opposed to eg An Introduction to Bayesian Analysis by J.K. Ghosh et al. or even more to Principles of Uncertainty by J. Kadane. In connection with this remark the exercises found in the book are closer to the delivery of additional material than to textbook-style exercises.

“posterior distributions are often far from any normal distribution, showing that Bayesian estimation gives the more accurate inference than other estimation methods.”

The overall setting is one where both the sampling and the prior distributions are different from respective “true” distributions. Requiring a tool to assess the discrepancy when utilising a specific pair of such distributions. Especially when the posterior distribution cannot be approximated by a Normal distribution. (Lindley’s paradox makes an interesting incognito incursion on p.238.) The WAIC is supported for the determination of the “true” model, in opposition to AIC and DIC, incl. on a mixture example that reminded me of our eight versions of DIC paper. In the “Basic Bayesian Theory” chapter (§3), the “basic theorem of Bayesian statistics” (p.85) states that the various losses related with WAIC can be expressed as second-order Taylor expansions of some cumulant generating functions, with order o(n⁻¹), “even if the posterior distribution cannot be approximated by any normal distribution” (p.87). With the intuition that

“if a log density ratio function has a relatively finite variance then the generalization loss, the cross validation loss, the training loss and WAIC have the same asymptotic behaviors.”

Obviously, these “basic” aspects should come as a surprise to a fair percentage of Bayesians (in the sense of not being particularly basic). Myself included. Chapter 4 exposes why, for regular models, the posterior distribution accumulates in an ε neighbourhood of the optimal parameter at a speed O(n2/5). With the normalised partition function being of order n-d/2 in the neighbourhood and exponentially negligible outside. A consequence of this regular asymptotic theory is that all above losses are asymptotically equivalent to the negative log likelihood plus similar order n⁻¹ terms that can be ordered. Chapters 5 and 6 deal with “standard” [the likelihood ratio is a multi-index power of the parameter ω] and general posterior distributions that can be written as mixtures of standard distributions,  with expressions of the above losses in terms of new universal constants. Again, a rather remote concern of mine. The book also includes a chapter (§7) on MCMC, with a rather involved proof that a Metropolis algorithm satisfies detailed balance (p.210). The Gibbs sampling section contains an extensive example on a two-dimensional two-component unit-variance Normal mixture, with an unusual perspective on the posterior, which is considered as “singular” when the true means are close. (Label switching or the absence thereof is not mentioned.) In terms of approximating the normalising constant (or free energy), the only method discussed there is path sampling, with a cryptic remark about harmonic mean estimators (not identified as such). In a final knapsack chapter (§9),  Bayes factors (confusedly denoted as L(x)) are shown to be most powerful tests in a Bayesian sense when comparing hypotheses without prior weights on said hypotheses, while posterior probability ratios are the natural statistics for comparing models with prior weights on said models. (With Lindley’s paradox making another appearance, still incognito!) And a  notion of phase transition for hyperparameters is introduced, with the meaning of a radical change of behaviour at a critical value of said hyperparameter. For instance, for a simple normal- mixture outlier model, the critical value of the Beta hyperparameter is α=2. Which is a wee bit of a surprise when considering Rousseau and Mengersen (2011) since their bound for consistency was α=d/2.

In conclusion, this is quite an original perspective on Bayesian models, covering the somewhat unusual (and potentially controversial) issue of misspecified priors and centered on the use of information criteria. I find the book could have benefited from further editing as I noticed many typos and somewhat unusual sentences (at least unusual to me).

[Disclaimer about potential self-plagiarism: this post or an edited version should eventually appear in my Books Review section in CHANCE.]

Probability and Bayesian modeling [book review]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , on March 26, 2020 by xi'an

Probability and Bayesian modeling is a textbook by Jim Albert [whose reply is included at the end of this entry] and Jingchen Hu that CRC Press sent me for review in CHANCE. (The book is also freely available in bookdown format.) The level of the textbook is definitely most introductory as it dedicates its first half on probability concepts (with no measure theory involved), meaning mostly focusing on counting and finite sample space models. The second half moves to Bayesian inference(s) with a strong reliance on JAGS for the processing of more realistic models. And R vignettes for the simplest cases (where I discovered R commands I ignored, like dplyr::mutate()!).

As a preliminary warning about my biases, I am always reserved at mixing introductions to probability theory and to (Bayesian) statistics in the same book, as I feel they should be separated to avoid confusion. As for instance between histograms and densities, or between (theoretical) expectation and (empirical) mean. I therefore fail to relate to the pace and tone adopted in the book which, in my opinion, seems to dally on overly simple examples [far too often concerned with food or baseball] while skipping over the concepts and background theory. For instance, introducing the concept of subjective probability as early as page 6 is laudable but I doubt it will engage fresh readers when describing it as a measurement of one’s “belief about the truth of an event”, then stressing that “make any kind of measurement, one needs a tool like a scale or ruler”. Overall, I have no particularly focused criticisms on the probability part except for the discrete vs continuous imbalance. (With the Poisson distribution not covered in the Discrete Distributions chapter. And the “bell curve” making a weird and unrigorous appearance there.) Galton’s board (no mention found of quincunx) could have been better exploited towards the physical definition of a prior, following Steve Stiegler’s analysis, by adding a second level. Or turned into an R coding exercise. In the continuous distributions chapter, I would have seen the cdf coming first to the pdf, rather than the opposite. And disliked the notion that a Normal distribution was supported by an histogram of (marathon) running times, i.e. values lower bounded by 122 (at the moment). Or later (in Chapter 8) for Roger Federer’s serving times. Incidentally, a fun typo on p.191, at least fun for LaTeX users, as

f_{Y\ mid X}

with an extra space between `\’ and `mid’! (I also noticed several occurrences of the unvoidable “the the” typo in the last chapters.) The simulation from a bivariate Normal distribution hidden behind a customised R function sim_binom() when it could have been easily described as a two-stage hierarchy. And no comment on the fact that a sample from Y-1.5X could be directly derived from the joint sample. (Too unconscious a statistician?)

When moving to Bayesian inference, a large section is spent on very simple models like estimating a proportion or a mean, covering both discrete and continuous priors. And strongly focusing on conjugate priors despite giving warnings that they do not necessarily reflect prior information or prior belief. With some debatable recommendation for “large” prior variances as weakly informative or (worse) for Exp(1) as a reference prior for sample precision in the linear model (p.415). But also covering Bayesian model checking either via prior predictive (hence Bayes factors) or posterior predictive (with no mention of using the data twice). A very marginalia in introducing a sufficient statistic for the Normal model. In the Normal model checking section, an estimate of the posterior density of the mean is used without (apparent) explanation.

“It is interesting to note the strong negative correlation in these parameters. If one assigned informative independent priors on and , these prior beliefs would be counter to the correlation between the two parameters observed in the data.”

For the same reasons of having to cut on mathematical validation and rigour, Chapter 9 on MCMC is not explaining why MCMC algorithms are converging outside of the finite state space case. The proposal in the algorithmic representation is chosen as a Uniform one, since larger dimension problems are handled by either Gibbs or JAGS. The recommendations about running MCMC do not include how many iterations one “should” run (or other common queries on Stack eXchange), albeit they do include the sensible running multiple chains and comparing simulated predictive samples with the actual data as a  model check. However, the MCMC chapter very quickly and inevitably turns into commented JAGS code. Which I presume would require more from the students than just reading the available code. Like JAGS manual. Chapter 10 is mostly a series of examples of Bayesian hierarchical modeling, with illustrations of the shrinkage effect like the one on the book cover. Chapter 11 covers simple linear regression with some mentions of weakly informative priors,  although in a BUGS spirit of using large [enough?!] variances: “If one has little information about the location of a regression parameter, then the choice of the prior guess is not that important and one chooses a large value for the prior standard deviation . So the regression intercept and slope are each assigned a Normal prior with a mean of 0 and standard deviation equal to the large value of 100.” (p.415). Regardless of the scale of y? Standardisation is covered later in the chapter (with the use of the R function scale()) as part of constructing more informative priors, although this sounds more like data-dependent priors to me in the sense that the scale and location are summarily estimated by empirical means from the data. The above quote also strikes me as potentially confusing to the students, as it does not spell at all how to design a joint distribution on the linear regression coefficients that translate the concentration of these coefficients along y̅=β⁰+β¹x̄. Chapter 12 expands the setting to multiple regression and generalised linear models, mostly consisting of examples. It however suggests using cross-validation for model checking and then advocates DIC (deviance information criterion) as “to approximate a model’s out-of-sample predictive performance” (p.463). If only because it is covered in JAGS, the definition of the criterion being relegated to the last page of the book. Chapter 13 concludes with two case studies, the (often used) Federalist Papers analysis and a baseball career hierarchical model. Which may sound far-reaching considering the modest prerequisites the book started with.

In conclusion of this rambling [lazy Sunday] review, this is not a textbook I would have the opportunity to use in Paris-Dauphine but I can easily conceive its adoption for students with limited maths exposure. As such it offers a decent entry to the use of Bayesian modelling, supported by a specific software (JAGS), and rightly stresses the call to model checking and comparison with pseudo-observations. Provided the course is reinforced with a fair amount of computer labs and projects, the book can indeed achieve to properly introduce students to Bayesian thinking. Hopefully leading them to seek more advanced courses on the topic.

Update: Jim Albert sent me the following precisions after this review got on-line:

Thanks for your review of our recent book.  We had a particular audience in mind, specifically undergraduate American students with some calculus background who are taking their first course in probability and statistics.  The traditional approach (which I took many years ago) teaches some probability one semester and then traditional inference (focusing on unbiasedness, sampling distributions, tests and confidence intervals) in the second semester.  There didn’t appear to be any Bayesian books at that calculus-based undergraduate level and that motivated the writing of this book.  Anyway, I think your comments were certainly fair and we’ve already made some additions to our errata list based on your comments.
[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about Chance!]

estimating the marginal likelihood (or an information criterion)

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on December 28, 2019 by xi'an

Tory Imai (from Kyoto University) arXived a paper last summer on what first looked like a novel approximation of the marginal likelihood. Based on the variance of thermodynamic integration. The starting argument is that there exists a power 0<t⁰<1 such that the expectation of the logarithm of the product of the prior by the likelihood to the power t⁰ or t⁰-powered likelihood  is equal to the standard log-marginal

\log m(x) = \mathbb{E}^{t^0}[ \log f(X|\theta) ]

when the expectation is under the posterior corresponding to the t⁰-powered likelihood (rather than the full likelihood). By an application of the mean value theorem. Watanabe’s (2013) WBIC replaces the optimum t⁰ with 1/log(n), n being the sample size. The issue in terms of computational statistics is of course that the error of WBIC (against the true log m(x)) is only characterised as an order of n.

The second part of the paper is rather obscure to me, as the motivation for the real log canonical threshold is missing, even though the quantity is connected with the power likelihood. And the DIC effective dimension. It then goes on to propose a new approximation of sBIC, where s stands for singular, of Drton and Plummer (2017) which I had missed (and may ask my colleague Martin later today at Warwick!). Quickly reading through the later however brings explanations about the real log canonical threshold being simply the effective dimension in Schwarwz’s BIC approximation to the log marginal,

\log m(x) \approx= \log f(x|\hat{\theta}_n) - \lambda \log n +(m-1)\log\log n

(as derived by Watanabe), where m is called the multiplicity of the real log canonical threshold. Both λ and m being unknown, Drton and Plummer (2017) estimate the above approximation in a Bayesian fashion, which leads to a double indexed marginal approximation for a collection of models. Since this thread leads me further and further from a numerical resolution of the marginal estimation, but brings in a different perspective on mixture Bayesian estimation, I will return to this highly  in a later post. The paper of Imai discusses a different numerical approximation to sBIC, With a potential improvement in computing sBIC. (The paper was proposed as a poster to BayesComp 2020, so I am looking forward discussing it with the author.)

 

over-confident about mis-specified models?

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on April 30, 2019 by xi'an

Ziheng Yang and Tianqui Zhu published a paper in PNAS last year that criticises Bayesian posterior probabilities used in the comparison of models under misspecification as “overconfident”. The paper is written from a phylogeneticist point of view, rather than from a statistician’s perspective, as shown by the Editor in charge of the paper [although I thought that, after Steve Fienberg‘s intervention!, a statistician had to be involved in a submission relying on statistics!] a paper , but the analysis is rather problematic, at least seen through my own lenses… With no statistical novelty, apart from looking at the distribution of posterior probabilities in toy examples. The starting argument is that Bayesian model comparison is often reporting posterior probabilities in favour of a particular model that are close or even equal to 1.

“The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this over confidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors,supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.”

The paper focus on the behaviour of posterior probabilities to strongly support a model against others when the sample size is large enough, “even when” all models are wrong, the argument being apparently that the correct output should be one of equal probability between models, or maybe a uniform distribution of these model probabilities over the probability simplex. Why should it be so?! The construction of the posterior probabilities is based on a meta-model that assumes the generating model to be part of a list of mutually exclusive models. It does not account for cases where “all models are wrong” or cases where “all models are right”. The reported probability is furthermore epistemic, in that it is relative to the measure defined by the prior modelling, not to a promise of a frequentist stabilisation in a ill-defined asymptotia. By which I mean that a 99.3% probability of model M¹ being “true”does not have a universal and objective meaning. (Moderation note: the high polarisation of posterior probabilities was instrumental in our investigation of model choice with ABC tools and in proposing instead error rates in ABC random forests.)

The notion that two models are equally wrong because they are both exactly at the same Kullback-Leibler distance from the generating process (when optimised over the parameter) is such a formal [or cartoonesque] notion that it does not make much sense. There is always one model that is slightly closer and eventually takes over. It is also bizarre that the argument does not account for the complexity of each model and the resulting (Occam’s razor) penalty. Even two models with a single parameter are not necessarily of intrinsic dimension one, as shown by DIC. And thus it is not a surprise if the posterior probability mostly favours one versus the other. In any case, an healthily sceptic approach to Bayesian model choice means looking at the behaviour of the procedure (Bayes factor, posterior probability, posterior predictive, mixture weight, &tc.) under various assumptions (model M¹, M², &tc.) to calibrate the numerical value, rather than taking it at face value. By which I do not mean a frequentist evaluation of this procedure. Actually, it is rather surprising that the authors of the PNAS paper do not jump on the case when the posterior probability of model M¹ say is uniformly distributed, since this would be a perfect setting when the posterior probability is a p-value. (This is also what happens to the bootstrapped version, see the last paragraph of the paper on p.1859, the year Darwin published his Origin of Species.)