## mathematical theory of Bayesian statistics [book review]

**I** came by chance (and not by CHANCE) upon this 2018 CRC Press book by Sumio Watanabe and ordered it myself to gather which material it really covered. As the back-cover blurb was not particularly clear and the title sounded quite general. After reading it, I found out that this is a mathematical treatise on some aspects of Bayesian information criteria, in particular on the Widely Applicable Information Criterion (WAIC) that was introduced by the author in 2010. The result is a rather technical and highly focussed book with little motivation or intuition surrounding the mathematical results, which may make the reading arduous for readers. Some background on mathematical statistics and Bayesian inference is clearly preferable and the book cannot be used as a textbook for most audiences, as opposed to eg An Introduction to Bayesian Analysis by J.K. Ghosh et al. or even more to Principles of Uncertainty by J. Kadane. In connection with this remark the exercises found in the book are closer to the delivery of additional material than to textbook-style exercises.

“posterior distributions are often far from any normal distribution, showing that Bayesian estimation gives the more accurate inference than other estimation methods.”

The overall setting is one where both the sampling and the prior distributions are different from respective “true” distributions. Requiring a tool to assess the discrepancy when utilising a specific pair of such distributions. Especially when the posterior distribution cannot be approximated by a Normal distribution. (Lindley’s paradox makes an interesting *incognito* incursion on p.238.) The WAIC is supported for the determination of the “true” model, in opposition to AIC and DIC, incl. on a mixture example that reminded me of our eight versions of DIC paper. In the “Basic Bayesian Theory” chapter (§3), the “basic theorem of Bayesian statistics” (p.85) states that the various losses related with WAIC can be expressed as second-order Taylor expansions of some cumulant generating functions, with order o(n⁻¹), “even if the posterior distribution cannot be approximated by any normal distribution” (p.87). With the intuition that

“if a log density ratio function has a relatively finite variance then the generalization loss, the cross validation loss, the training loss and WAIC have the same asymptotic behaviors.”

Obviously, these “basic” aspects should come as a surprise to a fair percentage of Bayesians (in the sense of not being particularly *basic*). Myself included. Chapter 4 exposes why, for regular models, the posterior distribution accumulates in an ε neighbourhood of the optimal parameter at a speed O(n^{2/5prior weights on said models.prior weights}). With the normalised partition fposterior probability ratiosunction being of order n^{-d/2} in the neighbourhood and exponentially negligible outside. A consequence of this regular asymptotic theory is that all above losses are asymptotically equivalent to the negative log likelihood plus similar order n⁻¹ terms that can be ordered. Chapters 5 and 6 deal with “standard” [the likelihood ratio is a multi-index power of the parameter ω] and general posterior distributions that can be written as mixtures of standard distributions, with expressions of the above losses in terms of new universal constants. Again, a rather remote concern of mine. The book also includes a chapter (§7) on MCMC, with a rather involved proof that a Metropolis algorithm satisfies detailed balance (p.210). The Gibbs sampling section contains an extensive example on a two-dimensional two-component unit-variance Normal mixture, with an unusual perspective on the posterior, which is considered as “singular” when the true means are close. (Label switching or the absence thereof is not mentioned.) In terms of approximating the normalising constant (or free energy), the only method discussed there is path sampling, with a cryptic remark about harmonic mean estimators (not identified as such). In a final knapsack chapter (§9), Bayes factors (confusedly denoted as L(x)) are shown to be most powerful tests in a Bayesian sense when comparing hypotheses without prior weights on said hypotheses, while posterior probability ratios are the natural statistics for comparing models with prior weights on said models. (With Lindley’s paradox making another appearance, still *incognito*!) And a notion of *phase transition* for hyperparameters is introduced, with the meaning of a radical change of behaviour at a critical value of said hyperparameter. For instance, for a simple normal- mixture outlier model, the critical value of the Beta hyperparameter is α=2. Which is a wee bit of a surprise when considering Rousseau and Mengersen (2011) since their bound for consistency was α=d/2.

In conclusion, this is quite an original perspective on Bayesian models, covering the somewhat unusual (and potentially controversial) issue of misspecified priors and centered on the use of information criteria. I find the book could have benefited from further editing as I noticed many typos and somewhat unusual sentences (at least unusual to me).

*[Disclaimer about potential self-plagiarism: this post or an edited version should eventually appear in my Books Review section in CHANCE.]*

May 6, 2021 at 9:34 am

I’m the author of the book (Sumio Watanabe). You have misread the book. Lindley’s paradox is not paradox. Please read once more p.270-277. Bayesian Hypothesis test is the different procedure from Bayesian model comparison. They are explained in sections 9.2 and 9.3, respectively. The difference is illustrated in Example.64.

May 6, 2021 at 5:09 pm

Interestingly, you differentiate between hypothesis testing and model choice through essentially not setting prior probabilities on the hypotheses and setting prior probabilities on the models. Which makes the Bayes factor only adequate in the second situation, if I am not confused.

May 6, 2021 at 5:47 pm

Thank you for question.

(1) In hypothesis test, we prepare Null (prior_0,model_0) and Alternative (prior_1,prior_1). Then the ratio of two marginal likelihoods is equal to the statistic of the most powerful test. Then we can determine the reject region for a given level, based on the assumption that a sample is generated from Null.

(2) In Bayesian model comparison, two pairs (prior_0,model_0) and (prior_1,model_1) are compared by the posterior probabilities using a sample and priors of pairs. It results in the analysis of the ratio of two marginal likelihoods.

Both test and comparison result in the ratio of marginal likelihoods, however, the determined regions are different. An example is given in Example.64.

May 6, 2021 at 9:17 am

I’m the author of the book (Sumio Watanabe). You have misread the book. The phase transition structure and the critical point strongly depend on the statistical model. The model in Example.67 is different from that in Rousseau and Mengersen (2011).

May 6, 2021 at 4:50 pm

Thanks, I was looking at the general location Normal mixture on p.282. In the case there is a single Normal component, which corresponds to case (2) in the discussion. This would be closer to the case covered by Rousseau & Mengersen (2011), wouldn’t it?

May 6, 2021 at 7:04 pm

Thank you for your reading again.

The model in p.282 is same as Rousseau & Mengersen (2011),

however, in p.282, we study another phase transition, which is caused by increasing sample size, n. Assume that a true distribution is a mixture of two near-located normal distributions. If n is small, then the posterior is almost equal to the case that the true is one normal distribution. If n tends to large, the posterior becomes to the case that the true is a mixture of two distributions. Hence a phase transition is caused by n increasing, which is illustrated in Fig.9.6. It can be observed by the generalization loss, cross validation, and WAIC.

In this model (mixture of two free normal distributions), the critical point according to the index of Dirichlet prior a is still unknown. However, It was proved by Takumi Watanabe by (1) that the

critical point of the mixture of two L-dimensional multinomial distributions is a=(L-1)/2. Since the parameter dimension of one multinomial distribution is (L-1), this result is formally equal to the consistency condition of Rousseau & Mengersen (2011). Takumi also proved that the real log canonical threshold is (L-1)/2+min(a/2,(L-1)/4), resulting that the asymptotic free energy and the generalization error are also clarified.

(1) Takumi Watanabe, Asymptotic Behavior of Bayesian Generalization Error in Multinomial Mixtures. IEICE Technical report, IBISML2019-18, pp. 1-8, 2020.

May 6, 2021 at 7:15 pm

Thanks again for taking the trouble to reply to my questions! I remain however puzzled because Rousseau & Mengersen show that, as n goes to infinity, “quite generally the posterior distribution has a stable and interesting behaviour, since it tends to empty the extra component” when the Dirichlet weight is smaller than d/w.

May 7, 2021 at 8:06 am

Actually, when discussing with Judith Rousseau, she pointed out to me that the result in their 2011 paper does not apply to the location mixture.

May 7, 2021 at 1:19 am

Thank you for your interest in singular cases.

This is the answer for your comment at May 6, 2021 at 7:15 pm.

Let a sample be generated from one L-dimensional multinomial distribution M(x), and a statistical model be p(x)=(1-a)M(x-b)+aM(x-c). Then M(x)=p(x) if and only if a=(0,1) or b=c=0. This is a singular case. If the index of Dirichlet prior >(L-1)/2, then the posterior becomes a=free, (b,c)~(0,0). If index<(L-1)/2, then the posterior becomes a~0 or a~1, b,c=free. Both cases are rather stable. If index=(L-1)/2, (critical point), then a~0 or a~1, b,c~0, but the posterior is unstable (convergence of MCMC becomes very slow). We expect that a normal mixture has the almost same behavior. The generalization loss depends on the index of prior. In a numerical experiment, the critical point can be found by the cross validation or WAIC. At the critical point, the variances of them become very large.

May 7, 2021 at 9:50 am

Thank you for your comment. This is the answer for your comment at May 7, 2021 at 8:06 am.

In your comment at May 7, 2021 at 8:06 am. I cannot determine the statistical model you mentioned, hence I cannot say anything about your comment. However, we have at least the following theoretical and numerical results.

If a statistical model is given by p(x)=aN(x|0)+(1-a)N(x|b), where dimension of b is 2, then its real log canonical threshold and numerical experiment are shown in the book, Example.67.

In Takumi’s Multinomial paper (1), p(x)=aM(x|b)+(1-a)M(x|c), the theoretical result about the real log canonical threshold is derived by using resolution of singularities which coincided with numerical experiment.

In a general normal mixture using the prior that is positive and finite (index of Dirichlet =1), then the real log canonical threshold was studied in the Yamazaki’s pioneer work (2).

In the book, Example.52 and Fig.7.2 in p.222 show that the posterior of a normal mixture also has a phase transition according to the location of the true distributions.

Model selection phenomena in a normal mixture is shown in Fig.2.8 in p.58 (DIC fails). Numerical experiments in model selection phenomena by different Dirichlet indices are shown in tables 1and 2 in (3).

(1) Takumi Watanabe, Asymptotic Behavior of Bayesian Generalization Error in Multinomial Mixtures. IEICE Technical report, IBISML2019-18, pp. 1-8, 2020.

(2) Keisuke Yamazaki, et.al. Singularities in mixture models and upper bounds of stochastic complexity, Neural Networks, Vol. 16, pp. 1029-1038, 2003.

(3) S. Watanabe, WAIC and WBIC for mixture models. Behaviormetrika vol. 48, pp.5–21, 2021.

May 6, 2021 at 8:54 am

I’m the author of the book (Sumio Watanabe). You have misread the book. Please read once more p.85-87. The cross validation (CV) and WAIC are asymptotically equivalent as random variables. However, training loss is not. Averages of the generalization loss, CV, and WAIC are asymptotically equivalent. However, they are not as random variables.

May 6, 2021 at 3:58 pm

Concerning the three expressions at the bottom of page 88, is there a typo in C[n] as I would have expected a minus in front of sum to be coherent with the other terms.

May 6, 2021 at 4:21 pm

Thank you for your reading again.

C_n in p.88 is derived from (3.15), (3.17), and (3.22).

May 6, 2021 at 4:36 pm

Thank you, I can now see the minus (-) in the earlier equations leading to a plus (+) there..