## Bayesian astrostats under Laplace’s gaze

Posted in Books, Kids, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , on October 11, 2016 by xi'an

This afternoon, I was part of a jury of an astrostatistics thesis, where the astronomy part was about binary objects in the Solar System, and the statistics part about detecting patterns in those objects, unsurprisingly. The first part was highly classical using several non-parametric tests like Kolmogorov-Smirnov to test whether those binary objects were different from single objects. While the p-values were very tiny, I felt these values were over-interpreted in the thesis, because the sample size of N=30 leads to some scepticism about numerical quantities like 0.0008. While I do not want to sound pushing for Bayesian solutions in every setting, this case is a good illustration of the nefarious power of p-values, which are almost always taken at face value, i.e., where 0.008 is understood in terms of the null hypothesis and not in terms of the observed realisation of the p-value. Even within a frequentist framework, the distribution of this p-value should be evaluated or estimated one way or another, as there is no reason to believe it is anywhere near a Uniform(0,1) distribution.The second part of the thesis was about the estimation of some parameters of the laws of the orbits of those dual objects and the point of interest for me was the purely mechanical construction of a likelihood function that was an exponential transform of a sum of residuals, made of squared differences between the observations and their expectations. Or a power of such differences. This was called the “statistical model” in the thesis and I presume in part of the astrostats literature. This reminded me of the first meeting I had with my colleagues from Besançon, where they could not use such mechanical versions because of intractable expectations and used instead simulations from their physical model, literally reinventing ABC. This resolution had the same feeling, closer to indirect inference than regular inference, although it took me half the defence to realise it.

The defence actually took part in the beautiful historical Perrault’s building of Observatoire de Paris, in downtown Paris, where Cassini, Arago and Le Verrier once ruled!  In the council room under paintings of major French astronomers, including Laplace himself, looking quite smug in his academician costume. The building is built around the Paris Zero Meridian (which got dethroned in 1911 by the Greenwich Zero Meridian, which I contemplated as a kid since my childhood church had the Greenwich drawn on the nave stones). The customary “pot” after the thesis and its validation by the jury was in the less historical cafeteria of the Observatoire, but it included a jazz big band, which made this thesis defence quite unique in many ways!

## a Bayesian criterion for singular models [discussion]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on October 10, 2016 by xi'an

[Here is the discussion Judith Rousseau and I wrote about the paper by Mathias Drton and Martyn Plummer, a Bayesian criterion for singular models, which was discussed last week at the Royal Statistical Society. There is still time to send a written discussion! Note: This post was written using the latex2wp converter.]

It is a well-known fact that the BIC approximation of the marginal likelihood in a given irregular model ${\mathcal M_k}$ fails or may fail. The BIC approximation has the form

$\displaystyle BIC_k = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - d_k \log n /2$

where ${d_k }$ corresponds on the number of parameters to be estimated in model ${\mathcal M_k}$. In irregular models the dimension ${d_k}$ typically does not provide a good measure of complexity for model ${\mathcal M_k}$, at least in the sense that it does not lead to an approximation of

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log \left( \int_{\mathcal M_k} p(\mathbf Y_n| \pi_k, \mathcal M_k) dP(\pi_k|k )\right) \,.$

A way to understand the behaviour of ${\log m(\mathbf Y_n |\mathcal M_k) }$ is through the effective dimension

$\displaystyle \tilde d_k = -\lim_n \frac{ \log P( \{ KL(p(\mathbf Y_n| \pi_0, \mathcal M_k) , p(\mathbf Y_n| \pi_k, \mathcal M_k) ) \leq 1/n | k ) }{ \log n}$

when it exists, see for instance the discussions in Chambaz and Rousseau (2008) and Rousseau (2007). Watanabe (2009} provided a more precise formula, which is the starting point of the approach of Drton and Plummer:

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - \lambda_k(\pi_0) \log n + [m_k(\pi_0) - 1] \log \log n + O_p(1)$

where ${\pi_0}$ is the true parameter. The authors propose a clever algorithm to approximate of the marginal likelihood. Given the popularity of the BIC criterion for model choice, obtaining a relevant penalized likelihood when the models are singular is an important issue and we congratulate the authors for it. Indeed a major advantage of the BIC formula is that it is an off-the-shelf crierion which is implemented in many softwares, thus can be used easily by non statisticians. In the context of singular models, a more refined approach needs to be considered and although the algorithm proposed by the authors remains quite simple, it requires that the functions ${ \lambda_k(\pi)}$ and ${m_k(\pi)}$ need be known in advance, which so far limitates the number of problems that can be thus processed. In this regard their equation (3.2) is both puzzling and attractive. Attractive because it invokes nonparametric principles to estimate the underlying distribution; puzzling because why should we engage into deriving an approximation like (3.1) and call for Bayesian principles when (3.1) is at best an approximation. In this case why not just use a true marginal likelihood?

1. Why do we want to use a BIC type formula?

The BIC formula can be viewed from a purely frequentist perspective, as an example of penalised likelihood. The difficulty then stands into choosing the penalty and a common view on these approaches is to choose the smallest possible penalty that still leads to consistency of the model choice procedure, since it then enjoys better separation rates. In this case a ${\log \log n}$ penalty is sufficient, as proved in Gassiat et al. (2013). Now whether or not this is a desirable property is entirely debatable, and one might advocate that for a given sample size, if the data fits the smallest model (almost) equally well, then this model should be chosen. But unless one is specifying what equally well means, it does not add much to the debate. This also explains the popularity of the BIC formula (in regular models), since it approximates the marginal likelihood and thus benefits from the Bayesian justification of the measure of fit of a model for a given data set, often qualified of being a Bayesian Ockham’s razor. But then why should we not compute instead the marginal likelihood? Typical answers to this question that are in favour of BIC-type formula include: (1) BIC is supposingly easier to compute and (2) BIC does not call for a specification of the prior on the parameters within each model. Given that the latter is a difficult task and that the prior can be highly influential in non-regular models, this may sound like a good argument. However, it is only apparently so, since the only justification of BIC is purely asymptotic, namely, in such a regime the difficulties linked to the choice of the prior disappear. This is even more the case for the sBIC criterion, since it is only valid if the parameter space is compact. Then the impact of the prior becomes less of an issue as non informative priors can typically be used. With all due respect, the solution proposed by the authors, namely to use the posterior mean or the posterior mode to allow for non compact parameter spaces, does not seem to make sense in this regard since they depend on the prior. The same comments apply to the author’s discussion on Prior’s matter for sBIC. Indeed variations of the sBIC could be obtained by penalizing for bigger models via the prior on the weights, for instance as in Mengersen and Rousseau (2011) or by, considering repulsive priors as in Petralia et al. (20120, but then it becomes more meaningful to (again) directly compute the marginal likelihood. Remains (as an argument in its favour) the relative computational ease of use of sBIC, when compared with the marginal likelihood. This simplification is however achieved at the expense of requiring a deeper knowledge on the behaviour of the models and it therefore looses the off-the-shelf appeal of the BIC formula and the range of applications of the method, at least so far. Although the dependence of the approximation of ${\log m(\mathbf Y_n |\mathcal M_k)}$ on ${\mathcal M_j }$, \$latex {j \leq k} is strange, this does not seem crucial, since marginal likelihoods in themselves bring little information and they are only meaningful when compared to other marginal likelihoods. It becomes much more of an issue in the context of a large number of models.

2. Should we care so much about penalized or marginal likelihoods ?

Marginal or penalized likelihoods are exploratory tools in a statistical analysis, as one is trying to define a reasonable model to fit the data. An unpleasant feature of these tools is that they provide numbers which in themselves do not have much meaning and can only be used in comparison with others and without any notion of uncertainty attached to them. A somewhat richer approach of exploratory analysis is to interrogate the posterior distributions by either varying the priors or by varying the loss functions. The former has been proposed in van Havre et l. (2016) in mixture models using the prior tempering algorithm. The latter has been used for instance by Yau and Holmes (2013) for segmentation based on Hidden Markov models. Introducing a decision-analytic perspective in the construction of information criteria sounds to us like a reasonable requirement, especially when accounting for the current surge in studies of such aspects.

[Posted as arXiv:1610.02503]

## Ted Benoît (1947-2016)

Posted in Books, Kids, pictures with tags , , , , , on October 9, 2016 by xi'an

While not the most famous of French comics artists, Ted Benoît was a significant contributor to the “ligne claire” school in the continuation of Hergé and Jacobs. His masterpiece is called Berceuses électriques (1982), published at a time when I regularly read comics magazines like A Suivre, l’Écho des Savanes or Métal Hurlant (besides Charlie). The story itself is surrealistic or just plainly irrelevant, while the dialogues and drawings are both brilliant, set in an America borrowed from the 1950’s Noir novels, plus a dose of cynism from the 1980’s. A decade later, Benoît also contributed to the “Blake and Mortimer” series, after Jacobs’ death, drawing “L’Affaire Francis Blake” (1996) and “L’Étrange Rendez-vous” (2002). Both impeccable graphical outcomes, if somehow weak in the plots.

## speaker for the dead [book review]

Posted in Books, Kids, Travel with tags , , , , , , , , , , , on October 8, 2016 by xi'an

## improved convergence of regression-adjusted ABC

Posted in Books, Statistics on October 7, 2016 by xi'an

“These results highlight the tension in ABC between choices of the summary statistics and bandwidth that will lead to more accurate inferences when using the ABC posterior, against choices that will reduce the computational cost or Monte Carlo error of algorithms for sampling from the ABC posterior.”

Wentao Li and Paul Fearnhead have arXived a new paper on the asymptotics of ABC that shows the benefit of including Beaumont et al. (2002) post-processing. This step modifies the simulated values of the parameter θ by a regression step bringing the value of the corresponding summary statistic closer to the observed value of that summary. Under some assumptions on the model and summary statistics, the regression step allows for a tolerance ε that is of order O(1/√n), n being the sample size, or even ε⁵=o(1/√n³), while keeping the Monte Carlo noise to a negligible level and improving the acceptance probability all the way to 1 as n grows. In the sense that the adjusted ABC estimator satisfies the same CLT as the true Bayes estimate (Theorem 3.1). As such this is a true improvement over our respective recent results—which both lead to Proposition 3.1 in the current paper—,  provided the implementation does not require a significant additional computing time (but I do not see why it should). Surprisingly the Monte Carlo effort (or sample size N) does not seem to matter at all if the tolerance is indeed of order O(1/√n), while I am under the impression that it should increase with n. Otherwise the Monte Carlo error dominates. Note also that the regression adjustment is linear here, instead of the local or non-parametric version of the original Beaumont et al. (2002).

## inferential models: reasoning with uncertainty [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , on October 6, 2016 by xi'an

“the field of statistics (…) is still surprisingly underdeveloped (…) the subject lacks a solid theory for reasoning with uncertainty [and] there has been very little progress on the foundations of statistical inference” (p.xvi)

A book that starts with such massive assertions is certainly hoping to attract some degree of attention from the field and likely to induce strong reactions to this dismissal of the not inconsiderable amount of research dedicated so far to statistical inference and in particular to its foundations. Or even attarcting flak for not accounting (in this introduction) for the past work of major statisticians, like Fisher, Kiefer, Lindley, Cox, Berger, Efron, Fraser and many many others…. Judging from the references and the tone of this 254 pages book, it seems like the two authors, Ryan Martin and Chuanhai Liu, truly aim at single-handedly resetting the foundations of statistics to their own tune, which sounds like a new kind of fiducial inference augmented with calibrated belief functions. Be warned that five chapters of this book are built on as many papers written by the authors in the past three years. Which makes me question, if I may, the relevance of publishing a book on a brand-new approach to statistics without further backup from a wider community.

“…it is possible to calibrate our belief probabilities for a common interpretation by intelligent minds.” (p.14)

Chapter 1 contains a description of the new perspective in Section 1.4.2, which I find useful to detail here. When given an observation x from a Normal N(θ,1) model, the authors rewrite X as θ+Z, with Z~N(0,1), as in fiducial inference, and then want to find a “meaningful prediction of Z independently of X”. This seems difficult to accept given that, once X=x is observed, Z=X-θ⁰, θ⁰ being the true value of θ, which belies the independence assumption. The next step is to replace Z~N(0,1) by a random set S(Z) containing Z and to define a belief function bel() on the parameter space Θ by

bel(A|X) = P(X-S(Z)⊆A)

which induces a pseudo-measure on Θ derived from the distribution of an independent Z, since X is already observed. When Z~N(0,1), this distribution does not depend on θ⁰ the true value of θ… The next step is to choose the belief function towards a proper frequentist coverage, in the approximate sense that the probability that bel(A|X) be more than 1-α is less than α when the [arbitrary] parameter θ is not in A. And conversely. This property (satisfied when bel(A|X) is uniform) is called validity or exact inference by the authors: in my opinion, restricted frequentist calibration would certainly sound more adequate.

“When there is no prior information available, [the philosophical justifications for Bayesian analysis] are less than fully convincing.” (p.30)

“Is it logical that an improper “ignorance” prior turns into a proper “non-ignorance” prior when combined with some incomplete information on the whereabouts of θ?” (p.44)

## Greek variations on power-expected-posterior priors

Posted in Books, Statistics, University life with tags , , , , , , on October 5, 2016 by xi'an

Dimitris Fouskakis, Ioannis Ntzoufras and Konstantinos Perrakis, from Athens, have just arXived a paper on power-expected-posterior priors. Just like the power prior and the expected-posterior prior, this approach aims at avoiding improper priors by the use of imaginary data, which distribution is itself the marginal against another prior. (In the papers I wrote on that topic with Juan Antonio Cano and Diego Salmerón, we used MCMC to figure out a fixed point for such priors.)

The current paper (which I only perused) studies properties of two versions of power-expected-posterior priors proposed in an earlier paper by the same authors. For the normal linear model. Using a posterior derived from an unormalised powered likelihood either (DR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the powered likelihood, or (CR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the actual likelihood. The baseline model being the G-prior with g=n². Both versions lead to a marginal likelihood that is similar to BIC and hence consistent. The DR version coincides with the original power-expected-posterior prior in the linear case. The CR version involves a change of covariance matrix. All in all, the CR version tends to favour less complex models, but is less parsimonious as a variable selection tool, which sounds a wee bit contradictory. Overall, I thus feel (possibly incorrectly) that the paper is more an appendix to the earlier paper than a paper in itself as I do not get in the end a clear impression of which method should be preferred.