## 21w5107 [day 1]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on November 30, 2021 by xi'an

The workshop started by the bad news of our friend Michele Guindani being hit and mugged upon arrival in Oaxaca, Saturday night. Fortunately, he was not hurt, but lost both phone and wallet, always a major bummer when abroad… Still this did not cast a lasting pall on the gathering of long-time no-see friends, whom I had indeed not seen for at least two years. Except for those who came to the CIRMirror!

A few hours later, we got woken up by fairly loud firecrackers (palomas? cohetes?) at 5am, for no reason I can fathom (the Mexican Revolution day was a week ago) although it seemed correlated with the nearby church bells going on at full blast (for Lauds? Hanukkah? Cyber Monday? Chirac’s birthdate?). The above picture was taken the Santa María del Tule town with its super-massive Montezuma cypress tree, with remaining decorations from the Día de los Muertos.

Without launching (much) the debate on whether or not Bayesian non-parametrics qualified as “objective Bayesian” methods, Igor Prünster started the day with a non-parametric presentation of dependent random probability measures. With the always fascinating notion that a random discrete non-parametric prior is inducing a distribution on the partitions (EPPF). And applicability in mixtures and their generalisations. Realising that the highly discrete nature of such measures is not such an issue for a given sample size n, since there are at most n elements in the partition. Beatrice Franzolini discussed of specific ways to create dependent distributions based on independent samples, although her practical example based on one N(-10,1) sample and another (independently) N(10,1) sample seemed to fit in several of the dependent random measures she compared. And Marta Catalano (Warwick) presented her work on partial exchangeability and optimal transportation (which I had also heard in CIRM last June and in Warwick last week). One thing I had not realised earlier was the dependence of the Wasserstein distance on the parameterisation, although it now makes perfect sense. If only for the coupling.  I had alas to miss Isadora Antoniano-Villalobos’ talk as I had to teach my undergrad class in Paris Dauphine at the same time… This non-parametric session was quite homogeneous and rich in perspectives.

In an all-MCMC afternoon, Julyan Arbel talked about reference priors for extreme value distributions, with the “shocking” case of a restriction on the support of one parameter, ξ. Which means in fact that the Jeffreys prior is then undefined. This reminded me somewhat of the work of Clara Grazian on Jeffreys priors for mixtures, where some models were not allowing for Fisher information to exist. The second part of this talk was about modified local versions of Gelman & Rubin (1992) R hats. And the recent modification proposed by Aki and co-authors. Where I thought that a simplification of the multivariate challenge of defining ranks could be alleviated by considering directly the likelihood values of the chains. And Trevor Campbell gradually built an involved parallel tempering method where the powers of a geometric mixture are optimised as spline functions of the temperature. Next, María Gil-Leyva presented her original and ordered approach to mixture estimation, which I discussed in a blog published two days ago (!). She corrected my impressions that (i) the methods were all impervious to label switching and (ii) required some conjugacy to operate. The final talk of the day was by Anirban Bhattacharya on high-D Bayesian regression and coupling techniques for checking convergence, a paper that had been on my reading list for a long while. A very elaborate construct of coupling strategies within a Gibbs sampler, with some steps relying on optimal coupling and others on the use of common random generators.

## Easy computation of the Bayes Factor

Posted in Books, Statistics with tags , , , , , on August 21, 2021 by xi'an

“Choosing the ranges has been criticized as introducing subjectivity; however, the key point is that the ranges are given quantitatively and should be justified”

On arXiv, I came across a paper by physicists Dunstan, Crowne, and Drew, on computing the Bayes factor by linear regression. Paper that I found rather hard to read given that the method is never completely spelled out but rather described through some examples (or the captions of figures)… The magical formula (for the marginal likelihood)

$B=(2\pi)^{n/2}L_{\max}\dfrac{\text{Cov}_p}{\prod_{i=1}^n \Delta p_i}$

where n is the parameter dimension, Cov is the Fisher information matrix, and the denominator the volume of a flat prior on an hypercube (!), seems to come for a Laplace approximation. But it depends rather crucially (!) on the choice of this volume. A severe drawback the authors evacuate with the above quote… And by using an example where the parameters have a similar meaning under both models. The following ones compare several dimensions of parameters without justifying (enough) the support of the corresponding priors. In addition, using a flat prior over the hypercube seems to clash with the existence of a (Fisher) correlation between the components. (To be completely open as to why I discuss this paper, I was asked to review the paper, which I declined.)

## Fisher’s lost information

Posted in Books, Kids, pictures, Statistics, Travel with tags , , , , , , , on February 11, 2019 by xi'an

After a post on X validated and a good discussion at work, I came to the conclusion [after many years of sweeping the puzzle under the carpet] that the (a?) Fisher information obtained for the Uniform distribution U(0,θ) as θ⁻¹ is meaningless. Indeed, there are many arguments:

1. The lack of derivability of the indicator function for x=θ is a non-issue since the derivative is defined almost everywhere.
2. In many textbooks, the Fisher information θ⁻² is derived from the Fréchet-Darmois-Cramèr-Rao inequality, which does not apply for the Uniform U(0,θ) distribution.
3. One connected argument for the expression of the Fisher information as the expectation of the squared score is that it is the variance of the score, since its expectation is zero. Except that it is not zero for the Uniform U(0,θ) distribution.
4. For the same reason, the opposite of the second derivative of the log-likelihood is not equal to the expectation of the squared score. It is actually -θ⁻²!
5. Looking at the Taylor expansion justification of the (observed) Fisher information, expanding the log-likelihood around the maximum likelihood estimator does not work since the maximum likelihood estimator does not cancel the score.
6. When computing the Fisher information for an n-sample rather than a 1-sample, the information is n²θ⁻², rather than nθ⁻².
7. Since the speed of convergence of the maximum likelihood estimator is of order n⁻², the central limit theorem does not apply and the limiting variance of the maximum likelihood estimator is not the Fisher information.

## information maximising neural networks summaries

Posted in pictures, Statistics with tags , , , , , , , , on February 6, 2019 by xi'an

After missing the blood moon eclipse last night, I had a meeting today at the Paris observatory (IAP), where we discussed an ABC proposal made by Tom Charnock, Guilhem Lavaux, and Benjamin Wandelt from this institute.

“We introduce a simulation-based machine learning technique that trains artificial neural networks to find non-linear functionals of data that maximise Fisher information : information maximising neural networks.” T. Charnock et al., 2018
The paper is centred on the determination of “optimal” summary statistics. With the goal of finding “transformation which maps the data to compressed summaries whilst conserving Fisher information [of the original data]”. Which sounds like looking for an efficient summary and hence impossible in non-exponential cases. As seen from the description in (2.1), the assumed distribution of the summary is Normal, with mean μ(θ) and covariance matrix C(θ) that are implicit transforms of the parameter θ. In that respect, the approach looks similar to the synthetic likelihood proposal of Wood (2010). From which an unusual form of Fisher information can be derived, as μ(θ)’C(θ)⁻¹μ(θ)… A neural net is trained to optimise this information criterion at a given (so-called fiducial) value of θ, in terms of a set of summaries of the same dimension as the data. Which means the information contained in the whole data (likelihood) is not necessarily recovered, linking with this comment from Edward Ionides (in a set of lectures at Wharton).
“Even summary statistics derived by careful scientific or statistical reasoning have been found surprisingly uninformative compared to the whole data likelihood in both scientific investigations (Shrestha et al., 2011) and simulation experiments (Fasiolo et al., 2016)” E. Ionides, slides, 2017
The maximal Fisher information obtained in this manner is then used in a subsequent ABC step as the natural metric for the distance between the observed and simulated data. (Begging the question as to why being maximal is necessarily optimal.) Another question is about the choice of the fiducial parameter, which choice should be tested by for instance iterating the algorithm a few steps. But having to run simulations for a single value of the parameter is certainly a great selling point!

## penalising model component complexity

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , , , , , , on April 1, 2014 by xi'an

“Prior selection is the fundamental issue in Bayesian statistics. Priors are the Bayesian’s greatest tool, but they are also the greatest point for criticism: the arbitrariness of prior selection procedures and the lack of realistic sensitivity analysis (…) are a serious argument against current Bayesian practice.” (p.23)

A paper that I first read and annotated in the very early hours of the morning in Banff, when temperatures were down in the mid minus 20’s now appeared on arXiv, “Penalising model component complexity: A principled, practical approach to constructing priors” by Thiago Martins, Dan Simpson, Andrea Riebler, Håvard Rue, and Sigrunn Sørbye. It is a highly timely and pertinent paper on the selection of default priors! Which shows that the field of “objective” Bayes is still full of open problems and significant advances and makes a great argument for the future president [that I am] of the O’Bayes section of ISBA to encourage young Bayesian researchers to consider this branch of the field.

“On the other end of the hunt for the holy grail, “objective” priors are data-dependent and are not uniformly accepted among Bayesians on philosophical grounds.” (p.2)

Apart from the above quote, as objective priors are not data-dependent! (this is presumably a typo, used instead of model-dependent), I like very much the introduction (appreciating the reference to the very recent Kamary (2014) that just got rejected by TAS for quoting my blog post way too much… and that we jointly resubmitted to Statistics and Computing). Maybe missing the alternative solution of going hierarchical as far as needed and ending up with default priors [at the top of the ladder]. And not discussing the difficulty in specifying the sensitivity of weakly informative priors.

“Most model components can be naturally regarded as a flexible version of a base model.” (p.3)

The starting point for the modelling is the base model. How easy is it to define this base model? Does it [always?] translate into a null hypothesis formulation? Is there an automated derivation? I assume this somewhat follows from the “block” idea that I do like but how generic is model construction by blocks?

“Occam’s razor is the principle of parsimony, for which simpler model formulations should be preferred until there is enough support for a more complex model.” (p.4)

I also like this idea of putting a prior on the distance from the base! Even more because it is parameterisation invariant (at least at the hyperparameter level). (This vaguely reminded me of a paper we wrote with George a while ago replacing tests with distance evaluations.) And because it gives a definitive meaning to Occam’s razor. However, unless the hyperparameter ξ is one-dimensional this does not define a prior on ξ per se. I equally like Eqn (2) as it shows how the base constraint takes one away from Jeffrey’s prior. Plus, if one takes the Kullback as an intrinsic loss function, this also sounds related to Holmes’s and Walker’s substitute loss pseudopriors, no? Now, eqn (2) does not sound right in the general case. Unless one implicitly takes a uniform prior on the Kullback sphere of radius d? There is a feeling of one-d-ness in the description of the paper (at least till page 6) and I wanted to see how it extends to models with many (≥2) hyperparameters. Until I reached Section 6 where the authors state exactly that! There is also a potential difficulty in that d(ξ) cannot be computed in a general setting. (Assuming that d(ξ) has a non-vanishing Jacobian as on page 19 sounds rather unrealistic.) Still about Section 6, handling reference priors on correlation matrices is a major endeavour, which should produce a steady flow of followers..!

“The current practice of prior specification is, to be honest, not in a good shape. While there has been a strong growth of Bayesian analysis in science, the research field of “practical prior specification” has been left behind.” (*p.23)

There are still quantities to specify and calibrate in the PC priors, which may actually be deemed a good thing by Bayesians (and some modellers). But overall I think this paper and its message constitute a terrific step for Bayesian statistics and I hope the paper can make it to a major journal.