## about the strong likelihood principle

Posted in Books, Statistics, University life with tags , , , , , , , on November 13, 2014 by xi'an

Deborah Mayo arXived a Statistical Science paper a few days ago, along with discussions by Jan Bjørnstad, Phil Dawid, Don Fraser, Michael Evans, Jan Hanning, R. Martin and C. Liu. I am very glad that this discussion paper came out and that it came out in Statistical Science, although I am rather surprised to find no discussion by Jim Berger or Robert Wolpert, and even though I still cannot entirely follow the deductive argument in the rejection of Birnbaum’s proof, just as in the earlier version in Error & Inference.  But I somehow do not feel like going again into a new debate about this critique of Birnbaum’s derivation. (Even though statements like the fact that the SLP “would preclude the use of sampling distributions” (p.227) would call for contradiction.)

“It is the imprecision in Birnbaum’s formulation that leads to a faulty impression of exactly what  is proved.” M. Evans

Indeed, at this stage, I fear that [for me] a more relevant issue is whether or not the debate does matter… At a logical cum foundational [and maybe cum historical] level, it makes perfect sense to uncover if and which if any of the myriad of Birnbaum’s likelihood Principles holds. [Although trying to uncover Birnbaum’s motives and positions over time may not be so relevant.] I think the paper and the discussions acknowledge that some version of the weak conditionality Principle does not imply some version of the strong likelihood Principle. With other logical implications remaining true. At a methodological level, I am less much less sure it matters. Each time I taught this notion, I got blank stares and incomprehension from my students, to the point I have now stopped altogether teaching the likelihood Principle in class. And most of my co-authors do not seem to care very much about it. At a purely mathematical level, I wonder if there even is ground for a debate since the notions involved can be defined in various imprecise ways, as pointed out by Michael Evans above and in his discussion. At a statistical level, sufficiency eventually is a strange notion in that it seems to make plenty of sense until one realises there is no interesting sufficiency outside exponential families. Just as there are very few parameter transforms for which unbiased estimators can be found. So I also spend very little time teaching and even less worrying about sufficiency. (As it happens, I taught the notion this morning!) At another and presumably more significant statistical level, what matters is information, e.g., conditioning means adding information (i.e., about which experiment has been used). While complex settings may prohibit the use of the entire information provided by the data, at a formal level there is no argument for not using the entire information, i.e. conditioning upon the entire data. (At a computational level, this is no longer true, witness ABC and similar limited information techniques. By the way, ABC demonstrates if needed why sampling distributions matter so much to Bayesian analysis.)

“Non-subjective Bayesians who (…) have to live with some violations of the likelihood principle (…) since their prior probability distributions are influenced by the sampling distribution.” D. Mayo (p.229)

In the end, the fact that the prior may depend on the form of the sampling distribution and hence does violate the likelihood Principle does not worry me so much. In most models I consider, the parameters are endogenous to those sampling distributions and do not live an ethereal existence independently from the model: they are substantiated and calibrated by the model itself, which makes the discussion about the LP rather vacuous. See, e.g., the coefficients of a linear model. In complex models, or in large datasets, it is even impossible to handle the whole data or the whole model and proxies have to be used instead, making worries about the structure of the (original) likelihood vacuous. I think we have now reached a stage of statistical inference where models are no longer accepted as ideal truth and where approximation is the hard reality, imposed by the massive amounts of data relentlessly calling for immediate processing. Hence, where the self-validation or invalidation of such approximations in terms of predictive performances is the relevant issue. Provided we can at all face the challenge…

## Pre-processing for approximate Bayesian computation in image analysis

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on March 21, 2014 by xi'an

With Matt Moores and Kerrie Mengersen, from QUT, we wrote this short paper just in time for the MCMSki IV Special Issue of Statistics & Computing. And arXived it, as well. The global idea is to cut down on the cost of running an ABC experiment by removing the simulation of a humongous state-space vector, as in Potts and hidden Potts model, and replacing it by an approximate simulation of the 1-d sufficient (summary) statistics. In that case, we used a division of the 1-d parameter interval to simulate the distribution of the sufficient statistic for each of those parameter values and to compute the expectation and variance of the sufficient statistic. Then the conditional distribution of the sufficient statistic is approximated by a Gaussian with these two parameters. And those Gaussian approximations substitute for the true distributions within an ABC-SMC algorithm à la Del Moral, Doucet and Jasra (2012).

Across 20 125 × 125 pixels simulated images, Matt’s algorithm took an average of 21 minutes per image for between 39 and 70 SMC iterations, while resorting to pseudo-data and deriving the genuine sufficient statistic took an average of 46.5 hours for 44 to 85 SMC iterations. On a realistic Landsat image, with a total of 978,380 pixels, the precomputation of the mapping function took 50 minutes, while the total CPU time on 16 parallel threads was 10 hours 38 minutes. By comparison, it took 97 hours for 10,000 MCMC iterations on this image, with a poor effective sample size of 390 values. Regular SMC-ABC algorithms cannot handle this scale: It takes 89 hours to perform a single SMC iteration! (Note that path sampling also operates in this framework, thanks to the same precomputation: in that case it took 2.5 hours for 10⁵ iterations, with an effective sample size of 10⁴…)

Since my student’s paper on Seaman et al (2012) got promptly rejected by TAS for quoting too extensively from my post, we decided to include me as an extra author and submitted the paper to this special issue as well.

## ABC with indirect summary statistics

Posted in Statistics, University life with tags , , , , , , , on February 3, 2014 by xi'an

After reading Drovandi’s and Pettitt’s Bayesian Indirect Inference, I checked (in the plane to Birmingham) the earlier Gleim’s and Pigorsch’s Approximate Bayesian Computation with indirect summary statistics. The setting is indeed quite similar to the above, with a description of three ways of connecting indirect inference with ABC, albeit with a different range of illustrations. This preprint states most clearly its assumption that the generating model is a particular case of the auxiliary model, which sounds anticlimactic since the auxiliary model is precisely used because the original one is mostly out of reach! This certainly was the original motivation for using indirect inference.

The part of the paper that I find the most intriguing is the argument that the indirect approach leads to sufficient summary statistics, in the sense that they “are sufficient for the parameters of the auxiliary model and (…) sufficiency carries over to the model of interest” (p.31). Looking at the details in the Appendix, I found that the argument is lacking, because the likelihood as a functional is shown to be a (sufficient) statistic, which seems both a tautology and irrelevant because this is different from the likelihood considered at the (auxiliary) MLE, which is the summary statistic used in fine.

“…we expand the square root of an innovation density h in a Hermite expansion and truncate the in finite polynomial at some integer K which, together with other tuning parameters of the SNP density, has to be determined through a model selection criterion (such as BIC). Now we take the leading term of the Hermite expansion to follow a Gaussian GARCH model.”

As in Drovandi and Pettitt, the performances of the ABC-I schemes are tested on a toy example, which is a very basic exponential iid sample with a conjugate prior. With a gamma model as auxiliary. The authors use a standard ABC based on the first two moments as their benchmark, however they do not calibrate those moments in the distance and end up with poor performances of ABC (in a setting where there is a sufficient statistic!). The best choice in this experiment appears as the solution based on the score, but the variances of the distances are not included in the comparison tables. The second implementation considered in the paper is a rather daunting continuous-time non-Gaussian Ornstein-Uhlenbeck stochastic volatility model à la Barndorf -Nielsen and Shephard (2001). The construction of the semi-nonparametric (why not semi-parametric?) auxiliary model is quite involved as well, as illustrated by the quote above. The approach provides an answer, with posterior ABC-IS distributions on all parameters of the original model, which kindles the question of the validation of this answer in terms of the original posterior. Handling simultaneously several approximation processes would help in this regard.

## ABC model choice [slides]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on November 7, 2011 by xi'an

Here are the slides for my talks both at CREST this afternoon (in ½ an hour!) and in Madrid [on Friday 11/11/11=16, magical day of the year, especially since I will be speaking at 11:11 CET…] for the Workshop Métodos Bayesianos 11 (no major difference with the slides from Zürich, hey!, except for the quantile distribution example]

## workshop in Columbia [talk]

Posted in Statistics, Travel, University life with tags , , , , , , , , , on September 25, 2011 by xi'an

Here are the slides of my talk yesterday at the Computational Methods in Applied Sciences workshop in Columbia:

The last section of the talk covers our new results with Jean-Michel Marin, Natesh Pillai and Judith Rousseau on the necessary and sufficient conditions for a summary statistic to be used in ABC model choice. (The paper is about to be completed.) This obviously comes as the continuation of our reflexions on  ABC model choice started last January. The major message of the paper is that the statistics used for running model choice cannot have a mean value common to both models, which strongly implies using ancillary statistics with different means under each model. (I am afraid that, thanks to the mixture of no-jetlag fatigue and of slide inflation [95 vs. 40mn] and of asymptotics technicalities in the last part, the talk was far from comprehensible. I started on the wrong foot with not getting an XL [Xiao-Li’s] comment on the measure-theory problem with the limit in ε going to zero. A peak given that great debate we had in Banff with Jean-Michel, David Balding, and Mark Beaumont, years ago. And our more recent paper about the arbitrariness of the density value in the Savage-Dickey paradox. I then compounded the confusion by stating the empirical mean was sufficient in the Laplace case…which is not even an exponential family. I hope I will be more articulate next week in Zürich where at least I will not speak past my bedtime!)

## ABC and sufficient statistics

Posted in Statistics, University life with tags , , , , , , , on July 8, 2011 by xi'an

Chris Barnes, Sarah Filippi, Michael P.H. Stumpf, and Thomas Thorne posted a paper on arXiv on the selection of sufficient statistics towards ABC model choice. This paper, called Considerate Approaches to Achieving Sufficiency for ABC model selection, was presented by Chris Barnes during ABC in London two months ago. (Note that all talks of the meeting are now available in Nature Precedings. A neat concept by the way!) This paper of them builds on our earlier warning about (unfounded) ABC model selection to propose a selection of summary statistics that partly alleviates the  original problem. (The part about the discrepancy with the true posterior probability remains to be addressed. As does the issue of whether or not the selected collection of statistics provides a convergent model choice inference. We are currently working on it…) Their section “Resuscitating ABC model choice” states quite clearly the goal of the paper:

- this [use of inadequate summary statistics] mirrors problems that can also be observed in the parameter estimation context,
- for many important, and arguably the most important applications of ABC, this problem can in principle be avoided by using the whole data rather than summary statistics,
- in cases where summary statistics are required, we argue that we can construct approximately sufficient statistics in a disciplined manner,
- when all else fails, a change in perspective, allows us to nevertheless make use of the flexibility of the ABC framework

The driving idea in the paper is to use an entropy approximation to measure the lack of information due to the use of a given set of summary statistics. The corresponding algorithm then proceeds from a starting pool of summary statistics to build sequentially a collection of the most informative summary statistics (which, in a sense, reminded me of a variable selection procedure based on Kullback-Leibler, we developed with  Costas Goutis and Jérôme Dupuis). It is a very interesting advance in the issue of ABC model selection, even though it cannot eliminate all stumbling blocks. The interpretation that ABC should be processed as an inferential method on its own rather than an approximation to Bayesian inference is clearly appealing. (Fearnhead and Prangle, and Dean, Singh, Jasra and Peters could be quoted as well.)

## Lack of confidence [revised]

Posted in R, Statistics, University life with tags , , , , , , , on April 22, 2011 by xi'an

Following the comments on our earlier submission to PNAS, we have written (and re-arXived) a revised version where we try to spell out (better) the distinction between ABC point (and confidence) estimation and ABC model choice, namely that the problem was at another level for Bayesian model choice (using posterior probabilities). When doing point estimation with in-sufficient summary statistics, the information content is poorer, but unless one uses very degraded summary statistics, inference is converging. We completely agree with the reviewers that the posterior distribution is different from the true posterior in this case but, at least, gathering more observations brings more information about the parameter (and convergence when the number of observations goes to infinity). For model choice, this is not guaranteed if we use summary statistics that are not inter-model sufficient, as shown by the Poisson and normal examples. Furthermore, except for very specific cases such as Gibbs random fields, it is almost always impossible to derive inter-model sufficient statistics, beyond the raw sample. This is why we consider there is a fundamental difference between point estimation and model choice.

Following the request from a referee, we also ran a more extensive simulation experiment for comparing two scenarios with 3 populations, 100 diploid individuals per population, and 50 loci/markers. However, the results are somehow less conclusive, in the sense that, since we use 50 loci, the data is much more informative about the model and therefore both the importance sampling and the ABC approximations provide a value of the posterior probability approximation that is close to one, hence both concluding with the validation of the true model. Because both approximations are very close to one, it is difficult to assess the worth of the ABC approximation per se, i.e. in numerical terms. (The fact that the statistical conclusion is the same for both approaches is of course satisfying from an inferential perspective, but is an altogether separate issue from our argument about the possible lack of convergence of the ABC Bayes factor approximation to the true Bayes factor.) Furthermore, this experiment may be beyond the manageable/reasonable in the sense that the importance sampling approximation cannot be taken for granted, nor can it be checked empirically. Indeed, with 50 markers and 100 individuals, the product likelihood suffers from an enormous variability that 100,000 particles and 100 trees per locus have trouble to address (despite a huge computing cost of more than 12 days on a powerful cluster).

Incidentally, I had a problem with natbib, when using the pnas style:

```!Package natbib Error: Bibliography not compatible with author-year citations.
Press <return> to continue in numerical citation style.
See the natbib package documentation for explanation.```

but it vanisheds with the options

`\usepackage[round,numbers]{natbib}`

which is an easy fix.