## Naturally amazed at non-identifiability

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on May 27, 2020 by xi'an

A Nature paper by Stilianos Louca and Matthew W. Pennell,  Extant time trees are consistent with a myriad of diversification histories, comes to the extraordinary conclusion that birth-&-death evolutionary models cannot distinguish between several scenarios given the available data! Namely, stem ages and daughter lineage ages cannot identify the speciation rate function λ(.), the extinction rate function μ(.)  and the sampling fraction ρ inherently defining the deterministic ODE leading to the number of species predicted at any point τ in time, N(τ). The Nature paper does not seem to make a point beyond the obvious and I am rather perplexed at why it got published [and even highlighted]. A while ago, under the leadership of Steve, PNAS decided to include statistician reviewers for papers relying on statistical arguments. It could time for Nature to move there as well.

“We thus conclude that two birth-death models are congruent if and only if they have the same rp and the same λp at some time point in the present or past.” [S.1.1, p.4]

Or, stated otherwise, that a tree structured dataset made of branch lengths are not enough to identify two functions that parameterise the model. The likelihood looks like

$\frac{\rho^{n-1}\Psi(\tau_1,\tau_0)}{1-E(\tau)}\prod_{i=1}^n \lambda(\tau_i)\Psi(s_{i,1},\tau_i)\Psi(s_{i,2},\tau_i)$\$

where E(.) is the probability to survive to the present and ψ(s,t) the probability to survive and be sampled between times s and t. Sort of. Both functions depending on functions λ(.) and  μ(.). (When the stem age is unknown, the likelihood changes a wee bit, but with no changes in the qualitative conclusions. Another way to write this likelihood is in term of the speciation rate λp

$e^{-\Lambda_p(\tau_0)}\prod_{i=1}^n\lambda_p(\tau_I)e^{-\Lambda_p(\tau_i)}$

where Λp is the integrated rate, but which shares the same characteristic of being unable to identify the functions λ(.) and μ(.). While this sounds quite obvious the paper (or rather the supplementary material) goes into fairly extensive mode, including “abstract” algebra to define congruence.

“…we explain why model selection methods based on parsimony or “Occam’s razor”, such as the Akaike Information Criterion and the Bayesian Information Criterion that penalize excessive parameters, generally cannot resolve the identifiability issue…” [S.2, p15]

As illustrated by the above quote, the supplementary material also includes a section about statistical model selections techniques failing to capture the issue, section that seems superfluous or even absurd once the fact that the likelihood is constant across a congruence class has been stated.

## priors without likelihoods are like sloths without…

Posted in Books, Statistics with tags , , , , , , , , , , , , on September 11, 2017 by xi'an

“The idea of building priors that generate reasonable data may seem like an unusual idea…”

Andrew, Dan, and Michael arXived a opinion piece last week entitled “The prior can generally only be understood in the context of the likelihood”. Which connects to the earlier Read Paper of Gelman and Hennig I discussed last year. I cannot state strong disagreement with the positions taken in this piece, actually, in that I do not think prior distributions ever occur as a given but are rather chosen as a reference measure to probabilise the parameter space and eventually prioritise regions over others. If anything I find myself even further on the prior agnosticism gradation.  (Of course, this lack of disagreement applies to the likelihood understood as a function of both the data and the parameter, rather than of the parameter only, conditional on the data. Priors cannot be depending on the data without incurring disastrous consequences!)

“…it contradicts the conceptual principle that the prior distribution should convey only information that is available before the data have been collected.”

The first example is somewhat disappointing in that it revolves as so many Bayesian textbooks (since Laplace!) around the [sex ratio] Binomial probability parameter and concludes at the strong or long-lasting impact of the Uniform prior. I do not see much of a contradiction between the use of a Uniform prior and the collection of prior information, if only because there is not standardised way to transfer prior information into prior construction. And more fundamentally because a parameter rarely makes sense by itself, alone, without a model that relates it to potential data. As for instance in a regression model. More, following my epiphany of last semester, about the relativity of the prior, I see no damage in the prior being relevant, as I only attach a relative meaning to statements based on the posterior. Rather than trying to limit the impact of a prior, we should rather build assessment tools to measure this impact, for instance by prior predictive simulations. And this is where I come to quite agree with the authors.

“…non-identifiabilities, and near nonidentifiabilites, of complex models can lead to unexpected amounts of weight being given to certain aspects of the prior.”

Another rather straightforward remark is that non-identifiable models see the impact of a prior remain as the sample size grows. And I still see no issue with this fact in a relative approach. When the authors mention (p.7) that purely mathematical priors perform more poorly than weakly informative priors it is hard to see what they mean by this “performance”.

“…judge a prior by examining the data generating processes it favors and disfavors.”

Besides those points, I completely agree with them about the fundamental relevance of the prior as a generative process, only when the likelihood becomes available. And simulatable. (This point is found in many references, including our response to the American Statistician paper Hidden dangers of specifying noninformative priors, with Kaniav Kamary. With the same illustration on a logistic regression.) I also agree to their criticism of the marginal likelihood and Bayes factors as being so strongly impacted by the choice of a prior, if treated as absolute quantities. I also if more reluctantly and somewhat heretically see a point in using the posterior predictive for assessing whether a prior is relevant for the data at hand. At least at a conceptual level. I am however less certain about how to handle improper priors based on their recommendations. In conclusion, it would be great to see one [or more] of the authors at O-Bayes 2017 in Austin as I am sure it would stem nice discussions there! (And by the way I have no prior idea on how to conclude the comparison in the title!)

## non-identifiability in Venezia

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on November 2, 2016 by xi'an

Last Wednesday, I attended a seminar by T. Kitagawa at the economics seminar of the University Ca’ Foscari, in Venice, which was about (uncertain) identifiability and a sort of meta-Bayesian approach to the problem. Just to give an intuition about the setting, a toy example is a simultaneous equation model Ax=ξ, where x and ξ are two-dimensional vectors, ξ being a standard bivariate Normal noise. In that case, A is not completely identifiable. The argument in the talk (and the paper) is that the common Bayesian answer that sets a prior on the non-identifiable part (which is an orthogonal matrix in the current setting) is debatable as it impacts inference on the non-identifiable parts, even in the long run. Which seems fine from my viewpoint. The authors propose to instead consider the range of possible priors that are compatible with the set restrictions on the non-identifiable parts and to introduce a mixture between a regular prior on the whole parameter A and this collection of priors, which can be seen as a set-valued prior although this does not fit within the Bayesian framework in my opinion. Once this mixture is constructed, a formal posterior weight on the regular prior can be derived. As well as a range of posterior values for all quantities of interest. While this approach connects with imprecise probabilities à la Walley (?) and links with robust Bayesian studies of the 1980’s, I always have difficulties with the global setting of such models, which do not come under criticism while being inadequate. (Of course, there are many more things I do not understand in econometrics!)

## asymptotic properties of Approximate Bayesian Computation

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on July 26, 2016 by xi'an

With David Frazier and Gael Martin from Monash University, and with Judith Rousseau (Paris-Dauphine), we have now completed and arXived a paper entitled Asymptotic Properties of Approximate Bayesian Computation. This paper undertakes a fairly complete study of the large sample properties of ABC under weak regularity conditions. We produce therein sufficient conditions for posterior concentration, asymptotic normality of the ABC posterior estimate, and asymptotic normality of the ABC posterior mean. Moreover, those (theoretical) results are of significant import for practitioners of ABC as they pertain to the choice of tolerance ε used within ABC for selecting parameter draws. In particular, they [the results] contradict the conventional ABC wisdom that this tolerance should always be taken as small as the computing budget allows.

Now, this paper bears some similarities with our earlier paper on the consistency of ABC, written with David and Gael. As it happens, the paper was rejected after submission and I then discussed it in an internal seminar in Paris-Dauphine, with Judith taking part in the discussion and quickly suggesting some alternative approach that is now central to the current paper. The previous version analysed Bayesian consistency of ABC under specific uniformity conditions on the summary statistics used within ABC. But conditions for consistency are now much weaker conditions than earlier, thanks to Judith’s input!

There are also similarities with Li and Fearnhead (2015). Previously discussed here. However, while similar in spirit, the results contained in the two papers strongly differ on several fronts:

1. Li and Fearnhead (2015) considers an ABC algorithm based on kernel smoothing, whereas our interest is the original ABC accept-reject and its many derivatives
2. our theoretical approach permits a complete study of the asymptotic properties of ABC, posterior concentration, asymptotic normality of ABC posteriors, and asymptotic normality of the ABC posterior mean, whereas Li and Fearnhead (2015) is only concerned with asymptotic normality of the ABC posterior mean estimator (and various related point estimators);
3. the results of Li and Fearnhead (2015) are derived under very strict uniformity and continuity/differentiability conditions, which bear a strong resemblance to those conditions in Yuan and Clark (2004) and Creel et al. (2015), while the result herein do not rely on such conditions and only assume very weak regularity conditions on the summaries statistics themselves; this difference allows us to characterise the behaviour of ABC in situations not covered by the approach taken in Li and Fearnhead (2015);

## at CIRM [#3]

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on March 4, 2016 by xi'an

Simon Barthelmé gave his mini-course on EP, with loads of details on the implementation of the method. Focussing on the EP-ABC and MCMC-EP versions today. Leaving open the difficulty of assessing to which limit EP is converging. But mentioning the potential for asynchronous EP (on which I would like to hear more). Ironically using several times a logistic regression example, if not on the Pima Indians benchmark! He also talked about approximate EP solutions that relate to consensus MCMC. With a connection to Mark Beaumont’s talk at NIPS [at the time as mine!] on the comparison with ABC. While we saw several talks on EP during this week, I am still agnostic about the potential of the approach. It certainly produces a fast proxy to the true posterior and hence can be exploited ad nauseam in inference methods based on pseudo-models like indirect inference. In conjunction with other quick and dirty approximations when available. As in ABC, it would be most useful to know how far from the (ideal) posterior distribution does the approximation stands. Machine learning approaches presumably allow for an evaluation of the predictive performances, but less so for the modelling accuracy, even with new sampling steps. [But I know nothing, I know!]

Dennis Prangle presented some on-going research on high dimension [data] ABC. Raising the question of what is the true meaning of dimension in ABC algorithms. Or of sample size. Because the inference relies on the event d(s(y),s(y’))≤ξ or on the likelihood l(θ|x). Both one-dimensional. Mentioning Iain Murray’s talk at NIPS [that I also missed]. Re-expressing as well the perspective that ABC can be seen as a missing or estimated normalising constant problem as in Bornn et al. (2015) I discussed earlier. The central idea is to use SMC to simulate a particle cloud evolving as the target tolerance ξ decreases. Which supposes a latent variable structure lurking in the background.

Judith Rousseau gave her talk on non-parametric mixtures and the possibility to learn parametrically about the component weights. Starting with a rather “magic” result by Allman et al. (2009) that three repeated observations per individual, all terms in a mixture are identifiable. Maybe related to that simpler fact that mixtures of Bernoullis are not identifiable while mixtures of Binomial are identifiable, even when n=2. As “shown” in this plot made for X validated. Actually truly related because Allman et al. (2009) prove identifiability through a finite dimensional model. (I am surprised I missed this most interesting paper!) With the side condition that a mixture of p components made of r Bernoulli products is identifiable when p ≥ 2[log² r] +1, when log² is base 2-logarithm. And [x] the upper rounding. I also find most relevant this distinction between the weights and the remainder of the mixture as weights behave quite differently, hardly parameters in a sense.