Archive for identifiability

multilevel linear models, Gibbs samplers, and multigrid decompositions

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 22, 2021 by xi'an

A paper by Giacommo Zanella (formerly Warwick) and Gareth Roberts (Warwick) is about to appear in Bayesian Analysis and (still) open for discussion. It examines in great details the convergence properties of several Gibbs versions of the same hierarchical posterior for an ANOVA type linear model. Although this may sound like an old-timer opinion, I find it good to have Gibbs sampling back on track! And to have further attention to diagnose convergence! Also, even after all these years (!), it is always a surprise  for me to (re-)realise that different versions of Gibbs samplings may hugely differ in convergence properties.

At first, intuitively, I thought the options (1,0) (c) and (0,1) (d) should be similarly performing. But one is “more” hierarchical than the other. While the results exhibiting a theoretical ordering of these choices are impressive, I would suggest pursuing an random exploration of the various parameterisations in order to handle cases where an analytical ordering proves impossible. It would most likely produce a superior performance, as hinted at by Figure 4. (This alternative happens to be briefly mentioned in the Conclusion section.) The notion of choosing the optimal parameterisation at each step is indeed somewhat unrealistic in that the optimality zones exhibited in Figure 4 are unknown in a more general model than the Gaussian ANOVA model. Especially with a high number of parameters, parameterisations, and recombinations in the model (Section 7).

An idle question is about the extension to a more general hierarchical model where recentring is not feasible because of the non-linear nature of the parameters. Even though Gaussianity may not be such a restriction in that other exponential (if artificial) families keeping the ANOVA structure should work as well.

Theorem 1 is quite impressive and wide ranging. It also reminded (old) me of the interleaving properties and data augmentation versions of the early-day Gibbs. More to the point and to the current era, it offers more possibilities for coupling, parallelism, and increasing convergence. And for fighting dimension curses.

“in this context, imposing identifiability always improves the convergence properties of the Gibbs Sampler”

Another idle thought of mine is to wonder whether or not there is a limited number of reparameterisations. I think that by creating unidentifiable decompositions of (some) parameters, eg, μ=μ¹+μ²+.., one can unrestrictedly multiply the number of parameterisations. Instead of imposing hard identifiability constraints as in Section 4.2, my intuition was that this de-identification would increase the mixing behaviour but this somewhat clashes with the above (rigorous) statement from the authors. So I am proven wrong there!

Unless I missed something, I also wonder at different possible implementations of HMC depending on different parameterisations and whether or not the impact of parameterisation has been studied for HMC. (Which may be linked with Remark 2?)

Naturally amazed at non-identifiability

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on May 27, 2020 by xi'an

A Nature paper by Stilianos Louca and Matthew W. Pennell,  Extant time trees are consistent with a myriad of diversification histories, comes to the extraordinary conclusion that birth-&-death evolutionary models cannot distinguish between several scenarios given the available data! Namely, stem ages and daughter lineage ages cannot identify the speciation rate function λ(.), the extinction rate function μ(.)  and the sampling fraction ρ inherently defining the deterministic ODE leading to the number of species predicted at any point τ in time, N(τ). The Nature paper does not seem to make a point beyond the obvious and I am rather perplexed at why it got published [and even highlighted]. A while ago, under the leadership of Steve, PNAS decided to include statistician reviewers for papers relying on statistical arguments. It could time for Nature to move there as well.

“We thus conclude that two birth-death models are congruent if and only if they have the same rp and the same λp at some time point in the present or past.” [S.1.1, p.4]

Or, stated otherwise, that a tree structured dataset made of branch lengths are not enough to identify two functions that parameterise the model. The likelihood looks like

\frac{\rho^{n-1}\Psi(\tau_1,\tau_0)}{1-E(\tau)}\prod_{i=1}^n \lambda(\tau_i)\Psi(s_{i,1},\tau_i)\Psi(s_{i,2},\tau_i)$

where E(.) is the probability to survive to the present and ψ(s,t) the probability to survive and be sampled between times s and t. Sort of. Both functions depending on functions λ(.) and  μ(.). (When the stem age is unknown, the likelihood changes a wee bit, but with no changes in the qualitative conclusions. Another way to write this likelihood is in term of the speciation rate λp


where Λp is the integrated rate, but which shares the same characteristic of being unable to identify the functions λ(.) and μ(.). While this sounds quite obvious the paper (or rather the supplementary material) goes into fairly extensive mode, including “abstract” algebra to define congruence.


“…we explain why model selection methods based on parsimony or “Occam’s razor”, such as the Akaike Information Criterion and the Bayesian Information Criterion that penalize excessive parameters, generally cannot resolve the identifiability issue…” [S.2, p15]

As illustrated by the above quote, the supplementary material also includes a section about statistical model selections techniques failing to capture the issue, section that seems superfluous or even absurd once the fact that the likelihood is constant across a congruence class has been stated.

priors without likelihoods are like sloths without…

Posted in Books, Statistics with tags , , , , , , , , , , , , on September 11, 2017 by xi'an

“The idea of building priors that generate reasonable data may seem like an unusual idea…”

Andrew, Dan, and Michael arXived a opinion piece last week entitled “The prior can generally only be understood in the context of the likelihood”. Which connects to the earlier Read Paper of Gelman and Hennig I discussed last year. I cannot state strong disagreement with the positions taken in this piece, actually, in that I do not think prior distributions ever occur as a given but are rather chosen as a reference measure to probabilise the parameter space and eventually prioritise regions over others. If anything I find myself even further on the prior agnosticism gradation.  (Of course, this lack of disagreement applies to the likelihood understood as a function of both the data and the parameter, rather than of the parameter only, conditional on the data. Priors cannot be depending on the data without incurring disastrous consequences!)

“…it contradicts the conceptual principle that the prior distribution should convey only information that is available before the data have been collected.”

The first example is somewhat disappointing in that it revolves as so many Bayesian textbooks (since Laplace!) around the [sex ratio] Binomial probability parameter and concludes at the strong or long-lasting impact of the Uniform prior. I do not see much of a contradiction between the use of a Uniform prior and the collection of prior information, if only because there is not standardised way to transfer prior information into prior construction. And more fundamentally because a parameter rarely makes sense by itself, alone, without a model that relates it to potential data. As for instance in a regression model. More, following my epiphany of last semester, about the relativity of the prior, I see no damage in the prior being relevant, as I only attach a relative meaning to statements based on the posterior. Rather than trying to limit the impact of a prior, we should rather build assessment tools to measure this impact, for instance by prior predictive simulations. And this is where I come to quite agree with the authors.

“…non-identifiabilities, and near nonidentifiabilites, of complex models can lead to unexpected amounts of weight being given to certain aspects of the prior.”

Another rather straightforward remark is that non-identifiable models see the impact of a prior remain as the sample size grows. And I still see no issue with this fact in a relative approach. When the authors mention (p.7) that purely mathematical priors perform more poorly than weakly informative priors it is hard to see what they mean by this “performance”.

“…judge a prior by examining the data generating processes it favors and disfavors.”

Besides those points, I completely agree with them about the fundamental relevance of the prior as a generative process, only when the likelihood becomes available. And simulatable. (This point is found in many references, including our response to the American Statistician paper Hidden dangers of specifying noninformative priors, with Kaniav Kamary. With the same illustration on a logistic regression.) I also agree to their criticism of the marginal likelihood and Bayes factors as being so strongly impacted by the choice of a prior, if treated as absolute quantities. I also if more reluctantly and somewhat heretically see a point in using the posterior predictive for assessing whether a prior is relevant for the data at hand. At least at a conceptual level. I am however less certain about how to handle improper priors based on their recommendations. In conclusion, it would be great to see one [or more] of the authors at O-Bayes 2017 in Austin as I am sure it would stem nice discussions there! (And by the way I have no prior idea on how to conclude the comparison in the title!)

non-identifiability in Venezia

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on November 2, 2016 by xi'an

Last Wednesday, I attended a seminar by T. Kitagawa at the economics seminar of the University Ca’ Foscari, in Venice, which was about (uncertain) identifiability and a sort of meta-Bayesian approach to the problem. Just to give an intuition about the setting, a toy example is a simultaneous equation model Ax=ξ, where x and ξ are two-dimensional vectors, ξ being a standard bivariate Normal noise. In that case, A is not completely identifiable. The argument in the talk (and the paper) is that the common Bayesian answer that sets a prior on the non-identifiable part (which is an orthogonal matrix in the current setting) is debatable as it impacts inference on the non-identifiable parts, even in the long run. Which seems fine from my viewpoint. The authors propose to instead consider the range of possible priors that are compatible with the set restrictions on the non-identifiable parts and to introduce a mixture between a regular prior on the whole parameter A and this collection of priors, which can be seen as a set-valued prior although this does not fit within the Bayesian framework in my opinion. Once this mixture is constructed, a formal posterior weight on the regular prior can be derived. As well as a range of posterior values for all quantities of interest. While this approach connects with imprecise probabilities à la Walley (?) and links with robust Bayesian studies of the 1980’s, I always have difficulties with the global setting of such models, which do not come under criticism while being inadequate. (Of course, there are many more things I do not understand in econometrics!)

asymptotic properties of Approximate Bayesian Computation

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on July 26, 2016 by xi'an

Street light near the St Kilda Road bridge, Melbourne, July 21, 2012With David Frazier and Gael Martin from Monash University, and with Judith Rousseau (Paris-Dauphine), we have now completed and arXived a paper entitled Asymptotic Properties of Approximate Bayesian Computation. This paper undertakes a fairly complete study of the large sample properties of ABC under weak regularity conditions. We produce therein sufficient conditions for posterior concentration, asymptotic normality of the ABC posterior estimate, and asymptotic normality of the ABC posterior mean. Moreover, those (theoretical) results are of significant import for practitioners of ABC as they pertain to the choice of tolerance ε used within ABC for selecting parameter draws. In particular, they [the results] contradict the conventional ABC wisdom that this tolerance should always be taken as small as the computing budget allows.

Now, this paper bears some similarities with our earlier paper on the consistency of ABC, written with David and Gael. As it happens, the paper was rejected after submission and I then discussed it in an internal seminar in Paris-Dauphine, with Judith taking part in the discussion and quickly suggesting some alternative approach that is now central to the current paper. The previous version analysed Bayesian consistency of ABC under specific uniformity conditions on the summary statistics used within ABC. But conditions for consistency are now much weaker conditions than earlier, thanks to Judith’s input!

There are also similarities with Li and Fearnhead (2015). Previously discussed here. However, while similar in spirit, the results contained in the two papers strongly differ on several fronts:

  1. Li and Fearnhead (2015) considers an ABC algorithm based on kernel smoothing, whereas our interest is the original ABC accept-reject and its many derivatives
  2. our theoretical approach permits a complete study of the asymptotic properties of ABC, posterior concentration, asymptotic normality of ABC posteriors, and asymptotic normality of the ABC posterior mean, whereas Li and Fearnhead (2015) is only concerned with asymptotic normality of the ABC posterior mean estimator (and various related point estimators);
  3. the results of Li and Fearnhead (2015) are derived under very strict uniformity and continuity/differentiability conditions, which bear a strong resemblance to those conditions in Yuan and Clark (2004) and Creel et al. (2015), while the result herein do not rely on such conditions and only assume very weak regularity conditions on the summaries statistics themselves; this difference allows us to characterise the behaviour of ABC in situations not covered by the approach taken in Li and Fearnhead (2015);
%d bloggers like this: