Archive for identifiability

priors without likelihoods are like sloths without…

Posted in Books, Statistics with tags , , , , , , , , , , , , on September 11, 2017 by xi'an

“The idea of building priors that generate reasonable data may seem like an unusual idea…”

Andrew, Dan, and Michael arXived a opinion piece last week entitled “The prior can generally only be understood in the context of the likelihood”. Which connects to the earlier Read Paper of Gelman and Hennig I discussed last year. I cannot state strong disagreement with the positions taken in this piece, actually, in that I do not think prior distributions ever occur as a given but are rather chosen as a reference measure to probabilise the parameter space and eventually prioritise regions over others. If anything I find myself even further on the prior agnosticism gradation.  (Of course, this lack of disagreement applies to the likelihood understood as a function of both the data and the parameter, rather than of the parameter only, conditional on the data. Priors cannot be depending on the data without incurring disastrous consequences!)

“…it contradicts the conceptual principle that the prior distribution should convey only information that is available before the data have been collected.”

The first example is somewhat disappointing in that it revolves as so many Bayesian textbooks (since Laplace!) around the [sex ratio] Binomial probability parameter and concludes at the strong or long-lasting impact of the Uniform prior. I do not see much of a contradiction between the use of a Uniform prior and the collection of prior information, if only because there is not standardised way to transfer prior information into prior construction. And more fundamentally because a parameter rarely makes sense by itself, alone, without a model that relates it to potential data. As for instance in a regression model. More, following my epiphany of last semester, about the relativity of the prior, I see no damage in the prior being relevant, as I only attach a relative meaning to statements based on the posterior. Rather than trying to limit the impact of a prior, we should rather build assessment tools to measure this impact, for instance by prior predictive simulations. And this is where I come to quite agree with the authors.

“…non-identifiabilities, and near nonidentifiabilites, of complex models can lead to unexpected amounts of weight being given to certain aspects of the prior.”

Another rather straightforward remark is that non-identifiable models see the impact of a prior remain as the sample size grows. And I still see no issue with this fact in a relative approach. When the authors mention (p.7) that purely mathematical priors perform more poorly than weakly informative priors it is hard to see what they mean by this “performance”.

“…judge a prior by examining the data generating processes it favors and disfavors.”

Besides those points, I completely agree with them about the fundamental relevance of the prior as a generative process, only when the likelihood becomes available. And simulatable. (This point is found in many references, including our response to the American Statistician paper Hidden dangers of specifying noninformative priors, with Kaniav Kamary. With the same illustration on a logistic regression.) I also agree to their criticism of the marginal likelihood and Bayes factors as being so strongly impacted by the choice of a prior, if treated as absolute quantities. I also if more reluctantly and somewhat heretically see a point in using the posterior predictive for assessing whether a prior is relevant for the data at hand. At least at a conceptual level. I am however less certain about how to handle improper priors based on their recommendations. In conclusion, it would be great to see one [or more] of the authors at O-Bayes 2017 in Austin as I am sure it would stem nice discussions there! (And by the way I have no prior idea on how to conclude the comparison in the title!)

non-identifiability in Venezia

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on November 2, 2016 by xi'an

Last Wednesday, I attended a seminar by T. Kitagawa at the economics seminar of the University Ca’ Foscari, in Venice, which was about (uncertain) identifiability and a sort of meta-Bayesian approach to the problem. Just to give an intuition about the setting, a toy example is a simultaneous equation model Ax=ξ, where x and ξ are two-dimensional vectors, ξ being a standard bivariate Normal noise. In that case, A is not completely identifiable. The argument in the talk (and the paper) is that the common Bayesian answer that sets a prior on the non-identifiable part (which is an orthogonal matrix in the current setting) is debatable as it impacts inference on the non-identifiable parts, even in the long run. Which seems fine from my viewpoint. The authors propose to instead consider the range of possible priors that are compatible with the set restrictions on the non-identifiable parts and to introduce a mixture between a regular prior on the whole parameter A and this collection of priors, which can be seen as a set-valued prior although this does not fit within the Bayesian framework in my opinion. Once this mixture is constructed, a formal posterior weight on the regular prior can be derived. As well as a range of posterior values for all quantities of interest. While this approach connects with imprecise probabilities à la Walley (?) and links with robust Bayesian studies of the 1980’s, I always have difficulties with the global setting of such models, which do not come under criticism while being inadequate. (Of course, there are many more things I do not understand in econometrics!)

asymptotic properties of Approximate Bayesian Computation

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on July 26, 2016 by xi'an

Street light near the St Kilda Road bridge, Melbourne, July 21, 2012With David Frazier and Gael Martin from Monash University, and with Judith Rousseau (Paris-Dauphine), we have now completed and arXived a paper entitled Asymptotic Properties of Approximate Bayesian Computation. This paper undertakes a fairly complete study of the large sample properties of ABC under weak regularity conditions. We produce therein sufficient conditions for posterior concentration, asymptotic normality of the ABC posterior estimate, and asymptotic normality of the ABC posterior mean. Moreover, those (theoretical) results are of significant import for practitioners of ABC as they pertain to the choice of tolerance ε used within ABC for selecting parameter draws. In particular, they [the results] contradict the conventional ABC wisdom that this tolerance should always be taken as small as the computing budget allows.

Now, this paper bears some similarities with our earlier paper on the consistency of ABC, written with David and Gael. As it happens, the paper was rejected after submission and I then discussed it in an internal seminar in Paris-Dauphine, with Judith taking part in the discussion and quickly suggesting some alternative approach that is now central to the current paper. The previous version analysed Bayesian consistency of ABC under specific uniformity conditions on the summary statistics used within ABC. But conditions for consistency are now much weaker conditions than earlier, thanks to Judith’s input!

There are also similarities with Li and Fearnhead (2015). Previously discussed here. However, while similar in spirit, the results contained in the two papers strongly differ on several fronts:

  1. Li and Fearnhead (2015) considers an ABC algorithm based on kernel smoothing, whereas our interest is the original ABC accept-reject and its many derivatives
  2. our theoretical approach permits a complete study of the asymptotic properties of ABC, posterior concentration, asymptotic normality of ABC posteriors, and asymptotic normality of the ABC posterior mean, whereas Li and Fearnhead (2015) is only concerned with asymptotic normality of the ABC posterior mean estimator (and various related point estimators);
  3. the results of Li and Fearnhead (2015) are derived under very strict uniformity and continuity/differentiability conditions, which bear a strong resemblance to those conditions in Yuan and Clark (2004) and Creel et al. (2015), while the result herein do not rely on such conditions and only assume very weak regularity conditions on the summaries statistics themselves; this difference allows us to characterise the behaviour of ABC in situations not covered by the approach taken in Li and Fearnhead (2015);

at CIRM [#3]

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on March 4, 2016 by xi'an

Simon Barthelmé gave his mini-course on EP, with loads of details on the implementation of the method. Focussing on the EP-ABC and MCMC-EP versions today. Leaving open the difficulty of assessing to which limit EP is converging. But mentioning the potential for asynchronous EP (on which I would like to hear more). Ironically using several times a logistic regression example, if not on the Pima Indians benchmark! He also talked about approximate EP solutions that relate to consensus MCMC. With a connection to Mark Beaumont’s talk at NIPS [at the time as mine!] on the comparison with ABC. While we saw several talks on EP during this week, I am still agnostic about the potential of the approach. It certainly produces a fast proxy to the true posterior and hence can be exploited ad nauseam in inference methods based on pseudo-models like indirect inference. In conjunction with other quick and dirty approximations when available. As in ABC, it would be most useful to know how far from the (ideal) posterior distribution does the approximation stands. Machine learning approaches presumably allow for an evaluation of the predictive performances, but less so for the modelling accuracy, even with new sampling steps. [But I know nothing, I know!]

Dennis Prangle presented some on-going research on high dimension [data] ABC. Raising the question of what is the true meaning of dimension in ABC algorithms. Or of sample size. Because the inference relies on the event d(s(y),s(y’))≤ξ or on the likelihood l(θ|x). Both one-dimensional. Mentioning Iain Murray’s talk at NIPS [that I also missed]. Re-expressing as well the perspective that ABC can be seen as a missing or estimated normalising constant problem as in Bornn et al. (2015) I discussed earlier. The central idea is to use SMC to simulate a particle cloud evolving as the target tolerance ξ decreases. Which supposes a latent variable structure lurking in the background.

Judith Rousseau gave her talk on non-parametric mixtures and the possibility to learn parametrically about the component weights. Starting with a rather “magic” result by Allman et al. (2009) that three repeated observations per individual, all terms in a mixture are identifiable. Maybe related to that simpler fact that mixtures of Bernoullis are not identifiable while mixtures of Binomial are identifiable, even when n=2. As “shown” in this plot made for X validated. Actually truly related because Allman et al. (2009) prove identifiability through a finite dimensional model. (I am surprised I missed this most interesting paper!) With the side condition that a mixture of p components made of r Bernoulli products is identifiable when p ≥ 2[log² r] +1, when log² is base 2-logarithm. And [x] the upper rounding. I also find most relevant this distinction between the weights and the remainder of the mixture as weights behave quite differently, hardly parameters in a sense.

twilight zone [of statistics]

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , on February 26, 2016 by xi'an

mixture with unknown means“I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.” L. Wasserman

Larry Wasserman once remarked that finite mixtures were like the twilight zone of statistics, thanks to the numerous idiosyncrasies associated with such models. And George Casella had similar strong reservations about mixture estimation. Avi Feller and co-authors [including Natesh Pillai] have just arXived a paper on this topic, exhibiting shocking (!) properties of the MLE! Their core example is a mixture of two normal distributions with known common variance and known weight different from 0.5, which ensures identifiability. This is a favourite example of mine that we used for instance in our book Introducing Monte Carlo methods with R. If only because we can plot the likelihood and posterior surfaces. (Warning: I wrote those notes on an earlier version of the paper, so mileage may vary in terms of accuracy!)

The “shocking” discovery in the paper is that the MLE is wrong as often as not in selecting the sign of the difference Δ between both means, with an additional accumulation point at zero. The global mode may thus be in the wrong place for small enough sample sizes. And even for larger sizes: when the difference between the means is small the likelihood is likely to be unimodal with a mode quite close to zero. (An interesting remark is that the likelihood derivative is always zero at Δ=0 when considering the special case of both means equal to -Δ and to πΔ/(1-π), respectively, which implies that the overall mean of the mixture is equal to zero. A potential connection with our reparameterisation paper, maybe?)

The alternative proposed by Avi and his co-authors is to proceed through moments, i.e., to revert to Pearson (1892). There are however difficulties with this approach, first and foremost the non-uniqueness of the moment equations used to estimate Δ. For instance, the second cumulant equation chosen by the authors is not always defined as opposed to the third cumulant equation (why not using this third cumulant then). Which does not always produce the right sign… But, in a strange twist, the authors turn those deficiencies into signals for both pathologies (wrong sign and “pile-up” at zero).

“…the grid bootstrap yields an exact p-value for any valid test statistic.”

The most importance issue in this framework being in estimating the parameters, the authors opt for an approach based on tests, which is definitely surprising given the well-known deficiencies of standard tests in mixtures. The test chosen here is a Wald test with a statistic equal to the χ² version of the first cumulant differences. I am surprised that the χ² approximation works in such an unfriendly setting. And I do not understand how the grid is used, unless a certain degree of approximation is accepted, which takes us back to the “dark ages” of imposing a minimal distance Δ to achieve consistency, as in Ghosh and Sen (1985).

muminusmu0 muminusmu1

“..our concern about sign error is trivial in the Bayesian setting: the global mode is simply a poor summary of a multi-modal posterior. More broadly, the weak identification issues we highlight in this paper are not necessarily relevant to a strict Bayesian.”

A priori, I do not think pathologies of the MLE always transfer to Bayes estimators, unless one uses the MAP as an [poor] estimator. But using the MAP is not necessary since posterior means are meaningful in this identified setting, where label switching should not occur. However, running the same experiments with a Gaussian prior on both means and using the posterior mean as my estimator, I did obtain the same pathology of Bayes estimates [also produced in the supplementary material] not concentrating on the true value of the difference, but putting weight on the opposite value and at zero. Using a less standard prior inspired by David Rossell’s talk on non-local priors two weeks ago, which avoids a neighbourhood of zero, I did not get a much different picture as illustrated below:

muminusmux0 muminusmux0

Overall, I remain somewhat uncertain as to what to conclude from this pathological behaviour. When both means are close enough, the sign of the difference is often estimated wrongly. But that could simply mean that the means are not significantly different, for that sample size…

consistency of ABC

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on August 25, 2015 by xi'an

Along with David Frazier and Gael Martin from Monash University, Melbourne, we have just completed (and arXived) a paper on the (Bayesian) consistency of ABC methods, producing sufficient conditions on the summary statistics to ensure consistency of the ABC posterior. Consistency in the sense of the prior concentrating at the true value of the parameter when the sample size and the inverse tolerance (intolerance?!) go to infinity. The conditions are essentially that the summary statistics concentrates around its mean and that this mean identifies the parameter. They are thus weaker conditions than those found earlier consistency results where the authors considered convergence to the genuine posterior distribution (given the summary), as for instance in Biau et al. (2014) or Li and Fearnhead (2015). We do not require here a specific rate of decrease to zero for the tolerance ε. But still they do not hold all the time, as shown for the MA(2) example and its first two autocorrelation summaries, example we started using in the Marin et al. (2011) survey. We further propose a consistency assessment based on the main consistency theorem, namely that the ABC-based estimates of the marginal posterior densities for the parameters should vary little when adding extra components to the summary statistic, densities estimated from simulated data. And that the mean of the resulting summary statistic is indeed one-to-one. This may sound somewhat similar to the stepwise search algorithm of Joyce and Marjoram (2008), but those authors aim at obtaining a vector of summary statistics that is as informative as possible. We also examine the consistency conditions when using an auxiliary model as in indirect inference. For instance, when using an AR(2) auxiliary model for estimating an MA(2) model. And ODEs.

Bayesian inference for partially identified models [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , on July 9, 2015 by xi'an

“The crux of the situation is that we lack theoretical insight into even quite basic questions about what is going on. More particularly, we cannot sayy anything about the limiting posterior marginal distribution of α compared to the prior marginal distribution of α.” (p.142)

Bayesian inference for partially identified models is a recent CRC Press book by Paul Gustafson that I received for a review in CHANCE with keen interest! If only because the concept of unidentifiability has always puzzled me. And that I have never fully understood what I felt was a sort of joker card that a Bayesian model was the easy solution to the problem since the prior was compensating for the components of the parameter not identified by the data. As defended by Dennis Lindley that “unidentifiability causes no real difficulties in the Bayesian approach”. However, after reading the book, I am less excited in that I do not feel it answers this type of questions about non-identifiable models and that it is exclusively centred on the [undoubtedly long-term and multifaceted] research of the author on the topic.

“Without Bayes, the feeling is that all the data can do is locate the identification region, without conveying any sense that some values in the region are more plausible than others.” (p.47)

Overall, the book is pleasant to read, with a light and witty style. The notational conventions are somewhat unconventional but well explained, to distinguish θ from θ* from θ. The format of the chapters is quite similar with a definition of the partially identified model, an exhibition of the transparent reparameterisation, the computation of the limiting posterior distribution [of the non-identified part], a demonstration [which it took me several iterations as the English exhibition rather than the French proof, pardon my French!]. Chapter titles suffer from an excess of the “further” denomination… The models themselves are mostly of one kind, namely binary observables and non-observables leading to partially observed multinomials with some non-identifiable probabilities. As in missing-at-random models (Chapter 3). In my opinion, it is only in the final chapters that the important questions are spelled-out, not always faced with a definitive answer. In essence, I did not get from the book (i) a characterisation of the non-identifiable parts of a model, of the  identifiability of unidentifiability, and of the universality of the transparent reparameterisation, (ii) a tool to assess the impact of a particular prior and possibly to set it aside, and (iii) a limitation to the amount of unidentifiability still allowing for coherent inference. Hence, when closing the book, I still remain in the dark (or at least in the grey) on how to handle partially identified models. The author convincingly argues that there is no special advantage to using a misspecified if identifiable model to a partially identified model, for this imbues false confidence (p.162), however we also need the toolbox to verify this is indeed the case.

“Given the data we can turn the Bayesian computational crank nonetheless and see what comes out.” (p.xix)

“It is this author’s contention that computation with partially identified models is a “bottleneck” issue.” (p.141)

Bayesian inference for partially identified models is particularly concerned about computational issues and rightly so. It is however unclear to me (without more time to invest investigating the topic) why the “use of general-purpose software is limited to the [original] parametrisation” (p.24) and why importance sampling would do better than MCMC on a general basis. I would definitely have liked more details on this aspect. There is a computational considerations section at the end of the book, but it remains too allusive for my taste. My naïve intuition would be that the lack of identifiability leads to flatter posterior and hence to easier MCMC moves, but Paul Gustafson reports instead bad mixing from standard MCMC schemes (like WinBUGS).

In conclusion, the book opens a new perspective on the relevance of partially identifiable models, trying to lift the stigma associated with them, and calls for further theory and methodology to deal with those. Here are the author’s final points (p.162):

  • “Identification is nuanced. Its absence does not preclude a parameter being well estimated, not its presence guarantee a parameter can be well estimated.”
  • “If we really took limitations of study designs and data quality seriously, then partially identifiable models would crop up all the time in a variety of scientific fields.”
  • “Making modeling assumptions for the sole purpose of gaining full identification can be a mug’s game (…)”
  • “If we accept partial identifiability, then consequently we need to regard sample size differently. There are profound implications of posterior variance tending to a positive limit as the sample size grows.”

These points may be challenging enough to undertake to read Bayesian inference for partially identified models in order to make one’s mind about their eventual relevance in statistical modelling.

[Disclaimer about potential self-plagiarism: this post will also be published as a book review in my CHANCE column. ]