Archive for misspecified model

conditioning on insufficient statistics in Bayesian regression

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on October 23, 2021 by xi'an

“…the prior distribution, the loss function, and the likelihood or sampling density (…) a healthy skepticism encourages us to question each of them”

A paper by John Lewis, Steven MacEachern, and Yoonkyung Lee has recently appeared in Bayesian Analysis. Starting with the great motivation of a misspecified model requiring the use of a (thus necessarily) insufficient statistic and moving to their central concern of simulating the posterior based on that statistic.

Model misspecification remains understudied from a B perspective and this paper is thus most welcome in addressing the issue. However, when reading through, one of my criticisms is in defining misspecification as equivalent to outliers in the sample. An outlier model is an easy case of misspecification, in the end, since the original model remains meaningful. (Why should there be “good” versus “bad” data) Furthermore, adding a non-parametric component for the unspecified part of the data would sound like a “more Bayesian” alternative. Unrelated, I also idly wondered at whether or not normalising flows could be used in this instance..

The problem in selecting a T (Darjeeling of course!) is not really discussed there, while each choice of a statistic T leads to a different signification to what misspecified means and suggests a comparison with Bayesian empirical likelihood.

“Acceptance rates of this [ABC] algorithm can be intolerably low”

Erm, this is not really the issue with ABC, is it?! Especially when the tolerance is induced by the simulations themselves.

When I reached the MCMC (Gibbs?) part of the paper, I first wondered at its relevance for the mispecification issues before realising it had become the focus of the paper. Now, simulating the observations conditional on a value of the summary statistic T is a true challenge. I remember for instance George Casella mentioning it in association with a Student’s t sample in the 1990’s and Kerrie and I having an unsuccessful attempt at it in the same period. Persi Diaconis has written several papers on the problem and I am thus surprised at the dearth of references here, like the rather recent Byrne and Girolami (2013), Florens and Simoni (2015), or Bornn et al. (2019). In the present case, the  linear model assumed as the true model has the exceptional feature that it leads to a feasible transform of an unconstrained simulation into a simulation with fixed statistics, with no measure theoretic worries if not free from considerable efforts to establish the operation is truly valid… And, while simulating (θ,y) makes perfect sense in an insufficient setting, the cost is then precisely the same as when running a vanilla ABC. Which brings us to the natural comparison with ABC. While taking ε=0 may sound as optimal for being “exact”, it is not from an ABC perspective since the convergence rate of the (summary) statistic should be roughly the one of the tolerance (Fearnhead and Liu, Frazier et al., 2018).

“[The Borel Paradox] shows that the concept of a conditional probability with regard to an isolated given hypothesis whose probability equals 0 is inadmissible.” A. Колмого́ров (1933)

As a side note for measure-theoretic purists, the derivation of the conditional of y given T(y)=T⁰ is arbitrary since the event has probability zero (ie, the conditioning set is of measure zero). See the Borel-Kolmogorov paradox. The computations in the paper are undoubtedly correct, but this is only one arbitrary choice of a transform (or conditioning σ-algebra).

a computational approach to statistical learning [book review]

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on April 15, 2020 by xi'an

This book was sent to me by CRC Press for review for CHANCE. I read it over a few mornings while [confined] at home and found it much more computational than statistical. In the sense that the authors go quite thoroughly into the construction of standard learning procedures, including home-made R codes that obviously help in understanding the nitty-gritty of these procedures, what they call try and tell, but that the statistical meaning and uncertainty of these procedures remain barely touched by the book. This is not uncommon to the machine-learning literature where prediction error on the testing data often appears to be the final goal but this is not so traditionally statistical. The authors introduce their work as (a computational?) supplementary to Elements of Statistical Learning, although I would find it hard to either squeeze both books into one semester or dedicate two semesters on the topic, especially at the undergraduate level.

Each chapter includes an extended analysis of a specific dataset and this is an asset of the book. If sometimes over-reaching in selling the predictive power of the procedures. Printed extensive R scripts may prove tiresome in the long run, at least to me, but this may simply be a generational gap! And the learning models are mostly unidimensional, see eg the chapter on linear smoothers with imho a profusion of methods. (Could someone please explain the point of Figure 4.9 to me?) The chapter on neural networks has a fairly intuitive introduction that should reach fresh readers. Although meeting the handwritten digit data made me shift back to the late 1980’s, when my wife was working on automatic character recognition. But I found the visualisation of the learning weights for character classification hinting at their shape (p.254) most alluring!

Among the things I am missing when reading through this book, a life-line on the meaning of a statistical model beyond prediction, attention to misspecification, uncertainty and variability, especially when reaching outside the range of the learning data, and further especially when returning regression outputs with significance stars, discussions on the assessment tools like the distance used in the objective function (for instance lacking in scale invariance when adding errors on the regression coefficients) or the unprincipled multiplication of calibration parameters, some asymptotics, at least one remark on the information loss due to splitting the data into chunks, giving some (asymptotic) substance when using “consistent”, waiting for a single page 319 to see the “data quality issues” being mentioned. While the methodology is defended by algebraic and calculus arguments, there is very little on the probability side, which explains why the authors consider that the students need “be familiar  with the concepts of expectation, bias and variance”. And only that. A few paragraphs on the Bayesian approach are doing more harm than well, especially with so little background in probability and statistics.

The book possibly contains the most unusual introduction to the linear model I can remember reading: Coefficients as derivatives… Followed by a very detailed coverage of matrix inversion and singular value decomposition. (Would not sound like the #1 priority were I to give such a course.)

The inevitable typo “the the” was found on page 37! A less common typo was Jensen’s inequality spelled as “Jenson’s inequality”. Both in the text (p.157) and in the index, followed by a repetition of the same formula in (6.8) and (6.9). A “stwart” (p.179) that made me search a while for this unknown verb. Another typo in the Nadaraya-Watson kernel regression, when the bandwidth h suddenly turns into n (and I had to check twice because of my poor eyesight!). An unusual use of partition where the sets in the partition are called partitions themselves. Similarly, fluctuating use of dots for products in dimension one, including a form of ⊗ for matricial product (in equation (8.25)) followed next page by the notation for the Hadamard product. I also suspect the matrix K in (8.68) is missing 1’s or am missing the point, since K is the number of kernels on the next page, just after a picture of the Eiffel Tower…) A surprising number of references for an undergraduate textbook, with authors sometimes cited with full name and sometimes cited with last name. And technical reports that do not belong to this level of books. Let me add the pedant remark that Conan Doyle wrote more novels “that do not include his character Sherlock Holmes” than novels which do include Sherlock.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE.]

misspecified [but published!]

Posted in Statistics with tags , , , , , on April 1, 2020 by xi'an

on-line parameter estimation with Wasserstein

Posted in Books, Statistics, University life with tags , , , , , , , , on November 27, 2019 by xi'an

Just found out that our paper On parameter estimation with the Wasserstein distance with Espen Bernton, Pierre Jacob, and Mathieu Gerber, has now appeared on-line on Information and Inference: A Journal of the IMA,

uncertainty in the ABC posterior

Posted in Statistics with tags , , , , , , on July 24, 2019 by xi'an

In the most recent Bayesian Analysis, Marko Järvenpää et al. (including my coauthor Aki Vehtari) consider an ABC setting where the number of available simulations of pseudo-samples  is limited. And where they want to quantify the amount of uncertainty resulting from the estimation of the ABC posterior density. Which is a version of the Monte Carlo error in practical ABC, in that this is the difference between the ABC posterior density for a given choice of summaries and a given choice of tolerance, and the actual approximation based on a finite number of simulations from the prior predictive. As in earlier works by Michael Gutmann and co-authors, the focus stands in designing a sequential strategy to decide where to sample the next parameter value towards minimising a certain expected loss. And in adopting a Gaussian process modelling for the discrepancy between observed data and simulated data, hence generalising the synthetic likelihood approach. This allows them to compute the expectation and the variance of the unnormalised ABC posterior, based on plugged-in estimators. From where the authors derive a loss as the expected variance of the acceptance probability (although it is not parameterisation invariant). I am unsure I see the point for this choice in that there is no clear reason for the resulting sequence of parameter choices to explore the support of the posterior distribution in a relatively exhaustive manner. The paper also mentions alternatives where the next parameter is chosen at the location where “the uncertainty of the unnormalised ABC posterior is highest”. Which sounds more pertinent to me. And further avoids integrating out the parameter. I also wonder if ABC mis-specification analysis could apply in this framework since the Gaussian process is most certainly a “wrong” model. (When concluding this post, I realised I had written a similar entry two years ago about the earlier version of the paper!)

%d bloggers like this: