Archive for Bayesian inference

ACDC versus ABC

Posted in Books, Kids, pictures, Statistics, Travel with tags , , , , , on June 12, 2017 by xi'an

At the Bayes, Fiducial and Frequentist workshop last month, I discussed with the authors of this newly arXived paper, Approximate confidence distribution computing, Suzanne Thornton and Min-ge Xie. Which they abbreviate as ACC and not as ACDC. While I have discussed the notion of confidence distribution in some earlier posts, this paper aims at producing proper frequentist coverage within a likelihood-free setting. Given the proximity with our recent paper on the asymptotics of ABC, as well as with Li and Fearnhead (2016) parallel endeavour, it is difficult (for me) to spot the actual distinction between ACC and ABC given that we also achieve (asymptotically) proper coverage when the limiting ABC distribution is Gaussian, which is the case for a tolerance decreasing quickly enough to zero (in the sample size).

“Inference from the ABC posterior will always be difficult to justify within a Bayesian framework.”

Indeed the ACC setting is eerily similar to ABC apart from the potential of the generating distribution to be data dependent. (Which is fine when considering that the confidence distributions have no Bayesian motivation but are a tool to ensure proper frequentist coverage.) That it is “able to offer theoretical support for ABC” (p.5) is unclear to me, given both this data dependence and the constraints it imposes on the [sampling and algorithmic] setting. Similarly, I do not understand how the authors “are not committing the error of doubly using the data” (p.5) and why they should be concerned about it, standing outside the Bayesian framework. If the prior involves the data as in the Cauchy location example, it literally uses the data [once], followed by an ABC comparison between simulated and actual data, that uses the data [a second time].

“Rather than engaging in a pursuit to define a moving target such as [a range of posterior distributions], ACC maintains a consistently clear frequentist interpretation (…) and thereby offers a consistently cohesive interpretation of likelihood-free methods.”

The frequentist coverage guarantee comes from a bootstrap-like assumption that [with tolerance equal to zero] the distribution of the ABC/ACC/ACDC random parameter around an estimate of the parameter given the summary statistic is identical to the [frequentist] distribution of this estimate around the true parameter [given the true parameter, although this conditioning makes no sense outside a Bayesian framework]. (There must be a typo in the paper when the authors define [p.10] the estimator as minimising the derivative of the density of the summary statistic, while still calling it an MLE.) That this bootstrap-like assumption holds is established (in Theorem 1) under a CLT on this MLE and assumptions on the data-dependent proposal that connect it to the density of the summary statistic. Connection that seem to imply a data-dependence as well as a certain knowledge about this density. What I find most surprising in this derivation is the total absence of conditions or even discussion on the tolerance level which, as we have shown, is paramount to the validation or invalidation of ABC inference. It sounds like the authors of Approximate confidence distribution computing are setting ε equal to zero for those theoretical derivations. While in practice they apply rules [for choosing ε] they do not voice out, but which result in very different acceptance rates for the ACC version they oppose to an ABC version. (In all illustrations, it seems that ε=0.1, which does not make much sense.) All in all, I am thus rather skeptical about the practical implications of the paper in that it seems to achieve confidence guarantees by first assuming proper if implicit choices of summary statistics and parameter generating distribution.

efficient acquisition rules for ABC

Posted in pictures, Statistics, University life with tags , , , , , , , , on June 5, 2017 by xi'an

A few weeks ago, Marko Järvenpää, Michael Gutmann, Aki Vehtari and Pekka Marttinen arXived a paper on sampling design for ABC that reminded me of presentations Michael gave at NIPS 2014 and in Banff last February. The main notion is that, when the simulation from the model is hugely expensive, random sampling does not make sense.

“While probabilistic modelling has been used to accelerate ABC inference, and strategies have been proposed for selecting which parameter to simulate next, little work has focused on trying to quantify the amount of uncertainty in the estimator of the ABC posterior density itself.”

The above question  is obviously interesting, if already considered in the literature for it seems to focus on the Monte Carlo error in ABC, addressed for instance in Fearnhead and Prangle (2012), Li and Fearnhead (2016) and our paper with David Frazier, Gael Martin, and Judith Rousseau. With corresponding conditions on the tolerance and the number of simulations to relegate Monte Carlo error to a secondary level. And the additional remark that the (error free) ABC distribution itself is not the ultimate quantity of interest. Or the equivalent (?) one that ABC is actually an exact Bayesian method on a completed space.

The paper initially confused me for a section on the very general formulation of ABC posterior approximation and error in this approximation. And simulation design for minimising this error. It confused me as it sounded too vague but only for a while as the remaining sections appear to be independent. The operational concept of the paper is to assume that the discrepancy between observed and simulated data, when perceived as a random function of the parameter θ, is a Gaussian process [over the parameter space]. This modelling allows for a prediction of the discrepancy at a new value of θ, which can be chosen as maximising the variance of the likelihood approximation. Or more precisely of the acceptance probability. While the authors report improved estimation of the exact posterior, I find no intuition as to why this should be the case when focussing on the discrepancy, especially because small discrepancies are associated with parameters approximately generated from the posterior.

La déraisonnable efficacité des mathématiques

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on May 11, 2017 by xi'an

Although it went completely out of my mind, thanks to a rather heavy travel schedule, I gave last week a short interview about the notion of mathematical models, which got broadcast this week on France Culture, one of the French public radio channels. Within the daily La Méthode Scientifique show, which is a one-hour emission on scientific issues, always a [rare] pleasure to listen to. (Including the day they invited Claire Voisin.) The theme of the show that day was about the unreasonable effectiveness of mathematics, with the [classical] questioning of whether it is an efficient tool towards solving scientific (and inference?) problems because the mathematical objects pre-existed their use or we are (pre-)conditioned to use mathematics to solve problems. I somewhat sounded like a dog in a game of skittles, but it was interesting to listen to the philosopher discussing my relativistic perspective [provided you understand French!]. And I appreciated very much the way Céline Loozen the journalist who interviewed me sorted the chaff from the wheat in the original interview to make me sound mostly coherent! (A coincidence: Jean-Michel Marin got interviewed this morning on France Inter, the major public radio, about the Grothendieck papers.)

HMC sampling in Bayesian empirical likelihood computation

Posted in Statistics with tags , , , , , , , on March 31, 2017 by xi'an

While working on the Series B’log the other day I noticed this paper by Chauduri et al. on Hamiltonian Monte Carlo and empirical likelihood: how exciting!!! Here is the abstract of the paper:

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Car lo method for sampling from the posterior distribution of the parameters of interest.The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semi-parametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Car lo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

[The comment is reposted from Series B’log, where I wrote it first.]

It is of particular interest for me [disclaimer: I was not involved in the review of this paper!] as we worked on ABC thru empirical likelihood, which is about the reverse of the current paper in terms of motivation: when faced with a complex model, we substitute an empirical likelihood version for the real thing, run simulations from the prior distribution and use the empirical likelihood as a proxy. With possible intricacies when the data is not iid (an issue we also met with Wasserstein distances.) In this paper the authors instead consider working on an empirical likelihood as their starting point and derive an HMC algorithm to do so. The idea is striking in that, by nature, an empirical likelihood is not a very smooth object and hence does not seem open to producing gradients and Hessians. As illustrated by Figure 1 in the paper . Which is so spiky at places that one may wonder at the representativity of such graphs.

I have always had a persistent worry about the ultimate validity of treating the empirical likelihood as a genuine likelihood, from the fact that it is the result of an optimisation problem to the issue that the approximate empirical distribution has a finite (data-dependent) support, hence is completely orthogonal to the true distribution. And to the one that the likelihood function is zero outside the convex hull of the defining equations…(For one thing, this empirical likelihood is always bounded by one but this may be irrelevant after all!)

The computational difficulty in handling the empirical likelihood starts with its support. Eliminating values of the parameter for which this empirical likelihood is zero amounts to checking whether zero belongs to the above convex hull. A hard (NP hard?) problem. (Although I do not understand why the authors dismiss the token observations of Owen and others. The argument that Bayesian analysis does more than maximising a likelihood seems to confuse the empirical likelihood as a product of a maximisation step with the empirical likelihood as a function of the parameter that can be used as any other function.)

In the simple regression example (pp.297-299), I find the choice of the moment constraints puzzling, in that they address the mean of the white noise (zero) and the covariance with the regressors (zero too). Puzzling because my definition of the regression model is conditional on the regressors and hence does not imply anything on their distribution. In a sense this is another model. But I also note that the approach focus on the distribution of the reconstituted white noises, as we did in the PNAS paper. (The three examples processed in the paper are all simple and could be processed by regular MCMC, thus making the preliminary step of calling for an empirical likelihood somewhat artificial unless I missed the motivation. The paper also does not seem to discuss the impact of the choice of the moment constraints or the computing constraints involved by a function that is itself the result of a maximisation problem.)

A significant part of the paper is dedicated to the optimisation problem and the exclusion of the points on the boundary. Which sounds like a non-problem in continuous settings. However, this appears to be of importance for running an HMC as it cannot evade the support (without token observations). On principle, HMC should not leave this support since the gradient diverges at the boundary, but in practice the leapfrog approximation may lead the path outside. I would have (naïvely?) suggested to reject moves when this happens and start again but the authors consider that proper choices of the calibration factors of HMC can avoid this problem. Which seems to induce a practical issue by turning the algorithm into an adaptive version.

As a last point, I would have enjoyed seeing a comparison of the performances against our (A)BCel version, which would have been straightforward to implement in the simple examples handled by the paper. (This could be a neat undergraduate project for next year!)

estimation versus testing [again!]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on March 30, 2017 by xi'an

The following text is a review I wrote of the paper “Parameter estimation and Bayes factors”, written by J. Rouder, J. Haff, and J. Vandekerckhove. (As the journal to which it is submitted gave me the option to sign my review.)

The opposition between estimation and testing as a matter of prior modelling rather than inferential goals is quite unusual in the Bayesian literature. In particular, if one follows Bayesian decision theory as in Berger (1985) there is no such opposition, but rather the use of different loss functions for different inference purposes, while the Bayesian model remains single and unitarian.

Following Jeffreys (1939), it sounds more congenial to the Bayesian spirit to return the posterior probability of an hypothesis H⁰ as an answer to the question whether this hypothesis holds or does not hold. This however proves impossible when the “null” hypothesis H⁰ has prior mass equal to zero (or is not measurable under the prior). In such a case the mathematical answer is a probability of zero, which may not satisfy the experimenter who asked the question. More fundamentally, the said prior proves inadequate to answer the question and hence to incorporate the information contained in this very question. This is how Jeffreys (1939) justifies the move from the original (and deficient) prior to one that puts some weight on the null (hypothesis) space. It is often argued that the move is unnatural and that the null space does not make sense, but this only applies when believing very strongly in the model itself. When considering the issue from a modelling perspective, accepting the null H⁰ means using a new model to represent the model and hence testing becomes a model choice problem, namely whether or not one should use a complex or simplified model to represent the generation of the data. This is somehow the “unification” advanced in the current paper, albeit it does appear originally in Jeffreys (1939) [and then numerous others] rather than the relatively recent Mitchell & Beauchamp (1988). Who may have launched the spike & slab denomination.

I have trouble with the analogy drawn in the paper between the spike & slab estimate and the Stein effect. While the posterior mean derived from the spike & slab posterior is indeed a quantity drawn towards zero by the Dirac mass at zero, it is rarely the point in using a spike & slab prior, since this point estimate does not lead to a conclusion about the hypothesis: for one thing it is never exactly zero (if zero corresponds to the null). For another thing, the construction of the spike & slab prior is both artificial and dependent on the weights given to the spike and to the slab, respectively, to borrow expressions from the paper. This approach thus leads to model averaging rather than hypothesis testing or model choice and therefore fails to answer the (possibly absurd) question as to which model to choose. Or refuse to choose. But there are cases when a decision must be made, like continuing a clinical trial or putting a new product on the market. Or not.

In conclusion, the paper surprisingly bypasses the decision-making aspect of testing and hence ends up with a inconclusive setting, staying midstream between Bayes factors and credible intervals. And failing to provide a tool for decision making. The paper also fails to acknowledge the strong dependence of the Bayes factor on the tail behaviour of the prior(s), which cannot be [completely] corrected by a finite sample, hence its relativity and the unreasonableness of a fixed scale like Jeffreys’ (1939).

a concise introduction to statistical inference [book review]

Posted in Statistics with tags , , , , , , , , , , on February 16, 2017 by xi'an

[Just to warn readers and avoid emails about Xi’an plagiarising Christian!, this book was sent to me by CRC Press for a review. To be published in CHANCE.]

This is an introduction to statistical inference. And with 180 pages, it indeed is concise! I could actually stop the review at this point as a concise review of a concise introduction to statistical inference, as I do not find much originality in this introduction, intended for “mathematically sophisticated first-time student of statistics”. Although sophistication is in the eye of the sophist, of course, as this book has margin symbols in the guise of integrals to warn of section using “differential or integral calculus” and a remark that the book is still accessible without calculus… (Integral calculus as in Riemann integrals, not Lebesgue integrals, mind you!) It even includes appendices with the Greek alphabet, summation notations, and exponential/logarithms.

“In statistics we often bypass the probability model altogether and simply specify the random variable directly. In fact, there is a result (that we won’t cover in detail) that tells us that, for any random variable, we can find an appropriate probability model.” (p.17)

Given its limited mathematical requirements, the book does not get very far in the probabilistic background of statistics methods, which makes the corresponding chapter not particularly helpful as opposed to a prerequisite on probability basics. Since not much can be proven without “all that complicated stuff about for any ε>0” (p.29). And makes defining correctly notions like the Central Limit Theorem impossible. For instance, Chebychev’s inequality comes within a list of admitted results. There is no major mistake in the chapter, even though mentioning that two correlated Normal variables are jointly Normal (p.27) is inexact.

“The power of a test is the probability that you do not reject a null that is in fact correct.” (p.120)

Most of the book follows the same pattern as other textbooks at that level, covering inference on a mean and a probability, confidence intervals, hypothesis testing, p-values, and linear regression. With some words of caution about the interpretation of p-values. (And the unfortunate inversion of the interpretation of power above.) Even mentioning the Cult [of Significance] I reviewed a while ago.

Given all that, the final chapter comes as a surprise, being about Bayesian inference! Which should make me rejoice, obviously, but I remain skeptical of introducing the concept to readers with so little mathematical background. And hence a very shaky understanding of a notion like conditional distributions. (Which reminds me of repeated occurrences on X validated when newcomers hope to bypass textbooks and courses to grasp the meaning of posteriors and such. Like when asking why Bayes Theorem does not apply for expectations.) I can feel the enthusiasm of the author for this perspective and it may diffuse to some readers, but apart from being aware of the approach, I wonder how much they carry away from this brief (decent) exposure. The chapter borrows from Lee (2012, 4th edition) and from Berger (1985) for the decision-theoretic part. The limitations of the exercise are shown for hypothesis testing (or comparison) by the need to restrict the parameter space to two possible values. And for decision making. Similarly, introducing improper priors and the likelihood principle [distinguished there from the law of likelihood] is likely to get over the head of most readers and clashes with the level of the previous chapters. (And I do not think this is the most efficient way to argue in favour of a Bayesian approach to the problem of statistical inference: I have now dropped all references to the likelihood principle from my lectures. Not because of the controversy, but simply because the students do not get it.) By the end of the chapter, it is unclear a neophyte would be able to spell out how one could specify a prior for one of the problems processed in the earlier chapters. The appendix on de Finetti’s formalism on personal probabilities is very much unlikely to help in this regard. While it sounds so far beyond the level of the remainder of the book.

MAP as Bayes estimators

Posted in Books, Kids, Statistics with tags , , , , on November 30, 2016 by xi'an

screenshot_20161122_123607Robert Bassett and Julio Deride just arXived a paper discussing the position of MAPs within Bayesian decision theory. A point I have discussed extensively on the ‘Og!

“…we provide a counterexample to the commonly accepted notion of MAP estimators as a limit of Bayes estimators having 0-1 loss.”

The authors mention The Bayesian Choice stating this property without further precautions and I completely agree to being careless in this regard! The difficulty stands with the limit of the maximisers being not necessarily the maximiser of the limit. The paper includes an example to this effect, with a prior as above,  associated with a sampling distribution that does not depend on the parameter. The sufficient conditions proposed therein are that the posterior density is almost surely proper or quasiconcave.

This is a neat mathematical characterisation that cleans this “folk theorem” about MAP estimators. And for which the authors are to be congratulated! However, I am not very excited by the limiting property, whether it holds or not, as I have difficulties conceiving the use of a sequence of losses in a mildly realistic case. I rather prefer the alternate characterisation of MAP estimators by Burger and Lucka as proper Bayes estimators under another type of loss function, albeit a rather artificial one.