Archive for fiducial distribution

ABC with no prior

Posted in Books, Kids, pictures with tags , , , , , , on April 30, 2018 by xi'an

“I’m trying to fit a complex model to some data that take a large amount of time to run. I’m also unable to write down a Likelihood function to this problem and so I turned to approximate Bayesian computation (ABC). Now, given the slowness of my simulations, I used Sequential ABC (…) In fact, contrary to the concept of Bayesian statistics (new knowledge updating old knowledge) I would like to remove all the influence of the priors from my estimates. “

A question from X validated where I have little to contribute as the originator of the problem had the uttermost difficulties to understand that ABC could not be run without a probability structure on the parameter space. Maybe a fiducialist in disguise?! To this purpose this person simulated from a collection of priors and took the best 5% across the priors, which is akin to either running a mixture prior or to use ABC for conducting prior choice, which reminds me of a paper of Toni et al. Not that it helps removing “all the influence of the priors”, of course…

An unrelated item of uninteresting trivia is that a question I posted in 2012 on behalf of my former student Gholamossein Gholami about the possibility to use EM to derive a Weibull maximum likelihood estimator (instead of sheer numerical optimisation) got over the 10⁴ views. But no answer so far!

look, look, confidence! [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on April 23, 2018 by xi'an

As it happens, I recently bought [with Amazon Associate earnings] a (used) copy of Confidence, Likelihood, Probability (Statistical Inference with Confidence Distributions), by Tore Schweder and Nils Hjort, to try to understand this confusing notion of confidence distributions. (And hence did not get the book from CUP or anyone else towards purposely writing a review. Or a ½-review like the one below.)

“Fisher squared the circle and obtained a posterior without a prior.” (p.419)

Now that I have gone through a few chapters, I am no less confused about the point of this notion. Which seems to rely on the availability of confidence intervals. Exact or asymptotic ones. The authors plainly recognise (p.61) that a confidence distribution is neither a posterior distribution nor a fiducial distribution, hence cutting off any possible Bayesian usage of the approach. Which seems right in that there is no coherence behind the construct, meaning for instance there is no joint distribution corresponding to the resulting marginals. Or even a specific dominating measure in the parameter space. (Always go looking for the dominating measure!) As usual with frequentist procedures, there is always a feeling of arbitrariness in the resolution, as for instance in the Neyman-Scott problem (p.112) where the profile likelihood and the deviance do not work, but considering directly the distribution of the (inconsistent) MLE of the variance “saves the day”, which sounds a bit like starting from the solution. Another statistical freak, the Fieller-Creasy problem (p.116) remains a freak in this context as it does not seem to allow for a confidence distribution. I also notice an ambivalence in the discourse of the authors of this book, namely that while they claim confidence distributions are both outside a probabilisation of the parameter and inside, “producing distributions for parameters of interest given the data (…) with fewer philosophical and interpretational obstacles” (p.428).

“Bias is particularly difficult to discuss for Bayesian methods, and seems not to be a worry for most Bayesian statisticians.” (p.10)

The discussions as to whether or not confidence distributions form a synthesis of Bayesianism and frequentism always fall short from being convincing, the choice of (or the dependence on) a prior distribution appearing to the authors as a failure of the former approach. Or unnecessarily complicated when there are nuisance parameters. Apparently missing on the (high) degree of subjectivity involved in creating the confidence procedures. Chapter 1 contains a section on “Why not go Bayesian?” that starts from Chris Sims‘ Nobel Lecture on the appeal of Bayesian methods and goes [softly] rampaging through each item. One point (3) is recurrent in many criticisms of B and I always wonder whether or not it is tongue-in-cheek-y… Namely the fact that parameters of a model are rarely if ever stochastic. This is a misrepresentation of the use of prior and posterior distributions [which are in fact] as summaries of information cum uncertainty. About a true fixed parameter. Refusing as does the book to endow posteriors with an epistemic meaning (except for “Bayesian of the Lindley breed” (p.419) is thus most curious. (The debate is repeating in the final(e) chapter as “why the world need not be Bayesian after all”.)

“To obtain frequentist unbiasedness, the Bayesian will have to choose her prior with unbiasedness in mind. Is she then a Bayesian?” (p.430)

A general puzzling feature of the book is that notions are not always immediately defined, but rather discussed and illustrated first. As for instance for the central notion of fiducial probability (Section 1.7, then Chapter 6), maybe because Fisher himself did not have a general principle to advance. The construction of a confidence distribution most often keeps a measure of mystery (and arbitrariness), outside the rather stylised setting of exponential families and sufficient (conditionally so) statistics. (Incidentally, our 2012 ABC survey is [kindly] quoted in relation with approximate sufficiency (p.180), while it does not sound particularly related to this part of the book. Now, is there an ABC version of confidence distributions? Or an ABC derivation?) This is not to imply that the book is uninteresting!, as I found reading it quite entertaining, with many humorous and tongue-in-cheek remarks, like “From Fraser (1961a) and until Fraser (2011), and hopefully even further” (p.92), and great datasets. (Including one entitled Pornoscope, which is about drosophilia mating.) And also datasets with lesser greatness, like the 3000 mink whales that were killed for Example 8.5, where the authors if not the whales “are saved by a large and informative dataset”… (Whaling is a recurrent [national?] theme throughout the book, along with sport statistics usually involving Norway!)

Miscellanea: The interest of the authors in the topic is credited to bowhead whales, more precisely to Adrian Raftery’s geometric merging (or melding) of two priors and to the resulting Borel paradox (xiii). Proposal that I remember Adrian presenting in Luminy, presumably in 1994. Or maybe in Aussois the year after. The book also repeats Don Fraser’s notion that the likelihood is a sufficient statistic, a point that still bothers me. (On the side, I realised while reading Confidence, &tc., that ABC cannot comply with the likelihood principle.) To end up on a French nitpicking note (!), Quenouille is typ(o)ed Quenoille in the main text, the references and the index. (Blame the .bib file!)

minibatch acceptance for Metropolis-Hastings

Posted in Books, Statistics with tags , , , , , on January 12, 2018 by xi'an

An arXival that appeared last July by Seita, Pan, Chen, and Canny, and that relates to my current interest in speeding up MCMC. And to 2014 papers by  Korattikara et al., and Bardenet et al. Published in Uncertainty in AI by now. The authors claim that their method requires less data per iteration than earlier ones…

“Our test is applicable when the variance (over data samples) of the log probability ratio between the proposal and the current state is less than one.”

By test, the authors mean a mini-batch formulation of the Metropolis-Hastings acceptance ratio in the (special) setting of iid data. First they use Barker’s version of the acceptance probability instead of Metropolis’. Second, they use a Gaussian approximation to the distribution of the logarithm of the Metropolis ratio for the minibatch, while the Barker acceptance step corresponds to comparing a logistic perturbation of the logarithm of the Metropolis ratio against zero. Which amounts to compare the logarithm of the Metropolis ratio for the minibatch, perturbed by a logistic minus Normal variate. (The cancellation of the Normal in eqn (13) is a form of fiducial fallacy, where the Normal variate has two different meanings. In other words, the difference of two Normal variates is not equal to zero.) However, the next step escapes me as the authors seek to optimise the distribution of this logistic minus Normal variate. Which I thought was uniquely defined as such a difference. Another constraint is that the estimated variance of the log-likelihood ratio gets below one. (Why one?) The argument is that the average of the individual log-likelihoods is approximately Normal by virtue of the Central Limit Theorem. Even when randomised. While the illustrations on a Gaussian mixture and on a logistic regression demonstrate huge gains in computational time, it is unclear to me to which amount one can trust the approximation for a given model and sample size…

on confidence distributions

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , on January 10, 2018 by xi'an

As Regina Liu gave her talk at ISI this morning on fusion learning and confidence distributions, this led me to think anew about this strange notion of confidence distributions, building a distribution on the parameter space without a prior to go with it, implicitly or explicitly, and vaguely differing from fiducial inference. (As an aside, the Wikipedia page on confidence distributions is rather heavily supporting the concept and was primarily written by someone from Rutgers, where the modern version was developed. [And as an aside inside the aside, Schweder and Hjort’s book is sitting in my office, waiting for me!])

Recall that a confidence distribution is a sample dependent distribution on the parameter space, which is uniform U(0,1) [in the sample] at the “true” value of the parameter. Used thereafter as a posterior distribution. (Again, almost always without a prior to go with it. Which is an incoherence from a probabilistic perspective. not mentioning the issue of operating without a pre-defined dominating measure. This measure issue is truly bothering me!) This seems to include fiducial distributions based on a pivot, unless I am confused. As noted in the review by Nadarajah et al. Moreover, the concept of creating a pseudo-posterior out of an existing (frequentist) confidence interval procedure to create a new (frequentist) procedure does not carry an additional validation per se, as it clearly depends on the choice of the initialising procedure. (Not even mentioning the lack of invariance and the intricacy of multidimensional extensions.)

fiducial inference

Posted in Books, Mountains, pictures, Running, Statistics, Travel with tags , , , , , , , , , , on October 30, 2017 by xi'an

In connection with my recent tale of the many ε’s, I received from Gunnar Taraldsen [from Tronheim, Norge] a paper [jointly written with Bo Lindqvist and just appeared on-line in JSPI] on conditional fiducial models.

“The role of the prior and the statistical model in Bayesian analysis is replaced by the use of the fiducial model x=R(θ,ε) in fiducial inference. The fiducial is obtained in this case without a prior distribution for the parameter.”

Reading this paper after addressing the X validated question made me understood better the fundamental wrongness of fiducial analysis! If I may herein object to Fisher himself… Indeed, when writing x=R(θ,ε), as the representation of the [observed] random variable x as a deterministic transform of a parameter θ and of an [unobserved] random factor ε, the two random variables x and ε are based on the same random preimage ω, i.e., x=x(ω) and ε=ε(ω). Observing x hence sets a massive constraint on the preimage ω and on the conditional distribution of ε=ε(ω). When the fiducial inference incorporates another level of randomness via an independent random variable ε’ and inverts x=R(θ,ε’) into θ=θ(x,ε’), assuming there is only one solution to the inversion, it modifies the nature of the underlying σ-algebra into something that is incompatible with the original model. Because of this sudden duplication of the random variates. While the inversion of this equation x=R(θ,ε’) gives an idea of the possible values of θ when ε varies according to its [prior] distribution, it does not account for the connection between x and ε. And does not turn the original parameter into a random variable with an implicit prior distribution.

As to conditional fiducial distributions, they are defined by inversion of x=R(θ,ε), under a certain constraint on θ, like C(θ)=0, which immediately raises a Pavlovian reaction in me, namely that since the curve C(θ)=0 has measure zero under the original fiducial distribution, how can this conditional solution be uniquely or at all defined. Or to avoid the Borel paradox mentioned in the paper. If I get the meaning of the authors in this section, the resulting fiducial distribution will actually depend on the choice of σ-algebra governing the projection.

“A further advantage of the fiducial approach in the case of a simple fiducial model is that independent samples are produced directly from independent sampling from [the fiducial distribution]. Bayesian simulations most often come as dependent samples from a Markov chain.”

This side argument in “favour” of the fiducial approach is most curious as it brings into the picture computational aspects that do not have any reason to be there. (The core of the paper is concerned with the unicity of the fiducial distribution in some univariate settings. Not with computational issues.)

Validity and the foundations of statistical inference

Posted in Statistics with tags , , , , , , , , on July 29, 2016 by xi'an

Natesh pointed out to me this recent arXival with a somewhat grandiose abstract:

In this paper, we argue that the primary goal of the foundations of statistics is to provide data analysts with a set of guiding principles that are guaranteed to lead to valid statistical inference. This leads to two new questions: “what is valid statistical inference?” and “do existing methods achieve this?” Towards answering these questions, this paper makes three contributions. First, we express statistical inference as a process of converting observations into degrees of belief, and we give a clear mathematical definition of what it means for statistical inference to be valid. Second, we evaluate existing approaches Bayesian and frequentist approaches relative to this definition and conclude that, in general, these fail to provide valid statistical inference. This motivates a new way of thinking, and our third contribution is a demonstration that the inferential model framework meets the proposed criteria for valid and prior-free statistical inference, thereby solving perhaps the most important unsolved problem in statistics.

Since solving the “most important unsolved problem in statistics” sounds worth pursuing, I went and checked the paper‘s contents.

“To us, the primary goal of the foundations of statistics is to provide a set of guiding principles that, if followed, will guarantee validity of the resulting inference. Our motivation for writing this paper is to be clear about what is meant by valid inference and to provide the necessary principles to help data analysts achieve validity.”

Which can be interpreted in so many ways that it is somewhat meaningless…

“…if real subjective prior information is available, we recommend using it. However, there is an expanding collection of work (e.g., machine learning, etc) that takes the perspective that no real prior information is available. Even a large part of the literature claiming to be Bayesian has abandoned the interpretation of the prior as a serious part of the model, opting for “default” prior that “works.” Our choice to omit a prior from the model is not for the (misleading) purpose of being “objective”—subjectivity is necessary—but, rather, for the purpose of exploring what can be done in cases where a fully satisfactory prior is not available, to see what improvements can be made over the status quo.”

This is a pretty traditional criticism of the Bayesian approach, namely that if a “true” prior is provided (by whom?) then it is optimal to use it. But this amounts to turn the prior into another piece of the sampling distribution and is not in my opinion a Bayesian argument! Most of the criticisms in the paper are directed at objective Bayes approaches, with the surprising conclusion that, because there exist cases where no matching prior is available, “the objective Bayesian approach [cannot] be considered as a general framework for scientific inference.” (p.9)

Another section argues that a Bayesian modelling cannot describe a state of total ignorance. This is formally correct, which is why there is no such thing as a non-informative or the non-informative prior, as often discussed here, but is this truly relevant, in that the inference problem contains one way or another information about the parameter, for instance through a loss function or a pseudo-likelihood.

“This is a desirable property that most existing methods lack.”

The proposal central to the paper thesis is to replace posterior probabilities by belief functions b(.|X), called statistical inference, that are interpreted as measures of evidence about subsets A of the parameter space. If not necessarily as probabilities. This is not very novel, witness the works of Dempster, Shafer and subsequent researchers. And not very much used outside Bayesian and fiducial statistics because of the mostly impossible task of defining a function over all subsets of the parameter space. Because of the subjectivity of such “beliefs”, they will be “valid” only if they are well-calibrated in the sense of b(A|X) being sub-uniform, that is, more concentrated near zero than a uniform variate (i.e., small) under the alternative, i.e. when θ is not in A. At this stage, since this is a mix of a minimax and proper coverage condition, my interest started to quickly wane… Especially because the sub-uniformity condition is highly demanding, if leading to controls over the Type I error and the frequentist coverage. As often, I wonder at the meaning of a calibration property obtained over all realisations of the random variable and all values of the parameter. So for me stability is neither “desirable” nor “essential”. Overall, I have increasing difficulties in perceiving proper coverage as a relevant property. Which has no stronger or weaker meaning that the coverage derived from a Bayesian construction.

“…frequentism does not provide any guidance for selecting a particular rule or procedure.”

I agree with this assessment, which means that there is no such thing as frequentist inference, but rather a philosophy for assessing procedures. That the Gleser-Hwang paradox invalidates this philosophy sounds a bit excessive, however. Especially when the bounded nature of Bayesian credible intervals is also analysed as a failure. A more relevant criticism is the lack of directives for picking procedures.

“…we are the first to recognize that the belief function’s properties are necessary in order for the inferential output to satisfy the required validity property”

The construction of the “inferential model” proposed by the authors offers similarities withn fiducial inference, in that it builds upon the representation of the observable X as X=a(θ,U). With further constraints on the function a() to ensure the validity condition holds… An interesting point is that the functional connection X=a(θ,U) means that the nature of U changes once X is observed, albeit in a delicate manner outside a Bayesian framework. When illustrated on the Gleser-Hwang paradox, the resolution proceeds from an arbitrary choice of a one-dimensional summary, though. (As I am reading the paper, I realise it builds on other and earlier papers by the authors, papers that I cannot read for lack of time. I must have listned to a talk by one of the authors last year at JSM as this rings a bell. Somewhat.) In conclusion of a quick Sunday afternoon read, I am not convinced by the arguments in the paper and even less by the impression of a remaining arbitrariness in setting the resulting procedure.

ISBA 2016 [#5]

Posted in Mountains, pictures, Running, Statistics, Travel with tags , , , , , , , , , , , , , on June 18, 2016 by xi'an

from above Forte Village, Santa Magherita di Pula, Sardinia, June 17, 2016On Thursday, I started the day by a rather masochist run to the nearby hills, not only because of the very hour but also because, by following rabbit trails that were not intended for my size, I ended up being scratched by thorns and bramble all over!, but also with neat views of the coast around Pula.  From there, it was all downhill [joke]. The first morning talk I attended was by Paul Fearnhead and about efficient change point estimation (which is an NP hard problem or close to). The method relies on dynamic programming [which reminded me of one of my earliest Pascal codes about optimising a dam debit]. From my spectator’s perspective, I wonder[ed] at easier models, from Lasso optimisation to spline modelling followed by testing equality between bits. Later that morning, James Scott delivered the first Bayarri Lecture, created in honour of our friend Susie who passed away between the previous ISBA meeting and this one. James gave an impressive coverage of regularisation through three complex models, with the [hopefully not degraded by my translation] message that we should [as Bayesians] focus on important parts of those models and use non-Bayesian tools like regularisation. I can understand the practical constraints for doing so, but optimisation leads us away from a Bayesian handling of inference problems, by removing the ascertainment of uncertainty…

Later in the afternoon, I took part in the Bayesian foundations session, discussing the shortcomings of the Bayes factor and suggesting the use of mixtures instead. With rebuttals from [friends in] the audience!

This session also included a talk by Victor Peña and Jim Berger analysing and answering the recent criticisms of the Likelihood principle. I am not sure this answer will convince the critics, but I won’t comment further as I now see the debate as resulting from a vague notion of inference in Birnbaum‘s expression of the principle. Jan Hannig gave another foundation talk introducing fiducial distributions (a.k.a., Fisher’s Bayesian mimicry) but failing to provide a foundational argument for replacing Bayesian modelling. (Obviously, I am definitely prejudiced in this regard.)

The last session of the day was sponsored by BayesComp and saw talks by Natesh Pillai, Pierre Jacob, and Eric Xing. Natesh talked about his paper on accelerated MCMC recently published in JASA. Which surprisingly did not get discussed here, but would definitely deserve to be! As hopefully corrected within a few days, when I recoved from conference burnout!!! Pierre Jacob presented a work we are currently completing with Chris Holmes and Lawrence Murray on modularisation, inspired from the cut problem (as exposed by Plummer at MCMski IV in Chamonix). And Eric Xing spoke about embarrassingly parallel solutions, discussed a while ago here.