Archive for expectation-propagation

Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

pimcisIn the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

Bayesian computation: fore and aft

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on February 6, 2015 by xi'an

BagneuxWith my friends Peter Green (Bristol), Krzysztof Łatuszyński (Warwick) and Marcello Pereyra (Bristol), we just arXived the first version of “Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, which first title was the title of this post. This is a survey of our own perspective on Bayesian computation, from what occurred in the last 25 years [a  lot!] to what could occur in the near future [a lot as well!]. Submitted to Statistics and Computing towards the special 25th anniversary issue, as announced in an earlier post.. Pulling strength and breadth from each other’s opinion, we have certainly attained more than the sum of our initial respective contributions, but we are welcoming comments about bits and pieces of importance that we miss and even more about promising new directions that are not posted in this survey. (A warning that is should go with most of my surveys is that my input in this paper will not differ by a large margin from ideas expressed here or in previous surveys.)

EP as a way of life (aka Life of EP)

Posted in Books, Statistics, University life with tags , , , , , , , on December 24, 2014 by xi'an

When Andrew was in Paris, we discussed at length about using EP for handling big datasets in a different way than running parallel MCMC. A related preprint came out on arXiv a few days ago, with an introduction on Andrews’ blog. (Not written two months in advance as most of his entries!)

The major argument in using EP in a large data setting is that the approximation to the true posterior can be build using one part of the data at a time and thus avoids handling the entire likelihood function. Nonetheless, I still remain mostly agnostic about using EP and a seminar this morning at CREST by Guillaume Dehaene and Simon Barthelmé (re)generated self-interrogations about the method that hopefully can be exploited towards the future version of the paper.

One of the major difficulties I have with EP is about the nature of the resulting approximation. Since it is chosen out of a “nice” family of distributions, presumably restricted to an exponential family, the optimal approximation will remain within this family, which further makes EP sound like a specific variational Bayes method since the goal is to find the family member the closest to the posterior in terms of Kullback-Leibler divergence. (Except that the divergence is the opposite one.) I remain uncertain about what to do with the resulting solution, as the algorithm does not tell me how close this solution will be from the true posterior. Unless one can use it as a pseudo-distribution for indirect inference (a.k.a., ABC)..?

Another thing that became clear during this seminar is that the decomposition of the target as a product is completely arbitrary, i.e., does not correspond to an feature of the target other than the later being the product of those components. Hence, the EP partition could be adapted or even optimised within the algorithm. Similarly, the parametrisation could be optimised towards a “more Gaussian” posterior. This is something that makes EP both exciting as opening many avenues for experimentation and fuzzy as its perceived lack of goal makes comparing approaches delicate. For instance, using MCMC or HMC steps to estimate the parameters of the tilted distribution is quite natural in complex settings but the impact of the additional approximation must be gauged against the overall purpose of the approach.

 

reflections on the probability space induced by moment conditions with implications for Bayesian Inference [refleXions]

Posted in Statistics, University life with tags , , , , , , , , , , on November 26, 2014 by xi'an

“The main finding is that if the moment functions have one of the properties of a pivotal, then the assertion of a distribution on moment functions coupled with a proper prior does permit Bayesian inference. Without the semi-pivotal condition, the assertion of a distribution for moment functions either partially or completely specifies the prior.” (p.1)

Ron Gallant will present this paper at the Conference in honour of Christian Gouréroux held next week at Dauphine and I have been asked to discuss it. What follows is a collection of notes I made while reading the paper , rather than a coherent discussion, to come later. Hopefully prior to the conference.

The difficulty I have with the approach presented therein stands as much with the presentation as with the contents. I find it difficult to grasp the assumptions behind the model(s) and the motivations for only considering a moment and its distribution. Does it all come down to linking fiducial distributions with Bayesian approaches? In which case I am as usual sceptical about the ability to impose an arbitrary distribution on an arbitrary transform of the pair (x,θ), where x denotes the data. Rather than a genuine prior x likelihood construct. But I bet this is mostly linked with my lack of understanding of the notion of structural models.

“We are concerned with situations where the structural model does not imply exogeneity of θ, or one prefers not to rely on an assumption of exogeneity, or one cannot construct a likelihood at all due to the complexity of the model, or one does not trust the numerical approximations needed to construct a likelihood.” (p.4)

As often with econometrics papers, this notion of structural model sets me astray: does this mean any latent variable model or an incompletely defined model, and if so why is it incompletely defined? From a frequentist perspective anything random is not a parameter. The term exogeneity also hints at this notion of the parameter being not truly a parameter, but including latent variables and maybe random effects. Reading further (p.7) drives me to understand the structural model as defined by a moment condition, in the sense that

\mathbb{E}[m(\mathbf{x},\theta)]=0

has a unique solution in θ under the true model. However the focus then seems to make a major switch as Gallant considers the distribution of a pivotal quantity like

Z=\sqrt{n} W(\mathbf{x},\theta)^{-\frac{1}{2}} m(\mathbf{x},\theta)

as induced by the joint distribution on (x,θ), hence conversely inducing constraints on this joint, as well as an associated conditional. Which is something I have trouble understanding, First, where does this assumed distribution on Z stem from? And, second, exchanging randomness of terms in a random variable as if it was a linear equation is a pretty sure way to produce paradoxes and measure theoretic difficulties.

The purely mathematical problem itself is puzzling: if one knows the distribution of the transform Z=Z(X,Λ), what does that imply on the joint distribution of (X,Λ)? It seems unlikely this will induce a single prior and/or a single likelihood… It is actually more probable that the distribution one arbitrarily selects on m(x,θ) is incompatible with a joint on (x,θ), isn’t it?

“The usual computational method is MCMC (Markov chain Monte Carlo) for which the best known reference in econometrics is Chernozhukov and Hong (2003).” (p.6)

While I never heard of this reference before, it looks like a 50 page survey and may be sufficient for an introduction to MCMC methods for econometricians. What I do not get though is the connection between this reference to MCMC and the overall discussion of constructing priors (or not) out of fiducial distributions. The author also suggests using MCMC to produce the MAP estimate but this always stroke me as inefficient (unless one uses our SAME algorithm of course).

“One can also compute the marginal likelihood from the chain (Newton and Raftery (1994)), which is used for Bayesian model comparison.” (p.22)

Not the best solution to rely on harmonic means for marginal likelihoods…. Definitely not. While the author actually uses the stabilised version (15) of Newton and Raftery (1994) estimator, which in retrospect looks much like a bridge sampling estimator of sorts, it remains dangerously close to the original [harmonic mean solution] especially for a vague prior. And it only works when the likelihood is available in closed form.

“The MCMC chains were comprised of 100,000 draws well past the point where transients died off.” (p.22)

I wonder if the second statement (with a very nice image of those dying transients!) is intended as a consequence of the first one or independently.

“A common situation that requires consideration of the notions that follow is that deriving the likelihood from a structural model is analytically intractable and one cannot verify that the numerical approximations one would have to make to circumvent the intractability are sufficiently accurate.” (p.7)

This then is a completely different business, namely that defining a joint distribution by mean of moment equations prevents regular Bayesian inference because the likelihood is not available. This is more exciting because (i) there are alternative available! From ABC to INLA (maybe) to EP to variational Bayes (maybe). And beyond. In particular, the moment equations are strongly and even insistently suggesting that empirical likelihood techniques could be well-suited to this setting. And (ii) it is no longer a mathematical worry: there exist a joint distribution on m(x,θ), induced by a (or many) joint distribution on (x,θ). So the question of finding whether or not it induces a single proper prior on θ becomes relevant. But, if I want to use ABC, being given the distribution of m(x,θ) seems to mean I can only generate new values of this transform while missing a natural distance between observations and pseudo-observations. Still, I entertain lingering doubts that this is the meaning of the study. Where does the joint distribution come from..?!

“Typically C is coarse in the sense that it does not contain all the Borel sets (…)  The probability space cannot be used for Bayesian inference”

My understanding of that part is that defining a joint on m(x,θ) is not always enough to deduce a (unique) posterior on θ, which is fine and correct, but rather anticlimactic. This sounds to be what Gallant calls a “partial specification of the prior” (p.9).

Overall, after this linear read, I remain very much puzzled by the statistical (or Bayesian) implications of the paper . The fact that the moment conditions are central to the approach would once again induce me to check the properties of an alternative approach like empirical likelihood.

future of computational statistics

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , , , , , on September 29, 2014 by xi'an

I am currently preparing a survey paper on the present state of computational statistics, reflecting on the massive evolution of the field since my early Monte Carlo simulations on an Apple //e, which would take a few days to return a curve of approximate expected squared error losses… It seems to me that MCMC is attracting more attention nowadays than in the past decade, both because of methodological advances linked with better theoretical tools, as for instance in the handling of stochastic processes, and because of new forays in accelerated computing via parallel and cloud computing, The breadth and quality of talks at MCMski IV is testimony to this. A second trend that is not unrelated to the first one is the development of new and the rehabilitation of older techniques to handle complex models by approximations, witness ABC, Expectation-Propagation, variational Bayes, &tc. With a corollary being an healthy questioning of the models themselves. As illustrated for instance in Chris Holmes’ talk last week. While those simplifications are inevitable when faced with hardly imaginable levels of complexity, I still remain confident about the “inevitability” of turning statistics into an “optimize+penalize” tunnel vision…  A third characteristic is the emergence of new languages and meta-languages intended to handle complexity both of problems and of solutions towards a wider audience of users. STAN obviously comes to mind. And JAGS. But it may be that another scale of language is now required…

If you have any suggestion of novel directions in computational statistics or instead of dead ends, I would be most interested in hearing them! So please do comment or send emails to my gmail address bayesianstatistics

ABC and Monte Carlo seminar in CREST

Posted in Statistics, University life with tags , , , , , , , on January 13, 2012 by xi'an

On Monday (Jan. 16, 3pm, CRESTENSAE, Room S08), Nicolas Chopin will present a talk on:

Dealing with intractability: recent advances in Bayesian Monte-Carlo methods for intractable likelihoods
(joint works with P. Jacob, O. Papaspiliopoulos and S. Barthelmé)

This talk will start with a review of recent advancements in Monte Carlo methodology for intractable problems; that is problems involving intractable quantities, typically intractable likelihoods. I will discuss in turn ABC type methods (a.k.a. likelihood-free), auxiliary variable methods for dealing with intractable normalising constants (e.g. the exchange algorithm), and MC² type of algorithms, a recent extension of which being the PMCMC algorithm (Andrieu et al., 2010). Then, I will present two recent pieces of work in these direction. First, and more briefly briefly, I’ll present the ABC-EP algorithm (Chopin and Barthelmé, 2011). I’ll also discuss some possible future research in ABC theory. Second, I’ll discuss the SMC² algorithm (Chopin, Jacob and Papaspiliopoulos, 2011), a new type of MC² algorithm that makes it possible to perform sequential analysis for virtually any state-space models, including models with an intractable Markov transition.

Bayes-250, Edinburgh [day 2]

Posted in Books, Mountains, pictures, R, Running, Statistics, Travel, University life with tags , , , , , , , , , on September 7, 2011 by xi'an

After a terrific run this morning to the top of Arthur’s Seat, and then around (the ribs are feeling fine, now!), the Bayes-250 talks were exhilarating and challenging. Jim Smith gave an introduction to the challenges of getting different experts to collaborate on a complex risk assessment, much in the spirit of his book, that got me wondering about experts with their own agenda/utility function. For instance, in the case of the recent Fukushima disaster, experts from the electricity company could not be expected to provide trustworthy answers… This meant the assessor’s loss function had to account for this bias in the experts’  opinions. John Winn (from Microsoft Cambridge) argued about developing probabilistic programming, which meant incorporating functions and types like

  • random (along a specific probability distribution);
  • constrain (meaning including data constraints);
  • infer (deriving the posterior distribution);

to the standard programming languages. The idea is conceptually interesting, and the notion of linking Thomas Bayes with Ada Byron Lovelace promising in a steampunk universe, however I remain unconvinced by the universality of the target, as approximations such as EP and variational Bayes need to be introduced for the fast computation of the posterior distribution. Peter Green presented an acceleration device for simulating over decomposable graphs, thanks to the use of the junction tree. Zoubin Gharhamani recalled the origins of the Indian buffet process and provided on the go a list of anti-Bayesian myths that was quite worth posting. Neil Lawrence showed us an interesting work on latent forces, in the mechanical sense, even though I could not keep track with the mechanics behind it! In the afternoon, Michael Goldstein told us why Bayes theorem does not work (not that I agree with him on that point!), Peggy Series explained how, fascinatingly, the brain processes information in a Bayesian manner with an “optimal” prior, and Nicolas Chopin talked about his expectation-propagation summary-less likelihood-free algorithm I discussed a few days ago. We also had a lively discussion about ABC model choice, from the choice of the metric distance to the impact of the summary statistics. The meeting ended with Andrew Fraser telling us about Thomas Bayes’ studies at Edinburgh in 1719-1721 and a quick stroll through the Old College. The day ended in a rather surprising pub, The Blind Poet, and a so-so south Indian restaurant, when we most unexpectedly bumped into Marc Suchard, visiting collaborators in Edinburgh!