When playing with Peter Rossi’s bayesm R package during a visit of Jean-Michel Marin to Paris, last week, we came up with the above Gibbs outcome. The setting is a Gaussian mixture model with three components in dimension 5 and the prior distributions are standard conjugate. In this case, with 500 observations and 5000 Gibbs iterations, the Markov chain (for one component of one mean of the mixture) has two highly distinct regimes: one that revolves around the true value of the parameter, 2.5, and one that explores a much broader area (which is associated with a much smaller value of the component weight). What we found amazing is the Gibbs ability to entertain both regimes, simultaneously.
Archive for the Books Category
When preparing my OxWaSP projects a few weeks ago, I came perchance on a set of slides, entitled “Hierarchical models are not Bayesian“, written by Brian Dennis (University of Idaho), where the author argues against Bayesian inference in hierarchical models in ecology, much in relation with the previously discussed paper of Subhash Lele. The argument is the same, namely a possibly major impact of the prior modelling on the resulting inference, in particular when some parameters are hardly identifiable, the more when the model is complex and when there are many parameters. And that “data cloning” being available since 2007, frequentist methods have “caught up” with Bayesian computational abilities.
Let me remind the reader that “data cloning” means constructing a sequence of Bayes estimators corresponding to the data being duplicated (or cloned) once, twice, &tc., until the point estimator stabilises. Since this corresponds to using increasing powers of the likelihood, the posteriors concentrate more and more around the maximum likelihood estimator. And even recover the Hessian matrix. This technique is actually older than 2007 since I proposed it in the early 1990’s under the name of prior feedback, with earlier occurrences in the literature like D’Epifanio (1989) and even the discussion of Aitkin (1991). A more efficient version of this approach is the SAME algorithm we developed in 2002 with Arnaud Doucet and Simon Godsill where the power of the likelihood is increased during iterations in a simulated annealing version (with a preliminary version found in Duflo, 1996).
I completely agree with the author that a hierarchical model does not have to be Bayesian: when the random parameters in the model are analysed as sources of additional variations, as for instance in animal breeding or ecology, and integrated out, the resulting model can be analysed by any statistical method. Even though one may wonder at the motivations for selecting this particular randomness structure in the model. And at an increasing blurring between what is prior modelling and what is sampling modelling as the number of levels in the hierarchy goes up. This rather amusing set of slides somewhat misses a few points, in particular the ability of data cloning to overcome identifiability and multimodality issues. Indeed, as with all simulated annealing techniques, there is a practical difficulty in avoiding the fatal attraction of a local mode using MCMC techniques. There are thus high chances data cloning ends up in the “wrong” mode. Moreover, when the likelihood is multimodal, it is a general issue to decide which of the modes is most relevant for inference. In which sense is the MLE more objective than a Bayes estimate, then? Further, the impact of a prior on some aspects of the posterior distribution can be tested by re-running a Bayesian analysis with different priors, including empirical Bayes versions or, why not?!, data cloning, in order to understand where and why huge discrepancies occur. This is part of model building, in the end.
Bayesian optimization for likelihood-free inference of simulator-based statistical models [guest post]Posted in Books, Statistics, University life with tags ABC, arXiv, Dennis Prangle, dimension curse, Gaussian processes, guest post, NIPS, nonparametric probability density estimation on February 17, 2015 by xi'an
Here are some comments on the paper of Gutmann and Corander. My brief skim read through this concentrated on the second half of the paper, the applied methodology. So my comments should be quite complementary to Christian’s on the theoretical part!
ABC algorithms generally follow the template of proposing parameter values, simulating datasets and accepting/rejecting/weighting the results based on similarity to the observations. The output is a Monte Carlo sample from a target distribution, an approximation to the posterior. The most naive proposal distribution for the parameters is simply the prior, but this is inefficient if the prior is highly diffuse compared to the posterior. MCMC and SMC methods can be used to provide better proposal distributions. Nevertheless they often still seem quite inefficient, requiring repeated simulations in parts of parameter space which have already been well explored.
The strategy of this paper is to instead attempt to fit a non-parametric model to the target distribution (or in fact to a slight variation of it). Hopefully this will require many fewer simulations. This approach is quite similar to Richard Wilkinson’s recent paper. Richard fitted a Gaussian process to the ABC analogue of the log-likelihood. Gutmann and Corander introduce two main novelties:
- They model the expected discrepancy (i.e. distance) Δθ between the simulated and observed summary statistics. This is then transformed to estimate the likelihood. This is in contrast to Richard who transformed the discrepancy before modelling. This is the standard ABC approach of weighting the discrepancy depending on how close to 0 it is. The drawback of the latter approach is it requires picking a tuning parameter (the ABC acceptance threshold or bandwidth) in advance of the algorithm. The new approach still requires a tuning parameter but its choice can be delayed until the transformation is performed.
- They generate the θ values on-line using “Bayesian optimisation”. The idea is to pick θ to concentrate on the region near the minimum of the objective function, and also to reduce uncertainty in the Gaussian process. Thus well explored regions can usually be neglected. This is in contrast to Richard who chose θs using space filling design prior to performing any simulations.
I didn’t read the paper’s theory closely enough to decide whether (1) is a good idea. Certainly the results for the paper’s examples look convincing. Also, one issue with Richard‘s approach was that because the log-likelihood varied over such a wide variety of magnitudes, he needed to fit several “waves” of GPs. It would be nice to know if the approach of modelling the discrepancy has removed this problem, or if a single GP is still sometimes an insufficiently flexible model.
Novelty (2) is a very nice and natural approach to take here. I did wonder why the particular criterion in Equation (45) was used to decide on the next θ. Does this correspond to optimising some information theoretic quantity? Other practical questions were whether it’s possible to parallelise the method (I seem to remember talking to Michael Gutmann about this at NIPS but can’t remember his answer!), and how well the approach scales up with the dimension of the parameters.
I read this book by Albert Camus over my week in Oxford, having found it on my daughter’s bookshelf (as she had presumably read it in high school…). It is a very special book in that (a) Camus was working on it when he died in a car accident, (b) the manuscript was found among the wreckage, and (c) it differs very much from Camus’ other books. Indeed, the book is partly autobiographical and written with an unsentimental realism that is raw and brutal. It describes the youth of Jacques, the son of French colons in Algiers, whose father had died in the first days of WW I and whose family lives in the uttermost poverty, with both his mother and grandmother doing menial jobs to simply survive. Thanks to a supportive teacher, he manages to get a grant to attend secondary school. What is most moving about the book is how Camus describes the numbing effects of poverty, namely how his relatives see their universe shrinking so much that notions like the Mother Country (France) or books loose meaning for them. Without moving them towards or against native Algerians, who never penetrate the inner circles in the novel, moving behind a sort of glass screen. It is not that the tensions and horrors of the colonisation and of the resistance to colonisation are hidden, quite the opposite, but the narrator considers those with a sort of fatalism without questioning the colonisation itself. (The book reminded me very much of my grand-father‘s childhood, with a father also among the dead soldiers of WW I, being raised by a single mother in harsh conditions. With the major difference that my grandfather decided to stop school very early to become a gardener…) There are also obvious parallels with Pagnol’s autobiographical novels like My Father’s Glory, written at about the same time, from the boy friendship to the major role of the instituteur, to the hunting party, to the funny uncle, but everything opposes the two authors, from Pagnol light truculence to Camus’ tragic depiction. Pagnol’s books are great teen books (and I still remember my mother buying the first one on a vacation road trip) but nothing more. Camus’ book could have been his greatest book, had he survived the car accident of January 1960.
In the common room of the Department of Mathematics at the University of Warwick [same building as the Department of Statistics], there is a box for book exchanges and I usually take a look at each visit for a possible exchange. In October, I thus picked Jo Nesbø’s The Redbreast in exchange for maybe The Rogue Male. However, it stood on my office bookcase for another three months before I found time to read this early (2000) instalment in the Harry Hole series. With connections with the earliest Redeemer.
This is a fairly good if not perfect book, with a large opening into Norway’s WW II history and the volunteers who joined Nazi Germany to fight on the Eastern Front. And the collaborationist government of Vidkin Quissling. I found most interesting this entry into this period and the many parallels with French history at the same time. (To the point that quisling is now a synonym for collaborator, similar to pétainiste in French.) This historical background has some similarities with Camilla Lackberg‘s Hidden Child I read a while ago but on a larger and broader scale. Reminiscences and episodes from 1940-1944 take a large part of the book. And rightly so, as the story during WW II explains a lot of the current plot. While this may sound like an easy story-line, the plot also dwells a lot on skinheads and neo-Nazis in Olso. While Hole’s recurrent alcoholism irks me in the long run (more than Rebus‘ own alcohol problem, for some reason!), the construction of the character is quite well-done, along with a reasonable police force, even though both Hole’s inquest and the central crime of the story are stretching on and beyond belief, with too many coincidences. And a fatal shot by the police leads to very little noise and investigation, in a country where the murder rate is one of the lowest in the World and police officers do not carry guns. Except in Nesbø’s novels! Still, I did like the novel to the point of spending most of a Sunday afternoon on it, with the additional appeal of most of it taking place in Oslo. Definitely a page turner.
Hartig et al. published a while ago (2011) a paper in Ecology Letters entitled “Statistical inference for stochastic simulation models – theory and application”, which is mostly about ABC. (Florian Hartig pointed out the paper to me in a recent blog comment. about my discussion of the early parts of Guttman and Corander’s paper.) The paper is largely a tutorial and it reminds the reader about related methods like indirect inference and methods of moments. The authors also insist on presenting ABC as a particular case of likelihood approximation, whether non-parametric or parametric. Making connections with pseudo-likelihood and pseudo-marginal approaches. And including a discussion of the possible misfit of the assumed model, handled by an external error model. And also introducing the notion of informal likelihood (which could have been nicely linked with empirical likelihood). A last class of approximations presented therein is called rejection filters and reminds me very much of Ollie Ratman’s papers.
“Our general aim is to find sufficient statistics that are as close to minimal sufficiency as possible.” (p.819)
As in other ABC papers, and as often reported on this blog, I find the stress on sufficiency a wee bit too heavy as those models calling for approximation almost invariably do not allow for any form of useful sufficiency. Hence the mathematical statistics notion of sufficiency is mostly useless in such settings.
“A basic requirement is that the expectation value of the point-wise approximation of p(Sobs|φ) must be unbiased” (p.823)
As stated above the paper is mostly in tutorial mode, for instance explaining what MCMC and SMC methods are. As illustrated by the above figure. There is however a final and interesting discussion section on the impact of estimating the likelihood function at different values of the parameter. However, the authors seem to focus solely on pseudo-marginal results to validate this approximation, hence on unbiasedness, which does not work for most ABC approaches that I know. And for the approximations listed in the survey. Actually, it would be quite beneficial to devise a cheap tool to assess the bias or extra-variation due to the use of approximative techniques like ABC… A sort of 21st Century bootstrap?!
Subhash Lele recently arXived a short paper entitled “Is non-informative Bayesian analysis appropriate for wildlife management: survival of San Joaquin Kit fox and declines in amphibian populations”. (Lele has been mentioned several times on this blog in connection with his data-cloning approach that mostly clones our own SAME algorithm.)
“The most commonly used non-informative priors are either the uniform priors or the priors with very large variances spreading the probability mass almost uniformly over the entire parameter space.”
The main goal of the paper is to warn, even better “to disabuse the ecologists of the notion that there is no difference between non-informative Bayesian inference and likelihood-based inference and that the philosophical underpinnings of statistical inference are irrelevant to practice.” The argument advanced by Lele is simply that two different parametrisations should lead to two compatible priors and that, if they do not not, this exhibits an unacceptable impact of the prior modelling on the resulting inference, while likelihood-based inference [obviously] does not depend on parametrisation.
The first example in the paper is a dynamic linear model of a fox population series when using a uniform U(0,1) prior on a parameter b against a Ga(100,100) prior on -a/b. (The normal prior a is the same on both.) I do not find the opposition between the two posteriors in the least surprising as the modelling starts by assuming different supports on the parameter b. And both are highly “informative” in that there is no intrinsic constraint on b that could justify the (0,1) support, as illustrated by the second choice when b is unconstrained, varying on (-15,15) or (-0.0015,0.0015) depending on how the Ga(100,100) prior is parametrised.
and the paper opposes a uniform prior on p,q to a normal N(0,10^3) prior on the logit transforms of p and q. [With an obvious typo at the top of page 10.] As shown on the above graph, the two priors on p are immensely different, so should lead to different posteriors in a weakly informative setting as a Bernoulli experiment. Even with a few hundred individuals. A somewhat funny aspect of this study is that Lele opposes the uniform prior to the Jeffreys Be(.5,.5) prior as being “nowhere close to looking like what one would consider a non-informative prior”, without noticing that the logit parametrisation normal prior leads to an even more peaked prior…
“Even when Jeffreys prior can be computed, it will be difficult to sell this prior as an objective prior to the jurors or the senators on the committee. The construction of Jeffreys and other objective priors for multi-parameter models poses substantial mathematical difficulties.”
I find it rather surprising that a paper can be dedicated to the comparison of two arbitrary prior distributions on two fairly simplistic models towards the global conclusion that “non-informative priors neither ‘let the data speak’ nor do they correspond (even roughly) to likelihood analysis.” In this regard, the earlier critical analysis of Seaman et al., to which my PhD student Kaniav Kamary and I replied, had a broader scope.