Archive for Pima Indians

X divergence for approximate inference

Posted in Statistics with tags , , , , , , , on March 14, 2017 by xi'an

Dieng et al. arXived this morning a new version of their paper on using the Χ divergence for variational inference. The Χ divergence essentially is the expectation of the squared ratio of the target distribution over the approximation, under the approximation. It is somewhat related to Expectation Propagation (EP), which aims at the Kullback-Leibler divergence between the target distribution and the approximation, under the target. And to variational Bayes, which is the same thing just the opposite way! The authors also point a link to our [adaptive] population Monte Carlo paper of 2008. (I wonder at a possible version through Wasserstein distance.)

Some of the arguments in favour of this new version of variational Bayes approximations is that (a) the support of the approximation over-estimates the posterior support; (b) it produces over-dispersed versions; (c) it relates to a well-defined and global objective function; (d) it allows for a sandwich inequality on the model evidence; (e) the function of the [approximation] parameter to be minimised is under the approximation, rather than under the target. The latest allows for a gradient-based optimisation. While one of the applications is on a Bayesian probit model applied to the Pima Indian women dataset [and will thus make James and Nicolas cringe!], the experimental assessment shows lower error rates for this and other benchmarks. Which in my opinion does not tell so much about the original Bayesian approach.

read paper [in Bristol]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 29, 2016 by xi'an

Clifton & Durdham Downs, Bristol, Sept. 25, 2012I went to give a seminar in Bristol last Friday and I chose to present the testing with mixture paper. As we are busy working on the revision, I was eagerly looking for comments and criticisms that could strengthen this new version. As it happened, the (Bristol) Bayesian Cake (Reading) Club had chosen our paper for discussion, two weeks in a row!, hence the title!, and I got invited to join the group the morning prior to the seminar! This was, of course, most enjoyable and relaxed, including an home-made cake!, but also quite helpful in assessing our arguments in the paper. One point of contention or at least of discussion was the common parametrisation between the components of the mixture. Although all parametrisations are equivalent from a single component point of view, I can [almost] see why using a mixture with the same parameter value on all components may impose some unsuspected constraint on that parameter. Even when the parameter is the same moment for both components. This still sounds like a minor counterpoint in that the weight should converge to either zero or one and hence eventually favour the posterior on the parameter corresponding to the “true” model.

Another point that was raised during the discussion is the behaviour of the method under misspecification or for an M-open framework: when neither model is correct does the weight still converge to the boundary associated with the closest model (as I believe) or does a convexity argument produce a non-zero weight as it limit (as hinted by one example in the paper)? I had thought very little about this and hence had just as little to argue though as this does not sound to me like the primary reason for conducting tests. Especially in a Bayesian framework. If one is uncertain about both models to be compared, one should have an alternative at the ready! Or use a non-parametric version, which is a direction we need to explore deeper before deciding it is coherent and convergent!

A third point of discussion was my argument that mixtures allow us to rely on the same parameter and hence the same prior, whether proper or not, while Bayes factors are less clearly open to this interpretation. This was not uniformly accepted!

Thinking afresh about this approach also led me to broaden my perspective on the use of the posterior distribution of the weight(s) α: while previously I had taken those weights mostly as a proxy to the posterior probabilities, to be calibrated by pseudo-data experiments, as for instance in Figure 9, I now perceive them primarily as the portion of the data in agreement with the corresponding model [or hypothesis] and more importantly as a solution for staying away from a Neyman-Pearson-like decision. Or error evaluation. Usually, when asked about the interpretation of the output, my answer is to compare the behaviour of the posterior on the weight(s) with a posterior associated with a sample from each model. Which does sound somewhat similar to posterior predictives if the samples are simulated from the associated predictives. But the issue was not raised during the visit to Bristol, which possibly reflects on how unfrequentist the audience was [the Statistics group is], as it apparently accepted with no further ado the use of a posterior distribution as a soft assessment of the comparative fits of the different models. If not necessarily agreeing the need of conducting hypothesis testing (especially in the case of the Pima Indian dataset!).

approximating evidence with missing data

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on December 23, 2015 by xi'an

University of Warwick, May 31 2010Panayiota Touloupou (Warwick), Naif Alzahrani, Peter Neal, Simon Spencer (Warwick) and Trevelyan McKinley arXived a paper yesterday on Model comparison with missing data using MCMC and importance sampling, where they proposed an importance sampling strategy based on an early MCMC run to approximate the marginal likelihood a.k.a. the evidence. Another instance of estimating a constant. It is thus similar to our Frontier paper with Jean-Michel, as well as to the recent Pima Indian survey of James and Nicolas. The authors give the difficulty to calibrate reversible jump MCMC as the starting point to their research. The importance sampler they use is the natural choice of a Gaussian or t distribution centred at some estimate of θ and with covariance matrix associated with Fisher’s information. Or derived from the warmup MCMC run. The comparison between the different approximations to the evidence are done first over longitudinal epidemiological models. Involving 11 parameters in the example processed therein. The competitors to the 9 versions of importance samplers investigated in the paper are the raw harmonic mean [rather than our HPD truncated version], Chib’s, path sampling and RJMCMC [which does not make much sense when comparing two models]. But neither bridge sampling, nor nested sampling. Without any surprise (!) harmonic means do not converge to the right value, but more surprisingly Chib’s method happens to be less accurate than most importance solutions studied therein. It may be due to the fact that Chib’s approximation requires three MCMC runs and hence is quite costly. The fact that the mixture (or defensive) importance sampling [with 5% weight on the prior] did best begs for a comparison with bridge sampling, no? The difficulty with such study is obviously that the results only apply in the setting of the simulation, hence that e.g. another mixture importance sampler or Chib’s solution would behave differently in another model. In particular, it is hard to judge of the impact of the dimensions of the parameter and of the missing data.

the problem of assessing statistical methods

Posted in Books, pictures, Statistics, University life with tags , , , , , , on November 4, 2015 by xi'an

A new arXival today by Abigail Arnold and Jason Loeppky that discusses how simulations studies are and should be conducted when assessing statistical methods.

“Obviously there is no one model that will universally outperform the rest. Recognizing the “No Free Lunch” theorem, the logical question to ask is whether one model will perform best over a given class of problems. Again, we feel that the answer to this question is of course no. But we do feel that there are certain methods that will have a better chance than other methods.”

I find the assumptions or prerequisites of the paper arguable [in the sense of 2. open to disagreement; not obviously correct]—not even mentioning the switch from models to methods in the above—in that I will not be convinced that a method outperforms another method by simply looking at a series of simulation experiments. (Which is why I find some machine learning papers unconvincing, when they introduce a new methodology and run it through a couple benchmarks.) This also reminds me of Samaniego’s Comparison of the Bayesian and frequentist approaches, which requires a secondary prior to run the comparison. (And hence is inconclusive.)

“The papers above typically show the results as a series of side-by-side boxplots (…) for each method, with one plot for each test function and sample size. Conclusions are then drawn from looking at a handful of boxplots which often look very cluttered and usually do not provide clear evidence as to the best method(s). Alternatively, the results will be summarized in a table of average performance (…) These tables are usually overwhelming to look at and interpretations are incredibly inefficient.”

Agreed boxplots are terrible (my friend Jean-Michel is forever arguing against them!). Tables are worse. But why don’t we question RMSE as well? This is most often a very reductive way of comparing methods. I also agree with the point that the design of the simulation studies is almost always overlooked and induces a false sense of precision, while failing to cover a wide enough range of cases. However, and once more, I question the prerequisites for comparing methods through simulations for the purpose of ranking those methods. (Which is not the perspective adopted by James and Nicolas when criticising the use of the Pima Indian dataset.)

“The ECDF allows for quick assessments of methods over a large array of problems to get an overall view while of course not precluding comparisons on individual functions (…) We hope that readers of this paper agree with our opinions and strongly encourage everyone to rely on the ECDF, at least as a starting point, to display relevant statistical information from simulations.”

Drawing a comparison with the benchmarking of optimisation methods, the authors suggest to rank statistical methods via the empirical cdf of their performances or accuracy across (benchmark) problems. Arguing that “significant benefit is gained by [this] collapsing”. I am quite sceptical [as often] of the argument, first because using a (e)cdf means the comparison is unidimensional, second because I see no reason why two cdfs should be easily comparable, third because the collapsing over several problems only operates when the errors for those different problems do not overlap.


Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , on September 22, 2015 by xi'an

Yesterday, I took part in the thesis defence of James Ridgway [soon to move to the University of Bristol[ at Université Paris-Dauphine. While I have already commented on his joint paper with Nicolas on the Pima Indians, I had not read in any depth another paper in the thesis, “On the properties of variational approximations of Gibbs posteriors” written jointly with Pierre Alquier and Nicolas Chopin.

PAC stands for probably approximately correct and starts with an empirical form of posterior, called the Gibbs posterior, where the log-likelihood is replaced with an empirical error

\pi(\theta|x_1,\ldots,x_n) \propto \exp\{-\lambda r_n(\theta)\}\pi(\theta)

that is rescaled by a factor λ. Factor that is called the learning rate, to be optimised as the (Kullback) closest  approximation to the true unknown distribution, by Peter Grünwald (2012) in his SafeBayes approach. In the paper of James, Pierre and Nicolas, there is no visible Bayesian perspective, since the pseudo-posterior is used to define a randomised estimator that achieves optimal oracle bounds. When λ is of order n. The purpose of the paper is rather to produce an efficient approximation to the Gibbs posterior, by using variational Bayes techniques. And to derive point estimators. With the added appeal that the approximation also achieves the oracle bounds. (Surprisingly, the authors do not leave the Pima Indians alone as they use this benchmark for a ranking model.) Since there is no discussion on the choice of the learning rate λ, as opposed to Bissiri et al. (2013) I discussed around Bayes.250, I have difficulties perceiving the possible impact of this representation on Bayesian analysis. Except maybe as an ABC device, as suggested by Christophe Andrieu.

Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

pimcisIn the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

controlled thermodynamic integral for Bayesian model comparison [reply]

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , , , on April 30, 2014 by xi'an

Reykjavik1Chris Oates wrotes the following reply to my Icelandic comments on his paper with Theodore Papamarkou, and Mark Girolami, reply that is detailed enough to deserve a post on its own:

Thank you Christian for your discussion of our work on the Og, and also for your helpful thoughts in the early days of this project! It might be interesting to speculate on some aspects of this procedure:

(i) Quadrature error is present in all estimates of evidence that are based on thermodynamic integration. It remains unknown how to exactly compute the optimal (variance minimising) temperature ladder “on-the-fly”; indeed this may be impossible, since the optimum is defined via a boundary value problem rather than an initial value problem. Other proposals for approximating this optimum are compatible with control variates (e.g. Grosse et al, NIPS 2013, Friel and Wyse, 2014). In empirical experiments we have found that the second order quadrature rule proposed by Friel and Wyse 2014 leads to substantially reduced bias, regardless of the specific choice of ladder.

(ii) Our experiments considered first and second degree polynomials as ZV control variates. In fact, intuition specifically motivates the use of second degree polynomials: Let us presume a linear expansion of the log-likelihood in θ. Then the implied score function is constant, not depending on θ. The quadratic ZV control variates are, in effect, obtained by multiplying the score function by θ. Thus control variates can be chosen to perfectly correlate with the log-likelihood, leading to zero-variance estimators. Of course, there is an empirical question of whether higher-order polynomials are useful when this Taylor approximation is inappropriate, but they would require the estimation of many more coefficients and in practice may be less stable.

(iii) We require that the control variates are stored along the chain and that their sample covariance is computed after the MCMC has terminated. For the specific examples in the paper such additional computation is a negligible fraction of the total computational, so that we did not provide specific timings. When non-diffegeometric MCMC is used to obtain samples, or when the score is unavailable in closed-form and must be estimated, the computational cost of the procedure would necessarily increase.

For the wide class of statistical models with tractable likelihoods, employed in almost all areas of statistical application, the CTI we propose should provide state-of-the-art estimation performance with negligible increase in computational costs.