## the problem of assessing statistical methods

Posted in Books, pictures, Statistics, University life with tags , , , , , , on November 4, 2015 by xi'an

A new arXival today by Abigail Arnold and Jason Loeppky that discusses how simulations studies are and should be conducted when assessing statistical methods.

“Obviously there is no one model that will universally outperform the rest. Recognizing the “No Free Lunch” theorem, the logical question to ask is whether one model will perform best over a given class of problems. Again, we feel that the answer to this question is of course no. But we do feel that there are certain methods that will have a better chance than other methods.”

I find the assumptions or prerequisites of the paper arguable [in the sense of 2. open to disagreement; not obviously correct]—not even mentioning the switch from models to methods in the above—in that I will not be convinced that a method outperforms another method by simply looking at a series of simulation experiments. (Which is why I find some machine learning papers unconvincing, when they introduce a new methodology and run it through a couple benchmarks.) This also reminds me of Samaniego’s Comparison of the Bayesian and frequentist approaches, which requires a secondary prior to run the comparison. (And hence is inconclusive.)

“The papers above typically show the results as a series of side-by-side boxplots (…) for each method, with one plot for each test function and sample size. Conclusions are then drawn from looking at a handful of boxplots which often look very cluttered and usually do not provide clear evidence as to the best method(s). Alternatively, the results will be summarized in a table of average performance (…) These tables are usually overwhelming to look at and interpretations are incredibly inefficient.”

Agreed boxplots are terrible (my friend Jean-Michel is forever arguing against them!). Tables are worse. But why don’t we question RMSE as well? This is most often a very reductive way of comparing methods. I also agree with the point that the design of the simulation studies is almost always overlooked and induces a false sense of precision, while failing to cover a wide enough range of cases. However, and once more, I question the prerequisites for comparing methods through simulations for the purpose of ranking those methods. (Which is not the perspective adopted by James and Nicolas when criticising the use of the Pima Indian dataset.)

“The ECDF allows for quick assessments of methods over a large array of problems to get an overall view while of course not precluding comparisons on individual functions (…) We hope that readers of this paper agree with our opinions and strongly encourage everyone to rely on the ECDF, at least as a starting point, to display relevant statistical information from simulations.”

Drawing a comparison with the benchmarking of optimisation methods, the authors suggest to rank statistical methods via the empirical cdf of their performances or accuracy across (benchmark) problems. Arguing that “significant benefit is gained by [this] collapsing”. I am quite sceptical [as often] of the argument, first because using a (e)cdf means the comparison is unidimensional, second because I see no reason why two cdfs should be easily comparable, third because the collapsing over several problems only operates when the errors for those different problems do not overlap.

## PAC-Bayesians

Posted in Kids, Statistics, University life, Books, Travel, pictures with tags , , , , , , , , , on September 22, 2015 by xi'an

Yesterday, I took part in the thesis defence of James Ridgway [soon to move to the University of Bristol[ at Université Paris-Dauphine. While I have already commented on his joint paper with Nicolas on the Pima Indians, I had not read in any depth another paper in the thesis, “On the properties of variational approximations of Gibbs posteriors” written jointly with Pierre Alquier and Nicolas Chopin.

PAC stands for probably approximately correct and starts with an empirical form of posterior, called the Gibbs posterior, where the log-likelihood is replaced with an empirical error

$\pi(\theta|x_1,\ldots,x_n) \propto \exp\{-\lambda r_n(\theta)\}\pi(\theta)$

that is rescaled by a factor λ. Factor that is called the learning rate, to be optimised as the (Kullback) closest  approximation to the true unknown distribution, by Peter Grünwald (2012) in his SafeBayes approach. In the paper of James, Pierre and Nicolas, there is no visible Bayesian perspective, since the pseudo-posterior is used to define a randomised estimator that achieves optimal oracle bounds. When λ is of order n. The purpose of the paper is rather to produce an efficient approximation to the Gibbs posterior, by using variational Bayes techniques. And to derive point estimators. With the added appeal that the approximation also achieves the oracle bounds. (Surprisingly, the authors do not leave the Pima Indians alone as they use this benchmark for a ranking model.) Since there is no discussion on the choice of the learning rate λ, as opposed to Bissiri et al. (2013) I discussed around Bayes.250, I have difficulties perceiving the possible impact of this representation on Bayesian analysis. Except maybe as an ABC device, as suggested by Christophe Andrieu.

## Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

## controlled thermodynamic integral for Bayesian model comparison [reply]

Posted in Statistics, University life, Running, Books, pictures with tags , , , , , , , , , , , , on April 30, 2014 by xi'an

Chris Oates wrotes the following reply to my Icelandic comments on his paper with Theodore Papamarkou, and Mark Girolami, reply that is detailed enough to deserve a post on its own:

Thank you Christian for your discussion of our work on the Og, and also for your helpful thoughts in the early days of this project! It might be interesting to speculate on some aspects of this procedure:

(i) Quadrature error is present in all estimates of evidence that are based on thermodynamic integration. It remains unknown how to exactly compute the optimal (variance minimising) temperature ladder “on-the-fly”; indeed this may be impossible, since the optimum is defined via a boundary value problem rather than an initial value problem. Other proposals for approximating this optimum are compatible with control variates (e.g. Grosse et al, NIPS 2013, Friel and Wyse, 2014). In empirical experiments we have found that the second order quadrature rule proposed by Friel and Wyse 2014 leads to substantially reduced bias, regardless of the specific choice of ladder.

(ii) Our experiments considered first and second degree polynomials as ZV control variates. In fact, intuition specifically motivates the use of second degree polynomials: Let us presume a linear expansion of the log-likelihood in θ. Then the implied score function is constant, not depending on θ. The quadratic ZV control variates are, in effect, obtained by multiplying the score function by θ. Thus control variates can be chosen to perfectly correlate with the log-likelihood, leading to zero-variance estimators. Of course, there is an empirical question of whether higher-order polynomials are useful when this Taylor approximation is inappropriate, but they would require the estimation of many more coefficients and in practice may be less stable.

(iii) We require that the control variates are stored along the chain and that their sample covariance is computed after the MCMC has terminated. For the specific examples in the paper such additional computation is a negligible fraction of the total computational, so that we did not provide specific timings. When non-diffegeometric MCMC is used to obtain samples, or when the score is unavailable in closed-form and must be estimated, the computational cost of the procedure would necessarily increase.

For the wide class of statistical models with tractable likelihoods, employed in almost all areas of statistical application, the CTI we propose should provide state-of-the-art estimation performance with negligible increase in computational costs.

## controlled thermodynamic integral for Bayesian model comparison

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , on April 24, 2014 by xi'an

Chris Oates, Theodore Papamarkou, and Mark Girolami (all from the University of Warwick) just arXived a paper on a new form of thermodynamic integration for computing marginal likelihoods. (I had actually discussed this paper with the authors on a few occasions when visiting Warwick.) The other name of thermodynamic integration is path sampling (Gelman and Meng, 1998). In the current paper, the path goes from the prior to the posterior by a sequence of intermediary distributions using a power of the likelihood. While the path sampling technique is quite efficient a method, the authors propose to improve it through the recourse to control variates, in order to decrease the variance. The control variate is taken from Mira et al. (2013), namely a one-dimensional temperature-dependent transform of the score function. (Strictly speaking, this is an asymptotic control variate in that the mean is only asymptotically zero.) This control variate is then incorporated within the expectation inside the path sampling integral. Its arbitrary elements are then calibrated against the variance of the path sampling integral. Except for the temperature ladder where the authors use a standard geometric rate, as the approach does not account for Monte Carlo and quadrature errors. (The degree of the polynomials used in the control variates is also arbitrarily set.) Interestingly, the paper mixes a lot of recent advances, from the zero variance notion of Mira et al. (2013) to the manifold Metropolis-adjusted Langevin algorithm of Girolami and Calderhead (2011), uses as a base method pMCMC (Jasra et al., 2007). The examples processed in the paper are regression (where the controlled version truly has a zero variance!) and logistic regression (with the benchmarked Pima Indian dataset), with a counter-example of a PDE interestingly proposed in the discussion section. I quite agree with the authors that the method is difficult to envision in complex enough models. I also did not see mentions therein of the extra time involved in using this control variate idea.

## Vanilla Rao-Blackwellisation [re]revised

Posted in Statistics, R with tags , , , , , , , on June 1, 2010 by xi'an

Although the revision is quite minor, it took us two months to complete from the time I received the news in the Atlanta airport lounge… The vanilla Rao-Blackwellisation paper with Randal Douc has thus been resubmitted to the Annals of Statistics. And rearXived. The only significant change is the inclusion of two tables detailing computing time, like the one below

$\left| \begin{matrix} \tau &\text{median} &\text{mean }&q_{.8} &q_{.9} &\text{time}\\ 0.25 &0.0 &8.85 &4.9 &13 &4.2\\ 0.50 &0.0 &6.76 &4 &11 &2.25\\ 1.00 &0.25 &6.15 &4 &10 &2.5\\ 2.00 &0.20 &5.90 &3.5 &8.5 &4.5\\\end{matrix} \right|$

which provides different evaluations of the additional computing effort due to the use of the Rao–Blackwellisation: median and mean numbers of additional iterations, $80\%$ and $90\%$ quantiles for the additional iterations, and ratio of the average R computing times obtained over $10^5$ simulations. (Turning the above table into a formula acceptable by WordPress took me for ever, as any additional white space between the terms of the matrix is mis-interpreted!) Now, the mean time column does not look very supportive of the Rao-Blackwellisation technique, but this is due to the presence of a few outlying runs that required many iterations before hitting an acceptance probability of one. Excessive computing time can be curbed by using a pre-set number of iterations, as described in the paper…

## pimax(mcsm)

Posted in Statistics, Books, R with tags , , , , on May 13, 2010 by xi'an

The function pimax from our package mcsm is used in to reproduce Figure 5.11 of our book Introducing Monte Carlo Methods with R. (The name comes from using the Pima Indian R benchmark as the reference dataset.) I got this email from Josué

I ran the ‘pimax’ example from the mcsm manual, and it gave me the following message:

> pimax(Nsim = 10^3)
Error in raaj[t, ] = apply(as.matrix(aas), 1, margap) :
number of items to replace is not a multiple of replacement length
> pimax()
Error in raaj[t, ] = apply(as.matrix(aas), 1, margap) :
number of items to replace is not a multiple of replacement length

but when running pimax(10^2) on my machine I did get the following picture and no error message. So I wonder if this is a matter of version of R or something else…