Archive for Jim Berger

the buzz about nuzz

Posted in Books, Mountains, pictures, Statistics with tags , , , , , , , , , , , , , on April 6, 2020 by xi'an

“…expensive in these terms, as for each root, Λ(x(s),v) (at the cost of one epoch) has to be evaluated for each root finding iteration, for each node of the numerical integral

When using the ZigZag sampler, the main (?) difficulty is in producing velocity switch as the switches are produced as interarrival times of an inhomogeneous Poisson process. When the rate of this process cannot be integrated out in an analytical manner, the only generic approach I know is in using Poisson thinning, obtained by finding an integrable upper bound on this rate, generating from this new process and subsampling. Finding the bound is however far from straightforward and may anyway result in an inefficient sampler. This new paper by Simon Cotter, Thomas House and Filippo Pagani makes several proposals to simplify this simulation, Nuzz standing for numerical ZigZag. Even better (!), their approach is based on what they call the Sellke construction, with Tom Sellke being a probabilist and statistician at Purdue University (trivia: whom I met when spending a postdoctoral year there in 1987-1988) who also wrote a fundamental paper on the opposition between Bayes factors and p-values with Jim Berger.

“We chose as a measure of algorithm performance the largest Kolmogorov-Smirnov (KS) distance between the MCMC sample and true distribution amongst all the marginal distributions.”

The practical trick is rather straightforward in that it sums up as the exponentiation of the inverse cdf method, completed with a numerical resolution of the inversion. Based on the QAGS (Quadrature Adaptive Gauss-Kronrod Singularities) integration routine. In order to save time Kingman’s superposition trick only requires one inversion rather than d, the dimension of the variable of interest. This nuzzled version of ZIgZag can furthermore be interpreted as a PDMP per se. Except that it retains a numerical error, whose impact on convergence is analysed in the paper. In terms of Wasserstein distance between the invariant measures. The paper concludes with a numerical comparison between Nuzz and random walk Metropolis-Hastings, HMC, and manifold MALA, using the number of evaluations of the likelihood as a measure of time requirement. Tuning for Nuzz is described, but not for the competition. Rather dramatically the Nuzz algorithm performs worse than this competition when counting one epoch for each likelihood computation and better when counting one epoch for each integral inversion. Which amounts to perfect inversion, unsurprisingly. As a final remark, all models are more or less Normal, with very smooth level sets, maybe not an ideal range


same risk, different estimators

Posted in Statistics with tags , , , , , on November 10, 2017 by xi'an

An interesting question on X validated reminded me of the epiphany I had some twenty years ago when reading a Annals of Statistics paper by Anirban Das Gupta and Bill Strawderman on shrinkage estimators, namely that some estimators shared the same risk function, meaning their integrated loss was the same for all values of the parameter. As indicated in this question, Stefan‘s instructor seems to believe that two estimators having the same risk function must be a.s. identical. Which is not true as exemplified by the James-Stein (1960) estimator with scale 2(p-2), which has constant risk p, just like the maximum likelihood estimator. I presume the confusion stemmed from the concept of completeness, where having a function with constant expectation under all values of the parameter implies that this function is constant. But, for loss functions, the concept does not apply since the loss depends both on the observation (that is complete in a Normal model) and on the parameter.

on Dutch book arguments

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , on May 1, 2017 by xi'an

“Reality is not always probable, or likely.”― Jorge Luis Borges

As I am supposed to discuss Teddy Seidenfeld‘s talk at the Bayes, Fiducial and Frequentist conference in Harvard today [the snow happened last time!], I started last week [while driving to Wales] reading some related papers of his. Which is great as I had never managed to get through the Dutch book arguments, including those in Jim’s book.

The paper by Mark Schervish, Teddy Seidenfeld, and Jay Kadane is defining coherence as the inability to bet against the predictive statements based on the procedure. A definition that sounds like a self-fulfilling prophecy to me as it involves a probability measure over the parameter space. Furthermore, the notion of turning inference, which aims at scientific validation, into a leisure, no-added-value, and somewhat ethically dodgy like gambling, does not agree with my notion of a validation for a theory. That is, not as a compelling reason for adopting a Bayesian approach. Not that I have suddenly switched to the other [darker] side, but I do not feel those arguments helping in any way, because of this dodgy image associated with gambling. (Pardon my French, but each time I read about escrows, I think of escrocs, or crooks, which reinforces this image! Actually, this name derives from the Old French escroue, but the modern meaning of écroué is sent to jail, which brings us back to the same feeling…)

Furthermore, it sounds like both a weak notion, since it implies an almost sure loss for the bookmaker, plus coherency holds for any prior distribution, including Dirac masses!, and a frequentist one, in that it looks at all possible values of the parameter (in a statistical framework). It also turns errors into monetary losses, taking them at face value. Which sounds also very formal to me.

But the most fundamental problem I have with this approach is that, from a Bayesian perspective, it does not bring any evaluation or ranking of priors, and in particular does not help in selecting or eliminating some. By behaving like a minimax principle, it does not condition on the data and hence does not evaluate the predictive properties of the model in terms of the data, e.g. by comparing pseudo-data with real data.

 While I see no reason to argue in favour of p-values or minimax decision rules, I am at a loss in understanding the examples in How to not gamble if you must. In the first case, i.e., when dismissing the α-level most powerful test in the simple vs. simple hypothesis testing case, the argument (in Example 4) starts from the classical (Neyman-Pearsonist) statistician favouring the 0.05-level test over others. Which sounds absurd, as this level corresponds to a given loss function, which cannot be compared with another loss function. Even though the authors chose to rephrase the dilemma in terms of a single 0-1 loss function and then turn the classical solution into the choice of an implicit variance-dependent prior. Plus force the poor Pearsonist to make a wager represented by the risk difference. The whole sequence of choices sounds both very convoluted and far away from the usual practice of a classical statistician… Similarly, when attacking [in Section 5.2] the minimax estimator in the Bernoulli case (for the corresponding proper prior depending on the sample size n), this minimax estimator is admissible under quadratic loss and still a Dutch book argument applies, which in my opinion definitely argues against the Dutch book reasoning. The way to produce such a domination result is to mix two Bernoulli estimation problems for two different sample sizes but the same parameter value, in which case there exist [other] choices of Beta priors and a convex combination of the risks functions that lead to this domination. But this example [Example 6] mostly exposes the artificial nature of the argument: when estimating the very same probability θ, what is the relevance of adding the risks or errors resulting from using two estimators for two different sample sizes. Of the very same probability θ. I insist on the very same because when instead estimating two [independent] values of θ, there cannot be a Stein effect for the Bernoulli probability estimation problem, that is, any aggregation of admissible estimators remains admissible. (And yes it definitely sounds like an exercise in frequentist decision theory!)

contemporary issues in hypothesis testing

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on May 3, 2016 by xi'an

hipocontemptNext Fall, on 15-16 September, I will take part in a CRiSM workshop on hypothesis testing. In our department in Warwick. The registration is now open [until Sept 2] with a moderate registration free of £40 and a call for posters. Jim Berger and Joris Mulder will both deliver a plenary talk there, while Andrew Gelman will alas give a remote talk from New York. (A terrific poster by the way!)

Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

pimcisIn the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.