## the Hyvärinen score is back

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , , , on November 21, 2017 by xi'an

Stéphane Shao, Pierre Jacob and co-authors from Harvard have just posted on arXiv a new paper on Bayesian model comparison using the Hyvärinen score

$\mathcal{H}(y, p) = 2\Delta_y \log p(y) + ||\nabla_y \log p(y)||^2$

which thus uses the Laplacian as a natural and normalisation-free penalisation for the score test. (Score that I first met in Padova, a few weeks before moving from X to IX.) Which brings a decision-theoretic alternative to the Bayes factor and which delivers a coherent answer when using improper priors. Thus a very appealing proposal in my (biased) opinion! The paper is mostly computational in that it proposes SMC and SMC² solutions to handle the estimation of the Hyvärinen score for models with tractable likelihoods and tractable completed likelihoods, respectively. (Reminding me that Pierre worked on SMC² algorithms quite early during his Ph.D. thesis.)

A most interesting remark in the paper is to recall that the Hyvärinen score associated with a generic model on a series must be the prequential (predictive) version

$\mathcal{H}_T (M) = \sum_{t=1}^T \mathcal{H}(y_t; p_M(dy_t|y_{1:(t-1)}))$

rather than the version on the joint marginal density of the whole series. (Followed by a remark within the remark that the logarithm scoring rule does not make for this distinction. And I had to write down the cascading representation

$\log p(y_{1:T})=\sum_{t=1}^T \log p(y_t|y_{1:t-1})$

to convince myself that this unnatural decomposition, where the posterior on θ varies on each terms, is true!) For consistency reasons.

This prequential decomposition is however a plus in terms of computation when resorting to sequential Monte Carlo. Since each time step produces an evaluation of the associated marginal. In the case of state space models, another decomposition of the authors, based on measurement densities and partial conditional expectations of the latent states allows for another (SMC²) approximation. The paper also establishes that for non-nested models, the Hyvärinen score as a model selection tool asymptotically selects the closest model to the data generating process. For the divergence induced by the score. Even for state-space models, under some technical assumptions.  From this asymptotic perspective, the paper exhibits an example where the Bayes factor and the Hyvärinen factor disagree, even asymptotically in the number of observations, about which mis-specified model to select. And last but not least the authors propose and assess a discrete alternative relying on finite differences instead of derivatives. Which remains a proper scoring rule.

I am quite excited by this work (call me biased!) and I hope it can induce following works as a viable alternative to Bayes factors, if only for being more robust to the [unspecified] impact of the prior tails. As in the above picture where some realisations of the SMC² output and of the sequential decision process see the wrong model being almost acceptable for quite a long while…

## positions in North-East America

Posted in Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , on September 14, 2017 by xi'an

• Professor in Statistics, Biostatistics or Data Science at U de M, deadline October 30th, 2017, a requirement being proficiency in the French language;
• Tenure-Track Professorship in Statistics at Harvard University, Department of Statistics, details there.

## fiducial on a string

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , on June 26, 2017 by xi'an

A very short note in arXiv today by Gunnar Taraldsen and Bo Henry Lindqvist (NTU, Norway). With the above title. I find the note close to unreadable, I must say, as the notations are not all or well- defined. The problem starts from Teddy Seidenfeld [whom I met in Harvard around Dutch book arguments] arguing about the lack of unicity of fiducial distributions in a relatively simple setting. Actually the note is also inspired from Bayes, Fiducial and Frequentist, and comments from Teddy, a talk I apparently missed by taking a flight back home too early!

What I find surprising in this note is that the “fiducial on a string” is a conditional distribution on the parameter space restricted to a  curve, derived from the original fiducial distribution by a conditioning argument. Except that since the conditioning is on a set of measure zero, this conditional is not only not-unique, but it is completely undefined and arbitrary, since changing it does not modify the properties of the joint distribution.

## ACDC versus ABC

Posted in Books, Kids, pictures, Statistics, Travel with tags , , , , , on June 12, 2017 by xi'an

At the Bayes, Fiducial and Frequentist workshop last month, I discussed with the authors of this newly arXived paper, Approximate confidence distribution computing, Suzanne Thornton and Min-ge Xie. Which they abbreviate as ACC and not as ACDC. While I have discussed the notion of confidence distribution in some earlier posts, this paper aims at producing proper frequentist coverage within a likelihood-free setting. Given the proximity with our recent paper on the asymptotics of ABC, as well as with Li and Fearnhead (2016) parallel endeavour, it is difficult (for me) to spot the actual distinction between ACC and ABC given that we also achieve (asymptotically) proper coverage when the limiting ABC distribution is Gaussian, which is the case for a tolerance decreasing quickly enough to zero (in the sample size).

“Inference from the ABC posterior will always be difficult to justify within a Bayesian framework.”

Indeed the ACC setting is eerily similar to ABC apart from the potential of the generating distribution to be data dependent. (Which is fine when considering that the confidence distributions have no Bayesian motivation but are a tool to ensure proper frequentist coverage.) That it is “able to offer theoretical support for ABC” (p.5) is unclear to me, given both this data dependence and the constraints it imposes on the [sampling and algorithmic] setting. Similarly, I do not understand how the authors “are not committing the error of doubly using the data” (p.5) and why they should be concerned about it, standing outside the Bayesian framework. If the prior involves the data as in the Cauchy location example, it literally uses the data [once], followed by an ABC comparison between simulated and actual data, that uses the data [a second time].

“Rather than engaging in a pursuit to define a moving target such as [a range of posterior distributions], ACC maintains a consistently clear frequentist interpretation (…) and thereby offers a consistently cohesive interpretation of likelihood-free methods.”

The frequentist coverage guarantee comes from a bootstrap-like assumption that [with tolerance equal to zero] the distribution of the ABC/ACC/ACDC random parameter around an estimate of the parameter given the summary statistic is identical to the [frequentist] distribution of this estimate around the true parameter [given the true parameter, although this conditioning makes no sense outside a Bayesian framework]. (There must be a typo in the paper when the authors define [p.10] the estimator as minimising the derivative of the density of the summary statistic, while still calling it an MLE.) That this bootstrap-like assumption holds is established (in Theorem 1) under a CLT on this MLE and assumptions on the data-dependent proposal that connect it to the density of the summary statistic. Connection that seem to imply a data-dependence as well as a certain knowledge about this density. What I find most surprising in this derivation is the total absence of conditions or even discussion on the tolerance level which, as we have shown, is paramount to the validation or invalidation of ABC inference. It sounds like the authors of Approximate confidence distribution computing are setting ε equal to zero for those theoretical derivations. While in practice they apply rules [for choosing ε] they do not voice out, but which result in very different acceptance rates for the ACC version they oppose to an ABC version. (In all illustrations, it seems that ε=0.1, which does not make much sense.) All in all, I am thus rather skeptical about the practical implications of the paper in that it seems to achieve confidence guarantees by first assuming proper if implicit choices of summary statistics and parameter generating distribution.

## Dutch book for sleeping beauty

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on May 15, 2017 by xi'an

After my short foray in Dutch book arguments two weeks ago in Harvard, I spotted a recent arXival by Vincent Conitzer analysing the sleeping beauty paradox from a Dutch book perspective. (The paper “A Dutch book against sleeping beauties who are evidential decision theorists” actually appeared in Synthese two years ago, which makes me wonder why it comes out only now on arXiv. And yes I am aware the above picture is about Bansky’s Cindirella and not sleeping beauty!)

“if Beauty is an evidential decision theorist, then in variants where she does not always have the same information available to her upon waking, she is vulnerable to Dutch books, regardless of whether she is a halfer or a thirder.”

As recalled in the introduction of the paper, there exist ways to construct Dutch book arguments against thirders and halfers alike. Conitzer constructs a variant that also distinguishes between a causal and an evidential decision theorist (sleeping beauty), the later being susceptible to another Dutch book. Which is where I get lost as I have no idea of a distinction between those two types of decision theory. Quickly checking on Wikipedia returned the notion that the latter decision theory maximises the expected utility conditional on the decision, but this does not clarify the issue in that it seems to imply the decision impacts the probability of the event… Hence keeping me unable to judge of the relevance of the arguments therein (which is no surprise since only based on a cursory read).

## Bayes is typically wrong…

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on May 3, 2017 by xi'an

In Harvard, this morning, Don Fraser gave a talk at the Bayesian, Fiducial, and Frequentist conference where he repeated [as shown by the above quote] the rather harsh criticisms on Bayesian inference he published last year in Statistical Science. And which I discussed a few days ago. The “wrongness” of Bayes starts with the completely arbitrary choice of the prior, which Don sees as unacceptable, and then increases because the credible regions are not confident regions, outside natural parameters from exponential families (Welch and Peers, 1963). And one-dimensional parameters using the profile likelihood (although I cannot find a proper definition of what the profile likelihood is in the paper, apparently a plug-in version that is not a genuine likelihood, hence somewhat falling under the same this-is-not-a-true-probability cleaver as the disputed Bayesian approach).

“I expect we’re all missing something, but I do not know what it is.” D.R. Cox, Statistical Science, 1994

And then Nancy Reid delivered a plenary lecture “Are we converging?” on the afternoon that compared most principles (including objective if not subjective Bayes) against different criteria, like consistency, nuisance elimination, calibration, meaning of probability, and so on.  In an highly analytic if pessimistic panorama. (The talk should be available on line at some point soon.)

## on Dutch book arguments

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , on May 1, 2017 by xi'an

“Reality is not always probable, or likely.”― Jorge Luis Borges

As I am supposed to discuss Teddy Seidenfeld‘s talk at the Bayes, Fiducial and Frequentist conference in Harvard today [the snow happened last time!], I started last week [while driving to Wales] reading some related papers of his. Which is great as I had never managed to get through the Dutch book arguments, including those in Jim’s book.

The paper by Mark Schervish, Teddy Seidenfeld, and Jay Kadane is defining coherence as the inability to bet against the predictive statements based on the procedure. A definition that sounds like a self-fulfilling prophecy to me as it involves a probability measure over the parameter space. Furthermore, the notion of turning inference, which aims at scientific validation, into a leisure, no-added-value, and somewhat ethically dodgy like gambling, does not agree with my notion of a validation for a theory. That is, not as a compelling reason for adopting a Bayesian approach. Not that I have suddenly switched to the other [darker] side, but I do not feel those arguments helping in any way, because of this dodgy image associated with gambling. (Pardon my French, but each time I read about escrows, I think of escrocs, or crooks, which reinforces this image! Actually, this name derives from the Old French escroue, but the modern meaning of écroué is sent to jail, which brings us back to the same feeling…)

Furthermore, it sounds like both a weak notion, since it implies an almost sure loss for the bookmaker, plus coherency holds for any prior distribution, including Dirac masses!, and a frequentist one, in that it looks at all possible values of the parameter (in a statistical framework). It also turns errors into monetary losses, taking them at face value. Which sounds also very formal to me.

But the most fundamental problem I have with this approach is that, from a Bayesian perspective, it does not bring any evaluation or ranking of priors, and in particular does not help in selecting or eliminating some. By behaving like a minimax principle, it does not condition on the data and hence does not evaluate the predictive properties of the model in terms of the data, e.g. by comparing pseudo-data with real data.

While I see no reason to argue in favour of p-values or minimax decision rules, I am at a loss in understanding the examples in How to not gamble if you must. In the first case, i.e., when dismissing the α-level most powerful test in the simple vs. simple hypothesis testing case, the argument (in Example 4) starts from the classical (Neyman-Pearsonist) statistician favouring the 0.05-level test over others. Which sounds absurd, as this level corresponds to a given loss function, which cannot be compared with another loss function. Even though the authors chose to rephrase the dilemma in terms of a single 0-1 loss function and then turn the classical solution into the choice of an implicit variance-dependent prior. Plus force the poor Pearsonist to make a wager represented by the risk difference. The whole sequence of choices sounds both very convoluted and far away from the usual practice of a classical statistician… Similarly, when attacking [in Section 5.2] the minimax estimator in the Bernoulli case (for the corresponding proper prior depending on the sample size n), this minimax estimator is admissible under quadratic loss and still a Dutch book argument applies, which in my opinion definitely argues against the Dutch book reasoning. The way to produce such a domination result is to mix two Bernoulli estimation problems for two different sample sizes but the same parameter value, in which case there exist [other] choices of Beta priors and a convex combination of the risks functions that lead to this domination. But this example [Example 6] mostly exposes the artificial nature of the argument: when estimating the very same probability θ, what is the relevance of adding the risks or errors resulting from using two estimators for two different sample sizes. Of the very same probability θ. I insist on the very same because when instead estimating two [independent] values of θ, there cannot be a Stein effect for the Bernoulli probability estimation problem, that is, any aggregation of admissible estimators remains admissible. (And yes it definitely sounds like an exercise in frequentist decision theory!)