the Hyvärinen score is back
Stéphane Shao, Pierre Jacob and co-authors from Harvard have just posted on arXiv a new paper on Bayesian model comparison using the Hyvärinen score
which thus uses the Laplacian as a natural and normalisation-free penalisation for the score test. (Score that I first met in Padova, a few weeks before moving from X to IX.) Which brings a decision-theoretic alternative to the Bayes factor and which delivers a coherent answer when using improper priors. Thus a very appealing proposal in my (biased) opinion! The paper is mostly computational in that it proposes SMC and SMC² solutions to handle the estimation of the Hyvärinen score for models with tractable likelihoods and tractable completed likelihoods, respectively. (Reminding me that Pierre worked on SMC² algorithms quite early during his Ph.D. thesis.)
A most interesting remark in the paper is to recall that the Hyvärinen score associated with a generic model on a series must be the prequential (predictive) version
rather than the version on the joint marginal density of the whole series. (Followed by a remark within the remark that the logarithm scoring rule does not make for this distinction. And I had to write down the cascading representation
to convince myself that this unnatural decomposition, where the posterior on θ varies on each terms, is true!) For consistency reasons.
This prequential decomposition is however a plus in terms of computation when resorting to sequential Monte Carlo. Since each time step produces an evaluation of the associated marginal. In the case of state space models, another decomposition of the authors, based on measurement densities and partial conditional expectations of the latent states allows for another (SMC²) approximation. The paper also establishes that for non-nested models, the Hyvärinen score as a model selection tool asymptotically selects the closest model to the data generating process. For the divergence induced by the score. Even for state-space models, under some technical assumptions. From this asymptotic perspective, the paper exhibits an example where the Bayes factor and the Hyvärinen factor disagree, even asymptotically in the number of observations, about which mis-specified model to select. And last but not least the authors propose and assess a discrete alternative relying on finite differences instead of derivatives. Which remains a proper scoring rule.
I am quite excited by this work (call me biased!) and I hope it can induce following works as a viable alternative to Bayes factors, if only for being more robust to the [unspecified] impact of the prior tails. As in the above picture where some realisations of the SMC² output and of the sequential decision process see the wrong model being almost acceptable for quite a long while…
November 24, 2017 at 7:59 pm
Funny that you may think all proper priors acceptable as true distributions! For instance, some proper priors lead to inadmissible estimators while some improper priors produce admissible ones…
November 27, 2017 at 12:49 am
I obviously don’t think that! A prior is appropriate for a given circumstance. So it’s never true that all proper priors could be used for any problem.
Regarding admissibility, my understanding is that there are essentially two types of inadmissible priors: the howlingly inadmissible and the ones that are only just beaten.
I guess this view comes from my background in numerics. I just don’t trust computers with big numbers or small number, so if the maths is dichotomizing estimators based on properties way out in the tails, then I think it’s hard to take the dichotomy seriously.
For instance, if you can beat my prior when |theta| > 10^10, but it performs better for more reasonable values, I’m not enormously concerned with the resulting estimator being inadmissible.
You know the maths of this much better than I do, but it was always my impression that the improper priors come from playing those sorts of games with infinity and infinitesimals.
For a lot of models, such as a poisson-glm withif a log-link, parameter values in [-20,20] cover well beyond the reasonable range (if your data has a mean of 10^8, you should rescale it before doing things in a computer!), so arguments around infinity are not very meaningful. And priors that put non-trivial mass outside that interval (or that are defined as the limit of things that put non-trivial mass outside that interval) seem a-statistical.
November 24, 2017 at 10:06 am
First of all, congratulations to Pierre and coauthors for the nice paper. As usual, Pierre writes interesting papers. Apparently, scoring rules are stalking me recently :) . At a first quick look, I don’t see problems with the limiting argument but I have to admit that my reading was superficial. Hope Pierre will answer to my emails when I’ll properly read the paper. :-)
November 24, 2017 at 3:40 pm
Thanks! Looking forward to your emails.
November 22, 2017 at 7:09 am
In a nutshell, the marginal log-likelihood might indeed be ill-defined when a prior becomes vaguer, while its derivatives with respect to the observation ‘y’ might admit a limit. This explains the appeal of using a scoring rule based on derivatives of the (incremental) log-likelihoods, such as the Hyvärinen scoring rule.
This is well-explained in Dawid and Musio’s 2015 paper.
November 22, 2017 at 9:02 pm
All Dawid and Musio do is say that you can still compute it even if the distribution is improper. They don’t seem to talk about what that means. The theory in Hyvärinen’s paper (which deals with unknown normalizations not unnormalisable densities) that shows propriety of the score all assumes that the density is normalisable. It might still be proper against sigma-finite measures, but someone needs to prove it somewhere…
November 23, 2017 at 4:27 pm
If you look at their case study in linear models, you’ll see that they consider improper priors as limits of sequences of proper priors. Then, if you agree that the score has a meaning for any proper prior, and that its limit (as the prior becomes improper) is well-defined, this immediately “gives a meaning” to this score when using improper priors.
We have a similar toy example on page 2 of our paper, with a simple Normal-Normal conjugate model.
Or maybe it’s the idea of giving meaning through a limit argument that makes you incomfortable? If so, then there’s not much to argue about, it’s a conceptual disagreement. I’m happy e.g. with derivatives being defined as limits of finite differences, and integrals as limits of averages.
November 24, 2017 at 5:39 am
I don’t agree that, for complicated nonlinear functions of the prior, things that work when the prior is proper still work when the prior is at its improper limit. If that were true we could safely use gamma(epsilon,epsilon) priors.
I really don’t want to say it doesn’t work. I have literally no idea. But I’ve been burnt before. That’s why we have maths. And no one seems to have done it.
November 21, 2017 at 6:42 am
I didn’t think p(y) was well defined for improper priors, so I’m not sure how this can be interpreted as a score (which would require p to be a probability distribution). Isn’t it only known up to a multiplicative constant? In that case how does the scoring rule stuff work? Or am I missing something obvious?
Full disclosure: I haven’t hit this paper on my pile yet and I am looking forward to it. But that seems like a key challenge.
November 21, 2017 at 10:18 am
The first and second differentials of log p(y) are clearly independent from the constant.
November 21, 2017 at 3:24 pm
Thanks Dan for putting this on your pile, and thanks Christian for the supportive comments!
November 21, 2017 at 5:34 pm
Yes, but what does this mean to the score. P(y) is clearly not a probability in general, it’s just an integral of things. Does the wnole scoring rule machine still run?
November 21, 2017 at 5:39 pm
Just to be clear, it’s not the case that there is a constant and we know it (which would be fine for this score), it’s that any constant could be used.
This problem goes away if you use the posterior predictive density rather than the prior predictive density.
November 24, 2017 at 8:16 am
I still do not get the argument why a finite measure has more meaning than a sigma-finite measure. Limits are necessary “evils” in a topological world…
November 24, 2017 at 5:37 pm
The problem (or potential problem) is that the set of sigma finite measures is strictly bigger than the set of finite measures. So you might be able to “game” the scoring rule on the larger set (find a minimum that isn’t at the true distribution) or the rule may no longer be strictly proper (because there is an unnormalisable measure that has the same score as the true distribution).