the Hyvärinen score is back

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , , , on November 21, 2017 by xi'an

Stéphane Shao, Pierre Jacob and co-authors from Harvard have just posted on arXiv a new paper on Bayesian model comparison using the Hyvärinen score

$\mathcal{H}(y, p) = 2\Delta_y \log p(y) + ||\nabla_y \log p(y)||^2$

which thus uses the Laplacian as a natural and normalisation-free penalisation for the score test. (Score that I first met in Padova, a few weeks before moving from X to IX.) Which brings a decision-theoretic alternative to the Bayes factor and which delivers a coherent answer when using improper priors. Thus a very appealing proposal in my (biased) opinion! The paper is mostly computational in that it proposes SMC and SMC² solutions to handle the estimation of the Hyvärinen score for models with tractable likelihoods and tractable completed likelihoods, respectively. (Reminding me that Pierre worked on SMC² algorithms quite early during his Ph.D. thesis.)

A most interesting remark in the paper is to recall that the Hyvärinen score associated with a generic model on a series must be the prequential (predictive) version

$\mathcal{H}_T (M) = \sum_{t=1}^T \mathcal{H}(y_t; p_M(dy_t|y_{1:(t-1)}))$

rather than the version on the joint marginal density of the whole series. (Followed by a remark within the remark that the logarithm scoring rule does not make for this distinction. And I had to write down the cascading representation

$\log p(y_{1:T})=\sum_{t=1}^T \log p(y_t|y_{1:t-1})$

to convince myself that this unnatural decomposition, where the posterior on θ varies on each terms, is true!) For consistency reasons.

This prequential decomposition is however a plus in terms of computation when resorting to sequential Monte Carlo. Since each time step produces an evaluation of the associated marginal. In the case of state space models, another decomposition of the authors, based on measurement densities and partial conditional expectations of the latent states allows for another (SMC²) approximation. The paper also establishes that for non-nested models, the Hyvärinen score as a model selection tool asymptotically selects the closest model to the data generating process. For the divergence induced by the score. Even for state-space models, under some technical assumptions.  From this asymptotic perspective, the paper exhibits an example where the Bayes factor and the Hyvärinen factor disagree, even asymptotically in the number of observations, about which mis-specified model to select. And last but not least the authors propose and assess a discrete alternative relying on finite differences instead of derivatives. Which remains a proper scoring rule.

I am quite excited by this work (call me biased!) and I hope it can induce following works as a viable alternative to Bayes factors, if only for being more robust to the [unspecified] impact of the prior tails. As in the above picture where some realisations of the SMC² output and of the sequential decision process see the wrong model being almost acceptable for quite a long while…

WBIC, practically

Posted in Statistics with tags , , , , , , , , , on October 20, 2017 by xi'an

“Thus far, WBIC has received no more than a cursory mention by Gelman et al. (2013)”

I had missed this 2015  paper by Nial Friel and co-authors on a practical investigation of Watanabe’s WBIC. Where WBIC stands for widely applicable Bayesian information criterion. The thermodynamic integration approach explored by Nial and some co-authors for the approximation of the evidence, thermodynamic integration that produces the log-evidence as an integral between temperatures t=0 and t=1 of a powered evidence, is eminently suited for WBIC, as the widely applicable Bayesian information criterion is associated with the specific temperature t⁰ that makes the power posterior equidistant, Kullback-Leibler-wise, from the prior and posterior distributions. And the expectation of the log-likelihood under this very power posterior equal to the (genuine) evidence. In fact, WBIC is often associated with the sub-optimal temperature 1/log(n), where n is the (effective?) sample size. (By comparison, if my minimalist description is unclear!, thermodynamic integration requires a whole range of temperatures and associated MCMC runs.) In an ideal Gaussian setting, WBIC improves considerably over thermodynamic integration, the larger the sample the better. In more realistic settings, though, including a simple regression and a logistic [Pima Indians!] model comparison, thermodynamic integration may do better for a given computational cost although the paper is unclear about these costs. The paper also runs a comparison with harmonic mean and nested sampling approximations. Since the integral of interest involves a power of the likelihood, I wonder if a safe version of the harmonic mean resolution can be derived from simulations of the genuine posterior. Provided the exact temperature t⁰ is known…

how many academics does it take to change… a p-value threshold?

Posted in Books, pictures, Running, Statistics, Travel with tags , , , , , , , , on August 22, 2017 by xi'an

“…a critical mass of researchers now endorse this change.”

The answer to the lightpulp question seems to be 72: Andrew sent me a short paper recently PsyarXived and to appear in Nature Human Behaviour following on the .005 not .05 tune we criticised in PNAS a while ago. (Actually a very short paper once the names and affiliations of all authors are taken away.) With indeed 72 authors, many of them my Bayesian friends! I figure the mass signature is aimed at convincing users of p-values of a consensus among statisticians. Or a “critical mass” as stated in the note. On the next week, Nature had an entry on this proposal. (With a survey on whether the p-value threshold should change!)

The argument therein [and hence my reservations] is about the same as in Val Johnson’s original PNAS paper, namely that .005 should become the reference cutoff when using p-values for discovering new effects. The tone of the note is mostly Bayesian in that it defends the Bayes factor as a better alternative I would call the b-value. And produces graphs that relate p-values to some minimax Bayes factors. In the simplest possible case of testing for the nullity of a normal mean. Which I do not think is particularly convincing when considering more realistic settings with (many) nuisance parameters and possible latent variables where numerical answers diverge between p-values and [an infinity of] b-values. And of course the unsolved issue of scaling the Bayes factor. (This without embarking anew upon a full-fledged criticism of the Bayes factor.) As usual, I am also skeptical of mentions of power, since I never truly understood the point of power, which depends on the alternative model, increasingly so with the complexity of this alternative. As argued in our letter to PNAS, the central issue that this proposal fails to address is the urgency in abandoning the notion [indoctrinated in generations of students that a single quantity and a single bound are the answers to testing issues. Changing the bound sounds like suggesting to paint afresh a building on the verge of collapsing.

a response by Ly, Verhagen, and Wagenmakers

Posted in Statistics with tags , , , , , , , , on March 9, 2017 by xi'an

Following my demise [of the Bayes factor], Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers wrote a very detailed response. Which I just saw the other day while in Banff. (If not in Schiphol, which would have been more appropriate!)

“In this rejoinder we argue that Robert’s (2016) alternative view on testing has more in common with Jeffreys’s Bayes factor than he suggests, as they share the same ‘‘shortcomings’’.”

Rather unsurprisingly (!), the authors agree with my position on the dangers to ignore decisional aspects when using the Bayes factor. A point of dissension is the resolution of the Jeffreys[-Lindley-Bartlett] paradox. One consequence derived by Alexander and co-authors is that priors should change between testing and estimating. Because the parameters have a different meaning under the null and under the alternative, a point I agree with in that these parameters are indexed by the model [index!]. But with which I disagree when arguing that the same parameter (e.g., a mean under model M¹) should have two priors when moving from testing to estimation. To state that the priors within the marginal likelihoods “are not designed to yield posteriors that are good for estimation” (p.45) amounts to wishful thinking. I also do not find a strong justification within the paper or the response about choosing an improper prior on the nuisance parameter, e.g. σ, with the same constant. Another a posteriori validation in my opinion. However, I agree with the conclusion that the Jeffreys paradox prohibits the use of an improper prior on the parameter being tested (or of the test itself). A second point made by the authors is that Jeffreys’ Bayes factor is information consistent, which is correct but does not solved my quandary with the lack of precise calibration of the object, namely that alternatives abound in a non-informative situation.

“…the work by Kamary et al. (2014) impressively introduces an alternative view on testing, an algorithmic resolution, and a theoretical justification.”

The second part of the comments is highly supportive of our mixture approach and I obviously appreciate very much this support! Especially if we ever manage to turn the paper into a discussion paper! The authors also draw a connection with Harold Jeffreys’ distinction between testing and estimation, based upon Laplace’s succession rule. Unbearably slow succession law. Which is well-taken if somewhat specious since this is a testing framework where a single observation can send the Bayes factor to zero or +∞. (I further enjoyed the connection of the Poisson-versus-Negative Binomial test with Jeffreys’ call for common parameters. And the supportive comments on our recent mixture reparameterisation paper with Kaniav Kamari and Kate Lee.) The other point that the Bayes factor is more sensitive to the choice of the prior (beware the tails!) can be viewed as a plus for mixture estimation, as acknowledged there. (The final paragraph about the faster convergence of the weight α is not strongly

relativity is the keyword

Posted in Books, Statistics, University life with tags , , , , , , , on February 1, 2017 by xi'an

As I was teaching my introduction to Bayesian Statistics this morning, ending up with the chapter on tests of hypotheses, I found reflecting [out loud] on the relative nature of posterior quantities. Just like when I introduced the role of priors in Bayesian analysis the day before, I stressed the relativity of quantities coming out of the BBB [Big Bayesian Black Box], namely that whatever happens as a Bayesian procedure is to be understood, scaled, and relativised against the prior equivalent, i.e., that the reference measure or gauge is the prior. This is sort of obvious, clearly, but bringing the argument forward from the start avoids all sorts of misunderstanding and disagreement, in that it excludes the claims of absolute and certainty that may come with the production of a posterior distribution. It also removes the endless debate about the determination of the prior, by making each prior a reference on its own. With an additional possibility of calibration by simulation under the assumed model. Or an alternative. Again nothing new there, but I got rather excited by this presentation choice, as it seems to clarify the path to Bayesian modelling and avoid misapprehensions.

Further, the curious case of the Bayes factor (or of the posterior probability) could possibly be resolved most satisfactorily in this framework, as the [dreaded] dependence on the model prior probabilities then becomes a matter of relativity! Those posterior probabilities depend directly and almost linearly on the prior probabilities, but they should not be interpreted in an absolute sense as the ultimate and unique probability of the hypothesis (which anyway does not mean anything in terms of the observed experiment). In other words, this posterior probability does not need to be scaled against a U(0,1) distribution. Or against the p-value if anyone wishes to do so. By the end of the lecture, I was even wondering [not so loudly] whether or not this perspective was allowing for a resolution of the Lindley-Jeffreys paradox, as the resulting number could be set relative to the choice of the [arbitrary] normalising constant. Continue reading

Bayesian model selection without evidence

Posted in Books, Statistics, University life with tags , , , , , , , on September 20, 2016 by xi'an

“The new method circumvents the challenges associated with accurate evidence calculations by computing posterior odds ratios using Bayesian parameter estimation”

One paper leading to another, I had a look at Hee et al. 2015 paper on Bayes factor estimation. The “novelty” stands in introducing the model index as an extra parameter in a single model encompassing all models under comparison, the “new” parameterisation being in (θ,n) rather than in θ. With the distinction that the parameter θ is now made of the union of all parameters across all models. Which reminds us very much of Carlin and Chib (1995) approach to the problem. (Peter Green in his Biometrika (1995) paper on reversible jump MCMC uses instead a direct sum of parameter spaces.) The authors indeed suggest simulating jointly (θ,n) in an MCMC or nested sampling scheme. Rather than being updated by arbitrary transforms as in Carlin and Chib (1995) the useless parameters from the other models are kept constant… The goal being to estimate P(n|D) the marginal posterior on the model index, aka the posterior probability of model n.

Now, I am quite not certain keeping the other parameter constants is a valid move: given a uniform prior on n and an equally uniform proposal, the acceptance probability simplifies into the regular Metropolis-Hastings ratio for model n. Hence the move is valid within model n. If not, I presume the previous pair (θ⁰,n⁰) is repeated. Wait!, actually, this is slightly more elaborate: if a new value of n, m, is proposed, then the acceptance ratio involves the posteriors for both n⁰ and m, possibly only the likelihoods when the proposal is the prior. So the move will directly depend on the likelihood ratio in this simplified case, which indicates the scheme could be correct after all. Except that this neglects the measure theoretic subtleties that led to reversible jump symmetry and hence makes me wonder. In other words, it follows exactly the same pattern as reversible jump without the constraints of the latter… Free lunch,  anyone?!

non-local priors for mixtures

Posted in Statistics, University life with tags , , , , , , , , , , , , , , , on September 15, 2016 by xi'an

[For some unknown reason, this commentary on the paper by Jairo Fúquene, Mark Steel, David Rossell —all colleagues at Warwick— on choosing mixture components by non-local priors remained untouched in my draft box…]

Choosing the number of components in a mixture of (e.g., Gaussian) distributions is a hard problem. It may actually be an altogether impossible problem, even when abstaining from moral judgements on mixtures. I do realise that the components can eventually be identified as the number of observations grows to infinity, as demonstrated for instance by Judith Rousseau and Kerrie Mengersen (2011). But for a finite and given number of observations, how much can we trust any conclusion about the number of components?! It seems to me that the criticism about the vacuity of point null hypotheses, namely the logical absurdity of trying to differentiate θ=0 from any other value of θ, applies to the estimation or test on the number of components of a mixture. Doubly so, one might argue, since a very small or a very close component is undistinguishable from a non-existing one. For instance, Definition 2 is correct from a mathematical viewpoint, but it does not spell out the multiple contiguities between k and k’ component mixtures.

The paper starts with a comprehensive coverage of l’état de l’art… When using a Bayes factor to compare a k-component and an h-component mixture, the behaviour of the factor is quite different depending on which model is correct. Essentially overfitted mixtures take much longer to detect than underfitted ones, which makes intuitive sense. And BIC should be corrected for overfitted mixtures by a canonical dimension λ between the true and the (larger) assumed number of parameters  into

2 log m(y) = 2 log p(y|θ) – λ log O(n) + O(log log n)

I would argue that this purely invalidates BIG in mixture settings since the canonical dimension λ is unavailable (and DIC does not provide a useful substitute as we illustrated a decade ago…) The criticism about Rousseau and Mengersen (2011) over-fitted mixture that their approach shrinks less than a model averaging over several numbers of components relates to minimaxity and hence sounds both overly technical and reverting to some frequentist approach to testing. Replacing testing with estimating sounds like the right idea.  And I am also unconvinced that a faster rate of convergence of the posterior probability or of the Bayes factor is a relevant factor when conducting

As for non local priors, the notion seems to rely on a specific topology for the parameter space since a k-component mixture can approach a k’-component mixture (when k'<k) in a continuum of ways (even for a given parameterisation). This topology seems to be summarised by the penalty (distance?) d(θ) in the paper. Is there an intrinsic version of d(θ), given the weird parameter space? Like one derived from the Kullback-Leibler distance between the models? The choice of how zero is approached clearly has an impact on how easily the “null” is detected, the more because of the somewhat discontinuous nature of the parameter space. Incidentally, I find it curious that only the distance between means is penalised… The prior also assumes independence between component parameters and component weights, which I think is suboptimal in dealing with mixtures, maybe suboptimal in a poetic sense!, as we discussed in our reparameterisation paper. I am not sure either than the speed the distance converges to zero (in Theorem 1) helps me to understand whether the mixture has too many components for the data’s own good when I can run a calibration experiment under both assumptions.

While I appreciate the derivation of a closed form non-local prior, I wonder at the importance of the result. Is it because this leads to an easier derivation of the posterior probability? I do not see the connection in Section 3, except maybe that the importance weight indeed involves this normalising constant when considering several k’s in parallel. Is there any convergence issue in the importance sampling solution of (3.1) and (3.3) since the simulations are run under the local posterior? While I appreciate the availability of an EM version for deriving the MAP, a fact I became aware of only recently, is it truly bringing an improvement when compared with picking the MCMC simulation with the highest completed posterior?

The section on prior elicitation is obviously of central interest to me! It however seems to be restricted to the derivation of the scale factor g, in the distance, and of the parameter q in the Dirichlet prior on the weights. While the other parameters suffer from being allocated the conjugate-like priors. I would obviously enjoy seeing how this approach proceeds with our non-informative prior(s). In this regard, the illustration section is nice, but one always wonders at the representative nature of the examples and the possible interpretations of real datasets. For instance, when considering that the Old Faithful is more of an HMM than a mixture.