## Jeffreys priors for hypothesis testing [Bayesian reads #2]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on February 9, 2019 by xi'an

A second (re)visit to a reference paper I gave to my OxWaSP students for the last round of this CDT joint program. Indeed, this may be my first complete read of Susie Bayarri and Gonzalo Garcia-Donato 2008 Series B paper, inspired by Jeffreys’, Zellner’s and Siow’s proposals in the Normal case. (Disclaimer: I was not the JRSS B editor for this paper.) Which I saw as a talk at the O’Bayes 2009 meeting in Phillie.

The paper aims at constructing formal rules for objective proper priors in testing embedded hypotheses, in the spirit of Jeffreys’ Theory of Probability “hidden gem” (Chapter 3). The proposal is based on symmetrised versions of the Kullback-Leibler divergence κ between null and alternative used in a transform like an inverse power of 1+κ. With a power large enough to make the prior proper. Eventually multiplied by a reference measure (i.e., the arbitrary choice of a dominating measure.) Can be generalised to any intrinsic loss (not to be confused with an intrinsic prior à la Berger and Pericchi!). Approximately Cauchy or Student’s t by a Taylor expansion. To be compared with Jeffreys’ original prior equal to the derivative of the atan transform of the root divergence (!). A delicate calibration by an effective sample size, lacking a general definition.

At the start the authors rightly insist on having the nuisance parameter v to differ for each model but… as we all often do they relapse back to having the “same ν” in both models for integrability reasons. Nuisance parameters make the definition of the divergence prior somewhat harder. Or somewhat arbitrary. Indeed, as in reference prior settings, the authors work first conditional on the nuisance then use a prior on ν that may be improper by the “same” argument. (Although conditioning is not the proper term if the marginal prior on ν is improper.)

The paper also contains an interesting case of the translated Exponential, where the prior is L¹ Student’s t with 2 degrees of freedom. And another one of mixture models albeit in the simple case of a location parameter on one component only.

## ISBA 2016 [#4]

Posted in pictures, Running, Statistics, Travel with tags , , , , , , , , , , on June 17, 2016 by xi'an

As an organiser of the ABC session (along with Paul Fearnhead), I was already aware of most results behind the talks, but nonetheless got some new perspectives from the presentations. Ewan Cameron discussed a two-stage ABC where the first step is actually an indirect inference inference, which leads to a more efficient ABC step. With applications to epidemiology. Lu presented extensions of his work with Paul Fearnhead, incorporating regression correction à la Beaumont to demonstrate consistency and using defensive sampling to control importance sampling variance. (While we are working on a similar approach, I do not want to comment on the consistency part, but I missed how defensive sampling can operate in complex ABC settings, as it requires advanced knowledge on the target to be effective.) And Ted Meeds spoke about two directions for automatising ABC (as in the ABcruise), from incorporating the pseudo-random generator into the representation of the ABC target, to calling for deep learning advances. The inclusion of random generators in the transform is great, provided they can remain black boxes as otherwise they require recoding. (This differs from quasi-Monte Carlo ABC, which aims at reducing the variability due to sheer noise.) It took me a little while, but I eventually understood why Jan Haning saw this inclusion as a return to fiducial inference!

Merlise Clyde gave a wide-ranging plenary talk on (linear) model selection that looked at a large range of priors under the hat of generalised confluent hypergeometric priors over the mixing scale in Zellner’s g-prior. Some were consistent under one or both models, maybe even for misspecified models. Some parts paralleled my own talk on the foundations of Bayesian tests, no wonder since I mostly give a review before launching into a criticism of the Bayes factor. Since I think this may be a more productive perspective than trying to over-come the shortcomings of Bayes factors in weakly informative settings. Some comments at the end of Merlise’s talk were loosely connected to this view in that they called for an unitarian perspective [rather than adapting a prior to a specific inference problem] with decision-theoretic backup. Conveniently the next session was about priors and testing, obviously connected!, with Leo Knorr-Held considering g-priors for the Cox model, Kerrie Mengersen discussing priors for over-fitted mixtures and HMMs, and Dan Simpson entertaining us with his quest of a prior for a point process, eventually reaching PC priors.

## messages from Harvard

Posted in pictures, Statistics, Travel, University life with tags , , , , , , on March 24, 2016 by xi'an

As in Bristol two months ago, where I joined the statistics reading in the morning, I had the opportunity to discuss the paper on testing via mixtures prior to my talk with a group of Harvard graduate students. Which concentrated on the biasing effect of the Bayes factor against the more complex hypothesis/model. Arguing [if not in those terms!] that Occam’s razor was too sharp. With a neat remark that decomposing the log Bayes factor as

log(p¹(y¹,H))+log(p²(y²|y¹,H))+…

meant that the first marginal was immensely and uniquely impacted by the prior modelling, hence very likely to be very small for a larger model H, which would then take forever to recover from. And asking why there was such a difference with cross-validation

log(p¹(y¹|y⁻¹,H))+log(p²(y²|y⁻²,H))+…

where the leave-one out posterior predictor is indeed more stable. While the later leads to major overfitting in my opinion, I never spotted the former decomposition which does appear as a strong and maybe damning criticism of the Bayes factor in terms of long-term impact of the prior modelling.

Other points made during the talk or before when preparing the talk:

1. additive mixtures are but one encompassing model, geometric mixtures could be fun too, if harder to process (e.g., missing normalising constant). Or Zellner’s mixtures (with again the normalising issue);
2. if the final outcome of the “test” is the posterior on α itself, the impact of the hyper-parameter on α is quite relative since this posterior can be calibrated by simulation against limiting cases (α=0,1);
3. for the same reason the different rate of accumulation near zero and one  when compared with a posterior probability is hardly worrying;
4. what I see as a fundamental difference in processing improper priors for Bayes factors versus mixtures is not perceived as such by everyone;
5. even a common parameter θ on both models does not mean both models are equally weighted a priori, which relates to an earlier remark in Amsterdam about the different Jeffreys priors one can use;
6. the MCMC output also produces a sample of θ’s which behaviour is obviously different from single model outputs. It would be interesting to study further the behaviour of those samples, which are not to be confused with model averaging;
7. the mixture setting has nothing intrinsically Bayesian in that the model can be processed in other ways.

## reflections on the probability space induced by moment conditions with implications for Bayesian Inference [slides]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on December 4, 2014 by xi'an

Here are the slides of my incoming discussion of Ron Gallant’s paper, tomorrow.

## reflections on the probability space induced by moment conditions with implications for Bayesian Inference [discussion]

Posted in Books, Statistics, University life with tags , , , , , , on December 1, 2014 by xi'an

[Following my earlier reflections on Ron Gallant’s paper, here is a more condensed set of questions towards my discussion of next Friday.]

“If one specifies a set of moment functions collected together into a vector m(x,θ) of dimension M, regards θ as random and asserts that some transformation Z(x,θ) has distribution ψ then what is required to use this information and then possibly a prior to make valid inference?” (p.4)

The central question in the paper is whether or not given a set of moment equations

$\mathbb{E}[m(X_1,\ldots,X_n,\theta)]=0$

(where both the Xi‘s and θ are random), one can derive a likelihood function and a prior distribution compatible with those. It sounds to me like a highly complex question since it implies the integral equation

$\int_{\Theta\times\mathcal{X}^n} m(x_1,\ldots,x_n,\theta)\,\pi(\theta)f(x_1|\theta)\cdots f(x_n|\theta) \text{d}\theta\text{d}x_1\cdots\text{d}x_n=0$

must have a solution for all n’s. A related question that was also remanent with fiducial distributions is how on Earth (or Middle Earth) the concept of a random theta could arise outside Bayesian analysis. And another one is how could the equations make sense outside the existence of the pair (prior,likelihood). A question that may exhibit my ignorance of structural models. But which may also relate to the inconsistency of Zellner’s (1996) Bayesian method of moments as exposed by Geisser and Seidenfeld (1999).

For instance, the paper starts (why?) with the Fisherian example of the t distribution of

$Z(x,\theta) = \frac{\bar{x}_n-\theta}{s/\sqrt{n}}$

which is truly is a t variable when θ is fixed at the true mean value. Now, if we assume that the joint distribution of the Xi‘s and θ is such that this projection is a t variable, is there any other case than the Dirac mass on θ? For all (large enough) sample sizes n? I cannot tell and the paper does not bring [me] an answer either.

When I look at the analysis made in the abstraction part of the paper, I am puzzled by the starting point (17), where

$p(x|\theta) = \psi(Z(x,\theta))$

since the lhs and rhs operate on different spaces. In Fisher’s example, x is an n-dimensional vector, while Z is unidimensional. If I apply blindly the formula on this example, the t density does not integrate against the Lebesgue measure in the n-dimension Euclidean space… If a change of measure allows for this representation, I do not see so much appeal in using this new measure and anyway wonder in which sense this defines a likelihood function, i.e. the product of n densities of the Xi‘s conditional on θ. To me this is the central issue, which remains unsolved by the paper.

## Regularisation

Posted in Statistics, University life with tags , , , , , , , , on October 5, 2010 by xi'an

After a huge delay, since the project started in 2006 and was first presented in Banff in 2007 (as well as included in the Bayesian Core), Gilles Celeux,  Mohammed El Anbari, Jean-Michel Marin, and myself have eventually completed our paper on using hyper-g priors variable selection and regularisation in linear models . The redaction of this paper was mostly delayed due to the publication of the 2007 JASA paper by Feng Liang, Rui Paulo, German Molina, Jim Berger, and Merlise Clyde, Mixtures of g-priors for Bayesian variable selection. We had indeed (independently) obtained very similar derivations based on hypergeometric function representations but, once the above paper was published, we needed to add material to our derivation and chose to run a comparison study between Bayesian and non-Bayesian methods for a series of simulated and true examples. It took a while to Mohammed El Anbari to complete this simulation study and even longer for the four of us to convene and agree on the presentation of the paper. The only difference between Liang et al.’s (2007) modelling and ours is that we do not distinguish between the intercept and the other regression coefficients in the linear model. On the one hand, this gives us one degree of freedom that allows us to pick an improper prior on the variance parameter. On the other hand, our posterior distribution is not invariant under location transforms, which was a point we heavily debated in Banff… The simulation part shows that all “standard” Bayesian solutions lead to very similar decisions and that they are much more parsimonious than regularisation techniques.

Two other papers posted on arXiv today address the model choice issue. The first one by Bruce Lindsay and Jiawei Liu introduces a credibility index, and the second one by Bazerque, Mateos, and Giannakis considers group-lasso on splines for spectrum cartography.

## Hyper-g priors

Posted in Books, R, Statistics with tags , , , , , , on August 31, 2010 by xi'an

Earlier this month, Daniel Sabanés Bové and Leo Held posted a paper about g-priors on arXiv. While I glanced at it for a few minutes, I did not have the chance to get a proper look at it till last Sunday. The g-prior was first introduced by the late Arnold Zellner for (standard) linear models, but they can be extended to generalised linear models (formalised by the late John Nelder) at little cost. In Bayesian Core, Jean-Michel Marin and I do centre the prior modelling in both linear and generalised linear models around g-priors, using the naïve extension for generalised linear models,

$\beta \sim \mathcal{N}(0,g \sigma^2 (\mathbf{X}^\text{T}\mathbf{X})^{-1})$

as in the linear case. Indeed, the reasonable alternative would be to include the true information matrix but since it depends on the parameter $\beta$ outside the normal case this is not truly an alternative. Bové and Held propose a slightly different version

$\beta \sim \mathcal{N}(0,g \sigma^2 c (\mathbf{X}^\text{T}\mathbf{W}\mathbf{X})^{-1})$

where W is a diagonal weight matrix and c is a family dependent scale factor evaluated at the mode 0. As in Liang et al. (2008, JASA) and most of the current literature, they also separate the intercept $\beta_0$ from the other regression coefficients. They also burn their “improperness joker” by choosing a flat prior on $\beta_0$, which means they need to use a proper prior on g, again as Liang et al. (2008, JASA), for the corresponding Bayesian model comparison to be valid. In Bayesian Core, we do not separate $\beta_0$ from the other regression coefficients and hence are left with one degree of freedom that we spend in choosing an improper prior on g instead. (Hence I do not get the remark of Bové and Held that our choice “prohibits Bayes factor comparisons with the null model“. As argued in Bayesian Core, the factor g being an hyperparameter shared by all models, we can use the same improper prior on g in all models and hence use standard Bayes factors.) In order to achieve closed form expressions, the authors use Cui and George ‘s (2008) prior

$\pi(g) \propto (1+g)^{1+a}\exp\{-b/(1+g)\}$

which requires the two hyper-hyper-parameters a and b to be specified.

The second part of the paper considers computational issues. It compares the ILA solution of Rue, Martino and Chopin (2009, Series B) with an MCMC solution based on an independent proposal on g resulting from linear interpolations (?). The marginal likelihoods are approximated by Chib and Jeliazkov (2001, JASA) for the MCMC part. Unsurprisingly, ILA does much better, even with a 97% acceptance rate in the MCMC algorithm.

The paper is very well-written and quite informative about the existing literature. It also uses the Pima Indian dataset  (The authors even dug out a 1991 paper of mine I had completely forgotten!) I am actually thinking of using the review in our revision of Bayesian Core, even though I think we should stick to our choice of including $\beta_0$ within the set of parameters…