full Bayesian significance test

Among the many comments (thanks!) I received when posting our Testing via mixture estimation paper came the suggestion to relate this approach to the notion of full Bayesian significance test (FBST) developed by (Julio, not Hal) Stern and Pereira, from São Paulo, Brazil. I thus had a look at this alternative and read the Bayesian Analysis paper they published in 2008, as well as a paper recently published in Logic Journal of IGPL. (I could not find what the IGPL stands for.) The central notion in these papers is the e-value, which provides the posterior probability that the posterior density is larger than the largest posterior density over the null set. This definition bothers me, first because the null set has a measure equal to zero under an absolutely continuous prior (BA, p.82). Hence the posterior density is defined in an arbitrary manner over the null set and the maximum is itself arbitrary. (An issue that invalidates my 1993 version of the Lindley-Jeffreys paradox!) And second because it considers the posterior probability of an event that does not exist a priori, being conditional on the data. This sounds in fact quite similar to Statistical Inference, Murray Aitkin’s (2009) book using a posterior distribution of the likelihood function. With the same drawback of using the data twice. And the other issues discussed in our commentary of the book. (As a side-much-on-the-side remark, the authors incidentally  forgot me when citing our 1992 Annals of Statistics paper about decision theory on accuracy estimators..!)

19 Responses to “full Bayesian significance test”

  1. My dear Professor Robert
    I promise that this is my ultimate comment about our FBST. The objective is to use the definition we have seen in our BA paper in a very simple problem to clarify our way to see the problem. Hope I am not upsetting you by taking your precious time.
    Suppose we want to estimate a proportion p in a population. We choose as a prior a Beta density with parameter (a,b) obtaining a Posterior Beta density with parameter (A,B). As our reference density (not reference prior) in our parameter space, we choose the uniform density – Beta(1,1) – as our reference density. Clearly as the posterior density for the odds we have the beta of second type with parameter (A,B) and now the reference density should be also a beta of second type with parameter (1,1). Using now this set up the e-value will be the same in both posteriors and then invariant for parameterizations.
    As you can see we have used the data only to updating the information. The reference density could be any density one can define. Our objective was to satisfy a referee that criticized our original e-value to be dependent of parameterizations.
    Hope this clarifies our method.
    With my best to all of you: A good new year to all your community!
    Carlos

    • Dear Carlos, thanks a lot for the further comments! And please feel free to keep discussing the FBST: this is the point in posting my own comments. Namely that we can engage into some light discussion! Best wishes for the New Year! Xi’an

      • Dear all,

        I don´t see why the prior e-value
        does not satisfy Christian as the
        “prior probability of the event …”.

        I believe this is what Carlos and
        Paulo are trying to say.

        Any density on the parameter space
        yields its e-value, regardless of what data
        it is conditioned on (if any).

        I personally prefer to treat the FBSt as
        indeed a test, ie, as the solution of a decision
        problem. The mathematical trouble (if any) of
        “using data twice” then become encapsulated in
        the loss function having x as an argument.
        Nevertheless, I think Fubini´s equivalence of normal and
        extensive forms still holds and probability rules are never
        violated. So, why using data twice should be a “drawback” ?

        I will explore further this issue.

        Joyeux Noël à tous.

        sergio

      • Dear Robert: I would like to make an additional comment: In the article M.R. Madruga, L.G. Esteves and S. Wechsler (2001), On the Bayesianity of Pereira-Stern Tests, Test, Test, 10, 2, 291-299; the authors prove that the FBST can be derived from a well specified Loss function. I am always looking for alternative interpretations and different epistemological frameworks, but Sergio’s (and his co-authors) theorem demonstrates that this is not strictly necessary. The FBST can be fully understood within the scope of standard Bayesian statistics. As for using the data twice, I believe that, once again, Sergio is right. Fubini’s theorem, allowing the change of integration order, can be used to show the equivalence of normal and extensive forms of the implied decision rules. I believe that this is enough to answer your questions and solve your concerns.

      • My dear colleagues
        Let me enter in the discussion for a little pragmatic opinion
        Professor Kemp once told me that we would be successful with Bayeainism only if we could get an alternative for p-values.
        My first try was the P-value that I and Sergio published in our Brazilian Journal some years ago. Hence after we understand the problems related to Lindley paradox, I and Julio wrote about the e-value in 1999. In such index we only need a density function to compute it. Do not have to talk about data in such step. After that Sergio and his colleagues wrote about the decision method to test of significance and maybe have to use data a second time. Julio on the other side wrote about the philosophical aspects on the use of e-values, including the composition of different hypotheses on the same space. Both did a superb job in divulgate our e-value. However, I myself have not paid much attention to those other aspects when using the e-values in real problems. I leave the decision to the scientists as Cox and Kemp did in their work. My problem at these times are how to define the significance level and am trying to define it just in the way I and Pericchi have done recently. I still could not see the lights to have an alternative way to e-values in the way have done in the last paper with Luis. In fact Julio and Marcelo had a nice way in a specific problem for Hardy-Weinberg equilibrium. So, I do not use data for nothing after getting my main object, a density that represents the opinion of a scientist at the moment they want my work – clearly I help them to get posteriors –.
        Good new year for all of you.

  2. Dear All:
    I am sorry to enter in the discussion. I am very happy to see our e-value being the subject of a discussion among VIP statisticians.
    If I understood, the host of this important page, Dr Robert, is against the Bayesian method because he does not agree to compute probabilities of events in the posterior density.
    I would like to say only that I do not necessarily need data to compute e-values. If one gives me a density and a hypothesis, I can compute the e-value. Let E be the set of points that have densities inferior of a density of at least one point of the set that defines the hypothesis. The e-value is the probability of E! So I did not use any data, I only used the well known probability space. Our host is the most expert in MCMC that uses frequencies a very large number of times. Can I say that he is a frequentist?

    • Dear Carlos, I hope you are not truly sorry to enter the discussion, otherwise you should not feel any pressure to do so! And by all means feel free to call me a frequentist!!! Now, I am certainly not trying to decide who’s Bayesian and who isn’t, or who’s more and who’s less! I am far from the keeper of orthodoxy there and do not care much about orthodoxy anyway!!!

      My remark is solely based on the definition found in your paper in BA, p.81-82, of the Bayesian evidence against H: it considers the posterior probability of the event that the posterior density is larger than the maximum posterior density under the null H. While this event is not parametrisation-invariant, let us consider the special and simpler case when the prior is flat. The event is then that the likelihood is larger than the maximum likelihood under the null H. Hence this definition of evidence uses the likelihood function twice, both in the posterior and in the event. And that’s all I mean by this remark!

    • Dear All, if you are still following this thread, I would like to hear your opinion about the first difficulty I have, namely the use of a density (towards getting the maximum posterior) on a measure zero set corresponding to the null hypothesis. Thanks!

      • About the difficulties of performing the Optimization step of the e-value, namely, maximizing a proper density (or surprise) function over a sharp hypothesis H (presented as a regular proper sub-manifold of the parameter space):
        Technical difficulty: Easy to moderate. Optimizing a proper density or surprise function is easy — as long as one has good constrained optimization software available. Several of our articles with applications of the FBST discuss good optimization techniques. Obs: Obtaining a non-zero measure over the sharp H would be very tricky indeed (leading to all kind of problems, like Lindley’s paradox), but that is precisely the difficulty that the e-value avoids!
        Theoretical difficulty: Moderate. It requires the user to understand that the e-value is NOT related to the posterior probability of H, nor to a ratio of probabilities of H and its complement, etc. Instead, it requires the user to understand that the e-values is a Possibility measure of H. Furthermore, the user must be able to understand that the e-value, ev(H|X), has the desired properties of Consistency with its underlying posterior probability measure, p_n(t), and Conformity with its underlying surprise function, s(t) = p_n(t) / r(t). Our two papers at the Logic Journal of the IGPL cover these topics with great care.
        Epistemological difficulty: Moderate to great. Each well constructed statistical significance measure must be escorted or accompanied by a suitable epistemological framework, for example: p-values are escorted by Popperian Falsificationism, and the corresponding metaphor of the Scientific Tribunal. Bayes factors are escorted by deFinettian Decision Theory, and the corresponding Betting-Odds metaphor. Several articles cited at our articles at the Logic Journal of the IGPL discuss the Cognitive Constructivism epistemological framework and the corresponding metaphor of Objects as Eigen-Solutions. Obs: Usually, the practice of statistics only requires very superficial epistemological discussions, if any, although sparse epistemological arguments are often used to justify some key theoretical properties required from statistical procedures.

      • Crowd,
        I believe we are dancing around a
        frequentist superstitions here.

        Let´s be Bayesian at least during these last 2 days of
        the year:

        If H has prior probability zero, it also has
        zero probability posterior to any data used once,
        twice, or to any finite number of repetitions.

        Par contre, if H has positive probability, one a fortiori sticks to
        Probability Calculus and avoids FBST. There will not
        be any P(theta given x and x) different from P(theta given x).
        So even if one uses data “twice”, the definition of conditional
        probability keeps him coherent.

        A complication may arise if opinion updating is modelled
        as probability kinematics or other non-static rule. This could
        deserve further research , one of course being advised that
        no Coherence Theorem will ever arise (see Phil Dawid on robots).

        Another issue is the Likelihood Principle being violated or not
        by the use of model-dependent official priors. It may be argued
        that , technically, the LP is not violated, as it says that inferences under proportional likelihoods must be the same *for any fixed prior*.

        So what? One can then follow the LP while being incoherent.
        De Finetti sort of made this point when he snubbed the Likelihood
        Principle as a trivial and “obvious” property of Bayesian Inference.

        Bonne Année à tous

        sw

      • Thanks for your additional comments. I am writing a second post to try to express my “concerns” more clearly. In the meanwhile, Feliz Ano Novo!

  3. Dear Christian:

    (1) IGPL stands for: Interest Group in Pure and Applied Logics (as stated in Arnold’s comment). We have published two articles in the Logic Journal of the IGPL, they are: Borges and Stern (2007) and Stern and Pereira (2014).

    (2) I wonder why you say that the e-value uses the data twice (?!)

    (2a) The prior-posterior update, p_n(t) = c_n * p_0(t) * L(t|x_1,…x_n) , incorporates (once) the information in the likelihood function into the posterior distribution, p_n(t). Hence, a distribution p_2n = c_2n * p_n(t) * L(t,X_1,…,x_n) would indeed incorporate two times the same information. In this sense, I fully agree with you that Murray Aitken’s procedure uses the data twice. However, in the construction of the e-value, we do nothing even remotely related to this double use of the information contained in the likelihood function.

    (2b) Do you say that the e-value uses the data twice just because the e-value uses a Reference density and a Prior distribution? Please note that these two distributions have very distinct roles: As in any standard Bayesian model, the Prior distribution, p_0(t) , represents the initial information available to the statistician in his or her modeling context. In contrast, the Reference density, r(t) , has a very different role: It is used to define the Surprise function, s(t) = p_n(t) / r(t) . The Reference density, r(t) , represents the standard Geometry of the parameter space (in its ground level, minimum information or maximum entropy state). Please note that most statistical models assume a standard metric for the parameter space, dl^2 = t’ G(t) t , where G(t) is a metric tensor. The e-value explicitly takes this fact into account, in order to built an invariant procedure. Once again, we are not using any data twice (no pun intended).

    (2c) Do you say that the e-value uses the data twice just because the e-value is built in a two-step procedure? Indeed, the e-value comprises an Optimization Step: Finding v* = max_{t in H} s_n(t); and an Integration Step: ev(H|X) = Int_{t | s(t) <= v*} p_n(t) dt . This two-step procedure achieves a significance measure that: (i) Is fully compliant with the Likelihood Principle (as implied by Paulo's comment and, therefore, cannot be accused of using the data twice); (ii) Is fully Invariant (by re-parameterizations of the parameter space or the hypothesis set); and (iii) Has powerful compositional properties (logical rules for combining truth values). (iv) Moreover, although it was not initially conceived in the Decision Theoretic framework, the FBST can be derived from an appropriate Loss function, see Madruga et al. (2001).

    (3) For further discussion: This correspondence of ours started discussing the technical similarities in the approaches for Model Choice at the papers: Kamary et al. (2014) and Lauretto et al. (2007). In your paper Robert et al. (2011), you explain some difficulties for using ABC for Model Choice – while using Bayes Factors. I believe that both of our aforementioned approaches could be used to overcome these difficulties. What do you say?

    References:
    – Borges, Stern (2007). The Rules of Logic Composition for the Bayesian Epistemic e-Values. Logic J.of the IGPL, 15, 5-6, 401-420.
    – Lauretto, Faria, Pereira, Stern (2007). The Problem of Separate Hypotheses via Mixtures Models. AIP Conference Proceedings, 954, 268-275.
    – Kamary, Mengersen, Robert and Rousseau (2014). Testing Hypotheses via a Mixture Model;
    – Robert, Cornet, Marine, and Pillaif (2011). Lack of Confidence in Approximate Bayesian Computation Model Choice. PNAS, 108, 37, 15112–15117.
    – Stern, Pereira (2014). Bayesian epistemic values: Focus on surprise, measure probability! Logic Journal of the IGPL, 22, 2, 236-254.

    • Thanks, Julio. My feeling is that, by using the posterior probability that the surprise is larger than its value over the smaller space. the evidence computes the posterior [hence data-based] probability of an event that depends on the data and the likelihood function. There is no prior equivalent of this quantity.

      • Dear Robert:
        When you say that the procedure “…depends on the Data AND the Likelihood function”, it gives the impression that it depends on the data in a second way other than the likelihood function. This would be a violation of the Likelihood Principle. This is not the case: The e-value (and the FBST) are strictly conformant with the Likelihood Principle (something that many pseudo-Bayesian procedures floating around fail to accomplish). Hence, it seems that your objection boils down to my Point (2c) — … just because the e-value is built in a two-step procedure: A Maximization step followed by an Integration step.
        In this case, I am afraid that the commandment “thou shalt use the data once” (in a single integration step) becomes so restrictive that it precludes anything outside the established orthodoxy, that is, it cannot be considered a foundational principle, becoming instead a normative law or requirement for canonical procedures. The e-value is indeed very different from a Bayes Factor. If that is the bulk of you objection, we are in full agreement!

  4. Arnold Baise Says:

    IGPL stands for Interest Group in Pure and Applied Logics.

  5. Hello Professor Robert,

    Regarding your second comment that to compute the Pereira-Stern evidence the data seems to be used twice, suppose that I give you some version of the posterior density and tell you the subset of the parameter space that corresponds to the null hypothesis, but don’t show you any data. Using only this information it’s possible to compute the Pereira-Stern evidence.

    Cheers,

    Paulo.

    • Thanks, Paulo: there is still information that is data dependent in you communicating this event to me. In other words, I do not think I can compute a prior probability of this event.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s