MAP estimators are not truly Bayesian estimators

This morning, I found this question from Ronald in my mailbox

“I have been reading the Bayesian Choice (2nd Edition).  On page 166 you note that, for continuous parameter spaces, the MAP estimator must be defined as the limit of a sequence of estimators  corresponding to a sequence of 0-1 loss functions with increasingly smaller nonzero epsilons.

This would appear to mean that the MAP estimator for continuous spaces is not Bayes-optimal, as is  frequently claimed.  To be Bayes-optimal, it would have to minimize the Bayes risk for a single, fixed loss function.  Instead, it must be defined using an infinite sequence of loss functions.

Does there exist a formal proof of the Bayes-optimality of the continuous-space MAP estimator-meaning one that is consistent with the usual definition assuming a fixed loss function?  I don’t see how there could be.  If a fixed loss function actually existed, then a definition requiring a limit would be unnecessary.”

which is really making a good point against MAP estimators. In fact, I have never found MAP estimators very appealing for many reasons, one being indeed that the MAP estimator cannot correctly be expressed as the solution to a minimisation problem. I also find the pointwise nature of the estimator quite a drawback: the estimator is only associated with a local property of the posterior density, not with a global property of the posterior distribution. This is in particular striking when considering the MAP estimates for two different parameterisations. The estimates often are quite different, just due to the Jacobian in the change of parameterisation. For instance, the MAP of the usual normal mean \mu under a flat prior is x, for instance x=2, but if one use a logit parameterisation instead

\mu = \log \eta/(1-\eta)

the MAP in \eta can be quite distinct from 1/(1+\exp-x), for instance leading to \mu=3 when x=2… Another bad feature is the difference between the marginal MAP and the joint MAP estimates. This is not to sat that the MAP cannot be optimal in any sense, as I suspect it could be admissible as a limit of Bayes estimates (under a sequence of loss functions). But a Bayes estimate itself?!

19 Responses to “MAP estimators are not truly Bayesian estimators”

  1. Paulo C. Marques F. Says:

    Thanks for your quick answer, Xi’an.

    With that “argument” I was trying to address just the
    measure theoretic issue of the MAP depending on a
    particular version of the posterior density.

    In other words, the maximum of a particular smooth
    version of the density may (?) approximate a number
    (the posterior expectation) which does not depend on
    a particular version of the density, and is a legitimate
    Bayes estimate

    Of course, the invariance with respect to reparameterizations
    issue is left untouched.

    Regards.

    • True, you first have to replace the version of the density with something like a continuous function, in which case the maximum is uniquely defined. You could also look at it from a simulation point of view, in which case the result depends on the random sequence but not on the version of the density, I believe.

  2. Paulo C. Marques F. Says:

    Hi Xi’an,

    Yesterday a friend of mine (he is computer scientist) asked me
    about the justification of the MAP (which he uses a lot) as a
    valid Bayes estimate.

    I improvised the following “justification” (!):

    “… under suitable regularity conditions, the MAP, evaluated with
    the help of a particular smooth version of the posterior density,
    approximates the posterior mean when the sample size increases,
    so …”

    Can you buy that? Or is it too nasty?

    Cheers and I hope you and your family are doing well.

    Paulo C. Marques F.

    • Hi Paulo, the argument is alas rather weak because for instance it does not work well under a change of variable, ie a reparameterisation… The best argument is that it is a weighted maximum likelihood estimate or, from a computational viewpoint, it is a first degree approximation. But the connection with the mean may be very very weak…
      Cheers, Xi’an

  3. Hi Xian,

    Could you explain “because Jeffreys’ inference only depends on the shape of the sampling density” a little more? I think it is an important point.

    Equivariance seems a better choice. I had seen it used with respect
    to symmetry properties of estimators with respect to transformations
    of the sample space, but not in the case of transformations of
    the parameter space (so called reparameterizations). I will check
    your reference.

    Thank you!

    • My feeling is that the dependence of Jeffreys’ prior on the whole density is not of the same nature than the p-value dependence on the data probability distribution, the case that stemmed Jeffreys’ famous comment. Overall, I am not utterly convinced by the “Likelihood Principle” that everything should only depend on f(x_0|\theta), if x_0 is the data. Indeed, Bayesian inference is based on Bayes’ theorem and the principles of (conditional) probability, which implies the whole of f(\cdot|\theta), which is what I meant by “shape of the sampling density” in loose terms…

  4. You are right. I’ve just pointed out, in the last post, that even guys
    that can be understood as Bayes estimates — such as the posterior
    mean — may also exhibit the same lack of invariance issues of the
    MAP.

    (By the way, invariance is not the best word here. It would
    be great to say “co-variance”, but of course the word has
    another well established meaning for us statisticians.)

    On the existence of intrinsic loss functions, when I was studying
    Jeffreys book (TOP) — since I have a background in differential
    geometry — I tried to push the geometric interpretation of his
    prior as far as I could. Summarizing, what I know about it is:

    1. The parametric model becomes a manifold.

    2. Each parameterization we normally use is just a possible local
    chart of that manifold.

    3. Starting from the Kullback-Leibler and Hellinger distances, Jeffreys
    induced a riemannian metric g_{ij} on the manifold.

    4. From this metric we get a natural notion of “volume”, given by
    \sqrt{\det(g_{ij})}. The full Jeffreys prior is then a measure
    which is intrinsically uniform on the manifold. Invariance
    (co-variance) of the expression of the prior in particular
    parameterizations is automatic.

    5. A loss function given, for example, by the square of the geodesic
    distance between two points of the manifold (which, in many cases,
    will have a nice expression in a particular chart) is one simple
    way to construct an invariant (co-variant) Bayes estimate.

    Browsing the literature, I’ve found out that these things are known
    one way or another. So, what is the problem with this program of
    intrinsic Bayesian inference? What I don’t like about it, is that
    the metric g_{ij} depends on the whole structure of the sample space,
    and so all of our inference statements depend on sample space
    points which have not been observed.

    About “how good is this g_{ij}?”, there is a (very difficult
    to understand) Theorem of Chentsov which says that this
    “information metric” is the only one satisfying a reasonable
    desiderata.

    So, despite the mathematical beauty of these ideas, I’m not sure
    that they must be at the foundation of our theory of inference.

    • I do not think there can be an agreement on the choice of the reference prior, ever. Not only for the reason you mention, namely the incompatibility of the Jeffreys prior with the likelihood principle. I however disagree with the statement that “our inference statements depend on sample space points which have not been observed”, because Jeffreys’ inference only depends on the shape of the sampling density, which is not the same (and not a drawback to me).
      Note that Lehmann and Casella (as well as The Bayesian Choice) use the term equivariance instead of invariance.

  5. On the lack of invariance of the MAP, the same happens with
    the so called Bayes estimator, the posterior meanE[\theta\mid x],
    which, as we know, is the solution of a decision problem with a
    quadratic loss function.

    To me, the point is that my loss function is an expression of
    my ordering of preferences, and so, to be coherent, it doesn’t
    make sense to use always the same quadratic loss function in
    two different parameterizations of the problem.

    A simple way to see this is that if I want to write the posterior
    loss expectation

    \int_\Theta L(a,\theta) \pi(\theta\mid x) d\theta

    in a different parameterization, I need, to keep coherence,
    to transform the density, using the Jacobian of the transformation considered, but I also have to transform the loss function
    L: if I consider it to be quadratic in one parameterization it
    will not be as such in another parameterization of the problem,
    generally. If I do transform the “whole integrand,” the
    Bayes decision in one parameterization will be the
    transformed Bayes decision of the other parameterization.

    My impression, at this stage of my studies, is that a quadratic
    loss function, as well as some sort of default prior, is a kind
    of an approximation to a fully elicited pair of loss function
    + prior.

    • Certainly, with a loss function, there is no need for invariance under reparameterisation. But MAP being presented as default Bayes estimates should not depend on the parameterisation. Note that there exist families of functional loss functions called intrinsic losses that enjoy invariance.

  6. I’ve missed a d\theta on the integral!

    I see your point: since the posterior density can be defined
    arbitrarily in a set of zero measure, we would have problems
    even using a Dirac loss. This “rebellion of the sets of zero measure”
    is always a nasty thing. And using a sequence of loss functions
    will not do either with this problem, because you can, formally,
    always take any point \theta' in the parameter space and redefine
    its density as

    \pi(\theta' \mid x) = 2 sup_\theta \pi(\theta \mid x).

    So, whatever version of the density is used, \theta' would always
    be the MAP, which is weird from an inferential viewpoint. Of course,
    I’m thinking about problems with a “continuous” parameter.

    You are right. It’s not so easy to formalize the MAP as a
    Bayes estimate.

  7. One way to formalize the MAP as a Bayes decision is to use
    the loss function

    L(a,\theta) = - \delta(\theta - a)

    Where \delta() is Dirac’s delta. Then, the posterior expected loss
    is just

    - \int_\Theta \delta(\theta - a) \pi(\theta \mid x) = - \pi(a \mid x) .

    So, the choice that minimizes the posterior expected loss is
    exactly the posterior mode.

    But is this just physicist/engineer’s sloppy math? Not quite, since
    there are rigorous formalizations of the concept of a generalized
    function, such as Dirac’s delta, out there.

    Also, I think it is possible to formalize the MAP without a delta
    loss, using a sequence of decision problems.

    On the lack of invariance of the MAP, I will make a specific comment
    later.

    Thanks for the blog, Xian! Excellent stuff!

    • This is indeed the way MAP is often justified but this is not rigorous: when using a delta function, the quantity to maximise in a is

      \pi(a=\theta|x),

      which is zero in continuous settings. As you mention, there is a formal approach to Dirac’s delta functions, which is Laurent Schwarz’s Theory of Distributions, but I do not think this helps with solving the problem because of measurability issues like the above not being uniquely defined. As the MAP being the limit of a sequence of Bayes estimates, this was the starting point of the discussion with Ronald, which means it has some optimality if not being a true Bayes estimate.

  8. […] estimators (cont’d) In connection with Anthony’s comments, here are the details for the normal example. I am using a flat prior on when . The MAP estimator […]

  9. So yes, everything I wrote is basically the opposite of being right! Indeed, this is very unimpressive behaviour for an estimator. I hadn’t realized how weird the effect of reparametrization was until just now…

  10. I have struggled with the idea of MAP estimation myself, although mainly because of your second point: that the estimator is associated with a local as opposed to a global property of the posterior. In fact, I don’t really understand the third point since this seems not to be an issue with reparametrization but instead an issue with the choice of prior. Putting a flat prior on eta is different to putting a flat prior on mu so naturally we obtain a different posterior, and hence a different MAP estimate. I think that if one was to use the logit parametrization but with the prior f(eta) = 1/(eta-eta^2) which amounts to the same improper prior (the Jeffreys prior for eta in this context, since we used the Jeffreys prior for mu) then we would get the same posterior.

    • This was presumably poorly worded, but I meant using one given prior, like Jeffreys prior and then just changing the parameterisation: because of the Jacobian the maximum of the transform may be quite different from the transform of the MAP, which is a general property of lack of invariance under reparameterisation but is quite an issue for defining an estimator…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.