## MAP estimators are not truly Bayesian estimators

**T**his morning, I found this question from Ronald in my mailbox

“I have been reading the Bayesian Choice (2nd Edition). On page 166 you note that, for continuous parameter spaces, the MAP estimator must be defined as the limit of a sequence of estimators corresponding to a sequence of 0-1 loss functions with increasingly smaller nonzero epsilons.

This would appear to mean that the MAP estimator for continuous spaces is not Bayes-optimal, as is frequently claimed. To be Bayes-optimal, it would have to minimize the Bayes risk for a single, fixed loss function. Instead, it must be defined using an infinite sequence of loss functions.

Does there exist a formal proof of the Bayes-optimality of the continuous-space MAP estimator-meaning one that is consistent with the usual definition assuming a fixed loss function? I don’t see how there could be. If a fixed loss function actually existed, then a definition requiring a limit would be unnecessary.”

which is really making a good point against MAP estimators. In fact, I have never found MAP estimators very appealing for many reasons, one being indeed that the MAP estimator cannot correctly be expressed as the solution to a minimisation problem. I also find the pointwise nature of the estimator quite a drawback: the estimator is only associated with a local property of the posterior density, not with a global property of the posterior distribution. This is in particular striking when considering the MAP estimates for two different parameterisations. The estimates often are quite different, just due to the Jacobian in the change of parameterisation. For instance, the MAP of the usual normal mean under a flat prior is , for instance , but if one use a logit parameterisation instead

the MAP in can be quite distinct from , for instance leading to when … Another bad feature is the difference between the marginal MAP and the joint MAP estimates. This is not to sat that the MAP cannot be optimal in any sense, as I suspect it could be admissible as a limit of Bayes estimates (under a sequence of loss functions). But a Bayes estimate itself?!

March 29, 2010 at 6:44 pm

Thanks for your quick answer, Xi’an.

With that “argument” I was trying to address just the

measure theoretic issue of the MAP depending on a

particular version of the posterior density.

In other words, the maximum of a particular smooth

version of the density may (?) approximate a number

(the posterior expectation) which does not depend on

a particular version of the density, and is a legitimate

Bayes estimate

Of course, the invariance with respect to reparameterizations

issue is left untouched.

Regards.

March 29, 2010 at 8:03 pm

True, you first have to replace the version of the density with something like a continuous function, in which case the maximum is uniquely defined. You could also look at it from a simulation point of view, in which case the result depends on the random sequence but not on the version of the density, I believe.

March 29, 2010 at 6:00 pm

Hi Xi’an,

Yesterday a friend of mine (he is computer scientist) asked me

about the justification of the MAP (which he uses a lot) as a

valid Bayes estimate.

I improvised the following “justification” (!):

“… under suitable regularity conditions, the MAP, evaluated with

the help of a particular smooth version of the posterior density,

approximates the posterior mean when the sample size increases,

so …”

Can you buy that? Or is it too nasty?

Cheers and I hope you and your family are doing well.

Paulo C. Marques F.

March 29, 2010 at 6:32 pm

Hi Paulo, the argument is alas rather weak because for instance it does not work well under a change of variable, ie a reparameterisation… The best argument is that it is a weighted maximum likelihood estimate or, from a computational viewpoint, it is a first degree approximation. But the connection with the mean may be very very weak…

Cheers, Xi’an

September 15, 2009 at 8:09 am

Hi Xian,

Could you explain “because Jeffreys’ inference only depends on the shape of the sampling density” a little more? I think it is an important point.

Equivariance seems a better choice. I had seen it used with respect

to symmetry properties of estimators with respect to transformations

of the sample space, but not in the case of transformations of

the parameter space (so called reparameterizations). I will check

your reference.

Thank you!

September 15, 2009 at 9:24 am

My feeling is that the dependence of Jeffreys’ prior on the whole density is not of the same nature than the p-value dependence on the data probability distribution, the case that stemmed Jeffreys’ famous comment. Overall, I am not utterly convinced by the “Likelihood Principle” that everything should only depend on , if is the data. Indeed, Bayesian inference is based on Bayes’ theorem and the principles of (conditional) probability, which implies the whole of , which is what I meant by “shape of the sampling density” in loose terms…

September 14, 2009 at 5:04 pm

You are right. I’ve just pointed out, in the last post, that even guys

that can be understood as Bayes estimates — such as the posterior

mean — may also exhibit the same lack of invariance issues of the

MAP.

(By the way, invariance is not the best word here. It would

be great to say “co-variance”, but of course the word has

another well established meaning for us statisticians.)

On the existence of intrinsic loss functions, when I was studying

Jeffreys book (TOP) — since I have a background in differential

geometry — I tried to push the geometric interpretation of his

prior as far as I could. Summarizing, what I know about it is:

1. The parametric model becomes a manifold.

2. Each parameterization we normally use is just a possible local

chart of that manifold.

3. Starting from the Kullback-Leibler and Hellinger distances, Jeffreys

induced a riemannian metric on the manifold.

4. From this metric we get a natural notion of “volume”, given by

. The full Jeffreys prior is then a measure

which is intrinsically uniform on the manifold. Invariance

(co-variance) of the expression of the prior in particular

parameterizations is automatic.

5. A loss function given, for example, by the square of the geodesic

distance between two points of the manifold (which, in many cases,

will have a nice expression in a particular chart) is one simple

way to construct an invariant (co-variant) Bayes estimate.

Browsing the literature, I’ve found out that these things are known

one way or another. So, what is the problem with this program of

intrinsic Bayesian inference? What I don’t like about it, is that

the metric depends on the whole structure of the sample space,

and so all of our inference statements depend on sample space

points which have not been observed.

About “how good is this g_{ij}?”, there is a (very difficult

to understand) Theorem of Chentsov which says that this

“information metric” is the only one satisfying a reasonable

desiderata.

So, despite the mathematical beauty of these ideas, I’m not sure

that they must be at the foundation of our theory of inference.

September 15, 2009 at 6:02 am

I do not think there can be an agreement on the choice of the reference prior, ever. Not only for the reason you mention, namely the incompatibility of the Jeffreys prior with the likelihood principle. I however disagree with the statement that “our inference statements depend on sample space points which have not been observed”, because Jeffreys’ inference only depends on the shape of the sampling density, which is not the same (and not a drawback to me).

Note that Lehmann and Casella (as well as The Bayesian Choice) use the term

equivarianceinstead ofinvariance.September 14, 2009 at 7:20 am

On the lack of invariance of the MAP, the same happens with

the so called Bayes estimator, the posterior mean,

which, as we know, is the solution of a decision problem with a

quadratic loss function.

To me, the point is that my loss function is an expression of

my ordering of preferences, and so, to be coherent, it doesn’t

make sense to use always the same quadratic loss function in

two different parameterizations of the problem.

A simple way to see this is that if I want to write the posterior

loss expectation

in a different parameterization, I need, to keep coherence,

to transform the density, using the Jacobian of the transformation considered, but I also have to transform the loss function

L: if I consider it to be quadratic in one parameterization it

will not be as such in another parameterization of the problem,

generally. If I do transform the “whole integrand,” the

Bayes decision in one parameterization will be the

transformed Bayes decision of the other parameterization.

My impression, at this stage of my studies, is that a quadratic

loss function, as well as some sort of default prior, is a kind

of an approximation to a fully elicited pair of loss function

+ prior.

September 14, 2009 at 9:19 am

Certainly, with a loss function, there is no need for invariance under reparameterisation. But MAP being presented as default Bayes estimates should not depend on the parameterisation. Note that there exist families of functional loss functions called intrinsic losses that enjoy invariance.

September 14, 2009 at 6:22 am

I’ve missed a d\theta on the integral!

I see your point: since the posterior density can be defined

arbitrarily in a set of zero measure, we would have problems

even using a Dirac loss. This “rebellion of the sets of zero measure”

is always a nasty thing. And using a sequence of loss functions

will not do either with this problem, because you can, formally,

always take any point in the parameter space and redefine

its density as

So, whatever version of the density is used, would always

be the MAP, which is weird from an inferential viewpoint. Of course,

I’m thinking about problems with a “continuous” parameter.

You are right. It’s not so easy to formalize the MAP as a

Bayes estimate.

September 14, 2009 at 5:33 am

One way to formalize the MAP as a Bayes decision is to use

the loss function

Where is Dirac’s delta. Then, the posterior expected loss

is just

So, the choice that minimizes the posterior expected loss is

exactly the posterior mode.

But is this just physicist/engineer’s sloppy math? Not quite, since

there are rigorous formalizations of the concept of a generalized

function, such as Dirac’s delta, out there.

Also, I think it is possible to formalize the MAP without a delta

loss, using a sequence of decision problems.

On the lack of invariance of the MAP, I will make a specific comment

later.

Thanks for the blog, Xian! Excellent stuff!

September 14, 2009 at 5:43 am

This is indeed the way MAP is often justified but this is not rigorous: when using a delta function, the quantity to maximise in

aiswhich is zero in continuous settings. As you mention, there is a formal approach to Dirac’s delta functions, which is Laurent Schwarz’s

Theory of Distributions, but I do not think this helps with solving the problem because of measurability issues like the above not being uniquely defined. As the MAP being the limit of a sequence of Bayes estimates, this was the starting point of the discussion with Ronald, which means it has some optimality if not being a true Bayes estimate.September 13, 2009 at 6:37 am

[…] estimators (cont’d) In connection with Anthony’s comments, here are the details for the normal example. I am using a flat prior on when . The MAP estimator […]

September 13, 2009 at 12:00 am

So yes, everything I wrote is basically the opposite of being right! Indeed, this is very unimpressive behaviour for an estimator. I hadn’t realized how weird the effect of reparametrization was until just now…

September 12, 2009 at 2:31 pm

I have struggled with the idea of MAP estimation myself, although mainly because of your second point: that the estimator is associated with a local as opposed to a global property of the posterior. In fact, I don’t really understand the third point since this seems not to be an issue with reparametrization but instead an issue with the choice of prior. Putting a flat prior on eta is different to putting a flat prior on mu so naturally we obtain a different posterior, and hence a different MAP estimate. I think that if one was to use the logit parametrization but with the prior f(eta) = 1/(eta-eta^2) which amounts to the same improper prior (the Jeffreys prior for eta in this context, since we used the Jeffreys prior for mu) then we would get the same posterior.

September 12, 2009 at 10:49 pm

This was presumably poorly worded, but I meant using one given prior, like Jeffreys prior and then just changing the parameterisation: because of the Jacobian the maximum of the transform may be quite different from the transform of the MAP, which is a general property of lack of invariance under reparameterisation but is quite an issue for defining an estimator…