Archive for MAP estimators

Bayesian filtering and smoothing [book review]

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , on February 25, 2015 by xi'an

When in Warwick last October, I met Simo Särkkä, who told me he had published an IMS monograph on Bayesian filtering and smoothing the year before. I thought it would be an appropriate book to review for CHANCE and tried to get a copy from Oxford University Press, unsuccessfully. I thus bought my own book that I received two weeks ago and took the opportunity of my Czech vacations to read it… [A warning pre-empting accusations of self-plagiarism: this is a preliminary draft for a review to appear in CHANCE under my true name!]

“From the Bayesian estimation point of view both the states and the static parameters are unknown (random) parameters of the system.” (p.20)

 Bayesian filtering and smoothing is an introduction to the topic that essentially starts from ground zero. Chapter 1 motivates the use of filtering and smoothing through examples and highlights the naturally Bayesian approach to the problem(s). Two graphs illustrate the difference between filtering and smoothing by plotting for the same series of observations the successive confidence bands. The performances are obviously poorer with filtering but the fact that those intervals are point-wise rather than joint, i.e., that the graphs do not provide a confidence band. (The exercise section of that chapter is superfluous in that it suggests re-reading Kalman’s original paper and rephrases the Monty Hall paradox in a story unconnected with filtering!) Chapter 2 gives an introduction to Bayesian statistics in general, with a few pages on Bayesian computational methods. A first remark is that the above quote is both correct and mildly confusing in that the parameters can be consistently estimated, while the latent states cannot. A second remark is that justifying the MAP as associated with the 0-1 loss is incorrect in continuous settings.  The third chapter deals with the batch updating of the posterior distribution, i.e., that the posterior at time t is the prior at time t+1. With applications to state-space systems including the Kalman filter. The fourth to sixth chapters concentrate on this Kalman filter and its extension, and I find it somewhat unsatisfactory in that the collection of such filters is overwhelming for a neophyte. And no assessment of the estimation error when the model is misspecified appears at this stage. And, as usual, I find the unscented Kalman filter hard to fathom! The same feeling applies to the smoothing chapters, from Chapter 8 to Chapter 10. Which mimic the earlier ones. Continue reading

MAP or mean?!

Posted in Statistics, Travel, University life with tags , , , on March 5, 2014 by xi'an

“A frequent matter of debate in Bayesian inversion is the question, which of the two principle point-estimators, the maximum-a-posteriori (MAP) or the conditional mean (CM) estimate is to be preferred.”

An interesting topic for this arXived paper by Burger and Lucka that I (also) read in the plane to Montréal, even though I do not share the concern that we should pick between those two estimators (only or at all), since what matters is the posterior distribution and the use one makes of it. I thus disagree there is any kind of a “debate concerning the choice of point estimates”. If Bayesian inference reduces to producing a point estimate, this is a regularisation technique and the Bayesian interpretation is both incidental and superfluous.

Maybe the most interesting result in the paper is that the MAP is expressed as a proper Bayes estimator! I was under the opposite impression, mostly because the folklore (and even The Bayesian Core)  have it that it corresponds to a 0-1 loss function does not hold for continuous parameter spaces and also because it seems to conflict with the results of Druihlet and Marin (BA, 2007), who point out that the MAP ultimately depends on the choice of the dominating measure. (Even though the Lebesgue measure is implicitly chosen as the default.) The authors of this arXived paper start with a distance based on the prior; called the Bregman distance. Which may be the quadratic or the entropy distance depending on the prior. Defining a loss function that is a mix of this Bregman distance and of the quadratic distance

||K(\hat u-u)||^2+2D_\pi(\hat u,u)

produces the MAP as the Bayes estimator. So where did the dominating measure go? In fact, nowhere: both the loss function and the resulting estimator are clearly dependent on the choice of the dominating measure… (The loss depends on the prior but this is not a drawback per se!)

machine learning [book review]

Posted in Books, R, Statistics, University life with tags , , , , , , , on October 21, 2013 by xi'an

I have to admit the rather embarrassing fact that Machine Learning, A probabilistic perspective by Kevin P. Murphy is the first machine learning book I really read in detail…! It is a massive book with close to 1,100 pages and I thus hesitated taking it with me around, until I grabbed it in my bag for Warwick. (And in the train to Argentan.) It is also massive in its contents as it covers most (all?) of what I call statistics (but visibly corresponds to machine learning as well!). With a Bayesian bent most of the time (which is the secret meaning of probabilistic in the title).

“…we define machine learning as a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty (such as planning how to collect more data!).” (p.1)

Apart from the Introduction—which I find rather confusing for not dwelling on the nature of errors and randomness and on the reason for using probabilistic models (since they are all wrong) and charming for including a picture of the author’s family as an illustration of face recognition algorithms—, I cannot say I found the book more lacking in foundations or in the breadth of methods and concepts it covers than a “standard” statistics book. In short, this is a perfectly acceptable statistics book! Furthermore, it has a very relevant and comprehensive selection of references (sometimes favouring “machine learning” references over “statistics” references!). Even the vocabulary seems pretty standard to me. All this makes me wonder why we at all distinguish between the two domains, following Larry Wasserman’s views (for once!) that the difference is mostly in the eye of the beholder, i.e. in which department one teaches… Which was already my perspective before I read the book but it comforted me even further. And the author agrees as well (“The probabilistic approach to machine learning is closely related to the field of statistics, but differs slightly in terms of its emphasis and terminology”, p.1). Let us all unite!

[..part 2 of the book review to appear tomorrow…]

[weak] information paradox

Posted in pictures, Running, Statistics, University life with tags , , , , , , on December 2, 2011 by xi'an

While (still!) looking at questions on Cross Validated on Saturday morning, just before going out for a chilly run in the park, I noticed an interesting question about a light bulb problem. Once you get the story out of the way, it boils down to the fact that, when comparing two binomial probabilities, p1 and p2, based on a Bernoulli sample of size n, and when selecting the MAP probability, having either n=2k-1 or n=2k observations lead to the same (frequentist) probability of making the right choice. The details are provided in my answers here and there. It is a rather simple combinatoric proof, once you have the starting identity [W. Feller, An Introduction to Probability Theory and Its Applications, vol. 1, 1968, [II.8], eqn (8.6)]

{2k-1 \choose i-1} + {2k-1 \choose i} = {2k \choose i}

but I wonder if there exists a more statistical explanation to this weak information paradox…

MAP, MLE and loss

Posted in Statistics with tags , , , , on April 25, 2011 by xi'an

Michael Evans and Gun Ho Jang posted an arXiv paper where they discuss the connection between MAP, least relative surprise (or maximum profile likelihood) estimators, and loss functions. I posted a while ago my perspective on MAP estimators, followed by several comments on the Bayesian nature of those estimators, hence will not reproduce them here, but the core of the matter is that neither MAP estimators, nor MLEs are really justified by a decision-theoretic approach, at least in a continuous parameter space. And that the dominating measure [arbitrarily] chosen on the parameter space impacts the value of the MAP, as demonstrated by Druihlet and Marin in 2007.

Continue reading

MAP estimators (cont’d)

Posted in Statistics with tags , , , , on September 13, 2009 by xi'an

In connection with Anthony’s comments, here are the details for the normal example. I am using a flat prior on \mu when x\sim\mathcal{N}(\mu,1). The MAP estimator of \mu is then \hat\mu=x. If I consider the change of variable \mu=\text{logit}(\eta), the posterior distribution on \eta is

\pi(\eta|x) = \exp[ -(\text{logit}(\eta)-x)^2/2 ] / \sqrt{2\pi} \eta (1-\eta)

and the MAP in \eta is then obtained numerically. For instance, the R code

f=function(x,mea) dnorm(log(x/(1-x)),mean=mea)/(x*(1-x))
g=function(x){ a=optimise(f,int=c(0,1),maximum=TRUE,mea=x)$max;log(a/(1-a))}

shows the divergence between the MAP estimator \hat\mu and the reverse transform of the MAP estimator \hat\eta of the transform… The second estimator is asymptotically (in x) equivalent to x+1.

An example I like very much in The Bayesian Choice is Example 4.1.2, when observing x\sim\text{Cauchy}(\theta,1) with a double exponential prior on \theta\sim\exp\{-|\theta|\}/2. The MAP is then always \hat\theta=0!

The dependence of the MAP estimator on the dominating measure is also studied in a BA paper by Pierre Druihlet and Jean-Michel Marin, who propose a solution that relies on Jeffreys’ prior as the reference measure.

MAP estimators are not truly Bayesian estimators

Posted in Statistics with tags , , , on September 12, 2009 by xi'an

This morning, I found this question from Ronald in my mailbox

“I have been reading the Bayesian Choice (2nd Edition).  On page 166 you note that, for continuous parameter spaces, the MAP estimator must be defined as the limit of a sequence of estimators  corresponding to a sequence of 0-1 loss functions with increasingly smaller nonzero epsilons.

This would appear to mean that the MAP estimator for continuous spaces is not Bayes-optimal, as is  frequently claimed.  To be Bayes-optimal, it would have to minimize the Bayes risk for a single, fixed loss function.  Instead, it must be defined using an infinite sequence of loss functions.

Does there exist a formal proof of the Bayes-optimality of the continuous-space MAP estimator-meaning one that is consistent with the usual definition assuming a fixed loss function?  I don’t see how there could be.  If a fixed loss function actually existed, then a definition requiring a limit would be unnecessary.”

which is really making a good point against MAP estimators. In fact, I have never found MAP estimators very appealing for many reasons, one being indeed that the MAP estimator cannot correctly be expressed as the solution to a minimisation problem. I also find the pointwise nature of the estimator quite a drawback: the estimator is only associated with a local property of the posterior density, not with a global property of the posterior distribution. This is in particular striking when considering the MAP estimates for two different parameterisations. The estimates often are quite different, just due to the Jacobian in the change of parameterisation. For instance, the MAP of the usual normal mean \mu under a flat prior is x, for instance x=2, but if one use a logit parameterisation instead

\mu = \log \eta/(1-\eta)

the MAP in \eta can be quite distinct from 1/(1+\exp-x), for instance leading to \mu=3 when x=2… Another bad feature is the difference between the marginal MAP and the joint MAP estimates. This is not to sat that the MAP cannot be optimal in any sense, as I suspect it could be admissible as a limit of Bayes estimates (under a sequence of loss functions). But a Bayes estimate itself?!


Get every new post delivered to your Inbox.

Join 794 other followers