## MAP as Bayes estimators

Posted in Books, Kids, Statistics with tags , , , , on November 30, 2016 by xi'an

Robert Bassett and Julio Deride just arXived a paper discussing the position of MAPs within Bayesian decision theory. A point I have discussed extensively on the ‘Og!

“…we provide a counterexample to the commonly accepted notion of MAP estimators as a limit of Bayes estimators having 0-1 loss.”

The authors mention The Bayesian Choice stating this property without further precautions and I completely agree to being careless in this regard! The difficulty stands with the limit of the maximisers being not necessarily the maximiser of the limit. The paper includes an example to this effect, with a prior as above,  associated with a sampling distribution that does not depend on the parameter. The sufficient conditions proposed therein are that the posterior density is almost surely proper or quasiconcave.

This is a neat mathematical characterisation that cleans this “folk theorem” about MAP estimators. And for which the authors are to be congratulated! However, I am not very excited by the limiting property, whether it holds or not, as I have difficulties conceiving the use of a sequence of losses in a mildly realistic case. I rather prefer the alternate characterisation of MAP estimators by Burger and Lucka as proper Bayes estimators under another type of loss function, albeit a rather artificial one.

## non-local priors for mixtures

Posted in Statistics, University life with tags , , , , , , , , , , , , , , , on September 15, 2016 by xi'an

[For some unknown reason, this commentary on the paper by Jairo Fúquene, Mark Steel, David Rossell —all colleagues at Warwick— on choosing mixture components by non-local priors remained untouched in my draft box…]

Choosing the number of components in a mixture of (e.g., Gaussian) distributions is a hard problem. It may actually be an altogether impossible problem, even when abstaining from moral judgements on mixtures. I do realise that the components can eventually be identified as the number of observations grows to infinity, as demonstrated for instance by Judith Rousseau and Kerrie Mengersen (2011). But for a finite and given number of observations, how much can we trust any conclusion about the number of components?! It seems to me that the criticism about the vacuity of point null hypotheses, namely the logical absurdity of trying to differentiate θ=0 from any other value of θ, applies to the estimation or test on the number of components of a mixture. Doubly so, one might argue, since a very small or a very close component is undistinguishable from a non-existing one. For instance, Definition 2 is correct from a mathematical viewpoint, but it does not spell out the multiple contiguities between k and k’ component mixtures.

The paper starts with a comprehensive coverage of l’état de l’art… When using a Bayes factor to compare a k-component and an h-component mixture, the behaviour of the factor is quite different depending on which model is correct. Essentially overfitted mixtures take much longer to detect than underfitted ones, which makes intuitive sense. And BIC should be corrected for overfitted mixtures by a canonical dimension λ between the true and the (larger) assumed number of parameters  into

2 log m(y) = 2 log p(y|θ) – λ log O(n) + O(log log n)

I would argue that this purely invalidates BIG in mixture settings since the canonical dimension λ is unavailable (and DIC does not provide a useful substitute as we illustrated a decade ago…) The criticism about Rousseau and Mengersen (2011) over-fitted mixture that their approach shrinks less than a model averaging over several numbers of components relates to minimaxity and hence sounds both overly technical and reverting to some frequentist approach to testing. Replacing testing with estimating sounds like the right idea.  And I am also unconvinced that a faster rate of convergence of the posterior probability or of the Bayes factor is a relevant factor when conducting

As for non local priors, the notion seems to rely on a specific topology for the parameter space since a k-component mixture can approach a k’-component mixture (when k'<k) in a continuum of ways (even for a given parameterisation). This topology seems to be summarised by the penalty (distance?) d(θ) in the paper. Is there an intrinsic version of d(θ), given the weird parameter space? Like one derived from the Kullback-Leibler distance between the models? The choice of how zero is approached clearly has an impact on how easily the “null” is detected, the more because of the somewhat discontinuous nature of the parameter space. Incidentally, I find it curious that only the distance between means is penalised… The prior also assumes independence between component parameters and component weights, which I think is suboptimal in dealing with mixtures, maybe suboptimal in a poetic sense!, as we discussed in our reparameterisation paper. I am not sure either than the speed the distance converges to zero (in Theorem 1) helps me to understand whether the mixture has too many components for the data’s own good when I can run a calibration experiment under both assumptions.

While I appreciate the derivation of a closed form non-local prior, I wonder at the importance of the result. Is it because this leads to an easier derivation of the posterior probability? I do not see the connection in Section 3, except maybe that the importance weight indeed involves this normalising constant when considering several k’s in parallel. Is there any convergence issue in the importance sampling solution of (3.1) and (3.3) since the simulations are run under the local posterior? While I appreciate the availability of an EM version for deriving the MAP, a fact I became aware of only recently, is it truly bringing an improvement when compared with picking the MCMC simulation with the highest completed posterior?

The section on prior elicitation is obviously of central interest to me! It however seems to be restricted to the derivation of the scale factor g, in the distance, and of the parameter q in the Dirichlet prior on the weights. While the other parameters suffer from being allocated the conjugate-like priors. I would obviously enjoy seeing how this approach proceeds with our non-informative prior(s). In this regard, the illustration section is nice, but one always wonders at the representative nature of the examples and the possible interpretations of real datasets. For instance, when considering that the Old Faithful is more of an HMM than a mixture.

## variational Bayes for variable selection

Posted in Books, Statistics, University life with tags , , , , , , , on March 30, 2016 by xi'an

Xichen Huang, Jin Wang and Feng Liang have recently arXived a paper where they rely on variational Bayes in conjunction with a spike-and-slab prior modelling. This actually stems from an earlier paper by Carbonetto and Stephens (2012), the difference being in the implementation of the method, which is less Gibbs-like for the current paper. The approach is not fully Bayesian in that, not only an approximate (variational) representation is used for the parameters of interest (regression coefficient and presence-absence indicators) but also the nuisance parameters are replaced with MAPs. The variational approximation on the regression parameters is an independent product of spike-and-slab distributions. The authors show the approximate approach is consistent in both frequentist and Bayesian terms (under identifiability assumptions). The method is undoubtedly faster than MCMC since it shares many features with EM but I still wonder at the Bayesian interpretability of the outcome, which writes out as a product of estimated spike-and-slab mixtures. First, the weights in the mixtures are estimated by EM, hence fixed. Second, the fact that the variational approximation is a product is confusing in that the posterior distribution on the regression coefficients is unlikely to produce posterior independence.

## more of the same!

Posted in Statistics, University life, Books, pictures with tags , , , , , , , , on December 10, 2015 by xi'an

Daniel Seita, Haoyu Chen, and John Canny arXived last week a paper entitled “Fast parallel SAME Gibbs sampling on general discrete Bayesian networks“.  The distributions of the observables are defined by full conditional probability tables on the nodes of a graphical model. The distributions on the latent or missing nodes of the network are multinomial, with Dirichlet priors. To derive the MAP in such models, although this goal is not explicitly stated in the paper till the second page, the authors refer to the recent paper by Zhao et al. (2015), discussed on the ‘Og just as recently, which applies our SAME methodology. Since the paper is mostly computational (and submitted to ICLR 2016, which takes place juuust before AISTATS 2016), I do not have much to comment about it. Except to notice that the authors mention our paper as “Technical report, Statistics and Computing, 2002”. I am not sure the editor of Statistics and Computing will appreciate! The proper reference is in Statistics and Computing, 12:77-84, 2002.

“We argue that SAME is beneficial for Gibbs sampling because it helps to reduce excess variance.”

Still, I am a wee bit surprised at both the above statement and at the comparison with a JAGS implementation. Because SAME augments the number of latent vectors as the number of iterations increases, so should be slower by a mere curse of dimension,, slower than a regular Gibbs with a single latent vector. And because I do not get either the connection with JAGS: SAME could be programmed in JAGS, couldn’t it? If the authors means a regular Gibbs sampler with no latent vector augmentation, the comparison makes little sense as one algorithm aims at the MAP (with a modest five replicas), while the other encompasses the complete posterior distribution. But this sounds unlikely when considering that the larger the number m of replicas the better their alternative to JAGS. It would thus be interesting to understand what the authors mean by JAGS in this setup!

## comparison of Bayesian predictive methods for model selection

Posted in Books, Statistics, University life with tags , , , , , , , , , on April 9, 2015 by xi'an

“Dupuis and Robert (2003) proposed choosing the simplest model with enough explanatory power, for example 90%, but did not discuss the effect of this threshold for the predictive performance of the selected models. We note that, in general, the relative explanatory power is an unreliable indicator of the predictive performance of the submodel,”

Juho Piironen and Aki Vehtari arXived a survey on Bayesian model selection methods that is a sequel to the extensive survey of Vehtari and Ojanen (2012). Because most of the methods described in this survey stem from Kullback-Leibler proximity calculations, it includes some description of our posterior projection method with Costas Goutis and Jérôme Dupuis. We indeed did not consider prediction in our papers and even failed to include consistency result, as I was pointed out by my discussant in a model choice meeting in Cagliari, in … 1999! Still, I remain fond of the notion of defining a prior on the embedding model and of deducing priors on the parameters of the submodels by Kullback-Leibler projections. It obviously relies on the notion that the embedding model is “true” and that the submodels are only approximations. In the simulation experiments included in this survey, the projection method “performs best in terms of the predictive ability” (p.15) and “is much less vulnerable to the selection induced bias” (p.16).

Reading the other parts of the survey, I also came to the perspective that model averaging makes much more sense than model choice in predictive terms. Sounds obvious stated that way but it took me a while to come to this conclusion. Now, with our mixture representation, model averaging also comes as a natural consequence of the modelling, a point presumably not stressed enough in the current version of the paper. On the other hand, the MAP model now strikes me as artificial and linked to a very rudimentary loss function. A loss that does not account for the final purpose(s) of the model. And does not connect to the “all models are wrong” theorem.

## Bayesian filtering and smoothing [book review]

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , on February 25, 2015 by xi'an

When in Warwick last October, I met Simo Särkkä, who told me he had published an IMS monograph on Bayesian filtering and smoothing the year before. I thought it would be an appropriate book to review for CHANCE and tried to get a copy from Oxford University Press, unsuccessfully. I thus bought my own book that I received two weeks ago and took the opportunity of my Czech vacations to read it… [A warning pre-empting accusations of self-plagiarism: this is a preliminary draft for a review to appear in CHANCE under my true name!]

“From the Bayesian estimation point of view both the states and the static parameters are unknown (random) parameters of the system.” (p.20)

Bayesian filtering and smoothing is an introduction to the topic that essentially starts from ground zero. Chapter 1 motivates the use of filtering and smoothing through examples and highlights the naturally Bayesian approach to the problem(s). Two graphs illustrate the difference between filtering and smoothing by plotting for the same series of observations the successive confidence bands. The performances are obviously poorer with filtering but the fact that those intervals are point-wise rather than joint, i.e., that the graphs do not provide a confidence band. (The exercise section of that chapter is superfluous in that it suggests re-reading Kalman’s original paper and rephrases the Monty Hall paradox in a story unconnected with filtering!) Chapter 2 gives an introduction to Bayesian statistics in general, with a few pages on Bayesian computational methods. A first remark is that the above quote is both correct and mildly confusing in that the parameters can be consistently estimated, while the latent states cannot. A second remark is that justifying the MAP as associated with the 0-1 loss is incorrect in continuous settings.  The third chapter deals with the batch updating of the posterior distribution, i.e., that the posterior at time t is the prior at time t+1. With applications to state-space systems including the Kalman filter. The fourth to sixth chapters concentrate on this Kalman filter and its extension, and I find it somewhat unsatisfactory in that the collection of such filters is overwhelming for a neophyte. And no assessment of the estimation error when the model is misspecified appears at this stage. And, as usual, I find the unscented Kalman filter hard to fathom! The same feeling applies to the smoothing chapters, from Chapter 8 to Chapter 10. Which mimic the earlier ones. Continue reading

## MAP or mean?!

Posted in Statistics, Travel, University life with tags , , , on March 5, 2014 by xi'an

“A frequent matter of debate in Bayesian inversion is the question, which of the two principle point-estimators, the maximum-a-posteriori (MAP) or the conditional mean (CM) estimate is to be preferred.”

An interesting topic for this arXived paper by Burger and Lucka that I (also) read in the plane to Montréal, even though I do not share the concern that we should pick between those two estimators (only or at all), since what matters is the posterior distribution and the use one makes of it. I thus disagree there is any kind of a “debate concerning the choice of point estimates”. If Bayesian inference reduces to producing a point estimate, this is a regularisation technique and the Bayesian interpretation is both incidental and superfluous.

Maybe the most interesting result in the paper is that the MAP is expressed as a proper Bayes estimator! I was under the opposite impression, mostly because the folklore (and even The Bayesian Core)  have it that it corresponds to a 0-1 loss function does not hold for continuous parameter spaces and also because it seems to conflict with the results of Druihlet and Marin (BA, 2007), who point out that the MAP ultimately depends on the choice of the dominating measure. (Even though the Lebesgue measure is implicitly chosen as the default.) The authors of this arXived paper start with a distance based on the prior; called the Bregman distance. Which may be the quadratic or the entropy distance depending on the prior. Defining a loss function that is a mix of this Bregman distance and of the quadratic distance

$||K(\hat u-u)||^2+2D_\pi(\hat u,u)$

produces the MAP as the Bayes estimator. So where did the dominating measure go? In fact, nowhere: both the loss function and the resulting estimator are clearly dependent on the choice of the dominating measure… (The loss depends on the prior but this is not a drawback per se!)