## prior against truth!

Posted in Books, Kids, Statistics with tags , , , , , , , on June 4, 2018 by xi'an

A question from X validated had interesting ramifications, about what happens when the prior does not cover the true value of the parameter (assuming there ? In fact, not so much in that, from a decision theoretic perspective, the fact that that π(θ⁰)=0, or even that π(θ)=0 in a neighbourhood of θ⁰ does not matter [too much]. Indeed, the formal derivation of a Bayes estimator as minimising the posterior loss means that the resulting estimator may take values that were “impossible” from a prior perspective! Indeed, taking for example the posterior mean, the convex combination of all possible values of θ under π may well escape the support of π when this support is not convex. Of course, one could argue that estimators should further be restricted to be possible values of θ under π but that would reduce their decision theoretic efficiency.

An example is the brilliant minimaxity result by George Casella and Bill Strawderman from 1981: when estimating a Normal mean μ based on a single observation xwith the additional constraint that |μ|<ρ, and when ρ is small enough, ρ1.0567 quite specifically, the minimax estimator for this problem under squared error loss corresponds to a (least favourable) uniform prior on the pair {ρ,ρ}, meaning that π gives equal weight to ρ and ρ (and none to any other value of the mean μ). When ρ increases above this bound, the least favourable prior sees its support growing one point at a time, but remaining a finite set of possible values. However the posterior expectation, 𝔼[μ|x], can take any value on (ρ,ρ).

In an even broader suspension of belief (in the prior), it may be that the prior has such a restricted support that it cannot consistently estimate the (true value of the) parameter, but the associated estimator may remain admissible or minimax.

## twilight zone [of statistics]

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , on February 26, 2016 by xi'an

“I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.” L. Wasserman

Larry Wasserman once remarked that finite mixtures were like the twilight zone of statistics, thanks to the numerous idiosyncrasies associated with such models. And George Casella had similar strong reservations about mixture estimation. Avi Feller and co-authors [including Natesh Pillai] have just arXived a paper on this topic, exhibiting shocking (!) properties of the MLE! Their core example is a mixture of two normal distributions with known common variance and known weight different from 0.5, which ensures identifiability. This is a favourite example of mine that we used for instance in our book Introducing Monte Carlo methods with R. If only because we can plot the likelihood and posterior surfaces. (Warning: I wrote those notes on an earlier version of the paper, so mileage may vary in terms of accuracy!)

The “shocking” discovery in the paper is that the MLE is wrong as often as not in selecting the sign of the difference Δ between both means, with an additional accumulation point at zero. The global mode may thus be in the wrong place for small enough sample sizes. And even for larger sizes: when the difference between the means is small the likelihood is likely to be unimodal with a mode quite close to zero. (An interesting remark is that the likelihood derivative is always zero at Δ=0 when considering the special case of both means equal to -Δ and to πΔ/(1-π), respectively, which implies that the overall mean of the mixture is equal to zero. A potential connection with our reparameterisation paper, maybe?)

The alternative proposed by Avi and his co-authors is to proceed through moments, i.e., to revert to Pearson (1892). There are however difficulties with this approach, first and foremost the non-uniqueness of the moment equations used to estimate Δ. For instance, the second cumulant equation chosen by the authors is not always defined as opposed to the third cumulant equation (why not using this third cumulant then). Which does not always produce the right sign… But, in a strange twist, the authors turn those deficiencies into signals for both pathologies (wrong sign and “pile-up” at zero).

“…the grid bootstrap yields an exact p-value for any valid test statistic.”

The most importance issue in this framework being in estimating the parameters, the authors opt for an approach based on tests, which is definitely surprising given the well-known deficiencies of standard tests in mixtures. The test chosen here is a Wald test with a statistic equal to the χ² version of the first cumulant differences. I am surprised that the χ² approximation works in such an unfriendly setting. And I do not understand how the grid is used, unless a certain degree of approximation is accepted, which takes us back to the “dark ages” of imposing a minimal distance Δ to achieve consistency, as in Ghosh and Sen (1985).

“..our concern about sign error is trivial in the Bayesian setting: the global mode is simply a poor summary of a multi-modal posterior. More broadly, the weak identification issues we highlight in this paper are not necessarily relevant to a strict Bayesian.”

A priori, I do not think pathologies of the MLE always transfer to Bayes estimators, unless one uses the MAP as an [poor] estimator. But using the MAP is not necessary since posterior means are meaningful in this identified setting, where label switching should not occur. However, running the same experiments with a Gaussian prior on both means and using the posterior mean as my estimator, I did obtain the same pathology of Bayes estimates [also produced in the supplementary material] not concentrating on the true value of the difference, but putting weight on the opposite value and at zero. Using a less standard prior inspired by David Rossell’s talk on non-local priors two weeks ago, which avoids a neighbourhood of zero, I did not get a much different picture as illustrated below:

Overall, I remain somewhat uncertain as to what to conclude from this pathological behaviour. When both means are close enough, the sign of the difference is often estimated wrongly. But that could simply mean that the means are not significantly different, for that sample size…

## relabelling mixtures

Posted in Books, Statistics with tags , , , , , , on January 30, 2015 by xi'an

Another short paper about relabelling in mixtures was arXived last week by Pauli and Torelli. They refer rather extensively to a previous paper by Puolamäki and Kaski (2009) of which I was not aware, paper attempting to get an unswitching sampler that does not exhibit any label switching, a concept I find most curious as I see no rigorous way to state that a sampler is not switching! This would imply spotting low posterior probability regions that the chain would cross. But I should check the paper nonetheless.

Because the G component mixture posterior is invariant under the G! possible permutations, I am somewhat undeciced as to what the authors of the current paper mean by estimating the difference between two means, like μ12. Since they object to using the output of a perfectly mixing MCMC algorithm and seem to prefer the one associated with a non-switching chain. Or by estimating the probability that a given observation is from a given component, since this is exactly 1/G by the permutation invariance property. In order to identify a partition of the data, they introduce a loss function on the joint allocations of pairs of observations, loss function that sounds quite similar to the one we used in our 2000 JASA paper on the label switching deficiencies of MCMC algorithms. (And makes me wonder why this work of us is not deemed relevant for the approach advocated in the paper!) Still, having read this paper, which I find rather poorly written, I have no clear understanding of how the authors give a precise meaning to a specific component of the mixture distribution. Or how the relabelling has to be conducted to avoid switching. That is, how the authors define their parameter space. Or their loss function. Unless one falls back onto the ordering of the means or the weights which has the drawback of not connecting with the levels sets of a particular mode of the posterior distribution, meaning that imposing the constraints result in a region that contains bits of several modes.

At some point the authors assume the data can be partitioned into K≤G groups such that there is a representative observation within each group never sharing a component (across MCMC iterations) with any of the other representatives. While this notion is label invariant, I wonder whether (a) this is possible on any MCMC outcome; (b) it indicates a positive or negative feature of the MCMC sampler.; and (c) what prevents the representatives to switch in harmony from one component to the next while preserving their perfect mutual exclusion… This however constitutes the advance in the paper, namely that component dependent quantities as estimated as those associated with a particular representative. Note that the paper contains no illustration, hence that the method may prove hard to impossible to implement!