## ABC-SAEM

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on October 8, 2019 by xi'an

In connection with the recent PhD thesis defence of Juliette Chevallier, in which I took a somewhat virtual part for being physically in Warwick, I read a paper she wrote with Stéphanie Allassonnière on stochastic approximation versions of the EM algorithm. Computing the MAP estimator can be done via some adapted for simulated annealing versions of EM, possibly using MCMC as for instance in the Monolix software and its MCMC-SAEM algorithm. Where SA stands sometimes for stochastic approximation and sometimes for simulated annealing, originally developed by Gilles Celeux and Jean Diebolt, then reframed by Marc Lavielle and Eric Moulines [friends and coauthors]. With an MCMC step because the simulation of the latent variables involves an untractable normalising constant. (Contrary to this paper, Umberto Picchini and Adeline Samson proposed in 2015 a genuine ABC version of this approach, paper that I thought I missed—although I now remember discussing it with Adeline at JSM in Seattle—, ABC is used as a substitute for the conditional distribution of the latent variables given data and parameter. To be used as a substitute for the Q step of the (SA)EM algorithm. One more approximation step and one more simulation step and we would reach a form of ABC-Gibbs!) In this version, there are very few assumptions made on the approximation sequence, except that it converges with the iteration index to the true distribution (for a fixed observed sample) if convergence of ABC-SAEM is to happen. The paper takes as an illustrative sequence a collection of tempered versions of the true conditionals, but this is quite formal as I cannot fathom a feasible simulation from the tempered version and not from the untempered one. It is thus much more a version of tempered SAEM than truly connected with ABC (although a genuine ABC-EM version could be envisioned).

## the most probable cluster

Posted in Books, Statistics with tags , , , , , , on July 11, 2019 by xi'an

In the last issue of Bayesian Analysis, Lukasz Rajkowski studies the most likely (MAP) cluster associated with the Dirichlet process mixture model. Reminding me that most Bayesian estimates of the number of clusters are not consistent (when the sample size grows to infinity). I am always puzzled by this problem, as estimating the number of clusters sounds like an ill-posed problem, since it is growing with the number of observations, by definition of the Dirichlet process. For instance, the current paper establishes that the number of clusters intersecting a given compact set remains bounded. (The setup is one of a Normal Dirichlet process mixture with constant and known covariance matrix.)

Since the posterior probability of a given partition of {1,2,…,n} can be (formally) computed, the MAP estimate can be (formally) derived. I inserted formally in the previous sentence as the derivation of the exact MAP is an NP hard problem in the number n of observations. As an aside, I have trouble with the author’s argument that the convex hulls of the clusters should be disjoin: I do not see why they should when the mixture components are overlapping. (More generally, I fail to relate to notions like “bad clusters” or “overestimation of the number of clusters” or a “sensible choice” of the covariance matrix.) More globally, I am somewhat perplexed by the purpose of the paper and the relevance of the MAP estimate, even putting aside my generic criticisms of the MAP approach. No uncertainty is attached to the estimator, which thus appears as a form of penalised likelihood strategy rather than a genuinely Bayesian (Analysis) solution.

The first example in the paper is using data from a Uniform over (-1,1), concluding at a “misleading” partition by the MAP since it produces more than one cluster. I find this statement flabbergasting as the generative model is not the estimated model. To wit, the case of an exponential Exp(1) sample that cannot reach a maximum of the target function with a finite number of sample. Which brings me back full-circle to my general unease about clustering in that much more seems to be assumed about this notion than what the statistical model delivers.

## risk-adverse Bayes estimators

Posted in Books, pictures, Statistics with tags , , , , , , , , , , on January 28, 2019 by xi'an

An interesting paper came out on arXiv in early December, written by Michael Brand from Monash. It is about risk-adverse Bayes estimators, which are defined as avoiding the use of loss functions (although why avoiding loss functions is not made very clear in the paper). Close to MAP estimates, they bypass the dependence of said MAPs on parameterisation by maximising instead π(θ|x)/√I(θ), which is invariant by reparameterisation if not by a change of dominating measure. This form of MAP estimate is called the Wallace-Freeman (1987) estimator [of which I never heard].

The formal definition of a risk-adverse estimator is still based on a loss function in order to produce a proper version of the probability to be “wrong” in a continuous environment. The difference between estimator and true value θ, as expressed by the loss, is enlarged by a scale factor k pushed to infinity. Meaning that differences not in the immediate neighbourhood of zero are not relevant. In the case of a countable parameter space, this is essentially producing the MAP estimator. In the continuous case, for “well-defined” and “well-behaved” loss functions and estimators and density, including an invariance to parameterisation as in my own intrinsic losses of old!, which the author calls likelihood-based loss function,  mentioning f-divergences, the resulting estimator(s) is a Wallace-Freeman estimator (of which there may be several). I did not get very deep into the study of the convergence proof, which seems to borrow more from real analysis à la Rudin than from functional analysis or measure theory, but keep returning to the apparent dependence of the notion on the dominating measure, which bothers me.

## MAP as Bayes estimators

Posted in Books, Kids, Statistics with tags , , , , on November 30, 2016 by xi'an

Robert Bassett and Julio Deride just arXived a paper discussing the position of MAPs within Bayesian decision theory. A point I have discussed extensively on the ‘Og!

“…we provide a counterexample to the commonly accepted notion of MAP estimators as a limit of Bayes estimators having 0-1 loss.”

The authors mention The Bayesian Choice stating this property without further precautions and I completely agree to being careless in this regard! The difficulty stands with the limit of the maximisers being not necessarily the maximiser of the limit. The paper includes an example to this effect, with a prior as above,  associated with a sampling distribution that does not depend on the parameter. The sufficient conditions proposed therein are that the posterior density is almost surely proper or quasiconcave.

This is a neat mathematical characterisation that cleans this “folk theorem” about MAP estimators. And for which the authors are to be congratulated! However, I am not very excited by the limiting property, whether it holds or not, as I have difficulties conceiving the use of a sequence of losses in a mildly realistic case. I rather prefer the alternate characterisation of MAP estimators by Burger and Lucka as proper Bayes estimators under another type of loss function, albeit a rather artificial one.

## non-local priors for mixtures

Posted in Statistics, University life with tags , , , , , , , , , , , , , , , on September 15, 2016 by xi'an

[For some unknown reason, this commentary on the paper by Jairo Fúquene, Mark Steel, David Rossell —all colleagues at Warwick— on choosing mixture components by non-local priors remained untouched in my draft box…]

Choosing the number of components in a mixture of (e.g., Gaussian) distributions is a hard problem. It may actually be an altogether impossible problem, even when abstaining from moral judgements on mixtures. I do realise that the components can eventually be identified as the number of observations grows to infinity, as demonstrated for instance by Judith Rousseau and Kerrie Mengersen (2011). But for a finite and given number of observations, how much can we trust any conclusion about the number of components?! It seems to me that the criticism about the vacuity of point null hypotheses, namely the logical absurdity of trying to differentiate θ=0 from any other value of θ, applies to the estimation or test on the number of components of a mixture. Doubly so, one might argue, since a very small or a very close component is undistinguishable from a non-existing one. For instance, Definition 2 is correct from a mathematical viewpoint, but it does not spell out the multiple contiguities between k and k’ component mixtures.

The paper starts with a comprehensive coverage of l’état de l’art… When using a Bayes factor to compare a k-component and an h-component mixture, the behaviour of the factor is quite different depending on which model is correct. Essentially overfitted mixtures take much longer to detect than underfitted ones, which makes intuitive sense. And BIC should be corrected for overfitted mixtures by a canonical dimension λ between the true and the (larger) assumed number of parameters  into

2 log m(y) = 2 log p(y|θ) – λ log O(n) + O(log log n)

I would argue that this purely invalidates BIG in mixture settings since the canonical dimension λ is unavailable (and DIC does not provide a useful substitute as we illustrated a decade ago…) The criticism about Rousseau and Mengersen (2011) over-fitted mixture that their approach shrinks less than a model averaging over several numbers of components relates to minimaxity and hence sounds both overly technical and reverting to some frequentist approach to testing. Replacing testing with estimating sounds like the right idea.  And I am also unconvinced that a faster rate of convergence of the posterior probability or of the Bayes factor is a relevant factor when conducting

As for non local priors, the notion seems to rely on a specific topology for the parameter space since a k-component mixture can approach a k’-component mixture (when k'<k) in a continuum of ways (even for a given parameterisation). This topology seems to be summarised by the penalty (distance?) d(θ) in the paper. Is there an intrinsic version of d(θ), given the weird parameter space? Like one derived from the Kullback-Leibler distance between the models? The choice of how zero is approached clearly has an impact on how easily the “null” is detected, the more because of the somewhat discontinuous nature of the parameter space. Incidentally, I find it curious that only the distance between means is penalised… The prior also assumes independence between component parameters and component weights, which I think is suboptimal in dealing with mixtures, maybe suboptimal in a poetic sense!, as we discussed in our reparameterisation paper. I am not sure either than the speed the distance converges to zero (in Theorem 1) helps me to understand whether the mixture has too many components for the data’s own good when I can run a calibration experiment under both assumptions.

While I appreciate the derivation of a closed form non-local prior, I wonder at the importance of the result. Is it because this leads to an easier derivation of the posterior probability? I do not see the connection in Section 3, except maybe that the importance weight indeed involves this normalising constant when considering several k’s in parallel. Is there any convergence issue in the importance sampling solution of (3.1) and (3.3) since the simulations are run under the local posterior? While I appreciate the availability of an EM version for deriving the MAP, a fact I became aware of only recently, is it truly bringing an improvement when compared with picking the MCMC simulation with the highest completed posterior?

The section on prior elicitation is obviously of central interest to me! It however seems to be restricted to the derivation of the scale factor g, in the distance, and of the parameter q in the Dirichlet prior on the weights. While the other parameters suffer from being allocated the conjugate-like priors. I would obviously enjoy seeing how this approach proceeds with our non-informative prior(s). In this regard, the illustration section is nice, but one always wonders at the representative nature of the examples and the possible interpretations of real datasets. For instance, when considering that the Old Faithful is more of an HMM than a mixture.