## 21w5107 [½day 4]

Posted in Statistics with tags , , , , , , , , , , , , , , on December 3, 2021 by xi'an

Final ½ day of the 21w5107 workshop for me, as our initial plans were to stop today due to the small number of participants on site. And I had booked plane tickets early, too early. I will thus sadly miss the four afternoon talks, mea culpa! However I did attend Noiritt Chandra’s talk on Bayesian factor analysis. Which has always been a bit of a mystery to me in the sense that the number q of factors need be specified, which is a prior input one rarely controls. Here the goal is to estimate a covariance matrix with a sparse representation. And q is estimated by empirical likelihood ahead of the estimation of the matrix. The focus was on minimaxity and MCMC implementation rather than objective Bayes per se! Then, Daniele Durante spoke about analytical posteriors for probit models using unified skew-Normal priors (following a 2019 Biometrika paper). Including marginal posteriors and marginal likelihood. And for various extensions like dynamic probit models. Opening other computational issues such as simulating high dimensional truncated Normal distributions. (Potential use of delayed acceptance there?) This second talk was also drifting away from objective Bayes! In the first half of his talk, Filippo Ascolani introduced us to trees of random probability measures, each mother node being the distribution of the atoms of the children nodes. (Interestingly, Kingman is both connected to (coalescent) trees and to completely random measures.) My naïve first impression was that the distributions would get more and more degenerate as the number of levels in the tree would increase, however I am unsure this is correct as Filippo mentioned getting observations on all nodes. The talk also made me wonder at how this could be related Radford Neal’s Dirichlet trees. (Which I discovered at my first ICMS workshop about 20 years ago.) Yang Ni concluded the morning with a talk on causality that provided (to me) a very smooth (re)introduction to Bayesian causal graphs.

Even more than last time, I enormously enjoyed the workshop, its location, the fantastic staff at the hotel, and the reconnection with dear friends!, just regretting we could not be a few more. I appreciate the efforts made by on-line participants to stay connected and intervene (thanks, Ed!), but the quality of interactions is sadly of another magnitude when spending all our time together. Hopefully there will be a next time and hopefully we’ll then be back to larger size (and hopefully the location will remain the same). Hasta luego, Oaxaca!

## divide & reconquer

Posted in Books, Statistics, University life with tags , , , , , , , , , , on February 5, 2018 by xi'an

Qi Liu, Anindya Bhadra, and William Cleveland from Purdue have arXived a paper entitled Divide and Recombine for Large and Complex Data: Model Likelihood Functions using MCMC. Which is a variation on the earlier divide & … papers attempting at handling large datasets. The beginning is quite similar to these earlier papers in that the likelihood is split into sub-likelihoods, approximated from MCMC samples and recombined into an approximate full likelihood. As in for instance Scott et al. one approximation use for the subsample is to replace the likelihood with a Normal approximation, or a skew Normal generalisation, which remains  a limited choice for heavy tailed likelihoods. Producing a Normal and skew-Normal approximation for the whole [data] likelihood, respectively. If I understand correctly, these approximations are missing a normalising constant to bring them to scale with the true likelihood, which I do not completely understand as the likelihood only needs to be defined up to a [constant] constant for most purposes, including Bayesian ones. The  method of estimation of this constant proposed therein is called the contour probability algorithm and it consists in using a highest density region to compare a likelihood and its approximation. (Nothing to do with our adaptation of Gelfand and Dey (1994) based on HPDs, with Darren Wright. Nor with nested sampling.) Returning a form of qq-plot. This is rather exploratory, while hardly addressing the issue of the precision of such approximations and the resolution of conflicting proposals. And the comparison with all these other recent proposals for splitting likelihoods into manageable bits (proposals that are mentioned in the final section, including our recentering scheme with my student Changye Wu).

## a discovery that mean can be impacted by extreme values

Posted in University life with tags , , , , , , on August 6, 2016 by xi'an

A surprising editorial in Nature about the misleading uses of impact factors, since as means they are heavily impacted by extreme values. With the realisation that the mean is not the median for skewed distributions…

To be fair(er), Nature published a subsequent paper this week about publishing additional metrics like the two-year median.

## the random variable that was always less than its mean…

Posted in Books, Kids, R, Statistics with tags , , , , , on May 30, 2016 by xi'an

Although this is far from a paradox when realising why the phenomenon occurs, it took me a few lines to understand why the empirical average of a log-normal sample is apparently a biased estimator of its mean. And why conversely the biased plug-in estimator does not appear to present a bias. To illustrate this “paradox” consider the picture below which compares both estimators of the mean of a log-normal LN(0,σ²) distribution as σ² increases: blue stands for the empirical mean, while gold corresponds to the plug-in estimator exp(σ²/2) when σ² is estimated from the log-sample, as in a normal sample. (The sample is of size 10⁶.) The gold sequence remains around one, while the blue one drifts away towards zero…

The question came on X validated and my first reaction was to doubt an implementation which outcome was so counter-intuitive. But then I thought further about the representation of a log-normal variate as exp(σξ) when ξ is a standard Normal variate. When σ grows large enough, it is near impossible for σξ to be larger than σ². More precisely,

P(X>E[X])=P(σξ>σ²/2)=1-Φ(σ/2)

which can be arbitrarily small.

## Jeffreys prior with improper posterior

Posted in Books, Statistics, University life with tags , , , , , , , , , , on May 12, 2014 by xi'an

In a complete coincidence with my visit to Warwick this week, I became aware of the paper “Inference in two-piece location-scale models with Jeffreys priors” recently published in Bayesian Analysis by Francisco Rubio and Mark Steel, both from Warwick. Paper where they exhibit a closed-form Jeffreys prior for the skewed distribution

$\dfrac{2\epsilon}{\sigma_1}f(\{x-\mu\}/\sigma_1)\mathbb{I}_{x<\mu}+\dfrac{2(1-\epsilon)}{\sigma_2}f(\{x-\mu\}/\sigma_2) \mathbb{I}_{x>\mu}$

where f is a symmetric density, namely

$\pi(\mu,\sigma_1,\sigma_2) \propto 1 \big/ \sigma_1\sigma_2\{\sigma_1+\sigma_2\}\,,$

where

$\epsilon=\sigma_1/\{\sigma_1+\sigma_2\}\,.$

only to show  immediately after that this prior does not allow for a proper posterior, no matter what the sample size is. While the above skewed distribution can always be interpreted as a mixture, being a weighted sum of two terms, it is not strictly speaking a mixture, if only because the “component” can be identified from the observation (depending on which side of μ is stands). The likelihood is therefore a product of simple terms rather than a product of a sum of two terms.

As a solution to this conundrum, the authors consider the alternative of the “independent Jeffreys priors”, which are made of a product of conditional Jeffreys priors, i.e., by computing the Jeffreys prior one parameter at a time with all other parameters considered to be fixed. Which differs from the reference prior, of course, but would have been my second choice as well. Despite criticisms expressed by José Bernardo in the discussion of the paper… The difficulty (in my opinion) resides in the choice (and difficulty) of the parameterisation of the model, since those priors are not parameterisation-invariant. (Xinyi Xu makes the important comment that even those priors incorporate strong if hidden information. Which relates to our earlier discussion with Kaniav Kamari on the “dangers” of prior modelling.)

Although the outcome is puzzling, I remain just slightly sceptical of the income, namely Jeffreys prior and the corresponding Fisher information: the fact that the density involves an indicator function and is thus discontinuous in the location μ at the observation x makes the likelihood function not differentiable and hence the derivation of the Fisher information not strictly valid. Since the indicator part cannot be differentiated. Not that I am seeing the Jeffreys prior as the ultimate grail for non-informative priors, far from it, but there is definitely something specific in the discontinuity in the density. (In connection with the later point, Weiss and Suchard deliver a highly critical commentary on the non-need for reference priors and the preference given to a non-parametric Bayes primary analysis. Maybe making the point towards a greater convergence of the two perspectives, objective Bayes and non-parametric Bayes.)

This paper and the ensuing discussion about the properness of the Jeffreys posterior reminded me of our earliest paper on the topic with Jean Diebolt. Where we used improper priors on location and scale parameters but prohibited allocations (in the Gibbs sampler) that would lead to less than two observations per components, thereby ensuring that the (truncated) posterior was well-defined. (This feature also remained in the Series B paper, submitted at the same time, namely mid-1990, but only published in 1994!)  Larry Wasserman proved ten years later that this truncation led to consistent estimators, but I had not thought about it in very long while. I still like this notion of forcing some (enough) datapoints into each component for an allocation (of the latent indicator variables) to be an acceptable Gibbs move. This is obviously not compatible with the iid representation of a mixture model, but it expresses the requirement that components all have a meaning in terms of the data, namely that all components contributed to generating a part of the data. This translates as a form of weak prior information on how much we trust the model and how meaningful each component is (in opposition to adding meaningless extra-components with almost zero weights or almost identical parameters).

As a marginalia, the insistence in Rubio and Steel’s paper that all observations in the sample be different also reminded me of a discussion I wrote for one of the Valencia proceedings (Valencia 6 in 1998) where Mark presented a paper with Carmen Fernández on this issue of handling duplicated observations modelled by absolutely continuous distributions. (I am afraid my discussion is not worth the \$250 price tag given by amazon!)