## Bayesian goodness of fit

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on April 10, 2018 by xi'an

Persi Diaconis and Guanyang Wang have just arXived an interesting reflection on the notion of Bayesian goodness of fit tests. Which is a notion that has always bothered me, in a rather positive sense (!), as

“I also have to confess at the outset to the zeal of a convert, a born again believer in stochastic methods. Last week, Dave Wright reminded me of the advice I had given a graduate student during my algebraic geometry days in the 70’s :`Good Grief, don’t waste your time studying statistics. It’s all cookbook nonsense.’ I take it back! …” David Mumford

The paper starts with a reference to David Mumford, whose paper with Wu and Zhou on exponential “maximum entropy” synthetic distributions is at the source (?) of this paper, and whose name appears in its very title: “A conversation for David Mumford”…, about his conversion from pure (algebraic) maths to applied maths. The issue of (Bayesian) goodness of fit is addressed, with card shuffling examples, the null hypothesis being that the permutation resulting from the shuffling is uniformly distributed if shuffling takes enough time. Interestingly, while the parameter space is compact as a distribution on a finite set, Lindley’s paradox still occurs, namely that the null (the permutation comes from a Uniform) is always accepted provided there is no repetition under a “flat prior”, which is the Dirichlet D(1,…,1) over all permutations. (In this finite setting an improper prior is definitely improper as it does not get proper after accounting for observations. Although I do not understand why the Jeffreys prior is not the Dirichlet(½,…,½) in this case…) When resorting to the exponential family of distributions entertained by Zhou, Wu and Mumford, including the uniform distribution as one of its members, Diaconis and Wang advocate the use of a conjugate prior (exponential family, right?!) to compute a Bayes factor that simplifies into a ratio of two intractable normalising constants. For which the authors suggest using importance sampling, thermodynamic integration, or the exchange algorithm. Except that they rely on the (dreaded) harmonic mean estimator for computing the Bayes factor in the following illustrative section! Due to the finite nature of the space, I presume this estimator still has a finite variance. (Remark 1 calls for convergence results on exchange algorithms, which can be found I think in the just as recent arXival by Christophe Andrieu and co-authors.) An interesting if rare feature of the example processed in the paper is that the sufficient statistic used for the permutation model can be directly simulated from a Multinomial distribution. This is rare as seen when considering the benchmark of Ising models, for which the summary and sufficient statistic cannot be directly simulated. (If only…!) In fine, while I enjoyed the paper a lot, I remain uncertain as to its bearings, since defining an objective alternative for the goodness-of-fit test becomes quickly challenging outside simple enough models.

## Darmois, Koopman, and Pitman

Posted in Books, Statistics with tags , , , , , , , , on November 15, 2017 by xi'an

When [X’ed] seeking a simple proof of the Pitman-Koopman-Darmois lemma [that exponential families are the only types of distributions with constant support allowing for a fixed dimension sufficient statistic], I came across a 1962 Stanford technical report by Don Fraser containing a short proof of the result. Proof that I do not fully understand as it relies on the notion that the likelihood function itself is a minimal sufficient statistic.

## “Stein deviates from the statistical norm”

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on November 27, 2016 by xi'an

## Charles M. Stein [1920-2016]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on November 26, 2016 by xi'an

I have just heard that Charles Stein, Professor at Stanford University, passed away last night. Although the following image is definitely over-used, I truly feel this is the departure of a giant of statistics.  He has been deeply influential on the fields of probability and mathematical statistics, primarily in decision theory and approximation techniques. On the first field, he led to considerable changes in the perception of optimality by exhibiting the Stein phenomenon, where the aggregation of several admissible estimators of unrelated quantities may (and will) become inadmissible for the joint estimation of those quantities! Although the result can be explained by mathematical and statistical reasoning, it was still dubbed a paradox due to its counter-intuitive nature. More foundationally, it led to expose the ill-posed nature of frequentist optimality criteria and certainly contributed to the Bayesian renewal of the 1980’s, before the MCMC revolution. (It definitely contributed to my own move, as I started working on the Stein phenomenon during my thesis, before realising the fundamentally Bayesian nature of the domination results.)

“…the Bayesian point of view is often accompanied by an insistence that people ought to agree to a certain doctrine even without really knowing what this doctrine is.” (Statistical Science, 1986)

The second major contribution of Charles Stein was the introduction of a new technique for normal approximation that is now called the Stein method. It relies on a differential operator and produces estimates of approximation error in Central Limit theorems, even in dependent settings. While I am much less familiar with this aspect of Charles Stein’s work, I believe the impact it has had on the field is much more profound and durable than the Stein effect in Normal mean estimation.

(During the Vietnam War, he was quite active in the anti-war movement and the above picture from 2003 shows that his opinions had not shifted over time!) A giant truly has gone.

## warp-U bridge sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on October 12, 2016 by xi'an

[I wrote this set of comments right after MCqMC 2016 on a preliminary version of the paper so mileage may vary in terms of the adequation to the current version!]

In warp-U bridge sampling, newly arXived and first presented at MCqMC 16, Xiao-Li Meng continues (in collaboration with Lahzi Wang) his exploration of bridge sampling techniques towards improving the estimation of normalising constants and ratios thereof. The bridge sampling estimator of Meng and Wong (1996) is an harmonic mean importance sampler that requires iterations as it depends on the ratio of interest. Given that the normalising constant of a density does not depend on the chosen parameterisation in the sense that the Jacobian transform preserves this constant, a degree of freedom is in the choice of the parameterisation. This is the idea behind warp transformations. The initial version of Meng and Schilling (2002) used location-scale transforms, while the warp-U solution goes for a multiple location-scale transform that can be seen as based on a location-scale mixture representation of the target. With K components. This approach can also be seen as a sort of artificial reversible jump algorithm when one model is fully known. A strategy Nicolas and I also proposed in our nested sampling Biometrika paper.

Once such a mixture approximation is obtained. each and every component of the mixture can be turned into the standard version of the location-scale family by the appropriate location-scale transform. Since the component index k is unknown for a given X, they call this transform a random transform, which I find somewhat more confusing that helpful. The conditional distribution of the index given the observable x is well-known for mixtures and it is used here to weight the component-wise location-scale transforms of the original distribution p into something that looks rather similar to the standard version of the location-scale family. If no mode has been forgotten by the mixture. The simulations from the original p are then rescaled by one of those transforms, which index k is picked according to the conditional distribution. As explained later to me by XL, the random[ness] in the picture is due to the inclusion of a random ± sign. Still, in the notation introduced in (13), I do not get how the distribution Þ [sorry for using different symbols, I cannot render a tilde on a p] is defined since both ψ and W are random. Is it the marginal? In which case it would read as a weighted average of rescaled versions of p. I have the same problem with Theorem 1 in that I do not understand how one equates Þ with the joint distribution.

Equation (21) is much more illuminating (I find) than the previous explanation in that it exposes the fact that the principle is one of aiming at a new distribution for both the target and the importance function, with hopes that the fit will get better. It could have been better to avoid the notion of random transform, then, but this is mostly a matter of conveying the notion.

On more specifics points (or minutiae), the unboundedness of the likelihood is rarely if ever a problem when using EM. An alternative to the multiple start EM proposal would then be to get sequential and estimate the mixture in a sequential manner, only adding a component when it seems worth it. See eg Chopin and Pelgrin (2004) and Chopin (2007). This could also help with the bias mentioned therein since only a (tiny?) fraction of the data would be used. And the number of components K has an impact on the accuracy of the approximation, as in not missing a mode, and on the computing time. However my suggestion would be to avoid estimating K as this must be immensely costly.

Section 6 obviously relates to my folded Markov interests. If I understand correctly, the paper argues that the transformed density Þ does not need to be computed when considering the folding-move-unfolding step as a single step rather than three steps. I fear the description between equations (30) and (31) is missing the move step over the transformed space. Also on a personal basis I still do not see how to add this approach to our folding methodology, even though the different transforms act as as many replicas of the original Markov chain.