Archive for Costas Goutis

revisiting marginalisation paradoxes [Bayesian reads #1]

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on February 8, 2019 by xi'an

As a reading suggestion for my (last) OxWaSP Bayesian course at Oxford, I included the classic 1973 Marginalisation paradoxes by Phil Dawid, Mervyn Stone [whom I met when visiting UCL in 1992 since he was sharing an office with my friend Costas Goutis], and Jim Zidek. Paper that also appears in my (recent) slides as an exercise. And has been discussed many times on this  ‘Og.

Reading the paper in the train to Oxford was quite pleasant, with a few discoveries like an interesting pike at Fraser’s structural (crypto-fiducial?!) distributions that “do not need Bayesian improper priors to fall into the same paradoxes”. And a most fascinating if surprising inclusion of the Box-Müller random generator in an argument, something of a precursor to perfect sampling (?). And a clear declaration that (right-Haar) invariant priors are at the source of the resolution of the paradox. With a much less clear notion of “un-Bayesian priors” as those leading to a paradox. Especially when the authors exhibit a red herring where the paradox cannot disappear, no matter what the prior is. Rich discussion (with none of the current 400 word length constraint), including the suggestion of neutral points, namely those that do identify a posterior, whatever that means. Funny conclusion, as well:

“In Stone and Dawid’s Biometrika paper, B1 promised never to use improper priors again. That resolution was short-lived and let us hope that these two blinkered Bayesians will find a way out of their present confusion and make another comeback.” D.J. Bartholomew (LSE)

and another

“An eminent Oxford statistician with decidedly mathematical inclinations once remarked to me that he was in favour of Bayesian theory because it made statisticians learn about Haar measure.” A.D. McLaren (Glasgow)

and yet another

“The fundamentals of statistical inference lie beneath a sea of mathematics and scientific opinion that is polluted with red herrings, not all spawned by Bayesians of course.” G.N. Wilkinson (Rothamsted Station)

Lindley’s discussion is more serious if not unkind. Dennis Lindley essentially follows the lead of the authors to conclude that “improper priors must go”. To the point of retracting what was written in his book! Although concluding about the consequences for standard statistics, since they allow for admissible procedures that are associated with improper priors. If the later must go, the former must go as well!!! (A bit of sophistry involved in this argument…) Efron’s point is more constructive in this regard since he recalls the dangers of using proper priors with huge variance. And the little hope one can hold about having a prior that is uninformative in every dimension. (A point much more blatantly expressed by Dickey mocking “magic unique prior distributions”.) And Dempster points out even more clearly that the fundamental difficulty with these paradoxes is that the prior marginal does not exist. Don Fraser may be the most brutal discussant of all, stating that the paradoxes are not new and that “the conclusions are erroneous or unfounded”. Also complaining about Lindley’s review of his book [suggesting prior integration could save the day] in Biometrika, where he was not allowed a rejoinder. It reflects on the then intense opposition between Bayesians and fiducialist Fisherians. (Funny enough, given the place of these marginalisation paradoxes in his book, I was mistakenly convinced that Jaynes was one of the discussants of this historical paper. He is mentioned in the reply by the authors.)

projection predictive input variable selection

Posted in Books, Statistics, University life with tags , , , , , , on November 2, 2015 by xi'an

aikiJuho Piironen and Aki Vehtari just arXived a paper on variable selection that relates to two projection papers we wrote in the 1990’s with Costas Goutis (who died near Seattle in a diving accident on July 1996) and Jérôme Dupuis… Except that they move to the functional space of Gaussian processes. The covariance function in a Gaussian process is indeed based on a distance between observations, which are themselves defined as a vector of inputs. Some of which matter and some of which do not matter in the kernel value. When rescaling the distance with “length-scales” for all variables, one could think that non-significant variates have very small scales and hence bypass the need for variable selection but this is not the case as those coefficients react poorly to non-linearities in the variates… The paper thus builds a projective structure from a reference model involving all input variables.

“…adding some irrelevant inputs is not disastrous if the model contains a sparsifying prior structure, and therefore, one can expect to lose less by using all the inputs than by trying to differentiate between the relevant and irrelevant ones and ignoring the uncertainty related to the left-out inputs.”

While I of course appreciate this avatar to our original idea (with some borrowing from McCulloch and Rossi, 1992), the paper reminds me of some of the discussions and doubts we had about the role of the reference or super model that “anchors” the projections, as there is no reason for that reference model to be a better one. It could be that an iterative process where the selected submodel becomes the reference for the next iteration could enjoy better performances. When I first presented this work in Cagliari, in the late 1990s, one comment was that the method had no theoretical guarantee like consistency. Which is correct if the minimum distance is not evolving (how quickly?!) with the sample size n. I also remember the difficulty Jérôme and I had in figuring out a manageable forward-backward exploration of the (huge) set of acceptable subsets of variables. Random walk exploration and RJMCMC are unlikely to solve this problem.

why do we maximise the weights in empirical likelihood?

Posted in Books, Statistics, University life with tags , , , , on October 29, 2013 by xi'an

Mark Johnson sent me the following question a few days ago:

I have one question about EL: how important is it to maximise the probabilities pi on the data items in the formula (stolen from the Wikipedia page on EL)?

\max_{\pi,\theta} \sum_{i=1}^n \ln\pi_i

You’re already replacing the max over θ with a distribution over θ.  What about the πi

It would seem to be “more Bayesian” to put a prior on the data item probabilities pi_i, and it would also seem to “do the right thing” in situations where there are several different pi that have the same empirical likelihood.

This is a fairly reasonable question, which first reminds me of an issue we had examined with Costas Goutis, on his very last trip to Paris in 1996, a few months before he died in a diving accident near Seattle. We were wondering if treating the bandwidth in a non-parametric density estimator as a regular parameter was making sense. After experimenting for a few days with different priors we found that it was not such a great idea and that, instead, the prior on the bandwidth needed to depend on the sample size. This led to Costas’ posthumous paper, Nonparametric Estimation of a Mixing Density via the Kernel Method, in JASA in 1997 (with the kind help of Jianqing Fan).

Now, more to the point (of empirical likelihood), I am afraid that putting (almost) any kind of prior on the weights πi would be hopeless. For one thing, the πi are of the same size as the sample (modulo the identifying equation constraints) so estimating them based on a prior that does not depend on the sample size does not produce consistent estimators of the weights. (Search Bayesian nonparametric likelihood estimation for more advanced reasons.) Intuitively, it seems to me that the (true) parameter θ of the (unknown or unavailable) distribution of the data does not make sense in the non-parametric setting or, conversely, that the weights πi have no meaning for the inference on θ. It thus sounds difficult to treat them together and on an equal footing. The approximation

\max_{\pi} \sum_{i=1}^n \ln\pi_i

is a function of θ that replaces the unknown or unavailable likelihood, in which the weights have no statistical meaning. But this is a wee of a weak argument as other solutions than the maximisation of the entropy could be used to determine the weights.

In the end, this remains a puzzling issue (and hence a great question), hitting at the difficulty of replacing the true model with an approximation on the one hand and aiming at estimating the true parameter(s) on the other hand.