Last evening, I attended the RSS Midlands seminar here in Warwick. The theme was chain event graphs (CEG), As I knew nothing about them, it was worth my time listening to both speakers and discussing with Jim Smith afterwards. CEGs are extensions of Bayes nets with originally many more nodes since they start with the probability tree involving all modalities of all variables. Intensive Bayesian model comparison is then used to reduce the number of nodes by merging modalities having the same children or removing variables with no impact on the variable of interest. So this is not exactly a new Bayes net based on modality dummies as nodes (my original question). This is quite interesting, esp. in the first talk illustration of using missing value indicators as a supplementary variable (to determine whether or not data is missing at random). I also wonder how much of a connection there is with variable length Markov chains (either as a model or as a way to prune the tree). A last vague idea is a potential connection with lumpable Markov chains, a concept I learned from Kemeny & Snell (1960): a finite Markov chain is lumpable if by merging two or more of its states it remains a Markov chain. I do not know if this has ever been studied from a statistical point of view, i.e. testing for lumpability, but this sounds related to the idea of merging modalities of some variables in the probability tree…
Archive for RSS
Although I could not stay at the RSS Annual Conference for the three days, I would have liked to do so, as there were several interesting sessions, from MCMC talks by Axel Finke, Din-Houn Lau, Anthony Lee and Michael Betancourt, to the session on Anti-fragility, the concept produced by Nassim Taleb in his latest book (reviewed before completion by Larry Wasserman). I find it rather surprising that the RSS is dedicating a whole session to this, but the usually anti-statistic stance of Taleb (esp. in The Black Swan) may explain for it (and the equally surprising debate between a “pro-Taleb” and a “pro-Silver”. I will also miss Sharon McGrayne‘s talk on the Bayesian revolution, but look forward to hear it at the Bayes-250 day in Duke next December. And I could have certainly benefited from the training session about building a package in R. It seemed, however, that one-day attendance was a choice made by many participants to the conference, judging from the ability to register for one or two days and from the (biased) sample of my friends.
Incidentally, the conference gave me the opportunity to discover Newcastle and Tynemouth, enjoying the architecture of Grey Street and running on the huge meadows almost at the city centre, among herds of cows in the morning fog. (I wish I had had more time to reach the neighbourly Hadrian wall and Durham, that I only spotted from the train to B’ham!)
Today, I attended the RSS Annual Conference in Newcastle-upon-Tyne. For one thing, I ran a Memorial session in memory of George Casella, with my (and his) friends Jim Hobert and Elias Moreno as speakers. (The session was well-attended if not overwhelmingly so.) For another thing, the RSS decided to have the DIC Read Paper by David Spiegelhalter, Nicky Best, Brad Carlin and Angelika van der Linde Bayesian measures of model complexity and fit re-Read, and I was asked to re-discuss the 2002 paper. Here are the slides of my discussion, borrowing from the 2006 Bayesian Analysis paper with Gilles Celeux, Florence Forbes, and Mike Titterington where we examined eight different versions of DIC for mixture models. (I refrained from using the title “snow white and the seven DICs” for a slide…) I also borrowed from our recent discussion of Murray Aitkin’s (2009) book. The other discussant was Elias Moreno, who focussed on consistency issues. (More on this and David Spiegelhalter’s defence in a few posts!) This was the first time I was giving a talk on a basketball court (I once gave an exam there!)
The great discussion Tony O’Hagan had with Dennis Lindley last March for the Bayes 250 meeting at the RSS is now available on line.
Since this is still close to Dennis’s birthday, I take the opportunity to wish him the best for his 90th birthday.
Here is the reply by Chris and Steve about my comments from yesterday:
Thanks to Christian for the comments and feedback on our paper “A General Framework for Updating Belief Distributions“. We agree with Christian that starting with a summary statistic, or statistics, is an anchor for inference or learning, providing direction and guidance for models, avoiding the alternative vague notion of attempting to model a complete data set. The latter idea has dominated the Bayesian methodology for decades, but with the advent of large and complex data sets, this is becoming increasingly challenging, if not impossible.
However, in order to do work with statistics of interest, we need to find a framework in which this direct approach can be supported by a learning strategy when the formal use of Bayes theorem is not applicable. We achieve this in the paper for a general class of loss functions, which connect observations with a target of interest. A point raised by Christian is how arbitrary these loss functions are. We do not see this at all; for if a target has been properly identified then the most primitive construct between observations informing about a target and the target would come in the form of a loss function. One should always be able to assess the loss of ascertaining a value of as an action and providing the loss in the presence of observation x. The question to be discussed is whether loss functions are objective, as in the case of the median loss,
or subjective, in the case of the choice between loss functions for estimating a location of a distribution; mean, median or mode? But our work is situated in the former position.
Previous work on loss functions, mostly in the classical literature, has spent a lot of space working out what are optimal loss functions for targets of interest. We are not really dealing with novel targets and so we can draw on the classic literature here. The work can be thought of as the Bayesian version of the M-estimator and associated ideas. In this respect we are dealing with two loss functions for updating belief distributions, one for the data, which we have just discussed, and one for the prior information, which, due to coherence principles, must be the Kullback-Leibler divergence. This raises the thorny issue of how to calibrate the two loss functions. We discuss this in the paper.
To then deal with the statistic problem, mentioned at the start of this discussion, we have found a nice way to proceed by using the loss function . How this loss function, combined with the use of the exponential family, can be used to estimate functionals of the type
is provided in the Walker talk at Bayes 250 in London, titled “The Misspecified Bayesian”, since the “model” is designed to be misspecified, a tool to estimate and learn about I only. The basic idea is to evaluate I by ensuring that we learn about the for which
This is the story of the background, we would now like to pick up in more detail on three important points that you raise in your post:
- The arbitrariness in selecting the loss function.
- The relative weighting of loss-to-data vs. loss-to-prior.
- The selection of the loss in the M-free case.
In the absence of complete knowledge of the data generating mechanism, i.e. outside of M-closed,
- We believe the statistician should weigh up the relative arbitrariness in selecting a loss function targeting the statistic of interest versus the arbitrariness of selecting a misspecified model, known not to be true, for the complete data generating mechanism. There is a wealth of literature on how to select optimal loss functions that target specific statistics, e.g. Hüber (2009) provides a comprehensive overview of how this should be done. As far as we are aware, we know of no formal procedures (that do not rely on loss functions) to select a false sampling distribution for the whole of x; see Key, Pericchi and Smith (1999).
- The relative weighting of loss-to-data vs. loss-to-prior. This is an interesting open problem. Our framework shows in the absence of M-closed or the use of self-information loss that the analyst must select this weighting. In our paper we suggest some default procedures. We have nowhere claimed these were “correct”. You raise concerns regards parameterisation and we agree with you that care is needed, but many of these issues equally hold for existing “Objective” or “Default” Bayes procedures, such as unit-information priors.
- The selection of the loss in M-free. You say “….there is no optimal choice for the substitute to the loss function…”. We disagree. Our approach is to select an established loss function that directly targets the statistic of interest, and elicit prior beliefs directly on the unknown value of this statistic. There is no notion here of a a pseudo-likelihood or where this converges to.
Thank you again to Christian for his critical observations!