## a Bayesian criterion for singular models [discussion]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on October 10, 2016 by xi'an

[Here is the discussion Judith Rousseau and I wrote about the paper by Mathias Drton and Martyn Plummer, a Bayesian criterion for singular models, which was discussed last week at the Royal Statistical Society. There is still time to send a written discussion! Note: This post was written using the latex2wp converter.]

It is a well-known fact that the BIC approximation of the marginal likelihood in a given irregular model ${\mathcal M_k}$ fails or may fail. The BIC approximation has the form

$\displaystyle BIC_k = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - d_k \log n /2$

where ${d_k }$ corresponds on the number of parameters to be estimated in model ${\mathcal M_k}$. In irregular models the dimension ${d_k}$ typically does not provide a good measure of complexity for model ${\mathcal M_k}$, at least in the sense that it does not lead to an approximation of

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log \left( \int_{\mathcal M_k} p(\mathbf Y_n| \pi_k, \mathcal M_k) dP(\pi_k|k )\right) \,.$

A way to understand the behaviour of ${\log m(\mathbf Y_n |\mathcal M_k) }$ is through the effective dimension

$\displaystyle \tilde d_k = -\lim_n \frac{ \log P( \{ KL(p(\mathbf Y_n| \pi_0, \mathcal M_k) , p(\mathbf Y_n| \pi_k, \mathcal M_k) ) \leq 1/n | k ) }{ \log n}$

when it exists, see for instance the discussions in Chambaz and Rousseau (2008) and Rousseau (2007). Watanabe (2009} provided a more precise formula, which is the starting point of the approach of Drton and Plummer:

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - \lambda_k(\pi_0) \log n + [m_k(\pi_0) - 1] \log \log n + O_p(1)$

where ${\pi_0}$ is the true parameter. The authors propose a clever algorithm to approximate of the marginal likelihood. Given the popularity of the BIC criterion for model choice, obtaining a relevant penalized likelihood when the models are singular is an important issue and we congratulate the authors for it. Indeed a major advantage of the BIC formula is that it is an off-the-shelf crierion which is implemented in many softwares, thus can be used easily by non statisticians. In the context of singular models, a more refined approach needs to be considered and although the algorithm proposed by the authors remains quite simple, it requires that the functions ${ \lambda_k(\pi)}$ and ${m_k(\pi)}$ need be known in advance, which so far limitates the number of problems that can be thus processed. In this regard their equation (3.2) is both puzzling and attractive. Attractive because it invokes nonparametric principles to estimate the underlying distribution; puzzling because why should we engage into deriving an approximation like (3.1) and call for Bayesian principles when (3.1) is at best an approximation. In this case why not just use a true marginal likelihood?

1. Why do we want to use a BIC type formula?

The BIC formula can be viewed from a purely frequentist perspective, as an example of penalised likelihood. The difficulty then stands into choosing the penalty and a common view on these approaches is to choose the smallest possible penalty that still leads to consistency of the model choice procedure, since it then enjoys better separation rates. In this case a ${\log \log n}$ penalty is sufficient, as proved in Gassiat et al. (2013). Now whether or not this is a desirable property is entirely debatable, and one might advocate that for a given sample size, if the data fits the smallest model (almost) equally well, then this model should be chosen. But unless one is specifying what equally well means, it does not add much to the debate. This also explains the popularity of the BIC formula (in regular models), since it approximates the marginal likelihood and thus benefits from the Bayesian justification of the measure of fit of a model for a given data set, often qualified of being a Bayesian Ockham’s razor. But then why should we not compute instead the marginal likelihood? Typical answers to this question that are in favour of BIC-type formula include: (1) BIC is supposingly easier to compute and (2) BIC does not call for a specification of the prior on the parameters within each model. Given that the latter is a difficult task and that the prior can be highly influential in non-regular models, this may sound like a good argument. However, it is only apparently so, since the only justification of BIC is purely asymptotic, namely, in such a regime the difficulties linked to the choice of the prior disappear. This is even more the case for the sBIC criterion, since it is only valid if the parameter space is compact. Then the impact of the prior becomes less of an issue as non informative priors can typically be used. With all due respect, the solution proposed by the authors, namely to use the posterior mean or the posterior mode to allow for non compact parameter spaces, does not seem to make sense in this regard since they depend on the prior. The same comments apply to the author’s discussion on Prior’s matter for sBIC. Indeed variations of the sBIC could be obtained by penalizing for bigger models via the prior on the weights, for instance as in Mengersen and Rousseau (2011) or by, considering repulsive priors as in Petralia et al. (20120, but then it becomes more meaningful to (again) directly compute the marginal likelihood. Remains (as an argument in its favour) the relative computational ease of use of sBIC, when compared with the marginal likelihood. This simplification is however achieved at the expense of requiring a deeper knowledge on the behaviour of the models and it therefore looses the off-the-shelf appeal of the BIC formula and the range of applications of the method, at least so far. Although the dependence of the approximation of ${\log m(\mathbf Y_n |\mathcal M_k)}$ on ${\mathcal M_j }$, \$latex {j \leq k} is strange, this does not seem crucial, since marginal likelihoods in themselves bring little information and they are only meaningful when compared to other marginal likelihoods. It becomes much more of an issue in the context of a large number of models.

2. Should we care so much about penalized or marginal likelihoods ?

Marginal or penalized likelihoods are exploratory tools in a statistical analysis, as one is trying to define a reasonable model to fit the data. An unpleasant feature of these tools is that they provide numbers which in themselves do not have much meaning and can only be used in comparison with others and without any notion of uncertainty attached to them. A somewhat richer approach of exploratory analysis is to interrogate the posterior distributions by either varying the priors or by varying the loss functions. The former has been proposed in van Havre et l. (2016) in mixture models using the prior tempering algorithm. The latter has been used for instance by Yau and Holmes (2013) for segmentation based on Hidden Markov models. Introducing a decision-analytic perspective in the construction of information criteria sounds to us like a reasonable requirement, especially when accounting for the current surge in studies of such aspects.

[Posted as arXiv:1610.02503]

## read paper [in Bristol]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 29, 2016 by xi'an

I went to give a seminar in Bristol last Friday and I chose to present the testing with mixture paper. As we are busy working on the revision, I was eagerly looking for comments and criticisms that could strengthen this new version. As it happened, the (Bristol) Bayesian Cake (Reading) Club had chosen our paper for discussion, two weeks in a row!, hence the title!, and I got invited to join the group the morning prior to the seminar! This was, of course, most enjoyable and relaxed, including an home-made cake!, but also quite helpful in assessing our arguments in the paper. One point of contention or at least of discussion was the common parametrisation between the components of the mixture. Although all parametrisations are equivalent from a single component point of view, I can [almost] see why using a mixture with the same parameter value on all components may impose some unsuspected constraint on that parameter. Even when the parameter is the same moment for both components. This still sounds like a minor counterpoint in that the weight should converge to either zero or one and hence eventually favour the posterior on the parameter corresponding to the “true” model.

Another point that was raised during the discussion is the behaviour of the method under misspecification or for an M-open framework: when neither model is correct does the weight still converge to the boundary associated with the closest model (as I believe) or does a convexity argument produce a non-zero weight as it limit (as hinted by one example in the paper)? I had thought very little about this and hence had just as little to argue though as this does not sound to me like the primary reason for conducting tests. Especially in a Bayesian framework. If one is uncertain about both models to be compared, one should have an alternative at the ready! Or use a non-parametric version, which is a direction we need to explore deeper before deciding it is coherent and convergent!

A third point of discussion was my argument that mixtures allow us to rely on the same parameter and hence the same prior, whether proper or not, while Bayes factors are less clearly open to this interpretation. This was not uniformly accepted!

Thinking afresh about this approach also led me to broaden my perspective on the use of the posterior distribution of the weight(s) α: while previously I had taken those weights mostly as a proxy to the posterior probabilities, to be calibrated by pseudo-data experiments, as for instance in Figure 9, I now perceive them primarily as the portion of the data in agreement with the corresponding model [or hypothesis] and more importantly as a solution for staying away from a Neyman-Pearson-like decision. Or error evaluation. Usually, when asked about the interpretation of the output, my answer is to compare the behaviour of the posterior on the weight(s) with a posterior associated with a sample from each model. Which does sound somewhat similar to posterior predictives if the samples are simulated from the associated predictives. But the issue was not raised during the visit to Bristol, which possibly reflects on how unfrequentist the audience was [the Statistics group is], as it apparently accepted with no further ado the use of a posterior distribution as a soft assessment of the comparative fits of the different models. If not necessarily agreeing the need of conducting hypothesis testing (especially in the case of the Pima Indian dataset!).

## discussions on Gerber and Chopin

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on May 29, 2015 by xi'an

As a coincidence, I received my copy of JRSS Series B with the Read Paper by Mathieu Gerber and Nicolas Chopin on sequential quasi Monte Carlo just as I was preparing an arXival of a few discussions on the paper! Among the [numerous and diverse] discussions, a few were of particular interest to me [I highlighted members of the University of Warwick and of Université Paris-Dauphine to suggest potential biases!]:

1. Mike Pitt (Warwick), Murray Pollock et al.  (Warwick) and Finke et al. (Warwick) all suggested combining quasi Monte Carlo with pseudomarginal Metropolis-Hastings, pMCMC (Pitt) and Rao-Bklackwellisation (Finke et al.);
2. Arnaud Doucet pointed out that John Skilling had used the Hilbert (ordering) curve in a 2004 paper;
3. Chris Oates, Dan Simpson and Mark Girolami (Warwick) suggested combining quasi Monte Carlo with their functional control variate idea;
4. Richard Everitt wondered about the dimension barrier of d=6 and about possible slice extensions;
5. Zhijian He and Art Owen pointed out simple solutions to handle a random number of uniforms (for simulating each step in sequential Monte Carlo), namely to start with quasi Monte Carlo and end up with regular Monte Carlo, in an hybrid manner;
6. Hans Künsch points out the connection with systematic resampling à la Carpenter, Clifford and Fearnhead (1999) and wonders about separating the impact of quasi Monte Carlo between resampling and propagating [which vaguely links to one of my comments];
7. Pierre L’Ecuyer points out a possible improvement over the Hilbert curve by a preliminary sorting;
8. Frederik Lindsten and Sumeet Singh propose using ABC to extend the backward smoother to intractable cases [but still with a fixed number of uniforms to use at each step], as well as Mateu and Ryder (Paris-Dauphine) for a more general class of intractable models;
9. Omiros Papaspiliopoulos wonders at the possibility of a quasi Markov chain with “low discrepancy paths”;
10. Daniel Rudolf suggest linking the error rate of sequential quasi Monte Carlo with the bounds of Vapnik and Ĉervonenkis (1977).

The arXiv document also includes the discussions by Julyan Arbel and Igor Prünster (Turino) on the Bayesian nonparametric side of sqMC and by Robin Ryder (Dauphine) on the potential of sqMC for ABC.

## statistical modelling of citation exchange between statistics journals

Posted in Books, Statistics, University life with tags , , , , , on April 10, 2015 by xi'an

Cristiano Varin, Manuela Cattelan and David Firth (Warwick) have written a paper on the statistical analysis of citations and index factors, paper that is going to be Read at the Royal Statistical Society next May the 13th. And hence is completely open to contributed discussions. Now, I have written several entries on the ‘Og about the limited trust I set to citation indicators, as well as about the abuse made of those. However I do not think I will contribute to the discussion as my reservations are about the whole bibliometrics excesses and not about the methodology used in the paper.

The paper builds several models on the citation data provided by the “Web of Science” compiled by Thompson Reuters. The focus is on 47 Statistics journals, with a citation horizon of ten years, which is much more reasonable than the two years in the regular impact factor. A first feature of interest in the descriptive analysis of the data is that all journals have a majority of citations from and to journals outside statistics or at least outside the list. Which I find quite surprising. The authors also build a cluster based on the exchange of citations, resulting in rather predictable clusters, even though JCGS and Statistics and Computing escape the computational cluster to end up in theory and methods along Annals of Statistics and JRSS Series B.

In addition to the unsavoury impact factor, a ranking method discussed in the paper is the eigenfactor score that starts with a Markov exploration of articles by going at random to one of the papers in the reference list and so on. (Which shares drawbacks with the impact factor, e.g., in that it does not account for the good or bad reason the paper is cited.) Most methods produce the Big Four at the top, with Series B ranked #1, and Communications in Statistics A and B at the bottom, along with Journal of Applied Statistics. Again, rather anticlimactic.

The major modelling input is based on Stephen Stigler’s model, a generalised linear model on the log-odds of cross citations. The Big Four once again receive high scores, with Series B still much ahead. (The authors later question the bias due to the Read Paper effect, but cannot easily evaluate this impact. While some Read Papers like Spiegelhalter et al. 2002 DIC do generate enormous citation traffic, to the point of getting re-read!, other journals also contain discussion papers. And are free to include an on-line contributed discussion section if they wish.) Using an extra ranking lasso step does not change things.

In order to check the relevance of such rankings, the authors also look at the connection with the conclusions of the (UK) 2008 Research Assessment Exercise. They conclude that the normalised eigenfactor score and Stigler model are more correlated with the RAE ranking than the other indicators.  Which means either that the scores are good predictors or that the RAE panel relied too heavily on bibliometrics! The more global conclusion is that clusters of journals or researchers have very close indicators, hence that ranking should be conducted with more caution that it is currently. And, more importantly, that reverting the indices from journals to researchers has no validation and little information.

## not Bayesian enough?!

Posted in Books, Statistics, University life with tags , , , , , , , on January 23, 2015 by xi'an

Our random forest paper was alas rejected last week. Alas because I think the approach is a significant advance in ABC methodology when implemented for model choice, avoiding the delicate selection of summary statistics and the report of shaky posterior probability approximation. Alas also because the referees somewhat missed the point, apparently perceiving random forests as a way to project a large collection of summary statistics on a limited dimensional vector as in the Read Paper of Paul Fearnhead and Dennis Prarngle, while the central point in using random forests is the avoidance of a selection or projection of summary statistics.  They also dismissed ou approach based on the argument that the reduction in error rate brought by random forests over LDA or standard (k-nn) ABC is “marginal”, which indicates a degree of misunderstanding of what the classification error stand for in machine learning: the maximum possible gain in supervised learning with a large number of classes cannot be brought arbitrarily close to zero. Last but not least, the referees did not appreciate why we mostly cannot trust posterior probabilities produced by ABC model choice and hence why the posterior error loss is a valuable and almost inevitable machine learning alternative, dismissing the posterior expected loss as being not Bayesian enough (or at all), for “averaging over hypothetical datasets” (which is a replicate of Jeffreys‘ famous criticism of p-values)! Certainly a first time for me to be rejected based on this argument!

## not converging to London for an [extra]ordinary Read Paper

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , on November 21, 2014 by xi'an

On December 10, I will alas not travel to London to attend the Read Paper on sequential quasi-Monte Carlo presented by Mathieu Gerber and Nicolas Chopin to The Society, as I fly instead to Montréal for the NIPS workshops… I am quite sorry to miss this event, as this is a major paper which brings quasi-Monte Carlo methods into mainstream statistics. I will most certainly write a discussion and remind Og’s readers that contributed (800 words) discussions are welcome from everyone, the deadline for submission being January 02.

## Statistics and Computing special MCMSk’issue [call for papers]

Posted in Books, Mountains, R, Statistics, University life with tags , , , , , , , , , , , on February 7, 2014 by xi'an

Following the exciting and innovative talks, posters and discussions at MCMski IV, the editor of Statistics and Computing, Mark Girolami (who also happens to be the new president-elect of the BayesComp section of ISBA, which is taking over the management of future MCMski meetings), kindly proposed to publish a special issue of the journal open to all participants to the meeting. Not only to speakers, mind, but to all participants.

So if you are interested in submitting a paper to this special issue of a computational statistics journal that is very close to our MCMski themes, I encourage you to do so. (Especially if you missed the COLT 2014 deadline!) The deadline for submissions is set on March 15 (a wee bit tight but we would dearly like to publish the issue in 2014, namely the same year as the meeting.) Submissions are to be made through the Statistics and Computing portal, with a mention that they are intended for the special issue.

An editorial committee chaired by Antonietta Mira and composed of Christophe Andrieu, Brad Carlin, Nicolas Chopin, Jukka Corander, Colin Fox, Nial Friel, Chris Holmes, Gareth Jones, Peter Müller, Antonietta Mira, Geoff Nicholls, Gareth Roberts, Håvård Rue, Robin Ryder, and myself, will examine the submissions and get back within a few weeks to the authors. In a spirit similar to the JRSS Read Paper procedure, submissions will first be examined collectively, before being sent to referees. We plan to publish the reviews as well, in order to include a global set of comments on the accepted papers. We intend to do it in The Economist style, i.e. as a set of edited anonymous comments. Usual instructions for Statistics and Computing apply, with the additional requirements that the paper should be around 10 pages and include at least one author who took part in MCMski IV.