## statistical modelling of citation exchange between statistics journals

Posted in Books, Statistics, University life with tags , , , , , on April 10, 2015 by xi'an

Cristiano Varin, Manuela Cattelan and David Firth (Warwick) have written a paper on the statistical analysis of citations and index factors, paper that is going to be Read at the Royal Statistical Society next May the 13th. And hence is completely open to contributed discussions. Now, I have written several entries on the ‘Og about the limited trust I set to citation indicators, as well as about the abuse made of those. However I do not think I will contribute to the discussion as my reservations are about the whole bibliometrics excesses and not about the methodology used in the paper.

The paper builds several models on the citation data provided by the “Web of Science” compiled by Thompson Reuters. The focus is on 47 Statistics journals, with a citation horizon of ten years, which is much more reasonable than the two years in the regular impact factor. A first feature of interest in the descriptive analysis of the data is that all journals have a majority of citations from and to journals outside statistics or at least outside the list. Which I find quite surprising. The authors also build a cluster based on the exchange of citations, resulting in rather predictable clusters, even though JCGS and Statistics and Computing escape the computational cluster to end up in theory and methods along Annals of Statistics and JRSS Series B.

In addition to the unsavoury impact factor, a ranking method discussed in the paper is the eigenfactor score that starts with a Markov exploration of articles by going at random to one of the papers in the reference list and so on. (Which shares drawbacks with the impact factor, e.g., in that it does not account for the good or bad reason the paper is cited.) Most methods produce the Big Four at the top, with Series B ranked #1, and Communications in Statistics A and B at the bottom, along with Journal of Applied Statistics. Again, rather anticlimactic.

The major modelling input is based on Stephen Stigler’s model, a generalised linear model on the log-odds of cross citations. The Big Four once again receive high scores, with Series B still much ahead. (The authors later question the bias due to the Read Paper effect, but cannot easily evaluate this impact. While some Read Papers like Spiegelhalter et al. 2002 DIC do generate enormous citation traffic, to the point of getting re-read!, other journals also contain discussion papers. And are free to include an on-line contributed discussion section if they wish.) Using an extra ranking lasso step does not change things.

In order to check the relevance of such rankings, the authors also look at the connection with the conclusions of the (UK) 2008 Research Assessment Exercise. They conclude that the normalised eigenfactor score and Stigler model are more correlated with the RAE ranking than the other indicators.  Which means either that the scores are good predictors or that the RAE panel relied too heavily on bibliometrics! The more global conclusion is that clusters of journals or researchers have very close indicators, hence that ranking should be conducted with more caution that it is currently. And, more importantly, that reverting the indices from journals to researchers has no validation and little information.

## comparison of Bayesian predictive methods for model selection

Posted in Books, Statistics, University life with tags , , , , , , , , , on April 9, 2015 by xi'an

“Dupuis and Robert (2003) proposed choosing the simplest model with enough explanatory power, for example 90%, but did not discuss the effect of this threshold for the predictive performance of the selected models. We note that, in general, the relative explanatory power is an unreliable indicator of the predictive performance of the submodel,”

Juho Piironen and Aki Vehtari arXived a survey on Bayesian model selection methods that is a sequel to the extensive survey of Vehtari and Ojanen (2012). Because most of the methods described in this survey stem from Kullback-Leibler proximity calculations, it includes some description of our posterior projection method with Costas Goutis and Jérôme Dupuis. We indeed did not consider prediction in our papers and even failed to include consistency result, as I was pointed out by my discussant in a model choice meeting in Cagliari, in … 1999! Still, I remain fond of the notion of defining a prior on the embedding model and of deducing priors on the parameters of the submodels by Kullback-Leibler projections. It obviously relies on the notion that the embedding model is “true” and that the submodels are only approximations. In the simulation experiments included in this survey, the projection method “performs best in terms of the predictive ability” (p.15) and “is much less vulnerable to the selection induced bias” (p.16).

Reading the other parts of the survey, I also came to the perspective that model averaging makes much more sense than model choice in predictive terms. Sounds obvious stated that way but it took me a while to come to this conclusion. Now, with our mixture representation, model averaging also comes as a natural consequence of the modelling, a point presumably not stressed enough in the current version of the paper. On the other hand, the MAP model now strikes me as artificial and linked to a very rudimentary loss function. A loss that does not account for the final purpose(s) of the model. And does not connect to the “all models are wrong” theorem.

## an email exchange about integral representations

Posted in Books, R, Statistics, University life with tags , , , , on April 8, 2015 by xi'an

I had an interesting email exchange [or rather exchange of emails] with a (German) reader of Introducing Monte Carlo Methods with R in the past days, as he had difficulties with the validation of the accept-reject algorithm via the integral

$\mathbb{P}(Y\in \mathcal{A},U\le f(Y)/Mg(Y)) = \int_\mathcal{A} \int_0^{f(y)/Mg(y)}\,\text{d}u\,g(y)\,\text{d}y\,,$

in that it took me several iterations [as shown in the above] to realise the issue was with the notation

$\int_0^a \,\text{d}u\,,$

which seemed to be missing a density term or, in other words, be different from

$\int_0^1 \,\mathbb{I}_{(0,a)}(u)\,\text{d}u\,,$

What is surprising for me is that the integral

$\int_0^a \,\text{d}u$

has a clear meaning as a Riemann integral, hence should be more intuitive….

## scalable Bayesian inference for the inverse temperature of a hidden Potts model

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , on April 7, 2015 by xi'an

Matt Moores, Tony Pettitt, and Kerrie Mengersen arXived a paper yesterday comparing different computational approaches to the processing of hidden Potts models and of the intractable normalising constant in the Potts model. This is a very interesting paper, first because it provides a comprehensive survey of the main methods used in handling this annoying normalising constant Z(β), namely pseudo-likelihood, the exchange algorithm, path sampling (a.k.a., thermal integration), and ABC. A massive simulation experiment with individual simulation times up to 400 hours leads to select path sampling (what else?!) as the (XL) method of choice. Thanks to a pre-computation of the expectation of the sufficient statistic E[S(Z)|β].  I just wonder why the same was not done for ABC, as in the recent Statistics and Computing paper we wrote with Matt and Kerrie. As it happens, I was actually discussing yesterday in Columbia of potential if huge improvements in processing Ising and Potts models by approximating first the distribution of S(X) for some or all β before launching ABC or the exchange algorithm. (In fact, this is a more generic desiderata for all ABC methods that simulating directly if approximately the summary statistics would being huge gains in computing time, thus possible in final precision.) Simulating the distribution of the summary and sufficient Potts statistic S(X) reduces to simulating this distribution with a null correlation, as exploited in Cucala and Marin (2013, JCGS, Special ICMS issue). However, there does not seem to be an efficient way to do so, i.e. without reverting to simulating the entire grid X…

## True Detective [review]

Posted in Books, pictures with tags , , , , , , , , on April 4, 2015 by xi'an

Even though I wrote before that I do not watch TV series, I made a second exception this year with True Detective. This series was recommended to me by Judith and this was truly a good recommendation!

Contrary to my old-fashioned idea of TV series, where the same group of caricaturesque characters repeatedly meet new settings that are solved within the 50 mn each show lasts, the whole season of True Detective is a single story, much more like a very long movie with a unified plot that smoothly unfolds and gets mostly solved in the last episode. It obviously brings more strength and depth in the characters, the two investigators Rust and Marty, with the side drawback that most of the other characters, except maybe Marty’s wife, get little space.  The opposition between those two investigators is central to the coherence of the story, with Rust being the most intriguing one, very intellectual, almost otherworldly, with a nihilistic discourse, and a self-destructive bent, while Marty sounds more down-to-earth, although he also caters to his own self-destructive demons… Both actors are very impressive in giving a life and an history to their characters. The story takes place in Louisiana, with great landscapes and oppressive swamps where everything seems doomed to vanish, eventually, making detective work almost useless. And where clamminess applies to moral values as much as to the weather. The core of the plot is the search for a serial killer, whose murders of women are incorporated within a pagan cult. Although this sounds rather standard for a US murder story (!), and while there are unnecessary sub-plots and unconvincing developments, the overall storyboard is quite coherent, with a literary feel, even though its writer,  Nic Pizzolatto, never completed the corresponding novel and the unfolding of the plot is anything but conventional, with well-done flashbacks and multi-layered takes on the same events. (With none of the subtlety of Rashômon, where one ends up mistrusting every POV.)  Most of the series takes place in current time, when the two former detectives are interrogated by detectives reopening an unsolved murder case. The transformation of Rust over 15 years is an impressive piece of acting, worth by itself watching the show! The final episode, while impressive from an aesthetic perspective as a descent into darkness, is somewhat disappointing at the story level for not exploring the killer’s perspective much further and for resorting to a fairly conventional (in the Psycho sense!) fighting scene.

## stability of noisy Metropolis-Hastings

Posted in Books, Statistics, Travel, University life with tags , , , on April 3, 2015 by xi'an

Felipe Medina-Aguayo, Anthony Lee and Gareths Roberts, all from Warwick, arXived last Thursday a paper on the stability properties of noisy Metropolis-Hastings algorithms. The validation of unbiased estimators of the target à la Andrieu and Roberts (2009, AoS)—often discussed here—is in fact obvious when following the auxiliary variable representation of Andrieu and Vihola (2015, AoAP). Assuming the unbiased estimator of the target is generated conditional on the proposed value in the original Markov chain. The noisy version of the above means refreshing the unbiased estimator at each iteration. It also goes under the name of Monte Carlo within Metropolis. The difficulty with this noisy version is that it is not exact, i.e., does not enjoy the true target as its marginal stationary distribution. The paper by Medina-Aguayo, Lee and Roberts focusses on its validation or invalidation (with examples of transient noisy versions). Under geometric ergodicity of the marginal chain, plus some stability in the weights, the noisy version is also geometrically ergodic. A drift condition on the proposal kernel is also sufficient. Under (much?) harder conditions, the limiting distribution of the noisy chain is asymptotically in the number of unbiased estimators the true target. The result is thus quite interesting in that it provides sufficient convergence conditions, albeit not always easy to check in realistic settings.

## the unbounded likelihood problem

Posted in Books, Statistics, Travel, University life on April 2, 2015 by xi'an

Following my maths of the Lindley-Jeffreys paradox post, Javier (from Warwick) pointed out a recent American Statistician paper by Liu, Wu and Meeker about Understanding and addressing the unbounded likelihood problem. (I remember meeting some of the authors when visiting Ames three years ago.) As often when reading articles in The American Statistician, I easily find reasons do disagree with the authors. Here are some.

“Fisher (1912) suggest that a likelihood defined by a product of densities should be proportional to the probability of the data.”

First, I fail to understand why an unbounded likelihood is an issue. (I also fail to understand the above quote: in a continuous setting, there is no such thing as the probability of the data. Only its density.) Especially when avoiding maximum likelihood estimation. The paper is quite vague as to why this is a statistical problem. They take as one category discrete mixture models. While the likelihood explodes around each observation (in the mean direction) this does not prevent the existence of convergent solutions to the likelihood equations. Or of Bayes estimators. Nested sampling itself manages this difficulty.

Second, I deeply dislike the baseline that everything is discrete or even finite, including measurement and hence continuous densities should be replaced with probabilities, called correct likelihood in the paper. Of course, using probabilities removes any danger of hitting an infinite likelihood. But it also introduces many layers of arbitrary calibration, incl. the scale of the discretisation. Like, I do not think there is any stability of the solution when the discretisation range Δ goes to zero, if the limiting theorem of the authors holds. But they do not seem to see this as an issue. I think it would make more sense to treat Δ as another parameter.

As an aside, I also find surprising the classification of the unbounded likelihood models in three categories, one being those “with three or four parameters, including a threshold parameter”. Why on Earth 3 or 4?! As if it was not possible to find infinite likelihoods with more than four parameters…