As we were working on the Handbook of mixture analysis with Sylvia Früwirth-Schnatter and Gilles Celeux today, near Saint-Germain des Près, I realised that there was a mistake in our 1990 mixture paper with Jean Diebolt [published in 1994], in that when we are proposing to use improper “Jeffreys” priors under the restriction that no component of the Gaussian mixture is “empty”, meaning that there are at least two observations generated from each component, the likelihood needs to be renormalised to be a density for the sample. This normalisation constant only depends on the weights of the mixture, which means that, when simulating from the full conditional distribution of the weights, there should be an extra-acceptance step to account for this correction. Of course, the term is essentially equal to one for a large enough sample but this remains a mistake nonetheless! It is funny that it remained undetected for so long in my most cited paper. Checking on Larry’s 1999 paper exploring the idea of excluding terms from the likelihood to allow for improper priors, I did not spot him using a correction either.
Archive for JRSSB
A paper on control variates by Chris Oates, Mark Girolami (Warwick) and Nicolas Chopin (CREST) appeared in a recent issue of Series B. I had read and discussed the paper with them previously and the following is a set of comments I wrote at some stage, to be taken with enough gains of salt since Chris, Mark and Nicolas answered them either orally or in the paper. Note also that I already discussed an earlier version, with comments that are not necessarily coherent with the following ones! [Thanks to the busy softshop this week, I resorted to publish some older drafts, so mileage can vary in the coming days.]
First, it took me quite a while to get over the paper, mostly because I have never worked with reproducible kernel Hilbert spaces (RKHS) before. I looked at some proofs in the appendix and at the whole paper but could not spot anything amiss. It is obviously a major step to uncover a manageable method with a rate that is lower than √n. When I set my PhD student Anne Philippe on the approach via Riemann sums, we were quickly hindered by the dimension issue and could not find a way out. In the first versions of the nested sampling approach, John Skilling had also thought he could get higher convergence rates before realising the Monte Carlo error had not disappeared and hence was keeping the rate at the same √n speed.
The core proof in the paper leading to the 7/12 convergence rate relies on a mathematical result of Sun and Wu (2009) that a certain rate of regularisation of the function of interest leads to an average variance of order 1/6. I have no reason to mistrust the result (and anyway did not check the original paper), but I am still puzzled by the fact that it almost immediately leads to the control variate estimator having a smaller order variance (or at least variability). On average or in probability. (I am also uncertain on the possibility to interpret the boxplot figures as establishing super-√n speed.)
Another thing I cannot truly grasp is how the control functional estimator of (7) can be both a mere linear recombination of individual unbiased estimators of the target expectation and an improvement in the variance rate. I acknowledge that the coefficients of the matrices are functions of the sample simulated from the target density but still…
Another source of inner puzzlement is the choice of the kernel in the paper, which seems too simple to be able to cover all problems despite being used in every illustration there. I see the kernel as centred at zero, which means a central location must be know, decreasing to zero away from this centre, so possibly missing aspects of the integrand that are too far away, and isotonic in the reference norm, which also seems to preclude some settings where the integrand is not that compatible with the geometry.
I am equally nonplussed by the existence of a deterministic bound on the error, although it is not completely deterministic, depending on the values of the reproducible kernel at the points of the sample. Does it imply anything restrictive on the function to be integrated?
A side remark about the use of intractable in the paper is that, given the development of a whole new branch of computational statistics handling likelihoods that cannot be computed at all, intractable should possibly be reserved for such higher complexity models.
I received this news from the RSS today that all the RSS journals are turning 100% electronic. No paper version any longer! I deeply regret this move on which, as an RSS member, I would have appreciated to be consulted as I find much easier to browse through the current issue when it arrives in my mailbox, rather than being t best reminded by an email that I will most likely ignore and erase. And as I consider the production of the journals the prime goal of the Royal Statistical Society. And as I read that only 25% of the members had opted so far for the electronic format, which does not sound to me like a majority. In addition, moving to electronic-only journals does not bring the perks one would expect from electronic journals:
- no bonuses like supplementary material, code, open or edited comments
- no reduction in the subscription rate of the journals and penalty fees if one still wants a paper version, which amounts to a massive increase in the subscription price
- no disengagement from the commercial publisher, whose role become even less relevant
- no access to the issues of the years one has paid for, once one stops subscribing.
“The benefits of electronic publishing include: faster publishing speeds; increased content; instant access from a range of electronic devices; additional functionality; and of course, environmental sustainability.”
The move is sold with typical marketing noise. But I do not buy it: publishing speeds will remain the same as driven by the reviewing part, I do not see where the contents are increased, and I cannot seriously read a journal article from my phone, so this range of electronic devices remains a gadget. Not happy!
Cristiano Varin, Manuela Cattelan and David Firth (Warwick) have written a paper on the statistical analysis of citations and index factors, paper that is going to be Read at the Royal Statistical Society next May the 13th. And hence is completely open to contributed discussions. Now, I have written several entries on the ‘Og about the limited trust I set to citation indicators, as well as about the abuse made of those. However I do not think I will contribute to the discussion as my reservations are about the whole bibliometrics excesses and not about the methodology used in the paper.
The paper builds several models on the citation data provided by the “Web of Science” compiled by Thompson Reuters. The focus is on 47 Statistics journals, with a citation horizon of ten years, which is much more reasonable than the two years in the regular impact factor. A first feature of interest in the descriptive analysis of the data is that all journals have a majority of citations from and to journals outside statistics or at least outside the list. Which I find quite surprising. The authors also build a cluster based on the exchange of citations, resulting in rather predictable clusters, even though JCGS and Statistics and Computing escape the computational cluster to end up in theory and methods along Annals of Statistics and JRSS Series B.
In addition to the unsavoury impact factor, a ranking method discussed in the paper is the eigenfactor score that starts with a Markov exploration of articles by going at random to one of the papers in the reference list and so on. (Which shares drawbacks with the impact factor, e.g., in that it does not account for the good or bad reason the paper is cited.) Most methods produce the Big Four at the top, with Series B ranked #1, and Communications in Statistics A and B at the bottom, along with Journal of Applied Statistics. Again, rather anticlimactic.
The major modelling input is based on Stephen Stigler’s model, a generalised linear model on the log-odds of cross citations. The Big Four once again receive high scores, with Series B still much ahead. (The authors later question the bias due to the Read Paper effect, but cannot easily evaluate this impact. While some Read Papers like Spiegelhalter et al. 2002 DIC do generate enormous citation traffic, to the point of getting re-read!, other journals also contain discussion papers. And are free to include an on-line contributed discussion section if they wish.) Using an extra ranking lasso step does not change things.
In order to check the relevance of such rankings, the authors also look at the connection with the conclusions of the (UK) 2008 Research Assessment Exercise. They conclude that the normalised eigenfactor score and Stigler model are more correlated with the RAE ranking than the other indicators. Which means either that the scores are good predictors or that the RAE panel relied too heavily on bibliometrics! The more global conclusion is that clusters of journals or researchers have very close indicators, hence that ranking should be conducted with more caution that it is currently. And, more importantly, that reverting the indices from journals to researchers has no validation and little information.
As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper. Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).
Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).
“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)
At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order
means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms. For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform
which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal
has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.
The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.
Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?
In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?
What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?
Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?
Our paper about evaluating statistics used for ABC model choice has just appeared in Series B! It somewhat paradoxical that it comes out just a few days after we submitted our paper on using random forests for Bayesian model choice, thus bypassing the need for selecting those summary statistics by incorporating all statistics available and letting the trees automatically rank those statistics in term of their discriminating power. Nonetheless, this paper remains an exciting piece of work (!) as it addresses the more general and pressing question of the validity of running a Bayesian analysis with only part of the information contained in the data. Quite usefull in my (biased) opinion when considering the emergence of approximate inference already discussed on this ‘Og…
[As a trivial aside, I had first used fresh from the press(es) as the bracketted comment, before I realised the meaning was not necessarily the same in English and in French.]
I received this email from Wiley with the great figure that JRSS Series B has now reached a 5.721 impact factor. Which makes it the first journal in Statistics from this perspective. Congrats to editors Gareth Roberts, Piotr Fryzlewicz and Ingrid Van Keilegom for this achievement! An amazing jump from the 2009 figure of 2.84…!