On October 11, at Bletchley Park, the Suffrage Science awards in mathematics and computer sciences were awarded for the first time to 12 senior female researchers. Among whom three statisticians, Professor Christl Donnelly from Imperial College London, my colleague at Warwick, Jane Hutton, and my friend and co-author, Sylvia Richardson, from MRC, Cambridge University. This initiative was started by the Medical Research Council in 2011 by Suffrage Science awards for life sciences, followed in 2013 by one for engineering and physics, and this year for maths and computing. The name of the award aims to connect with the Suffragette movement of the late 19th and early 20th Centuries, which were particularly active in Britain. One peculiar aspect of this award is that the recipients are given pieces of jewellery, created for each field, pieces that they will themselves give two years later to a new recipient of their choice, and so on in an infinite regress! (Which suggests a related puzzle, namely to figure out how many years it should take until all female scientists have received the award. But since the number increases as the square of the number of years, this is not going to happen unless the field proves particularly hostile to women scientists!) This jewellery award also relates to the history of the Suffragette movement since the WPSU commissioned their own jewellery awards. A clever additional touch was that the awards were delivered on Ada Lovelace Day, October 11.
Archive for University of Warwick
“By employing a modification to existing naïve subsampling techniques we can obtain an algorithm which is still exact but has sub-linear iterative cost as a function of data size.”
A few weeks ago Murray Pollock, Paul Fearnhead, Adam Johansen and Gareth Roberts (all from Warwick except for Paul) arXived a paper The Scalable Langevin Exact Algorithm: Bayesian Inference for Big Data. (This was also the topic of Murray’s talk last year at JSM in Seattle.) One major advance found in the paper is the derivation of an “exact” algorithm that is sub-linear in the data size. As discussed in the introduction, the current approaches to large data problems either suffer from being approximate (like divide-and-conquer methods) or do not achieve significant reduction in the computing time, being of order O(n). The authors mention Teh and Welling (2011) sand their tochastic gradient approximation to the Langevin diffusion, when the gradient is based on a subsample. Without the Metropolis correction that would ensure an exact target but at a cost of order O(n). (Which makes the technique rather difficult to recommend.)
A novel [for me] notion at the core of this paper is the concept of quasi-stationary distribution, which is the limiting distribution of a Markov chain X[t] conditional on a Markov stopping time [being larger than t]. The approach is based on diffusions with appropriate stationary distributions like the Langevin diffusion. (Actually, as in most papers I have read and remember, the current paper only considers the Langevin diffusion.) In order to avoid the issues with unadjusted and Metropolis-adjusted Langevin schemes, a killed Brownian motion is created, which means a Brownian motion conditional of being alive till time T when the instantaneous killing rate is a given function of the chain, Φ(X[t]), related with the stationary measure of the Langevin diffusion ν. Under appropriate conditions, the density of this killed Brownian motion converges [in T] to √ν. Which immediately hints at creating a new Langevin diffusion targeting ν² instead of ν. And killing it with the proper rate, which can be done by thinning a Poisson process. Simulating the original path can be done by path-space rejection sampling, following the technique set by Gareth Roberts and co-authors more than ten years ago. Based on finite dimensional realisations of the path on [0,T]. And including the killing part can be done by importance sampling and checking that the simulated killing time is larger than the current (exponentially simulated) time.
One practical difficulty in the implementation of this neat principle is the derivation of the normalising constant, which evaluation degrades with the time horizon T. The solution adopted in the paper is through a sequential Monte Carlo method, using another discretisation of the time interval [0,T] (relying on the original one would get too costly?). As for subsampling, since the survival probability for the Brownian motion is based on an unbiased estimator, subsampling does not hurt if conducted in a random manner. Although this increases the variance on principle, the use of a control variate computed just once helps in reducing the complexity to O(1).
This is a tough paper and I have not gone through the effort of trying to implement it, but this is an original and innovative construct I would like to monitor in further details on a toy example, maybe next week while in Warwick. Or at least to discuss it with the authors.
David Rossell and Francisco Rubio (both from Warwick) arXived a month ago a paper on non-normal variable selection. They use two-piece error models that preserve manageable inference and allow for simple computational algorithms, but also characterise the behaviour of the resulting variable selection process under model misspecification. Interestingly, they show that the existence of asymmetries or heavy tails leads to power losses when using the Normal model. The two-piece error distribution is made of two halves of location-scale transforms of the same reference density on the two sides of the common location parameter. In this paper, the density is either Gaussian or Laplace (i.e., exponential?). In both cases the (log-)likelihood has a nice compact expression (although it does not allow for a useful sufficient statistic). One is the L¹ version versus the other which is the L² version. Which is the main reason for using this formalism based on only two families of parametric distributions, I presume. (As mentioned in an earlier post, I do not consider those distributions as mixtures because the component of a given observation can always be identified. And because as shown in the current paper, maximum likelihood estimates can be easily derived.) The prior construction follows the non-local prior principles of Johnson and Rossell (2010, 2012) also discussed in earlier posts. The construction is very detailed and hence highlights how many calibration steps are needed in the process.
“Bayes factor rates are the same as when the correct model is assumed [but] model misspecification often causes a decrease in the power to detect truly active variables.”
When there are too many models to compare at once, the authors propose a random walk on the finite set of models (which does not require advanced measure-theoretic tools like reversible jump MCMC). One interesting aspect is that moving away from the normal to another member of this small family is driven by the density of the data under the marginal densities, which means moving only to interesting alternatives. But also sticking to the normal only for adequate datasets. In a sense this is not extremely surprising given that the marginal likelihoods (model-wise) are available. It is also interesting that on real datasets, one of the four models is heavily favoured against the others, be it Normal (6.3) or Laplace (6.4). And that the four model framework returns almost identical values when compared with a single (most likely) model. Although not immensely surprising when acknowledging that the frequency of the most likely model is 0.998 and 0.998, respectively.
“Our framework represents a middle-ground to add flexibility in a parsimonious manner that remains analytically and computationally tractable, facilitating applications where either p is large or n is too moderate to fit more flexible models accurately.”
Overall, I find the experiment quite conclusive and do not object [much] to this choice of parametric family in that it is always more general and generic than the sempiternal Gaussian model. That we picked in our Bayesian Essentials, following tradition. In a sense, it would be natural to pick the most general possible parametric family that allows for fast computations, if this notion does make any sense…
Today the Minister for the Constitution, Chris Skidmore, presented WMG, at the University of Warwick, with a Royal Warrant signed by Her Majesty the Queen, officially conferring her Majesty’s recognition with the title of the Regius Professor of Manufacturing (Engineering).
The title of Regius Professorship is a rare and prestigious award given by Her Majesty the Queen to recognise exceptionally high quality research at an institution. The University of Warwick was one of 12 universities honoured to mark Her Majesty’s 90th Birthday. Previous to this, only 14 had been granted since the reign of Queen Victoria. It is believed that the first Regius Professorship was conferred to Aberdeen University in 1497 by King James IV.
Felipe Medina-Aguayo, Antony Lee and Gareth Roberts (all at Warwick University) have recently published—even though the paper was accepted a year ago—a paper in Statistics and Computing about a variant to the pseudo-marginal Metropolis-Hastings algorithm. The modification is to simulate an estimate of the likelihood or posterior at the current value of the Markov chain at every iteration, rather than reproducing the current estimate. The reason for this refreshment of the weight estimate is to prevent stickiness in the chain, when a random weight leads to a very high value of the posterior. Unfortunately, this change leads to a Markov chain with the wrong stationary distribution. When this stationary exists! The paper actually produces examples of transient noisy chains, even in simple cases such as a geometric target distribution. And even when taking the average of a large number of weights. But the paper also contains sufficient conditions, like negative weight moments or uniform ergodicity of the proposal, for the noisy chain to be geometrically ergodic. Even though the applicability of those conditions to complex targets is not always obvious.
This week [at Warwick], among other things, I attended the CRiSM workshop on hypothesis testing, giving the same talk as at ISBA last June. There was a most interesting and unusual talk by Nick Chater (from Warwick) about the psychological aspects of hypothesis testing, namely about the unnatural features of an hypothesis in everyday life, i.e., how far this formalism stands from human psychological functioning. Or what we know about it. And then my Warwick colleague Tom Nichols explained how his recent work on permutation tests for fMRIs, published in PNAS, testing hypotheses on what should be null if real data and getting a high rate of false positives, got the medical imaging community all up in arms due to over-simplified reports in the media questioning the validity of 15 years of research on fMRI and the related 40,000 papers! For instance, some of the headings questioned the entire research in the area. Or transformed a software bug missing the boundary effects into a major flaw. (See this podcast on Not So Standard Deviations for a thoughtful discussion on the issue.) One conclusion of this story is to be wary of assertions when submitting a hot story to journals with a substantial non-scientific readership! The afternoon talks were equally exciting, with Andrew explaining to us live from New York why he hates hypothesis testing and prefers model building. With the birthday model as an example. And David Draper gave an encompassing talk about the distinctions between inference and decision, proposing a Jaynes information criterion and illustrating it on Mendel‘s historical [and massaged!] pea dataset. The next morning, Jim Berger gave an overview on the frequentist properties of the Bayes factor, with in particular a novel [to me] upper bound on the Bayes factor associated with a p-value (Sellke, Bayarri and Berger, 2001)
B¹⁰(p) ≤ 1/-e p log p
with the specificity that B¹⁰(p) is not testing the original hypothesis [problem] but a substitute where the null is the hypothesis that p is uniformly distributed, versus a non-parametric alternative that p is more concentrated near zero. This reminded me of our PNAS paper on the impact of summary statistics upon Bayes factors. And of some forgotten reference studying Bayesian inference based solely on the p-value… It is too bad I had to rush back to Paris, as this made me miss the last talks of this fantastic workshop centred on maybe the most important aspect of statistics!
In the plane to Warwick on Monday, I was reading my latest issue of Nature and found an interesting editorial on the financial plight of many graduates and post-docs in both the US and the UK (and certainly elsewhere). Who, despite having a fellowship, cannot make ends meet. This is particularly true in expensive cities like London, Oxford or even Paris, where rents force those new researchers to face long commuting hours. The editorial suggests taking extra-jobs to make up for financial difficulties, but this does not sound to me like a particularly pertinent recommendation if it means taking time off one’s research, at the period in a researcher’s career where one’s energy should be mostly directed at the production of papers towards securing a (more) permanent job. Even teaching can prove too time consuming for finishing PhD students. An adequation between the needs of those young researchers and the institutional support they receive would sound like a natural requirement, while graduates looking for fellowship should truly assess the adequation in detail before accepting an offer.Which of course is not always easy. In countries where post-doctoral contracts are not negotiable and are set at a national level (like, e.g., France), checking with earlier fellows is a must. (As it happens or happened, I was quite lucky to spend my post-doctoral years in cheap places with decent support from the local universities, but this is not relevant in today’s environment!)