## down with Galton (and Pearson and Fisher…)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on July 22, 2019 by xi'an

In the last issue of Significance, which I read in Warwick prior to the conference, there is a most interesting article on Galton’s eugenics, his heritage at University College London (UCL), and the overall trouble with honouring prominent figures of the past with memorials like named building or lectures… The starting point of this debate is a protest from some UCL students and faculty about UCL having a lecture room named after the late Francis Galton who was a professor there. Who further donated at his death most of his fortune to the university towards creating a professorship in eugenics. The protests are about Galton’s involvement in the eugenics movement of the late 18th and early 19th century. As well as professing racist opinions.

My first reaction after reading about these protests was why not?! Named places or lectures, as well as statues and other memorials, have a limited utility, especially when the named person is long dead and they certainly do not contribute in making a scientific theory [associated with the said individual] more appealing or more valid. And since “humans are [only] humans”, to quote Stephen Stigler speaking in this article, it is unrealistic to expect great scientists to be perfect, the more if one multiplies the codes for ethical or acceptable behaviours across ages and cultures. It is also more rational to use amphitheater MS.02 and lecture room AC.18 rather than associate them with one name chosen out of many alumni’s or former professors’.

Predictably, another reaction of mine was why bother?!, as removing Galton’s name from the items it is attached to is highly unlikely to change current views on eugenism or racism. On the opposite, it seems to detract from opposing the present versions of these ideologies. As some recent proposals linking genes and some form of academic success. Another of my (multiple) reactions was that as stated in the article these views of Galton’s reflected upon the views and prejudices of the time, when the notions of races and inequalities between races (as well as genders and social classes) were almost universally accepted, including in scientific publications like the proceedings of the Royal Society and Nature. When Karl Pearson launched the Annals of Eugenics in 1925 (after he started Biometrika) with the very purpose of establishing a scientific basis for eugenics. (An editorship that Ronald Fisher would later take over, along with his views on the differences between races, believing that “human groups differ profoundly in their innate capacity for intellectual and emotional development”.) Starting from these prejudiced views, Galton set up a scientific and statistical approach to support them, by accumulating data and possibly modifying some of these views. But without much empathy for the consequences, as shown in this terrible quote I found when looking for more material:

“I should feel but little compassion if I saw all the Damaras in the hand of a slave-owner, for they could hardly become more wretched than they are now…”

As it happens, my first exposure to Galton was in my first probability course at ENSAE when a terrific professor was peppering his lectures with historical anecdotes and used to mention Galton’s data-gathering trip to Namibia, literally measure local inhabitants towards his physiognomical views , also reflected in the above attempt of his to superpose photographs to achieve the “ideal” thief…

## A precursor of ABC-Gibbs

Posted in Books, R, Statistics with tags , , , , , , , , , , on June 7, 2019 by xi'an

All ABC algorithms, including ABC-PaSS introduced here, require that statistics are sufficient for estimating the parameters of a given model. As mentioned above, parameter-wise sufficient statistics as required by ABC-PaSS are trivial to find for distributions of the exponential family. Since many population genetics models do not follow such distributions, sufficient statistics are known for the most simple models only. For more realistic models involving multiple populations or population size changes, only approximately-sufficient statistics can be found.

While Gibbs sampling is not mentioned in the paper, this is indeed a form of ABC-Gibbs, with the advantage of not facing convergence issues thanks to the sufficiency. The drawback being that this setting is restricted to exponential families and hence difficult to extrapolate to non-exponential distributions, as using almost-sufficient (or not) summary statistics leads to incompatible conditionals and thus jeopardise the convergence of the sampler. When thinking a wee bit more about the case treated by Kousathanas et al., I am actually uncertain about the validation of the sampler. When tolerance is equal to zero, this is not an issue as it reproduces the regular Gibbs sampler. Otherwise, each conditional ABC step amounts to introducing an auxiliary variable represented by the simulated summary statistic. Since the distribution of this summary statistic depends on more than the parameter for which it is sufficient, in general, it should also appear in the conditional distribution of other parameters. At least from this Gibbs perspective, it thus relies on incompatible conditionals, which makes the conditions proposed in our own paper the more relevant.

## contemporary issues in hypothesis testing

Posted in Statistics with tags , , , , , , , , , , , , , , , , , , on September 26, 2016 by xi'an

This week [at Warwick], among other things, I attended the CRiSM workshop on hypothesis testing, giving the same talk as at ISBA last June. There was a most interesting and unusual talk by Nick Chater (from Warwick) about the psychological aspects of hypothesis testing, namely about the unnatural features of an hypothesis in everyday life, i.e., how far this formalism stands from human psychological functioning.  Or what we know about it. And then my Warwick colleague Tom Nichols explained how his recent work on permutation tests for fMRIs, published in PNAS, testing hypotheses on what should be null if real data and getting a high rate of false positives, got the medical imaging community all up in arms due to over-simplified reports in the media questioning the validity of 15 years of research on fMRI and the related 40,000 papers! For instance, some of the headings questioned the entire research in the area. Or transformed a software bug missing the boundary effects into a major flaw.  (See this podcast on Not So Standard Deviations for a thoughtful discussion on the issue.) One conclusion of this story is to be wary of assertions when submitting a hot story to journals with a substantial non-scientific readership! The afternoon talks were equally exciting, with Andrew explaining to us live from New York why he hates hypothesis testing and prefers model building. With the birthday model as an example. And David Draper gave an encompassing talk about the distinctions between inference and decision, proposing a Jaynes information criterion and illustrating it on Mendel‘s historical [and massaged!] pea dataset. The next morning, Jim Berger gave an overview on the frequentist properties of the Bayes factor, with in particular a novel [to me] upper bound on the Bayes factor associated with a p-value (Sellke, Bayarri and Berger, 2001)

B¹⁰(p) ≤ 1/-e p log p

with the specificity that B¹⁰(p) is not testing the original hypothesis [problem] but a substitute where the null is the hypothesis that p is uniformly distributed, versus a non-parametric alternative that p is more concentrated near zero. This reminded me of our PNAS paper on the impact of summary statistics upon Bayes factors. And of some forgotten reference studying Bayesian inference based solely on the p-value… It is too bad I had to rush back to Paris, as this made me miss the last talks of this fantastic workshop centred on maybe the most important aspect of statistics!

## a general framework for updating belief functions

Posted in Books, Statistics, University life with tags , , , , , , , , , on July 15, 2013 by xi'an

Pier Giovanni Bissiri, Chris Holmes and Stephen Walker have recently arXived the paper related to Sephen’s talk in London for Bayes 250. When I heard the talk (of which some slides are included below), my interest was aroused by the facts that (a) the approach they investigated could start from a statistics, rather than from a full model, with obvious implications for ABC, & (b) the starting point could be the dual to the prior x likelihood pair, namely the loss function. I thus read the paper with this in mind. (And rather quickly, which may mean I skipped important aspects. For instance, I did not get into Section 4 to any depth. Disclaimer: I wasn’t nor is a referee for this paper!)

The core idea is to stick to a Bayesian (hardcore?) line when missing the full model, i.e. the likelihood of the data, but wishing to infer about a well-defined parameter like the median of the observations. This parameter is model-free in that some degree of prior information is available in the form of a prior distribution. (This is thus the dual of frequentist inference: instead of a likelihood w/o a prior, they have a prior w/o a likelihood!) The approach in the paper is to define a “posterior” by using a functional type of loss function that balances fidelity to prior and fidelity to data. The prior part (of the loss) ends up with a Kullback-Leibler loss, while the data part (of the loss) is an expected loss wrt to l(THETASoEUR,x), ending up with the definition of a “posterior” that is

$\exp\{ -l(\theta,x)\} \pi(\theta)$

the loss thus playing the role of the log-likelihood.

I like very much the problematic developed in the paper, as I think it is connected with the real world and the complex modelling issues we face nowadays. I also like the insistence on coherence like the updating principle when switching former posterior for new prior (a point sorely missed in this book!) The distinction between M-closed M-open, and M-free scenarios is worth mentioning, if only as an entry to the Bayesian processing of pseudo-likelihood and proxy models. I am however not entirely convinced by the solution presented therein, in that it involves a rather large degree of arbitrariness. In other words, while I agree on using the loss function as a pivot for defining the pseudo-posterior, I am reluctant to put the same faith in the loss as in the log-likelihood (maybe a frequentist atavistic gene somewhere…) In particular, I think some of the choices are either hard or impossible to make and remain unprincipled (despite a call to the LP on page 7).  I also consider the M-open case as remaining unsolved as finding a convergent assessment about the pseudo-true parameter brings little information about the real parameter and the lack of fit of the superimposed model. Given my great expectations, I ended up being disappointed by the M-free case: there is no optimal choice for the substitute to the loss function that sounds very much like a pseudo-likelihood (or log thereof). (I thought the talk was more conclusive about this, I presumably missed a slide there!) Another great expectation was to read about the proper scaling of the loss function (since L and wL are difficult to separate, except for monetary losses). The authors propose a “correct” scaling based on balancing both faithfulness for a single observation, but this is not a completely tight argument (dependence on parametrisation and prior, notion of a single observation, &tc.)

The illustration section contains two examples, one of which is a full-size or at least challenging  genetic data analysis. The loss function is based on a logistic  pseudo-likelihood and it provides results where the Bayes factor is in agreement with a likelihood ratio test using Cox’ proportional hazard model. The issue about keeping the baseline function as unkown reminded me of the Robbins-Wasserman paradox Jamie discussed in Varanasi. The second example offers a nice feature of putting uncertainties onto box-plots, although I cannot trust very much the 95%  of the credibles sets. (And I do not understand why a unique loss would come to be associated with the median parameter, see p.25.)

Watch out: Tomorrow’s post contains a reply from the authors!

## top model choice week (#3)

Posted in Statistics, University life with tags , , , , , , , , , , , on June 19, 2013 by xi'an

To conclude this exciting week, there will be a final seminar by Veronika Rockovà (Erasmus University) on Friday, June 21, at 11am at ENSAE  in Room 14. Here is her abstract:

11am: Fast Dynamic Posterior Exploration for Factor Augmented Multivariate Regression byVeronika Rockova

Advancements in high-throughput experimental techniques have facilitated the availability of diverse genomic data, which provide complementary information regarding the function and organization of gene regulatory mechanisms. The massive accumulation of data has increased demands for more elaborate modeling approaches that combine the multiple data platforms. We consider a sparse factor regression model, which augments the multivariate regression approach by adding a latent factor structure, thereby allowing for dependent patterns of marginal covariance between the responses. In order to enable the identi cation of parsimonious structure, we impose spike and slab priors on the individual entries in the factor loading and regression matrices. The continuous relaxation of the point mass spike and slab enables the implementation of a rapid EM inferential procedure for dynamic posterior model exploration. This is accomplished by considering a nested sequence of spike and slab priors and various factor space cardinalities. Identi ed candidate models are evaluated by a conditional posterior model probability criterion, permitting trans-dimensional comparisons. Patterned sparsity manifestations such as an orthogonal allocation of zeros in factor loadings are facilitated by structured priors on the binary inclusion matrix. The model is applied to a problem of integrating two genomic datasets, where expression of microRNA’s is related to the expression of genes with an underlying connectivity pathway network.

## The Windup Girl

Posted in Books with tags , , , , , , , on February 23, 2013 by xi'an

“The scientists here carry the haunted look of people who know they are under siege. They know that beyond a few doors, all manners of apocalyptic terrors wait to swallow them.”

The book by Paolo Bacigalupi was standing among a shelf of recommended reads at Waterstones near UCL, during my last visit there, and the connection with William Gibson made on the cover pushed me to buy the book. Plus the Hugo and Nebula Awards. And the cover, of course. I took advantage of this trip to Hamburg to read The Windup Girl and I found the book definitely a great read.

“Flotsam of the Old Expansion. An ancient piece of driftwood left at high tide, from the time petroleum was cheap and men and women crossed the globe in hours instead of weeks.”

The Windup Girl has indeed some flavour of Gibson’s Neuromancer and Stephenson’s Snow Crash, however the story is more psychological and less technological than those two classics. There is a darker tone to the novel, as Earth is suffering both from the end of oil and from various food plagues that destroyed most crops, not mentioning deadly new viruses. The new powers are the big genetically-engineered-seed producers, while part of the World has been eradicated. (The power is now produced by genetically engineered mammoths called megodonts.) And pollution is strictly kept under control.

“It has the markings of an engineering virus. DNA shifts don*t look like ones that would reproduce in the wild. Blister rust has no reason to jump the animal kingdom barrier. Nothing is encouraging it, it is not easily transferred. The differences are marked. It’s as though we’re looking into its future.”

The story is set in Thailand, which has somehow miraculously salvaged a huge seed bank and which manages to keep those crop companies at bay. Of course, things are deteriorating as the book begins, otherwise there would be no story. What I like the most about The Windup Girl is this bleak vision of a harsh future, set in Asia and told through four different story threads belonging to completely separate cultures (Thai, Chinese, American, and new-Japanese), thus avoiding the usual ethnocentrism of such novels. As mentioned above, the story is definitely not as technological or geeky as cyberpunk novels and it does not even qualify as genepunk, as the amount of genetics involved in the story is somehow limited (except for three newly created races all impacting the plot). But the dystopian universe created by Paolo Bacigalupi is definitely both convincing and mesmerising, while not requiring so many suspensions of belief. The characters are all well-set, with the proper degree of greyness in their ethics, and the political manoeuvring is realistic. I also feel The Windup Girl is quite in tune with (my) current worries about the future fate of humanity faced with rapid climate change, an increasing frequency of natural disasters, and correlated insect invasions. At last, the relation of some of the characters to (Thai) Buddhism is an interesting peculiarity of the novel. So a book truly worth recommending! (In Spanish, the title of the book is La Chica Mecánica, which I find less appealing that the multilayered Windup Girl! The multiple covers on this ‘Og page are actually virtual covers suggested by fans, follow the links to get the whole story.)

## genetics

Posted in Books, Kids, Travel, University life with tags , , , , , , , , , , on April 9, 2012 by xi'an

Today, I was reading in the science leaflet of Le Monde about a new magnitude in sequencing cancerous tumors (wrong link, I know…). This made me wonder whether the sequence of (hundreds of) mutations leading from a normal cell to a cancerous one could be reconstituted in the way a genealogy is. (This reminds me of another exciting genetic article I read in the Eurostar back from London on Thursday, in the Economist, about the colonization of Madagascar by 30 women from the Malay archipelago: “The island was one of the last places on Earth to be settled, receiving its earliest migrants in the middle of the first millennium AD…“)

As a double coincidence, I was reading La Recherche yesterday in the métro to Dauphine, which central theme this month is about heredity beyond genetics. (Double because this also connected with the meeting in London.) The keyword is epigenetics, namely the activation or inactivation of a gene and the hereditary transmission of this character w/o a genetic mutation. This is quite interesting as it implies the hereditability of some adopted traits, i.e. forces one to reconsider the nature versus nurture debate. (This sentence is another input due to Galton!) It also implies that a much faster rate of species differentiation due to environmental changes (than the purely genetic one) is possible, which may sound promising in the light of the fast climate changes we are currently facing. However, what I do not understand is why the journal included a paper on the consequences of epigenetics on the Darwinian theory of evolution and… intelligent design. Indeed, I do not see why the inclusion of different vectors in the hereditary process would contradict Darwin’s notion of natural selection. Or even why considering a scientific modification or replacement of the current Darwinian theory of evolution would be an issue. Charles Darwin wrote his book in 1859, prior to the start of genetics, and the immense advances made since then led to modifications and adjustments from his original views. Without involving any irrational belief in the process.