## 39% anglo-irish!

Posted in Kids, Statistics, Travel with tags , , , , , , , , on May 24, 2015 by xi'an

As I have always been curious about my ancestry, I made a DNA test on 23andMe. While the company no longer provides statistics about potential medical conditions because of a lawsuit, it does return an ancestry analysis of sorts. In my case, my major ancestry composition is Anglo-Irish!  (with 39% of my DNA) and northern European (with 32%), while only 19% is Franco-German… In retrospect, not so much of a surprise—not because of my well-known Anglophilia but—given that my (known, i.e., at least for the direct ancestral branches) family roots are in Normandy—whose duke invaded Britain in 1056—and Brittany—which was invaded by British Celts fleeing Anglo-Saxons in the 400’s.  What’s maybe more surprising to me is that the database contained 23 people identified as 4th degree cousins and a total of 652 relatives… While the potential number of my potential 4th degree cousins stands in the 10,000’s, and hence there may indeed be a few ending up as 23andMe—mostly American—customers, I am indeed surprised that a .37% coincidence in our genes qualifies for being 4th degree cousins! But given that I only share 3.1% with my great⁴-grandfather, it actually make sense that I share about .1% to .4% with such remote cousins. However I wonder at the precision of such an allocation: could those cousins be even more remotely related? Not related at all? [Warning: All the links to 23andMe in this post are part of their referral program.]

## Evidence and evolution (5)

Posted in Books, Statistics with tags , , , , , , , , , , , , on April 29, 2010 by xi'an

“Tout étant fait pour une fi n, tout est nécessairement pour la meilleure fi n. Remarquez bien que les nez ont été faits pour porter des lunettes, aussi avons-nous des lunettes.” Voltaire, Candide, Chapitre 1.

I am now done with my review of Sober’s Evidence and Evolution: The Logic Behind the Science, Posting about each chapter along the way helped me a lot to write down the review over the past few days. Its conclusion is that

Evidence and Evolution is very well-written, with hardly any typo (the unbiasedness property of AIC is stated at the bottom of page 101 with the expectation symbol E on the wrong side of the equation, Figure 3.8c is used instead of Figure 3.7c on page 204, Figure 4.7 is used instead of Figure 4.8 on page 293, Simon Tavaré’s name is always spelled Taveré, vaules rather than values is repeated four times on page 339). The style is sometimes too light and often too verbose, with an abundance of analogies that I regard as sidetracking, but this makes for an easier reading (except for the sentence “the key to answering the second question is that the observation that X = 1 and Y = 1 produces stronger evidence favoring CA over SA the lower the probability is that the ancestors postulated by the two hypotheses were in state 1”, on page 314, that still eludes me!). As detailed in this review, I have points of contentions with the philosophical views about testing in Evidence and Evolution as well as about the methods exposed therein, but this does not detract from the appeal of reading the book. (The lack of completely worked out statistical hypotheses in realistic settings remains the major issue in my criticism of the book.) While the criticisms of the Bayesian paradigm are often shallow (like the one on page 97 ridiculing Bayesians drawing inference based on a single observation), there is nothing fundamentally wrong with the statistical foundations of the book. I therefore repeat my earlier recommendation in favour of Evidence and Evolution, Chapters 1 and (paradoxically) 5 being the easier entries. Obviously, readers familiar with Sober’s earlier papers and books will most likely find a huge overlap with those but others will gather Sober’s viewpoints on the notion of testing hypotheses in a (mostly) unified perspective.

And, as illustrated by the above quote, I found the sentence from Voltaire’s Candide I wanted to include. Of course, this 12 page review may be overly long for the journal it was intended for, Human Genetics, in which case I will have to find another outlet for the current arXived version. But I enjoyed reading this book with a pencil and gathered enough remarks along the way to fill those twelve pages.

## Evidence and evolution (4)

Posted in Books, Statistics with tags , , , , on April 26, 2010 by xi'an

“Darwinians would not be satisfied if all life on Earth derived from the same large slab of rock.” (E&E, p.269)

Thanks to Eyjafjallajökull, I used the three and a half hours in the train back from Marseille to conclude my lecture of Sober’s Evidence and Evolution: The Logic Behind the Science, The final chapter (apart from the concluding summary) is about “Common ancestry” and may be the most statistically oriented of the three chapters about evolution. This is not to say the chapter is without defaults, including in particular a certain tendency to repeat the same arguments. but this is somehow the chapter I appreciated the most. The chapter starts with a detailed analysis on how the hypothesis of common ancestry should be set, the main distinction being between one organism and several, while pointing out the confusing effect of lateral gene transfer.  Inference about phylogenetic trees and the use of genetic sequences rather than simplistic traits gets us closer to the true issues at stake. Another interesting feature of this chapter is the relation to Darwin’s reflections on the common origin of life on Earth through many quotes.

“If those prior probabilities are obscure, the same will be true of the posterior probabilities.” (E&E, p.277)

The statistical issue is thus of testing for a common ancestor versus separate ancestors for a set of organisms. The nature of the information contained in the data is never made precise enough to understand whether this fits the principle of total evidence stressed throughout the book. The chapter also shows a more lenient disposition towards Bayesian solutions but Section 4.3 ends up with an impossibility statement, due to the impossibility of defining an objective prior because Sober wants prior probabilities that have some authority. This is a self-defeating constraint leading to empirically well-grounded priors.

“Those propositions suffice for similarity to be evidence for common ancestry, and they have broad applicability.” (E&E, p.283)

The part about Reichenbach’s (1956) sufficient condition for a common trait to induce a likelihood ratio larger than one in favour of the continuous ancestor hypothesis needs to be discussed as this is the point I find the most puzzling in the chapter. Indeed, most of the nine assumptions of Reichenbach (1956) relates both models under comparison, i.e. common ancestry versus separate ancestry. This seems to me to be a weird thing to do as models under comparison should not share all of their parameters! For instance, if we build a Bayesian model to compare those models, we would use a prior distribution on each group of parameters. Having a common parameter does not make sense since we end up selecting one of the two models. I wonder if this is the result of a reluctance to have true parameters as in a regular statistical analysis.  (See, e.g., the lament that “until values for adjustable parameters are specified, we cannot talk about the probability of the data under different hypotheses”, p.338.) What is striking is the reliance of the whole chapter on this unnatural set of hypotheses since it keeps resurfacing throughout the chapter. Sober writes that Propositions 1-9 are not consequences of the axioms of probability. Neither are they necessary conditions for common ancestry to have a higher likelihood than separate ancestry (p.283). Nonetheless, this is creating a unnecessary bias in the perception of the problem which may induce critics of evolution to reject the whole approach.

“If there was no such common ancestor, what would alignment ever mean?” (E&E, p.291)

The theme of the missing model I have alluded to in the previous posts is also recurrent in this chapter. There are a lot of paragraphs about the choice of the representation of the difference between two species, from trait to gene sequence, and the author acknowledges that the difficulty in this choice has to do with a requirement for a more advanced theoretical representation (model) adapted to more complex data. This sounds rather obvious stated that way but the book wanders around this point for pages! (An example is the above quote that misses the point about sequence alignment: this is a perfectly well-defined measure of distance, common ancestor or not.) And the overall conclusion is a vague call for the principle of total evidence (which is a rephrasing of the likelihood principle). As illustrated in the section on multiple characters, the discussion is confusing without a proper model. It is only on page 300 of the book that a completely defined model for the evolution of a dichotomous trait (i.e. the simplest possible case) appears. This model is a rather crude tool, as it depends on arbitrary calibration factors like $P(Z=0)=0.99$ instead of 1 and, more importantly, on an unspecified time (as in “what time is it on the evolution clock?“). The corresponding likelihood ratio is then (under one of the selection schemes)

$\dfrac{0.01b_t^2 + 0.99}{[0.01b_t+0.99]^2}$

where the dependence on those factors is obvious. This illustrates the impossibility to reach a satisfactory conclusion without going first through a statistical analysis of the problem.

“It is possible for data to discriminate among a set of hypotheses without saying anything about a proposition that is common to all the alternatives considered” (E&E, p.315)

The debate about the phylogenetic tree reconstruction versus the test for common ancestry (Sections 4.7 and 4.8) lacks appeals for the very reason exposed above. The tree structure may be incorporated within the model(s) and integrated out in a Bayesian fashion to provide the marginal likelihood of the model(s). Although this seems to be an important issue, as illustrated by the controversy with Templeton, the opposition between likelihood inference and “cladistic” parsimony is not properly conducted in that, as a naïve reader, I cannot understand Sober’s presentation of the later. This section is much more open to Bayesian processing by abstaining from the usual criticism about the lack of objectivity of the prior selection, but it entirely misses the ability of the Bayesian approach to integrate out the nuisance parameters, whether they are the tree topology (standard marginalisation) or the model index (model averaging). The debate about the limited meaning of statistical consistency is making the valid point that consistency only puts light on the case when the hypothesised model is true, but extended consistency could have been considered as well, namely that the procedure will bring the hypothesised model as close as possible to the “true” model within the hypothesised family of models. What I gather from this final section is that cladistic parsimony tries to do without models (if not without assumptions), which seems to relate to Templeton’s views about Bayesian inference.

Again, this is certainly the most enjoyable chapter of the book from my point of view (besides the nice recap about methods of inference in Chapter1), even though the lack of real illustrations makes it less potent than it could be. It also shows the limitation of a philosophical debate on simplistic idealisations of the real model. The book only acknowledges on page 334 that genealogical hypotheses are composite. Better late than never, but I think that an incorporation of the parameter estimation in the inferential process would not have hurt the quality of the debate.