## Le Monde puzzle [#887quater]

Posted in Books, Kids, R, Statistics, University life with tags , , on November 28, 2014 by xi'an

And yet another resolution of this combinatorics Le Monde mathematical puzzle: that puzzle puzzled many more people than usual! This solution is by Marco F, using a travelling salesman representation and existing TSP software.

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

For instance, take n=199, you should first calculate the “friends”. Save them on a symmetric square matrix:

m1 <- matrix(Inf, nrow=199, ncol=199)
diag(m1) <- 0
for (i in 1:199) m1[i,friends[i]] <- 1


Export the distance matrix to a file (in TSPlib format):

library(TSP)
tsp <- TSP(m1)
tsp
image(tsp)
write_TSPLIB(tsp, "f199.TSPLIB")


And use a solver to obtain the results. The best solver for TSP is Concorde. There are online versions where you can submit jobs:

0 2 1000000
2 96 1000000
96 191 1000000
191 168 1000000
...


The numbers of the solution are in the second column (2, 96, 191, 168…). And they are 0-indexed, so you have to add 1 to them:

3  97 192 169 155 101 188 136 120  49 176 148 108 181 143 113 112  84  37  63 18  31  33  88168 193  96 160 129 127 162 199  90  79 177 147  78  22 122 167 194 130  39 157  99 190 13491 198  58  23  41 128 196  60  21 100 189 172 152 73 183 106  38 131 125 164 197  59 110 146178 111 145  80  20  61 135 121  75  6  94 195166 123 133 156  69  52 144  81  40   9  72 184  12  24  57  87  82 62  19  45  76 180 109 116 173 151  74  26  95 161 163 126  43 153 17154  27 117 139  30  70  11  89 107 118 138 186103  66 159 165 124 132  93  28   8  17  32  45  44  77 179 182 142  83  86  14  50 175 114 55 141 115  29  92 104 185  71  10  15  34   27  42 154 170 191  98 158  67 102 187 137 119 25  56 65  35  46 150 174  51  13  68  53  47 149 140  85  36  64 105  16  48


## Le Monde puzzle [#887ter]

Posted in Books, Kids, Statistics, University life with tags , , , , on November 27, 2014 by xi'an

Here is a graph solution to the recent combinatorics Le Monde mathematical puzzle, proposed by John Shonder:

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

Consider an undirected graph GN with N vertices labelled 1 through N. Draw an edge between vertices i and j if and only if i + j is a perfect square. Then N is golden if GN contains a Hamiltonian path — that is, if there is a connected path that visits all of the vertices exactly once.I wrote a program (using Mathematica, though I’m sure there must be an R library with similar functionality) that builds up G sequentially and checks at each step whether the graph contains a Hamiltonian path. The program starts with G1 — a single vertex and no edges. Then it adds vertex 2. G2 has no edges, so 2 isn’t golden.

Adding vertex 3, there is an edge between 1 and 3. But vertex 2 is unconnected, so we’re still not golden.

The results are identical to yours, but I imagine my program runs a bit faster. Mathematica contains a built-in function to test for the existence of a Hamiltonian path.

Some of the graphs are interesting. I include representations of G25 and G36. Note that G36 contains a Hamiltonian cycle, so you could arrange the integers 1 … 36 on a roulette wheel such that each consecutive pair adds to a perfect square.

A somewhat similar problem:

Call N a “leaden” number if the sequence {1,2, …, N} can be reordered so that the sum of any consecutive pair is a prime number. What are the leaden numbers between 1 and 100? What about an arrangement such that the absolute value of the difference between any two consecutive numbers is prime?

[The determination of the leaden numbers was discussed in a previous Le Monde puzzle post.]

## Methodological developments in evolutionary genomic [3 years postdoc in Montpellier]

Posted in pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , on November 26, 2014 by xi'an

[Here is a call for a post-doctoral position in Montpellier, South of France, not Montpelier, Vermont!, in a population genetics group with whom I am working. Highly recommended if you are currently looking for a postdoc!]

#### Three-year post-doctoral position at the Institute of Computational Biology (IBC), Montpellier (France) : Methodological developments in evolutionary genomics.

One young investigator position opens immediately at the Institute for Computational Biology (IBC) of Montpellier (France) to work on the development of innovative inference methods and software in population genomics or phylogenetics to analyze large-scale genomic data in the fields of health, agronomy and environment (Work Package 2 « evolutionary genomics » of the IBC). The candidate will develop its own research on some of the following topics : selective processes, demographic history, spatial genetic processes, very large phylogenies reconstruction, gene/species tree reconciliation, using maximum likelihood, Bayesian and simulation-based inference. We are seeking a candidate with a strong background in mathematical and computational evolutionary biology, with interest in applications and software development. The successfull candidate will work on his own project, build in collaboration with any researcher involved in the WP2 project and working at the IBC labs (AGAP, CBGP, ISEM, I3M, LIRMM, MIVEGEC).

IBC hires young investigators, typically with a PhD plus some post-doc experience, a high level of publishing, strong communication abilities, and a taste for multidisciplinary research. Working full-time at IBC, these young researchers will play a key role in Institute life. Most of their time will be devoted to scientific projects. In addition, they are expected to actively participate in the coordination of workpackages, in the hosting of foreign researchers and in the organization of seminars and events (summer schools, conferences…). In exchange, these young researchers will benefit from an exceptional environment thanks to the presence of numerous leading international researchers, not to mention significant autonomy for their work. Montpellier hosts one of the most vibrant communities of biodiversity research in Europe with several research centers of excellence in the field. This positions is open for up to 3 years with a salary well above the French post-doc standards. Starting date is open to discussion.

The application deadline is January 31, 2015.

Living in Montpellier: http://www.agropolis.org/english/guide/index.html

#### Contacts at WP2 « Evolutionary Genetics » :

Jean-Michel Marin : http://www.math.univ-montp2.fr/~marin/

Vincent Ranwez : https://sites.google.com/site/ranwez/

Olivier Gascuel : http://www.lirmm.fr/~gascuel/

Submit my application : http://www.ibc-montpellier.fr/open-positions/young-investigators#wp2-evolution

## reflections on the probability space induced by moment conditions with implications for Bayesian Inference [refleXions]

Posted in Statistics, University life with tags , , , , , , , , , , on November 26, 2014 by xi'an

“The main finding is that if the moment functions have one of the properties of a pivotal, then the assertion of a distribution on moment functions coupled with a proper prior does permit Bayesian inference. Without the semi-pivotal condition, the assertion of a distribution for moment functions either partially or completely specifies the prior.” (p.1)

Ron Gallant will present this paper at the Conference in honour of Christian Gouréroux held next week at Dauphine and I have been asked to discuss it. What follows is a collection of notes I made while reading the paper , rather than a coherent discussion, to come later. Hopefully prior to the conference.

The difficulty I have with the approach presented therein stands as much with the presentation as with the contents. I find it difficult to grasp the assumptions behind the model(s) and the motivations for only considering a moment and its distribution. Does it all come down to linking fiducial distributions with Bayesian approaches? In which case I am as usual sceptical about the ability to impose an arbitrary distribution on an arbitrary transform of the pair (x,θ), where x denotes the data. Rather than a genuine prior x likelihood construct. But I bet this is mostly linked with my lack of understanding of the notion of structural models.

“We are concerned with situations where the structural model does not imply exogeneity of θ, or one prefers not to rely on an assumption of exogeneity, or one cannot construct a likelihood at all due to the complexity of the model, or one does not trust the numerical approximations needed to construct a likelihood.” (p.4)

As often with econometrics papers, this notion of structural model sets me astray: does this mean any latent variable model or an incompletely defined model, and if so why is it incompletely defined? From a frequentist perspective anything random is not a parameter. The term exogeneity also hints at this notion of the parameter being not truly a parameter, but including latent variables and maybe random effects. Reading further (p.7) drives me to understand the structural model as defined by a moment condition, in the sense that

$\mathbb{E}[m(\mathbf{x},\theta)]=0$

has a unique solution in θ under the true model. However the focus then seems to make a major switch as Gallant considers the distribution of a pivotal quantity like

$Z=\sqrt{n} W(\mathbf{x},\theta)^{-\frac{1}{2}} m(\mathbf{x},\theta)$

as induced by the joint distribution on (x,θ), hence conversely inducing constraints on this joint, as well as an associated conditional. Which is something I have trouble understanding, First, where does this assumed distribution on Z stem from? And, second, exchanging randomness of terms in a random variable as if it was a linear equation is a pretty sure way to produce paradoxes and measure theoretic difficulties.

The purely mathematical problem itself is puzzling: if one knows the distribution of the transform Z=Z(X,Λ), what does that imply on the joint distribution of (X,Λ)? It seems unlikely this will induce a single prior and/or a single likelihood… It is actually more probable that the distribution one arbitrarily selects on m(x,θ) is incompatible with a joint on (x,θ), isn’t it?

“The usual computational method is MCMC (Markov chain Monte Carlo) for which the best known reference in econometrics is Chernozhukov and Hong (2003).” (p.6)

While I never heard of this reference before, it looks like a 50 page survey and may be sufficient for an introduction to MCMC methods for econometricians. What I do not get though is the connection between this reference to MCMC and the overall discussion of constructing priors (or not) out of fiducial distributions. The author also suggests using MCMC to produce the MAP estimate but this always stroke me as inefficient (unless one uses our SAME algorithm of course).

“One can also compute the marginal likelihood from the chain (Newton and Raftery (1994)), which is used for Bayesian model comparison.” (p.22)

Not the best solution to rely on harmonic means for marginal likelihoods…. Definitely not. While the author actually uses the stabilised version (15) of Newton and Raftery (1994) estimator, which in retrospect looks much like a bridge sampling estimator of sorts, it remains dangerously close to the original [harmonic mean solution] especially for a vague prior. And it only works when the likelihood is available in closed form.

“The MCMC chains were comprised of 100,000 draws well past the point where transients died off.” (p.22)

I wonder if the second statement (with a very nice image of those dying transients!) is intended as a consequence of the first one or independently.

“A common situation that requires consideration of the notions that follow is that deriving the likelihood from a structural model is analytically intractable and one cannot verify that the numerical approximations one would have to make to circumvent the intractability are sufficiently accurate.” (p.7)

This then is a completely different business, namely that defining a joint distribution by mean of moment equations prevents regular Bayesian inference because the likelihood is not available. This is more exciting because (i) there are alternative available! From ABC to INLA (maybe) to EP to variational Bayes (maybe). And beyond. In particular, the moment equations are strongly and even insistently suggesting that empirical likelihood techniques could be well-suited to this setting. And (ii) it is no longer a mathematical worry: there exist a joint distribution on m(x,θ), induced by a (or many) joint distribution on (x,θ). So the question of finding whether or not it induces a single proper prior on θ becomes relevant. But, if I want to use ABC, being given the distribution of m(x,θ) seems to mean I can only generate new values of this transform while missing a natural distance between observations and pseudo-observations. Still, I entertain lingering doubts that this is the meaning of the study. Where does the joint distribution come from..?!

“Typically C is coarse in the sense that it does not contain all the Borel sets (…)  The probability space cannot be used for Bayesian inference”

My understanding of that part is that defining a joint on m(x,θ) is not always enough to deduce a (unique) posterior on θ, which is fine and correct, but rather anticlimactic. This sounds to be what Gallant calls a “partial specification of the prior” (p.9).

Overall, after this linear read, I remain very much puzzled by the statistical (or Bayesian) implications of the paper . The fact that the moment conditions are central to the approach would once again induce me to check the properties of an alternative approach like empirical likelihood.

## prayers and chi-square

Posted in Books, Kids, Statistics, University life with tags , , , , , , on November 25, 2014 by xi'an

One study I spotted in Richard Dawkins’ The God delusion this summer by the lake is a study of the (im)possible impact of prayer over patient’s recovery. As a coincidence, my daughter got this problem in her statistics class of last week (my translation):

1802 patients in 6 US hospitals have been divided into three groups. Members in group A was told that unspecified religious communities would pray for them nominally, while patients in groups B and C did not know if anyone prayed for them. Those in group B had communities praying for them while those in group C did not. After 14 days of prayer, the conditions of the patients were as follows:

• out of 604 patients in group A, the condition of 249 had significantly worsened;
• out of 601 patients in group B, the condition of 289 had significantly worsened;
• out of 597 patients in group C, the condition of 293 had significantly worsened.

Use a chi-square procedure to test the homogeneity between the three groups, a significant impact of prayers, and a placebo effect of prayer.

This may sound a wee bit weird for a school test, but she is in medical school after all so it is a good way to enforce rational thinking while learning about the chi-square test! (Answers: [even though the data is too sparse to clearly support a decision, esp. when using the chi-square test!] homogeneity and placebo effect are acceptable assumptions at level 5%, while the prayer effect is not [if barely].)

## an ABC experiment

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on November 24, 2014 by xi'an

In a cross-validated forum exchange, I used the code below to illustrate the working of an ABC algorithm:

#normal data with 100 observations
n=100
x=rnorm(n)
#observed summaries

#normal x gamma prior
priori=function(N){
return(cbind(rnorm(N,sd=10),
1/sqrt(rgamma(N,shape=2,scale=5))))
}

ABC=function(N,alpha=.05){

prior=priori(N) #reference table

#pseudo-data
summ=matrix(0,N,2)
for (i in 1:N){
xi=rnorm(n)*prior[i,2]+prior[i,1]
}

#normalisation factor for the distance
#distance
#selection
posterior=prior[dist<quantile(dist,alpha),]}


Hence I used the median and the mad as my summary statistics. And the outcome is rather surprising, for two reasons: the first one is that the posterior on the mean μ is much wider than when using the mean and the variance as summary statistics. This is not completely surprising in that the latter are sufficient, while the former are not. Still, the (-10,10) range on the mean is way larger… The second reason for surprise is that the true posterior distribution cannot be derived since the joint density of med and mad is unavailable.

After thinking about this for a while, I went back to my workbench to check the difference with using mean and variance. To my greater surprise, I found hardly any difference! Using the almost exact ABC with 10⁶ simulations and a 5% subsampling rate returns exactly the same outcome. (The first row above is for the sufficient statistics (mean,standard deviation) while the second row is for the (median,mad) pair.) Playing with the distance does not help. The genuine posterior output is quite different, as exposed on the last row of the above, using a basic Gibbs sampler since the posterior is not truly conjugate.

## Challis Lectures

Posted in Books, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , on November 23, 2014 by xi'an

I had a great time during this short visit in the Department of Statistics, University of Florida, Gainesville. First, it was a major honour to be the 2014 recipient of the George H. Challis Award and I considerably enjoyed delivering my lectures on mixtures and on ABC with random forests, And chatting with members of the audience about the contents afterwards. Here is the physical award I brought back to my office:

More as a piece of trivia, here is the amount of information about the George H. Challis Award I found on the UF website:

This fund was established in 2000 by Jack M. and Linda Challis Gill and the Gill Foundation of Texas, in memory of Linda’s father, to support faculty and student conference travel awards and the George Challis Biostatistics Lecture Series. George H. Challis was born on December 8, 1911 and was raised in Italy and Indiana. He was the first cousin of Indiana composer Cole Porter. George earned a degree in 1933 from the School of Business at Indiana University in Bloomington. George passed away on May 6, 2000. His wife, Madeline, passed away on December 14, 2009.

Cole Porter, indeed!

On top of this lecturing activity, I had a full academic agenda, discussing with most faculty members and PhD students of the Department, on our respective research themes over the two days I was there and it felt like there was not enough time! And then, during the few remaining hours where I did not try to stay on French time (!), I had a great time with my friends Jim and Maria in Gainesville, tasting a fantastic local IPA beer from Cigar City Brewery and several great (non-local) red wines… Adding to that a pile of new books, a smooth trip both ways, and a chance encounter with Alicia in Atlanta airport, it was a brilliant extended weekend!