## R puzzle

Posted in Statistics with tags , , on September 12, 2019 by xi'an

Can you guess the meaning of the following R code

"?"=u\164f8ToI\x6Et;'!'=prod;!{
z<-y}&z>T##&[]>~48bEfILpu

If not (!), the explanation is provided in Robin’s answer to a codegolf puzzle.

## who’s that travelling salesman path?!

Posted in Statistics with tags , , , on July 18, 2017 by xi'an

## Asher’s enigma

Posted in R, Statistics with tags , , , on July 26, 2010 by xi'an

On his Probability and statistics blog, Matt Asher put a funny question (with my rephrasing):

Take a unit square. Now pick two spots at random along the perimeter, uniformly. For each of these two locations, pick another random point from one of the three other sides of the square and draw the segment. What is the probability the two segments intersect? And what is the distribution for the intersection points?

The (my) intuition for the first question was 1/2, but a quick computation led to another answer. The key to the computation is to distinguish whether or not both segments share one side of the square. They do with probability $\dfrac{2}{4}\times 1 + \dfrac{2}{4}\times\dfrac{2}{3} = \dfrac{5}{6},$

in which case they intersect with probability 1/2. They occupy the four sides with probability 1/6, in which case they intersect with probability 1/3. So the final answer is 17/36 (as posted by several readers and empirically found by Matt). The second question is much more tricky: the histogram of the distribution of the coordinates is peaked towards the boundaries, thus reminding me of an arc-sine distribution, but there is a bump in the middle as well. Computing the coordinates of the intersection depending on the respective positions of the endpoints of both segments and simulating those distributions led me to histograms that looked either like beta B(a,a) distributions, or like beta B(1,a) distributions, or like beta B(a,1) distributions… Not exactly, though. So not even a mixture of beta distributions is enough to explain the distribution of the intersection points… For instance, the intersection points corresponding to segments were both segments start from the same side and end up in the opposite side are distributed as $\dfrac{u_1(u_4-u_3)-u_3(u_2-u_1)}{u_4-u_3-u_2+u_1}$

where all u‘s are uniform on (0,1) and under the constraint $(u_2-u_1)(u_4-u_3)<0$. The following graph shows how well a beta distribution fits in that case. (Not perfectly, though!) The R code is

u=matrix(runif(4*10^5),ncol=4)
u[,c(1,3)]=t(apply(u[,c(1,3)],1,sort))
u[,c(2,4)]=-t(apply(-u[,c(2,4)],1,sort))
y=(u[,1]*(u[,4]-u[,3])-u[,3]*(u[,2]-u[,1]))/(u[,1]+u[,4]-u[,2]-u[,3])

Similarly, if the two segments start from the same side but end up on different sides, the distribution of one coordinate is given by $\dfrac{u_1(1-u_3)-u_3u_4(u_2-u_1)}{1-u_3-u_4(u_2-u_1)}$

under the constraint $u_3. The outcome is once again almost distributed as a beta: The corresponding R code is

u=matrix(runif(4*10^5),ncol=4)
u[,c(1,3)]=-t(apply(-u[,c(1,3)],1,sort))
y=(u[,1]*(1-u[,3])-u[,3]*u[,4]*(u[,2]-u[,1]))/(1-u[,3]-u[,4]*(u[,2]-u[,1]))

## Le Monde rank test (corr’d)

Posted in R, Statistics with tags , , , on April 7, 2010 by xi'an

Since my first representation of the rank statistic as paired was incorrect, here is the histogram produced by the simulation

perm=sample(1:20)
saple[t]=sum(abs(sort(perm[1:10])-sort(perm[11:20])))

when $n=20$. It is obviously much closer to zero than previously. An interesting change is that the regression of the log-mean on $log(n)$ produces

> lm(log(memean)~log(enn))
Call:
lm(formula = log(memean) ~ log(enn))
Coefficients:
(Intercept)     log(enn)
-1.162        1.499

meaning that the mean is in $n^{3/2}$ rather than in $n$ or $n^2$:

> summary(lm(memean~eth-1))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
eth 0.3117990  0.0002719    1147   <2e-16 ***

with a very good fit. ## Le Monde rank test

Posted in R, Statistics with tags , , , , , , , , on April 5, 2010 by xi'an

In the puzzle found in Le Monde of this weekend, the mathematical object behind the silly story is defined as a pseudo-Spearman rank correlation test statistic, $\mathfrak{M}_n = \sum_{i=1}^n |r^x_i-r^y_i|\,,$

where the difference between the ranks of the paired random variables $x_i$ and $y_i$ is in absolute value instead of being squared as in the Spearman rank test statistic. I don’t know whether or not this measure of distance has been studied in the statistics literature (although I’d be surprised has it not been studied!). Here is an histogram of the distribution of the new statistics for $n=20$ under the null hypothesis that both samples are uncorrelated (i.e. that the sequence of ranks is a random permutation). Each point in the sample was obtained by

perm=sample(1:20)
saple[t]=sum(abs(perm[1:10]-perm[11:20])) When regressing the mean of this statistic $\mathfrak{M}_n$ against the covariates $n$ and $n^2$, I obtain the uninspiring formula $\mathbb{E} [\mathfrak{M}_n] \approx 0.1681 n^2 - 0.3769 n + 11.1921$

which does not translate into a nice polynomial in $n$!

Another interesting probabilistic/combinatorial problem issued from an earlier Le Monde puzzle: given an urn with $n$ white balls and $n$ black balls that is sampled without replacement, what is the probability that there exists a sequence of length $2k$ with the same number of white and black balls for $k=1,\ldots,n$? If $k=1,n$, the answer is obviously one (1), but for some values of $k$, it is less than one. When $n$ goes to infinity, this is somehow related to the probability that a Brownian bridge crosses the axis in-between $0$ and $1$ but I have no clue whether this helps or not! Robin Ryder solved the question for the values $n=50$ and $k=24,25$ by establishing that the probability is still one.

Ps- The same math tribune in Le Monde coincidently advertises a book, Le Mythe Climatique, by Benoît Rittaud that adresses … climate change issues and the “statistical mistakes made by climatologists”. The interesting point (if any) is that Benoît Rittaud is a “mathematician not a statistician”, with a few papers in ergodic theory, but this advocated climatoskeptic nonetheless criticises the use of both statistical and simulation tools in climate modeling. (“Simulation has only been around for a few dozen years, a very short span in the history of sciences. The climate debate may be an opportunity to reassess the role of simulation in the scientific process.”)

## New Le Monde puzzle

Posted in Kids, R with tags , on March 4, 2010 by xi'an

When I first read Le Monde puzzle this weekend, I though it was even less exciting than the previous one: find $Y>2010$ and $N<(Y-N)$,

such that $N(Y-N)$ is a multiple of $Y$.

The solution is obtained by brute-force checking through an R program: $Y=2012, N=1006$

and then the a next solution is $Y=2016, N=504$ (with several values for N).

However, while waiting in the plane to Edinburgh, I thought more about it and found that the problem can be solved with paper and pencil. It goes like this. There exists an integer $c$ such that $N(Y-N)=cY.$

Hence, solving the second degree equation, $N = \left(Y\pm\sqrt{Y^2-4cY}\right)/2 = Y \left(1\pm\sqrt{1-4c/Y}\right)/2$

which implies that $2 / \left(1\pm\sqrt{1-4c/Y}\right)$

is one of the integral factors of $Y$. If we write $Y=qk$ with $1\pm\sqrt{1-4c/Y}= \dfrac{2}{k},$

we get $\dfrac{4c}{Y} = \dfrac{4c}{qk} = 1 - \left( 1 - \dfrac{2}{k}\right)^2 = \frac{4}{k}\left(1-\dfrac{1}{k}\right)$

and thus $c=q-\dfrac{q}{k} = q\,\dfrac{k-1}{k}$.

We thus deduce that $q/k$ is an integer, meaning that $q=pk$ and thus that $k^2$ is an integer factor of $Y$. This obviously restricts the choice of $Y$, especially when considering that $4c\le Y$ implies $k\ge 4$. Furthermore, the solutions to the second degree equations are then given by $\dfrac{Y}{2} \left( 1 \pm \dfrac{k-2}{k}\right) = \left\{ \begin{matrix} pk\\ pk(k-1)\end{matrix}\right.$.

The conclusion is thus that any $Y$ which can be divided by a squared integer larger than or equal to $4^2$ provides a solution. Now, if we look at the decomposition of $2010=2\times 3\times 5\times 67$, $2011=2011\times 1$, $2012=2^4\times 503$, $2013=3\times 11\times 61$, $2014=2\times 19\times 53$, $2015=5\times 13\times 31$, $2016=2^5\times 3^2\times 7$,

we see (without any R programming) that $(Y,N)=(2016,504)$ is the smallest solution (in $Y$). $Y=2016$ is the smallest solution with $N=504$ a possible solution in $N$ ( $N=168$ being another one). Which makes (at least) for a more bearable diversion in the plane…

An approach avoiding the second degree equation is to notice that $N(Y-N)=cY$ implies that $N$ and $Y$ share a largest common divider $h>1$, i.e. $Y=hY^\prime\,,\quad N=hN^\prime$

which implies $N^\prime(Y^\prime-N^\prime)h=cY^\prime$,

thus that $Y^\prime$ divides $k$ (because both other terms are prime with $Y^\prime$). Eliminating dividers common to $c$ and $N^\prime$ actually leads to $Y^\prime=k$, hence to the same conclusion as before.

## Le Monde puzzle

Posted in Statistics with tags , on February 21, 2010 by xi'an

The puzzle in Le Monde is quite straightforward (!) this weekend: it can be rewritten as to figure out the values of the sums $10! \displaystyle{\left.\left\{1 - \sum_{j_1=1}^{10} \frac{1}{j_1}\left\{1-\sum_{\stackrel{j_2=1}{j_2\ne j_1}}^{10}\frac{1}{j_2}\right\{1-\ldots\right.\right.}$ $\displaystyle{\qquad\ldots\left.\left.\left.\left\{1-\sum_{\stackrel{j_9=1}{j_9\ne j_1,\ldots,j_8}}^{10} \frac{1}{j_9}\right\}\ldots\right\}\right\}\right\}}$

and $\displaystyle{\left.(49!)^2 \left\{1 - \sum_{j_1=1}^{49} \frac{1}{j_1^2}\left\{1-\sum_{\stackrel{j_2=1}{j_2\ne j_1}}^{49}\frac{1}{j_2^2}\right\{1-\ldots\qquad\right.\right.}$ $\displaystyle{\qquad\ldots\left.\left.\left.\left\{1-\sum_{\stackrel{j_{48}=1}{j_{48}\ne j_1,\ldots,j_{47}}}^{49} \frac{1}{j_{48}^2}\right\}\ldots\right\}\right\}\right\}}$

which are easily displayed and as easily solved.

The first sum can indeed be written as $\displaystyle{\sum_{i=1}^k (-1)^{k-i} \sum_{\stackrel{j_1\ne\cdots\ne j_i}{\in\{1,\ldots,k\}}} \prod_{u=1}^i j_u}$

for $k=10$. This is simply $\displaystyle{\sum_{i=0}^k (-1)^{k-i} \sum_{\stackrel{j_1\ne\cdots\ne j_i}{\in\{1,\ldots,k\}}} \prod_{u=1}^i j_u}- (-1)^k = \prod_{i=1}^k (i-1) - (-1)^k$

and the solution is thus $(-1)^{k+1}$, equal to -1 for $k=10$. Once we realise the fact this is a product with one missing term, the second sum is similar: it can be written as $\displaystyle{\sum_{i=0}^k (-1)^{k-i} \sum_{\stackrel{j_1\ne\cdots\ne j_i}{\in\{1,\ldots,k\}}} \prod_{u=1}^i j_u^2}- (-1)^k$ $\qquad = \prod_{i=1}^k (i^2-1) - (-1)^k = - (-1)^k$

This is equal to 1 for $k=49$. A bit disappointing because it amounts to reformulate the question with the proper algebraic formula…