folded Normals

Posted in Books, Kids, pictures, R, Running, Statistics with tags , , , , , , , , , , , , on February 25, 2021 by xi'an

While having breakfast (after an early morn swim at the vintage La Butte aux Cailles pool, which let me in free!), I noticed a letter to the Editor in the Annals of Applied Statistics, which I was unaware existed. (The concept, not this specific letter!) The point of the letter was to indicate that finding the MLE for the mean and variance of a folded normal distribution was feasible without resorting to the EM algorithm. Since the folded normal distribution is a special case of mixture (with fixed weights), using EM is indeed quite natural, but the author, Iain MacDonald, remarked that an optimiser such as R nlm() could be called instead. The few lines of relevant R code were even included. While this is a correct if minor remark, I am a wee bit surprised at seeing it included in the journal, the more because the authors of the original paper using the EM approach were given the opportunity to respond, noticing EM is much faster than nlm in the cases they tested, and Iain MacDonald had a further rejoinder! The more because the Wikipedia page mentioned the use of optimisers much earlier (and pointed out at the R package Rfast as producing MLEs for the distribution).

love thy command line [Bourne again]

Posted in Books, Kids, Linux, R, University life with tags , , , , , , , , on February 15, 2021 by xi'an

“Prebuilt into macOS and Unix systems (…) the command line (also called the shell) is a powerful text-based interface in which users issue terse instructions to create, find, sort and manipulate files, all without using the mouse. There are actually several distinct (…) shell systems, among the most popular of which [sic?] is Bash, an acronym for the ‘Bourne again shell’ (a reference to the Bourne shell, which it replaced in 1989).”

An hilarious rediscovery of the joys of shell (line) commands in Nature! Which I use by default for most operations on my computer, albeit far from expertly (despite the use of a cheat tee, from time to time!). One of the arguments in the article, “The mouse doesn’t scale,” is definitely mine as well. Among other marketing lines, wrangling files with no software interference (check), handling huge files (very rarely), manipulating spreadsheets (I don’t), parallelising work on remote servers (check), automate via cron (not anymore)…. Unsurprisingly, most of our students are never using terminals of command lines.

new order

Posted in Books, Kids, R, Statistics with tags , , , , on February 5, 2021 by xi'an

The latest riddle from The Riddler was both straightforward: given four iid Normal variates, X¹,X²,X³,X⁴, what is the probability that X¹+X²<X³+X⁴ given that X¹<X³ ? The answer is ¾ and it actually does not depend on the distribution of the variates. The bonus question is however much harder: what is this probability when there are 2N iid Normal variates?

I posted the question on math.stackexchange, then on X validated, but received no hint at a possible simplification of the probability. And then erased the questions. Given the shape of the domain where the bivariate Normal density is integrated, it sounds most likely there is no closed-form expression. (None was proposed by the Riddler.) The probability decreases roughly in N³ when computing this probability by simulation and fitting a regression.

> summary(lm(log(p)~log(r)))

Residuals:
Min        1Q    Median        3Q       Max
-0.013283 -0.010362 -0.000606  0.007835  0.039915

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.111235   0.008577  -12.97 4.11e-13 ***
log(r)      -0.311361   0.003212  -96.94  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01226 on 27 degrees of freedom
Multiple R-squared:  0.9971,	Adjusted R-squared:  0.997
F-statistic:  9397 on 1 and 27 DF,  p-value: < 2.2e-16


hard birthday problem

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , on February 4, 2021 by xi'an

From an X validated question, found that WordPress now allows for direct link to pdf documents, like the above paper by my old friend Anirban Das Gupta! The question is about estimating a number M of individuals with N distinct birth dates over a year of T days. After looking around I could not find a simpler representation of the probability for N=r other than (1) in my answer,

$\frac{T!}{(\bar N-r)!}\frac{m!}{T^m} \sum_{(r_1,\ldots,r_m);\\\sum_1^m r_i=r\ \&\\\sum_1^m ir_i=m}1\Big/\prod_{j=1}^m r_j! (j!)^{r_j}$

borrowed from a paper by Fisher et al. (Another Fisher!) Checking Feller leads to the probability (p.102)

${T \choose r}\sum_{\nu=0}^r (-1)^{\nu}{r\choose\nu}\left(1-\frac{T-r+\nu}T \right)^m$

which fits rather nicely simulation frequencies, as shown using

apply(!apply(matrix(sample(1:Nb,T*M,rep=TRUE),T,M),1,duplicated),2,sum)


Further, Feller (1970, pp.103-104) justifies an asymptotic Poisson approximation with parameter$$\lambda(M)=\bar{N}\exp\{-M/\bar N\}$ from which an estimate of$M\$ can be derived. With the birthday problem as illustration (pp.105-106)!

It may be that a completion from N to (R¹,R²,…) where the components are the number of days with one birthdate, two birthdates, &tc. could help design an EM algorithm that would remove the summation in (1) but I did not spend more time on the problem (than finding a SAS approximation to the probability!).

easy and uneasy riddles

Posted in Books, Kids, R with tags , , , , , on February 2, 2021 by xi'an

On 15 January, The Riddler had both a straightforward and a challenging riddles. The first one was to optimise the choice of a real number d with the utility function U(d,θ)=d ℑ(θ>d), when θ is Uniform(0,100). Leading unsurprisingly to d=50…

The tough(er) one was to solve a form of sudoku where the 24 entries of a 8×3 table are integers in {1,…,9} and the information is provided by the row-wise and column-wise products of these integers. The vertical margins are

294, 216, 135, 98, 112, 84, 245, 40

and the horizontal margins are

8 890 560, 156 800, 55 566

After an unsuccessful brute-force (and pseudo-annealed) attempt achieving a minimum error of 127, although using the prime factor decompositions of these 11 margins, I realised that some entries were known: e.g., 7 at (1,2), 5 at (3,2), and 7 at (7,3), and (much later) that the (huge) product value for the first column implied that each term in that column had to be the maximal possible value for the corresponding rows, except for 5 on row 7. This leads to the starting grid

    [,1] [,2] [,3]
[1,]    7    7    6
[2,]    9    0    0
[3,]    9    5    3
[4,]    7    0    0
[5,]    8    0    0
[6,]    7    0    0
[7,]    5    7    7
[8,]    8    0    0


and an additional and obvious exclusion based on the absence of 3’s in the second column, of 5’s and 2’s in the third column shows there was a unique solution

    [,1] [,2] [,3]
[1,]    7    7    6
[2,]    9    8    3
[3,]    9    5    3
[4,]    7    2    7
[5,]    8    2    7
[6,]    7    4    3
[7,]    5    7    7
[8,]    8    5    1


as also demonstrated by a complete exploration with R:

Try it online!