## why does rbinom(1,1) differ from sample(0:1,1) with the same seed?

Posted in Statistics with tags , , , , , , , , , on February 17, 2021 by xi'an
> set.seed(1)
> rbinom(10,1,0.5)
[1] 0 0 1 1 0 1 1 1 1 0
> set.seed(1)
> sample(c(0,1), 10, replace = TRUE)
[1] 0 1 0 1 0 0 1 0 1 1

This rather legitimate question was posted on X validated last week, the answer being that the C codes behind both functions do not use pseudo-random generators in the same manner. For instance, rbinom does get involved beyond a mean value of 30 (and otherwise resorts to the inverse cdf approach). And following worries about sample biases, sample was updated in 2019 (and also seems to resort to the inverse cdf when the mean is less than 200). However, when running the above code on my machine, still using the 2018 R version 3.4.4!, I recover the same outcome:

> set.seed(1)
> rbinom(10,1,0.5)
[1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1)
> sample(c(0,1), 10, replace = TRUE)
[1] 0 0 1 1 0 1 1 1 1 0> set.seed(1)
> qbinom(runif(10),1,0.5)
[1] 0 0 1 1 0 1 1 1 1 0
> set.seed(1)
> 1*(runif(10)>.5)
[1] 0 0 1 1 0 1 1 1 1 0


## Bayesian sufficiency

Posted in Books, Kids, Statistics with tags , , , , , , , , , on February 12, 2021 by xi'an

“During the past seven decades, an astonishingly large amount of effort and ingenuity has gone into the search fpr resonable answers to this question.” D. Basu

Induced by a vaguely related question on X validated, I re-read Basu’s 1977 great JASA paper on the elimination of nuisance parameters. Besides the limitations of competing definitions of conditional, partial, marginal sufficiency for the parameter of interest,  Basu discusses various notions of Bayesian (partial) sufficiency.

“After a long journey through a forest of confusing ideas and examples, we seem to have lost our way.” D. Basu

Starting with Kolmogorov’s idea (published during WW II) to impose to all marginal posteriors on the parameter of interest θ to only depend on a statistic S(x). But having to hold for all priors cancels the notion as the statistic need be sufficient jointly for θ and σ, as shown by Hájek in the early 1960’s. Following this attempt, Raiffa and Schlaifer then introduced a more restricted class of priors, namely where nuisance and interest are a priori independent. In which case a conditional factorisation theorem is a sufficient (!) condition for this Q-sufficiency.  But not necessary as shown by the N(θ·σ, 1) counter-example (when σ=±1 and θ>0). [When the prior on σ is uniform, the absolute average is Q-sufficient but is this a positive feature?] This choice of prior separation is somewhat perplexing in that it does not hold under reparameterisation.

Basu ends up with three challenges, including the multinomial M(θ·σ,½(1-θ)·(1+σ),½(1+θ)·(1-σ)), with (n¹,n²,n³) as a minimal sufficient statistic. And the joint observation of an Exponential Exp(θ) translated by σ and of an Exponential Exp(σ) translated by -θ, where the prior on σ gets eliminated in the marginal on θ.

## hard birthday problem

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , on February 4, 2021 by xi'an

From an X validated question, found that WordPress now allows for direct link to pdf documents, like the above paper by my old friend Anirban Das Gupta! The question is about estimating a number M of individuals with N distinct birth dates over a year of T days. After looking around I could not find a simpler representation of the probability for N=r other than (1) in my answer,

$\frac{T!}{(\bar N-r)!}\frac{m!}{T^m} \sum_{(r_1,\ldots,r_m);\\\sum_1^m r_i=r\ \&\\\sum_1^m ir_i=m}1\Big/\prod_{j=1}^m r_j! (j!)^{r_j}$

borrowed from a paper by Fisher et al. (Another Fisher!) Checking Feller leads to the probability (p.102)

${T \choose r}\sum_{\nu=0}^r (-1)^{\nu}{r\choose\nu}\left(1-\frac{T-r+\nu}T \right)^m$

which fits rather nicely simulation frequencies, as shown using

apply(!apply(matrix(sample(1:Nb,T*M,rep=TRUE),T,M),1,duplicated),2,sum)


Further, Feller (1970, pp.103-104) justifies an asymptotic Poisson approximation with parameter$$\lambda(M)=\bar{N}\exp\{-M/\bar N\}$ from which an estimate of$M\$ can be derived. With the birthday problem as illustration (pp.105-106)!

It may be that a completion from N to (R¹,R²,…) where the components are the number of days with one birthdate, two birthdates, &tc. could help design an EM algorithm that would remove the summation in (1) but I did not spend more time on the problem (than finding a SAS approximation to the probability!).

## a neat EM resolution

Posted in Books, Kids, Statistics, University life with tags , , , , , , on February 3, 2021 by xi'an

Read (and answered) this question on X validation about finding the maximum likelihood estimator of a 2×2 Gaussian covariance matrix when some observations are partly missing.  The neat thing is that, in this case, the maximisation step is identical to the maximum likelihood estimation of the 2×2 Gaussian covariance matrix by redefining the empirical covariance matrix into Z and maximising

$-n\log|\Sigma|-\text{trace}(Z\Sigma^{-1})$

in Σ. Nothing involved but fun to explain, nonetheless. (In my final exam this year, no student even approached the EM questions!)

## wrong algebra for slice sampler

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , , , on January 27, 2021 by xi'an

Once more, and thrice alas!, I became aware of a typo in our “Use R!” book through a question on X validated from a reader unable to reproduce the slice of a basic 2D slice sampler for a logistic regression with coefficients (a,b). Indeed, our slice reads as the incorrect set (missing the i=1,…,n)

$\left\{ (a,b): y_i(a+bx_i) > \log \frac{u_i}{1-u_i} \right\}$

when it should have been

$\bigcap_{i=1} \left\{ (a,b)\,:\ (-1)^{y_i}(a+bx_i) > \log\frac{u_i}{1-u_i} \right\}$

which is the version I found in my LaTeX file. So I do not know what happened (unless I corrected the LaTeX file at a later date and cannot remember it, but the latest chance on the file reads October 2011…). Fortunately, the resulting slices in a and b and the following R code remain correct. Unfortunately, both French and Japanese translations reproduce the mistake…