## sampling the mean

Posted in Kids, R, Statistics with tags , , , , , on December 12, 2019 by xi'an

A challenge found on the board of the coffee room at CEREMADE, Université Paris Dauphine:

When sampling with replacement three numbers in {0,1,…,N}, what is the probability that their average is (at least) one of the three?

With a (code-golfed!) brute force solution of

mean(!apply((a<-matrix(sample(0:n,3e6,rep=T),3)),2,mean)-apply(a,2,median))


producing a graph pretty close to 3N/2(N+1)² (which coincides with a back-of-the-envelope computation):

## from tramway to Panzer (or back!)…

Posted in Books, pictures, Statistics with tags , , , , , , on June 14, 2019 by xi'an

Although it is usually presented as the tramway problem, namely estimating the number of tram or bus lines in a city given observing one line number, including The Bayesian Choice by yours truly, the original version of the problem is about German tanks, Panzer V tanks to be precise, which total number M was to be estimated by the Allies from their observation of serial numbers of a number k of tanks. The Riddler is restating the problem when the only available information is made of the smallest, 22, and largest, 144, numbers, with no information about the number k itself. I am unsure what the Riddler means by “best” estimate, but a posterior distribution on M (and k) can be certainly be constructed for a prior like 1/k x 1/M² on (k,M). (Using M² to make sure the posterior mean does exist.) The joint distribution of the order statistics is

$\frac{k!}{(k-2)!} M^{-k} (144-22)^{k-2}\, \Bbb I_{2\le k\le M\ge 144}$

which makes the computation of the posterior distribution rather straightforward. Here is the posterior surface (with an unfortunate rendering of an artefactual horizontal line at 237!), showing a concentration near the lower bound M=144. The posterior mode is actually achieved for M=144 and k=7, while the posterior means are (rounded as) M=169 and k=9.

## easy Riddler

Posted in Kids, R with tags , , , on May 10, 2019 by xi'an

The riddle of the week is rather standard probability calculus

If N points are generated at random places on the perimeter of a circle, what is the probability that you can pick a diameter such that all of those points are on only one side of the newly halved circle?

Since it is equivalent to finding the range of N Uniform variates less than ½. And since the range of N Uniform variates is distributed as a Be(N-1,2) random variate. The resulting probability, which happens to be exactly $N/2^{N-1}$, is decreasing exponentially, as shown below…

## survivalists [a Riddler’s riddle]

Posted in Books, Kids, R, Statistics with tags , , , , , , on April 22, 2019 by xi'an

A neat question from The Riddler on a multi-probability survival rate:

Nine processes are running in a loop with fixed survivals rates .99,….,.91. What is the probability that the first process is the last one to die? Same question with probabilities .91,…,.99 and the probability that the last process is the last one to die.

The first question means that the realisation of a Geometric G(.99) has to be strictly larger than the largest of eight Geometric G(.98),…,G(.91). Given that the cdf of a Geometric G(a) is [when counting the number of attempts till failure, included, i.e. the Geometric with support the positive integers]

$F(x)=\Bbb P(X\le x)=1-a^{x}$

the probability that this happens has the nice (?!) representation

$\sum_{x=2}^\infty a_1^{x-1}(1-a_1)\prod_{j\ge 2}(1-a_j^{x-1})=(1-a_1)G(a_1,\ldots,a_9)$

which leads to an easy resolution by recursion since

$G(a_1,\ldots,a_9)=G(a_1,\ldots,a_8)-G(a_1a_9,\ldots,a_8)$

and $G(a)=a/(1-a)$

and a value of 0.5207 returned by R (Monte Carlo evaluation of 0.5207 based on 10⁷ replications). The second question is quite similar, with solution

$\sum_{x=2}^\infty a_1^{x-1}(1-a_1)\prod_{j\ge 1}(1-a_j^{x})=a^{-1}(1-a_1)G(a_1,\ldots,a_9)$

and value 0.52596 (Monte Carlo evaluation of 0.52581 based on 10⁷ replications).

## dynamic nested sampling for stars

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , , , on April 12, 2019 by xi'an

In the sequel of earlier nested sampling packages, like MultiNest, Joshua Speagle has written a new package called dynesty that manages dynamic nested sampling, primarily intended for astronomical applications. Which is the field where nested sampling is the most popular. One of the first remarks in the paper is that nested sampling can be more easily implemented by using a Uniform reparameterisation of the prior, that is, a reparameterisation that turns the prior into a Uniform over the unit hypercube. Which means in fine that the prior distribution can be generated from a fixed vector of uniforms and known transforms. Maybe not such an issue given that this is the prior after all.  The author considers this makes sampling under the likelihood constraint a much simpler problem but it all depends in the end on the concentration of the likelihood within the unit hypercube. And on the ability to reach the higher likelihood slices. I did not see any special trick when looking at the documentation, but reflected on the fundamental connection between nested sampling and this ability. As in the original proposal by John Skilling (2006), the slice volumes are “estimated” by simulated Beta order statistics, with no connection with the actual sequence of simulation or the problem at hand. We did point out our incomprehension for such a scheme in our Biometrika paper with Nicolas Chopin. As in earlier versions, the algorithm attempts at visualising the slices by different bounding techniques, before proceeding to explore the bounded regions by several exploration algorithms, including HMC.

“As with any sampling method, we strongly advocate that Nested Sampling should not be viewed as being strictly“better” or “worse” than MCMC, but rather as a tool that can be more or less useful in certain problems. There is no “One True Method to Rule Them All”, even though it can be tempting to look for one.”

When introducing the dynamic version, the author lists three drawbacks for the static (original) version. One is the reliance on this transform of a Uniform vector over an hypercube. Another one is that the overall runtime is highly sensitive to the choice the prior. (If simulating from the prior rather than an importance function, as suggested in our paper.) A third one is the issue that nested sampling is impervious to the final goal, evidence approximation versus posterior simulation, i.e., uses a constant rate of prior integration. The dynamic version simply modifies the number of point simulated in each slice. According to the (relative) increase in evidence provided by the current slice, estimated through iterations. This makes nested sampling a sort of inversted Wang-Landau since it sharpens the difference between slices. (The dynamic aspects for estimating the volumes of the slices and the stopping rule may hinder convergence in unclear ways, which is not discussed by the paper.) Among the many examples produced in the paper, a 200 dimension Normal target, which is an interesting object for posterior simulation in that most of the posterior mass rests on a ring away from the maximum of the likelihood. But does not seem to merit a mention in the discussion. Another example of heterogeneous regression favourably compares dynesty with MCMC in terms of ESS (but fails to include an HMC version).

[Breaking News: Although I wrote this post before the exciting first image of the black hole in M87 was made public and hence before I was aware of it, the associated AJL paper points out relying on dynesty for comparing several physical models of the phenomenon by nested sampling.]

## almost uniform but far from straightforward

Posted in Books, Kids, Statistics with tags , , , , , , , on October 24, 2018 by xi'an

A question on X validated about a [not exactly trivial] maximum likelihood for a triangular function led me to a fascinating case, as exposed by Olver in 1972 in The American Statistician. When considering an asymmetric triangle distribution on (0,þ), þ being fixed, the MLE for the location of the tip of the triangle is necessarily one of the observations [which was not the case in the original question on X validated ]. And not in an order statistic of rank j that does not stand in the j-th uniform partition of (0,þ). Furthermore there are opportunities for observing several global modes… In the X validated case of the symmetric triangular distribution over (0,θ), with ½θ as tip of the triangle, I could not figure an alternative to the pedestrian solution of looking separately at each of the (n+1) intervals where θ can stand and returning the associated maximum on that interval. Definitely a good (counter-)example about (in)sufficiency for class or exam!

## maximal spacing around order statistics [#2]

Posted in Books, R, Statistics, University life with tags , , , , , , , , on June 8, 2018 by xi'an

The proposed solution of the riddle from the Riddler discussed here a few weeks ago is rather approximative, in that the distribution of

$\Delta_n=\max_i\,\min_j\,|X_{i}-X_{j}|$

when the n-sample is made of iid Normal variates is (a) replaced with the distribution of one arbitrary minimum and (b) the distribution of the minimum is based on an assumption of independence between the absolute differences. Which does not hold, as shown by the above correlation matrix (plotted via corrplot) for N=11 and 10⁴ simulations. One could think that this correlation decreases with N, but it remains essentially 0.2 for larger values of N. (On the other hand, the minima are essentially independent.)