Archive for the R Category

maximal spacing around order statistics [#2]

Posted in Books, R, Statistics, University life with tags , , , , , , , , on June 8, 2018 by xi'an

The proposed solution of the riddle from the Riddler discussed here a few weeks ago is rather approximative, in that the distribution of


when the n-sample is made of iid Normal variates is (a) replaced with the distribution of one arbitrary minimum and (b) the distribution of the minimum is based on an assumption of independence between the absolute differences. Which does not hold, as shown by the above correlation matrix (plotted via corrplot) for N=11 and 10⁴ simulations. One could think that this correlation decreases with N, but it remains essentially 0.2 for larger values of N. (On the other hand, the minima are essentially independent.)

Le Monde puzzle [#1051]

Posted in Books, Kids, R with tags , , , , , , on May 18, 2018 by xi'an

A combinatoric Le Monde mathematical puzzle of limited size:
When the only allowed move is to switch two balls from adjacent boxes, what is the minimal number of moves to return all balls in the above picture to their respective boxes? Same question with six boxes and 12 balls.

The question is rather interesting to code as I decided to use recursion (as usual!) but wanted to gain time by storing the number of steps needed by any configuration to reach its ordered recombination. Meaning I had to update an external vector within the recursive function for each new configuration I met. With help from Julien Stoehr, who presented me with the following code, a simplification of a common R function

v.assign <- function (i,value,...) {
  temp <- get(i, pos = 1)
  temp[...] <- value
  assign(i, temp, pos = 1)}

which assigns one or several entries to the external vector i. I thus used this trick in the following R code, where cosz is a vector of size 5¹⁰, much larger than the less than 10! values I need but easier to code. While n≤5.

swee <- function(balz){
  indz <- sum((balz-1)*baz)
  if (cosz[indz]==-1){ 
  if (min(diff(balz))==0){ #ordered
       val <- n^tn
       for (i in 2:n)
       for (j in (2*i-1):(2*i))
       for (k in (2*i-3):(2*i-2)){
         calz <- balz
         calz[k] <- balz[j];calz[j] 0) 
           val <- min(val,1+swee(calz))}

which returns 2 for n=2, 6 for n=3, 11 for n=4, 15 for n=5. In the case n=6, I need a much better coding of the permutations of interest. Which is akin to ranking all words within a dictionary with letters (1,1,…,6,6). After some thinking (!) and searching, I came up with a new version, defining

    for (i in 1:(n-1)){
      if (x>1){
        for (j in 1:(x-1)){


for finding the index of a given permutation, between 1 and (2n)!/2!..2!, and then calling the initial swee(p) with this modified allocation. The R code was still running when I posted this entry… and six days later, it returned the answer of 23.

maximal spacing around order statistics

Posted in Books, R, Statistics, University life with tags , , , , , , , on May 17, 2018 by xi'an

The riddle from the Riddler for the coming weeks is extremely simple to express in mathematical terms, as it summarises into characterising the distribution of


when the n-sample is made of iid Normal variates. I however had a hard time finding a result connected with this quantity since most available characterisations are for either Uniform or Exponential variates. I eventually found a 2017 arXival by Nagaraya et al.  covering the issue. Since the Normal distribution belongs to the Gumbel domain of attraction, the extreme spacings, that is the spacings between the most extreme orders statistics [rescaled by nφ(Φ⁻¹{1-n⁻¹})] are asymptotically independent and asymptotically distributed as (Theorem 5, p.15, after correcting a typo):


where the ξ’s are Exp(1) variates. A crude approximation is thus to consider that the above Δ is distributed as the maximum of two standard and independent exponential distributions, modulo the rescaling by  nφ(Φ⁻¹{1-n⁻¹})… But a more adequate result was pointed out to me by Gérard Biau, namely a 1986 Annals of Probability paper by Paul Deheuvels, my former head at ISUP, Université Pierre and Marie Curie. In this paper, Paul Deheuvels establishes that the largest spacing in a normal sample, M¹, satisfies

\mathbb{P}(\sqrt{2\log\,n}\,M^1\le x) \to \prod_{i=1}^{\infty} (1-e^{-ix})^2

from which a conservative upper bound on the value of n required for a given bound x⁰ can be derived. The simulation below compares the limiting cdf (in red) with the empirical cdf of the above Δ based on 10⁴ samples of size n=10³.The limiting cdf is the cdf of the maximum of an infinite sequence of independent exponentials with scales 1,½,…. Which connects with the above result, in fine. For a practical application, the 99% quantile of this distribution is 4.71. To achieve a maximum spacing of, say 0.1, with probability 0.99, one would need 2 log(n) > 5.29²/0.1², i.e., log(n)>1402, which is a pretty large number…


the riddle of the stands

Posted in Books, Kids, R with tags , , , , , on May 11, 2018 by xi'an

The simple riddle of last week on The Riddler, about the minimum number of urinals needed for n men to pee if the occupation rule is to stay as far as possible from anyone there and never to stand next to another man,  is quickly solved by an R code:

 while (max(diff((1:M)[ok!=0])>2)){

with maximal occupation illustrated by the graph below:

Meaning that the efficiency of the positioning scheme is not optimal when following the sequential positioning, requiring N+2^{\lceil log_2(N-1) \rceil} urinals. Rather than one out of two, requiring 2N-1 urinals. What is most funny in this simple exercise is the connection exposed in the Riddler with an Xkcd blag written a few years go about the topic.

Imperial postdoc in Bayesian nonparametrics

Posted in pictures, R with tags , , , , , , , , on April 27, 2018 by xi'an

Here is another announcement for a post-doctoral position in London (UK) to work with Sarah Filippi. In the Department of Mathematics at Imperial College London. (More details on the site or in this document. Hopefully, the salary is sufficient for staying in London, if not in South Kensington!)

The post holder will work on developing a novel Bayesian Non-Parametric Test for Conditional Independence. This is at the core of modern causal discovery, itself of paramount importance throughout the sciences and in Machine Learning. As part of this project, the post holder will derive a Bayesian non-parametric testing procedure for conditional independence, scalable to high-dimensional conditioning variable. To ensure maximum impact and allow experimenters in different fields to easily apply this new methodology, the post holder will then create an open-source software package available on the R statistical programming platform. Doing so, the post holder will investigate applying this approach to real-world data from our established partners who have a track record of informing national and international bodies such as Public Health England and the World Health Organisation.

practical Bayesian inference [book review]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , on April 26, 2018 by xi'an

[Disclaimer: I received this book of Coryn Bailer-Jones for a review in the International Statistical Review and intend to submit a revised version of this post as my review. As usual, book reviews on the ‘Og are reflecting my own definitely personal and highly subjective views on the topic!]

It is always a bit of a challenge to review introductory textbooks as, on the one hand, they are rarely written at the level and with the focus one would personally choose to write them. And, on the other hand, it is all too easy to find issues with the material presented and the way it is presented… So be warned and proceed cautiously! In the current case, Practical Bayesian Inference tries to embrace too much, methinks, by starting from basic probability notions (that should not be unknown to physical scientists, I believe, and which would avoid introducing a flat measure as a uniform distribution over the real line!, p.20). All the way to running MCMC for parameter estimation, to compare models by Bayesian evidence, and to cover non-parametric regression and bootstrap resampling. For instance, priors only make their apparition on page 71. With a puzzling choice of an improper prior (?) leading to an improper posterior (??), which is certainly not the smoothest entry on the topic. “Improper posteriors are a bad thing“, indeed! And using truncation to turn them into proper distributions is not a clear improvement as the truncation point will significantly impact the inference. Discussing about the choice of priors from the beginning has some appeal, but it may also create confusion in the novice reader (although one never knows!). Even asking about “what is a good prior?” (p.73) is not necessarily the best (and my recommended) approach to a proper understanding of the Bayesian paradigm. And arguing about the unicity of the prior (p.119) clashes with my own view of the prior being primarily a reference measure rather than an ideal summary of the available information. (The book argues at some point that there is no fixed model parameter, another and connected source of disagreement.) There is a section on assigning priors (p.113), but it only covers the case of a possibly biased coin without much realism. A feature common to many Bayesian textbooks though. To return to the issue of improper priors (and posteriors), the book includes several warnings about the danger of hitting an undefined posterior (still called a distribution), without providing real guidance on checking for its definition. (A tough question, to be sure.)

“One big drawback of the Metropolis algorithm is that it uses a fixed step size, the magnitude of which can hardly be determined in advance…”(p.165)

When introducing computational techniques, quadratic (or Laplace) approximation of the likelihood is mingled with kernel estimators, which does not seem appropriate. Proposing to check convergence and calibrate MCMC via ACF graphs is helpful in low dimensions, but not in larger dimensions. And while warning about the dangers of forgetting the Jacobians in the Metropolis-Hastings acceptance probability when using a transform like η=ln θ is well-taken, the loose handling of changes of variables may be more confusing than helpful (p.167). Discussing and providing two R codes for the (standard) Metropolis algorithm may prove too much. Or not. But using a four page R code for fitting a simple linear regression with a flat prior (pp.182-186) may definitely put the reader off! Even though I deem the example a proper experiment in setting a Metropolis algorithm and appreciate the detailed description around the R code itself. (I just take exception at the paragraph on running the code with two or even one observation, as the fact that “the Bayesian solution always exists” (p.188) [under a proper prior] is not necessarily convincing…)

“In the real world we cannot falsify a hypothesis or model any more than we “truthify” it (…) All we can do is ask which of the available models explains the data best.” (p.224)

In a similar format, the discussion on testing of hypotheses starts with a lengthy presentation of classical tests and p-values, the chapter ending up with a list of issues. Most of them reasonable in my own referential. I also concur with the conclusive remarks quoted above that what matters is a comparison of (all relatively false) models. What I less agree [as predictable from earlier posts and papers] with is the (standard) notion that comparing two models with a Bayes factor follows from the no information (in order to avoid the heavily loaded non-informative) prior weights of ½ and ½. Or similarly that the evidence is uniquely calibrated. Or, again, using a truncated improper prior under one of the assumptions (with the ghost of the Jeffreys-Lindley paradox lurking nearby…).  While the Savage-Dickey approximation is mentioned, the first numerical resolution of the approximation to the Bayes factor is via simulations from the priors. Which may be very poor in the situation of vague and uninformative priors. And then the deadly harmonic mean makes an entry (p.242), along with nested sampling… There is also a list of issues about Bayesian model comparison, including (strong) dependence on the prior, dependence on irrelevant alternatives, lack of goodness of fit tests, computational costs, including calls to possibly intractable likelihood function, ABC being then mentioned as a solution (which it is not, mostly).

Continue reading

Le Monde puzzle [#1049]

Posted in Books, Kids, R with tags , , , on April 18, 2018 by xi'an

An algorithmic Le Monde mathematical puzzle with a direct

Alice and Bob play a game by picking alternatively one of the remaining digits between 1 and 10 and putting it in either one of two available stacks, 1 or 2. Their respective gains are the products of the piles (1 for Alice and 2 for Bob).

The problem is manageable by a recursive function

 if ((min(remz[1,])>0)||(min(remz[2,])>0)){#finale
   for (i in (1:10)[!(1:10)%in%remz]){
   for (i in (1:10)[!(1:10)%in%remz]){

that shows the optimal gain for Alice is 3360=2x5x6x7x 8, versus Bob getting 1080=1x3x4x9x10. The moves ensuring the gain are 2-10-…