## a programming bug with weird consequences

Posted in Kids, pictures, R, Statistics, University life with tags , , , , , , on November 25, 2015 by xi'an

One student of mine coded by mistake an independent Metropolis-Hastings algorithm with too small a variance in the proposal when compared with the target variance. Here is the R code of this implementation:

#target is N(0,1)
#proposal is N(0,.01)
T=1e5
prop=x=rnorm(T,sd=.01)
ratop=dnorm(prop,log=TRUE)-dnorm(prop,sd=.01,log=TRUE)
ratav=ratop[1]
logu=ratop-log(runif(T))
for (t in 2:T){
if (logu[t]>ratav){
x[t]=prop[t];ratav=ratop[t]}else{x[t]=x[t-1]}
}


It produces outputs of the following shape
which is quite amazing because of the small variance. The reason for the lengthy freezes of the chain is the occurrence with positive probability of realisations from the proposal with very small proposal density values, as they induce very small Metropolis-Hastings acceptance probabilities and are almost “impossible” to leave. This is due to the lack of control of the target, which is flat over the domain of the proposal for all practical purposes. Obviously, in such a setting, the outcome is unrelated with the N(0,1) target!

It is also unrelated with the normal proposal in that switching to a t distribution with 3 degrees of freedom produces a similar outcome:

It is only when using a Cauchy proposal that the pattern vanishes:

## Sunday morning puzzle

Posted in Books, Kids, R with tags , , , on November 22, 2015 by xi'an

A question from X validated that took me quite a while to fathom and then the solution suddenly became quite obvious:

If a sample taken from an arbitrary distribution on {0,1}⁶ is censored from its (0,0,0,0,0,0) elements, and if the marginal probabilities are know for all six components of the random vector, what is an estimate of the proportion of (missing) (0,0,0,0,0,0) elements?

Since the censoring modifies all probabilities by the same renormalisation, i.e. divides them by the probability to be different from (0,0,0,0,0,0), ρ, this probability can be estimated by looking at the marginal probabilities to be equal to 1, which equal the original and known marginal probabilities divided by ρ. Here is a short R code illustrating the approach that I wrote in the taxi home yesterday night:

#generate vectors
N=1e5
zprobs=c(.1,.9) #iid example
smpl=matrix(sample(0:1,6*N,rep=TRUE,prob=zprobs),ncol=6)
pty=apply(smpl,1,sum)
smpl=smpl[pty>0,]
ps=apply(smpl,2,mean)
cor=mean(ps/rep(zprobs[2],6))
#estimated original size
length(smpl[,1])*cor


A broader question is how many values (and which values) of the sample can be removed before this recovery gets impossible (with the same amount of information).

## Paret’oothed importance sampling and infinite variance [guest post]

Posted in Kids, pictures, R, Statistics, University life with tags , , , , , , on November 17, 2015 by xi'an

[Here are some comments sent to me by Aki Vehtari in the sequel of the previous posts.]

The following is mostly based on our arXived paper with Andrew Gelman and the references mentioned  there.

Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution.

When k<1/2, the variance exists and the central limit theorem holds. Chen and Shao (2004) show further that the rate of convergence to normality is faster when higher moments exist. When 1/2≤k<1, the variance does not exist (but mean exists), the generalized central limit theorem holds, and we may assume the rate of convergence is faster when k is closer to 1/2.

In the example with “Exp(1) proposal for an Exp(1/2) target”, k=1/2 and we are truly on the border.

In our experiments in the arXived paper and in Vehtari, Gelman, and Gabry (2015), we have observed that Pareto smoothed importance sampling (PSIS) usually converges well also with k>1/2 but k close to 1/2 (let’s say k<0.7). But if k<1 and k is close to 1 (let’s say k>0.7) the convergence is much worse and both naïve importance sampling and PSIS are unreliable.

Two figures are attached, which show the results comparing IS and PSIS in the Exp(1/2) and Exp(1/10) examples. The results were computed with repeating 1000 times a simulation with 10000 samples in each. We can see the bad performance of IS in both examples as you also illustrated. In Exp(1/2) case, PSIS is also to produce much more stable results. In Exp(1/10) case, PSIS is able to reduce the variance of the estimate, but it is not enough to avoid a big bias.

It would be interesting to have more theoretical justification why infinite variance is not so big problem if k is close to 1/2 (e.g. how the convergence rate is related to the amount of fractional moments).

I guess that max ω[t] / ∑ ω[t] in Chaterjee and Diaconis has some connection to the tail shape parameter of the generalized Pareto distribution, but it is likely to be much noisier as it depends on the maximum value instead of a larger number of tail samples as in the approach by Koopman, Shephard, and Creal (2009).A third figure shows an example where the variance is finite, with “an Exp(1) proposal for an Exp(1/1.9) target”, which corresponds to k≈0.475 < 1/2. Although the variance is finite, we are close to the border and the performance of basic IS is bad. There is no sharp change in the practical behaviour with a finite number of draws when going from finite variance to infinite variance. Thus, I think it is not enough to focus on the discrete number of moments, but for example, the Pareto shape parameter k gives us more information. Koopman, Shephard, and Creal (2009) also estimated the Pareto shape k, but they formed a hypothesis test whether the variance is finite and thus discretising the information in k, and assuming that finite variance is enough to get good performance.

## importance sampling with infinite variance

Posted in pictures, R, Statistics, University life with tags , , , , , , , on November 13, 2015 by xi'an

“In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(μ|ν)) is necessary and sufficient for accurate estimation by importance sampling.”

Sourav Chatterjee and Persi Diaconis arXived yesterday an exciting paper where they study the proper sample size in an importance sampling setting with no variance. That’s right, with no variance. They give as a starting toy example the use of an Exp(1) proposal for an Exp(1/2) target, where the importance ratio exp(x/2)/2 has no ξ order moment (for ξ≥2). So the infinity in the variance is somehow borderline in this example, which may explain why the estimator could be considered to “work”. However, I disagree with the statement “that a sample size a few thousand suffices” for the estimator of the mean to be close to the true value, that is, 2. For instance, the picture I drew above is the superposition of 250 sequences of importance sampling estimators across 10⁵ iterations: several sequences show huge jumps, even for a large number of iterations, which are characteristic of infinite variance estimates. Thus, while the expected distance to the true value can be closely evaluated via the Kullback-Leibler divergence between the target and the proposal (which by the way is infinite when using a Normal as proposal and a Cauchy as target), there are realisations of the simulation path that can remain far from the true value and this for an arbitrary number of simulations. (I even wonder if, for a given simulation path, waiting long enough should not lead to those unbounded jumps.) The first result is frequentist, while the second is conditional, i.e., can occur for the single path we have just simulated… As I taught in class this very morning, I thus remain wary about using an infinite variance estimator. (And not only in connection with the harmonic mean quagmire. As shown below by the more extreme case of simulating an Exp(1) proposal for an Exp(1/10) target, where the mean is completely outside the range of estimates.) Wary, then, even though I find the enclosed result about the existence of a cut-off sample size associated with this L¹ measure quite astounding. Continue reading

## Le Monde puzzle [#937]

Posted in Books, R with tags , , , , on November 11, 2015 by xi'an

A combinatoric Le Monde mathematical puzzle that resembles many earlier ones:

Given a pool of 30 interns allocated to three person night-shifts, is it possible to see 31 consecutive nights such that (a) all the shifts differ and (b) there are no pair of shifts with a single common intern?

In fact, the constraint there is very strong: two pairs of shift can only share zero or two interns. For one given shift, there are 26 other shifts with which it can share two interns, but then any two of those 26 others must share zero or two, which makes the two common to all and exclude any additional shift. But this is not the only approach to allocate the interns over the shifts since, as pointed out by Jean-Louis and checking with the following R code, 28 and not 27 is the maximum possible number of shifts under those conditions.

poss=combn(30,3)
shft=matrix(NA,31,3)
shft[1,]=sample(1:30,3)

poss=poss[,sample(4060)]
prop=poss[,1];k=2
acpt=intersect(prop,shft[1,])
while (((length(acpt)==1)+(length(acpt==3)))>0){
prop=poss[,k];k=k+1
acpt=intersect(prop,shft[1,])}
shft[2,]=prop
for(i in 3:31){
poss=poss[,sample(4060)]
prop=poss[,1];k=2
acpt=(length(intersect(prop,shft[1,]))==1)+(length(intersect(prop,shft[1,]))==3)
for (j in 2:(i-1))
acpt=acpt+(length(intersect(prop,shft[j,]))==1)+(length(intersect(prop,shft[j,]))==3)
while ((acpt>0)&(k<4061)){
prop=poss[,k];k=k+1
acpt=(length(intersect(prop,shft[1,]))==1)+(length(intersect(prop,shft[1,]))==3)
for (j in 2:(i-1))
acpt=acpt+(length(intersect(prop,shft[j,]))==1)+(length(intersect(prop,shft[j,]))==3)}
if (k==4061) break()
shft[i,]=prop}


## Think Bayes: Bayesian Statistics Made Simple

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on October 27, 2015 by xi'an

By some piece of luck, I came upon the book Think Bayes: Bayesian Statistics Made Simple, written by Allen B. Downey and published by Green Tea Press [which I could relate to No Starch Press, focussing on coffee!, which published Statistics Done Wrong that I reviewed a while ago] which usually publishes programming books with fun covers. The book is available on-line for free in pdf and html formats, and I went through it during a particularly exciting administrative meeting…

“Most books on Bayesian statistics use mathematical notation and present ideas in terms of mathematical concepts like calculus. This book uses Python code instead of math, and discrete approximations instead of continuous mathematics. As a result, what would be an integral in a math book becomes a summation, and most operations on probability distributions are simple loops.”

The book is most appropriately published in this collection as most of it concentrates on Python programming, with hardly any maths formula. In some sense similar to Jim Albert’s R book. Obviously, coming from maths, and having never programmed in Python, I find the approach puzzling, But just as obviously, I am aware—both from the comments on my books and from my experience on X validated—that a large group (majority?) of newcomers to the Bayesian realm find the mathematical approach to the topic a major hindrance. Hence I am quite open to this editorial choice as it is bound to include more people to think Bayes, or to think they can think Bayes.

“…in fewer than 200 pages we have made it from the basics of probability to the research frontier. I’m very happy about that.”

The choice made of operating almost exclusively through motivating examples is rather traditional in US textbooks. See e.g. Albert’s book. While it goes against my French inclination to start from theory and concepts and end up with illustrations, I can see how it operates in a programming book. But as always I fear it makes generalisations uncertain and understanding more shaky… The examples are per force simple and far from realistic statistics issues. Hence illustrates more the use of Bayesian thinking for decision making than for data analysis. To wit, those examples are about the Monty Hall problem and other TV games, some urn, dice, and coin models, blood testing, sport predictions, subway waiting times, height variability between men and women, SAT scores, cancer causality, a Geiger counter hierarchical model inspired by Jaynes, …, the exception being the final Belly Button Biodiversity dataset in the final chapter, dealing with the (exciting) unseen species problem in an equally exciting way. This may explain why the book does not cover MCMC algorithms. And why ABC is covered through a rather artificial normal example. Which also hides some of the maths computations under the carpet.

“The underlying idea of ABC is that two datasets are alike if they yield the same summary statistics. But in some cases, like the example in this chapter, it is not obvious which summary statistics to choose.¨

In conclusion, this is a very original introduction to Bayesian analysis, which I welcome for the reasons above. Of course, it is only an introduction, which should be followed by a deeper entry into the topic, and with [more] maths. In order to handle more realistic models and datasets.

## Le Monde puzzle [#929]

Posted in Books, Kids, R with tags , on September 29, 2015 by xi'an

A combinatorics Le Monde mathematical puzzle:

In the set {1,…,12}, numbers adjacent to i are called friends of i. How many distinct subsets of size 5 can be chosen under the constraint that each number in the subset has at least a friend with him?

In a brute force approach, I tried a quintuple loop to check all possible cases:

case=0
for (a in 1:(12-4))
for (b in (a+1):(12-3))
for (c in (b+1):(12-2))
for (d in (c+1):(12-1))
for (e in (d+1):12)
case=case+((b-a<2)&(min(c-b,d-c)<2)
&(min(d-c,e-d)<2)&(e-d<2))


which returns 64 possible cases. Note that the second and last loop are useless since b=a+1 and e=d+1, necessarily. And c is either (b+1) or (d-1), which means 2 choices for c, except when e=a+4. This all adds up to

$8 + 2\sum_{a=1}^7\sum_{e=a+5}^{12} = 8+2.7.8-2.7.8/2=8.8=64$

A related R question: is there a generic way of programming a sequence of embedded loops like the one above without listing all of the loops one by one?