Archive for the R Category

coauthorship and citation networks

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , on February 21, 2017 by xi'an

cozauthorAs I discovered (!) the Annals of Applied Statistics in my mailbox just prior to taking the local train to Dauphine for the first time in 2017 (!), I started reading it on the way, but did not get any further than the first discussion paper by Pengsheng Ji and Jiashun Jin on coauthorship and citation networks for statisticians. I found the whole exercise intriguing, I must confess, with little to support a whole discussion on the topic. I may have read the paper too superficially as a métro pastime, but to me it sounded more like a post-hoc analysis than a statistical exercise, something like looking at the network or rather at the output of a software representing networks and making sense of clumps and sub-networks a posteriori. (In a way this reminded of my first SAS project at school, on the patterns of vacations in France. It was in 1983 on pinched cards. And we spent a while cutting & pasting in a literal sense the 80 column graphs produced by SAS on endless listings.)

It may be that part of the interest in the paper is self-centred. I do not think analysing a similar dataset in another field like deconstructionist philosophy or Korean raku would have attracted the same attention. Looking at the clusters and the names on the pictures is obviously making sense, if more at a curiosity than a scientific level, as I do not think this brings much in terms of ranking and evaluating research (despite what Bernard Silverman suggests in his preface) or understanding collaborations (beyond the fact that people in the same subfield or same active place like Duke tend to collaborate). Speaking of curiosity, I was quite surprised to spot my name in one network and even more to see that I was part of the “High-Dimensional Data Analysis” cluster, rather than of the “Bayes” cluster.  I cannot fathom how I ended up in that theme, as I cannot think of a single paper of mines pertaining to either high dimensions or data analysis [to force the trait just a wee bit!]. Maybe thanks to my joint paper with Peter Mueller. (I tried to check the data itself but cannot trace my own papers in the raw datafiles.)

I also wonder what is the point of looking at solely four major journals in the field, missing for instance most of computational statistics and biostatistics, not to mention machine learning or econometrics. This results in a somewhat narrow niche, if obviously recovering the main authors in the [corresponding] field. Some major players in computational stats still make it to the lists, like Gareth Roberts or Håvard Rue, but under the wrong categorisation of spatial statistics.

A knapsack riddle [#2]?

Posted in Kids, R, Statistics with tags , , , on February 17, 2017 by xi'an

gear

Still about this allocation riddle of the past week, and still with my confusion about the phrasing of the puzzle, when looking at a probabilistic interpretation of the game, rather than for a given adversary’s y, the problem turns out to search for the maximum of

\mathbb{E}[L(x,Y)]=\sum_{i=1}^{10} i\{P(Y_i<x_i)-P(Y_i>x_i)\}

where the Y’s are Binomial B(100,p). Given those p’s, this function of x is available in closed form and can thus maximised by a simulated annealing procedure, coded as

utility=function(x,p){
  ute=2*pbinom(x[1]-1,100,prob=p[1])+
   dbinom(x[1],100,p[1])
  for (i in 2:10)
   ute=ute+2*i*pbinom(x[i]-1,100,prob=p[i])+
    i*dbinom(x[i],100,p[i])
  return(ute)}
#basic term in utility
baz=function(i,x,p){
  return(i*dbinom(x[i],100,p[i])+
   i*dbinom(x[i]-1,100,p[i]))}
#relies on a given or estimated p
x=rmultinom(n=1,siz=100,prob=p)
maxloz=loss=0
newloss=losref=utility(x,p)
#random search
T=1e3
Te=1e2
baza=rep(0,10)
t=1
while ((t<T)||(newloss>loss)){
 loss=newloss
 i=sample(1:10,1,prob=(10:1)*(x>0))
#moving all other counters by one
 xp=x+1;xp[i]=x[i]
#corresponding utility change
 for (j in 1:10) baza[j]=baz(j,xp,p)
  proz=exp(log(t)*(baza-baza[i])/Te)
#soft annealing move
 j=sample(1:10,1,prob=proz)
 if (i!=j){ x[i]=x[i]-1;x[j]=x[j]+1}
newloss=loss+baza[j]-baza[i]
if (newloss>maxloz){
 maxloz=newloss;argz=x}
t=t+1
if ((t>T-10)&(newloss<losref)){
 t=1;loss=0
 x=rmultinom(n=1,siz=100,prob=p)
 newloss=losref=utility(x,p)}}

which seems to work, albeit not always returning the same utility. For instance,

> p=cy/sum(cy)
> utility(argz,p)
[1] 78.02
> utility(cy,p)
[1] 57.89

or

> p=sy/sum(sy)
> utility(argz,p)
[1] 82.04
> utility(sy,p)
[1] 57.78

Of course, this does not answer the question as intended and reworking the code to that purpose is not worth the time!

a knapsack riddle?

Posted in Books, pictures, R, Statistics, Travel with tags , , , , , , on February 13, 2017 by xi'an

gear

The [then current now past] riddle of the week is a sort of multiarmed bandits optimisation. Of sorts. Or rather a generalised knapsack problem. The question is about optimising the allocation of 100 undistinguishable units to 10 distinct boxes against a similarly endowed adversary, when the loss function is

L(x,y)=(x_1>y_1)-(x_1<y_1)+...+10((x_{10}>y_{10})-(x_{10}<y_{10}))

and the distribution q of the adversary is unknown. As usual (!), the phrasing of the riddle is somewhat ambiguous but I am under the impression that the game is played sequentially, hence that one can learn about the distribution of the adversary, at least when assuming this adversary keeps the same distribution q at all times. Continue reading

an accurate variance approximation

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , on February 7, 2017 by xi'an

In answering a simple question on X validated about producing Monte Carlo estimates of the variance of estimators of exp(-θ) in a Poisson model, I wanted to illustrate the accuracy of these estimates against the theoretical values. While one case was easy, since the estimator was a Binomial B(n,exp(-θ)) variate [in yellow on the graph], the other one being the exponential of the negative of the Poisson sample average did not enjoy a closed-form variance and I instead used a first order (δ-method) approximation for this variance which ended up working surprisingly well [in brown] given that the experiment is based on an n=20 sample size.

Thanks to the comments of George Henry, I stand corrected: the variance of the exponential version is easily manageable with two lines of summation! As

\text{var}(\exp\{-\bar{X}_n\})=\exp\left\{-n\theta[1-\exp\{-2/n\}]\right\}

-\exp\left\{-2n\theta[1-\exp\{-1/n\}]\right\}

which allows for a comparison with its second order Taylor approximation:

compar

a well-hidden E step

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , , , , on February 3, 2017 by xi'an

Grand Palais from Esplanade des Invalides, Paris, Dec. 07, 2012A recent question on X validated ended up being quite interesting! The model under consideration is made of parallel Markov chains on a finite state space, all with the same Markov transition matrix, M, which turns into a hidden Markov model when the only summary available is the number of chains in a given state at a given time. When writing down the EM algorithm, the E step involves the expected number of moves from a given state to a given state at a given time. The conditional distribution of those numbers of chains is a product of multinomials across times and starting states, with no Markov structure since the number of chains starting from a given state is known at each instant. Except that those multinomials are constrained by the number of “arrivals” in each state at the next instant and that this makes the computation of the expectation intractable, as far as I can see.

A solution by Monte Carlo EM means running the moves for each instant under the above constraints, which is thus a sort of multinomial distribution with fixed margins, enjoying a closed-form expression but for the normalising constant. The direct simulation soon gets too costly as the number of states increases and I thus considered a basic Metropolis move, using one margin (row or column) or the other as proposal, with the correction taken on another margin. This is very basic but apparently enough for the purpose of the exercise. If I find time in the coming days, I will try to look at the ABC resolution of this problem, a logical move when starting from non-sufficient statistics!

a typo that went under the radar

Posted in Books, R, Statistics, University life with tags , , , , , , , on January 25, 2017 by xi'an

A chance occurrence on X validated: a question on an incomprehensible formula for Bayesian model choice: which, most unfortunately!, appeared in Bayesian Essentials with R! Eeech! It looks like one line in our LATEX file got erased and the likelihood part in the denominator altogether vanished. Apologies to all readers confused by this nonsensical formula!

an express riddle

Posted in Books, Kids, R with tags , , on January 20, 2017 by xi'an

A quick puzzle on The Riddler this week that enjoys a quick solution once one writes it out. The core of the puzzle is about finding the average number of draws one need to empty a population of size T if each draw is uniform over the remaining number of individuals between one and the number that remain. It is indeed easy to see that this average satisfies

\epsilon^T=1+\frac{1}{T}\sum_{i=1}^{T-1} \epsilon^i

since all draws but one require an extra draw. A recursion then leads by elimination to deduce that

\epsilon^T=\frac{1}{T}+\frac{1}{T-1}+\ldots+\frac{1}{2}+1

which is the beginning of the (divergent) harmonic series. In the case T=30, the solution is (almost) equal to 4.

> sum(1/(1:30))*1e10
[1] 39949871309

A second riddle the same week reminded me of a result in Devroye’s Non-Uniform Random Variate Generation, namely to find the average number of draws from a Uniform until the sequence goes down. Actually, the real riddle operates with a finite support Uniform, but I find the solution with the continuous Uniform more elegant. And it only took a few metro stops to solve. The solution goes as follows: the probability to stop after two Uniform draws is 1/2, after n uniform draws, it is (n-1)/n!, which does sum up to 1:

\sum_{n=2}^\infty \frac{n-1}{n!} = \sum_{n=2}^\infty \frac{n}{n!} - \sum_{n=2}^\infty \frac{1}{n!} = \sum_{n=1}^\infty \frac{1}{n!} - \sum_{n=2}^\infty \frac{1}{n!}=1

and the expectation of this distribution is e-1 by a very similar argument, as can be checked by a rudimentary Monte Carlo experiment

> over(1e7) #my implementation of the puzzle
[1] 1.7185152