## 6th French Econometrics Conference in Dauphine

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , on October 15, 2014 by xi'an

On December 4-5, Université Paris-Dauphine will host the 6th French Econometric Conference, which celebrates Christian Gouriéroux and his contributions to econometrics. (Christian was my statistics professor during my graduate years at ENSAE and then Head of CREST when I joined this research unit, first as a PhD student and later as Head of the statistics group. And he has always been a tremendous support for me.)

Not only is the program quite impressive, with co-authors of Christian Gouriéroux and a few Nobel laureates (if not the latest, Jean Tirole, who taught economics at ENSAE when I was a student there), but registration is free. I will most definitely attend the talks, as I am in Paris-Dauphine at this time of year (the week before NIPS). In particular, looking forward to Gallant’s views on Bayesian statistics.

## Statistics slides (3)

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , on October 9, 2014 by xi'an

Here is the third set of slides for my third year statistics course. Nothing out of the ordinary, but the opportunity to link statistics and simulation for students not yet exposed to Monte Carlo methods. (No ABC yet, but who knows?, I may use ABC as an entry to Bayesian statistics, following Don Rubin’s example! Surprising typo on the Project Euclid page for this 1984 paper, by the way…) On Monday, I had the pleasant surprise to see Shravan Vasishth in the audience, as he is visiting Université Denis Diderot (Paris 7) this month.

## a weird beamer feature…

Posted in Books, Kids, Linux, R, Statistics, University life with tags , , , , , , , , , , , , on September 24, 2014 by xi'an

As I was preparing my slides for my third year undergraduate stat course, I got a weird error that got a search on the Web to unravel:

! Extra }, or forgotten \endgroup.
\endframe ->\egroup
\begingroup \def \@currenvir {frame}
l.23 \end{frame}
\begin{slide}
?


which was related with a fragile environment

\begin{frame}[fragile]
\frametitle{simulation in practice}
\begin{itemize}
\item For a given distribution $F$, call the corresponding
pseudo-random generator in an arbitrary computer language
\begin{verbatim}
> x=rnorm(10)
> x
[1] -0.021573 -1.134735  1.359812 -0.887579
[7] -0.749418  0.506298  0.835791  0.472144
\end{verbatim}
\item use the sample as a statistician would
\begin{verbatim}
> mean(x)
[1] 0.004892123
> var(x)
[1] 0.8034657
\end{verbatim}
to approximate quantities related with $F$
\end{itemize}
\end{frame}\begin{frame}


but not directly the verbatim part: the reason for the bug was that the \end{frame} command did not have a line by itself! Which is one rare occurrence where the carriage return has an impact in LaTeX, as far as I know… (The same bug appears when there is an indentation at the beginning of the line. Weird!) [Another annoying feature is wordpress turning > into &gt; in the sourcecode environment...]

## Statistics second slides

Posted in Books, Kids, Statistics, University life with tags , , , , , on September 24, 2014 by xi'an

This is the next chapter of my Statistics course, definitely more standard, with some notions on statistical models, limit theorems, and exponential families. In the first class, I recalled the convergence notions with no proof but counterexamples and spend some time on a slide not included here, borrowed from Chris Holmes’ talk last Friday on the linear relation between blood pressure and the log odds ratio of an heart condition. This was a great example, both to illustrate the power of increasing the number of observations and of using a logistic regression model. Students kept asking questions about it.

## another R new trick [new for me!]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on July 16, 2014 by xi'an

While working with Andrew and a student from Dauphine on importance sampling, we wanted to assess the distribution of the resulting sample via the Kolmogorov-Smirnov measure

$\max_x |\hat{F_n}(x)-F(x)|$

where F is the target.  This distance (times √n) has an asymptotic distribution that does not depend on n, called the Kolmogorov distribution. After searching for a little while, we could not figure where this distribution was available in R. It had to, since ks.test was returning a p-value. Hopefully correct! So I looked into the ks.test function, which happens not to be entirely programmed in C, and found the line

PVAL <- 1 - if (alternative == "two.sided")
.Call(C_pKolmogorov2x, STATISTIC, n)


which means that the Kolmogorov distribution is coded as a C function C_pKolmogorov2x in R. However, I could not call the function myself.

> .Call(C_pKolmogorov2x,.3,4)


Hence, as I did not want to recode this distribution cdf, I posted the question on stackoverflow (long time no see!) and got a reply almost immediately as to use the package kolmim. Followed by the extra comment from the same person that calling the C code only required to add the path to its name, as in

> .Call(stats:::C_pKolmogorov2x,STAT=.3,n=4)
[1] 0.2292


## implementing reproducible research [short book review]

Posted in Books, Kids, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , on July 15, 2014 by xi'an

As promised, I got back to this book, Implementing reproducible research (after the pigeons had their say). I looked at it this morning while monitoring my students taking their last-chance R exam (definitely last chance as my undergraduate R course is not reconoduced next year). The book is in fact an edited collection of papers on tools, principles, and platforms around the theme of reproducible research. It obviously links with other themes like open access, open data, and open software. All positive directions that need more active support from the scientific community. In particular the solutions advocated through this volume are mostly Linux-based. Among the tools described in the first chapter, knitr appears as an alternative to sweave. I used the later a while ago and while I like its philosophy. it does not extend to situations where the R code within takes too long to run… (Or maybe I did not invest enough time to grasp the entire spectrum of sweave.) Note that, even though the book is part of the R Series of CRC Press, many chapters are unrelated to R. And even more [unrelated] to statistics.

This limitation is somewhat my difficulty with [adhering to] the global message proposed by the book. It is great to construct such tools that monitor and archive successive versions of code and research, as anyone can trace back the research steps conducting to the published result(s). Using some of the platforms covered by the book establishes for instance a superb documentation principle, going much further than just providing an “easy” verification tool against fraudulent experiments. The notion of a super-wiki where notes and preliminary versions and calculations (and dead ends and failures) would be preserved for open access is just as great. However this type of research processing and discipline takes time and space and human investment, i.e. resources that are sparse and costly. Complex studies may involve enormous amounts of data and, neglecting the notions of confidentiality and privacy, the cost of storing such amounts is significant. Similarly for experiments that require days and weeks of huge clusters. I thus wonder where those resources would be found (journals, universities, high tech companies, …?) for the principle to hold in full generality and how transient they could prove. One cannot expect the research time to garantee availability of those meta-documents for remote time horizons. Just as a biased illustration, checking the available Bayes’ notebooks meant going to a remote part of London at a specific time and with a preliminary appointment. Those notebooks are not available on line for free. But for how long?

“So far, Bob has been using Charlie’s old computer, using Ubuntu 10.04. The next day, he is excited to find the new computer Alice has ordered for him has arrived. He installs Ubuntu 12.04″ A. Davison et al.

Putting their principles into practice, the authors of Implementing reproducible research have made all chapters available for free on the Open Science Framework. I thus encourage anyone interesting in those principles (and who would not be?!) to peruse the chapters and see how they can benefit from and contribute to open and reproducible research.

## vector quantile regression

Posted in pictures, Statistics, University life with tags , , , , , , , on July 4, 2014 by xi'an

My Paris-Dauphine colleague Guillaume Carlier recently arXived a statistics paper entitled Vector quantile regression, co-written with Chernozhukov and Galichon. I was most curious to read the paper as Guillaume is primarily a mathematical analyst working on optimisation problems like optimal transport. And also because I find quantile regression difficult to fathom as a statistical problem. (As it happens, both his co-authors are from econometrics.) The results in the paper are (i) to show that a d-dimensional (Lebesgue) absolutely continuous random variable Y can always be represented as the deterministic transform Y=Q(U), where U is a d-dimensional [0,1] uniform (the paper expresses this transform as conditional on a set of regressors Z, but those essentially play no role) and Q is monotonous in the sense of being the gradient of a convex function,

$Q(u) = \nabla q(u)$ and $\{Q(u)-Q(v)\}^\text{T}(u-v)\ge 0;$

(ii) to deduce from this representation a unique notion of multivariate quantile function; and (iii) to consider the special case when the quantile function Q can be written as the linear

$\beta(U)^\text{T}Z$

where β(U) is a matrix. Hence leading to an estimation problem.

While unsurprising from a measure theoretic viewpoint, the representation theorem (i) is most interesting both for statistical and simulation reasons. Provided the function Q can be easily estimated and derived, respectively. The paper however does not provide a constructive tool for this derivation, besides indicating several characterisations as solutions of optimisation problems. From a statistical perspective, a non-parametric estimation of  β(.) would have useful implications in multivariate regression, although the paper only considers the specific linear case above. Which solution is obtained by a discretisation of all variables and  linear programming.