Archive for Python

data scientist position

Posted in R, Statistics, University life with tags , , , , , , , , , , on April 8, 2014 by xi'an

Université Paris-DauphineOur newly created Chaire “Economie et gestion des nouvelles données” in Paris-Dauphine, ENS Ulm, École Polytechnique and ENSAE is recruiting a data scientist starting as early as May 1, the call remaining open till the position is filled. The location is in one of the above labs in Paris, the duration for at least one year, salary is varying, based on the applicant’s profile, and the contacts are Stephane Gaiffas (stephane.gaiffas AT cmap DOT polytechnique.fr), Robin Ryder (ryder AT ceremade DOT dauphine.fr). and Gabriel Peyré (peyre AT ceremade DOT dauphine.fr). Here are more details:

Job description

The chaire “Economie et gestion des nouvelles données” is recruiting a talented young engineer specialized in large scale computing and data processing. The targeted applications include machine learning, imaging sciences and finance. This is a unique opportunity to join a newly created research group between the best Parisian labs in applied mathematics and computer science (ParisDauphine, ENS Ulm, Ecole Polytechnique and ENSAE) working hand in hand with major industrial companies (Havas, BNP Paribas, Warner Bros.). The proposed position consists in helping researchers of the group to develop and implement large scale data processing methods, and applying these methods on real life problems in collaboration with the industrial partners.

A non exhaustive list of methods that are currently investigated by researchers of the group, and that will play a key role in the computational framework developed by the recruited engineer, includes :
● Large scale non smooth optimization methods (proximal schemes, interior points, optimization on manifolds).
● Machine learning problems (kernelized methods, Lasso, collaborative filtering, deep learning, learning for graphs, learning for timedependent systems), with a particular focus on large scale problems and stochastic methods.
● Imaging problems (compressed sensing, superresolution).
● Approximate Bayesian Computation (ABC) methods.
● Particle and Sequential Monte Carlo methods

Candidate profile

The candidate should have a very good background in computer science with various programming environments (e.g. Matlab, Python, C++) and knowledge of high performance computing methods (e.g. GPU, parallelization, cloud computing). He/she should adhere to the open source philosophy and possibly be able to interact with the relevant communities (e.g. scikitlearn initiative). Typical curriculum includes engineering school or Master studies in computer science / applied maths / physics, and possibly a PhD (not required).

Working environment

The recruited engineer will work within one of the labs of the chaire. He will benefit from a very stimulating working environment and all required computing resources. He will work in close interaction with the 4 research labs of the chaire, and will also have regular meetings with the industrial partners. More information about the chaire can be found online at http://www.di.ens.fr/~aspremon/chaire/

Bayesian programming [book review]

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , , on March 3, 2014 by xi'an

“We now think the Bayesian Programming methodology and tools are reaching maturity. The goal of this book is to present them so that anyone is able to use them. We will, of course, continue to improve tools and develop new models. However, pursuing the idea that probability is an alternative to Boolean logic, we now have a new important research objective, which is to design specific hsrdware, inspired from biology, to build a Bayesian computer.”(p.xviii)

On the plane to and from Montpellier, I took an extended look at Bayesian Programming a CRC Press book recently written by Pierre Bessière, Emmanuel Mazer, Juan-Manuel Ahuactzin, and Kamel Mekhnacha. (Very nice picture of a fishing net on the cover, by the way!) Despite the initial excitement at seeing a book which final goal was to achieve a Bayesian computer, as demonstrated by the above quote, I however soon found the book too arid to read due to its highly formalised presentation… The contents are clear indications that the approach is useful as they illustrate the use of Bayesian programming in different decision-making settings, including a collection of Python codes, so it brings an answer to the what but it somehow misses the how in that the construction of the priors and the derivation of the posteriors is not explained in a way one could replicate.

“A modeling methodology is not sufficient to run Bayesian programs. We also require an efficient Bayesian inference engine to automate the probabilistic calculus. This assumes we have a collection of inference algorithms adapted and tuned to more or less specific models and a software architecture to combine them in a coherent and unique tool.” (p.9)

For instance, all models therein are described via the curly brace formalism summarised by

phdthesis28xwhich quickly turns into an unpalatable object, as in this example taken from the online PhD thesis of Gabriel Synnaeve (where he applied Bayesian programming principles to a MMORPG called StarCraft and developed an AI (or bot) able to play BroodwarBotQ)

phdthesis37xthesis that I found most interesting!

“Consequently, we have 21 × 16 = 336 bell-shaped distributions and we have 2 × 21 × 16 = 772 free parameters: 336 means and 336 standard deviations.¨(p.51)

Now, getting back to the topic of the book, I can see connections with statistical problems and models, and not only via the application of Bayes’ theorem, when the purpose (or Question) is to take a decision, for instance in a robotic action. I still remain puzzled by the purpose of the book, since it starts with very low expectations on the reader, but hurries past notions like Kalman filters and Metropolis-Hastings algorithms in a few paragraphs. I do not get some of the details, like this notion of a discretised Gaussian distribution (I eventually found the place where the 772 prior parameters are “learned” in a phase called “identification”.)

“Thanks to conditional independence the curse of dimensionality has been broken! What has been shown to be true here for the required memory space is also true for the complexity of inferences. Conditional independence is the principal tool to keep the calculation tractable. Tractability of Bayesian inference computation is of course a major concern as it has been proved NP-hard (Cooper, 1990).”(p.74)

The final chapters (Chap. 14 on “Bayesian inference algorithms revisited”, Chap. 15 on “Bayesian learning revisited” and  Chap. 16 on “Frequently asked questions and frequently argued matters” [!]) are definitely those I found easiest to read and relate to. With mentions made of conjugate priors and of the EM algorithm as a (Bayes) classifier. The final chapter mentions BUGS, Hugin and… Stan! Plus a sequence of 23 PhD theses defended on Bayesian programming for robotics in the past 20 years. And explains the authors’ views on the difference between Bayesian programming and Bayesian networks (“any Bayesian network can be represented in the Bayesian programming formalism, but the opposite is not true”, p.316), between Bayesian programming and probabilistic programming (“we do not search to extend classical languages but rather to replace them by a new programming approach based on probability”, p.319), between Bayesian programming and Bayesian modelling (“Bayesian programming goes one step further”, p.317), with a further (self-)justification of why the book sticks to discrete variables, and further more philosophical sections referring to Jaynes and the principle of maximum entropy.

“The “objectivity” of the subjectivist approach then lies in the fact that two different subjects with same preliminary knowledge and same observations will inevitably reach the same conclusions.”(p.327)

Bayesian Programming thus provides a good snapshot of (or window on) what one can achieve in uncertain environment decision-making with Bayesian techniques. It shows a long-term reflection on those notions by Pierre Bessière, his colleagues and students. The topic is most likely too remote from my own interests for the above review to be complete. Therefore, if anyone is interested in reviewing any further this book for CHANCE, before I send the above to the journal, please contact me. (Usual provisions apply.)

speed of R, C, &tc.

Posted in R, Running, Statistics, University life with tags , , , , , , , , , on February 3, 2012 by xi'an

My Paris colleague (and fellow-runner) Aurélien Garivier has produced an interesting comparison of 4 (or 6 if you consider scilab and octave as different from matlab) computer languages in terms of speed for producing the MLE in a hidden Markov model, using EM and the Baum-Welch algorithms. His conclusions are that

  • matlab is a lot faster than R and python, especially when vectorization is important : this is why the difference is spectacular on filtering/smoothing, not so much on the creation of the sample;
  • octave is a good matlab emulator, if no special attention is payed to execution speed…;
  • scilab appears as a credible, efficient alternative to matlab;
  • still, C is a lot faster; the inefficiency of matlab in loops is well-known, and clearly shown in the creation of the sample.

(In this implementation, R is “only” three times slower than matlab, so this is not so damning…) All the codes are available and you are free to make suggestions to improve the speed of of your favourite language!

The foundations of Statistics: a simulation-based approach

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , on July 12, 2011 by xi'an

“We have seen that a perfect correlation is perfectly linear, so an imperfect correlation will be `imperfectly linear’.” page 128

This book has been written by two linguists, Shravan Vasishth and Michael Broe, in order to teach statistics “in  areas that are traditionally not mathematically demanding” at a deeper level than traditional textbooks “without using too much mathematics”, towards building “the confidence necessary for carrying more sophisticated analyses” through R simulation. This is a praiseworthy goal, bound to produce a great book. However, and most sadly, I find the book does not live up to expectations. As in Radford Neal’s recent coverage of introductory probability books with R, there are statements there that show a deep misunderstanding of the topic… (This post has also been published on the Statistics Forum.) Continue reading

Parallel computation [revised]

Posted in R, Statistics, University life with tags , , , , , , on March 15, 2011 by xi'an

We have now completed our revision of the parallel computation paper and hope to send it to JCGS within a few days. As seen on the arXiv version, and given the very positive reviews we received, the changes are minor, mostly focusing on the explanation of the principle and on the argument that it comes essentially free. Pierre also redrew the graphs in a more compact and nicer way, thanks to the ggplot2 package abilities. In addition, Pierre put the R and python programs used in the paper on a public depository.

Julien on R shortcomings

Posted in Books, R, Statistics, University life with tags , , , , , , , on September 8, 2010 by xi'an

Julien Cornebise posted a rather detailed set of comments (from Jasper!) that I thought was interesting and thought-provoking enough (!) to promote to a guest post. Here it is , then, to keep the debate rolling (with my only censoring being the removal of smileys!). (Please keep in mind that I do not endorse everything stated in this guest post! Especially the point on “Use R!“)

On C vs R
As a reply to Duncan: indeed C (at least for the bottlenecks) will probably always be faster for the final, mainstream use of an algorithm [e.g. as a distributed R library, or a standalone program]. Machine-level, smart compilers, etc etc. The same goes for Matlab, and even for Python: e.g. Pierre Jacob (Xian’s great PhD student) uses Weave to inline C in his Python code for the bottlenecks — simple, and fast. Some hedge funds even hire coders to recode the Matlab code of their consulting academic statisticians.

Point taken. But, as Radford Neal points out, that doesn’t justify R to be much slower that it could be:

  • When statisticians (cf Xian) want to develop/prototype new algorithms and methods while focussing on the math/stat/algo more than on the language-dependent implementation, it is still a shame to waste 50% (or even 25%). Same goes for the memory management, or even for some language features[1]
  • Even less computer-savvy users of R for real-case data, willing to use existing algorithms (not developing new algos) but on big/intricate datasets can be put off by slow speed — or even by memory failures.
  • And the library is BRILLIANT.

On Future Language vs R
Thanks David and Martyn for the link to Ihaka’s great paper on R-like lisp-based. Says things better than I could, and with an expertise on R that I haven’t. I also didn’t know about Robert Gentleman and his success at Harvard (but he *invented* the thing, not merely tuned it up).

Developing a whole new language and concept, as advocated in Ihaka’s paper and as suggested by gappy3000 would be a great leap forward, and a needed breakthrough to change the way we use computational stats. I would *love* to see that, as I personally think (as Ihaka advocates in the paper you link to) that R, as a language, is a hell of a pain [2] and I am saddened to see a lot of “Use R” books who will root its inadequate use for needs where the language hardly fits the bill — although the library does.

But R is here and in everyday use, and the matter is more of making it worth using, to its full potential. I have no special attachment to R, but any breakthrough language that would not be entirely compatible with the massive library contributed over the years would be doomed to fail to pick-up the everyday statistician—and we’re talking here about far-fetched long-term moves. Sanitary breakthrough, but harder to make happen when such an anchor is here.
I would say that R has turned into the Fortran of statistics: here to stay, anchored by the inertia that stems from its intrinsic (and widely acknowledged) merits  (I’ve been nice, I didn’t say Cobol.).

So until of the great leap forward comes (or until we make it happen as a community), I second Radford Neal‘s call for optimization of the existing core of R.

Rejoinder
As a rejoinder to the comments here, I think we need to consider separately

  1. R’s brilliant library
  2. R’s not-so-brilliant language and/or interpreter.

It seems to me from this topic that the community needs/should push for, in chronological order.

  1. First, a speed-up of R’s existing interpreter as called for by Radford Neal.  “Easy” and short-term task, by good-willing amateur coders, or, better, by solid CS people.
  2. Team-up with CS experts interested in developing computational stat-related tools.
  3. With them, get out of the now dead-ended R language and embark on a new stat framework based on an *existing*, proven, language. *Must*  be able to reuse the brilliant R library/codes brought up by the community. Failing so would fail to pick up the userbase = die in limbo.  That’s more or less what is called for by Ihaka (except for his doubts on the backward compatibility, see Section 7 of his paper).  Much harder and longer term, but worth it.

From then on
Who knows the R community enough to relay this call, and make it happen ? I’m out of my league.

Uninteresting footnotes:
[1] I have twitched several times when trying R, feeling the coding was somewhat unnatural from a CS point of view. [Mind, I twitch all the same, although on other points, with Matlab]
[2] again, I speak only out of the few tries I gave it, as I gave up using it for my everyday work, I am biased — and ignorant

Neal
Follow

Get every new post delivered to your Inbox.

Join 671 other followers