Archive for CREST

Korean trip

Posted in Mountains, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , on November 24, 2019 by xi'an

A fairly short but exciting trip to Seoul and to the Fall meeting of the Korean Statistical Society there. Plus giving a seminar at Seoul National University, where I stayed and enjoyed its beautiful campus surrounded by hills painted in the flamboyant reds and yellows of trees. Running to the top of Gwanaksan in the early morning, with some scrambling moments, was a fantastic beginning for the day! Although it was quite unintentional Sacha Tsybakov from CREST happened to be another invited speaker at the meeting (along with Regina Liu from Rutgers, whom I was also met in Salzburg two months ago) and we had a nice stroll together on the University of Seoul campus during a break in the sessions, gaining another view of the city from the top of the Bukhasan mountain. The talk I gave there on the asymptotics of ABC happened to be more attended than my tutorial lecture delivered at the beginning of JSM in Denver this summer. I am thus quite grateful to the organisers for their invitation and this opportunity to meet Korean statisticians and to get a glimpse of Korean culture and cuisine!

 

Christian Robert is giving a talk in Jussieu tomorrow

Posted in Statistics, University life with tags , , , , , , , on September 26, 2019 by xi'an

My namesake Christian (Yann) Robert (CREST) is giving a seminar tomorrow in Jussieu (Université Pierre & Marie Curie, couloir 16-26, salle 209), between 2 and 3, on composite likelihood estimation method for hierarchical Archimedean copulas defined with multivariate compound distributions. Here is the abstract:

We consider the family of hierarchical Archimedean copulas obtained from multivariate exponential mixture distributions through compounding, as introduced by Cossette et al. (2017). We investigate ways of determining the structure of these copulas and estimating their parameters. An agglomerative clustering technique based on the matrix of Spearman’s rhos, combined with a bootstrap procedure, is used to identify the tree structure. Parameters are estimated through a top-down composite likelihood. The validity of the approach is illustrated through two simulation studies in which the procedure is explained step by step. The composite likelihood method is also compared to the full likelihood method in a simple case where the latter is computable.

on anonymisation

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on August 2, 2019 by xi'an

An article in the New York Times covering a recent publication in Nature Communications on the ability to identify 99.98% of Americans from almost any dataset with fifteen covariates. And mentioning the French approach of INSEE, more precisely CASD (a branch of GENES, as ENSAE and CREST to which I am affiliated), where my friend Antoine worked for a few years, and whose approach is to vet researchers who want access to non-anonymised data, by creating local working environments on the CASD machines  so that data does not leave the site. The approach is to provide the researcher with a dedicated interface, which “enables access remotely to a secure infrastructure where confidential data is safe from harm”. It further delivers reproducibility certificates for publications, a point apparently missed by the New York Times which advances the lack of reproducibility as a drawback of the method. It also mentions the possibility of doing cryptographic data analysis, again missing the finer details with a lame objection.

“Our paper shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.”

The Nature paper is actually about the probability for an individual to be uniquely identified from the given dataset, which somewhat different from the NYT headlines. Using a copula for the distribution of the covariates. And assessing the model with a mean square error evaluation when what matters are false positives and false negatives. Note that the model need be trained for each new dataset, which reduces the appeal of the claim, especially when considering that individuals tagged as uniquely identified about 6% are not. The statistic of 99.98% posted in the NYT is actually a count on a specific dataset,  the 5% Public Use Microdata Sample files, and Massachusetts residents, and not a general statistic [which would not make much sense!, as I can easily imagine 15 useless covariates] or prediction from the authors’ model. And a wee bit anticlimactic.

noise contrastive estimation

Posted in Statistics with tags , , , , , , , , , on July 15, 2019 by xi'an

As I was attending Lionel Riou-Durand’s PhD thesis defence in ENSAE-CREST last week, I had a look at his papers (!). The 2018 noise contrastive paper is written with Nicolas Chopin (both authors share the CREST affiliation with me). Which compares Charlie Geyer’s 1994 bypassing the intractable normalising constant problem by virtue of an artificial logit model with additional simulated data from another distribution ψ.

“Geyer (1994) established the asymptotic properties of the MC-MLE estimates under general conditions; in particular that the x’s are realisations of an ergodic process. This is remarkable, given that most of the theory on M-estimation (i.e.estimation obtained by maximising functions) is restricted to iid data.”

Michael Guttman and Aapo Hyvärinen also use additional simulated data in another likelihood of a logistic classifier, called noise contrastive estimation. Both methods replace the unknown ratio of normalising constants with an unbiased estimate based on the additional simulated data. The major and impressive result in this paper [now published in the Electronic Journal of Statistics] is that the noise contrastive estimation approach always enjoys a smaller variance than Geyer’s solution, at an equivalent computational cost when the actual data observations are iid. And the artificial data simulations ergodic. The difference between both estimators is however negligible against the Monte Carlo error (Theorem 2).

This may be a rather naïve question, but I wonder at the choice of the alternative distribution ψ. With a vague notion that it could be optimised in a GANs perspective. A side result of interest in the paper is to provide a minimal (re)parameterisation of the truncated multivariate Gaussian distribution, if only as an exercise for future exams. Truncated multivariate Gaussian for which the normalising constant is of course unknown.

Ph.D. scholarships at ENSAE ParisTech‐CREST

Posted in Statistics with tags , , , , , , , , on April 2, 2019 by xi'an

ENSAE ParisTech and CREST are currently inviting applications for 3-year PhD scholarships in statistics (and economics, finance, and sociology). There is no constraint of nationality or curriculum, but the supervisor must be from ENSAE (Paris-Saclay) or ENSAI (Rennes-Bruz).  The deadline is May 1, to be sent to Mrs Fanda Traore, at ensae.fr.

Applications should submitted (in French or in English), including :
– Curriculum vitae;
– Statement of research and teaching interests (10 pages);
– a cover letter
– the official transcripts of all higher education institutions from which you get a degree
– recommendation letters from professors, including a letter from the Ph.D. supervisor.

Selected candidates will be most likely interviewed at ENSAE‐CREST.

position in statistics and/or machine learning at ENSAE ParisTech‐CREST

Posted in pictures, University life with tags , , , , , , , on March 28, 2019 by xi'an

ENSAE ParisTech and CREST are currently inviting applications for a position of Assistant or Associate Professor in Statistics or Machine Learning.

The appointment starts in September, 2019, at the earliest. At the level of Assistant Professor, the position is for an initial three-year term renewable for another three years before the tenure evaluation. Salary is competitive according to qualifications. The teaching duties are reduced compared to French university standards. At the time of appointment, knowledge of French is not required but it is expected that the appointee will acquire a workable knowledge of French within a reasonable time.

Candidate Profile

– PhD in Statistics or Machine Learning.
– Outstanding research, including subjects in high-dimensional statistics and machine learning.
– Publications in leading international journals in Statistics or leading outlets in Machine Learning.

Demonstrated ability to teach courses in Mathematics, Statistics and Machine Learning for engineers and to supervise projects in Applied Statistics. The successful candidate is expected to teach at least one course in mathematics, applied mathematics or introductory statistics at the undergraduate level, and one course in the “Data Science, Statistics and Machine Learning”’ specialization track during the third year of ENSAE (Master level).

Applications should submitted (in French or in English) by email to recruitment@ensae.fr :
– Curriculum vitae;
– Statement of research and teaching interests (2-4 pages);
– Names and addresses of three or more individuals willing to provide letters of reference.

Deadline for applications : April 29, 2019.
Selected candidates will be invited to present their work and project at ENSAE‐CREST.

Roberto Casarin’s talk at CREST tomorrow

Posted in Statistics with tags , , , , , , , , , , , on March 13, 2019 by xi'an

My former student and friend Roberto Casarin (University Ca’Foscari, Venice) will talk tomorrow at the CREST Financial Econometrics seminar on

“Bayesian Markov Switching Tensor Regression for Time-varying Networks”

Time: 10:30
Date: 14 March 2019
Place: Room 3001, ENSAE, Université Paris-Saclay

Abstract : We propose a new Bayesian Markov switching regression model for multi-dimensional arrays (tensors) of binary time series. We assume a zero-inflated logit dynamics with time-varying parameters and apply it to multi-layer temporal networks. The original contribution is threefold. First, in order to avoid over-fitting we propose a parsimonious parameterisation of the model, based on a low-rank decomposition of the tensor of regression coefficients. Second, the parameters of the tensor model are driven by a hidden Markov chain, thus allowing for structural changes. The regimes are identified through prior constraints on the mixing probability of the zero-inflated model. Finally, we model the jointly dynamics of the network and of a set of variables of interest. We follow a Bayesian approach to inference, exploiting the Pólya-Gamma data augmentation scheme for logit models in order to provide an efficient Gibbs sampler for posterior approximation. We show the effectiveness of the sampler on simulated datasets of medium-big sizes, finally we apply the methodology to a real dataset of financial networks.