Archive for CREST

Christian Robert is giving a talk in Jussieu tomorrow

Posted in Statistics, University life with tags , , , , , , , on September 26, 2019 by xi'an

My namesake Christian (Yann) Robert (CREST) is giving a seminar tomorrow in Jussieu (Université Pierre & Marie Curie, couloir 16-26, salle 209), between 2 and 3, on composite likelihood estimation method for hierarchical Archimedean copulas defined with multivariate compound distributions. Here is the abstract:

We consider the family of hierarchical Archimedean copulas obtained from multivariate exponential mixture distributions through compounding, as introduced by Cossette et al. (2017). We investigate ways of determining the structure of these copulas and estimating their parameters. An agglomerative clustering technique based on the matrix of Spearman’s rhos, combined with a bootstrap procedure, is used to identify the tree structure. Parameters are estimated through a top-down composite likelihood. The validity of the approach is illustrated through two simulation studies in which the procedure is explained step by step. The composite likelihood method is also compared to the full likelihood method in a simple case where the latter is computable.

on anonymisation

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on August 2, 2019 by xi'an

An article in the New York Times covering a recent publication in Nature Communications on the ability to identify 99.98% of Americans from almost any dataset with fifteen covariates. And mentioning the French approach of INSEE, more precisely CASD (a branch of GENES, as ENSAE and CREST to which I am affiliated), where my friend Antoine worked for a few years, and whose approach is to vet researchers who want access to non-anonymised data, by creating local working environments on the CASD machines  so that data does not leave the site. The approach is to provide the researcher with a dedicated interface, which “enables access remotely to a secure infrastructure where confidential data is safe from harm”. It further delivers reproducibility certificates for publications, a point apparently missed by the New York Times which advances the lack of reproducibility as a drawback of the method. It also mentions the possibility of doing cryptographic data analysis, again missing the finer details with a lame objection.

“Our paper shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.”

The Nature paper is actually about the probability for an individual to be uniquely identified from the given dataset, which somewhat different from the NYT headlines. Using a copula for the distribution of the covariates. And assessing the model with a mean square error evaluation when what matters are false positives and false negatives. Note that the model need be trained for each new dataset, which reduces the appeal of the claim, especially when considering that individuals tagged as uniquely identified about 6% are not. The statistic of 99.98% posted in the NYT is actually a count on a specific dataset,  the 5% Public Use Microdata Sample files, and Massachusetts residents, and not a general statistic [which would not make much sense!, as I can easily imagine 15 useless covariates] or prediction from the authors’ model. And a wee bit anticlimactic.

noise contrastive estimation

Posted in Statistics with tags , , , , , , , , , on July 15, 2019 by xi'an

As I was attending Lionel Riou-Durand’s PhD thesis defence in ENSAE-CREST last week, I had a look at his papers (!). The 2018 noise contrastive paper is written with Nicolas Chopin (both authors share the CREST affiliation with me). Which compares Charlie Geyer’s 1994 bypassing the intractable normalising constant problem by virtue of an artificial logit model with additional simulated data from another distribution ψ.

“Geyer (1994) established the asymptotic properties of the MC-MLE estimates under general conditions; in particular that the x’s are realisations of an ergodic process. This is remarkable, given that most of the theory on M-estimation (i.e.estimation obtained by maximising functions) is restricted to iid data.”

Michael Guttman and Aapo Hyvärinen also use additional simulated data in another likelihood of a logistic classifier, called noise contrastive estimation. Both methods replace the unknown ratio of normalising constants with an unbiased estimate based on the additional simulated data. The major and impressive result in this paper [now published in the Electronic Journal of Statistics] is that the noise contrastive estimation approach always enjoys a smaller variance than Geyer’s solution, at an equivalent computational cost when the actual data observations are iid. And the artificial data simulations ergodic. The difference between both estimators is however negligible against the Monte Carlo error (Theorem 2).

This may be a rather naïve question, but I wonder at the choice of the alternative distribution ψ. With a vague notion that it could be optimised in a GANs perspective. A side result of interest in the paper is to provide a minimal (re)parameterisation of the truncated multivariate Gaussian distribution, if only as an exercise for future exams. Truncated multivariate Gaussian for which the normalising constant is of course unknown.

Ph.D. scholarships at ENSAE ParisTech‐CREST

Posted in Statistics with tags , , , , , , , , on April 2, 2019 by xi'an

ENSAE ParisTech and CREST are currently inviting applications for 3-year PhD scholarships in statistics (and economics, finance, and sociology). There is no constraint of nationality or curriculum, but the supervisor must be from ENSAE (Paris-Saclay) or ENSAI (Rennes-Bruz).  The deadline is May 1, to be sent to Mrs Fanda Traore, at ensae.fr.

Applications should submitted (in French or in English), including :
– Curriculum vitae;
– Statement of research and teaching interests (10 pages);
– a cover letter
– the official transcripts of all higher education institutions from which you get a degree
– recommendation letters from professors, including a letter from the Ph.D. supervisor.

Selected candidates will be most likely interviewed at ENSAE‐CREST.

position in statistics and/or machine learning at ENSAE ParisTech‐CREST

Posted in pictures, University life with tags , , , , , , , on March 28, 2019 by xi'an

ENSAE ParisTech and CREST are currently inviting applications for a position of Assistant or Associate Professor in Statistics or Machine Learning.

The appointment starts in September, 2019, at the earliest. At the level of Assistant Professor, the position is for an initial three-year term renewable for another three years before the tenure evaluation. Salary is competitive according to qualifications. The teaching duties are reduced compared to French university standards. At the time of appointment, knowledge of French is not required but it is expected that the appointee will acquire a workable knowledge of French within a reasonable time.

Candidate Profile

– PhD in Statistics or Machine Learning.
– Outstanding research, including subjects in high-dimensional statistics and machine learning.
– Publications in leading international journals in Statistics or leading outlets in Machine Learning.

Demonstrated ability to teach courses in Mathematics, Statistics and Machine Learning for engineers and to supervise projects in Applied Statistics. The successful candidate is expected to teach at least one course in mathematics, applied mathematics or introductory statistics at the undergraduate level, and one course in the “Data Science, Statistics and Machine Learning”’ specialization track during the third year of ENSAE (Master level).

Applications should submitted (in French or in English) by email to recruitment@ensae.fr :
– Curriculum vitae;
– Statement of research and teaching interests (2-4 pages);
– Names and addresses of three or more individuals willing to provide letters of reference.

Deadline for applications : April 29, 2019.
Selected candidates will be invited to present their work and project at ENSAE‐CREST.

Roberto Casarin’s talk at CREST tomorrow

Posted in Statistics with tags , , , , , , , , , , , on March 13, 2019 by xi'an

My former student and friend Roberto Casarin (University Ca’Foscari, Venice) will talk tomorrow at the CREST Financial Econometrics seminar on

“Bayesian Markov Switching Tensor Regression for Time-varying Networks”

Time: 10:30
Date: 14 March 2019
Place: Room 3001, ENSAE, Université Paris-Saclay

Abstract : We propose a new Bayesian Markov switching regression model for multi-dimensional arrays (tensors) of binary time series. We assume a zero-inflated logit dynamics with time-varying parameters and apply it to multi-layer temporal networks. The original contribution is threefold. First, in order to avoid over-fitting we propose a parsimonious parameterisation of the model, based on a low-rank decomposition of the tensor of regression coefficients. Second, the parameters of the tensor model are driven by a hidden Markov chain, thus allowing for structural changes. The regimes are identified through prior constraints on the mixing probability of the zero-inflated model. Finally, we model the jointly dynamics of the network and of a set of variables of interest. We follow a Bayesian approach to inference, exploiting the Pólya-Gamma data augmentation scheme for logit models in order to provide an efficient Gibbs sampler for posterior approximation. We show the effectiveness of the sampler on simulated datasets of medium-big sizes, finally we apply the methodology to a real dataset of financial networks.

Siem Reap conference

Posted in Kids, pictures, Travel, University life with tags , , , , , , , , , , , , , , , , , , on March 8, 2019 by xi'an

As I returned from the conference in Siem Reap. on a flight avoiding India and Pakistan and their [brittle and bristling!] boundary on the way back, instead flying far far north, near Arkhangelsk (but with nothing to show for it, as the flight back was fully in the dark), I reflected how enjoyable this conference had been, within a highly friendly atmosphere, meeting again with many old friends (some met prior to the creation of CREST) and new ones, a pleasure not hindered by the fabulous location near Angkor of course. (The above picture is the “last hour” group picture, missing a major part of the participants, already gone!)

Among the many talks, Stéphane Shao gave a great presentation on a paper [to appear in JASA] jointly written with Pierre Jacob, Jie Ding, and Vahid Tarokh on the Hyvärinen score and its use for Bayesian model choice, with a highly intuitive representation of this divergence function (which I first met in Padua when Phil Dawid gave a talk on this approach to Bayesian model comparison). Which is based on the use of a divergence function based on the squared error difference between the gradients of the true log-score and of the model log-score functions. Providing an alternative to the Bayes factor that can be shown to be consistent, even for some non-iid data, with some gains in the experiments represented by the above graph.

Arnak Dalalyan (CREST) presented a paper written with Lionel Riou-Durand on the convergence of non-Metropolised Langevin Monte Carlo methods, with a new discretization which leads to a substantial improvement of the upper bound on the sampling error rate measured in Wasserstein distance. Moving from p/ε to √p/√ε in the requested number of steps when p is the dimension and ε the target precision, for smooth and strongly log-concave targets.

This post gives me the opportunity to advertise for the NGO Sala Baï hostelry school, which the whole conference visited for lunch and which trains youths from underprivileged backgrounds towards jobs in hostelery, supported by donations, companies (like Krama Krama), or visiting the Sala Baï  restaurant and/or hotel while in Siem Reap.