**O**ne of my students in my MCMC course at ENSAE seems to specialise into spotting typos in the Monte Carlo Statistical Methods book as he found an issue in every problem he solved! He even went back to a 1991 paper of mine on Inverse Normal distributions, inspired from a discussion with an astronomer, Caroline Soubiran, and my two colleagues, Gilles Celeux and Jean Diebolt. The above derivation from the massive Gradsteyn and Ryzhik (which I discovered thanks to Mary Ellen Bock when arriving in Purdue) is indeed incorrect as the final term should be the square root of 2β rather than 8β. However, this typo does not impact the normalising constant of the density, K(α,μ,τ), unless I am further confused.

## Archive for course

## I thought I did make a mistake but I was wrong…

Posted in Books, Kids, Statistics with tags Charles M. Schulz, confluent hypergeometric function, course, ENSAE, exercises, Gradsteyn, inverse normal distribution, MCMC, mixtures, Monte Carlo Statistical Methods, Peanuts, Ryzhik, typos on November 14, 2018 by xi'an## annual visit to Oxford

Posted in Kids, pictures, Statistics, Travel, University life with tags Avalon, Bayesian statistics, course, Oxford, OxWaSP, Roxy Music, segregation, St. Hugh's College, University of Oxford, University of Warwick on February 1, 2018 by xi'an**A**s in every year since 2014, I am spending a few days in Oxford to teach a module on Bayesian Statistics to our Oxford-Warwick PhD students. This time I was a wee bit under the weather due to a mild case of food poisoning and I can only hope that my more than sedate delivery did not turn definitely the students away from Bayesian pursuits!

The above picture is at St. Hugh’s College, where I was staying. Or should it be Saint Hughes, since this 12th century bishop was a pre-Brexit European worker from Avalon, France… (This college was created in 1886 for young women of poorer background. And only opened to male students a century later. The 1924 rules posted in one corridor show how these women were considered to be so “dangerous” by the institution that they had to be kept segregated from men, except their brothers!, at all times…)

## Why is it necessary to sample from the posterior distribution if we already KNOW the posterior distribution?

Posted in Statistics with tags accept-reject algorithm, course, cross validated, ENSAE, MCMC, Monte Carlo Statistical Methods, posterior distribution, probability, simulation on October 27, 2017 by xi'an**I** found this question on X validated somewhat hilarious, the more because of the shouted KNOW! And the confused impression that because one can write down π(θ|x) up to a constant, one KNOWS this distribution… It is actually one of the paradoxes of simulation that, from a mathematical perspective, once π(θ|x) is available as a function of (θ,x), all other quantities related with this distribution are mathematically perfectly and uniquely defined. From a numerical perspective, this does not help. Actually, when starting my MCMC course at ENSAE a few days later, I had the same question from a student who thought facing a density function like

f(x) ∞ exp{-||x||²-||x||⁴-||x||⁶}

was enough to immediately produce simulations from this distribution. (I also used this example to show the degeneracy of accept-reject as the dimension d of x increases, using for instance a Gamma proposal on y=||x||. The acceptance probability plunges to zero with d, with 9 acceptances out of 10⁷ for d=20.)

## miXed distributions

Posted in Books, Kids, Statistics, University life with tags Brittany, course, cross validated, density, Dirac mass, Lebesgue measure, mixed distribution, moustache, Nantes, Rennes on November 3, 2015 by xi'an**A** couple of questions on X validated showed the difficulty students have with mixed measures and their density. Actually, my students always react with incredulity to the likelihood of a censored normal sample or to the derivation of a Bayes factor associated with the null (and atomic) hypothesis μ=0…

I attribute this difficulty to a poor understanding of the notion of density and hence to a deficiency in the training in measure theory, since the density f of the distribution F is always relative to a reference measure dμ, i.e.

f(x) = dF/dμ(x)

(Hence Lebesgue’s moustache on the attached poster!) To handle atoms in the distribution requires introducing a dominating measure dμ with atomic components, i.e., usually a sum of the Lebesgue measure and of the counting measure on the appropriate set. Which is not so absolutely obvious: while the first question had {0,1} as atoms, the second question introduced atoms on {-θ,θ}and required a change of variable to consider a counting measure on {-1,1}. I found this second question actually of genuine interest and a great toy example for class and exams.

## methods for quantifying conflict casualties in Syria

Posted in Books, Statistics, University life with tags Carnegie Mellon University, CEREMADE, course, data science, MASH, privacy, PSL, Rebecca Steorts, seminar, Syria, Université Paris Dauphine on November 3, 2014 by xi'anOn Monday November 17, 11am, Amphi 10, Université Paris-Dauphine, Rebecca Steorts from CMU will give a talk at the GT Statistique et imagerie seminar:

Information about social entities is often spread across multiple large databases, each degraded by noise, and without unique identifiers shared across databases.Entity resolution—reconstructing the actual entities and their attributes—is essential to using big data and is challenging not only for inference but also for computation.

In this talk, I motivate entity resolution by the current conflict in Syria. It has been tremendously well documented, however, we still do not know how many people have been killed from conflict-related violence. We describe a novel approach towards estimating death counts in Syria and challenges that are unique to this database. We first introduce computational speed-ups to avoid all-to-all record comparisons based upon locality-sensitive hashing from the computer science literature. We then introduce a novel approach to entity resolution by discovering a bipartite graph, which links manifest records to a common set of latent entities. Our model quantifies the uncertainty in the inference and propagates this uncertainty into subsequent analyses. Finally, we speak to the success and challenges of solving a problem that is at the forefront of national headlines and news.

This is joint work with Rob Hall (Etsy), Steve Fienberg (CMU), and Anshu Shrivastava (Cornell University).

[Note that Rebecca will visit the maths department in Paris-Dauphine for two weeks and give a short course in our data science Master on data confidentiality, privacy and statistical disclosure (syllabus).]