## Archive for the Statistics Category

## xi’an’s number [repost]

Posted in Statistics with tags blogging, github, Gothenburg, Sweden, Xi'an on August 18, 2019 by xi'an## deadlines for BayesComp’2020

Posted in pictures, Statistics, Travel, University life with tags AutoStat, BayesComp 2020, Bayesian computing, conference, Florida, Gainesville, ISBA, Nimble, SAS, STAN, tutorial, University of Florida on August 17, 2019 by xi'an**W**hile I have forgotten to send a reminder that August 15 was the first deadline of BayesComp 2020 for the early registrations, here are further deadlines and dates

- BayesComp 2020 occurs on January 7-10 2020 in Gainesville, Florida, USA
- Registration is open with regular rates till October 14, 2019
- Deadline for submission of poster proposals is December 15, 2019
- Deadline for travel support applications is September 20, 2019
- There are four free tutorials on January 7, 2020, related with Stan, NIMBLE, SAS, and AutoStat

## delayed-acceptance. ADA boosted

Posted in Statistics with tags ABC, delayed acceptance, Gaussian processes, MCMC, particle MCMC, Scandinavia on August 11, 2019 by xi'an**S**amuel Wiqvist and co-authors from Scandinavia have recently arXived a paper on a new version of delayed acceptance MCMC. The ADA in the novel algorithm stands for approximate and accelerated, where the approximation in the first stage is to use a Gaussian process to replace the likelihood. In our approach, we used subsets for partial likelihoods, ordering them so that the most varying sub-likelihoods were evaluated first. Furthermore, if a parameter reaches the second stage, the likelihood is not necessarily evaluated, based on the global probability that a second stage is rejected or accepted. Which of course creates an approximation. Even when using a local predictor of the probability. The outcome of a comparison in two complex models is that the delayed approach does not necessarily do better than particle MCMC in terms of effective sample size per second, since it does reject significantly more. Using various types of surrogate likelihoods and assessments of the approximation effect could boost the appeal of the method. Maybe using ABC first could suggest another surrogate?

## a problem that did not need ABC in the end

Posted in Books, pictures, Statistics, Travel with tags ABC, Approximate Bayesian computation, Colorado, cross validated, dawn, Denver, high rise, introductory opening lecture, jatp, JSM 2019, law of the hammer, multinomial distribution, predictive on August 8, 2019 by xi'an**W**hile in Denver, at JSM, I came across [across validated!] this primarily challenging problem of finding the posterior of the 10³ long probability vector of a Multinomial M(10⁶,p) when only observing the range of a realisation of M(10⁶,p). This sounded challenging because the distribution of the pair (min,max) is not available in closed form. (Although this allowed me to find a paper on the topic by the late Shanti Gupta, who was chair at Purdue University when I visited 32 years ago…) This seemed to call for ABC (especially since I was about to give an introductory lecture on the topic!, law of the hammer…), but the simulation of datasets compatible with the extreme values of both minimum and maximum, m=80 and M=12000, proved difficult when using a uniform Dirichlet prior on the probability vector, since these extremes called for both small and large values of the probabilities. However, I later realised that the problem could be brought down to a Multinomial with only three categories and the observation (m,M,n-m-M), leading to an obvious Dirichlet posterior and a predictive for the remaining 10³-2 realisations.

## unbiased product of expectations

Posted in Books, Statistics, University life with tags ABC, g-and-k distributions, importance sampling, latent variable models, NP-complete problem, permanent of a matrix, pseudo-marginal MCMC, recycling, University of Warwick on August 5, 2019 by xi'an**W**hile I was not involved in any way, or even aware of this research, Anthony Lee, Simone Tiberi, and Giacomo Zanella have an incoming paper in Biometrika, and which was partly written while all three authors were at the University of Warwick. The purpose is to design an efficient manner to approximate the product of n unidimensional expectations (or integrals) all computed against the same reference density. Which is not a real constraint. A neat remark that motivates the method in the paper is that an improved estimator can be connected with the permanent of the n x N matrix A made of the values of the n functions computed at N different simulations from the reference density. And involves N!/ (N-n)! terms rather than N to the power n. Since it is NP-hard to compute, a manageable alternative uses random draws from constrained permutations that are reasonably easy to simulate. Especially since, given that the estimator recycles most of the particles, it requires a much smaller version of N. Essentially N=O(n) with this scenario, instead of O(n²) with the basic Monte Carlo solution, towards a similar variance.

This framework offers many applications in latent variable models, including pseudo-marginal MCMC, of course, but also for ABC since the ABC posterior based on getting each simulated observation close enough from the corresponding actual observation fits this pattern (albeit the dependence on the chosen ordering of the data is an issue that can make the example somewhat artificial).

## R wins COPSS Award!

Posted in Statistics with tags Bayesian time series analysis, COPSS Award, Denver, IMS Medallion, JSM 2019, Likelihood Principle, open source, R, RStudio on August 4, 2019 by xi'an**H**adley Wickham from RStudio has won the 2019 COPSS Award, which expresses a rather radical switch from the traditional recipient of this award in that this recognises his many contributions to the R language and in particular to RStudio. The full quote for the nomination is his “influential work in statistical computing, visualisation, graphics, and data analysis” including “making statistical thinking and computing accessible to a large audience”. With the last part possibly a recognition of the appeal of Open Source… (I was not in Denver for the awards ceremony, having left after the ABC session on Monday morning. Unfortunately, this session only attracted a few souls, due to the competition of twentysome other sessions, including, *excusez du peu!*, David Dunson’s Medallion Lecture and Michael Lavine’s IOL on the likelihood principle. And Marco Ferreira’s short-course on Bayesian time series. This is the way the joint meeting goes, but it is disappointing to reach so few people.)

## on anonymisation

Posted in Books, pictures, Statistics, University life with tags agence CAB, CASD, CREST, cryptography, encryption, ENSAE, genes, INSEE, Massachusetts, Nature, NYT, reproducibility on August 2, 2019 by xi'an**A**n article in the New York Times covering a recent publication in Nature Communications on the ability to identify 99.98% of Americans from almost any dataset with fifteen covariates. And mentioning the French approach of INSEE, more precisely CASD (a branch of GENES, as ENSAE and CREST to which I am affiliated), where my friend Antoine worked for a few years, and whose approach is to vet researchers who want access to non-anonymised data, by creating local working environments on the CASD machines so that data does not leave the site. The approach is to provide the researcher with a dedicated interface, which “enables access remotely to a secure infrastructure where confidential data is safe from harm”. It further delivers reproducibility certificates for publications, a point apparently missed by the New York Times which advances the lack of reproducibility as a drawback of the method. It also mentions the possibility of doing cryptographic data analysis, again missing the finer details with a lame objection.

*“Our paper shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.”*

The Nature paper is actually about the probability for an individual to be uniquely identified from the given dataset, which somewhat different from the NYT headlines. Using a copula for the distribution of the covariates. And assessing the model with a mean square error evaluation when what matters are false positives and false negatives. Note that the model need be trained for each new dataset, which reduces the appeal of the claim, especially when considering that individuals tagged as uniquely identified about 6% are not. The statistic of 99.98% posted in the NYT is actually a count on a specific dataset, the 5% Public Use Microdata Sample files, and Massachusetts residents, and not a general statistic [which would not make much sense!, as I can easily imagine 15 useless covariates] or prediction from the authors’ model. And a wee bit anticlimactic.