Archive for cryptography

on anonymisation

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on August 2, 2019 by xi'an

An article in the New York Times covering a recent publication in Nature Communications on the ability to identify 99.98% of Americans from almost any dataset with fifteen covariates. And mentioning the French approach of INSEE, more precisely CASD (a branch of GENES, as ENSAE and CREST to which I am affiliated), where my friend Antoine worked for a few years, and whose approach is to vet researchers who want access to non-anonymised data, by creating local working environments on the CASD machines  so that data does not leave the site. The approach is to provide the researcher with a dedicated interface, which “enables access remotely to a secure infrastructure where confidential data is safe from harm”. It further delivers reproducibility certificates for publications, a point apparently missed by the New York Times which advances the lack of reproducibility as a drawback of the method. It also mentions the possibility of doing cryptographic data analysis, again missing the finer details with a lame objection.

“Our paper shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.”

The Nature paper is actually about the probability for an individual to be uniquely identified from the given dataset, which somewhat different from the NYT headlines. Using a copula for the distribution of the covariates. And assessing the model with a mean square error evaluation when what matters are false positives and false negatives. Note that the model need be trained for each new dataset, which reduces the appeal of the claim, especially when considering that individuals tagged as uniquely identified about 6% are not. The statistic of 99.98% posted in the NYT is actually a count on a specific dataset,  the 5% Public Use Microdata Sample files, and Massachusetts residents, and not a general statistic [which would not make much sense!, as I can easily imagine 15 useless covariates] or prediction from the authors’ model. And a wee bit anticlimactic.

the beauty of maths in computer science [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on January 17, 2019 by xi'an

CRC Press sent me this book for review in CHANCE: Written by Jun Wu, “staff research scientist in Google who invented Google’s Chinese, Japanese, and Korean Web search algorithms”, and translated from the Chinese, 数学之美, originating from Google blog entries. (Meaning most references are pre-2010.) A large part of the book is about word processing and web navigation, which is the author’s research specialty. And not so much about mathematics. (When rereading the first chapters to start this review I then realised why the part about language processing in AIQ sounded familiar: I had read it in the Beauty of Mathematics in Computer Science.)

In the first chapter, about the history of languages, I found out, among other things, that ancient Jewish copists of the Bible had an error correcting algorithm consisting in giving each character a numerical equivalent, summing up each row, then all rows, and  checking the sum at the end of the page was the original one. The second chapter explains why the early attempts at language computer processing, based on grammar rules, were unsuccessful and how a statistical approach had broken the blockade. Explained via Markov chains in the following chapter. Along with the Good-Turing [Bayesian] estimate of the transition probabilities. Next comes a short and low-tech chapter on word segmentation. And then an introduction to hidden Markov models. Mentioning the Baum-Welch algorithm as a special case of EM, which makes a return by Chapter 26. Plus a chapter on entropies and Kullback-Leibler divergence.

A first intermede is provided by a chapter dedicated to the late Frederick Jelinek, the author’s mentor (including what I find a rather unfortunate equivalent drawn between the Nazi and Communist eras in Czechoslovakia, p.64). Chapter that sounds a wee bit too much like an extended obituary.

The next section of chapters is about search engines, with a few pages on Boolean logic, dynamic programming, graph theory, Google’s PageRank and TF-IDF (term frequency/inverse document frequency). Unsurprisingly, given that the entries were originally written for Google’s blog, Google’s tools and concepts keep popping throughout the entire book.

Another intermede about Amit Singhal, the designer of Google’s internal search ranking system, Ascorer. With another unfortunate equivalent with the AK-47 Kalashnikov rifle as “elegantly simple”, “effective, reliable, uncomplicated, and easy to implement or operate” (p.105). Even though I do get the (reason for the) analogy, using an equivalent tool which purpose is not to kill other people would have been just decent…

Then chapters on measuring proximity between news articles by (vectors in a 64,000 dimension vocabulary space and) their angle, and singular value decomposition, and turning URLs as long integers into 16 bytes random numbers by the Mersenne Twister (why random, except for encryption?), missing both the square in von Neumann’s first PRNG (p.124) and the opportunity to link the probability of overlap with the birthday problem (p.129). Followed by another chapter on cryptography, always a favourite in maths vulgarisation books (but with no mention made of the originators of public key cryptography, like James Hellis or the RSA trio, or of the impact of quantum computers on the reliability of these methods). And by an a-mathematic chapter on spam detection.

Another sequence of chapters cover maximum entropy models (in a rather incomprehensible way, I think, see p.159), continued with an interesting argument how Shannon’s first theorem predicts that it should be faster to type Chinese characters than Roman characters. Followed by the Bloom filter, which operates as an approximate Poisson variate. Then Bayesian networks where the “probability of any node is computed by Bayes’ formula” [not really]. With a slightly more advanced discussion on providing the highest posterior probability network. And conditional random fields, where the conditioning is not clearly discussed (p.192). Next are chapters about Viterbi’s algorithm (and successful career) and the EM algorithm, nicknamed “God’s algorithm” in the book (Chapter 26) although I never heard of this nickname previously.

The final two chapters are on neural networks and Big Data, clearly written later than the rest of the book, with the predictable illustration of AlphaGo (but without technical details). The twenty page chapter on Big Data does not contain a larger amount of mathematics, with no equation apart from Chebyshev’s inequality, and a frequency estimate for a conditional probability. But I learned about 23&me running genetic tests at a loss to build a huge (if biased) genetic database. (The bias in “Big Data” issues is actually not covered by this chapter.)

“One of my main objectives for writing the book is to introduce some mathematical knowledge related to the IT industry to people who do not work in the industry.”

To conclude, I found the book a fairly interesting insight on the vision of his field and job experience by a senior scientist at Google, with loads of anecdotes and some historical backgrounds, but very Google-centric and what I felt like an excessive amount of name dropping and of I did, I solved, I &tc. The title is rather misleading in my opinion as the amount of maths is very limited and rarely sufficient to connect with the subject at hand. Although this is quite a relative concept, I did not spot beauty therein but rather technical advances and trick, allowing the author and Google to beat the competition.

Nature Outlook on AI

Posted in Statistics with tags , , , , , , , , , , , , , , , on January 13, 2019 by xi'an

The 29 November 2018 issue of Nature had a series of papers on AIs (in its Outlook section). At the general public (awareness) level than in-depth machine-learning article. Including one on the forecasted consequences of ever-growing automation on jobs, quoting from a 2013 paper by Carl Frey and Michael Osborne [of probabilistic numerics fame!] that up to 47% of US jobs could become automated. The paper is inconclusive on how taxations could help in or deter from transfering jobs to other branches, although mentioning the cascading effect of taxing labour and subsidizing capital. Another article covers the progresses in digital government, with Estonia as a role model, including the risks of hacking (but not mentioning Russia’s state driven attacks). Differential privacy is discussed as a way to keep data “secure” (but not cryptography à la Louis Aslett!). With another surprising entry that COBOL is still in use in some administrative systems. Followed by a paper on the apparently limited impact of digital technologies on mental health, despite the advertising efforts of big tech companies being described as a “race to the bottom of the brain stem”! And another one on (overblown) public expectations on AIs, although the New York Time had an entry yesterday on people in Arizona attacking self-driving cars with stones and pipes… Plus a paper on the growing difficulties of saving online documents and culture for the future (although saving all tweets ever published does not sound like a major priority to me!).

Interesting (?) aside, the same issue contains a general public article on the use of AIs for peer reviews (of submitted papers). The claim being that “peer review by artificial intelligence (AI) is promising to improve the process, boost the quality of published papers — and save reviewers time.” A wee bit over-optimistic, I would say, as the developed AI’s are at best “that statistics and methods in manuscripts are sound”. For instance, producing “key concepts to summarize what the paper is about” is not particularly useful. A degree of innovation compared with the existing would be. Or an automated way to adapt the paper style to the strict and somewhat elusive Biometrika style!

Big Bayes goes South

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on December 5, 2018 by xi'an

At the Big [Data] Bayes conference this week [which I found quite exciting despite a few last minute cancellations by speakers] there were a lot of clustering talks including the ones by Amy Herring (Duke), using a notion of centering that should soon appear on arXiv. By Peter Müller (UT, Austin) towards handling large datasets. Based on a predictive recursion that takes one value at a time, unsurprisingly similar to the update of Dirichlet process mixtures. (Inspired by a 1998 paper by Michael Newton and co-authors.) The recursion doubles in size at each observation, requiring culling of negligible components. Order matters? Links with Malsiner-Walli et al. (2017) mixtures of mixtures. Also talks by Antonio Lijoi and Igor Pruenster (Boconni Milano) on completely random measures that are used in creating clusters. And by Sylvia Frühwirth-Schnatter (WU Wien) on creating clusters for the Austrian labor market of the impact of company closure. And by Gregor Kastner (WU Wien) on multivariate factor stochastic models, with a video of a large covariance matrix evolving over time and catching economic crises. And by David Dunson (Duke) on distance clustering. Reflecting like myself on the definitely ill-defined nature of the [clustering] object. As the sample size increases, spurious clusters appear. (Which reminded me of a disagreement I had had with David McKay at an ICMS conference on mixtures twenty years ago.) Making me realise I missed the recent JASA paper by Miller and Dunson on that perspective.

Some further snapshots (with short comments visible by hovering on the picture) of a very high quality meeting [says one of the organisers!]. Following suggestions from several participants, it would be great to hold another meeting at CIRM in a near future. Continue reading

bitcoin and cryptography for statistical inference and AI

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , on April 16, 2018 by xi'an

A recent news editorial in Nature (15 March issue) reminded me of the lectures Louis Aslett gave at the Gregynog Statistical Conference last week, on the advanced use of cryptography tools to analyse sensitive and private data. Lectures that reminded me of a graduate course I took on cryptography and coding, in Paris 6, and which led me to visit a lab at the Université de Limoges during my conscripted year in the French Navy. With no research outcome. Now, the notion of using encrypted data towards statistical analysis is fascinating in that it may allow for efficient inference and personal data protection at the same time. As opposed to earlier solutions of anonymisation that introduced noise and data degradation, not always providing sufficient protection of privacy. Encryption that is also the notion at the basis of the Nature editorial. An issue completely missing from the paper, while stressed by Louis, is that this encryption (like Bitcoin) is costly, in order to deter hacking, and hence energy inefficient. Or limiting the amount of data that can be used in such studies, which would turn the idea into a stillborn notion.