A coincidence [or not] saw very similar papers appear in Le Monde and The Guardian within days. I already reported on the Doomsday tone of The Guardian tribune. The point of the other paper is essentially the same, namely that the public has lost trust in quantitative arguments, from the explosion of statistical entries in political debates, to the general defiance against experts, media, government, and parties, including the Institute of Official Statistics (INSEE), to a feeling of disconnection between statistical entities and the daily problems of the average citizen, to the lack of guidance and warnings in the publication of such statistics, to the rejection of anything technocratic… With the missing addendum that politicians and governments too readily correlate good figures with their policies and poor ones with their opponents’. (Just no blame for big data analytics in this case.)
Archive for big data
Last week I spotted this tribune in The Guardian, with the witty title of statistics loosing its power, and sort of over-reacted by trying to gather enough momentum from colleagues towards writing a counter-column. After a few days of decantation and a few more readings (reads?) of the tribune, I cooled down towards a more lenient perspective, even though I still dislike the [catastrophic and journalistic] title. The paper is actually mostly right (!), from its historical recap of the evolution of (official) statistics across centuries, to the different nature of the “big data” statistics. (The author is “William Davies, a sociologist and political economist. His books include The Limits of Neoliberalism and The Happiness Industry.”)
“Despite these criticisms, the aspiration to depict a society in its entirety, and to do so in an objective fashion, has meant that various progressive ideals have been attached to statistics.”
A central point is that public opinion has less confidence in (official) statistics than it used to be. (warning: Major understatement, here!) For many reasons, from numbers used to support any argument and its opposite, to statistics (-ians) being associated with experts, found at every corner of news and medias, hence with the “elite” arch-enemy, to a growing innumeracy of both the general public and of the said “elites”—like this “expert” in a debate about the 15th anniversary of the Euro currency on the French NPR last week equating a raise from 2.4 Francs to 6.5 Francs to 700%…—favouring rhetoric over facts, to a disintegration of the social structure that elevates one’s community over others and dismisses arguments from those others, especially those addressed at the entire society. The current debate—and the very fact there can even be a debate about it!—about post-truths and alternative facts is a sad illustration of this regression in the public discourse. The overall perspective in the tribune is one of a sociologist on statistics, but nothing to strongly object to.
“These data analysts are often physicists or mathematicians, whose skills are not developed for the study of society at all.”
The second part of the paper is about the perceived shift from (official) statistics to another and much more dangerous type of data analysis. Which is not a new view on the field, as shown by Weapons of Math Destruction. I tend to disagree with this perception that data handled by private companies for private purposes is inherently evil. The reticence in trusting the conclusions drawn from such datasets also extends to publicly available datasets and is not primarily linked to the lack of reproducibility of such analyses (which would be a perfectly rational argument!). It is neither due to physicists or mathematicians running those, instead of quantitative sociologists! The roots of the mistrust are rather to be found in an anti-scientism that has been growing in the past decades, in a paradox of an equally growing technological society fuelled by scientific advances. Hence, calling for a governmental office of big data or some similar institution is very much unlikely to solve the issue. I do not know what could, actually, but continuing to develop better statistical methodology cannot hurt!
As I had read many comments and reviews about this book, including one by Arthur Charpentier, on Freakonometrics, I eventually decided to buy it from my Amazon Associate savings (!). With a strong a priori bias, I am afraid, gathered from reading some excerpts, comments, and the overall advertising about it. And also because the book reminded me of another quantic swan. Not to mention the title. After reading it, I am afraid I cannot tell my ascertainment has changed much.
“Models are opinions embedded in mathematics.” (p.21)
The core message of this book is that the use of algorithms and AI methods to evaluate and rank people is unsatisfactory and unfair. From predicting recidivism to fire high school teachers, from rejecting loan applications to enticing the most challenged categories to enlist for for-profit colleges. Which is indeed unsatisfactory and unfair. Just like using the h index and citation ranking for promotion or hiring. (The book mentions the controversial hiring of many adjunct faculty by KAU to boost its ranking.) But this conclusion is not enough of an argument to write a whole book. Or even to blame mathematics for the unfairness: as far as I can tell, mathematics has nothing to do with unfairness. Some analysts crunch numbers, produce a score, and then managers make poor decisions. The use of mathematics throughout the book is thus completely inappropriate, when the author means statistics, machine learning, data mining, predictive algorithms, neural networks, &tc. (OK, there is a small section on Operations Research on p.127, but I figure deep learning can bypass the maths.) Continue reading
Michael Jordan, Jason Lee, and Yun Yang just arXived a paper with their proposal on handling large datasets through distributed computing, thus contributing to the currently very active research topic of approximate solutions in large Bayesian models. The core of the proposal is summarised by the screenshot above, where the approximate likelihood replaces the exact likelihood with a first order Taylor expansion. The first term is the likelihood computed for a given subsample (or a given thread) at a ratio of one to N and the difference of the gradients is only computed once at a good enough guess. While the paper also considers M-estimators and non-Bayesian settings, the Bayesian part thus consists in running a regular MCMC when the log-target is approximated by the above. I first thought this proposal amounted to a Gaussian approximation à la Simon Wood or to an INLA approach but this is not the case: the first term of the approximate likelihood is exact and hence can be of any form, while the scalar product is linear in θ, providing a sort of first order approximation, albeit frozen at the chosen starting value.
Assuming that each block of the dataset is stored on a separate machine, I think the approach could further be implemented in parallel, running N MCMC chains and comparing the output. With a post-simulation summary stemming from the N empirical distributions thus produced. I also wonder how the method would perform outside the fairly smooth logistic regression case, where the single sample captures well-enough the target. The picture above shows a minor gain in a misclassification rate that is already essentially zero.
Next week, Rémi Bardenet is giving a seminar in Paris, Thursday April 14, 2pm, in ENSAE [room 15] on MCMC methods for tall data. Unfortunately, I will miss this opportunity to discuss with Rémi as I will be heading to La Sapienza, Roma, for Clara Grazian‘s PhD defence the next day. And on Monday afternoon, April 11, Nicolas Chopin will give a talk on quasi-Monte Carlo for sequential problems at Institut Henri Poincaré.
I came by chance to this web service Adzune, which takes CV’s through text mining and returns an estimate of which salary this experience is worth. Here is the summary produced, along with an automated word cloud (food safety?! millennium?! How comes this appears in my skills?).
Christian Robert’s experience appears to be concentrated in Information Technology / Big Data, with exposure to Business Operations and General Business / General Skills and Activities. Christian Robert has 29 years of work experience, with 22 years of management experience, including a high-level position.
The most positive thing one can state about this summary is that the algorithm does not seem very adequate for an academic. Exposure to Business Operations? Me?! Statistics does not seem to be a catchy enough skill for those analysts. Nttt….