Archive for big data

Calling Bullshit: The Art of Scepticism in a Data‑Driven World [EJ’s book review]

Posted in Books, Statistics with tags , , , , , on August 26, 2020 by xi'an

“…this book will train readers to be statistically savvy at a time when immunity to misinformation is essential: not just for the survival of liberal democracy, as the authors assert, but for survival itself.Perhaps a crash course on bullshit detection should be a mandatory part of the school curriculum.”

In the latest issue of Nature, EJ Wagenmaker has written a book review of the book Calling Bullshit, by  Carl Bergstrom and Jevin West. Book written out of a course taught by the authors at the University of Washington during Spring Quarter 2017 and aimed at teaching students how to debunk bullshit, that is, misleading exploitation of statistics and machine learning. And subsequently turned into a book. Which I have not read. In his overall positive review EJ regrets the poor data visualisation scholarship of the authors, who could have demonstrated and supported the opportunity for a visual debunking of the original data. And the lack of alternative solutions like Bayesian analysis to counteract p-fishing. Of course, the need for debunking and exposing statistically sounding misinformation has never been so present.

Expectation Propagation as a Way of Life on-line

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , , on March 18, 2020 by xi'an

After a rather extended shelf-life, our paper expectation propagation as a way of life: a framework for Bayesian inference on partitioned data which was started when Andrew visited Paris in… 2014!, and to which I only marginally contributed, has now appeared in JMLR! Which happens to be my very first paper in this journal.

Michael dans le Monde [#2]

Posted in Books, pictures, Statistics, University life with tags , , , , on January 5, 2020 by xi'an

A (second) back-page interview of Mike in Le Monde on the limitations of academics towards working with major high tech companies. And fatal attractions that are difficult to resist, given the monetary rewards. As his previous interview, this is quite an interesting read (in French), although it obviously reflects a US perspective rather than French (with the same comment applying to the recent interview of Yann LeCun on France Inter).

“…les chercheurs académiques français, qui sont vraiment très peu payés.”

The first part is a prediction that the GAFAs will not continue hiring (full-time or part-time) academic researchers to keep doing their academic research as the quest for more immediate profits will eventually win over the image produced by these collaborations. But maybe DeepMind is not the best example, as e.g. Amazon seems to be making immediate gains from such collaborations.

“…le modèle économique [de Amazon, Ali Baba, Uber, &tc] cherche à créer des marchés nouveaux avec à la source, on peut l’espérer, de nouveaux emplois.”

One stronger point of disagreement is about the above quote, namely that Uber or Amazon indeed create jobs. As I am uncertain that all jobs creations are worthwhile. Indeed, which kind of freedom there is in working after-hours for a reward that is so much below the minimal wage (in countries where there is a true minimal wage) that the workers [renamed entrepreneurs] are below the poverty line? Similarly, unless there are stronger regulations imposed by states or unions like the EU, it seems difficult to imagine how society as an aggregate of individuals can curb the hegemonic tendencies of the high tech leviathans…?

sampling and imbalanced

Posted in Statistics with tags , , , , , on June 21, 2019 by xi'an

Deborshee Sen, Matthias Sachs, Jianfeng Lu and David Dunson have recently arXived a sub-sampling paper for  classification (logistic) models where some covariates or some responses are imbalanced. With a PDMP, namely zig-zag, used towards preserving the correct invariant distribution (as already mentioned in an earlier post on the zig-zag zampler and in a recent Annals paper by Joris Bierkens, Paul Fearnhead, and Gareth Roberts (Warwick)). The current paper is thus an improvement on the above. Using (non-uniform) importance sub-sampling across observations and simpler upper bounds for the Poisson process. A rather practical form of Poisson thinning. And proposing unbiased estimates of the sub-sample log-posterior as well as stratified sub-sampling.

I idly wondered if the zig-zag sampler could itself be improved by not switching the bouncing directions at random since directions associated with almost certainly null coefficients should be neglected as much as possible, but the intensity functions associated with the directions do incorporate this feature. Except for requiring computation of the intensities for all directions. This is especially true when facing many covariates.

Thinking of the logistic regression model itself, it is sort of frustrating that something so close to an exponential family causes so many headaches! Formally, it is an exponential family but the normalising constant is rather unwieldy, especially when there are many observations and many covariates. The Polya-Gamma completion is a way around, but it proves highly costly when the dimension is large…

MASH in Le Monde

Posted in Statistics with tags , , , , , , , , on January 25, 2019 by xi'an