Archive for big data

MASH in Le Monde

Posted in Statistics with tags , , , , , , , , on January 25, 2019 by xi'an

Big Bayes goes South

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on December 5, 2018 by xi'an

At the Big [Data] Bayes conference this week [which I found quite exciting despite a few last minute cancellations by speakers] there were a lot of clustering talks including the ones by Amy Herring (Duke), using a notion of centering that should soon appear on arXiv. By Peter Müller (UT, Austin) towards handling large datasets. Based on a predictive recursion that takes one value at a time, unsurprisingly similar to the update of Dirichlet process mixtures. (Inspired by a 1998 paper by Michael Newton and co-authors.) The recursion doubles in size at each observation, requiring culling of negligible components. Order matters? Links with Malsiner-Walli et al. (2017) mixtures of mixtures. Also talks by Antonio Lijoi and Igor Pruenster (Boconni Milano) on completely random measures that are used in creating clusters. And by Sylvia Frühwirth-Schnatter (WU Wien) on creating clusters for the Austrian labor market of the impact of company closure. And by Gregor Kastner (WU Wien) on multivariate factor stochastic models, with a video of a large covariance matrix evolving over time and catching economic crises. And by David Dunson (Duke) on distance clustering. Reflecting like myself on the definitely ill-defined nature of the [clustering] object. As the sample size increases, spurious clusters appear. (Which reminded me of a disagreement I had had with David McKay at an ICMS conference on mixtures twenty years ago.) Making me realise I missed the recent JASA paper by Miller and Dunson on that perspective.

Some further snapshots (with short comments visible by hovering on the picture) of a very high quality meeting [says one of the organisers!]. Following suggestions from several participants, it would be great to hold another meeting at CIRM in a near future. Continue reading

phishing alert at CIRM!

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on September 13, 2018 by xi'an

A strong and loud warning to all participants to one of the three events organised in CIRM, Luminy, Marseilles, in conjunction with Kerrie Mengersen’s Jean Morlet visiting chair, namely that some participants have received calls from crooks posting as CIRM admins, asking for credit card details towards supporting their stay at CIRM. This is a phishing attempt as self-supported participants to these events will be asked to pay at the end of their stay and never by phone or mail. In the meanwhile, there remains a few entries for both

  1. Bayesian Statistics in the Big Data Era (26-30 Nov, 2018)
  2. Young Bayesians and Big Data for Social Good (23-26 Nov., 2018)

for which registration is free but compulsory.

Bayesian statistics in the big data era

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on May 7, 2018 by xi'an

In conjunction with Kerrie Mengersen’ Jean Morlet Chair at CIRM, Luminy, Marseilles, we organise a special conference “Bayesian statistics in the big data era” on November 26-30, 2018, with the following speakers having already confirmed attendance

Louis Aslett (Durham, UK)
Sudipto Banerjee (UCLA, US)
Tamara Broderick (MIT, US)
Noël Cressie (Wollongong, OZ)
Marco Cuturi (ENSAE, FR)
David Dunson (Duke, US)
Sylvia Frühwirth-Schnatter (WU, AU)
Amy Herring (Duke, US)
Gregor Kastner (WU, AU)
Ruth King (Edinburgh, UK)
Gary Koop (Edinburgh, UK)
Antonio Lijoi (Bocconi, IT)
Jean-Michel Marin (Montpellier, FR)
Antonietta Mira (Lugano, CH)
Peter Müller (UT Austin, US)
Igor Pruenster (Bocconi, IT)
Stéphane Robin (INRA, FR)
Heejung Shim (U Melbourne, OZ)
Minh-Ngoc Tran (UNSW, OZ)
Darren Wilkinson (Newcastle, UK)


Registration is free but compulsory, and we encourage all interested data scientists (and beyond) to apply and to contribute a talk or a poster. The size of the audience is limited to a maximum of 80 participants, on a first-come first-serve basis. (Cheap housing is available on the campus, located in the gorgeous national park des Calanques south of Marseilles.)

In connection with this conference, there will be a workshop the previous weekend on “Young Bayesians and Big Data for social good”, to get junior researchers interested in the analysis of data related with social issues and human rights to work with a few senior researchers. More details soon, here and on the CIRM website.

le bayésianisme aujourd’hui [book review]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on March 4, 2017 by xi'an

It is quite rare to see a book published in French about Bayesian statistics and even rarer to find one that connects philosophy of science, foundations of probability, statistics, and applications in neurosciences and artificial intelligence. Le bayésianisme aujourd’hui (Bayesianism today) was edited by Isabelle Drouet, a Reader in Philosophy at La Sorbonne. And includes a chapter of mine on the basics of Bayesian inference (à la Bayesian Choice), written in French like the rest of the book.

The title of the book is rather surprising (to me) as I had never heard the term Bayesianism mentioned before. As shown by this link, the term apparently exists. (Even though I dislike the sound of it!) The notion is one of a probabilistic structure of knowledge and learning, à la Poincaré. As described in the beginning of the book. But I fear the arguments minimising the subjectivity of the Bayesian approach should not be advanced, following my new stance on the relativity of probabilistic statements, if only because they are defensive and open the path all too easily to counterarguments. Similarly, the argument according to which the “Big Data” era makesp the impact of the prior negligible and paradoxically justifies the use of Bayesian methods is limited to the case of little Big Data, i.e., when the observations are more or less iid with a limited number of parameters. Not when the number of parameters explodes. Another set of arguments that I find both more modern and compelling [for being modern is not necessarily a plus!] is the ease with which the Bayesian framework allows for integrative and cooperative learning. Along with its ultimate modularity, since each component of the learning mechanism can be extracted and replaced with an alternative. Continue reading

the other end of statistics

Posted in Books, pictures, Statistics with tags , , , , , , on February 8, 2017 by xi'an

A coincidence [or not] saw very similar papers appear in Le Monde and The Guardian within days. I already reported on the Doomsday tone of The Guardian tribune. The point of the other paper is essentially the same, namely that the public has lost trust in quantitative arguments, from the explosion of statistical entries in political debates, to the general defiance against experts, media, government, and parties, including the Institute of Official Statistics (INSEE), to a feeling of disconnection between statistical entities and the daily problems of the average citizen, to the lack of guidance and warnings in the publication of such statistics, to the rejection of anything technocratic… With the missing addendum that politicians and governments too readily correlate good figures with their policies and poor ones with their opponents’. (Just no blame for big data analytics in this case.)

the end of statistics [not!]

Posted in Statistics with tags , , , , , , , , , , on January 31, 2017 by xi'an

endofstatsLast week I spotted this tribune in The Guardian, with the witty title of statistics loosing its power, and sort of over-reacted by trying to gather enough momentum from colleagues towards writing a counter-column. After a few days of decantation and a few more readings (reads?) of the tribune, I cooled down towards a more lenient perspective, even though I still dislike the [catastrophic and journalistic] title. The paper is actually mostly right (!), from its historical recap of the evolution of (official) statistics across centuries, to the different nature of the “big data” statistics. (The author is “William Davies, a sociologist and political economist. His books include The Limits of Neoliberalism and The Happiness Industry.”)

“Despite these criticisms, the aspiration to depict a society in its entirety, and to do so in an objective fashion, has meant that various progressive ideals have been attached to statistics.”

A central point is that public opinion has less confidence in (official) statistics than it used to be. (warning: Major understatement, here!) For many reasons, from numbers used to support any argument and its opposite, to statistics (-ians) being associated with experts, found at every corner of news and medias, hence with the “elite” arch-enemy, to a growing innumeracy of both the general public and of the said “elites”—like this “expert” in a debate about the 15th anniversary of the Euro currency on the French NPR last week equating a raise from 2.4 Francs to 6.5 Francs to 700%…—favouring rhetoric over facts, to a disintegration of the social structure that elevates one’s community over others and dismisses arguments from those others, especially those addressed at the entire society. The current debate—and the very fact there can even be a debate about it!—about post-truths and alternative facts is a sad illustration of this regression in the public discourse. The overall perspective in the tribune is one of a sociologist on statistics, but nothing to strongly object to.

“These data analysts are often physicists or mathematicians, whose skills are not developed for the study of society at all.”

The second part of the paper is about the perceived shift from (official) statistics to another and much more dangerous type of data analysis. Which is not a new view on the field, as shown by Weapons of Math Destruction. I tend to disagree with this perception that data handled by private companies for private purposes is inherently evil. The reticence in trusting the conclusions drawn from such datasets also extends to publicly available datasets and is not primarily linked to the lack of reproducibility of such analyses (which would be a perfectly rational argument!). It is neither due to physicists or mathematicians running those, instead of quantitative sociologists! The roots of the mistrust are rather to be found in an anti-scientism that has been growing in the past decades, in a paradox of an equally growing technological society fuelled by scientific advances. Hence, calling for a governmental office of big data or some similar institution is very much unlikely to solve the issue. I do not know what could, actually, but continuing to develop better statistical methodology cannot hurt!