This week at the Reading Classics student seminar, Thomas Ounas presented a paper, Statistical inference on massive datasets, written by Li, Lin, and Li, a paper out of The List. (This paper was recently published as Applied Stochastic Models in Business and Industry, 29, 399-409..) I accepted this unorthodox proposal as (a) it was unusual, i.e., this was the very first time a student made this request, and (b) the topic of large datasets and their statistical processing definitely was interesting even though the authors of the paper were unknown to me. The presentation by Thomas was very power-pointish (or power[-point]ful!), with plenty of dazzling transition effects… Even including (a) a Python software replicating the method and (b) a nice little video on internet data transfer protocols. And on a Linux machine! Hence the experiment was worth the try! Even though the paper is a rather unlikely candidate for the list of classics… (And the rendering in static power point no so impressive. Hence a video version available as well…)
The solution adopted by the authors of the paper is one of breaking a massive dataset into blocks so that each fits into the computer(s) memory and of computing a separate estimate for each block. Those estimates are then averaged (and standard-deviationed) without a clear assessment of the impact of this multi-tiered handling of the data. Thomas then built a software to illustrate this approach, with mean and variance and quantiles and densities as quantities of interest. Definitely original! The proposal itself sounds rather basic from a statistical viewpoint: for instance, evaluating the loss in information due to using this blocking procedure requires repeated sampling, which is unrealistic. Or using solely the inter-variance estimates which seems to be missing the intra-variability. Hence to be overly optimistic. Further, strictly speaking, the method does not asymptotically apply to biased estimators, hence neither to Bayes estimators (nor to density estimators). Convergence results are thus somehow formal, in that the asymptotics cannot apply to a finite memory computer. In practice, the difficulty of the splitting technique is rather in breaking the data into blocks since Big Data is rarely made of iid observations. Think of amazon data, for instance. A question actually asked by the class. The method of Li et al. should also include some boostrapping connection. E.g., to Michael’s bag of little bootstraps.