## reading classics (#3)

**T**his week at the Reading Classics student seminar, Thomas Ounas presented a paper, *Statistical inference on massive datasets*, written by Li, Lin, and Li, a paper out of The List. (This paper was recently published as *Applied Stochastic Models in Business and Industry*, 29, 399-409..) I accepted this unorthodox proposal as (a) it was unusual, i.e., this was the very first time a student made this request, and (b) the topic of large datasets and their statistical processing definitely was interesting even though the authors of the paper were unknown to me. The presentation by Thomas was very power-pointish *(or power[-point]ful!)*, with plenty of dazzling transition effects… Even including (a) a Python software replicating the method and (b) a nice little video on internet data transfer protocols. And on a Linux machine! Hence the experiment was worth the try! Even though the paper is a rather unlikely candidate for the list of classics… (And the rendering in static power point no so impressive. Hence a video version available as well…)

**T**he solution adopted by the authors of the paper is one of breaking a massive dataset into blocks so that each fits into the computer(s) memory and of computing a separate estimate for each block. Those estimates are then averaged (and standard-deviationed) without a clear assessment of the impact of this multi-tiered handling of the data. Thomas then built a software to illustrate this approach, with mean and variance and quantiles and densities as quantities of interest. Definitely original! The proposal itself sounds rather basic from a statistical viewpoint: for instance, evaluating the loss in information due to using this blocking procedure requires repeated sampling, which is unrealistic. Or using solely the inter-variance estimates which seems to be missing the intra-variability. Hence to be overly optimistic. Further, strictly speaking, the method does not asymptotically apply to biased estimators, hence neither to Bayes estimators (nor to density estimators). Convergence results are thus somehow formal, in that the asymptotics cannot apply to a finite memory computer. In practice, the difficulty of the splitting technique is rather in breaking the data into blocks since Big Data is rarely made of iid observations. Think of amazon data, for instance. A question actually asked by the class. The method of Li et al. should also include some boostrapping connection. E.g., to Michael’s bag of little bootstraps.

December 4, 2013 at 12:31 am

[…] https://xianblog.wordpress.com/2013/11/29/reading-classics-3-2/ […]

November 29, 2013 at 7:03 am

What about “pulling together” the \theta_i ‘s with a (hierarchical) mixed model, or some other meta-analytic techniques?

November 29, 2013 at 10:24 am

Almost anything would be better than using this plain average! Seen my early comments this week on the embarrassingly parallel method? This would be a much better start, for sure!

November 29, 2013 at 10:29 am

Yes I did- just didn’t get around to ask you about one thing that bugged me: distribute a job to too many nodes and your informative prior, raised to the 1/number of nodes, becomes spread out and pretty non-informative, or doesn’t it?

November 29, 2013 at 10:59 am

I first thought this was a poor idea. But then, assuming the subposterior remains proper (and this is indeed an issue), the method is simply a way to simulate from a product of densities by mixing the estimates of the subposteriors together. There is no statistical issue of informative versus non-informative prior at this stage, as the technique is merely computational. The components of the products are not to be interpreted as posteriors on their own.

November 29, 2013 at 10:34 am

I meant the prior used by each node

November 29, 2013 at 2:24 am

[…] “Statistical inference on massive datasets” http://personal.psu.edu/users/j/x/jxz203/lin/… via Xi’an’s https://xianblog.wordpress.com/2013… […]