Archive for large data problems

divide & reconquer

Posted in Books, Statistics, University life with tags , , , , , , , , , , on February 5, 2018 by xi'an

Qi Liu, Anindya Bhadra, and William Cleveland from Purdue have arXived a paper entitled Divide and Recombine for Large and Complex Data: Model Likelihood Functions using MCMC. Which is a variation on the earlier divide & … papers attempting at handling large datasets. The beginning is quite similar to these earlier papers in that the likelihood is split into sub-likelihoods, approximated from MCMC samples and recombined into an approximate full likelihood. As in for instance Scott et al. one approximation use for the subsample is to replace the likelihood with a Normal approximation, or a skew Normal generalisation, which remains  a limited choice for heavy tailed likelihoods. Producing a Normal and skew-Normal approximation for the whole [data] likelihood, respectively. If I understand correctly, these approximations are missing a normalising constant to bring them to scale with the true likelihood, which I do not completely understand as the likelihood only needs to be defined up to a [constant] constant for most purposes, including Bayesian ones. The  method of estimation of this constant proposed therein is called the contour probability algorithm and it consists in using a highest density region to compare a likelihood and its approximation. (Nothing to do with our adaptation of Gelfand and Dey (1994) based on HPDs, with Darren Wright. Nor with nested sampling.) Returning a form of qq-plot. This is rather exploratory, while hardly addressing the issue of the precision of such approximations and the resolution of conflicting proposals. And the comparison with all these other recent proposals for splitting likelihoods into manageable bits (proposals that are mentioned in the final section, including our recentering scheme with my student Changye Wu).

variational consensus Monte Carlo

Posted in Books, Statistics, University life with tags , , , , , , on July 2, 2015 by xi'an

“Unfortunately, the factorization does not make it immediately clear how to aggregate on the level of samples without first having to obtain an estimate of the densities themselves.” (p.2)

The recently arXived variational consensus Monte Carlo is a paper by Maxim Rabinovich, Elaine Angelino, and Michael Jordan that approaches the consensus Monte Carlo principle from a variational perspective. As in the embarrassingly parallel version,  the target is split into a product of K terms, each being interpreted as an unnormalised density and being fed to a different parallel processor. The most natural partition is to break the data into K subsamples and to raise the prior to the power 1/K in each term. While this decomposition makes sense from a storage perspective, since each bit corresponds to a different subsample of the data, it raises the question of the statistical pertinence of splitting the prior and my feelings about it are now more lukewarm than when I commented on the embarrassingly parallel version,  mainly for the reason that it is not reparameterisation invariant—getting different targets if one does the reparameterisation before or after the partition—and hence does not treat the prior as the reference measure it should be. I therefore prefer the version where the same original prior is attached to each part of the partitioned likelihood (and even more the random subsampling approaches discussed in the recent paper of Bardenet, Doucet, and Holmes). Another difficulty with the decomposition is that a product of densities is not a density in most cases (it may even be of infinite mass) and does not offer a natural path to the analysis of samples generated from each term in the product. Nor an explanation as to why those samples should be relevant to construct a sample for the original target.

“The performance of our algorithm depends critically on the choice of aggregation function family.” (p.5)

Since the variational Bayes approach is a common answer to complex products models, Rabinovich et al. explore the use of variational Bayes techniques to build the consensus distribution out of the separate samples. As in Scott et al., and Neiswanger et al., the simulation from the consensus distribution is a transform of simulations from each of the terms in the product, e.g., a weighted average. Which determines the consensus distribution as a member of an aggregation family defined loosely by a Dirac mass. When the transform is a sum of individual terms, variational Bayes solutions get much easier to find and the authors work under this restriction… In the empirical evaluation of this variational Bayes approach as opposed to the uniform and Gaussian averaging options in Scott et al., it improves upon those, except in a mixture example with a large enough common variance.

In fine, despite the relevance of variational Bayes to improve the consensus approximation, I still remain unconvinced about the use of the product of (pseudo-)densities and the subsequent mix of simulations from those components, for the reason mentioned above and also because the tail behaviour of those components is not related with the tail behaviour of the target. Still, this is a working solution to a real problem and as such is a reference for future works.

EP as a way of life (aka Life of EP)

Posted in Books, Statistics, University life with tags , , , , , , , on December 24, 2014 by xi'an

When Andrew was in Paris, we discussed at length about using EP for handling big datasets in a different way than running parallel MCMC. A related preprint came out on arXiv a few days ago, with an introduction on Andrews’ blog. (Not written two months in advance as most of his entries!)

The major argument in using EP in a large data setting is that the approximation to the true posterior can be build using one part of the data at a time and thus avoids handling the entire likelihood function. Nonetheless, I still remain mostly agnostic about using EP and a seminar this morning at CREST by Guillaume Dehaene and Simon Barthelmé (re)generated self-interrogations about the method that hopefully can be exploited towards the future version of the paper.

One of the major difficulties I have with EP is about the nature of the resulting approximation. Since it is chosen out of a “nice” family of distributions, presumably restricted to an exponential family, the optimal approximation will remain within this family, which further makes EP sound like a specific variational Bayes method since the goal is to find the family member the closest to the posterior in terms of Kullback-Leibler divergence. (Except that the divergence is the opposite one.) I remain uncertain about what to do with the resulting solution, as the algorithm does not tell me how close this solution will be from the true posterior. Unless one can use it as a pseudo-distribution for indirect inference (a.k.a., ABC)..?

Another thing that became clear during this seminar is that the decomposition of the target as a product is completely arbitrary, i.e., does not correspond to an feature of the target other than the later being the product of those components. Hence, the EP partition could be adapted or even optimised within the algorithm. Similarly, the parametrisation could be optimised towards a “more Gaussian” posterior. This is something that makes EP both exciting as opening many avenues for experimentation and fuzzy as its perceived lack of goal makes comparing approaches delicate. For instance, using MCMC or HMC steps to estimate the parameters of the tilted distribution is quite natural in complex settings but the impact of the additional approximation must be gauged against the overall purpose of the approach.


Large-scale Inference

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , on February 24, 2012 by xi'an

Large-scale Inference by Brad Efron is the first IMS Monograph in this new series, coordinated by David Cox and published by Cambridge University Press. Since I read this book immediately after Cox’ and Donnelly’s Principles of Applied Statistics, I was thinking of drawing a parallel between the two books. However, while none of them can be classified as textbooks [even though Efron’s has exercises], they differ very much in their intended audience and their purpose. As I wrote in the review of Principles of Applied Statistics, the book has an encompassing scope with the goal of covering all the methodological steps  required by a statistical study. In Large-scale Inference, Efron focus on empirical Bayes methodology for large-scale inference, by which he mostly means multiple testing (rather than, say, data mining). As a result, the book is centred on mathematical statistics and is more technical. (Which does not mean it less of an exciting read!) The book was recently reviewed by Jordi Prats for Significance. Akin to the previous reviewer, and unsurprisingly, I found the book nicely written, with a wealth of R (colour!) graphs (the R programs and dataset are available on Brad Efron’s home page).

I have perhaps abused the “mono” in monograph by featuring methods from my own work of the past decade.” (p.xi)

Sadly, I cannot remember if I read my first Efron’s paper via his 1977 introduction to the Stein phenomenon with Carl Morris in Pour la Science (the French translation of Scientific American) or through his 1983 Pour la Science paper with Persi Diaconis on computer intensive methods. (I would bet on the later though.) In any case, I certainly read a lot of the Efron’s papers on the Stein phenomenon during my thesis and it was thus with great pleasure that I saw he introduced empirical Bayes notions through the Stein phenomenon (Chapter 1). It actually took me a while but I eventually (by page 90) realised that empirical Bayes was a proper subtitle to Large-Scale Inference in that the large samples were giving some weight to the validation of empirical Bayes analyses. In the sense of reducing the importance of a genuine Bayesian modelling (even though I do not see why this genuine Bayesian modelling could not be implemented in the cases covered in the book).

Large N isn’t infinity and empirical Bayes isn’t Bayes.” (p.90)

The core of Large-scale Inference is multiple testing and the empirical Bayes justification/construction of Fdr’s (false discovery rates). Efron wrote more than a dozen papers on this topic, covered in the book and building on the groundbreaking and highly cited Series B 1995 paper by Benjamini and Hochberg. (In retrospect, it should have been a Read Paper and so was made a “retrospective read paper” by the Research Section of the RSS.) Frd are essentially posterior probabilities and therefore open to empirical Bayes approximations when priors are not selected. Before reaching the concept of Fdr’s in Chapter 4, Efron goes over earlier procedures for removing multiple testing biases. As shown by a section title (“Is FDR Control “Hypothesis Testing”?”, p.58), one major point in the book is that an Fdr is more of an estimation procedure than a significance-testing object. (This is not a surprise from a Bayesian perspective since the posterior probability is an estimate as well.)

Scientific applications of single-test theory most often suppose, or hope for rejection of the null hypothesis (…) Large-scale studies are usually carried out with the expectation that most of the N cases will accept the null hypothesis.” (p.89)

On the innovations proposed by Efron and described in Large-scale Inference, I particularly enjoyed the notions of local Fdrs in Chapter 5 (essentially pluggin posterior probabilities that a given observation stems from the null component of the mixture) and of the (Bayesian) improvement brought by empirical null estimation in Chapter 6 (“not something one estimates in classical hypothesis testing”, p.97) and the explanation for the inaccuracy of the bootstrap (which “stems from a simpler cause”, p.139), but found less crystal-clear the empirical evaluation of the accuracy of Fdr estimates (Chapter 7, ‘independence is only a dream”, p.113), maybe in relation with my early career inability to explain Morris’s (1983) correction for empirical Bayes confidence intervals (pp. 12-13). I also discovered the notion of enrichment in Chapter 9, with permutation tests resembling some low-key bootstrap, and multiclass models in Chapter 10, which appear as if they could benefit from a hierarchical Bayes perspective. The last chapter happily concludes with one of my preferred stories, namely the missing species problem (on which I hope to work this very Spring).

JSM 2010 [talk]

Posted in Statistics, University life with tags , , , , , , , , on August 2, 2010 by xi'an

Here are the slides of the talk I am presenting on Monday at JSM 2010 in Vancouver, in the Bayesian Inference in Massive Data Problems session organised by Alexandra Schmidt: it is about Bayesian model choice in cosmology
It borrowed from Darren Wraith’s presentation at JSM 2009 by adding more items on evidence estimation, in connection with our lastest cosmology paper. Obviously, 64 slides for 25 minutes is a bit extra-galactic, even at the speed of light, so I will scrap all the technical sections on PMC, AMIS and focus on the implementation for the cosmology problem.

Ps-If anyone has an idea as to why two pages are rotated during the ps2pdf process, thanks for sharing your thoughts!