Archive for tall data

simple, scalable and accurate posterior interval estimation

Posted in Statistics with tags , , , , , , , on July 6, 2016 by xi'an

“There is a lack of simple and scalable algorithms for uncertainty quantification.”

A paper by Cheng Li , Sanvesh Srivastava, and David Dunson that I had missed and which was pointed out on Andrew’s blog two days ago. As recalled in the very first sentence of the paper, above, the existing scalable MCMC algorithms somewhat fail to account for confidence (credible) intervals. In the sense that handling parallel samples does not naturally produce credible intervals.Since the approach is limited to one-dimensional quantity of interest, ζ=h(θ), the authors of the paper consider the MCMC approximations of the cdf of the said quantity ζ based on the manageable subsets like as many different approximations of the same genuine posterior distribution of that quantity ζ. (Corrected by a power of the likelihood but dependent on the particular subset used for the estimation.) The estimate proposed in the paper is a Wasserstein barycentre of the available estimations, barycentre that is defined as minimising the sum of the Wasserstein distances to all estimates. (Why should this measure be relevant: the different estimates may be of different quality). Interestingly (at least at a computational level), the solution is such that the quantile function of the Wasserstein barycentre is the average of the estimated quantiles functions. (And is there an alternative loss returning the median cdf?) A confidence interval based on the quantile function can then be directly derived. The paper shows that this Wasserstein barycentre converges to the true (marginal) posterior as the sample size m of each sample grows to infinity (and faster than 1/√m), with the strange side-result that the convergence is in 1/√n when the MLE of the global parameter θ is unbiased. Strange to me because unbiasedness is highly dependent on parametrisation while the performances of this estimator should not be, i.e., should be invariant under reparameterisation. Maybe this is due to ζ being a linear transform of θ in the convergence theorem… In any case, I find this question of merging cdf’s from poorly defined approximations to an unknown cdf of the highest interest and look forward any further proposal to this effect!

communication-efficient distributed statistical learning

Posted in Books, Statistics, University life with tags , , , , , , , , on June 10, 2016 by xi'an

mikecemMichael Jordan, Jason Lee, and Yun Yang just arXived a paper with their proposal on handling large datasets through distributed computing, thus contributing to the currently very active research topic of approximate solutions in large Bayesian models. The core of the proposal is summarised by the screenshot above, where the approximate likelihood replaces the exact likelihood with a first order Taylor expansion. The first term is the likelihood computed for a given subsample (or a given thread) at a ratio of one to N and the difference of the gradients is only computed once at a good enough guess. While the paper also considers M-estimators and non-Bayesian settings, the Bayesian part thus consists in running a regular MCMC when the log-target is approximated by the above. I first thought this proposal amounted to a Gaussian approximation à la Simon Wood or to an INLA approach but this is not the case: the first term of the approximate likelihood is exact and hence can be of any form, while the scalar product is linear in θ, providing a sort of first order approximation, albeit frozen at the chosen starting value.

mikecem2Assuming that each block of the dataset is stored on a separate machine, I think the approach could further be implemented in parallel, running N MCMC chains and comparing the output. With a post-simulation summary stemming from the N empirical distributions thus produced. I also wonder how the method would perform outside the fairly smooth logistic regression case, where the single sample captures well-enough the target. The picture above shows a minor gain in a misclassification rate that is already essentially zero.

Rémi Bardenet’s seminar

Posted in Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on April 7, 2016 by xi'an

Grand Palais from Esplanade des Invalides, Paris, Dec. 07, 2012Next week, Rémi Bardenet is giving a seminar in Paris, Thursday April 14, 2pm, in ENSAE [room 15] on MCMC methods for tall data. Unfortunately, I will miss this opportunity to discuss with Rémi as I will be heading to La Sapienza, Roma, for Clara Grazian‘s PhD defence the next day.  And on Monday afternoon, April 11, Nicolas Chopin will give a talk on quasi-Monte Carlo for sequential problems at Institut Henri Poincaré.

parallelizing MCMC with random partition trees

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on July 7, 2015 by xi'an

Another arXived paper in the recent series about big or tall data and how to deal with it by MCMC. Which pertains to the embarrassingly parallel category. As in the previously discussed paper, the authors (Xiangyu Wang, Fangjian Guo, Katherine Heller, and David Dunson) chose to break the prior itself into m bits… (An additional point from last week criticism is that, were an unbiased estimator of each term in the product available in an independent manner, the product of the estimators would be the estimator of the product.) In this approach, the kernel estimator of Neiswanger et al. is replaced with a random partition tree histogram. Which uses the same block partition across all terms in the product representation of the posterior. And hence ends up with a smaller number of terms in the approximation, since it does not explode with m. (They could have used Mondrian forests as well! However I think their quantification of the regular kernel method cost as an O(Tm) approach does not account for Neiswanger et al.’s trick in exploiting the product of kernels…) The so-called tree estimate can be turned into a random forest by repeating the procedure several times and averaging. The simulation comparison runs in favour of the current method when compared with other consensus or non-parametric methods. Except in the final graph (Figure 5) which shows several methods achieving the same prediction accuracy against running time.

variational consensus Monte Carlo

Posted in Books, Statistics, University life with tags , , , , , , on July 2, 2015 by xi'an

“Unfortunately, the factorization does not make it immediately clear how to aggregate on the level of samples without first having to obtain an estimate of the densities themselves.” (p.2)

The recently arXived variational consensus Monte Carlo is a paper by Maxim Rabinovich, Elaine Angelino, and Michael Jordan that approaches the consensus Monte Carlo principle from a variational perspective. As in the embarrassingly parallel version,  the target is split into a product of K terms, each being interpreted as an unnormalised density and being fed to a different parallel processor. The most natural partition is to break the data into K subsamples and to raise the prior to the power 1/K in each term. While this decomposition makes sense from a storage perspective, since each bit corresponds to a different subsample of the data, it raises the question of the statistical pertinence of splitting the prior and my feelings about it are now more lukewarm than when I commented on the embarrassingly parallel version,  mainly for the reason that it is not reparameterisation invariant—getting different targets if one does the reparameterisation before or after the partition—and hence does not treat the prior as the reference measure it should be. I therefore prefer the version where the same original prior is attached to each part of the partitioned likelihood (and even more the random subsampling approaches discussed in the recent paper of Bardenet, Doucet, and Holmes). Another difficulty with the decomposition is that a product of densities is not a density in most cases (it may even be of infinite mass) and does not offer a natural path to the analysis of samples generated from each term in the product. Nor an explanation as to why those samples should be relevant to construct a sample for the original target.

“The performance of our algorithm depends critically on the choice of aggregation function family.” (p.5)

Since the variational Bayes approach is a common answer to complex products models, Rabinovich et al. explore the use of variational Bayes techniques to build the consensus distribution out of the separate samples. As in Scott et al., and Neiswanger et al., the simulation from the consensus distribution is a transform of simulations from each of the terms in the product, e.g., a weighted average. Which determines the consensus distribution as a member of an aggregation family defined loosely by a Dirac mass. When the transform is a sum of individual terms, variational Bayes solutions get much easier to find and the authors work under this restriction… In the empirical evaluation of this variational Bayes approach as opposed to the uniform and Gaussian averaging options in Scott et al., it improves upon those, except in a mixture example with a large enough common variance.

In fine, despite the relevance of variational Bayes to improve the consensus approximation, I still remain unconvinced about the use of the product of (pseudo-)densities and the subsequent mix of simulations from those components, for the reason mentioned above and also because the tail behaviour of those components is not related with the tail behaviour of the target. Still, this is a working solution to a real problem and as such is a reference for future works.

on Markov chain Monte Carlo methods for tall data

Posted in Books, Statistics, University life with tags , , , , , on June 22, 2015 by xi'an

Rémi Bardenet, Arnaud Doucet, and Chris Holmes arXived a long paper (with the above title) a month ago, paper that I did not have time to read in detail till today. The paper is quite comprehensive in its analysis of the current literature on MCMC for huge, tall, or big data. Even including our delayed acceptance paper! Now, it is indeed the case that we are all still struggling with this size difficulty. Making proposals in a wide range of directions, hopefully improving the efficiency of dealing with tall data. However, we are not there yet in that the outcome is either about as costly as the original MCMC implementation or its degree of approximation is unknown, even when bounds are available.

Most of the paper proposal is based on aiming at an unbiased estimator of the likelihood function in a pseudo-marginal manner à la Andrieu and Roberts (2009) and on a random subsampling scheme that presumes (a) iid-ness and (b) a lower bound on each term in the likelihood. It seems to me slightly unrealistic to assume that a much cheaper and tight lower bound on those terms could be available. Firmly set in the iid framework, the problem itself is unclear: do we need 10⁸ observations of a logistic model with a few parameters? The real challenge is rather in non-iid hierarchical models with random effects and complex dependence structures. For which subsampling gets much more delicate. None of the methods surveyed in the paper broaches upon such situations where the entire data cannot be explored at once.

An interesting experiment therein, based on the Glynn and Rhee (2014) unbiased representation, shows that the approach does not work well. This could lead the community to reconsider the focus on unbiasedness by coming full circle to the opposition  between bias and variance. And between intractable likelihood and representative subsample likelihood.

Reading the (superb) coverage of earlier proposals made me trace back on the perceived appeal of the decomposition of Neiswanger et al. (2014) as I came to realise that the product of functions renormalised into densities has no immediate probabilistic connection with its components. As an extreme example, terms may fail to integrate. (Of course, there are many Monte Carlo features that exploit such a decomposition, from the pseudo-marginal to accept-reject algorithms. And more to come.) Taking samples from terms in the product is thus not directly related to taking samples from each term, in opposition with the arithmetic mixture representation. I was first convinced by using a fraction of the prior in each term but now find it unappealing because there is no reason the prior should change for a smaller sampler and no equivalent to the prohibition of using the data several times. At this stage, I would be much more in favour of raising a random portion of the likelihood function to the right power. An approach that I suggested to a graduate student earlier this year and which is also discussed in the paper. And considered too naïve and a “very poor approach” (Section 6, p.18), even though there must be versions that do not run afoul of the non-Gaussian nature of the log likelihood ratio. I am certainly going to peruse more thoroughly this Section 6 of the paper.

Another interesting suggestion in this definitely rich paper is the foray into an alternative bypassing the uniform sampling in the Metropolis-Hastings step, using instead the subsampled likelihood ratio. The authors call this “exchanging acceptance noise for subsampling noise” (p.22). However, there is no indication about the resulting stationary and I find the notion of only moving to higher likelihoods (or estimates of) counter to the spirit of Metropolis-Hastings algorithms. (I have also eventually realised the meaning of the log-normal “difficult” benchmark that I missed in the earlier : it means log-normal data is modelled by a normal density.)  And yet another innovation along the lines of a control variate for the log likelihood ratio, no matter it sounds somewhat surrealistic.