Archive for bag of little bootstraps

JSM 2015 [day #4]

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on August 13, 2015 by xi'an

My first session today was Markov Chain Monte Carlo for Contemporary Statistical Applications with a heap of interesting directions in MCMC research! Now, without any possible bias (!), I would definitely nominate Murray Pollock (incidentally from Warwick) as the winner for best slides, funniest presentation, and most enjoyable accent! More seriously, the scalable Langevin algorithm he developed with Paul Fearnhead, Adam Johansen, and Gareth Roberts, is quite impressive in avoiding computing costly likelihoods. With of course caveats on which targets it applies to. Murali Haran showed a new proposal to handle high dimension random effect models by a projection trick that reduces the dimension. Natesh Pillai introduced us (or at least me!) to a spectral clustering that allowed for an automated partition of the target space, itself the starting point to his parallel MCMC algorithm. Quite exciting, even though I do not perceive partitions as an ideal solution to this problem. The final talk in the session was Galin Jones’ presentation of consistency results and conditions for multivariate quantities which is a surprisingly unexplored domain. MCMC is still alive and running!

The second MCMC session of the morning, Monte Carlo Methods Facing New Challenges in Statistics and Science, was equally diverse, with Lynn Kuo’s talk on the HAWK approach, where we discovered that harmonic mean estimators are still in use, e.g., in MrBayes software employed in phylogenetic inference. The proposal to replace this awful estimator that should never be seen again (!) was rather closely related to an earlier solution of us for marginal likelihood approximation, based there on a partition of the whole space rather than an HPD region in our case… Then, Michael Betancourt brilliantly acted as a proxy for Andrew to present the STAN language, with a flashy trailer he most recently designed. Featuring Andrew as the sole actor. And with great arguments for using it, including the potential to run expectation propagation (as a way of life). In fine, Faming Liang proposed a bootstrap subsampling version of the Metropolis-Hastings algorithm, where the likelihood acknowledging the resulting bias in the limiting distribution.

My first afternoon session was another entry on Statistical Phylogenetics, somewhat continued from yesterday’s session. Making me realised I had not seen a single talk on ABC for the entire meeting! The issues discussed in the session were linked with aligning sequences and comparing  many trees. Again in settings where likelihoods can be computed more or less explicitly. Without any expertise in the matter, I wondered at a construction that would turn all trees, like  into realizations of a continuous model. For instance by growing one branch at a time while removing the MRCA root… And maybe using a particle like method to grow trees. As an aside, Vladimir Minin told me yesterday night about genetic mutations that could switch on and off phenotypes repeatedly across generations… For instance  the ability to glow in the dark for species of deep sea fish.

When stating that I did not see a single talk about ABC, I omitted Steve Fienberg’s Fisher Lecture R.A. Fisher and the Statistical ABCs, keeping the morceau de choix for the end! Even though of course Steve did not mention the algorithm! A was for asymptotics, or ancilarity, B for Bayesian (or biducial??), C for causation (or cuffiency???)… Among other germs, I appreciated that Steve mentioned my great-grand father Darmois in connection with exponential families! And the connection with Jon Wellner’s LeCam Lecture from a few days ago. And reminding us that Savage was a Fisher lecturer himself. And that Fisher introduced fiducial distributions quite early. And for defending the Bayesian perspective. Steve also set some challenges like asymptotics for networks, Bayesian model assessment (I liked the notion of stepping out of the model), and randomization when experimenting with networks. And for big data issues. And for personalized medicine, building on his cancer treatment. No trace of the ABC algorithm, obviously, but a wonderful Fisher’s lecture, also most obviously!! Bravo, Steve, keep thriving!!!

Unbiased Bayes for Big Data: Path of partial posteriors

Posted in Statistics, University life with tags , , , , , , , , , on February 26, 2015 by xi'an

“Data complexity is sub-linear in N, no bias is introduced, variance is finite.”

Heiko Strathman, Dino Sejdinovic and Mark Girolami have arXived a few weeks ago a paper on the use of a telescoping estimator to achieve an unbiased estimator of a Bayes estimator relying on the entire dataset, while using only a small proportion of the dataset. The idea is that a sequence  converging—to an unbiased estimator—of estimators φt can be turned into an unbiased estimator by a stopping rule T:

\sum_{t=1}^T \dfrac{\varphi_t-\varphi_{t-1}}{\mathbb{P}(T\ge t)}

is indeed unbiased. In a “Big Data” framework, the components φt are MCMC versions of posterior expectations based on a proportion αt of the data. And the stopping rule cannot exceed αt=1. The authors further propose to replicate this unbiased estimator R times on R parallel processors. They further claim a reduction in the computing cost of

\mathcal{O}(N^{1-\alpha})\qquad\text{if}\qquad\mathbb{P}(T=t)\approx e^{-\alpha t}

which means that a sub-linear cost can be achieved. However, the gain in computing time means higher variance than for the full MCMC solution:

“It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (…), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compare approaches in that region. “

I first got a “free lunch” impression when reading the paper, namely it sounded like using a random stopping rule was enough to overcome unbiasedness and large size jams. This is not the message of the paper, but I remain both intrigued by the possibilities the unbiasedness offers and bemused by the claims therein, for several reasons: Continue reading

this issue of Series B

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on September 5, 2014 by xi'an

The September issue of [JRSS] Series B I received a few days ago is of particular interest to me. (And not as an ex-co-editor since I was never involved in any of those papers!) To wit: a paper by Hani Doss and Aixin Tan on evaluating normalising constants based on MCMC output, a preliminary version I had seen at a previous JSM meeting, a paper by Nick Polson, James Scott and Jesse Windle on the Bayesian bridge, connected with Nick’s talk in Boston earlier this month, yet another paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael Jordan on the bag of little bootstraps, which presentation I heard Michael deliver a few times when he was in Paris. (Obviously, this does not imply any negative judgement on the other papers of this issue!)

For instance, Doss and Tan consider the multiple mixture estimator [my wording, the authors do not give the method a name, referring to Vardi (1985) but missing the connection with Owen and Zhou (2000)] of k ratios of normalising constants, namely

\sum_{l=1}^k \frac{1}{n_l} \sum_{t=1}^{n_l} \dfrac{n_l g_j(x_t^l)}{\sum_{s=1}^k n_s g_s(x_t^l) z_1/z_s } \longrightarrow \dfrac{z_j}{z_1}

where the z’s are the normalising constants and with possible different numbers of iterations of each Markov chain. An interesting starting point (that Hans Künsch had mentioned to me a while ago but that I had since then forgotten) is that the problem was reformulated by Charlie Geyer (1994) as a quasi-likelihood estimation where the ratios of all z’s relative to one reference density are the unknowns. This is doubling interesting, actually, because it restates the constant estimation problem into a statistical light and thus somewhat relates to the infamous “paradox” raised by Larry Wasserman a while ago. The novelty in the paper is (a) to derive an optimal estimator of the ratios of normalising constants in the Markov case, essentially accounting for possibly different lengths of the Markov chains, and (b) to estimate the variance matrix of the ratio estimate by regeneration arguments. A favourite tool of mine, at least theoretically as practically useful minorising conditions are hard to come by, if at all available.

reading classics (#3)

Posted in Books, Kids, Statistics, University life with tags , , , , , , on November 29, 2013 by xi'an

La Défense from Paris-Dauphine, Nov. 15, 2012This week at the Reading Classics student seminar, Thomas Ounas presented a paper, Statistical inference on massive datasets, written by Li, Lin, and Li, a paper out of The List. (This paper was recently published as Applied Stochastic Models in Business and Industry, 29, 399-409..) I accepted this unorthodox proposal as (a) it was unusual, i.e., this was the very first time a student made this request, and (b) the topic of large datasets and their statistical processing definitely was interesting even though the authors of the paper were unknown to me. The presentation by Thomas was very power-pointish (or power[-point]ful!), with plenty of dazzling transition effects… Even including (a) a Python software replicating the method and (b) a nice little video on internet data transfer protocols. And on a Linux machine! Hence the experiment was worth the try! Even though the paper is a rather unlikely candidate for the list of classics… (And the rendering in static power point no so impressive. Hence a video version available as well…)

The solution adopted by the authors of the paper is one of breaking a massive dataset into blocks so that each fits into the computer(s) memory and of computing a separate estimate for each block. Those estimates are then averaged (and standard-deviationed) without a clear assessment of the impact of this multi-tiered handling of the data. Thomas then built a software to illustrate this approach, with mean and variance and quantiles and densities as quantities of interest. Definitely original! The proposal itself sounds rather basic from a statistical viewpoint: for instance, evaluating the loss in information due to using this blocking procedure requires repeated sampling, which is unrealistic. Or using solely the inter-variance estimates which seems to be missing the intra-variability. Hence to be overly optimistic. Further, strictly speaking, the method does not asymptotically apply to biased estimators, hence neither to Bayes estimators (nor to density estimators). Convergence results are thus somehow formal, in that the asymptotics cannot apply to a finite memory computer. In practice, the difficulty of the splitting technique is rather in breaking the data into blocks since Big Data is rarely made of iid observations. Think of amazon data, for instance. A question actually asked by the class. The method of Li et al. should also include some boostrapping connection. E.g., to Michael’s bag of little bootstraps.

reading classics (#4)

Posted in Statistics, University life with tags , , , , , , , , , , on November 29, 2012 by xi'an

Another read today and not from JRSS B for once, namely,  Efron‘s (an)other look at the Jackknife, i.e. the 1979 bootstrap classic published in the Annals of Statistics. My Master students in the Reading Classics Seminar course thus listened today to Marco Brandi’s presentation, whose (Beamer) slides are here:

In my opinion this was an easier paper to discuss, more because of its visible impact than because of the paper itself, where the comparison with the jackknife procedure does not sound so relevant nowadays. again mostly algorithmic and requiring some background on how it impacted the field. Even though Marco also went through Don Rubin’s Bayesian bootstrap and Michael Jordan bag of little bootstraps, he struggled to get away from the technicality towards the intuition and the relevance of the method. The Bayesian bootstrap extension was quite interesting in that we discussed a lot the connections with Dirichlet priors and the lack of parameters that sounded quite antagonistic with the Bayesian principles. However, at the end of the day, I feel that this foundational paper was not explored in proportion to its depth and that it would be worth another visit.

Michael Jordan à Normale demain

Posted in Statistics, University life with tags , , , , , on October 1, 2012 by xi'an

Tomorrow (Tuesday, Oct. 2), Michael Jordan (Berkeley) will give a seminar at École Normale Supérieure, in French:

Mardi 2 octobre 2012 à 11h15. (Le séminaire sera précédé d’un café à 11 heures)
Salle W, Toits du DMA, Ecole Normale Supérieure, 45 rue d’Ulm, Paris 5e.

Diviser-pour-Régner and Inférence Statistique pour “Big Data”

Michael I. Jordan
Fondation Sciences Mathematiques de Paris & University of California, Berkeley

Dans cet exposé nous présentons quelques résultats récents dans le domaine d’inférence pour “Big Data”. Diviser-pour-régner est un outil essentiel du point de vue computationel pour aborder des problèmes de traitement de données à grande échelle, surtout vu la croissance récente de systèmes distribués, mais ce paradigme présente des difficultés lorsqu’il s’agit d’inférence statistique. Considérons, par exemple, le problème fondamentale d’obtenir des intervalles de confiance pour les estimateurs. Le principe du “bootstrap” suggère d’échantilloner les données à plusieurs reprises pour obtenir des fluctuations et donc des intervalles de confiance, mais ceci est infaisable à grande échelle. Si on fait appel à des sous-échantillons on obtient des fluctuations qui ne sont pas à l’échelle correcte. Nous présentons une approche nouvelle, la “bag of little bootstraps,” qui circonvient ce problème et qui peut être appliquée à des données à grande échelle. Nous parlerons aussi du problème de complétion de matrice à grande échelle, où l’approche de diviser-pour-régner est un outil practique mais qui soulève des problèmes théoriques. Le soutient théorique est fournit par des théorèmes de concentration de matrices aléatoires, et à ce propos nous présentons une approche nouvelle à la concentration de matrices aléatoires basée sur la méthode de Stein. [En collaboration avec Ariel Kleiner, Lester Mackey, Purna Sarkar, Ameet Talwalkar, Richard Chen, Brendan Farrell et Joel Tropp].