Anoop Korattikara, Yutian Chen and Max Welling recently arXived a paper on the appeal of using only part of the data to speed up MCMC. This is different from the growing literature on unbiased estimators of the likelihood exemplified by Andrieu & Roberts (2009). Here, the approximation to the true target is akin to the approximation in ABC algorithms in that a value of the parameter is accepted if the difference in the likelihoods is larger than a given bound. Expressing this perspective as a test on the mean of the log likelihood leads the authors to use instead a subsample from the whole sample. (The approximation level ε is then a bound on the p-value.) While this idea only applies to iid settings, it is quite interesting and sounds a wee bit like a bootstrapped version of MCMC. Especially since it sounds as if it could provide an auto-evaluation of its error.
Archive for bootstrap
Next month, Michael Jordan will give an advanced course at CREST-ENSAE, Paris, on Recent Advances at the Interface of Computation and Statistics. The course will take place on April 4 (14:00, ENSAE, Room #11), 11 (14:00, ENSAE, Room #11), 15 (11:00, ENSAE, Room #11) and 18 (14:00, ENSAE, Room #11). It is open to everyone and attendance is free. The only constraint is a compulsory registration with Nadine Guedj (email: guedj[AT]ensae.fr) for security issues. I strongly advise all graduate students who can take advantage of this fantastic opportunity to grasp it! Here is the abstract to the course:
“I will discuss several recent developments in areas where statistical science meets computational science, with particular concern for bringing statistical inference into contact with distributed computing architectures and with recursive data structures :
How does one obtain confidence intervals in massive data sets? The bootstrap principle suggests resampling data to obtain fluctuations in the values of estimators, and thereby confidence intervals, but this is infeasible computationally with massive data. Subsampling the data yields fluctuations on the wrong scale, which have to be corrected to provide calibrated statistical inferences. I present a new procedure, the “bag of little bootstraps,” which circumvents this problem, inheriting the favorable theoretical properties of the bootstrap but also having a much more favorable computational profile.
The problem of matrix completion has been the focus of much recent work, both theoretical and practical. To take advantage of distributed computing architectures in this setting, it is natural to consider divide-and-conquer algorithms for matrix completion. I show that these work well in practice, but also note that new theoretical problems arise when attempting to characterize the statistical performance of these algorithms. Here the theoretical support is provided by concentration theorems for random matrices, and I present a new approach to matrix concentration based on Stein’s method.
Bayesian nonparametrics involves replacing the “prior distributions” of classical Bayesian analysis with “prior stochastic processes.” Of particular value are the class of “combinatorial stochastic processes,” which make it possible to express uncertainty (and perform inference) over combinatorial objects that are familiar as data structures in computer science.”
References are available on Michael’s homepage.
I was attending a lecture this morning at CREST by Patrice Bertail where he was using estimated renewal parameters on a Markov chain to build (asymptotically) convergent bootstrap procedures. Estimating renewal parameters is obviously of interest in MCMC algorithms as they can be used to assess the convergence of the associated Markov chain: That is, if the estimation does not induce a significant bias. Another question that came to me during the talk is that; since those convergence assessments techniques are formally holding for any small set, choosing the small set in order to maximise the renewal rate also maximises the number of renewal events and hence the number of terms in the control sequence: Thus, the maximal renewal rate þ is definitely a quantity of interest: Now, is this quantity þ an intrinsic parameter of the chain, i.e. a quantity that drives its mixing and/or converging behaviour(s)? For instance; an iid sequence has a renewal rate of 1; because the whole set is a “small” set. Informally, the time between two consecutive renewal events is akin to the time between two simulations from the target and stationary distribution, according to the Kac’s representation we used in our AAP paper with Jim Hobert. So it could be that þ is directly related with the effective sample size of the chain, hence the autocorrelation. (A quick web search did not produce anything relevant:) Too bad this question did not pop up last week when I had the opportunity to discuss it with Sean Meyn in Gainesville!
On the morning I returned from Varanasi and the ISBA meeting there, I had to give my R final exam (along with three of my colleagues in Paris-Dauphine). This year, the R course was completely in English, exam included, which means I can post it here as it may attract more interest than the French examens of past years…
I just completed grading my 32 copies, all from exam A, which takes a while as I have to check (and sometimes recover) the R code, and often to correct the obvious mistakes to see if the deeper understanding of the concepts is there. This year student cohort is surprisingly homogeneous: I did not spot any of the horrors I may have mentioned in previous posts.
I must alas acknowledge a grievous typo in the version of Exam B that was used the day of the final: cutting-and-pasting from A to B, I forgot to change the parameters in Exercise 2, asking them to simulate a Gamma(0,1). It is only after half an hour that a bright student pointed out the impossibility… We had tested the exams prior to printing them but this somehow escaped the four of us!
Now, as I was entering my grades into the global spreadsheet, I noticed a perfect… lack of correlation between those and the grades at the midterm exam. I wonder what that means: I could be grading at random, the levels in November and in January could be uncorrelated, some students could have cheated in November and others in January, student’s names or file names got mixed up, …? A rather surprising outcome!
Another read today and not from JRSS B for once, namely, Efron‘s (an)other look at the Jackknife, i.e. the 1979 bootstrap classic published in the Annals of Statistics. My Master students in the Reading Classics Seminar course thus listened today to Marco Brandi’s presentation, whose (Beamer) slides are here:
In my opinion this was an easier paper to discuss, more because of its visible impact than because of the paper itself, where the comparison with the jackknife procedure does not sound so relevant nowadays. again mostly algorithmic and requiring some background on how it impacted the field. Even though Marco also went through Don Rubin’s Bayesian bootstrap and Michael Jordan bag of little bootstraps, he struggled to get away from the technicality towards the intuition and the relevance of the method. The Bayesian bootstrap extension was quite interesting in that we discussed a lot the connections with Dirichlet priors and the lack of parameters that sounded quite antagonistic with the Bayesian principles. However, at the end of the day, I feel that this foundational paper was not explored in proportion to its depth and that it would be worth another visit.
Menéndez, Fan, Garthwaite, and Sisson—whom I heard in Adelaide on that subject—posted yesterday a paper on arXiv about correcting the frequentist coverage of default intervals toward their nominal level. Given such an interval [L(x),U(x)], the correction for proper frequentist coverage is done by parametric bootstrap, i.e. by simulating n replicas of the original sample from the pluggin density f(.|θ*) and deriving the empirical cdf of L(y)-θ*. And of U(y)-θ*. Under the assumption of consistency of the estimate θ*, this ensures convergence (in the original sampled size) of the corrected bounds.
Since ABC is based on the idea that pseudo data can be simulated from f(.|θ) for any value of θ, the concept “naturally” applies to ABC outcomes, as illustrated in the paper by a g-and-k noise MA(1) model. (As noted by the authors, there always is some uncertainty with the consistency of the ABC estimator.) However, there are a few caveats:
- ABC usually aims at approximating the posterior distribution (given the summary statistics), of which the credible intervals are an inherent constituent. Hence, attempts at recovering a frequentist coverage seem contradictory with the original purpose of the method. Obviously, if ABC is instead seen as an inference method per se, like indirect inference, this objection does not hold.
- Then, once the (umbilical) link with Bayesian inference is partly severed, there is no particular reason to stick to credible sets for [L(x),U(x)]. A more standard parametric bootstrap approach, based on the bootstrap distribution of θ*, should work as well. This means that a comparison with other frequentist methods like indirect inference could be relevant.
- At last, and this is also noted by the authors, the method may prove extremely expensive. If the bounds L(x) and U(x) are obtained empirically from an ABC sample, a new ABC computation must be associated with each one of the n replicas of the original sample. It would be interesting to compare the actual coverages of this ABC-corrected method with a more direct parametric bootstrap approach.
Tomorrow (Tuesday, Oct. 2), Michael Jordan (Berkeley) will give a seminar at École Normale Supérieure, in French:
Mardi 2 octobre 2012 à 11h15. (Le séminaire sera précédé d’un café à 11 heures)
Salle W, Toits du DMA, Ecole Normale Supérieure, 45 rue d’Ulm, Paris 5e.
Diviser-pour-Régner and Inférence Statistique pour “Big Data”
Michael I. Jordan
Fondation Sciences Mathematiques de Paris & University of California, Berkeley
Dans cet exposé nous présentons quelques résultats récents dans le domaine d’inférence pour “Big Data”. Diviser-pour-régner est un outil essentiel du point de vue computationel pour aborder des problèmes de traitement de données à grande échelle, surtout vu la croissance récente de systèmes distribués, mais ce paradigme présente des difficultés lorsqu’il s’agit d’inférence statistique. Considérons, par exemple, le problème fondamentale d’obtenir des intervalles de confiance pour les estimateurs. Le principe du “bootstrap” suggère d’échantilloner les données à plusieurs reprises pour obtenir des fluctuations et donc des intervalles de confiance, mais ceci est infaisable à grande échelle. Si on fait appel à des sous-échantillons on obtient des fluctuations qui ne sont pas à l’échelle correcte. Nous présentons une approche nouvelle, la “bag of little bootstraps,” qui circonvient ce problème et qui peut être appliquée à des données à grande échelle. Nous parlerons aussi du problème de complétion de matrice à grande échelle, où l’approche de diviser-pour-régner est un outil practique mais qui soulève des problèmes théoriques. Le soutient théorique est fournit par des théorèmes de concentration de matrices aléatoires, et à ce propos nous présentons une approche nouvelle à la concentration de matrices aléatoires basée sur la méthode de Stein. [En collaboration avec Ariel Kleiner, Lester Mackey, Purna Sarkar, Ameet Talwalkar, Richard Chen, Brendan Farrell et Joel Tropp].