In the plane to Montréal, today, I read this paper by Kulkarni, Saeedi and Gershman, which will be presented at AISTATS. The main idea is to create a mix between particle Monte Carlo and a kind of quasiMonte Carlo technique (qNC is not mentionned in the paper), using variational inference (and coordinate ascent) to optimise the location and weight of the particles. It is however restricted to cases with finite support (as a product of N latent variables) as in an HMM with a finite state space. There is also something I do not get in the HMM case, which is that the variational approximation to the filtering is contracted sequentially. This means that at time the K highest weight current particles are selected while the past remains unchanged. Is this due to the Markovian nature of the hidden model? (Blame oxygen deprivation, altitude dizziness or travelling stress, then!) I also fail to understand how for filtering, “at each time step, the algorithm selects the K continuations (new variable assignments of the current particle set) that maximize the variational free energy.” Because the weight function to be optimised (eqn (11)) seems to freeze the whole past path of particles… I presume I will find an opportunity while in Reykjavik to discuss those issues with the authors.
Archive for optimisation
variational particle approximations
Posted in Mountains, pictures, Statistics, Travel, University life with tags AISTATS, Canada, Montréal, Newfoundland, optimisation, particle methods, Reykjavik, variational Bayes methods on February 28, 2014 by xi'anFoundations of Statistical Algorithms [book review]
Posted in Statistics, University life, Books, Linux, R with tags simulation, R, optimisation, randomness, parallelisation, R package, statistical computing, random number generation, GPU, big data, statistical algorithms, scalability, MapReduce, Hadoop on February 28, 2014 by xi'anThere is computational statistics and there is statistical computing. And then there is statistical algorithmic. Not the same thing, by far. This 2014 book by Weihs, Mersman and Ligges, from TU Dortmund, the later being also a member of the R Core team, stands at one end of this wide spectrum of techniques required by modern statistical analysis. In short, it provides the necessary skills to construct statistical algorithms and hence to contribute to statistical computing. And I wish I had the luxury to teach from Foundations of Statistical Algorithms to my graduate students, if only we could afford an extra yearly course…
“Our aim is to enable the reader (…) to quickly understand the main ideas of modern numerical algorithms [rather] than having to memorize the current, and soon to be outdated, set of popular algorithms from computational statistics.”(p.1)
The book is built around the above aim, first presenting the reasons why computers can produce answers different from what we want, using least squares as a mean to check for (in)stability, then second establishing the ground forFishman Monte Carlo methods by discussing (pseudo)random generation, including MCMC algorithms, before moving in third to bootstrap and resampling techniques, and concluding with parallelisation and scalability. The text is highly structured, with frequent summaries, a division of chapters all the way down to subsubsubsections, an R implementation section in each chapter, and a few exercises. Continue reading
optimisation via slice sampling
Posted in Statistics with tags optimisation, SAME algorithm, simulated annealing, slice sampling, temperature schedule on December 20, 2012 by xi'anThis morning, over breakfast; I read the paper recently arXived by John Birge et Nick Polson. I was intrigued by the combination of optimisation and of slice sampling, but got a wee disappointed by the paper, in that it proposes a simple form of simulated annealing, minimising f(x) by targeting a small collection of energy functions exp{τf(x)}. Indeed, the slice sampler is used to explore each of those targets, i.e. for a fixed temperature τ. For the four functions considered in the paper, a slice sampler can indeed be implemented, but this feature could be seen as a marginalia, given that a random walk MetropolisHastings algorithm could be used as a proposal mechanism in other cases. The other intriguing fact is that annealing is not used in the traditional way of increasing the coefficient τ along iterations (as in our SAME algorithm), for which convergence issues are much more intricate, but instead stays at the level where a whole (Markov) sample is used for each temperature, the outcomes being then compared in terms of localisation (and maybe for starting at the next temperature value). I did not see any discussion about the selection of the successive temperatures, which usually is a delicate issue in realistic settings, nor of the stopping rule(s) used to decide the maximum has been reached.
Special Issue of ACM TOMACS on Monte Carlo Methods in Statistics
Posted in Books, R, Statistics, University life with tags ACM Transactions on Modeling and Computer Simulation, Berlin, EM algorithm, importance sampling, integer valued functions, MCMC, Monte Carlos Statistical Methods, optimisation, parallelisation, particle filters, rare events, simulation, WSC 2012 on December 10, 2012 by xi'anAs posted here a long, long while ago, following a suggestion from the editor (and North America Cycling Champion!) Pierre Lécuyer (Université de Montréal), Arnaud Doucet (University of Oxford) and myself acted as guest editors for a special issue of ACM TOMACS on Monte Carlo Methods in Statistics. (Coincidentally, I am attending a board meeting for TOMACS tonight in Berlin!) The issue is now ready for publication (next February unless I am confused!) and made of the following papers:
* Massive parallelization of serial inference algorithms for a complex generalized linear model MARC A. SUCHARD, IVAN ZORYCH, PATRICK RYAN, DAVID MADIGAN 

*Convergence of a Particlebased Approximation of the Block Online Expectation Maximization Algorithm SYLVAIN LE CORFF and GERSENDE FORT 

* Efficient MCMC for Binomial Logit Models AGNES FUSSL, SYLVIA FRÜHWIRTHSCHNATTER, RUDOLF FRÜHWIRTH 

* Adaptive EquiEnergy Sampler: Convergence and Illustration AMANDINE SCHRECK and GERSENDE FORT and ERIC MOULINES 

* Particle algorithms for optimization on binary spaces CHRISTIAN SCHÄFER 

* Posterior expectation of regularly paved random histograms RAAZESH SAINUDIIN, GLORIA TENG, JENNIFER HARLOW, and DOMINIC LEE 

* Small variance estimators for rare event probabilities MICHEL BRONIATOWSKI and VIRGILE CARON 

* SelfAvoiding Random Dynamics on Integer Complex Systems FIRAS HAMZE, ZIYU WANG, and NANDO DE FREITAS 

* Bayesian learning of noisy Markov decision processes SUMEETPAL S. SINGH, NICOLAS CHOPIN, and NICK WHITELEY 
Here is the draft of the editorial that will appear at the beginning of this special issue. (All faults are mine, of course!) Continue reading
more typos in Monte Carlo statistical methods
Posted in Books, Statistics, University life with tags capturerecapture, EM algorithm, frequentist inference, integer set, Jensen's inequality, missing data, Monte Carlo Statistical Methods, optimisation, typos, UNC on October 28, 2011 by xi'anJan Hanning kindly sent me this email about several difficulties with Chapters 3, Monte Carlo Integration, and 5, Monte Carlo Optimization, when teaching out of our book Monte Carlo Statistical Methods [my replies in italics between square brackets, apologies for the late reply and posting, as well as for the confusion thus created. Of course, the additional typos will soon be included in the typo lists on my book webpage.]:
 I seem to be unable to reproduce Table 3.3 on page 88 – especially the chisquare column does not look quite right. [No, they definitely are not right: the true χ² quantiles should be 2.70, 3.84, and 6.63, at the levels 0.1, 0.05, and 0.01, respectively. I actually fail to understand how we got this table that wrong...]
 The second question I have is the choice of the U(0,1) in this Example 3.6. It feels to me that a choice of Beta(23.5,18.5) for p_{1} and Beta(36.5,5.5) for p_{2} might give a better representation based on the data we have. Any comments? [I am plainly uncertain about this... Yours is the choice based on the posterior Beta coefficient distributions associated with Jeffreys prior, hence making the best use of the data. I wonder whether or not we should remove this example altogether... It is certainly "better" than the uniform. However, in my opinion, there is no proper choice for the distribution of the p_{i}'s because we are mixing there a likelihoodratio solution with a Bayesian perspective on the predictive distribution of the likelihoodratio. If anything, this exposes the shortcomings of a classical approach, but it is likely to confuse the students! Anyway, this is a very interesting problem.]
 My students discovered that Problem 5.19 has the following typos, copying from their email: “x_x” should be “x_i” [sure!]. There are a few “( )”s missing here and there [yes!]. Most importantly, the likelihood/density seems incorrect. The normalizing constant should be the reciprocal of the one showed in the book [oh dear, indeed, the constant in the exponential density did not get to the denominator...]. As a result, all the formulas would differ except the ones in part (a). [they clearly need to be rewritten, sorry about this mess!]
 I am unsure about the if and only if part of the Theorem 5.15 [namely that the likelihood sequence is stationary if and only if the Q function in the E step has reached a stationary point]. It appears to me that a condition for the “if part” is missing [the "only if" part is a direct consequence of Jensen's inequality]. Indeed Theorem 1 of Dempster et al 1977 has an extra condition [note that the original proof for convergence of EM has a flaw, as discussed here]. Am I missing something obvious? [maybe: it seems to me that, once Q reaches a fixed point, the likelihood L does not change... It is thus tautological, not a proof of convergence! But the theorem says a wee more, so this needs investigating. As Jan remarked, there is no symmetry in the Q function...]
 Should there be a (nm) in the last term of formula (5.17)? [yes, indeed!, multiply the last term by (nm)]
 Finally, I am a bit confused about the likelihood in Example 5.22 [which is a capturerecapture model]. Assume that H_{ij}=k [meaning the animal i is in state k at time j]. Do you assume that you observed X_{ijr} [which is the capture indicator for animal i at time j in zone k: it is equal to 1 for at most one k] as a Binomial B(n,p_{r}) even for r≠k? [no, we observe all X_{ijr}'s with r≠k equal to zero] The nature of the problem seems to suggest that the answer is no [for other indices, X_{ijr} is always zero, indeed] If that is the case I do not see where the power on top of (1p_{k}) in the middle of the page 185 comes from [when the capture indices are zero, they do not contribute to the sum, which explains for this condensed formula. Therefore, I do not think there is anything wrong with this overparameterised representation of the missing variables.]
 In Section 5.3.4, there seems to be a missing minus sign in the approximation formula for the variance [indeed, shame on us for missing the minus in the observed information matrix!]
 I could not find the definition of in Theorem 6.15. Is it all natural numbers or all integers? May be it would help to include it in Appendix B. [Surprising! This is the set of all positive integers, I thought this was a standard math notation...]
 In Definition 6.27, you probably want to say covering of A and not X. [Yes, we were already thinking of the next theorem, most likely!]
 In Proposition 6.33  all x in A instead of all x in X. [Yes, again! As shown in the proof. Even though it also holds for all x in X]
Thanks a ton to Jan and to his UNC students (and apologies for leading them astray with those typos!!!)