Archive for the Books Category

Le Monde puzzle [#860]

Posted in Books, Kids, R with tags , , , , on April 4, 2014 by xi'an

A Le Monde mathematical puzzle that connects to my awalé post of last year:

For N≤18, N balls are placed in N consecutive holes. Two players, Alice and Bob, consecutively take two balls at a time provided those balls are in contiguous holes. The loser is left with orphaned balls. What is the values of N such that Bob can win, no matter what is Alice’s strategy?

I solved this puzzle by the following R code that works recursively on N by eliminating all possible adjacent pairs of balls and checking whether or not there is a winning strategy for the other player.

# return 1 if current player can win, 0 otherwise

  if (max(awale[-1]*awale[-N])==1){
  #there are adjacent balls remaining

   for (i in (1:(N-1))[awale[1:(N-1)]==1]){

    if (awale[i+1]==1){

for (N in 2:18) print(topA(rep(1,N)))

which returns the solution

[1] 1
[1] 1
[1] 1
[1] 0
[1] 1
[1] 1
[1] 1
[1] 0
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 0
[1] 1
[1] 1
[1] 1

(brute-force) answering the question that N=5,9,15 are the values where Alice has no winning strategy if Bob plays in an optimal manner. (The case N=5 is obvious as there always remains two adjacent 1′s once Alice removed any adjacent pair. The case N=9 can also be shown to be a lost cause by enumeration of Alice’s options.)

[more] parallel MCMC

Posted in Books, Mountains with tags , , , , , , , , , , on April 3, 2014 by xi'an

Scott Schmidler and his Ph.D. student Douglas VanDerwerken have arXived a paper on parallel MCMC the very day I left for Chamonix, prior to MCMSki IV, so it is no wonder I missed it at the time. This work is somewhat in the spirit of the parallel papers Scott et al.’s consensus Bayes,  Neiswanger et al.’s embarrassingly parallel MCMC, Wang and Dunson’s Weierstrassed MCMC (and even White et al.’s parallel ABC), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. In their short survey of previous works on parallelization, VanDerwerken and Schmidler overlooked our neat (!) JCGS Rao-Blackwellisation with Pierre Jacob and Murray Smith, maybe because it sounds more like post-processing than genuine parallelization (in that it does not speed up the convergence of the chain but rather improves the Monte Carlo usages one can make of this chain), maybe because they did not know of it.

“This approach has two shortcomings: first, it requires a number of independent simulations, and thus processors, equal to the size of the partition; this may grow exponentially in dim(Θ). Second, the rejection often needed for the restriction doesn’t permit easy evaluation of transition kernel densities, required below. In addition, estimating the relative weights wi with which they should be combined requires care.” (p.3)

The idea of the authors is to replace an exploration of the whole space operated via a single Markov chain (or by parallel chains acting independently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more importantly, each chain can work in parallel to the others. More specifically, given a partition of the space, into sets Ai with posterior weights wi, parallel chains are associated with targets equal to the original target restricted to those Ai‘s. This is therefore an MCMC version of partitioned sampling. With regard to the shortcomings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. I am a bit worried about this argument since it amounts to getting a random number of simulations within each partition set Ai. In my (maybe biased) perception of partitioned sampling, this sounds somewhat counter-productive, as it increases the variance of the overall estimator. (Of course, not restricting a chain to a given partition set Ai has the incentive of avoiding a possibly massive amount of rejection steps. It is however unclear (a) whether or not it impacts ergodicity (it all depends on the way the chain is constructed, i.e. against which target(s)…) as it could lead to an over-representation of some boundaries and (b) whether or not it improves the overall convergence properties of the chain(s).)

“The approach presented here represents a solution to this problem which can completely remove the waiting times for crossing between modes, leaving only the relatively short within-mode equilibration times.” (p.4)

A more delicate issue with the partitioned MCMC approach (in my opinion!) stands with the partitioning. Indeed, in a complex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, provided all modes are represented and the chain for each partition set Ai has enough energy to explore this set. It actually sounds (slightly?) unlikely that a target with huge gaps between modes will see a considerable improvement from the partioned version when the partition sets Ai are selected on the go, because some of the boundaries between the partition sets may be hard to reach with a off-the-shelf proposal. (Obviously, the second part of the method on the adaptive construction of partitions is yet in the writing and I am looking forward its aXival!)

Furthermore, as noted by Pierre Jacob (of Statisfaction fame!), the adaptive construction of the partition has a lot in common with Wang-Landau schemes. Which goal is to produce a flat histogram proposal from the current exploration of the state space. Connections with Atchadé’s and Liu’s (2010, Statistical Sinica) extension of the original Wang-Landau algorithm could have been spelled out. Esp. as the Voronoï tessellation construct seems quite innovative in this respect.

firefly Monte Carlo

Posted in Books, Statistics, University life with tags , , , on April 2, 2014 by xi'an

And here is yet another arXived paper using a decomposition of the posterior distirbution as a product of terms to run faster, better and higher MCMC algorithms! This one is by Douglas Maclaurin and Ryan Adams: “Firefly Monte Carlo: Exact MCMC with Subsets of Data“. (While a swarm of fireflies make sense to explain the name, I may miss some cultural subliminal meaning in the title as Firefly and Monte Carlo seem to be places in Las Vegas (?), and a car brand, Firefly is a TV series, a clothes brand, and maybe other things…)

“The evolution of the chain evokes an image of fireflies, as the individual data blink on and out due to updates of the zn.”

The fundamental assumption of Maclaurin’s and Adams’ approach is that each product term in the likelihood (expressed as a product) can be bounded by a cheaper lower bound. This lower bound is used to create a Bernoulli auxiliary variable with probability equal to the ratio of the lower bound to the likelihood term, auxiliary variable that helps to reduce the number of evaluations of the original likelihood terms. Obviously, there is a gain only if (a) the lower bound is close or tight enough and (b) simulating the auxiliary variables is cheap enough.

About (a), the paper gives the tight example of a logistic, with a case of a 98% tightness. How generic is that and how those bounds can be derived in a cheap or automated manner? If one needs to run a variational Bayes approximation first, the gain in efficiency is unlikely to hold. About (b), I do not fully get it: if generating zn requires the evaluation of the original likelihood we loose the entire appeal of the method. Admittedly, I can see the point in changing a very small portion α of the zn‘s between moves on the parameter θ, since the number of likelihood evaluations is the same portion α of the total number of terms N. But decreasing the portion α is also reducing the mixing efficiency of the algorithm. In the efficient ways of updating the auxiliary brightness variables (ways proposed in the paper), I get the idea of making a proposal first before eventually computing the true probability of a Bernoulli. A proposal making use of the previous value of the probability (i.e. for the previous value of the parameter θ) could also reduce the number of evaluations of likelihood terms. However, using a “cached” version of the likelihood is only relevant within the same simulation step since a change in θ requires recomputing the likelihood.

“In each experiment we compared FlyMC, with two choices of bound selection, to regular full-posterior MCMC. We looked at the average number of likelihoods queried at each iteration and the number of effective samples generated per iteration, accounting for autocorrelation.”

This comparison does not seem adequate to me: by construction, the algorithm in the paper reduces the number of likelihood evaluations, so this is not a proper comparative instrument. The effective sample size is a transform of the correlation, not an indicator of convergence. For instance, if the zn‘s were hardly to change between iterations, thus the overall sampler was definitely far from converging, we would get θ’s simulated from almost the same distribution, hence being uncorrelated. In other words, if the joint chain in (θ,zn) does not converge, it is harder to establish that the subchain in θ converges at all. Indeed, in this logistic example where the computation of the likelihood is not a massive constraint, I am surprised there is any possibility of a huge gain in using the method, unless the lower bound is essentially the likelihood, which is actually  the case for logistic regression models. Another point made by Dan Simpson is that the whole dataset needs to remain on-hold, full-time, which may be a challenge to the computer memory. And stops short of providing really Big Data solutions.

penalising model component complexity

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , , , , , , on April 1, 2014 by xi'an

“Prior selection is the fundamental issue in Bayesian statistics. Priors are the Bayesian’s greatest tool, but they are also the greatest point for criticism: the arbitrariness of prior selection procedures and the lack of realistic sensitivity analysis (…) are a serious argument against current Bayesian practice.” (p.23)

A paper that I first read and annotated in the very early hours of the morning in Banff, when temperatures were down in the mid minus 20′s now appeared on arXiv, “Penalising model component complexity: A principled, practical approach to constructing priors” by Thiago Martins, Dan Simpson, Andrea Riebler, Håvard Rue, and Sigrunn Sørbye. It is a highly timely and pertinent paper on the selection of default priors! Which shows that the field of “objective” Bayes is still full of open problems and significant advances and makes a great argument for the future president [that I am] of the O’Bayes section of ISBA to encourage young Bayesian researchers to consider this branch of the field.

“On the other end of the hunt for the holy grail, “objective” priors are data-dependent and are not uniformly accepted among Bayesians on philosophical grounds.” (p.2)

Apart from the above quote, as objective priors are not data-dependent! (this is presumably a typo, used instead of model-dependent), I like very much the introduction (appreciating the reference to the very recent Kamary (2014) that just got rejected by TAS for quoting my blog post way too much… and that we jointly resubmitted to Statistics and Computing). Maybe missing the alternative solution of going hierarchical as far as needed and ending up with default priors [at the top of the ladder]. And not discussing the difficulty in specifying the sensitivity of weakly informative priors.

“Most model components can be naturally regarded as a flexible version of a base model.” (p.3)

The starting point for the modelling is the base model. How easy is it to define this base model? Does it [always?] translate into a null hypothesis formulation? Is there an automated derivation? I assume this somewhat follows from the “block” idea that I do like but how generic is model construction by blocks?


“Occam’s razor is the principle of parsimony, for which simpler model formulations should be preferred until there is enough support for a more complex model.” (p.4)

I also like this idea of putting a prior on the distance from the base! Even more because it is parameterisation invariant (at least at the hyperparameter level). (This vaguely reminded me of a paper we wrote with George a while ago replacing tests with distance evaluations.) And because it gives a definitive meaning to Occam’s razor. However, unless the hyperparameter ξ is one-dimensional this does not define a prior on ξ per se. I equally like Eqn (2) as it shows how the base constraint takes one away from Jeffrey’s prior. Plus, if one takes the Kullback as an intrinsic loss function, this also sounds related to Holmes’s and Walker’s substitute loss pseudopriors, no? Now, eqn (2) does not sound right in the general case. Unless one implicitly takes a uniform prior on the Kullback sphere of radius d? There is a feeling of one-d-ness in the description of the paper (at least till page 6) and I wanted to see how it extends to models with many (≥2) hyperparameters. Until I reached Section 6 where the authors state exactly that! There is also a potential difficulty in that d(ξ) cannot be computed in a general setting. (Assuming that d(ξ) has a non-vanishing Jacobian as on page 19 sounds rather unrealistic.) Still about Section 6, handling reference priors on correlation matrices is a major endeavour, which should produce a steady flow of followers..!

“The current practice of prior specification is, to be honest, not in a good shape. While there has been a strong growth of Bayesian analysis in science, the research field of “practical prior specification” has been left behind.” (*p.23)

There are still quantities to specify and calibrate in the PC priors, which may actually be deemed a good thing by Bayesians (and some modellers). But overall I think this paper and its message constitute a terrific step for Bayesian statistics and I hope the paper can make it to a major journal.

Bayesian Data Analysis [BDA3 - part #2]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on March 31, 2014 by xi'an

Here is the second part of my review of Gelman et al.’ Bayesian Data Analysis (third edition):

“When an iterative simulation algorithm is “tuned” (…) the iterations will not in general converge to the target distribution.” (p.297)

Part III covers advanced computation, obviously including MCMC but also model approximations like variational Bayes and expectation propagation (EP), with even a few words on ABC. The novelties in this part are centred at Stan, the language Andrew is developing around Hamiltonian Monte Carlo techniques, a sort of BUGS of the 10′s! (And of course Hamiltonian Monte Carlo techniques themselves. A few (nit)pickings: the book advises important resampling without replacement (p.266) which makes some sense when using a poor importance function but ruins the fundamentals of importance sampling. Plus, no trace of infinite variance importance sampling? of harmonic means and their dangers? In the Metropolis-Hastings algorithm, the proposal is called the jumping rule and denoted by Jt, which, besides giving the impression of a Jacobian, seems to allow for time-varying proposals and hence time-inhomogeneous Markov chains, which convergence properties are much hairier. (The warning comes much later, as exemplified in the above quote.) Moving from “burn-in” to “warm-up” to describe the beginning of an MCMC simulation. Being somewhat 90′s about convergence diagnoses (as shown by the references in Section 11.7), although the book also proposes new diagnoses and relies much more on effective sample sizes. Particle filters are evacuated in hardly half-a-page. Maybe because Stan does not handle particle filters. A lack of intuition about the Hamiltonian Monte Carlo algorithms, as the book plunges immediately into a two-page pseudo-code description. Still using physics vocabulary that put me (and maybe only me) off. Although I appreciated the advice to check analytical gradients against their numerical counterpart.

“In principle there is no limit to the number of levels of variation that can be handled in this way. Bayesian methods provide ready guidance in handling the estimation of the unknown parameters.” (p.381)

I also enjoyed reading the part about modes that stand at the boundary of the parameter space (Section 13.2), even though I do not think modes are great summaries in Bayesian frameworks and while I do not see how picking the prior to avoid modes at the boundary avoids the data impacting the prior, in fine. The variational Bayes section (13.7) is equally enjoyable, with a proper spelled-out illustration, introducing an unusual feature for Bayesian textbooks.  (Except that sampling without replacement is back!) Same comments for the Expectation Propagation (EP) section (13.8) that covers brand new notions. (Will they stand the test of time?!)

“Geometrically, if β-space is thought of as a room, the model implied by classical model selection claims that the true β has certain prior probabilities of being in the room, on the floor, on the walls, in the edge of the room, or in a corner.” (p.368)

Part IV is a series of five chapters about regression(s). This is somewhat of a classic, nonetheless  Chapter 14 surprised me with an elaborate election example that dabbles in advanced topics like causality and counterfactuals. I did not spot any reference to the g-prior or to its intuitive justifications and the chapter mentions the lasso as a regularisation technique, but without any proper definition of this “popular non-Bayesian form of regularisation” (p.368). In French: with not a single equation! Additional novelty may lie in the numerical prior information about the correlations. What is rather crucially (cruelly?) missing though is a clearer processing of variable selection in regression models. I know Andrew opposes any notion of a coefficient being exactly equal to zero, as ridiculed through the above quote, but the book does not reject model selection, so why not in this context?! Chapter 15 on hierarchical extensions stresses the link with exchangeability, once again. With another neat election example justifying the progressive complexification of the model and the cranks and toggles of model building. (I am not certain the reparameterisation advice on p.394 is easily ingested by a newcomer.) The chapters on robustness (Chap. 17) and missing data (Chap. 18) sound slightly less convincing to me, esp. the one about robustness as I never got how to make robustness agree with my Bayesian perspective. The book states “we do not have to abandon Bayesian principles to handle outliers” (p.436), but I would object that the Bayesian paradigm compels us to define an alternative model for those outliers and the way they are produced. One can always resort to a drudging exploration of which subsample of the dataset is at odds with the model but this may be unrealistic for large datasets and further tells us nothing about how to handle those datapoints. The missing data chapter is certainly relevant to such a comprehensive textbook and I liked the survey illustration where the missing data was in fact made of missing questions. However, I felt the multiple imputation part was not well-presented, fearing readers would not understand how to handle it…

“You can use MCMC, normal approximation, variational Bayes, expectation propagation, Stan, or any other method. But your fit must be Bayesian.” (p.517)

Part V concentrates the most advanced material, with Chapter 19 being mostly an illustration of a few complex models, slightly superfluous in my opinion, Chapter 20 a very short introduction to functional bases, including a basis selection section (20.2) that implements the “zero coefficient” variable selection principle refuted in the regression chapter(s), and does not go beyond splines (what about wavelets?), Chapter 21 a (quick) coverage of Gaussian processes with the motivating birth-date example (and two mixture datasets I used eons ago…), Chapter 22 a more (too much?) detailed study of finite mixture models, with no coverage of reversible-jump MCMC, and Chapter 23 an entry on Bayesian non-parametrics through Dirichlet processes.

“In practice, for well separated components, it is common to remain stuck in one labelling across all the samples that are collected. One could argue that the Gibbs sampler has failed in such a case.” (p.535)

To get back to mixtures, I liked the quote about the label switching issue above, as I was “one” who argued that the Gibbs sampler fails to converge! The corresponding section seems to favour providing a density estimate for mixture models, rather than component-wise evaluations, but it nonetheless mentions the relabelling by permutation approach (if missing our 2000 JASA paper). The section about inferring on the unknown number of components suggests conducting a regular Gibbs sampler on a model with an upper bound on the number of components and then checking for empty components, an idea I (briefly) considered in the mid-1990′s before the occurrence of RJMCMC. Of course, the prior on the components matters and the book suggests using a Dirichlet with fixed sum like 1 on the coefficients for all numbers of components.

“14. Objectivity and subjectivity: discuss the statement `People tend to believe results that support their preconceptions and disbelieve results that surprise them. Bayesian methods tend to encourage this undisciplined mode of thinking.’¨ (p.100)

Obviously, this being a third edition begets the question, what’s up, doc?!, i.e., what’s new [when compared with the second edition]? Quite a lot, even though I am not enough of a Gelmanian exegist to produce a comparision table. Well, for a starter, David Dunson and Aki Vethtari joined the authorship, mostly contributing to the advanced section on non-parametrics, Gaussian processes, EP algorithms. Then the Hamiltonian Monte Carlo methodology and Stan of course, which is now central to Andrew’s interests. The book does include a short Appendix on running computations in R and in Stan. Further novelties were mentioned above, like the vision of weakly informative priors taking over noninformative priors but I think this edition of Bayesian Data Analysis puts more stress on clever and critical model construction and on the fact that it can be done in a Bayesian manner. Hence the insistence on predictive and cross-validation tools. The book may be deemed somewhat short on exercices, providing between 3 and 20 mostly well-developed problems per chapter, often associated with datasets, rather than the less exciting counter-example above. Even though Andrew disagrees and his students at ENSAE this year certainly did not complain, I personally feel a total of 220 exercices is not enough for instructors and self-study readers. (At least, this reduces the number of email requests for solutions! Esp. when 50 of those are solved on the book website.) But this aspect is a minor quip: overall this is truly the reference book for a graduate course on Bayesian statistics and not only Bayesian data analysis.


Get every new post delivered to your Inbox.

Join 551 other followers