## Hamming Ball Sampler

Posted in Books, Statistics, University life with tags , , , , , on May 7, 2015 by xi'an

Michalis Titsias and Christopher Yau just arXived a paper entitled the Hamming Ball sampler. Aimed at large and complex discrete latent variable models. The completion method is called after Richard Hamming, who is associated with code correcting methods (reminding me of one of the Master courses I took on coding, 30 years ago…), because it uses the Hamming distance in a discrete version of the slice sampler. One of the reasons for this proposal is that conditioning upon the auxiliary slice variable allows for the derivation of normalisation constants otherwise unavailable. The method still needs some calibration in the choice of blocks that partition the auxiliary variable and in the size of the ball. One of the examples assessed in the paper is a variable selection problem with 1200 covariates, out of which only 2 are relevant, while another example deals with a factorial HMM, involving 10 hidden chains. Since the paper compares each example with the corresponding block Gibbs sampling solution, it means this Gibbs sampling version is not intractable. It would be interesting to see a case where the alternative is not available…

## the most patronizing start to an answer I have ever received

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on April 30, 2015 by xi'an

Another occurrence [out of many!] of a question on X validated where the originator (primitivus petitor) was trying to get an explanation without the proper background. On either Bayesian statistics or simulation. The introductory sentence to the question was about “trying to understand how the choice of priors affects a Bayesian model estimated using MCMC” but the bulk of the question was in fact failing to understand an R code for a random-walk Metropolis-Hastings algorithm for a simple regression model provided in a introductory blog by Florian Hartig. And even more precisely about confusing the R code dnorm(b, sd = 5, log = T) in the prior with rnorm(1,mean=b, sd = 5, log = T) in the proposal…

“You should definitely invest some time in learning the bases of Bayesian statistics and MCMC methods from textbooks or on-line courses.” X

So I started my answer with the above warning. Which sums up my feelings about many of those X validated questions, namely that primitivi petitores lack the most basic background to consider such questions. Obviously, I should not have bothered with an answer, but it was late at night after a long day, a good meal at the pub in Kenilworth, and a broken toe still bothering me. So I got this reply from the primitivus petitor that it was a patronizing piece of advice and he prefers to learn from R code than from textbooks and on-line courses, having “looked through a number of textbooks”. Good luck with this endeavour then!

## mixtures of mixtures

Posted in pictures, Statistics, University life with tags , , , , , , , , , on March 9, 2015 by xi'an

And yet another arXival of a paper on mixtures! This one is written by Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, and Bettina Grün, from the Johannes Kepler University Linz and the Wirtschaftsuniversitat Wien I visited last September. With the exact title being Identifying mixtures of mixtures using Bayesian estimation.

So, what is a mixture of mixtures if not a mixture?! Or if not only a mixture. The upper mixture level is associated with clusters, while the lower mixture level is used for modelling the distribution of a given cluster. Because the cluster needs to be real enough, the components of the mixture are assumed to be heavily overlapping. The paper thus spends a large amount of space on detailing the construction of the associated hierarchical prior. Which in particular implies defining through the prior what a cluster means. The paper also connects with the overfitting mixture idea of Rousseau and Mengersen (2011, Series B). At the cluster level, the Dirichlet hyperparameter is chosen to be very small, 0.001, which empties superfluous clusters but sounds rather arbitrary (which is the reason why we did not go for such small values in our testing/mixture modelling). On the opposite, the mixture weights have an hyperparameter staying (far) away from zero. The MCMC implementation is based on a standard Gibbs sampler and the outcome is analysed and sorted by estimating the “true” number of clusters as the MAP and by selecting MCMC simulations conditional on that value. From there clusters are identified via the point process representation of a mixture posterior. Using a standard k-means algorithm.

The remainder of the paper illustrates the approach on simulated and real datasets. Recovering in those small dimension setups the number of clusters used in the simulation or found in other studies. As noted in the conclusion, using solely a Gibbs sampler with such a large number of components is rather perilous since it may get stuck close to suboptimal configurations. Especially with very small Dirichlet hyperparameters.

## Unbiased Bayes for Big Data: Path of partial posteriors [a reply from the authors]

Posted in Statistics, University life with tags , , , , , , , , , on February 27, 2015 by xi'an

[Here is a reply by Heiko Strathmann to my post of yesterday. Along with the slides of a talk in Oxford mentioned in the discussion.]

Thanks for putting this up, and thanks for the discussion. Christian, as already exchanged via email, here are some answers to the points you make.

First of all, we don’t claim a free lunch — and are honest with the limitations of the method (see negative examples). Rather, we make the point that we can achieve computational savings in certain situations — essentially exploiting redundancy (what Michael called “tall” data in his note on subsampling & HMC) leading to fast convergence of posterior statistics.

Dan is of course correct noticing that if the posterior statistic does not converge nicely (i.e. all data counts), then truncation time is “mammoth”. It is also correct that it might be questionable to aim for an unbiased Bayesian method in the presence of such redundancies. However, these are the two extreme perspectives on the topic. The message that we want to get along is that there is a trade-off in between these extremes. In particular the GP examples illustrate this nicely as we are able to reduce MSE in a regime where posterior statistics have *not* yet stabilised, see e.g. figure 6.

“And the following paragraph is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation.”

To clarify, the paragraph refers to the additional convergence issues induced by alternative Markov transition kernels of mini-batch-based full posterior sampling methods by Welling, Bardenet, Dougal & co. For example, Firefly MC’s mixing time is increased by a factor of 1/q where q*N is the mini-batch size. Mixing of stochastic gradient Langevin gets worse over time. This is not true for our scheme as we can use standard transition kernels. It is still essential for the partial posterior Markov chains to converge (if MCMC is used). However, as this is a well studied problem, we omit the topic in our paper and refer to standard tools for diagnosis. All this is independent of the debiasing device.

Yesterday in Oxford, Pierre Jacob pointed out that if MCMC is used for estimating partial posterior statistics, the overall result is not unbiased. We had a nice discussion how this bias could be addressed via a two-stage debiasing procedure: debiasing the MC estimates as described in the “Unbiased Monte Carlo” paper by Agapiou et al, and then plugging those into the path estimators — though it is (yet) not so clear how (and whether) this would work in our case.
In the current version of the paper, we do not address the bias present due to MCMC. We have a paragraph on this in section 3.2. Rather, we start from a premise that full posterior MCMC samples are a gold standard. Furthermore, the framework we study is not necessarily linked to MCMC – it could be that the posterior expectation is available in closed form, but simply costly in N. In this case, we can still unbiasedly estimate this posterior expectation – see GP regression.

“The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).”

It is true that the choice is crucial in order to control the variance. However, provided that partial posterior expectations converge at a rate n with n the size of a minibatch, computational complexity can be reduced to N1-α (α<β) without variance exploding. There is a trade-off: the faster the posterior expectations converge, more computation can be saved; β is in general unknown, but can be roughly estimated with the “direct approach” as we describe in appendix.

About the “direct approach”
It is true that for certain classes of models and φ functionals, the direct averaging of expectations for increasing data sizes yields good results (see log-normal example), and we state this. However, the GP regression experiments show that the direct averaging gives a larger MSE as with debiasing applied. This is exactly the trade-off mentioned earlier.

I also wonder what people think about the comparison to stochastic variational inference (GP for Big Data), as this hasn’t appeared in discussions yet. It is the comparison to “non-unbiased” schemes that Christian and Dan asked for.

## Unbiased Bayes for Big Data: Path of partial posteriors

Posted in Statistics, University life with tags , , , , , , , , , on February 26, 2015 by xi'an

“Data complexity is sub-linear in N, no bias is introduced, variance is finite.”

Heiko Strathman, Dino Sejdinovic and Mark Girolami have arXived a few weeks ago a paper on the use of a telescoping estimator to achieve an unbiased estimator of a Bayes estimator relying on the entire dataset, while using only a small proportion of the dataset. The idea is that a sequence  converging—to an unbiased estimator—of estimators φt can be turned into an unbiased estimator by a stopping rule T:

$\sum_{t=1}^T \dfrac{\varphi_t-\varphi_{t-1}}{\mathbb{P}(T\ge t)}$

is indeed unbiased. In a “Big Data” framework, the components φt are MCMC versions of posterior expectations based on a proportion αt of the data. And the stopping rule cannot exceed αt=1. The authors further propose to replicate this unbiased estimator R times on R parallel processors. They further claim a reduction in the computing cost of

$\mathcal{O}(N^{1-\alpha})\qquad\text{if}\qquad\mathbb{P}(T=t)\approx e^{-\alpha t}$

which means that a sub-linear cost can be achieved. However, the gain in computing time means higher variance than for the full MCMC solution:

“It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (…), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compare approaches in that region. “

I first got a “free lunch” impression when reading the paper, namely it sounded like using a random stopping rule was enough to overcome unbiasedness and large size jams. This is not the message of the paper, but I remain both intrigued by the possibilities the unbiasedness offers and bemused by the claims therein, for several reasons: Continue reading

## the travelling salesman

Posted in Statistics with tags , , , , , , , , on January 3, 2015 by xi'an

A few days ago, I was grading my last set of homeworks for the MCMC graduate course I teach to both Dauphine and ENSAE graduate students. A few students had chosen to write a travelling salesman simulated annealing code (Exercice 7.22 in Monte Carlo Statistical Methods) and one of them included this quote

“And when I saw that, I realized that selling was the greatest career a man could want. ‘Cause what could be more satisfying than to be able to go, at the age of eighty-four, into twenty or thirty different cities, and pick up a phone, and be remembered and loved and helped by so many different people ?”
Arthur Miller, Death of a Salesman

which was a first!

## testing MCMC code

Posted in Books, Statistics, University life with tags , , , , , , , , on December 26, 2014 by xi'an

A title that caught my attention on arXiv: testing MCMC code by Roger Grosse and David Duvenaud. The paper is in fact a tutorial adapted from blog posts written by Grosse and Duvenaud, on the blog of the Harvard Intelligent Probabilistic Systems group. The purpose is to write code in such a modular way that (some) conditional probability computations can be tested. Using my favourite Gibbs sampler for the mixture model, they advocate computing the ratios

$\dfrac{p(x'|z)}{p(x|z)}\quad\text{and}\quad \dfrac{p(x',z)}{p(x,z)}$

to make sure they are exactly identical. (Where x denotes the part of the parameter being simulated and z anything else.) The paper also mentions an older paper by John Geweke—of which I was curiously unaware!—leading to another test: consider iterating the following two steps:

1. update the parameter θ given the current data x by an MCMC step that preserves the posterior p(θ|x);
2. update the data x given the current parameter value θ from the sampling distribution p(x|θ).

Since both steps preserve the joint distribution p(x,θ), values simulated from those steps should exhibit the same properties as a forward production of (x,θ), i.e., simulating from p(θ) and then from p(x|θ). So with enough simulations, comparison tests can be run. (Andrew has a very similar proposal at about the same time.) There are potential limitations to the first approach, obviously, from being unable to write the full conditionals [an ABC version anyone?!] to making a programming mistake that keep both ratios equal [as it would occur if a Metropolis-within-Gibbs was run by using the ratio of the joints in the acceptance probability]. Further, as noted by the authors it only addresses the mathematical correctness of the code, rather than the issue of whether the MCMC algorithm mixes well enough to provide a pseudo-iid-sample from p(θ|x). (Lack of mixing that could be spotted by Geweke’s test.) But it is so immediately available that it can indeed be added to every and all simulations involving a conditional step. While Geweke’s test requires re-running the MCMC algorithm altogether. Although clear divergence between an iid sampling from p(x,θ) and the Gibbs version above could appear fast enough for a stopping rule to be used. In fine, a worthwhile addition to the collection of checkings and tests built across the years for MCMC algorithms! (Of which the trick proposed by my friend Tobias Rydén to run first the MCMC code with n=0 observations in order to recover the prior p(θ) remains my favourite!)