single variable transformation approach to MCMC

Posted in Books, Statistics, Travel with tags , , , , on September 9, 2014 by xi'an

I read the newly arXived paper “On Single Variable Transformation Approach to Markov Chain Monte Carlo” by Dey and Bhattacharya on the pleasant train ride between Bristol and Coventry last weekend. The paper actually follows several earlier papers by the authors that I have not read in detail. The notion of single variable transform is to add plus or minus the same random noise to all components of the current value of the Markov chain, instead of the standard d-dimensional random walk proposal of the reference Metropolis-Hastings algorithm, namely all proposals are of the form

$x_i'=x_i\pm \epsilon\ i=1,\cdots,d$

meaning the chain proceeds [after acceptance] along one and only one of the d diagonals. The authors’ arguments are that (a) the proposal is cheaper and (b) the acceptance rate is higher. What I find questionable in this argument is that this does not directly matter in the evaluation of the performances of the algorithm. For instance, higher acceptance in a Metropolis-Hasting algorithm does not imply faster convergence and smaller asymptotic variance. (This goes without mentioning the fact that the comparative Figure 1 is so variable with the dimension as to be of limited worth. Figure 1 and 2 are also found in an earlier arXived paper of the authors.) For instance, restricting the moves along the diagonals of the Euclidean space implies that there is a positive probability to make two successive proposals along the same diagonal, which is a waste of time. When considering the two-dimensional case, joining two arbitrary points using an everywhere positive density g upon ε means generating two successive values from g, which is equivalent cost-wise to generating a single noise from a two-dimensional proposal. Without the intermediate step of checking the one-dimensional move along one diagonal. So much for a gain. In fine, the proposal found in this paper sums up as being a one-at-a-time version of a standard random walk Metropolis-Hastings algorithm.

ABC model choice by random forests [guest post]

Posted in pictures, R, Statistics, University life with tags , , , , , , , , , , on August 11, 2014 by xi'an

[Dennis Prangle sent me his comments on our ABC model choice by random forests paper. Here they are! And I appreciate very much contributors commenting on my paper or others, so please feel free to join.]

This paper proposes a new approach to likelihood-free model choice based on random forest classifiers. These are fit to simulated model/data pairs and then run on the observed data to produce a predicted model. A novel “posterior predictive error rate” is proposed to quantify the degree of uncertainty placed on this prediction. Another interesting use of this is to tune the threshold of the standard ABC rejection approach, which is outperformed by random forests.

The paper has lots of thought-provoking new ideas and was an enjoyable read, as well as giving me the encouragement I needed to read another chapter of the indispensable Elements of Statistical Learning However I’m not fully convinced by the approach yet for a few reasons which are below along with other comments.

Alternative schemes

The paper shows that random forests outperform rejection based ABC. I’d like to see a comparison to more efficient ABC model choice algorithms such as that of Toni et al 2009. Also I’d like to see if the output of random forests could be used as summary statistics within ABC rather than as a separate inference method.

Posterior predictive error rate (PPER)

This is proposed to quantify the performance of a classifier given a particular data set. The PPER is the proportion of times the classifier’s most favoured model is incorrect for simulated model/data pairs drawn from an approximation to the posterior predictive. The approximation is produced by a standard ABC analysis.

Misclassification could be due to (a) a poor classifier or (b) uninformative data, so the PPER aggregrates these two sources of uncertainty. I think it is still very desirable to have an estimate of the uncertainty due to (b) only i.e. a posterior weight estimate. However the PPER is useful. Firstly end users may sometimes only care about the aggregated uncertainty. Secondly relative PPER values for a fixed dataset are a useful measure of uncertainty due to (a), for example in tuning the ABC threshold. Finally, one drawback of the PPER is the dependence on an ABC estimate of the posterior: how robust are the results to the details of how this is obtained?

Classification

This paper illustrates an important link between ABC and machine learning classification methods: model choice can be viewed as a classification problem. There are some other links: some classifiers make good model choice summary statistics (Prangle et al 2014) or good estimates of ABC-MCMC acceptance ratios for parameter inference problems (Pham et al 2014). So the good performance random forests makes them seem a generally useful tool for ABC (indeed they are used in the Pham et al al paper).

posterior predictive checks for admixture models

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on July 8, 2014 by xi'an

In a posting coincidence, just a few days after we arXived our paper on ABC model choice with random forests, where we use posterior predictive errors for assessing the variability of the random forest procedure, David Mimno, David Blei, and Barbara Engelhardt arXived a paper on posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure, which deals with similar data and models, while also using the posterior predictive as a central tool. (Marginalia: the paper is a wee bit difficult to read [esp. with France-Germany playing in the airport bar!] as the modelling is only clearly described at the very end. I suspect this arXived version was put together out of a submission to a journal like Nature or PNAS, with mentions of a Methods section that does not appear here and of Supplementary Material that turned into subsections of the Discussion section.)

The dataset are genomic datasets made of SNPs (single nucleotide polymorphisms). For instance, the first (HapMap) dataset corresponds to 1,043 individuals and 468,167 SNPs. The model is simpler than Kingman’s coalescent, hence its likelihood does not require ABC steps to run inference. The admixture model in the paper is essentially a mixture model over ancestry indices with individual dependent weights with Bernoulli observations, hence resulting into a completed likelihood of the form

$\prod_{i=1}^n\prod_{\ell=1}^L\prod_j \phi_{\ell,z_{i,\ell,j}}^{x_{i,\ell,j}}(1-\phi_{\ell,z_{i,\ell,j}})^{1-x_{i,\ell,j}}\theta_{i,z_{i,\ell,j}}$

(which looks more formidable than it truly is!). Regular Bayesian inference is thus possible in this setting, implementing e.g. Gibbs sampling. The authors chose instead to rely on EM and thus derived the maximum likelihood estimators of the (many) parameters of the admixture. And of the latent variables z. Their posterior predictive check is based on the simulation of pseudo-observations (as in ABC!) from the above likelihood, with parameters and latent variables replaced with their EM estimates (unlike ABC). There is obviously some computational reason in doing this instead of simulating from the posterior, albeit implicit in the paper. I am however slightly puzzled by the conditioning on the latent variable estimate , as its simulation is straightforward and as a latent variable is more a missing observation than a parameter. Given those 30 to 100 replications of the data, an empirical distribution of a discrepancy function is used to assess whether or not the equivalent discrepancy for the observation is an outlier. If so, the model is not appropriate for the data. (Interestingly, the discrepancy is measured via the Bayes factor of z-scores.)

The connection with our own work is that the construction of discrepancy measures proposed in this paper could be added to our already large collection of summary statistics to check to potential impact in model comparison, i.e. for a significant contribution to the random forest nodes.  Conversely, the most significant summary statistics could then be tested as discrepancy measures. Or, more in tune with our Series B paper on the proper selection of summary variables, the distribution of those discrepancy measures could be compared across potential models. Assuming this does not take too much computing power…

recycling accept-reject rejections

Posted in Statistics, University life with tags , , , , , , , , , on July 1, 2014 by xi'an

Vinayak Rao, Lizhen Lin and David Dunson just arXived a paper which proposes anew technique to handle intractable normalising constants. And which exact title is Data augmentation for models based on rejection sampling. (Paper that I read in the morning plane to B’ham, since this is one of my weeks in Warwick.) The central idea therein is that, if the sample density (aka likelihood) satisfies

$p(x|\theta) \propto f(x|\theta) \le q(x|\theta) M\,,$

where all terms but p are known in closed form, then completion by the rejected values of an hypothetical accept-reject algorithm−hypothetical in the sense that the data does not have to be produced by an accept-reject scheme but simply the above domination condition to hold−allows for a data augmentation scheme. Without requiring the missing normalising constant. Since the completed likelihood is

$\prod_{i=1}^n \dfrac{f(x_i|\theta)}{M} \prod_{j=1}^{m_i} \left\{q(y_{ij}|\theta) -\dfrac{f(y_{ij}|\theta)}{M}\right\}$

A closed-form, if not necessarily congenial, function.

Now this is quite a different use of the “rejected values” from the accept reject algorithm when compared with our 1996 Biometrika paper on the Rao-Blackwellisation of accept-reject schemes (which, still, could have been mentioned there… Or Section 4.2 of Monte Carlo Statistical Methods. Rather than re-deriving the joint density of the augmented sample, “accepted+rejected”.)

It is a neat idea in that it completely bypasses the approximation of the normalising constant. And avoids the somewhat delicate tuning of the auxiliary solution of Moller et al. (2006)  The difficulty with this algorithm is however in finding an upper bound M on the unnormalised density f that is

1. in closed form;
2. with a manageable and tight enough “constant” M;
3. compatible with running a posterior simulation conditional on the added rejections.

The paper seems to assume further that the bound M is independent from the current parameter value θ, at least as suggested by the notation (and Theorem 2), but this is not in the least necessary for the validation of the formal algorithm. Such a constraint would pull M higher, hence reducing the efficiency of the method. Actually the matrix Langevin distribution considered in the first example involves a bound that depends on the parameter κ.

The paper includes a result (Theorem 2) on the uniform ergodicity that relies on heavy assumptions on the proposal distribution. And a rather surprising one, namely that the probability of rejection is bounded from below, i.e. calling for a less efficient proposal. Now it seems to me that a uniform ergodicity result holds as well when the probability of acceptance is bounded from below since, then, the event when no rejection occurs constitutes an atom from the augmented Markov chain viewpoint. There therefore occurs a renewal each time the rejected variable set ϒ is empty, and ergodicity ensues (Robert, 1995, Statistical Science).

Note also that, despite the opposition raised by the authors, the method per se does constitute a pseudo-marginal technique à la Andrieu-Roberts (2009) since the independent completion by the (pseudo) rejected variables produces an unbiased estimator of the likelihood. It would thus be of interest to see how the recent evaluation tools of Andrieu and Vihola can assess the loss in efficiency induced by this estimation of the likelihood.

Maybe some further experimental evidence tomorrow…

ABC model choice by random forests

Posted in pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , on June 25, 2014 by xi'an

After more than a year of collaboration, meetings, simulations, delays, switches,  visits, more delays, more simulations, discussions, and a final marathon wrapping day last Friday, Jean-Michel Marin, Pierre Pudlo,  and I at last completed our latest collaboration on ABC, with the central arguments that (a) using random forests is a good tool for choosing the most appropriate model and (b) evaluating the posterior misclassification error rather than the posterior probability of a model is an appropriate paradigm shift. The paper has been co-signed with our population genetics colleagues, Jean-Marie Cornuet and Arnaud Estoup, as they provided helpful advice on the tools and on the genetic illustrations and as they plan to include those new tools in their future analyses and DIYABC software.  ABC model choice via random forests is now arXived and very soon to be submitted…

One scientific reason for this fairly long conception is that it took us several iterations to understand the intrinsic nature of the random forest tool and how it could be most naturally embedded in ABC schemes. We first imagined it as a filter from a set of summary statistics to a subset of significant statistics (hence the automated ABC advertised in some of my past or future talks!), with the additional appeal of an associated distance induced by the forest. However, we later realised that (a) further ABC steps were counterproductive once the model was selected by the random forest and (b) including more summary statistics was always beneficial to the performances of the forest and (c) the connections between (i) the true posterior probability of a model, (ii) the ABC version of this probability, (iii) the random forest version of the above, were at best very loose. The above picture is taken from the paper: it shows how the true and the ABC probabilities (do not) relate in the example of an MA(q) model… We thus had another round of discussions and experiments before deciding the unthinkable, namely to give up the attempts to approximate the posterior probability in this setting and to come up with another assessment of the uncertainty associated with the decision. This led us to propose to compute a posterior predictive error as the error assessment for ABC model choice. This is mostly a classification error but (a) it is based on the ABC posterior distribution rather than on the prior and (b) it does not require extra-computations when compared with other empirical measures such as cross-validation, while avoiding the sin of using the data twice!

Big Bayes stories in print [and in force]

Posted in Books, Statistics, University life with tags , , , , , on May 20, 2014 by xi'an

The special issue of Statistical Science Kerrie Mengersen and I edited over the past three (four?) years is now out in print! Even though many ‘Og readers may have already seen the table of contents, here it is once again. We hope you will enjoy this 100 page long excursion in big Bayesiana. The papers are not freely accessible as “current papers” on the journal website but can yet be found in the “future papers” section. (If a sponsor wants to support turning the papers into open access version, he or she is most welcome to contact us or the IMS!) And, thanks to Larry for reminding me!, available on arXiv. Thanks to all authors, discussants, reviewers and special kudos to Jon Wellner for his constant help and support in putting the special issue together!

an a-statistical proof of a binomial identity

Posted in Books, Statistics, University life with tags , , , , on May 15, 2014 by xi'an

When waiting for Andrew this morning, I was browsing through the arXived papers of the day and came across this “Simple Statistical Proof of a Binomial Identity” by P. Vellaisamy. The said identity is that, for all s>0,

$\sum_{k=0}^n (-1)^k {n \choose k}\dfrac{s}{s+k} = \prod_{k=1}^n \dfrac{k}{k+s}$

Nothing wrong with the maths in this paper (except for a minor typo using Exp(1) instead of Exp(s), p.2).  But I am perplexed by the label “statistical” used by the author, as this proof is an entirely analytic argument, based on two different integrations of the same integral. Nothing connected with data or any statistical  technique: this is sheer combinatorics, of the kind one could find in William Feller‘s volume I.