## Hamiltonian MC on discrete spaces

Posted in Statistics, Travel, University life with tags , , , , , , , , on July 3, 2017 by xi'an

Following a lively discussion with Akihiko Nishimura during a BNP11 poster session last Tuesday, I took the opportunity of the flight to Montréal to read through the arXived paper (written jointly with David Dunson and Jianfeng Liu). The issue is thus one of handling discrete valued parameters in Hamiltonian Monte Carlo. The basic “trick” in handling this complexity goes by turning the discrete support via the inclusion of an auxiliary continuous variable whose discretisation is the discrete parameter, hence resembling to some extent the slice sampler. This removes the discreteness blockage but creates another difficulty, namely handling a discontinuous target density. (I idly wonder why the trick cannot be iterated to second or higher order so that to achieve the right amount of smoothness. Of course, the maths behind would be less cool!) The extension of the Hamiltonian to this setting by a  convolution is a trick I had not seen since the derivation of the Central Limit Theorem during Neveu’s course at Polytechnique.  What I find most exciting in the resolution is the move from a Gaussian momentum to a Laplace momentum, for the reason that I always wondered at alternatives [without trying anything myself!]. The Laplace version is indeed most appropriate here in that it avoids a computation of all discontinuity points and associated values along a trajectory. Since the moves are done component-wise, the method has a Metropolis-within-Gibbs flavour, which actually happens to be a special case. What is also striking is that the approach is both rejection-free and exact, provided ergodicity occurs, which is the case when the stepsize is random.

In addition to this resolution of the discrete parameter problem, the paper presents the further appeal of (re-)running an analysis of the Jolly-Seber capture-recapture model. Where the discrete parameter is the latent number of live animals [or whatever] in the system at any observed time. (Which we cover in Bayesian essentials with R as a neat entry to both dynamic and latent variable models.) I would have liked to see a comparison with the completion approach of Jérôme Dupuis (1995, Biometrika), since I figure the Metropolis version implemented here differs from Jérôme’s. The second example is built on Bissiri et al. (2016) surrogate likelihood (discussed earlier here) and Chopin and Ridgway (2017) catalogue of solutions for not analysing the Pima Indian dataset. (Replaced by another dataset here.)

## ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

Another arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)

## can we trust computer simulations? [day #2]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on July 13, 2015 by xi'an

“Sometimes the models are better than the data.” G. Krinner

Second day at the conference on building trust in computer simulations. Starting with a highly debated issue, climate change projections. Since so many criticisms are addressed to climate models as being not only wrong but also unverifiable. And uncheckable. As explained by Gerhart Krinner, the IPCC has developed methodologies to compare models and evaluate predictions. However, from what I understood, this validation does not say anything about the future, which is the part of the predictions that matters. And that is attacked by critics and feeds climatic-skeptics. Because it is so easy to argue against the homogeneity of the climate evolution and for “what you’ve seen is not what you’ll get“! (Even though climatic-skeptics are the least likely to use this time-heterogeneity argument, being convinced as they are of the lack of human impact over the climate.)  The second talk was by Viktoria Radchuk about validation in ecology. Defined here as a test of predictions against independent data (and designs). And mentioning Simon Wood’s synthetic likelihood as the Bayesian reference for conducting model choice (as a synthetic likelihoods ratio). I had never thought of this use (found in Wood’s original paper) for synthetic likelihood, I feel a bit queasy about using a synthetic likelihood ratio as a genuine likelihood ratio. Which led to a lively discussion at the end of her talk. The next talk was about validation in economics by Matteo Richiardi, who discussed state-space models where the hidden state is observed through a summary statistic, perfect playground for ABC! But Matteo opted instead for a non-parametric approach that seems to increase imprecision and that I have never seen used in state-space models. The last part of the talk was about non-ergodic models, for which checking for validity becomes much more problematic, in my opinion. Unless one manages multiple observations of the non-ergodic path. Nicole Saam concluded this “Validation in…” morning with Validation in Sociology. With a more pessimistic approach to the possibility of finding a falsifying strategy, because of the vague nature of sociology models. For which data can never be fully informative. She illustrated the issue with an EU negotiation analysis. Where most hypotheses could hardly be tested.

“Bayesians persist with poor examples of randomness.” L. Smith

“Bayesians can be extremely reasonable.” L. Smith

The afternoon session was dedicated to methodology, mostly statistics! Andrew Robinson started with a talk on (frequentist) model validation. Called splitters and lumpers. Illustrated by a forest growth model. He went through traditional hypothesis tests like Neyman-Pearson’s that try to split between samples. And (bio)equivalence tests that take difference as the null. Using his equivalence R package. Then Leonard Smith took over [in a literal way!] from a sort-of-Bayesian perspective, in a work joint with Jim Berger and Gary Rosner on pragmatic Bayes which was mostly negative about Bayesian modelling. Introducing (to me) the compelling notion of structural model error as a representation of the inadequacy of the model. With illustrations from weather and climate models. His criticism of the Bayesian approach is that it cannot be holistic while pretending to be [my wording]. And being inadequate to measure model inadequacy, to the point of making prior choice meaningless. Funny enough, he went back to the ball dropping experiment David Higdon discussed at one JSM I attended a while ago, with the unexpected outcome that one ball did not make it to the bottom of the shaft. A more positive side was that posteriors are useful models but should not be interpreted from a probabilistic perspective. Move beyond probability was his final message. (For most of the talk, I misunderstood P(BS), the probability of a big surprise, for something else…) This was certainly the most provocative talk of the conference  and the discussion could have gone on for the rest of day! Somewhat, Lenny was voluntarily provocative in piling the responsibility upon the Bayesian’s head for being overconfident and not accounting for the physicist’ limitations in modelling the phenomenon of interest. Next talk was by Edward Dougherty on methods used in biology. He separated within-model uncertainty from outside-model inadequacy. The within model part is mostly easy to agree upon. Even though difficulties in estimating parameters creates uncertainty classes of models. Especially because of being from a small data discipline. He analysed the impact of machine learning techniques like classification as being useless without prior knowledge. And argued in favour of the Bayesian minimum mean square error estimator. Which can also lead to a classifier. And experimental design. (Using MSE seems rather reductive when facing large dimensional parameters.) Last talk of the day was by Nicolas Becu, a geographer, with a surprising approach to validation via stakeholders. A priori not too enticing a name! The discussion was of a more philosophical nature, going back to (re)define validation against reality and imperfect models. And including social aspects of validation, e.g., reality being socially constructed. This led to the stakeholders, because a model is then a shared representation. Nicolas illustrated the construction by simulation “games” of a collective model in a community of Thai farmers and in a group of water users.

In a rather unique fashion, we also had an evening discussion on points we share and points we disagreed upon. After dinner (and wine), which did not help I fear! Bill Oberkampf mentioned the use of manufactured solutions to check code, which seemed very much related to physics. But then we got mired into the necessity of dividing between verification and validation. Which sounded very and too much engineering-like to me. Maybe because I do not usually integrate coding errors and algorithmic errors into my reasoning (verification)… Although sharing code and making it available makes a big difference. Or maybe because considering all models are wrong is neither part of my methodology (validation). This part ended up in a fairly pessimistic conclusion on the lack of trust in most published articles. At least in the biological sciences.

## shrinkage-thresholding MALA for Bayesian variable selection

Posted in Statistics, University life with tags , , , , , on March 10, 2014 by xi'an

Amandine Shreck along with her co-authors Gersende Fort, Sylvain LeCorff, and Eric Moulines, all from Telecom Paristech, has undertaken to revisit the problem of large p small n variable selection. The approach they advocate mixes Langevin algorithms with trans-model moves with shrinkage thresholding. The corresponding Markov sampler is shown to be geometrically ergodic, which may be a première in that area. The paper was arXived in December but I only read it on my flight to Calgary, not overly distracted by the frozen plains of Manitoba and Saskatchewan. Nor by my neighbour watching Hunger Games II.)

A shrinkage-thresholding operator is defined as acting on the regressor matrix towards producing sparse versions of this matrix. (I actually had trouble picturing the model until Section 2.2 where the authors define the multivariate regression model, making the regressors a matrix indeed. With a rather unrealistic iid Gaussian noise. And with an unknown number of relevant rows, hence a varying dimension model. Note that this is a strange regression in that the regression coefficients are known and constant across all models.) Because the Langevin algorithm requires a gradient to operate, the log target is divided between a differentiable and a non-differentiable parts, the later accommodating the Dirac masses in the dominating measure. The new MALA moves involve applying the above shrinkage-thresholding operator to a regular Langevin proposal, hence moving to sub-spaces and sparser representations.

The thresholding functions are based on positive part operators, which means that the Markov chain does not visit some neighbourhoods of zero in the embedding and in the sparser spaces. In other words, the proposal operates between models of varying dimensions without further ado because the point null hypotheses are replaced with those neighbourhoods. Hence it is not exactly simulating from the “original” posterior, which may be a minor caveat or not. Not if defining the neighbourhoods is driven by an informed or at least spelled-out choice of a neighbourhood of zero where the coefficients are essentially identified with zero. The difficulty is then in defining how close is close enough. Especially since the thresholding functions seem to all depend on a single number which does not depend on the regressor matrix. It would be interesting to see if the g-prior version could be developed as well… Actually, I would have also included a dose of g-prior in the Langevin move, rather than using an homogeneous normal noise.

The paper contains a large experimental part where the performances of the method are evaluated on various simulated datasets. It includes a comparison with reversible jump MCMC, which slightly puzzles me: (a) I cannot see from the paper whether or not the RJMCMC is applied to the modified (thresholded) posterior, as a regular RJMCMC would not aim at the same target, but the appendix does not indicate a change of target; (b) the mean error criterion for which STMALA does better than RJMCMC is not defined, but the decrease of this criterion along iterations seems to indicate that convergence has not yet occured, since it does not completely level up after 3 10⁵ iterations.

I must have mentioned it in another earlier post, but I find somewhat ironical to see those thresholding functions making a comeback after seeing the James-Stein and smooth shrinkage estimators taking over the then so-called pre-test versions in the 1970’s (Judge and Bock, 1978) and 1980’s. There are obvious reasons for this return, moving away from quadratic loss being one.

## bounded target support [#2]

Posted in Books, Kids, Statistics, University life with tags , , , , , on July 8, 2013 by xi'an

In a sort of echo from an earlier post, I received this emailed question from Gabriel:

I am contacting you in connection with my internship and your book «Le choix bayésien» where I cannot find an answer to my question. Given a constrained parameter space and an unconstrained Markov chain, is it correct to subsample the chain in order to keep only those points that satisfy the constraint?

To which I replied that this would induce a bias in the outcome, even though this is a marginally valid argument (if  the Markov chain is in its stationary regime, picking one value at random from those satisfying the constraint is akin to accept-reject). The unbiased approach is to resort to Metropolis-Hastings steps in Gabriel’s Gibbs sampler to check whether or not each proposed move stays within the constrained space. If not, one need to replicate the current value of the chain…

Update: Following comments by Ajay and David, I withdraw the term “bias”. The method works as a low key accept-reject but can be very inefficient, to the extent of having no visit to the constrained set. (However, in the event of a practically/numerically disconnected support with a huge gap between the connected components, it may also be more efficient than a low energy Metropolis-Hastings algorithm. Mileage may vary!)

## lemma 7.3

Posted in Statistics with tags , , , , , , , , , , , on November 14, 2012 by xi'an

As Xiao-Li Meng accepted to review—and I am quite grateful he managed to fit this review in an already overflowing deanesque schedule!— our 2004 book  Monte Carlo Statistical Methods as part of a special book review issue of CHANCE honouring the memory of George thru his books—thanks to Sam Behseta for suggesting this!—, he sent me the following email about one of our proofs—demonstrating how much efforts he had put into this review!—:

```I however have a question about the proof of Lemma 7.3
on page 273. After the expression of
E[h(x^(1)|x_0], the proof stated "and substitute
Eh(x) for h(x_1)".  I cannot think of any
justification for this substitution, given the whole
purpose is to show h(x) is a constant.```

I put it on hold for a while and only looked at it in the (long) flight to Chicago. Lemma 7.3 in Monte Carlo Statistical Methods is the result that the Metropolis-Hastings algorithm is Harris recurrent (and not only recurrent). The proof is based on the characterisation of Harris recurrence as having only constants for harmonic functions, i.e. those satisfying the identity

$h(x) = \mathbb{E}[h(X_t)|X_{t-1}=x]$

The chain being recurrent, the above implies that harmonic functions are almost everywhere constant and the proof steps from almost everywhere to everywhere. The fact that the substitution above—and I also stumbled upon that very subtlety when re-reading the proof in my plane seat!—is valid is due to the fact that it occurs within an integral: despite sounding like using the result to prove the result, the argument is thus valid! Needless to say, we did not invent this (elegant) proof but took it from one of the early works on the theory of Metropolis-Hastings algorithms, presumably Luke Tierney’s foundational Annals paper work that we should have quoted…

As pointed out by Xiao-Li, the proof is also confusing for the use of two notations for the expectation (one of which is indexed by f and the other corresponding to the Markov transition) and for the change in the meaning of f, now the stationary density, when compared with Theorem 6.80.

## A slice of infinity

Posted in R, Statistics, University life with tags , , , , , , , on July 28, 2011 by xi'an

Peng Yu sent me an email about the conditions for convergence of a Gibbs sampler:

The following statement mentions convergence. But I’m not familiar what the regularity condition is.

“But it is necessary to have a finite probability of moving away from the current state at all times in order to satisfy the regularity conditions on which the whole MCMC theory depends.”

Slice sampler is discussed in your book Monte Carlo Statistical Methods. I think that the “regularity condition” may have been discussed in your book. If so, would you please let me know where it is? Thanks and look forward to hearing from you!

The quote is from Martyn Plummer and deals with a stopping rule in JAGS implementation of the slice sampler. (The correct wording should be “strictly positive probability” rather than “finite probability”, I think.) However, this has nothing to do with a “regularity condition” on the irreducibility of a Markov chain: if a slice sampler is implemented for an unbounded density target, say a Beta(1/2,1/2), there is no irreducibility condition connected with the infiniteness of the density. In theory, (a) the chain never visits the “state” where the density is infinite (if only because we are dealing with a continuous state space) and (b) after visiting a value x with a large density f(x), the slice sampler allows for a move away from it since the slice involves a uniform simulation over (0,f(x)). Deeper properties of the slice sampler (like geometric ergodicity) are explored in, e.g., this JRSS B paper by Gareth Roberts and Jeff Rosenthal and this one in the Annals of Statistics by Radford Neal. In practice, the problem is caused by values of f(x) that cannot be computed and hence produce an error message like

`Singularity in likelihood found by Slicer.`

If those singularities can be localised, a neighbourhood excluding them should be introduced. (More easily said than done, obviously!)

Here is an example of a slice sampler with the Beta(1/2,1/2) distribution:

```#graphics
dote=function(x,y) points(x,y,col="gold",pch=19,cex=.4)
mote=function(x,y,z,w) lines(c(x,z),c(y,w),col="gold",lwd=.5)
cst=dbeta(.5,.5,.5)*.5 #normalising constant
#inverting f(x)=d, 2nd degree equation
hitden=function(d) .5+.5*sqrt(1-4*( cst/ max(d,dbeta(.5,.5,.5)))^2)*c(-1,1)
#output
curve(dbeta(x,.5,.5),0,1,ylab="density",lwd=2,col="steelblue",n=1001)
x=runif(1);u=runif(1)*dbeta(x,.5,.5);dote(x,u)
for (t in 1:100){ #100 slice steps
bo=hitden(u)
nx=sample(c(runif(1,0,bo[1]),runif(1,bo[2],1)),1)
nu=runif(1)*dbeta(nx,.5,.5)
mote(x,u,nx,nu)
x=nx;u=nu;dote(x,u)
}
```

which clearly explores the whole area under the Beta(1/2,1/2) density. Even when started at a large density value like f(.999999), it eventually leaves the vicinity of this highly improbable value.