Archive for Monte Carlo Statistical Methods

simulation under zero measure constraints

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on November 17, 2016 by xi'an

A theme that comes up fairly regularly on X validated is the production of a sample with given moments, either for calibration motives or from a misunderstanding of the difference between a distribution mean and a sample average. Here are some entries on that topic:

In most of those questions, the constraint in on the sum or mean of the sample, which allows for an easy resolution by a change of variables. It however gets somewhat harder when the constraint involves more moments or, worse, an implicit solution to an equation. A good example of the later is the quest for a sample with a given maximum likelihood estimate in the case this MLE cannot be derived analytically. As for instance with a location-scale t sample…

Actually, even when the constraint is solely on the sum, a relevant question is the production of an efficient simulation mechanism. Using a Gibbs sampler that changes one component of the sample at each iteration does not qualify, even though it eventually produces the proper sample. Except for small samples. As in this example

n=3;T=1e4
s0=.5 #fixed average
sampl=matrix(s0,T,n)
for (t in 2:T){
 sampl[t,]=sampl[t-1,]
 for (i in 1:(n-1)){
  sampl[t,i]=runif(1,
  min=max(0,n*s0-sum(sampl[t,c(-i,-n)])-1),
  max=min(1,n*s0-sum(sampl[t,c(-i,-n)])))
 sampl[t,n]=n*s0-sum(sampl[t,-n])}}

For very large samples, I figure that proposing from the unconstrained density can achieve a sufficient efficiency, but the in-between setting remains an interesting problem.

pseudo-marginal MCMC with minimal replicas

Posted in Books, Statistics, University life with tags , , , , , , on November 16, 2016 by xi'an

A week ago, Chris Sherlock, Alexandre Thiery and Anthony Lee (Warwick) arXived a short paper where they show that it is most efficient to use only one random substitute to the likelihood when considering a pseudo-marginal algorithm. Thus generalising an earlier paper of Luke Bornn and co-authors I commented earlier. A neat side result in the paper is that the acceptance probability for m replicas [in the pseudo-marginal approximation] is at most m/s the acceptance probability for s replicas when s<m. The main result is as follows:

screenshot_20161114_140345There is a (superficial?) connection with the fact that when running Metropolis-within-Gibbs there is no advantage in doing more than one single Metropolis iteration, as the goal is to converge to the joint posterior, rather than approximating better the full conditional…

Nature snapshot [Volume 539 Number 7627]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 15, 2016 by xi'an

A number of entries of interest [to me] in that Nature issue: from the Capuchin monkeys that break stones in a way that resembles early hominins biface tools, to the persistent association between some sounds and some meanings across numerous languages, to the use of infected mosquitoes in South America to fight Zika, to the call for more maths in psychiatry by the NIMH director, where since prevision is mentioned I presumed stats is included, to the potentially earthshaking green power revolution in Africa, to the reconstruction of the first HIV strains in North America, along with the deconstruction of the “Patient 0” myth, helped by Bayesian phylogenetic analyses, to a cover of the Open Syllabus Project, with Monte Carlo Statistical Methods arriving first [in the Monte Carlo list]….

“Observations should not converge on one model but aim to find anomalies that carry clues about the nature of dark matter, dark energy or initial conditions of the Universe. Further observations should be motivated by testing unconventional interpretations of those anomalies (such as exotic forms of dark matter or modified theories of gravity). Vast data sets may contain evidence for unusual behaviour that was unanticipated when the projects were conceived.” Avi Loeb

One editorial particularly drew my attention, Good data are not enough, by the astronomer Avi Loeb. as illustrated  by the quote above, Loeb objects to data being interpreted and even to data being collected towards the assessment of the standard model. While I agree that this model contains a lot of fudge factors like dark matter and dark energy, which apparently constitutes most of the available matter, the discussion is quite curious, in that interpreting data according to alternative theories sounds impossible and certainly beyond the reach of most PhD students [as Loeb criticises the analysis of some data in a recent thesis he evaluated].

“modern cosmology is augmented by unsubstantiated, mathematically sophisticated ideas — of the multiverse, anthropic reasoning and string theory.

The author argues to always allow for alternative interpretations of the data, which sounds fine at a primary level but again calls for the conception of such alternative models. When discrepancies are found between the standard model and the data, they can be due to errors in the measurement itself, in the measurement model, or in the theoretical model. However, they may be impossible to analyse outside the model, in the neutral way called and wished by Loeb. Designing neutral experiments sounds even less meaningful. Which is why I am fairly taken aback by the call to “a research frontier [that] should maintain at least two ways of interpreting data so that new experiments will aim to select the correct one”! Why two and not more?! And which ones?! I am not aware of fully developed alternative theories and cannot see how experiments designed under one model could produce indications about a new and incomplete model.

“Such simple, off-the-shelf remedies could help us to avoid the scientific fate of the otherwise admirable Mayan civilization.”

Hence I am bemused by the whole exercise, which deepest arguments seem to be a paper written by the author last year and an interdisciplinary centre on black holes also launched recently by the same author.

variance of an exponential order statistics

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , , , on November 10, 2016 by xi'an

This afternoon, one of my Monte Carlo students at ENSAE came to me with an exercise from Monte Carlo Statistical Methods that I did not remember having written. And I thus “charged” George Casella with authorship for that exercise!

Exercise 3.3 starts with the usual question (a) about the (Binomial) precision of a tail probability estimator, which is easy to answer by iterating simulation batches. Expressed via the empirical cdf, it is concerned with the vertical variability of this empirical cdf. The second part (b) is more unusual in that the first part is again an evaluation of a tail probability, but then it switches to find the .995 quantile by simulation and produce a precise enough [to three digits] estimate. Which amounts to assess the horizontal variability of this empirical cdf.

As we discussed about this question, my first suggestion was to aim at a value of N, number of Monte Carlo simulations, such that the .995 x N-th spacing had a length of less than one thousandth of the .995 x N-th order statistic. In the case of the Exponential distribution suggested in the exercise, generating order statistics is straightforward, since, as suggested by Devroye, see Section V.3.3, the i-th spacing is an Exponential variate with rate (N-i+1). This is so fast that Devroye suggests simulating Uniform order statistics by inverting Exponential order statistics (p.220)!

However, while still discussing the problem with my student, I came to a better expression of the question, which was to figure out the variance of the .995 x N-th order statistic in the Exponential case. Working with the density of this order statistic however led nowhere useful. A bit later, after Google-ing the problem, I came upon this Stack Exchange solution that made use of the spacing result mentioned above, namely that the expectation and variance of the k-th order statistic are

\mathbb{E}[X_{(k)}]=\sum\limits_{i=N-k+1}^N\frac1i,\qquad \mbox{Var}(X_{(k)})=\sum\limits_{i=N-k+1}^N\frac1{i^2}

which leads to the proper condition on N when imposing the variability constraint.

je reviendrai à Montréal [MCM 2017]

Posted in pictures, R, Running, Statistics, Travel with tags , , , , , , , , , , , , on November 3, 2016 by xi'an

Next summer of 2017, the biennial International Conference on Monte Carlo Methods and Applications (MCM) will take place in Montréal, Québec, Canada, on July 3-7. This is a mathematically-oriented meeting that works in alternance with MCqMC and that is “devoted to the study of stochastic simulation and Monte Carlo methods in general, from the theoretical viewpoint and in terms of their effective applications in different areas such as finance, statistics, machine learning, computer graphics, computational physics, biology, chemistry, and scientific computing in general. It is one of the most prominent conference series devoted to research on the mathematical aspects of stochastic simulation and Monte Carlo methods.” I attended one edition in Annecy three years ago and enjoyed very much the range of topics and backgrounds. The program is under construction and everyone is warmly invited to contribute talks or special sessions, with a deadline on January 20, 2017. In addition, Montréal is a Monte Carlo Mecca of sorts with leading researchers in the field like Luc Devroye and Pierre Lécuyer working there. (And a great place to visit in the summer!)

adaptive exchange

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

 The crux of the method is to run an iteration as [where y denotes the observation]

  1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
  2. Generate a pseudo-observation z~ƒ(z|θ’);
  3. Accept with probability

\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

an attempt at EP-ABC from scratch, nothing more… [except for a few bugs]

Posted in R, Statistics, University life with tags , , , , , , on October 19, 2016 by xi'an

Following a request from one of the reviewers of our chapter Likelihood-free model choice, I tried to run EP-ABC on a toy problem and to compare it with the outcome of a random forest ABC. Literally starting from scratch, namely from the description found in Simon and Nicolas’ JASA paper.  To run my test, I chose as my modelled data an exponential Exp(λ) sample of size 20, with a Gaussian N(0,1) prior on the log parameter (here replicated 100 times):

n=20
trueda=matrix(rexp(100*n),ncol=n)
#prior values
er0=0
Q0=1

Then I applied the formulas found in the paper for approximating the evidence, first defining the log normalising constant for an unnormalised Gaussian density as on the top of p.318

Psi=function(er,Q){
  -.5*(log(Q/2/pi)-Q*er*er)}

[which exhibited a typo in the paper, as Simon Barthelmé figured out after emails exchanges that the right formulation was the inverse]

Psi=function(er,Q){
  -.5*(log(Q/2/pi)-er*er/Q)}

and iterating the site updates as described in Sections 2.2, 3.1 and Algorithms 1 and 3:

ep=function(dat,M=1e4,eps,Nep=10){
 n=length(dat)
 #global and ith natural parameters for Gaussian approximations
 #initialisation:
 Qi=rep(0,n)
 Q=Q0+sum(Qi)
 eri=rep(0,n)
 er=er0+sum(eri)
 #constants Ci in (2.6)
 normz=rep(1,n)
 for (t in 1:Nep){
  for (i in sample(1:n)){
  #site i update
   Qm=Q-Qi[i] #m for minus i
   erm=er-eri[i]
   #ABC sample
   thetas=rnorm(M,me=erm/Qm,sd=1/sqrt(Qm))
   dati=rexp(M,rate=exp(thetas))
#update by formula (3.3)
   Mh=sum((abs(dati-dat[i])<eps))
   mu=mean(thetas[abs(dati-dat[i])<eps])
  sig=var(thetas[abs(dati-dat[i])<eps])
  if (Mh>1e3){
#enough proposals accepted
   Qi[i]=1/sig-Qm
   eri[i]=mu/sig-erm
   Q=Qm+Qi[i]
   er=erm+eri[i]
#update of Ci as in formula (2.7)
#with correction bottom of p.319
  normz[i]=log(Mh/M/2/eps)- Psi(er,Q)+Psi(erm,Qm)
  }}}
#approximation of the log evidence as on p.318
return(sum(normz)+Psi(er,Q)-Psi(er0,Q0)) }

except that I made an R beginner’s mistake (!) when calling the normal simulation to be

thetas=rnorm(M,me=erm/Qm)/sqrt(Qm)

to be compared with the genuine evidence under a conjugate Exp(1) prior [values of the evidence should differ but should not differ that much]

ev=function(dat){
 n=length(dat)
 lgamma(n+1)-(n+1)*log(1+sum(dat))
 }

After correcting for those bugs and too small sample sizes, thanks to Simon kind-hearted help!, running this code results in minor discrepancies between both quantities:

evid=ovid=rep(0,100)
M=1e4 #number of ABC samples
propep=.1 #tolerance factor
for (t in 1:100){
 #tolerance scaled by data
 epsi=propep*sd(trueda[t,])
 evid[t]=ep(trueda[t,],M,epsi)
 ovid[t]=ev(trueda[t,])
 }
plot(evid,ovid,cex=.6,pch=19)

as shown by the comparison below between the evidence and the EP-ABC approximations to the evidence, called epabence (the dashed line is the diagonal):

epabcObviously, this short (if lengthy at my scale) experiment is not intended to conclude that EP-ABC work or does not work. (Simon also provided additional comparison with the genuine evidence, that is under the right prior, and with the zero Monte-Carlo version of the EP-ABC evidence, achieving a high concentration over the diagonal.) I just conclude that the method does require a certain amount of calibration to become stable. Calibrations steps that are clearly spelled out in the paper (adaptive choice of M, stabilisation by quasi-random samples, stabilisation by stochastic optimisation steps and importance sampling), but also imply that EP-ABC is also “far from routine use because it make take days to tune on real problems” (pp.315-316). Thinking about the reasons for such discrepancy over the past days, one possible explanation I see for the impact of the calibration parameters is that the validation of EP if any occurs at a numerical rather than Monte Carlo level, hence that the Monte Carlo error must be really small to avoid interfering with the numerical aspects.