Archive for Monte Carlo Statistical Methods

weakly informative reparameterisations for location-scale mixtures

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , on January 19, 2017 by xi'an

fitted_density_galaxy_data_500itersWe have been working towards a revision of our reparameterisation paper for quite a while now and too advantage of Kate Lee visiting Paris this fortnight to make a final round: we have now arXived (and submitted) the new version. The major change against the earlier version is the extension of the approach to a large class of models that include infinitely divisible distributions, compound Gaussian, Poisson, and exponential distributions, and completely monotonic densities. The concept remains identical: change the parameterisation of a mixture from a component-wise decomposition to a construct made of the first moment(s) of the distribution and of component-wise objects constrained by the moment equation(s). There is of course a bijection between both parameterisations, but the constraints appearing in the latter produce compact parameter spaces for which (different) uniform priors can be proposed. While the resulting posteriors are no longer conjugate, even conditional on the latent variables, standard Metropolis algorithms can be implemented to produce Monte Carlo approximations of these posteriors.

recycling Gibbs auxiliaries [a reply]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on January 3, 2017 by xi'an

[Here is a reply sent to me by Luca Martino, Victor Elvira, and Gustau Camp-Vallis, after my earlier comments on their paper.]

We provide our contribution to the discussion, reporting our experience with the application of Metropolis-within-Gibbs schemes. Since in literature there are miscellaneous opinions, we want to point out the following considerations:

– according to our experience, the use of M>1 steps of the Metropolis-Hastings (MH) method for drawing from each full-conditional (with or without recycling), decreases the MSE of the estimation (see code Ex1-Ex2 and related Figure 7(b) and Figures 8). If the corresponding full conditional is very concentrated, one possible solution is to applied an adaptive or automatic MH for drawing from this full-conditional (it can require the use of M internal steps; see references in Section 3.2).

– Fixing the number of evaluations of the posterior, the comparison between a longer Gibbs chain with a single step of MH and a shorter Gibbs chain with M>1 steps of MH per each full-conditional, is required. Generally, there is no clear winner. The better performance depends on different aspects: the specific scenario, if and adaptive MH is employed or not, if the recycling is applied or not (see Figure 10(a) and the corresponding code Ex2).

The previous considerations are supported/endorsed by several authors (see the references in Section 3.2). In order to highlight the number of controversial opinions about the MH-within-Gibbs implementation, we report a last observation:

– If it is possible to draw directly from the full-conditionals, of course this is the best scenario (this is our belief). Remarkably, as also reported in Chapter 1, page 393 of the book “Monte Carlo Statistical Methods”, C. Robert and Casella, 2004, some authors have found that a “bad” choice of the proposal function in the MH step (i.e., different from the full conditional, or a poor approximation of it) can improve the performance of the MH-within-Gibbs sampler. Namely, they assert that a more “precise” approximation of the full-conditional does not necessarily improve the overall performance. In our opinion, this is possibly due to the fact that the acceptance rate in the MH step (lower than 1) induces an “accidental” random scan of the components of the target pdf in the Gibbs sampler, which can improve the performance in some cases. In our work, for the simplicity, we only focus on the deterministic scan. However, a random scan could be also considered.

zig, zag, and subsampling

Posted in Books, Statistics, University life with tags , , , , , , , , , on December 29, 2016 by xi'an

ENSAE, Nov. 17, 2010Today, I alas missed a seminar at BiPS on the Zig-Zag (sub-)sampler of Joris Bierkens, Paul Fearnhead and Gareth Roberts, presented here in Paris by James Ridgway. Fortunately for me, I had some discussions with Murray Pollock in Warwick and then again with Changye Wu in Dauphine that shed some light on this complex but highly innovative approach to simulating in Big Data settings thanks to a correct subsampling mechanism.

The zig-zag process runs a continuous process made of segments that turn from one diagonal to the next at random times driven by a generator connected with the components of the gradient of the target log-density. Plus a symmetric term. Provided those random times can be generated, this process is truly available and associated with the right target distribution. When the components of the parameter are independent (an unlikely setting), those random times can be associated with an inhomogeneous Poisson process. In the general case, one needs to bound the gradients by more manageable functions that create a Poisson process that can later be thinned. Next, one needs to simulate the process for the upper bound, a task that seems hard to achieve apart from linear and piecewise constant upper bounds. The process has a bit of a slice sampling taste, except that it cannot be used as a slice sampler but requires continuous time integration, given that the length of each segment matters. (Or maybe random time subsampling?)

A highly innovative part of the paper concentrates on Big Data likelihoods and on the possibility to subsample properly and exactly the original dataset. The authors propose Zig-Zag with subsampling by turning the gradients into random parts of the gradients. While remaining unbiased. There may be a cost associated with this gain of one to n, namely that the upper bounds may turn larger as they handle all elements in the likelihood at once, hence become (even) less efficient. (I am more uncertain about the case of the control variates, as it relies on a Lipschitz assumption.) While I still miss an easy way to implement the approach in a specific model, I remain hopeful for this new approach to make a major dent in the current methodologies!


Posted in Books, Mountains, pictures, Statistics, Travel with tags , , , , , on December 22, 2016 by xi'an

“An anytime algorithm is an algorithm that can be run continuously, generating progressively better solutions when afforded additional computation time. Traditional particle-based inference algorithms are not anytime in nature; all particles need to be propagated in lock-step to completion in order to compute expectations.”

Following a discussion with Lawrence Murray last week, I read Paige et al.  NIPS 2014 paper on their anytime sequential Monte Carlo algorithm. As explained above, an anytime algorithm is interruptible, meaning it can be stopped at any time without biasing the outcome of the algorithm. While MCMC algorithms can qualify as anytime (provided they are in stationary regime), it is not the case with sequential and particle Monte Carlo algorithms, which do not have an inbred growing mechanism preserving the target. In the case of Paige et al.’s proposal, the interruptible solution returns an unbiased estimator of the marginal likelihood at time n for any number of particles, even when this number is set or increased during the computation. The idea behind the solution is to create a particle cascade by going one particle at a time and creating children of this particle in proportion to the current average weight. An approach that can be run indefinitely. And since memory is not infinite, the authors explain how to cap the number of alive particles without putting the running distribution in jeopardy…

puzzled by harmony [not!]

Posted in Books, Kids, Mountains, pictures, R, Running, Statistics, Travel with tags , , , , , on December 13, 2016 by xi'an

In answering yet another question on X validated about the numerical approximation of the marginal likelihood, I suggested using an harmonic mean estimate as a simple but worthless solution based on an MCMC posterior sample. This was on a toy example with a uniform prior on (0,π) and a “likelihood” equal to sin(θ) [really a toy problem!]. Simulating an MCMC chain by a random walk Metropolis-Hastings algorithm is straightforward, as is returning the harmonic mean of the sin(θ)’s.

f <- function(x){
    if ((0<x)&(x<pi)){

n = 2000 #number of iterations
sigma = 0.5
x = runif(1,0,pi) #initial x value
chain = fx = f(x)   
#generates an array of random x values from norm distribution
rands = rnorm(n,0, sigma) 
#Metropolis - Hastings algorithm
for (i in 2:n){
    can = x + rands[i]  #candidate for jump
    aprob = fcan/fx #acceptance probability
    if (runif(1) < aprob){
        x = can
        fx = fcan}
I = pi*length(chain)/sum(1/chain) #integral harmonic approximation

However, the outcome looks remarkably stable and close to the expected value 2/π, despite 1/sin(θ) having an infinite integral on (0,π). Meaning that the average of the 1/sin(θ)’s has no variance. Hence I wonder why this specific example does not lead to an unreliable output… But re-running the chain with a smaller scale σ starts producing values of sin(θ) regularly closer to zero, which leads to an estimate of I both farther away from 2 and much more variable. No miracle, in the end!

recycling Gibbs auxiliaries

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on December 6, 2016 by xi'an

wreck of the S.S. Dicky, Caloundra beach, Qld, Australia, Aug. 19, 2012Luca Martino, Victor Elvira and Gustau Camps-Valls have arXived a paper on recycling for Gibbs sampling. The argument therein is to take advantage of all simulations induced by MCMC simulation for one full conditional, towards improving estimation if not convergence. The context is thus one when Metropolis-within-Gibbs operates, with several (M) iterations of the corresponding Metropolis being run instead of only one (which is still valid from a theoretical perspective). While there are arguments in augmenting those iterations, as recalled in the paper, I am not a big fan of running a fixed number of M of iterations as this does not approximate better the simulation from the exact full conditional and even if this approximation was perfect, the goal remains simulating from the joint distribution. As such, multiplying the number of Metropolis iterations does not necessarily impact the convergence rate, only brings it closer to the standard Gibbs rate. Moreover, the improvement does varies with the chosen component, meaning that the different full conditionals have different characteristics that produce various levels of variance reduction:

  • if the targeted expectation only depends on one component of the Markov chain, multiplying the number of simulations for the other components has no clear impact, except in increasing time;
  • if the corresponding full conditional is very concentrated, repeating simulations should produce quasi-repetitions, and no gain.

The only advantage in computing time that I can see at this stage is when constructing the MCMC sampler for the full proposal is much more costly than repeating MCMC iterations, which are then almost free and contribute to the reduction of the variance of the estimator.

This analysis of MCMC-withing-Gibbs strategies reminds me of a recent X validated question, which was about the proper degree of splitting simulations from a marginal and from a corresponding conditional in the chain rule, the optimal balance being in my opinion dependent on the relative variances of the conditional expectations.

A last point is that recycling in the context of simulation and Monte Carlo methodology makes me immediately think of Rao-Blackwellisation, which is surprisingly absent from the current paperRao-Blackwellisation was introduced in the MCMC literature and to the MCMC community in the first papers of Alan Gelfand and Adrian Smith, in 1990. While this is not always producing a major gain in Monte Carlo variability, it remains a generic way of recycling auxiliary variables as shown, e.g., in the recycling paper we wrote with George Casella in 1996, one of my favourite papers.

simulation under zero measure constraints

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on November 17, 2016 by xi'an

A theme that comes up fairly regularly on X validated is the production of a sample with given moments, either for calibration motives or from a misunderstanding of the difference between a distribution mean and a sample average. Here are some entries on that topic:

In most of those questions, the constraint in on the sum or mean of the sample, which allows for an easy resolution by a change of variables. It however gets somewhat harder when the constraint involves more moments or, worse, an implicit solution to an equation. A good example of the later is the quest for a sample with a given maximum likelihood estimate in the case this MLE cannot be derived analytically. As for instance with a location-scale t sample…

Actually, even when the constraint is solely on the sum, a relevant question is the production of an efficient simulation mechanism. Using a Gibbs sampler that changes one component of the sample at each iteration does not qualify, even though it eventually produces the proper sample. Except for small samples. As in this example

s0=.5 #fixed average
for (t in 2:T){
 for (i in 1:(n-1)){

For very large samples, I figure that proposing from the unconstrained density can achieve a sufficient efficiency, but the in-between setting remains an interesting problem.