## Poisson process model for Monte Carlo methods

Posted in Books with tags , , , , , , , on February 25, 2016 by xi'an

“Taken together this view of Monte Carlo simulation as a maximization problem is a promising direction, because it connects Monte Carlo research with the literature on optimization.”

Chris Maddison arXived today a paper on the use of Poisson processes in Monte Carlo simulation. based on the so-called Gumbel-max trick, which amounts to add to the log-probabilities log p(i) of the discrete target, iid Gumbel variables, and to take the argmax as the result of the simulation. A neat trick as it does not require the probability distribution to be normalised. And as indicated in the above quote to relate simulation and optimisation. The generalisation considered here replaces the iid Gumbel variates by a Gumbel process, which is constructed as an “exponential race”, i.e., a Poisson process with an exponential auxiliary variable. The underlying variates can be generated from a substitute density, à la accept-reject, which means this alternative bounds the true target.  As illustrated in the plot above.

The paper discusses two implementations of the principle found in an earlier NIPS 2014 paper [paper that contains most of the novelty about this method], one that refines the partition and the associated choice of proposals, and another one that exploits a branch-and-bound tree structure to optimise the Gumbel process. With apparently higher performances. Overall, I wonder at the applicability of the approach because of the accept-reject structure: it seems unlikely to apply to high dimensional problems.

While this is quite exciting, I find it surprising that this paper completely omits references to Brian Ripley’s considerable input on simulation and point processes. As well as the relevant Geyer and Møller (1994). (I am obviously extremely pleased to see that our 2004 paper with George Casella and Marty Wells is quoted there. We had written this paper in Cornell, a few years earlier, right after the 1999 JSM in Baltimore, but it has hardly been mentioned since then!)

## difference between Metropolis, Gibbs, importance, and rejection sampling

Posted in Books, Kids, Statistics with tags , , , , , on December 14, 2015 by xi'an

Last week, while I was preparing my talk for the NIPS workshop, I spotted this fairly generic question on X validated. And decided to procrastinate by answering through generic comments on the pros and cons of each method. This is a challenging if probably empty question as it lacks a measure of evaluation for those different approaches.  And this is another reason why I replied, in that it relates to my pondering the a-statistical nature of simulation-based approximation methods. Also called probabilistic numerics, not statistical numerics, eh! It is indeed close to impossible to compare such approaches and others on a general basis. For instance, the comparative analysis greatly differs when dealing with a once-in-a-lifetime problem and with an everyday issue, e.g. when building a package for a sufficiently standard model. In the former case, a quick-and-dirty off-the-shelf solution is recommended, while in the latter, designing an efficient and fine-tuned approach makes sense. (The pros and cons I discussed in my X validated answer thus do not apply in most settings!) If anything, using several approaches, whenever possible, is the best advice to give. If not on the targeted problem, at least on a toy or simulated version, to check for performances of those different tools. But this brings back the issue of cost and time… An endless garden of forking paths, one would say [in another setting].

## adaptive rejection sampling with fixed number of nodes

Posted in Books, Statistics, University life with tags , , , , , on October 8, 2015 by xi'an

This paper was arXived today by Martino and Louzada. It starts from the adaptive rejection sampling algorithm of Gilks and Wild (1993), which provided an almost automatic random generator for univariate log-concave target probability densities. By creating an envelope of the true target based on convexity. This algorithm was at the core of BUGS in its early days. And makes a good example of accept reject algorithm that I often use in class. It is however not so popular nowadays because of the unidimensional constraint. And because it is not particularly adequate for simulating a single realisation from a given target. As is the case when used in a Gibbs sampler. The authors only consider here the issue of the growing cost of running the adaptive rejection sampling algorithm, which is due to the update of the envelope at each rejection. They however fail to quantify this increase, which involves (a) finding the interval where the current rejection lies, among n existing intervals, which is of order O(n), and (b) creating both modified envelopes on both new intervals, which is of order O(1). The proposal of the authors, called cheap adaptive rejection sampling, settles for a fixed number τ of support points (and intervals), solely swapping a rejected point with the closest support point when this improves the acceptance rate. The test for swapping involves first computing the alternative target (on a pair of intervals) and the corresponding normalising constant, while swapping the rejected point with the closest support point involves finding the closest point, which is of order O(log τ). I am rather surprised that the authors do not even mention the execution time orders, resorting instead to a simulation where the gain brought by their proposal is not overwhelming. There is also an additional cost for figuring out the fixed number τ of support points, another issue not mentioned in the paper.

## an email exchange about integral representations

Posted in Books, R, Statistics, University life with tags , , , , on April 8, 2015 by xi'an

I had an interesting email exchange [or rather exchange of emails] with a (German) reader of Introducing Monte Carlo Methods with R in the past days, as he had difficulties with the validation of the accept-reject algorithm via the integral

$\mathbb{P}(Y\in \mathcal{A},U\le f(Y)/Mg(Y)) = \int_\mathcal{A} \int_0^{f(y)/Mg(y)}\,\text{d}u\,g(y)\,\text{d}y\,,$

in that it took me several iterations [as shown in the above] to realise the issue was with the notation

$\int_0^a \,\text{d}u\,,$

which seemed to be missing a density term or, in other words, be different from

$\int_0^1 \,\mathbb{I}_{(0,a)}(u)\,\text{d}u\,,$

What is surprising for me is that the integral

$\int_0^a \,\text{d}u$

has a clear meaning as a Riemann integral, hence should be more intuitive….

## understanding the Hastings algorithm

Posted in Books, Statistics with tags , , , , , on August 26, 2014 by xi'an

David Minh and Paul Minh [who wrote a 2001 Applied Probability Models] have recently arXived a paper on “understanding the Hastings algorithm”. They revert to the form of the acceptance probability suggested by Hastings (1970):

$\rho(x,y) = s(x,y) \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}$

where s(x,y) is a symmetric function keeping the above between 0 and 1, and q is the proposal. This obviously includes the standard Metropolis-Hastings form of the ratio, as well as Barker’s (1965):

$\rho(x,y) = \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}$

which is known to be less efficient by accepting less often (see, e.g., Antonietta Mira’s PhD thesis). The authors also consider the alternative

$\rho(x,y) = \min(\pi(y)/ q(y|x),1)\,\min(q(x|y)/\pi(x),1)$

which I had not seen earlier. It is a rather intriguing quantity in that it can be interpreted as (a) a simulation of y from the cutoff target corrected by reweighing the previous x into a simulation from q(x|y); (b) a sequence of two acceptance-rejection steps, each concerned with a correspondence between target and proposal for x or y. There is an obvious caveat in this representation when the target is unnormalised since the ratio may then be arbitrarily small… Yet another alternative could be proposed in this framework, namely the delayed acceptance probability of our paper with Marco and Clara, one special case being

$\rho(x,y) = \min(\pi_1(y)q(x|y)/\pi_1(x) q(y|x),1)\,\min(\pi_2(y)/\pi_1(x),1)$

where

$\pi(x)\propto\pi_1(x)\pi_2(x)$

is an arbitrary decomposition of the target. An interesting remark in the paper is that any Hastings representation can alternatively be written as

$\rho(x,y) = \min(\pi(y)/k(x,y)q(y|x),1)\,\min(k(x,y)q(x|y)/\pi(x),1)$

where k(x,y) is a (positive) symmetric function. Hence every single Metropolis-Hastings is also a delayed acceptance in the sense that it can be interpreted as a two-stage decision.

The second part of the paper considers an extension of the accept-reject algorithm where a value y proposed from a density q(y) is accepted with probability

$\min(\pi(y)/ Mq(y),1)$

and else the current x is repeated, where M is an arbitrary constant (incl. of course the case where it is a proper constant for the original accept-reject algorithm). Curiouser and curiouser, as Alice would say! While I think I have read some similar proposal in the past, I am a wee intrigued at the appear of using only the proposed quantity y to decide about acceptance, since it does not provide the benefit of avoiding generations that are rejected. In this sense, it appears as the opposite of our vanilla Rao-Blackwellisation. (The paper however considers the symmetric version called the independent Markovian minorizing algorithm that only depends on the current x.) In the extension to proposals that depend on the current value x, the authors establish that this Markovian AR is in fine equivalent to the generic Hastings algorithm, hence providing an interpretation of the “mysterious” s(x,y) through a local maximising “constant” M(x,y). A possibly missing section in the paper is the comparison of the alternatives, albeit the authors mention Peskun’s (1973) result that exhibits the Metropolis-Hastings form as the optimum.

## recycling accept-reject rejections (#2)

Posted in R, Statistics, University life with tags , , , , , , on July 2, 2014 by xi'an

Following yesterday’s post on Rao’s, Liu’s, and Dunson’s paper on a new approach to intractable normalising constants, and taking advantage of being in Warwick, I tested the method on a toy model, namely the posterior associated with n Student’s t observations with unknown location parameter μ and a flat prior,

$x_1,\ldots,x_n \sim p(x|\mu) \propto \left[ 1+(x-\mu)^2/\nu \right]^{-(\nu+1)/2}$

which is “naturally” bounded by a Cauchy density with scale √ν. The constant M is then easily derived and running the new algorithm follows from a normal random walk proposal targeting the augmented likelihood (R code below).

As shown by the above graph, the completion-by-rejection scheme produces a similar outcome (tomato) as the one based on the sole observations (steelblue). With a similar acceptance rate. However, the computing time is much much degraded:

> system.time(g8())
user  system elapsed
53.751   0.056  54.103
> system.time(g9())
user  system elapsed
1.156   0.000   1.161


when compared with the no-completion version. Here is the entire R code that produced both MCMC samples: Continue reading

## recycling accept-reject rejections

Posted in Statistics, University life with tags , , , , , , , , , on July 1, 2014 by xi'an

Vinayak Rao, Lizhen Lin and David Dunson just arXived a paper which proposes anew technique to handle intractable normalising constants. And which exact title is Data augmentation for models based on rejection sampling. (Paper that I read in the morning plane to B’ham, since this is one of my weeks in Warwick.) The central idea therein is that, if the sample density (aka likelihood) satisfies

$p(x|\theta) \propto f(x|\theta) \le q(x|\theta) M\,,$

where all terms but p are known in closed form, then completion by the rejected values of an hypothetical accept-reject algorithm−hypothetical in the sense that the data does not have to be produced by an accept-reject scheme but simply the above domination condition to hold−allows for a data augmentation scheme. Without requiring the missing normalising constant. Since the completed likelihood is

$\prod_{i=1}^n \dfrac{f(x_i|\theta)}{M} \prod_{j=1}^{m_i} \left\{q(y_{ij}|\theta) -\dfrac{f(y_{ij}|\theta)}{M}\right\}$

A closed-form, if not necessarily congenial, function.

Now this is quite a different use of the “rejected values” from the accept reject algorithm when compared with our 1996 Biometrika paper on the Rao-Blackwellisation of accept-reject schemes (which, still, could have been mentioned there… Or Section 4.2 of Monte Carlo Statistical Methods. Rather than re-deriving the joint density of the augmented sample, “accepted+rejected”.)

It is a neat idea in that it completely bypasses the approximation of the normalising constant. And avoids the somewhat delicate tuning of the auxiliary solution of Moller et al. (2006)  The difficulty with this algorithm is however in finding an upper bound M on the unnormalised density f that is

1. in closed form;
2. with a manageable and tight enough “constant” M;
3. compatible with running a posterior simulation conditional on the added rejections.

The paper seems to assume further that the bound M is independent from the current parameter value θ, at least as suggested by the notation (and Theorem 2), but this is not in the least necessary for the validation of the formal algorithm. Such a constraint would pull M higher, hence reducing the efficiency of the method. Actually the matrix Langevin distribution considered in the first example involves a bound that depends on the parameter κ.

The paper includes a result (Theorem 2) on the uniform ergodicity that relies on heavy assumptions on the proposal distribution. And a rather surprising one, namely that the probability of rejection is bounded from below, i.e. calling for a less efficient proposal. Now it seems to me that a uniform ergodicity result holds as well when the probability of acceptance is bounded from below since, then, the event when no rejection occurs constitutes an atom from the augmented Markov chain viewpoint. There therefore occurs a renewal each time the rejected variable set ϒ is empty, and ergodicity ensues (Robert, 1995, Statistical Science).

Note also that, despite the opposition raised by the authors, the method per se does constitute a pseudo-marginal technique à la Andrieu-Roberts (2009) since the independent completion by the (pseudo) rejected variables produces an unbiased estimator of the likelihood. It would thus be of interest to see how the recent evaluation tools of Andrieu and Vihola can assess the loss in efficiency induced by this estimation of the likelihood.

Maybe some further experimental evidence tomorrow…