## understanding the Hastings algorithm

Posted in Books, Statistics with tags , , , , , on August 26, 2014 by xi'an

David Minh and Paul Minh [who wrote a 2001 Applied Probability Models] have recently arXived a paper on “understanding the Hastings algorithm”. They revert to the form of the acceptance probability suggested by Hastings (1970):

$\rho(x,y) = s(x,y) \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}$

where s(x,y) is a symmetric function keeping the above between 0 and 1, and q is the proposal. This obviously includes the standard Metropolis-Hastings form of the ratio, as well as Barker’s (1965):

$\rho(x,y) = \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}$

which is known to be less efficient by accepting less often (see, e.g., Antonietta Mira’s PhD thesis). The authors also consider the alternative

$\rho(x,y) = \min(\pi(y)/ q(y|x),1)\,\min(q(x|y)/\pi(x),1)$

which I had not seen earlier. It is a rather intriguing quantity in that it can be interpreted as (a) a simulation of y from the cutoff target corrected by reweighing the previous x into a simulation from q(x|y); (b) a sequence of two acceptance-rejection steps, each concerned with a correspondence between target and proposal for x or y. There is an obvious caveat in this representation when the target is unnormalised since the ratio may then be arbitrarily small… Yet another alternative could be proposed in this framework, namely the delayed acceptance probability of our paper with Marco and Clara, one special case being

$\rho(x,y) = \min(\pi_1(y)q(x|y)/\pi_1(x) q(y|x),1)\,\min(\pi_2(y)/\pi_1(x),1)$

where

$\pi(x)\propto\pi_1(x)\pi_2(x)$

is an arbitrary decomposition of the target. An interesting remark in the paper is that any Hastings representation can alternatively be written as

$\rho(x,y) = \min(\pi(y)/k(x,y)q(y|x),1)\,\min(k(x,y)q(x|y)/\pi(x),1)$

where k(x,y) is a (positive) symmetric function. Hence every single Metropolis-Hastings is also a delayed acceptance in the sense that it can be interpreted as a two-stage decision.

The second part of the paper considers an extension of the accept-reject algorithm where a value y proposed from a density q(y) is accepted with probability

$\min(\pi(y)/ Mq(y),1)$

and else the current x is repeated, where M is an arbitrary constant (incl. of course the case where it is a proper constant for the original accept-reject algorithm). Curiouser and curiouser, as Alice would say! While I think I have read some similar proposal in the past, I am a wee intrigued at the appear of using only the proposed quantity y to decide about acceptance, since it does not provide the benefit of avoiding generations that are rejected. In this sense, it appears as the opposite of our vanilla Rao-Blackwellisation. (The paper however considers the symmetric version called the independent Markovian minorizing algorithm that only depends on the current x.) In the extension to proposals that depend on the current value x, the authors establish that this Markovian AR is in fine equivalent to the generic Hastings algorithm, hence providing an interpretation of the “mysterious” s(x,y) through a local maximising “constant” M(x,y). A possibly missing section in the paper is the comparison of the alternatives, albeit the authors mention Peskun’s (1973) result that exhibits the Metropolis-Hastings form as the optimum.

## recycling accept-reject rejections (#2)

Posted in R, Statistics, University life with tags , , , , , , on July 2, 2014 by xi'an

Following yesterday’s post on Rao’s, Liu’s, and Dunson’s paper on a new approach to intractable normalising constants, and taking advantage of being in Warwick, I tested the method on a toy model, namely the posterior associated with n Student’s t observations with unknown location parameter μ and a flat prior,

$x_1,\ldots,x_n \sim p(x|\mu) \propto \left[ 1+(x-\mu)^2/\nu \right]^{-(\nu+1)/2}$

which is “naturally” bounded by a Cauchy density with scale √ν. The constant M is then easily derived and running the new algorithm follows from a normal random walk proposal targeting the augmented likelihood (R code below).

As shown by the above graph, the completion-by-rejection scheme produces a similar outcome (tomato) as the one based on the sole observations (steelblue). With a similar acceptance rate. However, the computing time is much much degraded:

> system.time(g8())
user  system elapsed
53.751   0.056  54.103
> system.time(g9())
user  system elapsed
1.156   0.000   1.161


when compared with the no-completion version. Here is the entire R code that produced both MCMC samples: Continue reading

## recycling accept-reject rejections

Posted in Statistics, University life with tags , , , , , , , , , on July 1, 2014 by xi'an

Vinayak Rao, Lizhen Lin and David Dunson just arXived a paper which proposes anew technique to handle intractable normalising constants. And which exact title is Data augmentation for models based on rejection sampling. (Paper that I read in the morning plane to B’ham, since this is one of my weeks in Warwick.) The central idea therein is that, if the sample density (aka likelihood) satisfies

$p(x|\theta) \propto f(x|\theta) \le q(x|\theta) M\,,$

where all terms but p are known in closed form, then completion by the rejected values of an hypothetical accept-reject algorithm−hypothetical in the sense that the data does not have to be produced by an accept-reject scheme but simply the above domination condition to hold−allows for a data augmentation scheme. Without requiring the missing normalising constant. Since the completed likelihood is

$\prod_{i=1}^n \dfrac{f(x_i|\theta)}{M} \prod_{j=1}^{m_i} \left\{q(y_{ij}|\theta) -\dfrac{f(y_{ij}|\theta)}{M}\right\}$

A closed-form, if not necessarily congenial, function.

Now this is quite a different use of the “rejected values” from the accept reject algorithm when compared with our 1996 Biometrika paper on the Rao-Blackwellisation of accept-reject schemes (which, still, could have been mentioned there… Or Section 4.2 of Monte Carlo Statistical Methods. Rather than re-deriving the joint density of the augmented sample, “accepted+rejected”.)

It is a neat idea in that it completely bypasses the approximation of the normalising constant. And avoids the somewhat delicate tuning of the auxiliary solution of Moller et al. (2006)  The difficulty with this algorithm is however in finding an upper bound M on the unnormalised density f that is

1. in closed form;
2. with a manageable and tight enough “constant” M;
3. compatible with running a posterior simulation conditional on the added rejections.

The paper seems to assume further that the bound M is independent from the current parameter value θ, at least as suggested by the notation (and Theorem 2), but this is not in the least necessary for the validation of the formal algorithm. Such a constraint would pull M higher, hence reducing the efficiency of the method. Actually the matrix Langevin distribution considered in the first example involves a bound that depends on the parameter κ.

The paper includes a result (Theorem 2) on the uniform ergodicity that relies on heavy assumptions on the proposal distribution. And a rather surprising one, namely that the probability of rejection is bounded from below, i.e. calling for a less efficient proposal. Now it seems to me that a uniform ergodicity result holds as well when the probability of acceptance is bounded from below since, then, the event when no rejection occurs constitutes an atom from the augmented Markov chain viewpoint. There therefore occurs a renewal each time the rejected variable set ϒ is empty, and ergodicity ensues (Robert, 1995, Statistical Science).

Note also that, despite the opposition raised by the authors, the method per se does constitute a pseudo-marginal technique à la Andrieu-Roberts (2009) since the independent completion by the (pseudo) rejected variables produces an unbiased estimator of the likelihood. It would thus be of interest to see how the recent evaluation tools of Andrieu and Vihola can assess the loss in efficiency induced by this estimation of the likelihood.

Maybe some further experimental evidence tomorrow…

## R finals

Posted in R, Statistics, University life with tags , , , , , , , , on January 31, 2013 by xi'an

On the morning I returned from Varanasi and the ISBA meeting there, I had to give my R final exam (along with three of my colleagues in Paris-Dauphine). This year, the R course was completely in English, exam included, which means I can post it here as it may attract more interest than the French examens of past years…

I just completed grading my 32 copies, all from exam A, which takes a while as I have to check (and sometimes recover) the R code, and often to correct the obvious mistakes to see if the deeper understanding of the concepts is there. This year student cohort is surprisingly homogeneous: I did not spot any of the horrors I may have mentioned in previous posts.

I must alas acknowledge a grievous typo in the version of Exam B that was used the day of the final: cutting-and-pasting from A to B, I forgot to change the parameters in Exercise 2, asking them to simulate a Gamma(0,1). It is only after half an hour that a bright student pointed out the impossibility… We had tested the exams prior to printing them but this somehow escaped the four of us!

Now, as I was entering my grades into the global spreadsheet, I noticed a perfect… lack of correlation between those and the grades at the midterm exam. I wonder what that means: I could be grading at random, the levels in November and in January could be uncorrelated, some students could have cheated in November and others in January, student’s names or file names got mixed up, …? A rather surprising outcome!

## optimising accept-reject

Posted in R, Statistics, University life with tags , , , , , , on November 21, 2012 by xi'an

I spotted on R-bloggers a post discussing optimising the efficiency of programming accept-reject algorithms. While it is about SAS programming, and apparently supported by the SAS company, there are two interesting features with this discussion. The first one is about avoiding the dreaded loop in accept-reject algorithms. For instance, taking the case of the truncated-at-one Poisson distribution, the code

rtpois=function(n,lambda){
sampl=c()
while (length(sampl)<n){
x=rpois(1,lambda)
if (x!=0) sampl=c(sampl,x)}
return(sampl)
}


is favoured by my R course students but highly inefficient:

> system.time(rtpois(10^5,.5))
user  system elapsed
61.600  27.781  98.727


both for the stepwise increase in the size of the vector and for the loop. For instance, defining the vector sampl first requires a tenth of the above time (note the switch from 10⁵ to 10⁶):

> system.time(rtpois(10^6,.5))
user  system elapsed
54.155   0.200  62.635


As discussed by the author of the post, a more efficient programming should aim at avoiding the loop by predicting the number of proposals necessary to accept a given number of values. Since the bound M used in accept-reject algorithms is also the expected number of attempts for one acceptance, one should start with something around Mn proposed values. (Assuming of course all densities are properly normalised.) For instance, in the case of the truncated-at-one Poisson distribution based on proposals from the regular Poisson, the bound is 1/1-e. A first implementation of this principle is to build the sample via a few loops:

rtpois=function(n,lambda){
propal=rpois(ceiling(n/(1-exp(-lambda))),lambda)
propal=propal[propal>0]
n0=length(propal)
if (n0>=n)
return(propal[1:n])
else return(c(propal,rtpois(n-n0,lambda)))
}


with a higher efficiency:

> system.time(rtpois(10^6,.5))
user  system elapsed
0.816   0.092   0.928


Replacing the expectation with an upper bound using the variance of the negative binomial distribution does not make a significant dent in the computing time

rtpois=function(n,lambda){
M=1/(1-exp(-lambda))
propal=rpois(ceiling(M*(n+2*sqrt(n/M)/(M-1))),lambda)
propal=propal[propal>0]
n0=length(propal)
if (n0>=n)
return(propal[1:n])
else return(c(propal,rtpois(n-n0,lambda)))}


since we get

> system.time(rtpois(10^6,.5))
user  system elapsed
0.824   0.048   0.877


The second point about this Poisson example is that simulating a distribution with a restricted support using another distribution with a larger support is quite inefficient. Especially when λ goes to zero By comparison, using a Poisson proposal with parameter μ and translating it by 1 may bring a considerable improvement: without getting into the gory details, it can be shown that the optimal value of μ (in terms of maximal acceptance probability) is λ and that the corresponding probability of acceptance is

$\dfrac{1-\exp\{-\lambda\}}{\lambda}$

which is larger than the probability of the original approach when λ is less than one. As shown by the graph below, this allows for a lower bound in the probability of acceptance that remains tolerable.

## R midterms

Posted in Kids, Linux, R, Statistics, University life with tags , , , , , , , , , , , on November 9, 2012 by xi'an

Here are my R midterm exams, version A and version B in English (as students are sitting next to one another in the computer rooms), on simulation methods for my undergrad exploratory statistics course. Nothing particularly exciting or innovative! Dedicated ‘Og‘s readers may spot a few Le Monde puzzles in the lot…

Two rather entertaining if mundane occurences related to this R exam: one hour prior to the exam, a student came to my office to beg for being allowed to take the solution manual with her (as those midterm exercises are actually picked from an exercise booklet, some students cooperated towards producing a complete solution manual and this within a week!), kind of missing the main point of having an exam. (I have not seen yet this manual but I’d be quite interested in checking the code they produced on that occasion…) During the exam, another student asked me what was the R command to turn any density into a random generator: he had written a density function called mydens and could not fathom why rmydens(n) was not working. The same student later called me as his computer was “stuck”: he was not aware that a “+” prompt on the command line meant R was waiting for him to complete the command… A less comical event that ended well is that a student failed to save her R code (periodically and) at the end of the exam and we had to dig very deep into the machine to salvage her R commands from \tmp as rkward safeguards, as only the .RData file was available at first. I am glad we found this before turning the machine off, otherwise it would have been lost.

## generalised ratio of uniforms

Posted in R, Statistics, University life with tags , , , on May 15, 2012 by xi'an

A recent arXiv posting of the paper “On the Generalized Ratio of Uniforms as a Combination of Transformed Rejection and Extended Inverse of Density Sampling” by Martino, Luengo, and Míguez from Madrid rekindled my interest in this rather peculiar simulation method. The ratio of uniforms samples uniformly on the subgraph

$\mathcal{A}=\{(v,u);\,0\le u\le\sqrt{p(v/u)}\}$

to produce simulations from p as the ratio v/u. The proof is straightforward first year calculus but I do not find the method intuitive as, say, accept/reject…. The paper gives a very detailed background on those methods, as well as on the “inverse of density method”, which is like looking at the uniform simulation over the subgraph, but with both axes inverted (slice sampling is the same on both). (A minor point of contention or at least misunderstanding: when using the inverse of density method, the authors claim that using the unormalised and the normalised versions of the target leads to the same outcome. While it is true for the direct method, I have trouble seeing the equivalent in the inverse case…) The paper also stresses that the optimal case for accept-reject is when the target is bounded because the uniform can then be used as a proposal. I agree this is a simpler solution but fail to see any optimality in the matter. The authors then study ways of transforming unbounded subgraphs into bounded domains (i.e. bounded pdfs and supports). This imposes conditions on the transform f, which must have finite limits for p(x)/f'(x) or p-1(x)/f'(x) at the boundaries. (An optimal choice is when f is the cdf of p, since then the transform is uniform.)

The remainder (and more innovative) part of the paper is less clear in that I do not get a generic feeling on what it is about! The generalisation of the above is to consider uniform sampling from

$\mathcal{A}_g=\big\{(v,u);\,0\le u\le g^{-1}\{c p[v/g'(u)]\}\big\}$

for a generic increasing function g such that g(0)=0. And c a positive constant. (Any positive constant?!) But this is from a 1991 paper by Jon Wakefield, Alan Gelfand, and Adrian Smith. The extension is thus in finding g such that the above region is bounded and can be explored by uniform sampling over a box.. And in noticing that “the generalized Ratio-of-Uniform method is a combination of the transformed rejection method applied to the inverse density with the extended inverse-of-density method” (p.27).

I wonder at the applicability of the approach for costly target functions p. And at the extension to larger dimensions. And wish I had more time (or more graduate students) to look at possible adaptive constructions of the transform g. An interesting and fruitful read, nonetheless!