On Monday, James Johndrow, Aaron Smith, Natesh Pillai, and David Dunson arXived a paper on the diminishing benefits of using data augmentation for large and highly imbalanced categorical data. They reconsider the data augmentation scheme of Tanner and Wong (1987), surprisingly not mentioned, used in the first occurrences of the Gibbs sampler like Albert and Chib’s (1993) or our mixture estimation paper with Jean Diebolt (1990). The central difficulty with data augmentation is that the distribution to be simulated operates on a space that is of order O(n), even when the original distribution covers a single parameter. As illustrated by the coalescent in population genetics (and the subsequent intrusion of the ABC methodology), there are well-known cases when the completion is near to impossible and clearly inefficient (as again illustrated by the failure of importance sampling strategies on the coalescent). The paper provides spectral gaps for the logistic and probit regression completions, which are of order a power of log(n) divided by √n, when all observations are equal to one. In a somewhat related paper with Jim Hobert and Vivek Roy, we studied the spectral gap for mixtures with a small number of observations: I wonder at the existence of a similar result in this setting, when all observations stem from one component of the mixture, when all observations are one. The result in this paper is theoretically appealing, the more because the posteriors associated with such models are highly regular and very close to Gaussian (and hence not that challenging as argued by Chopin and Ridgway). And because the data augmentation algorithm is uniformly ergodic in this setting (as we established with Jean Diebolt and later explored with Richard Tweedie). As demonstrated in the experiment produced in the paper, when comparing with HMC and Metropolis-Hastings (same computing times?), which produce much higher effective sample sizes.
Archive for simulation
A nice question was posted on X validated as to figure out a way to simulate a Bernoulli B(q) variate when using only a Bernoulli B(p) generator. With the additional question of handling the special case q=a/b, a rational probability. This is not exactly a Bernoulli factory problem in that q does not write as f(p), but still a neat challenge. My solution would have been similar to the one posted by William Huber, namely to simulate a sequence of B(p) or B(1-p) towards zooming on q until the simulation of the underlying uniforms U allows us to conclude at the position of U wrt q. For instance, if p>q and X~B(p) is equal to zero, the underlying uniform is more than p, hence more than q, leading to returning zero for the B(q) generation. Else, a second B(p) or B(1-p) generation means breaking the interval (0,p) into two parts, one of which allows for stopping the generation, and so on. The solution posted by William Huber contains an R code that could be easily improved by choosing for each interval between p and (1-p) towards the maximal probability of stopping. I still wonder at the ultimate optimal solution that would minimise the (average or median) number of calls to the Bernoulli(p) generator.
Alex Thiéry suggested debiasing the biased estimate of e by Rhee and Glynn truncated series method, so I tried the method to see how much of an improvement (if any!) this would bring. I first attempted to naïvely implement the raw formula of Rhee and Glynn
with a (large) Poisson distribution on the stopping rule N, but this took ages. I then realised that the index n did not have to be absolute, i.e. to start at n=1 and proceed snailwise one integer at a time: the formula remains equally valid after a change of time, i.e. n=can start at an arbitrary value and proceeds by steps of arbitrary size, which obviously speeds things up! Continue reading
A rather exotic question on X validated: since π can be approximated by random sampling over a unit square, is there an equivalent for approximating e? This is an interesting question, as, indeed, why not focus on e rather than π after all?! But very quickly the very artificiality of the problem comes back to hit one in one’s face… With no restriction, it is straightforward to think of a Monte Carlo average that converges to e as the number of simulations grows to infinity. However, such methods like Poisson and normal simulations require some complex functions like sine, cosine, or exponential… But then someone came up with a connection to the great Russian probabilist Gnedenko, who gave as an exercise that the average number of uniforms one needs to add to exceed 1 is exactly e, because it writes as
(The result was later detailed in the American Statistician as an introductory simulation exercise akin to Buffon’s needle.) This is a brilliant solution as it does not involve anything but a standard uniform generator. I do not think it relates in any close way to the generation from a Poisson process with parameter λ=1 where the probability to exceed one in one step is e⁻¹, hence deriving a Geometric variable from this process leads to an unbiased estimator of e as well. As an aside, W. Huber proposed the following elegantly concise line of R code to implement an approximation of e:
1/mean(n*diff(sort(runif(n+1))) > 1)
Hard to beat, isn’t it?! (Although it is more exactly a Monte Carlo approximation of
which adds a further level of approximation to the solution….)
Panayiota Touloupou (Warwick), Naif Alzahrani, Peter Neal, Simon Spencer (Warwick) and Trevelyan McKinley arXived a paper yesterday on Model comparison with missing data using MCMC and importance sampling, where they proposed an importance sampling strategy based on an early MCMC run to approximate the marginal likelihood a.k.a. the evidence. Another instance of estimating a constant. It is thus similar to our Frontier paper with Jean-Michel, as well as to the recent Pima Indian survey of James and Nicolas. The authors give the difficulty to calibrate reversible jump MCMC as the starting point to their research. The importance sampler they use is the natural choice of a Gaussian or t distribution centred at some estimate of θ and with covariance matrix associated with Fisher’s information. Or derived from the warmup MCMC run. The comparison between the different approximations to the evidence are done first over longitudinal epidemiological models. Involving 11 parameters in the example processed therein. The competitors to the 9 versions of importance samplers investigated in the paper are the raw harmonic mean [rather than our HPD truncated version], Chib’s, path sampling and RJMCMC [which does not make much sense when comparing two models]. But neither bridge sampling, nor nested sampling. Without any surprise (!) harmonic means do not converge to the right value, but more surprisingly Chib’s method happens to be less accurate than most importance solutions studied therein. It may be due to the fact that Chib’s approximation requires three MCMC runs and hence is quite costly. The fact that the mixture (or defensive) importance sampling [with 5% weight on the prior] did best begs for a comparison with bridge sampling, no? The difficulty with such study is obviously that the results only apply in the setting of the simulation, hence that e.g. another mixture importance sampler or Chib’s solution would behave differently in another model. In particular, it is hard to judge of the impact of the dimensions of the parameter and of the missing data.
Ted Meeds and Max Welling have not so recently written about an embarrassingly parallel approach to ABC that they call optimisation Monte Carlo. [Danke Ingmar for pointing out the reference to me.] They start from a rather innocuous rephrasing of the ABC posterior, writing the pseudo-observations as deterministic transforms of the parameter and of a vector of uniforms. Innocuous provided this does not involve an infinite number of uniforms, obviously. Then they suddenly switch to the perspective that, for a given uniform vector u, one should seek the parameter value θ that agrees with the observation y. A sort of Monte Carlo inverse regression: if
then invert this equation in θ. This is quite clever! Maybe closer to fiducial than true Bayesian statistics, since the prior does not occur directly [only as a weight p(θ)], but if this is manageable [and it all depends on the way f(θ,u) is constructed], this should perform better than ABC! After thinking about it a wee bit more in London, though, I realised this was close to impossible in the realistic examples I could think of. But I still like the idea and want to see if anything at all can be made of this…
“However, it is hard to detect if our optimization succeeded and we may therefore sometimes reject samples that should not have been rejected. Thus, one should be careful not to create a bias against samples u for which the optimization is difficult. This situation is similar to a sampler that will not mix to remote local optima in the posterior distribution.”
Now, the paper does not go that way but keeps the ε-ball approach as in regular ABC, to derive an approximation of the posterior density. For a while I was missing the difference between the centre of the ball and the inverse of the above equation, bottom of page 3. But then I realised the former was an approximation to the latter. When the authors discuss their approximation in terms of the error ε, I remain unconvinced by the transfer of the tolerance to the optimisation error, as those are completely different notions. This also applies to the use of a Jacobian in the weight, which seems out of place since this Jacobian appears in a term associated with (or replacing) the likelihood, f(θ,u), which is then multiplied by the prior p(θ). (Assuming a Jacobian exists, which is unclear when considering most simulation patterns use hard bounds and indicators.) When looking at the toy examples, it however makes sense to have a Jacobian since the selected θ’s are transforms of the u’s. And the p(θ)’s are simply importance weights correcting for the wrong target. Overall, the appeal of the method proposed in the paper remains unclear to me. Most likely because I did not spend enough time over it.
Last week, while I was preparing my talk for the NIPS workshop, I spotted this fairly generic question on X validated. And decided to procrastinate by answering through generic comments on the pros and cons of each method. This is a challenging if probably empty question as it lacks a measure of evaluation for those different approaches. And this is another reason why I replied, in that it relates to my pondering the a-statistical nature of simulation-based approximation methods. Also called probabilistic numerics, not statistical numerics, eh! It is indeed close to impossible to compare such approaches and others on a general basis. For instance, the comparative analysis greatly differs when dealing with a once-in-a-lifetime problem and with an everyday issue, e.g. when building a package for a sufficiently standard model. In the former case, a quick-and-dirty off-the-shelf solution is recommended, while in the latter, designing an efficient and fine-tuned approach makes sense. (The pros and cons I discussed in my X validated answer thus do not apply in most settings!) If anything, using several approaches, whenever possible, is the best advice to give. If not on the targeted problem, at least on a toy or simulated version, to check for performances of those different tools. But this brings back the issue of cost and time… An endless garden of forking paths, one would say [in another setting].