efficiency of normalising over discrete parameters

Posted in Statistics with tags , , , , , , , , , on May 1, 2022 by xi'an

Yesterday, I noticed a new arXival entitled Investigating the efficiency of marginalising over discrete parameters in Bayesian computations written by Wen Wang and coauthors. The paper is actually comparing the simulation of a Gibbs sampler with an Hamiltonian Monte Carlo approach on Gaussian mixtures, when including and excluding latent variables, respectively. The authors missed the opposite marginalisation when the parameters are integrated.

While marginalisation requires substantial mathematical effort, folk wisdom in the Stan community suggests that fitting models with marginalisation is more efficient than using Gibbs sampling.

The comparison is purely experimental, though, which means it depends on the simulated data, the sample size, the prior selection, and of course the chosen algorithms. It also involves the [mostly] automated [off-the-shelf] choices made in the adopted software, JAGS and Stan. The outcome is only evaluated through ESS and the (old) R statistic. Which all depend on the parameterisation. But evacuates the label switching problem by imposing an ordering on the Gaussian means, which may have a different impact on marginalised and unmarginalised models. All in all, there is not much one can conclude about this experiment since the parameter values beyond the simulated data seem to impact the performances much more than the type of algorithm one implements.

information loss from the median

Posted in Books, Kids, Statistics with tags , , , , , , on April 19, 2022 by xi'an

An interesting side item from a X validated question about calculating the Fisher information for the Normal median (as an estimator of the mean). While this information is not available in closed form, it has a “nice” expression

$1+n\mathbb E[Z_{n/2:n}\varphi(Z_{n/2:n})]-n\mathbb E[Z_{n/2:n-1}\varphi(Z_{n/2:n-1})]+$
$\frac{n(n-1)}{n/2-2}\varphi(Z_{n/2-2:n-2})^2+\frac{n(n-1)}{n-n/2-1}\varphi(Z_{n/2:n-2})^2$

which can easily be approximated by simulation (much faster than by estimating the variance of said median). This shows that the median is about 1.57 less informative than the empirical mean. Bonus points for computing the information brought by the MAD statistic! (The information loss against the MLE is 2.69,  since the Monte Carlo ratio of their variances is 0.37.)

the riddle(r) of the certain winner losing in the end

Posted in Books, Kids, R, Statistics with tags , , , , , on November 25, 2020 by xi'an

Considering a binary random walk, starting at zero, what is the probability of being almost sure of winning at some point only to lose at the end? This is the question set by the post-election Riddler, with almost sure meaning above 99% and the time horizon set to n=101 steps (it could have been 50 or 538!). As I could not see a simple way to compute the collection of states with a probability of being positive at the end of at least 0.99, even after checking William Feller’s Random Walks fabulous chapter, I wrote an R code to find them, and then ran a Monte Carlo evaluation of the probability to reach this collection and still end up with a negative value. Which came as 0.00212 over repeated simulations. Obviously smaller than 0.01, but no considerably so. As seen on the above picture, the set to be visited is actually not inconsiderable. The bounding curves are the diagonal and the 2.33 √(n-t) bound derived from the limiting Brownian approximation to the random walk, which fits rather well. (I wonder if there is a closed form expression for the probability of the Brownian hitting the boundary 2.33 √(n-t). Simulations with 1001 steps give an estimated probability of 0.505, leading to a final probability of 0.00505 of getting over the boundary and loosing in the end, close to the 1/198 produced by The Riddler.)

borderline infinite variance in importance sampling

Posted in Books, Kids, Statistics with tags , , , , , on November 23, 2015 by xi'an

As I was still musing about the posts of last week around infinite variance importance sampling and its potential corrections, I wondered at whether or not there was a fundamental difference between “just” having a [finite] variance and “just” having none. In conjunction with Aki’s post. To get a better feeling, I ran a quick experiment with Exp(1) as the target and Exp(a) as the importance distribution. When estimating E[X]=1, the above graph opposes a=1.95 to a=2.05 (variance versus no variance, bright yellow versus wheat), a=2.95 to a=3.05 (third moment versus none, bright yellow versus wheat), and a=3.95 to a=4.05 (fourth moment versus none, bright yellow versus wheat). The graph below is the same for the estimation of E[exp(X/2)]=2, which has an integrand that is not square integrable under the target. Hence seems to require higher moments for the importance weight. Hard to derive universal theories from those two graphs, however they show that protection against sudden drifts in the estimation sequence. As an aside [not really!], apart from our rather confidential Confidence bands for Brownian motion and applications to Monte Carlo simulation with Wilfrid Kendall and Jean-Michel Marin, I do not know of many studies that consider the sequence of averages time-wise rather than across realisations at a given time and still think this is a more relevant perspective for simulation purposes.

self-healing umbrella sampling

Posted in Kids, pictures, Statistics, University life with tags , , , , , , , on November 5, 2014 by xi'an

Ten days ago, Gersende Fort, Benjamin Jourdain, Tony Lelièvre, and Gabriel Stoltz arXived a study about an adaptive umbrella sampler that can be re-interpreted as a Wang-Landau algorithm, if not the most efficient version of the latter. This reminded me very much of the workshop we had all together in Edinburgh last June. And even more of the focus of the molecular dynamics talks in this same ICMS workshop about accelerating the MCMC exploration of multimodal targets. The self-healing aspect of the sampler is to adapt to the multimodal structure thanks to a partition that defines a biased sampling scheme spending time in each set of the partition in a frequency proportional to weights. While the optimal weights are the weights of the sets against the target distribution (are they truly optimal?! I would have thought lifting low density regions, i.e., marshes, could improve the mixing of the chain for a given proposal), those are unknown and they need to be estimated by an adaptive scheme that makes staying in a given set the less desirable the more one has visited it. By increasing the inverse weight of a given set by a factor each time it is visited. Which sounds indeed like Wang-Landau. The plus side of the self-healing umbrella sampler is that it only depends on a scale γ (and on the partition). Besides converging to the right weights of course. The downside is that it does not reach the most efficient convergence, since the adaptivity weight decreases in 1/n rather than 1/√n.

Note that the paper contains a massive experimental side where the authors checked the impact of various parameters by Monte Carlo studies of estimators involving more than a billion iterations. Apparently repeated a large number of times.

The next step in adaptivity should be about the adaptive determination of the partition, hoping for a robustness against the dimension of the space. Which may be unreachable if I judge by the apparent deceleration of the method when the number of terms in the partition increases.