Archive for MCMskv

Le Sassine

Posted in Mountains, pictures, Travel, Wines with tags , , , , , , on May 20, 2016 by xi'an

Villa Arvedi

Posted in Mountains, pictures, Travel, Wines with tags , , , , , , on March 20, 2016 by xi'an

approximations of Markov Chains [another garden of forking paths]

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , , , , , , on March 15, 2016 by xi'an

On the Sétaz cabin ride, Valloire, Dec. 23, 2011James Johndrow and co-authors from Duke wrote a paper on approximate MCMC that was arXived last August and that I missed. David Dunson‘s talk at MCMski made me aware of it. The paper studies the impact of replacing a valid kernel with a close approximation. Which is a central issue for many usages of MCMC in complex models, as exemplified by the large number of talks on that topic at MCMski.

“All of our bounds improve with the MCMC sample path length at the expected rate in t.”

A major constraint in the paper is Doeblin’s condition, which implies uniform geometric ergodicity. Not only it is a constraint on the Markov kernel but it is also one for the Markov operator in that it may prove impossible to… prove. The second constraint is that the approximate Markov kernel is close enough to the original, which sounds reasonable. Even though one can always worry that the total variation norm is too weak a norm to mean much. For instance, I presume with some confidence that this does not prevent the approximate Markov kernel from not being ergodic, e.g., not irreducible, not absolutely continuous wrt the target, null recurrent or transient. Actually, the assumption is stronger in that there exists a collection of approximations for all small enough values ε of the total variation distance. (Small enough meaning ε is much smaller than the complement α to 1 of the one step distance between the Markov kernel and the target. With poor kernels, the approximation must thus be very good.) This is less realistic than assuming the availability of one single approximation associated with an existing but undetermined distance ε. (For instance, the three examples of Section 3 in the paper show the existence of approximations achieving a certain distance ε, without providing a constructive determination of such approximations.) Under those assumptions, the average of the sequence of Markov moves according to the approximate kernel converges to the target in total variation (and in expectation for bounded functions). With sharp bounds on those distances. I am still a bit worried at the absence of conditions for the approximation to be ergodic.

“…for relatively short path lengths, there should exist a range of values for which aMCMC offers better performance in the compminimax sense.”

The paper also includes computational cost into the picture. Introducing the notion of compminimax error, which is the smallest (total variation) distance among all approximations at a given computational budget. Quite an interesting, innovative, and relevant notion that may however end up being too formal for practical use. And that does not include the time required to construct and calibrate the approximations.

R typos

Posted in Books, Kids, R, Statistics, Travel, University life with tags , , , , , , , , on January 27, 2016 by xi'an

Amster14At MCMskv, Alexander Ly (from Amsterdam) pointed out to me some R programming mistakes I made in the introduction to Metropolis-Hastings algorithms I wrote a few months ago for the Wiley on-line encyclopedia! While the outcome (Monte Carlo posterior) of the corrected version is moderately changed this is nonetheless embarrassing! The example (if not the R code) was a mixture of a Poisson and a Geometric distributions borrowed from our testing as mixture paper. Among other things, I used a flat prior on the mixture weights instead of a Beta(1/2,1/2) prior and a simple log-normal random walk on the mean parameter instead of a more elaborate second order expansion discussed in the text. And I also inverted the probabilities of success and failure for the Geometric density. The new version is now available on arXiv, and hopefully soon on the Wiley site, but one (the?) fact worth mentioning here is that the (right) corrections in the R code first led to overflows, because I was using the Beta random walk Be(εp,ε(1-p)) which major drawback I discussed here a few months ago. With the drag that nearly zero or one values of the weight parameter produced infinite values of the density… Adding 1 (or 1/2) to each parameter of the Beta proposal solved the problem. And led to a posterior on the weight still concentrating on the correct corner of the unit interval. In any case, a big thank you to Alexander for testing the R code and spotting out the several mistakes…

optimal importance sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , on January 13, 2016 by xi'an

somewhere near Zürich, Jan. 4, 2016An arXiv file that sat for quite a while in my to-read pile is Variance reduction in SGD by distributed importance sampling by Alain et al. I had to wait for the flight to Zürich and MCMskv to get a look at it. The part of the paper that is of primary interest to me is the generalisation of the optimal importance function result

q⁰(x)∞f(x)|h(x)|

to higher dimensions. Namely, what is the best importance function for approximating the expectation of h(X) when h is multidimensional? There does exist an optimal solution when the score function is the trace of the variance matrix. Where the solution is proportional to the target density times the norm of the target integrand

q⁰(x)∞f(x)||h(x)||

The application of the result to neural networks and stochastic gradients using minibatches of the training set somehow escapes me, even though the asynchronous aspects remind me of the recent asynchronous Gibbs sampler of Terenin, Draper, and Simpson.

While the optimality obtained in the paper is mathematically clear, I am a wee bit surprised at the approach: the lack of normalising constant in the optimum means using a reweighted approximation that drifts away from the optimal score. Furthermore, this optimum is sub-optimal when compared with the component wise optimum which produces a variance of zero (if we assume the normalising constant to be available). Obviously, using the component-wise optima requires to run as many simulations as there are components in the integrand, but since cost does not seem to be central to this study…