Archive for MCMskv

Le Sassine

Posted in Mountains, pictures, Travel, Wines with tags , , , , , , on May 20, 2016 by xi'an

Villa Arvedi

Posted in Mountains, pictures, Travel, Wines with tags , , , , , , on March 20, 2016 by xi'an

approximations of Markov Chains [another garden of forking paths]

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , , , , , , on March 15, 2016 by xi'an

On the Sétaz cabin ride, Valloire, Dec. 23, 2011James Johndrow and co-authors from Duke wrote a paper on approximate MCMC that was arXived last August and that I missed. David Dunson‘s talk at MCMski made me aware of it. The paper studies the impact of replacing a valid kernel with a close approximation. Which is a central issue for many usages of MCMC in complex models, as exemplified by the large number of talks on that topic at MCMski.

“All of our bounds improve with the MCMC sample path length at the expected rate in t.”

A major constraint in the paper is Doeblin’s condition, which implies uniform geometric ergodicity. Not only it is a constraint on the Markov kernel but it is also one for the Markov operator in that it may prove impossible to… prove. The second constraint is that the approximate Markov kernel is close enough to the original, which sounds reasonable. Even though one can always worry that the total variation norm is too weak a norm to mean much. For instance, I presume with some confidence that this does not prevent the approximate Markov kernel from not being ergodic, e.g., not irreducible, not absolutely continuous wrt the target, null recurrent or transient. Actually, the assumption is stronger in that there exists a collection of approximations for all small enough values ε of the total variation distance. (Small enough meaning ε is much smaller than the complement α to 1 of the one step distance between the Markov kernel and the target. With poor kernels, the approximation must thus be very good.) This is less realistic than assuming the availability of one single approximation associated with an existing but undetermined distance ε. (For instance, the three examples of Section 3 in the paper show the existence of approximations achieving a certain distance ε, without providing a constructive determination of such approximations.) Under those assumptions, the average of the sequence of Markov moves according to the approximate kernel converges to the target in total variation (and in expectation for bounded functions). With sharp bounds on those distances. I am still a bit worried at the absence of conditions for the approximation to be ergodic.

“…for relatively short path lengths, there should exist a range of values for which aMCMC offers better performance in the compminimax sense.”

The paper also includes computational cost into the picture. Introducing the notion of compminimax error, which is the smallest (total variation) distance among all approximations at a given computational budget. Quite an interesting, innovative, and relevant notion that may however end up being too formal for practical use. And that does not include the time required to construct and calibrate the approximations.

R typos

Posted in Books, Kids, R, Statistics, Travel, University life with tags , , , , , , , , on January 27, 2016 by xi'an

Amster14At MCMskv, Alexander Ly (from Amsterdam) pointed out to me some R programming mistakes I made in the introduction to Metropolis-Hastings algorithms I wrote a few months ago for the Wiley on-line encyclopedia! While the outcome (Monte Carlo posterior) of the corrected version is moderately changed this is nonetheless embarrassing! The example (if not the R code) was a mixture of a Poisson and a Geometric distributions borrowed from our testing as mixture paper. Among other things, I used a flat prior on the mixture weights instead of a Beta(1/2,1/2) prior and a simple log-normal random walk on the mean parameter instead of a more elaborate second order expansion discussed in the text. And I also inverted the probabilities of success and failure for the Geometric density. The new version is now available on arXiv, and hopefully soon on the Wiley site, but one (the?) fact worth mentioning here is that the (right) corrections in the R code first led to overflows, because I was using the Beta random walk Be(εp,ε(1-p)) which major drawback I discussed here a few months ago. With the drag that nearly zero or one values of the weight parameter produced infinite values of the density… Adding 1 (or 1/2) to each parameter of the Beta proposal solved the problem. And led to a posterior on the weight still concentrating on the correct corner of the unit interval. In any case, a big thank you to Alexander for testing the R code and spotting out the several mistakes…

optimal importance sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , on January 13, 2016 by xi'an

somewhere near Zürich, Jan. 4, 2016An arXiv file that sat for quite a while in my to-read pile is Variance reduction in SGD by distributed importance sampling by Alain et al. I had to wait for the flight to Zürich and MCMskv to get a look at it. The part of the paper that is of primary interest to me is the generalisation of the optimal importance function result

q⁰(x)∞f(x)|h(x)|

to higher dimensions. Namely, what is the best importance function for approximating the expectation of h(X) when h is multidimensional? There does exist an optimal solution when the score function is the trace of the variance matrix. Where the solution is proportional to the target density times the norm of the target integrand

q⁰(x)∞f(x)||h(x)||

The application of the result to neural networks and stochastic gradients using minibatches of the training set somehow escapes me, even though the asynchronous aspects remind me of the recent asynchronous Gibbs sampler of Terenin, Draper, and Simpson.

While the optimality obtained in the paper is mathematically clear, I am a wee bit surprised at the approach: the lack of normalising constant in the optimum means using a reweighted approximation that drifts away from the optimal score. Furthermore, this optimum is sub-optimal when compared with the component wise optimum which produces a variance of zero (if we assume the normalising constant to be available). Obviously, using the component-wise optima requires to run as many simulations as there are components in the integrand, but since cost does not seem to be central to this study…

MCMskv #5 [future with a view]

Posted in Kids, Mountains, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on January 12, 2016 by xi'an

As I am flying back to Paris (with an afternoon committee meeting in München in-between), I am reminiscing on the superlative scientific quality of this MCMski meeting, on the novel directions in computational Bayesian statistics exhibited therein, and on the potential settings for the next meeting. If any.

First, as hopefully obvious from my previous entries, I found the scientific program very exciting, with almost uniformly terrific talks, and a coverage of the field of computational Bayesian statistics that is perfectly tuned to my own interest. In that sense, MCMski is my “top one” conference! Even without considering the idyllic location. While some of the talks were about papers I had already read (and commented here), others brought new vistas and ideas. If one theme is to emerge from this meeting it has to be the one of approximate and noisy algorithms, with a wide variety of solutions and approaches to overcome complexity issues. If anything, I wish the solutions would also incorporate the Boxian fact that the statistical models themselves are approximate. Overall, a fantastic program (says one member of the scientific committee).

Second, as with previous MCMski meetings, I again enjoyed the unique ambience of the meeting, which always feels more relaxed and friendly than other conferences of a similar size, maybe because of the après-ski atmosphere or of the special coziness provided by luxurious mountain hotels. This year hotel was particularly pleasant, with non-guests like myself able to partake of some of their facilities. A big thank you to Anto for arranging so meticulously all the details of such a large meeting!!! I am even more grateful when realising this is the third time Anto takes over the heavy load of organising MCMski. Grazie mille!

Since this is a [and even the!] BayesComp conference, the current section program chair and board must decide on the  structure and schedule of the next meeting. A few suggestions if I may: I would scrap entirely the name MCMski from the next conference as (a) it may sound like academic tourism for unaware bystanders (who only need to check the program of any of the MCMski conferences to stand reassured!) and (b) its topic go way beyond MCMC. Given the large attendance and equally large proportion of young researchers, I would also advise against hosting the conference in a ski resort for both cost and accessibility reasons [as we had already discussed after MCMskiv], in favour of a large enough town to offer a reasonable range of accommodations and of travel options. Like Chamonix, Innsbruck, Reykjavik, or any place with a major airport about one hour away… If nothing is available with skiing possibilities, so be it! While the outdoor inclinations of the early organisers induced us to pick locations where skiing over lunch break was a perk, any accessible location that allows for a concentration of researchers in a small area and for the ensuing day-long exchange is fine! Among the novelties in the program, the tutorials and the Breaking news! sessions were quite successful (says one member of the scientific committee). And should be continued in one format or another. Maybe a more programming thread could be added as well… And as we had mentioned earlier, to see a stronger involvement of the Young Bayesian section in the program would be great! (Even though the current meeting already had many young researcher  talks.)

mixtures are slices of an orange

Posted in Kids, R, Statistics with tags , , , , , , , , , , , , , , , , on January 11, 2016 by xi'an

licenceDataTempering_mu_pAfter presenting this work in both London and Lenzerheide, Kaniav Kamary, Kate Lee and I arXived and submitted our paper on a new parametrisation of location-scale mixtures. Although it took a long while to finalise the paper, given that we came with the original and central idea about a year ago, I remain quite excited by this new representation of mixtures, because the use of a global location-scale (hyper-)parameter doubling as the mean-standard deviation for the mixture itself implies that all the other parameters of this mixture model [beside the weights] belong to the intersection of a unit hypersphere with an hyperplane. [Hence the title above I regretted not using for the poster at MCMskv!]fitted_density_galaxy_data_500iters2This realisation that using a (meaningful) hyperparameter (μ,σ) leads to a compact parameter space for the component parameters is important for inference in such mixture models in that the hyperparameter (μ,σ) is easily estimated from the entire sample, while the other parameters can be studied using a non-informative prior like the Uniform prior on the ensuing compact space. This non-informative prior for mixtures is something I have been seeking for many years, hence my on-going excitement! In the mid-1990‘s, we looked at a Russian doll type parametrisation with Kerrie Mengersen that used the “first” component as defining the location-scale reference for the entire mixture. And expressing each new component as a local perturbation of the previous one. While this is a similar idea than the current one, it falls short of leading to a natural non-informative prior, forcing us to devise a proper prior on the variance that was a mixture of a Uniform U(0,1) and of an inverse Uniform 1/U(0,1). Because of the lack of compactness of the parameter space. Here, fixing both mean and variance (or even just the variance) binds the mixture parameter to an ellipse conditional on the weights. A space that can be turned into the unit sphere via a natural reparameterisation. Furthermore, the intersection with the hyperplane leads to a closed form spherical reparameterisation. Yay!

While I do not wish to get into the debate about the [non-]existence of “non-informative” priors at this stage, I think being able to using the invariant reference prior π(μ,σ)=1/σ is quite neat here because the inference on the mixture parameters should be location and scale equivariant. The choice of the prior on the remaining parameters is of lesser importance, the Uniform over the compact being one example, although we did not study in depth this impact, being satisfied with the outputs produced from the default (Uniform) choice.

From a computational perspective, the new parametrisation can be easily turned into the old parametrisation, hence leads to a closed-form likelihood. This implies a Metropolis-within-Gibbs strategy can be easily implemented, as we did in the derived Ultimixt R package. (Which programming I was not involved in, solely suggesting the name Ultimixt from ultimate mixture parametrisation, a former title that we eventually dropped off for the paper.)

Discussing the paper at MCMskv was very helpful in that I got very positive feedback about the approach and superior arguments to justify the approach and its appeal. And to think about several extensions outside location scale families, if not in higher dimensions which remain a practical challenge (in the sense of designing a parametrisation of the covariance matrices in terms of the global covariance matrix).