Archive for asynchronous algorithms

automatic adaptation of MCMC algorithms

Posted in pictures, Statistics with tags , , , , , , , on March 4, 2019 by xi'an

“A typical adaptive MCMC sampler will approximately optimize performance given the kind of sampler chosen in the first place, but it will not optimize among the variety of samplers that could have been chosen.”

Last February (2018), Dao Nguyen and five co-authors arXived a paper that I missed. On a new version of adaptive MCMC that aims at selecting a wider range of proposal kernels. Still requiring a by-hand selection of this collection of kernels… Among the points addressed, beyond the theoretical guarantees that the adaptive scheme does not jeopardize convergence to the proper target, are a meta-exploration of the set of combinations of samplers and integration of the computational speed in the assessment of each sampler. Including the very difficulty of assessing mixing. One could deem the index of the proposal as an extra (cyber-)parameter to its generic parameter (like the scale in the random walk), but the discreteness of this index makes the extension more delicate than expected. And justifies the distinction between internal and external parameters. The notion of a worst-mixing dimension is quite appealing and connects with the long-term hope that one could spend the maximum fraction of the sampler runtime over the directions that are poorly mixing, while still keeping the target as should be. The adaptive scheme is illustrated on several realistic models with rather convincing gains in efficiency and time.

The convergence tools are inspired from Roberts and Rosenthal (2007), with an assumption of uniform ergodicity over all kernels considered therein which is both strong and delicate to assess in practical settings. Efficiency is rather unfortunately defined in terms of effective sample size, which is a measure of correlation or lack thereof, but which does not relate to the speed of escape from the basin of attraction of the starting point. I also wonder at the pertinence of the estimation of the effective sample size when the chain is based on different successive kernels, since the lack of correlation may be due to another kernel. Another calibration issue is the internal clock that relates to the average number of iterations required to tune properly a specific kernel, which again may be difficult to assess in a realistic situation. A last query is whether or not this scheme could be compared with an asynchronous (and valid) MCMC approach that exploits parallel capacities of the computer.

convergences of MCMC and unbiasedness

Posted in pictures, Statistics, University life with tags , , , , , , , , , on January 16, 2018 by xi'an

During his talk on unbiased MCMC in Dauphine today, Pierre Jacob provided a nice illustration of the convergence modes of MCMC algorithms. With the stationary target achieved after 100 Metropolis iterations, while the mean of the target taking much more iterations to be approximated by the empirical average. Plus a nice connection between coupling time and convergence. Convergence to the target.During Pierre’s talk, some simple questions came to mind, from developing an “impatient user version”, as in perfect sampling, in order  to stop chains that run “forever”,  to optimising parallelisation in order to avoid problems of asynchronicity. While the complexity of coupling increases with dimension and the coupling probability goes down, the average coupling time varies but an unexpected figure is that the expected cost per iteration is of 2 simulations, irrespective of the chosen kernels. Pierre also made a connection with optimal transport coupling and stressed that the maximal coupling was for the proposal and not for the target.

patterns of scalable Bayesian inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on February 24, 2016 by xi'an

Elaine Angelino, Matthew Johnson and Ryan Adams just arXived a massive survey of 118 pages on scalable Bayesian inference, which could have been entitled Bayes for Big Data, as this monograph covers state-of-the-art computational approaches to large and complex data structures. I did not read each and every line of it, but I have already recommended it to my PhD students. Some of its material unsurprisingly draws from the recent survey by Rémi Bardenet et al. (2015) I discussed a while ago. It also relates rather frequently to the somewhat parallel ICML paper of Korattikara et al. (2014). And to the firefly Monte Carlo procedure also discussed previously here.

Chapter 2 provides some standard background on computational techniques, Chapter 3 covers MCMC with data subsets, Chapter 4 gives some entries on MCMC with parallel and distributed architectures, Chapter 5 focus on variational solutions, and Chapter 6 is about open questions and challenges.

“Insisting on zero asymptotic bias from Monte Carlo estimates of expectations may leave us swamped in errors from high variance or transient bias.”

One central theme of the paper is the need for approximate solutions, MCMC being perceived as the exact solution. (Somewhat wrongly in the sense that the product of an MCMC is at best an empirical version of the true posterior, hence endowed with a residual and incompressible variation for a given computing budget.) While Chapter 3 stresses the issue of assessing the distance to the true posterior, it does not dwell at all on computing times and budget, which is arguably a much harder problem. Chapter 4 seems to be more aware of this issue since arguing that “a way to use parallel computing resources is to run multiple sequential MCMC algorithms at once [but that this] does not reduce the transient bias in MCMC estimates of posterior expectations” (p.54). The alternatives are to use either prefetching (which was the central theme of Elaine Angelino’s thesis), asynchronous Gibbs with the new to me (?) Hogwild Gibbs algorithms (connected in Terenin et al.’s recent paper, not quoted in the paper), some versions of consensus Monte Carlo covered in earlier posts, the missing links being in my humble opinion an assessment of the worth of those solutions (in the spirit of “here’s the solution, what was the problem again?”) and once again the computing time issue. Chapter 5 briefly discusses some recent developments in variational mean field approximations, which is farther from my interests and (limited) competence, but which appears as a particular class of approximate models and thus could (and should?) relate to likelihood-free methods. Chapter 6 about the current challenges of the field is presumably the most interesting in this monograph in that it produces open questions and suggests directions for future research. For instance, opposing the long term MCMC error with the short term transient part. Or the issue of comparing different implementations in a practical and timely perspective.

optimal importance sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , on January 13, 2016 by xi'an

somewhere near Zürich, Jan. 4, 2016An arXiv file that sat for quite a while in my to-read pile is Variance reduction in SGD by distributed importance sampling by Alain et al. I had to wait for the flight to Zürich and MCMskv to get a look at it. The part of the paper that is of primary interest to me is the generalisation of the optimal importance function result

q⁰(x)∞f(x)|h(x)|

to higher dimensions. Namely, what is the best importance function for approximating the expectation of h(X) when h is multidimensional? There does exist an optimal solution when the score function is the trace of the variance matrix. Where the solution is proportional to the target density times the norm of the target integrand

q⁰(x)∞f(x)||h(x)||

The application of the result to neural networks and stochastic gradients using minibatches of the training set somehow escapes me, even though the asynchronous aspects remind me of the recent asynchronous Gibbs sampler of Terenin, Draper, and Simpson.

While the optimality obtained in the paper is mathematically clear, I am a wee bit surprised at the approach: the lack of normalising constant in the optimum means using a reweighted approximation that drifts away from the optimal score. Furthermore, this optimum is sub-optimal when compared with the component wise optimum which produces a variance of zero (if we assume the normalising constant to be available). Obviously, using the component-wise optima requires to run as many simulations as there are components in the integrand, but since cost does not seem to be central to this study…

MCMskv #5 [future with a view]

Posted in Kids, Mountains, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on January 12, 2016 by xi'an

As I am flying back to Paris (with an afternoon committee meeting in München in-between), I am reminiscing on the superlative scientific quality of this MCMski meeting, on the novel directions in computational Bayesian statistics exhibited therein, and on the potential settings for the next meeting. If any.

First, as hopefully obvious from my previous entries, I found the scientific program very exciting, with almost uniformly terrific talks, and a coverage of the field of computational Bayesian statistics that is perfectly tuned to my own interest. In that sense, MCMski is my “top one” conference! Even without considering the idyllic location. While some of the talks were about papers I had already read (and commented here), others brought new vistas and ideas. If one theme is to emerge from this meeting it has to be the one of approximate and noisy algorithms, with a wide variety of solutions and approaches to overcome complexity issues. If anything, I wish the solutions would also incorporate the Boxian fact that the statistical models themselves are approximate. Overall, a fantastic program (says one member of the scientific committee).

Second, as with previous MCMski meetings, I again enjoyed the unique ambience of the meeting, which always feels more relaxed and friendly than other conferences of a similar size, maybe because of the après-ski atmosphere or of the special coziness provided by luxurious mountain hotels. This year hotel was particularly pleasant, with non-guests like myself able to partake of some of their facilities. A big thank you to Anto for arranging so meticulously all the details of such a large meeting!!! I am even more grateful when realising this is the third time Anto takes over the heavy load of organising MCMski. Grazie mille!

Since this is a [and even the!] BayesComp conference, the current section program chair and board must decide on the  structure and schedule of the next meeting. A few suggestions if I may: I would scrap entirely the name MCMski from the next conference as (a) it may sound like academic tourism for unaware bystanders (who only need to check the program of any of the MCMski conferences to stand reassured!) and (b) its topic go way beyond MCMC. Given the large attendance and equally large proportion of young researchers, I would also advise against hosting the conference in a ski resort for both cost and accessibility reasons [as we had already discussed after MCMskiv], in favour of a large enough town to offer a reasonable range of accommodations and of travel options. Like Chamonix, Innsbruck, Reykjavik, or any place with a major airport about one hour away… If nothing is available with skiing possibilities, so be it! While the outdoor inclinations of the early organisers induced us to pick locations where skiing over lunch break was a perk, any accessible location that allows for a concentration of researchers in a small area and for the ensuing day-long exchange is fine! Among the novelties in the program, the tutorials and the Breaking news! sessions were quite successful (says one member of the scientific committee). And should be continued in one format or another. Maybe a more programming thread could be added as well… And as we had mentioned earlier, to see a stronger involvement of the Young Bayesian section in the program would be great! (Even though the current meeting already had many young researcher  talks.)

MCMskv #1 [room with a view]

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on January 6, 2016 by xi'an

That’s it!, MCMskv has now started! We hold our round-table Monday night, which ended with most of my interventions revolving about the importance of models. And of the fact that models are always approximate (and wrong), hence that uncertainty and uncertainty ascertainment is paramount. Even more with large datasets and roundtablehigh-dimensional models. Apologies to the audience if I sounded like running on a very short loop. (And maybe also for the round-table to keep them from their dinner!)  Still, I got some items for reflection out of this discussion, including the notion that big data is usually and inappropriately associated with an impression of completeness that is almost deterministic in a Laplacian sense. Namely that the available data for, say, all Facebook users, seems to allow us (or The Machine) to play Laplace’s Demon. And thus forgoes the need for uncertainty and uncertainty ascertainment. Which obviously clashes with the issues of poor data, inappropriate models, and time or space stationarity of the available information.

Two more computing-related notions that came out the discussion [for me] are asynchronicity (in the sense explored by Terenin et al. a few months ago) and subsampling, The later seems to mean many things, judging from the discussion from the panel and the audience. For me, it corresponded to the ability (or inability) to handle only part of the available data to simulate the posterior associated with this available data.

The first talk on Tuesday morning was the plenary talk by Michael Jordan about his incorporation of complexity constraints on the convergence of an MCMC variable selection algorithm. (I though I had commented this paper in the past on the ‘Og but apparently I did not!) This was quite interesting, with ultra-fast convergence of the sampler. The talk was alas made harder to follow because of a cameraman standing in front of most of the audience for the entire time, as in the above picture. (I also noticed the interesting randomness of the light panels, who all display different patterns of dots, maybe random enough to satisfy a randomness test!) Another if irrelevant annoying fact was that I discovered upon arrival that my airbnb rental was located 8 kilometres away from the conference location, in a completely different town! Thankfully, we had rented a car [for 5] which saved the day (and even more the night!).