Dream on!

On Saturday, I was asked to referee Jasper Vrugt’s paper “DREAM(D): an adaptive Markov chain Monte Carlo simulation algorithm to solve discrete, noncontinuous, posterior parameter estimation problems(I have added the proper upper case letters!) for the journal Hydrology and Earth System Sciences. It may sound surprising that I advertise this refereeing, but Hydrology and Earth System Sciences has the fairly interesting feature that referees’ comments can be turned into a discussion of the paper, along with other spontaneous comments. The only drawbacks are that (a) the discussion remains open for only 8 days and (b) using LaTeX commands is not as straightforward as on WordPress. Here is (more or less) my entry:

“The DREAM(D) code is perhaps the only method available to solve discontinuous parameter estimation problems, and simultaneously provide an estimate of uncertainty.”

The paper by Jasper Vrugt is following a long series of papers on this topic of DREAM (DiffeRential Evolution Adaptive Metropolis) algorithms developed by the author, from 2003 onwards. The major idea of the DREAM algorithms seems to be the use of several sequences in parallel (or of a population of points) in an MCMC setting. This reminds me of works in the 1990’s by Laird Breyer, Gareth Roberts, and Richard Tweedie on coupled MCMC, as well as of our pinball sampler, with Kerrie Mengersen. Assuming a target that is the product of the original targets, MCMC schemes that move the whole population at each iteration conditional on the previous population obviously allows for a sort of adaptivity, in the sense that empirical moments can be derived from this population. However, since the target increases in dimension, the drawback is that convergence clearly takes longer than with a single chain, despite the adaptive features. This is actually the reason why we did not pursue the pinball sampler… The current paper is an adaptation of the general scheme to a discrete state space, with the only change being to consider the integer part of the original DREAM perturbation. However, the algorithm contains an unclear step (b) referring to the “crossover probability” that I do not understand (there is at least one typo and the fact that d’ keeps decreasing by a factor 1 is not possible in the long run). As a consequence, I also do not understand why this does not impact detailed balance. Overall, I rather object to the adjective adaptive used in the name of the algorithm, next to MCMC, because this is not an “adaptive MCMC” algorithm in the sense of Andrieu, Haario, Roberts, Rosenthal, and others (incl. the participants to Adap’ski!), in that the Markov kernel on the product space does not change along iterations. There is no need for convergence to occur beyond detailed balance (Figure 2 being rather unhelpful in that respect).

“Here, I have introduced a discrete MCMC simulation algorithm that is especially designed to solve noncontinuous posterior sampling problems.”

The first example found in the paper is a sudoku solver, a problem I find quite interesting. The details are missing from the paper, but the likelihood associated with a grid presumably penalises grids breaking some of the constraints in a way similar to my simple exponential penalty. Although the goal is to maximise the likelihood (which then provides the only solution to the sudoku), the paper seems to advocate a regular MCMC dynamic, with time-homogeneous Markov kernel and target density. It thus seems that the global optimum is hit by chance along the way. Using a time-homogeneous Markov chain for optimisation, as opposed to the time-heterogeneous original Metropolis et al. (1953) simulated annealing approach, sounds too naïve for large discrete spaces.

“Each parameter is discretized equidistantly in 250 intervals with respective step size listed in the last column at the right hand side. This gridding is necessary to create a non-continuous, discrete, parameter estimation problem.”

The second example is about an hydrology complex model whose description is missing (Figure 3 is a sudoku related graph). It appears as a somehow contrived problem if the above quote is to be taken into account. The conclusion of this section is that the new DREAM algorithm applied to the discretised version of the problem is doing as well as the original DREAM algorithm applied to the original model. There is no optimisation in this case.

Overall, I am not convinced by the claims quoted above as to the adequation of the new DREAM algorithm to the discrete optimisation problem, in the sense that there is very little in the design that takes the problem into account: DREAM(D) is a mere discretisation of DREAM and the scale of the moves is dictated by the variability of the population at the previous step, assuming a sort of connexity that is less than common in discrete state spaces.There is no connexity in most of those problems. For instance, in the sudoku example, the values in a given entry of the grid have no ordinal meaning and 7 is just as far from 8 than 1. A discrete space distance would thus seem more appropriate than a linear distance in this case. But this does not mean it would provide an efficient generic way of exploring a discrete space because of the discontinuity in the problem.

I also asked Christian Schäfer for his opinion on this technique as he is currently working on binary model optimisation,  as in Schäfer and Chopin. Here are his comments:

Sampling on discrete state spaces using interacting particle Monte Carlo methods comes with the major difficulty that we cannot control the acceptance rates of the Markov chains by scaling the variance as we do in continuous state spaces. In discrete states spaces, there is no analogue to the classical random walk Hastings kernel driven by some (possibly adapted and scaled) multivariate normal.

Jasper Vrugt addresses this problem by rounding the proposals produced by the mechanism he found to be effective in continuous spaces. I would expect that the proposal mechanism of DREAM is likely to produce states that are hardly ever accepted (large b) or states that have already been visited (small b). The discontinuity of the state space will, in general, not allow calibrating b in the sense we scale the variance of the random walk proposals in the continuous case. Still, maybe it leads to efficient transition kernels in special cases.

Finally, I do not see why this approach would yield a good optimization algorithm; it seems to be rather a mean of state spaces exploration, but there are more efficient methods (like Wang-Landau or enumeration algorithms) to this end. However, my perspective is biased by the fact that I concentrate my work on binary spaces for which I am sure the DREAM algorithm would fail since the calibration problem described above is taken to an extreme.

5 Responses to “Dream on!”

  1. I am finally reading the comments in more detail. I suggest to have a look at the revised paper. This contains two additional proposal distributions beyond the simple rounding operator that maintain balance and are especially designed for combinatorial problems. Thus for spaces that are no longer Euclidean, and involve directed swap and random swap moves of coordinates of the search space. Strangely enough even to date (after 8 months), I did not receive any feedback on this revision, nor after emailing the revised paper directly to this referee. Obviously one can design a specific sudoku solver. But that is not the scope of the present paper. The sudoku example is just used to illustrate that DREAM_{D} can solve discrete problems as well. The hydrologic examples follow from the literature. The code is now being used by various others, and I am receiving positive feedback.

  2. […] on published papers and, why not?!, rejected authors could respond to reviews… This is why I liked the format of the review process in the journal Hydrology and Earth System Sciences. that allows […]

  3. Another referee, Bettina Schaefli from EPFL, also posted a set of comments on the paper. She also finds the Sudoku example too obscure as stated and also thinks the paper should include a truly discrete problem rather than a discretised version of a continuous problem.

  4. Dear Xi:

    Thank you for your review. I appreciate your comments and will repond to each of them in detail on the HESS-D website. In summary:

    1. Please keep in mind that DREAM-D maintains detailed balance. It is thus by no means designed to be efficient in solving Sudoku puzzles!!!! This is because of the detailed balance condition. I can easily change the proposal distribution to be way more efficient for the sudoku puzzle, but this does not maintain balance. Thus DREAM-D is designed to maintain balance. The Sudoku is just shown that it can even solve a problem that has 50+ parameters, without providing any constraints on the starting solution.

    2. Indeed, I can provide the likelihood function for the sudoku example; there is different possibilities, but one where you take the max of the three constraint violations works ok. This in principle is not a correct likelihood function, but that does not matter for the application because we are only interested in the optimum! And indeed, many runs are needed to solve the puzzle using DREAM-D, but it is good news that MCMC can even solve something complicated like this!!! Again, branch and bound optimization would solve the problem in 3500 functiion evaluations, and about 5 seconds!!!! But efficiency is not the point for the Sudoku. The point is that DREAM-D can even solve problems like this. It is way more efficient for the hydrologic problems considered in the paper. And indeed, a similar number of function evaluations is needed between DREAM and DREAM-D to reach the posterior.

    3. DREAM-D might perhaps not be most efficienct in finding the optimum of a discrete problem (as also mentioned by one of your colleagues), but derives uncertainty. How many methods are there out there that derive posterior uncertainty of integers? This is important. And yes the DREAM-D method is rather general, without much problem knowledge so that it can be applied to a range of problems!!! This is the underlying idea. You have different MCMC methods, but they are each developed for a small class of problems. We have shown that DREAM and hopefully DREAM-D works for a range of problems, thus the sudoku problem to illustrate this.

    4. I respectfully disagree with your statement that adaptive in the title is inappropriate. The Haario sampler you refer to uses a fixed mathematical form of the proposal distribution (multinormal distribution), but with covariance adapted from the history of sampled points. This is called adaptive. DREAM-D is similar: a fixed mathematical form of the proposal distribution is used, but the distance between points (orientation and scale) is changing en route to the posterior. So, whereas Haario compute the covariance of the points, we do not do that, and simply use differencing using their exact position. This is as adaptive as Haario!

    I will respond to your comments online, and lets continue our conversation. Perhaps best to do this by email? Rather than to have everyone look at this?!

  5. […] Friday also saw some excellent blogging: OR in an OB World wondered how data analysis will influence operations research  and Re-educate Seattle mused about meaningful feedback for students. Also setting an example (seemed to be the theme of the week), Xian’s Og shared some open refereeing. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.