## X entropy for optimisation

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on March 29, 2018 by xi'an

At Gregynog, with mounds of snow still visible in the surrounding hills, not to be confused with the many sheep dotting the fields(!), Owen Jones gave a three hour lecture on simulation for optimisation, which is a less travelled path when compared with simulation for integration. His second lecture covered cross entropy for optimisation purposes. (I had forgotten that Reuven Rubinstein and Dirk Kroese had put forward this aspect of their technique in the very title of their book. As “A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning”.) The X entropy approaches pushes for simulations restricted to top values of the target function, iterating to find the best parameter in the parametric family used for the simulation. (Best to be understood in the Kullback sense.) Now, this is a wee bit like simulated annealing, where lots of artificial entities have to be calibrated in the algorithm, due to the original problem being unrelated to an specific stochastic framework. X entropy facilitates concentration on the highest values of the target, but requires a family of probability distributions that puts weight on the top region. This may be a damning issue in large dimensions. Owen illustrated the approach in the case of the travelling salesman problem, where the parameterised distribution is a Markov chain on the state space of city sequences. Further, if the optimal value of the target is unknown, avoiding getting stuck in a local optimum may be tricky. (Owen presented a proof of convergence for a temperature going to zero slowly enough that is equivalent to a sure exploration of the entire state space, in a discrete setting, which does not provide a reassurance in this respect, as the corresponding algorithm cannot be implemented.) This method falls into the range of methods that are doubly stochastic in that they rely on Monte Carlo approximations at each iteration of the exploration algorithm.

During a later talk, I tried to recycle one of my earlier R codes on simulated annealing for sudokus, but could not find a useful family of proposal distributions to reach the (unique) solution. Using a mere product of distributions on each of the free positions in the sudoku grid only led me to a penalty of 13 errors…

1    2    8    5    9    7    4    9    3
7    3    5    1    2    4    6    2    8
4    6    9    6    3    8    5    7    1
2    7    5    3    1    6    9    4    8
8    1    4    7    8    9    7    6    2
6    9    3    8    4    2    1    3    5
3    8    6    4    7    5    2    1    9
1    4    2    9    6    3    8    5    7
9    5    7    2    1    8    3    4    6


It is hard to consider a distribution on the space of permutations, 𝔖⁸¹.

## the ABC-SubSim algorithm

Posted in pictures, Statistics with tags , , , , , , on April 29, 2014 by xi'an

In a nice coincidence with my ABC tutorial at AISTATS 2014 – MLSS, Manuel Chiachioa, James Beck, Juan Chiachioa, and Guillermo Rus arXived today a paper on a new ABC algorithm, called ABC-SubSim. The SubSim stands for subset simulation and corresponds to an approach developed by one of the authors for rare-event simulation. This approach looks somewhat similar to the cross-entropy method of Rubinstein and Kroese, in that successive tail sets are created towards reaching a very low probability tail set. Simulating from the current subset increases the probability to reach the following and less probable tail set. The extension to the ABC setting is done by looking at the acceptance region (in the augmented space) as a tail set and by defining a sequence of tolerances.  The paper could also be connected with nested sampling in that constrained simulation through MCMC occurs there as well. Following the earlier paper, the MCMC implementation therein is a random-walk-within-Gibbs algorithm. This is somewhat the central point in that the sample from the previous tolerance level is used to start a Markov chain aiming at the next tolerance level. (Del Moral, Doucet and Jasra use instead a particle filter, which could easily be adapted to the modified Metropolis move considered in the paper.) The core difficulty with this approach, not covered in the paper, is that the MCMC chains used to produce samples from the constrained sets have to be stopped at some point, esp. since the authors run those chains in parallel. The stopping rule is not provided (see, e.g., Algorithm 3) but its impact on the resulting estimate of the tail probability could be far from negligible… Esp. because there is no burnin/warmup. (I cannot see how “ABC-SubSim exhibits the benefits of perfect sampling” as claimed by the authors, p. 6!)  The authors re-examined the MA(2) toy benchmark we had used in our earlier survey, reproducing as well the graphical representation on the simplex as shown above.

## Split Sampling: expectations, normalisation and rare events

Posted in Books, Statistics, University life with tags , , , , , , on January 27, 2014 by xi'an

Just before Christmas (a year ago), John Birge, Changgee Chang, and Nick Polson arXived a paper with the above title. Split sampling is presented a a tool conceived to handle rare event probabilities, written in this paper as

$Z(m)=\mathbb{E}_\pi[\mathbb{I}\{L(X)>m\}]$

where π is the prior and L the likelihood, m being a large enough bound to make the probability small. However, given John Skilling’s representation of the marginal likelihood as the integral of the Z(m)’s, this simulation technique also applies to the approximation of the evidence. The paper refers from the start to nested sampling as a motivation for this method, presumably not as a way to run nested sampling, which was created as a tool for evidence evaluation, but as a competitor. Nested sampling may indeed face difficulties in handling the coverage of the higher likelihood regions under the prior and it is an approximative method, as we detailed in our earlier paper with Nicolas Chopin. The difference between nested and split sampling is that split sampling adds a distribution ω(m) on the likelihood levels m. If pairs (x,m) can be efficiently generated by MCMC for the target

$\pi(x)\omega(m)\mathbb{I}\{L(X)>m\},$

the marginal density of m can then be approximated by Rao-Blackwellisation. From which the authors derive an estimate of Z(m), since the marginal is actually proportional to ω(m)Z(m). (Because of the Rao-Blackwell argument, I wonder how much this differs from Chib’s 1995 method, i.e. if the split sampling estimator could be expressed as a special case of Chib’s estimator.) The resulting estimator of the marginal also requires a choice of ω(m) such that the associated cdf can be computed analytically. More generally, the choice of ω(m) impacts the quality of the approximation since it determines how often and easily high likelihood regions will be hit. Note also that the conditional π(x|m) is the same as in nested sampling, hence may run into difficulties for complex likelihoods or large datasets.

When reading the beginning of the paper, the remark that “the chain will visit each level roughly uniformly” (p.13) made me wonder at a possible correspondence with the Wang-Landau estimator. Until I read the reference to Jacob and Ryder (2012) on page 16. Once again, I wonder at a stronger link between both papers since the Wang-Landau approach aims at optimising the exploration of the simulation space towards a flat histogram. See for instance Figure 2.

The following part of the paper draws a comparison with both nested sampling and the product estimator of Fishman (1994). I do not fully understand the consequences of the equivalence between those estimators and the split sampling estimator for specific choices of the weight function ω(m). Indeed, it seemed to me that the main point was to draw from a joint density on (x,m) to avoid the difficulties of exploring separately each level set. And also avoiding the approximation issues of nested sampling. As a side remark, the fact that the harmonic mean estimator occurs at several points of the paper makes me worried. The qualification of “poor Monte Carlo error variances properties” is an understatement for the harmonic mean estimator, as it generally has infinite variance and it hence should not be used at all, even as a starting point. The paper does not elaborate much about the cross-entropy method, despite using an example from Rubinstein and Kroese (2004).

In conclusion, an interesting paper that made me think anew about the nested sampling approach, which keeps its fascination over the years! I will most likely use it to build an MSc thesis project this summer in Warwick.

## Reuven Rubinstein (1938-2012)

Posted in Statistics with tags , , , , , , , , on December 10, 2012 by xi'an

I just learned last night that Professor Reuven Rubinstein passed away. While I was not a close collaborator of him, I met Reuven Rubinstein a few times at conferences and during a short visit to Paris, and each time learned from the encounter. I also appreciated his contributions to the field of simulation, esp. his cross-entropy method that culminated in the book The Cross-Entropy Method with Dirk Kroese. Reuven was involved in many aspects of simulation along his prolific career, he will be especially remembered for his 1981 book Simulation and the Monte Carlo Method that is arguably the very first book on simulation as a Monte Carlo method. This book had a recent second edition, co-authored with Dirk Kroese as well. It is thus quite a sad day to witness this immense contributor to the field leave us. (Here is a link to his webpage at Technion, including pictures of a trip to the Gulag camp where he spent most of his childhood.) I presume there will be testimonies about his influence at the WSC 2012 conference here in Berlin.