## optimisation via slice sampling

Posted in Statistics with tags , , , , on December 20, 2012 by xi'an

This morning, over breakfast; I read the paper recently arXived by John Birge et Nick Polson. I was intrigued by the combination of optimisation and of slice sampling, but got a wee disappointed by the paper, in that it proposes a simple form of simulated annealing, minimising f(x) by targeting a small collection of energy functions exp{-τf(x)}. Indeed, the slice sampler is used to explore each of those targets, i.e. for a fixed temperature τ. For the four functions considered in the paper, a slice sampler can indeed be implemented, but this feature could be seen as a marginalia, given that a random walk Metropolis-Hastings algorithm could be used as a proposal mechanism in other cases. The other intriguing fact is that annealing is not used in the traditional way of increasing the coefficient τ along iterations (as in our SAME algorithm), for which convergence issues are much more intricate, but instead stays at the level where a whole (Markov) sample is used for each temperature, the outcomes being then compared in terms of localisation (and maybe for starting at the next temperature value). I did not see any discussion about the selection of the successive temperatures, which usually is a delicate issue in realistic settings, nor of the stopping rule(s) used to decide the maximum has been reached.

## threshold schedules for ABC-SMC

Posted in Statistics, University life with tags , , , , , , , , , on October 17, 2012 by xi'an

Daniel Silk, Sarah Filippi and Michael Stumpf just posted on arXiv a new paper on ABC-SMC. They attack therein a typical problem with SMC (and PMC, and even MCMC!) methods, namely the ability to miss (global) modes of the target due to a poor initial exploration. So, if a proposal is built on previous simulations and if  those simulations have missed an important mode, it is quite likely that this proposal will concentrate on other parts of the space and miss the important mode even more. This is also why simulated annealing and other stochastic optimisation algorithms are so fickle: a wrong parameterisation (like the temperature schedule) induces the sequence to converge to a local optimum rather than to the global optimum. Since sequential ABC is a form of simulated annealing, the decreasing tolerance (or threshold) playing the part of the temperature, it is no surprise that it suffers from this generic disease…

The method proposed in the paper aims at avoiding this difficulty by looking at sudden drops in the acceptance rate curve (as a function of the tolerance ε),

$\aleph_t(\epsilon)=\int p_t(x)\mathbb{I}(\Delta(x,x^\star)\le \epsilon)\,\text{d}x,$

suggesting for threshold the value of ε that maximises the second derivative of this curve. Now, before getting to (at?) the issue of turning this principle into a practical algorithm, let me linger at the motivation for it:

“To see this, one can imagine an ε-ball expanding about the true data; at first the ball only encompasses a small number of particles that were drawn from very close to the global maximum, corresponding to the low gradient at the foot of the shape. Once ε is large enough we are able to accept the relatively large number of particles sitting in the local maximum, which causes the increase in gradient.” (p.5)

Thus, the argument for looking at values of ε preceding fast increases in the acceptance rate ℵ is that we are thus avoiding flatter and lower regions of the posterior support corresponding to a local maximum. It clearly encompasses the case studied by the authors of a global highly concentrated global mode, surrounded by flatter lower modes, but it seems to me that this is not necessarily the only possible reason for changes in the shape of the acceptance probability ℵ. First, we are looking at an ABC acceptance rate, not at a genuine likelihood. Second, this acceptance rate function depends on the current (and necessarily rough) approximate to the genuine predictive, pt. Furthermore, when taking into account this rudimentary replacement of the true likelihood function, it is rather difficult to envision how it impacts the correspondence between a proximity in the data and a proximity in the parameter space. (The toy example is both illuminating and misleading, because it considers a setting where the data is a deterministic transform of the parameter.) I thus think the analysis should refer more strongly to the non-parametric literature and in particular to the k-nearest-neighbour approach recently reformulated by Biau et al.: there is no reason to push the tolerance ε all the way down to zero as this limit does not correspond to the optimal value of the tolerance. If one does not use a non-parametric determination of the “right” tolerance, the lingering question is when and why stopping the sequence of ABC simulations.

The acceptance rate function ℵ is approximated using an unscented transform. I understand neither the implementation of, nor the motivations for, this choice. Given that the function ℵ is a one-dimensional transform of the tolerance and that it actually corresponds to the cdf of the distance between the true data and the (current) pseudo-data, why couldn’t we use smooth versions of a cdf estimate based on the simulated distances, given that these distances are already computed..?  I must be missing something.

## interacting particles ABC

Posted in Statistics with tags , , , , , , on August 27, 2012 by xi'an

Carlo Albert and Hans Kuensch recently posted an arXiv paper which provides a new perspective on ABC. It relates to ABC-MCMC and to ABC-SMC in different ways, but the major point is to propose a sequential schedule for decreasing the tolerance that ensures convergence. Although there exist other proofs of convergence in the literature, this one is quite novel in that it connects ABC with the cooling schedules of simulated annealing. (The fact that the sample size does not appear as in Fearnhead and Prangle and their non-parametric perspective can be deemed less practical, but I think this is simply another perspective on the problem!) The corresponding ABC algorithm is a mix of MCMC and SMC in that it lets a population of N particles evolve in a quasi-independent manner, the population being only used to update the parameters of the independent (normal) proposal and those of the cooling tolerance. Each particle in the population moves according to a Metropolis-Hastings step, but this is not an ABC-MCMC scheme in that the algorithm works with a population at all times, and this is not an ABC-SMC scheme in that there is no weighting and no resampling.

Maybe I can add two remarks about the conclusion: the authors do not seem aware of other works using other penalties than the 0-1 kernel, but those abound, see e.g. the discussion paper of Fearnhead and Prangle. Or Ratmann et al. The other missing connection is about adaptive tolerance construction, which is also found in the literature, see e.g. Doucet et al. or Drovandi and Pettitt.

## ASC 2012 (#1)

Posted in Statistics, Travel, University life with tags , , , , , , , , , , on July 11, 2012 by xi'an

This morning I attended Alan Gelfand talk on directional data, i.e. on the torus (0,2π), and found his modeling via wrapped normals (i.e. normal reprojected onto the unit sphere) quite interesting and raising lots of probabilistic questions. For instance, usual moments like mean and variance had no meaning in this space. The variance matrix of the underlying normal, as well of its mean, obviously matter. One thing I am wondering about is how restrictive the normal assumption is. Because of the projection, any random change to the scale of the normal vector does not impact this wrapped normal distribution but there are certainly features that are not covered by this family. For instance, I suspect it can only offer at most two modes over the range (0,2π). And that it cannot be explosive at any point.

The keynote lecture this afternoon was delivered by Roderick Little in a highly entertaining way, about calibrated Bayesian inference in official statistics. For instance, he mentioned the inferential “schizophrenia” in this field due to the between design-based and model-based inferences. Although he did not define what he meant by “calibrated Bayesian” in the most explicit manner, he had this nice list of eight good reasons to be Bayesian (that came close to my own list at the end of the Bayesian Choice):

1. conceptual simplicity (Bayes is prescriptive, frequentism is not), “having a model is an advantage!”
2. avoiding ancillarity angst (Bayes conditions on everything)
3. avoiding confidence cons (confidence is not probability)
4. nails nuisance parameters (frequentists are either wrong or have a really hard time)
5. escapes from asymptotia
6. incorporates prior information and if not weak priors work fine
7. Bayes is useful (25 of the top 30 cited are statisticians out of which … are Bayesians)
8. Bayesians go to Valencia! [joke! Actually it should have been Bayesian go MCMskiing!]
9. Calibrated Bayes gets better frequentists answers

He however insisted that frequentists should be Bayesians and also that Bayesians should be frequentists, hence the calibration qualification.

After an interesting session on Bayesian statistics, with (adaptive or not) mixtures and variational Bayes tools, I actually joined the “young statistician dinner” (without any pretense at being a young statistician, obviously) and had interesting exchanges on a whole variety of topics, esp. as Kerrie Mengersen adopted (reinvented) my dinner table switch strategy (w/o my R simulated annealing code). Until jetlag caught up with me.

## [not] Le Monde puzzle (solution)

Posted in R, Statistics, University life with tags , , , , , on April 14, 2012 by xi'an

Following the question on dinner table permutations on StackExchange (mathematics) and the reply that the right number was six, provided by hardmath, I was looking for a constructive solution how to build the resolvable 2-(20,5,1) covering. A few hours later. hardmath again came up with an answer, found in the paper Equitable Resolvable Coverings by van Dam, Haemers and Peek (2002). In the meanwhile (and even during the Big’MC seminar of yesterday!), I had been thinking of a simulated annealing implementation, which actually was used by van Dam, Haemers and Peek. Here is my (plain) R code

#initialisation of tables
#rows for individuals, columns for courses
T=sample(rep(1:4,5))
for (i in 2:6)
T=cbind(T,sample(rep(1:4,5)))
#encounters table
meet=function(T){
M=outer(T[,1],T[,1],"==")
for (i in 2:6)
M=M+outer(T[,i],T[,i],"==")
M
}
#number of missed encounters
penalty=function(M){ sum(M==0) }
penat=penalty(meet(T))

N=10^5
gamma=100

for (t in 1:N){
#random pick of switched individuals
prop=sample(1:20,2,prob=apply(meet(T)==0,1,sum))
cour=sample(1:6,1)
Tp=T
Tp[prop[1],cour]=T[prop[2],cour]
Tp[prop[2],cour]=T[prop[1],cour]
penatp=penalty(meet(Tp))
print(c(penat,penatp))

if (penatp==0){
T=Tp
break()
}

if (log(runif(1))<(penat-penatp)/gamma){
T=Tp
penat=penatp}

if (t%%10==0)
gamma=gamma*.999
}


which happened to provide a solution on the second round (got stuck at a penalty of 4 in the first round):

> T
T
[1,] 1 4 3 2 2 3
[2,] 1 2 4 3 4 4
[3,] 3 2 1 4 1 3
[4,] 1 2 3 1 1 1
[5,] 4 2 4 2 3 3
[6,] 2 4 1 2 4 1
[7,] 4 3 1 1 2 4
[8,] 1 3 2 4 3 1
[9,] 3 3 3 3 4 3
[10,] 4 4 2 3 1 1
[11,] 1 1 1 3 3 2
[12,] 3 4 4 1 3 2
[13,] 4 1 3 4 4 2
[14,] 2 4 3 4 3 4
[15,] 2 3 4 2 1 2
[16,] 2 2 2 3 2 2
[17,] 2 1 2 1 4 3
[18,] 4 3 1 1 2 4
[19,] 3 1 4 4 2 1
[20,] 3 1 2 2 1 4


(This makes a great example for my general public talk in Australia this summer/winter!)