Archive for simulated annealing

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

The crux of the method is to run an iteration as [where y denotes the observation]

1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
2. Generate a pseudo-observation z~ƒ(z|θ’);
3. Accept with probability

$\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}$

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

Le Monde puzzle [#977]

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , on October 3, 2016 by xi'an

A mild arithmetic Le Monde mathematical puzzle:

Find the optimal permutation of {1,2,..,15} towards minimising the maximum of sum of all three  consecutive numbers, including the sums of the 14th, 15th, and first numbers, as well as the 15th, 1st and 2nd numbers.

If once again opted for a lazy solution, not even considering simulated annealing!,

parme=sample(15)
max(apply(matrix(c(parme,parme[-1],
parme[1],parme[-(1:2)],parme[1:2]),3),2,sum))


and got the minimal value of 26 for the permutation

14 9 2 15 7 1 11 10 4 12 8 5 13 6 3

Le Monde gave a solution with value 25, though, which is

11 9 7 5 13 8 2 10 14 6 1 12 15 4 3

but there is a genuine mistake in the solution!! This anyway shows that brute force may sometimes fail. (Which sounds like a positive conclusion to failing to find the proper solution! But trying with a simple simulated annealing version did not produce any 25 either…)

Posted in Books, Kids, pictures, R with tags , , , , , , , , , on May 6, 2016 by xi'an

The riddle from The Riddler this week is about finding an undirected graph with N nodes and no isolated node such that the number of nodes with more connections than the average of their neighbours is maximal. A representation of a connected graph is through a matrix X of zeros and ones, on which one can spot the nodes satisfying the above condition as the positive entries of the vector (X1)^2-(X^21), if 1 denotes the vector of ones. I thus wrote an R code aiming at optimising this target

targe <- function(F){
sum(F%*%F%*%rep(1,N)/(F%*%rep(1,N))^2<1)}


by mere simulated annealing:

rate <- function(N){
# generate matrix F
# 1. no single
F=matrix(0,N,N)
F[sample(2:N,1),1]=1
F[1,]=F[,1]
for (i in 2:(N-1)){
if (sum(F[,i])==0)
F[sample((i+1):N,1),i]=1
F[i,]=F[,i]}
if (sum(F[,N])==0)
F[sample(1:(N-1),1),N]=1
F[N,]=F[,N]
# 2. more connections
F[lower.tri(F)]=F[lower.tri(F)]+
sample(0:1,N*(N-1)/2,rep=TRUE,prob=c(N,1))
F[F>1]=1
F[upper.tri(F)]=t(F)[upper.tri(t(F))]
#simulated annealing
T=1e4
temp=N
targo=targe(F)
for (t in 1:T){
#1. local proposal
nod=sample(1:N,2)
prop=F
prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
1-prop[nod[1],nod[2]]
while (min(prop%*%rep(1,N))==0){
nod=sample(1:N,2)
prop=F
prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
1-prop[nod[1],nod[2]]}
target=targe(prop)
if (log(runif(1))*temp<target-targo){
F=prop;targo=target}
#2. global proposal
prop=F prop[lower.tri(prop)]=F[lower.tri(prop)]+
sample(c(0,1),N*(N-1)/2,rep=TRUE,prob=c(N,1))
prop[prop>1]=1
prop[upper.tri(prop)]=t(prop)[upper.tri(t(prop))]
target=targe(prop)
if (log(runif(1))*temp<target-targo){
F=prop;targo=target}
temp=temp*.999
}
return(F)}


This code returns quite consistently (modulo the simulated annealing uncertainty, which grows with N) the answer N-2 as the number of entries above average! Which is rather surprising in a Simpson-like manner since all entries but two are above average. (Incidentally, I found out that Edward Simpson recently wrote a paper in Significance about the Simpson-Yule paradox and him being a member of the Bletchley Park Enigma team. I must have missed out the connection with the Simpson paradox when reading the paper in the first place…)

Le Monde puzzle [#945]

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , on January 25, 2016 by xi'an

A rather different Le Monde mathematical puzzle:

A two-person game is played on an nxn grid filled with zero. Each player pick an entry, which is increased by one as well as all adjacent entries. The game stops when all entries are equal. For n=3,4,5, what are the possible grids with identical values all over?

If I define an R neighbourhood function

 nighbo=function(i,n){
neigh=i
if (i%%n!=1) neigh=c(i-1,neigh)
if (i%%n>0) neigh=c(i+1,neigh)
if (i+n<=n*n) neigh=c(i+n,neigh)
if (i-n>0) neigh=c(i-n,neigh)
return(neigh)}


and try a brute force filling of the grid

while ((min(grid)==0)||(length(unique(grid))>1)){
ent=sample(1:(n*n),1,prob=1/as.vector(grid+1)^10)
grid[nighbo(ent,n)]=grid[nighbo(ent,n)]+1}


the loop never stops. When thinking of the case n=3 [while running in the early morning], I wondered whether or not reaching an equal value on all entries was at all possible. Indeed, it is impossible to update one of the four corners without updating at least one of the neighbours, while the converse is false. Experimenting further with simulated annealing to optimise the probabilities of picking different entries in the table when n=4,5 seems to indicate this is also the case for larger values of n, in that all attempts lead to larger values of neighbours to the four corners :

outer=c(1,n,n*n,n*n-n+1)
border=sort(unique(c(2:(n-1),(n*n-n+2):(n*n-1),1+n*(1:(n-2)),n+n*(1:(n-2)))))
inner=(1:(n*n))[-c(outer,border)]
#
target=function(weit){
grid=matrix(0,n,n)
for (t in 1:1e4){
cas=sample(1:3,1,prob=weit)
if (cas==1) ent=sample(outer,1,prob=max(grid[outer])-grid[outer]+1)
if (cas==2) ent=sample(border,1,prob=max(grid[border])-grid[border]+1)
if (cas==3) ent=sample(inner,1,prob=max(grid[inner])-grid[inner]+1)
ent=nighbo(ent,n)
grid[ent]=grid[ent]+1}
ave=c(mean(grid[outer]),mean(grid[border]),mean(grid[inner]))
return(list(dive=max(diff(sort(ave))),ave=ave,ava=mean(grid)))
}
#
weit=rep(1,3)
cur=target(weit)
T=100
while (cur$dive>0){ ind=sample(1:3,1,prob=1/cur$ave)
peit=weit
peit[ind]=weit[ind]-(cur$ave[ind]-cur$ava)*runif(1)/(cur$ava) while(peit[ind]<0) peit[ind]=weit[ind]-(cur$ave[ind]-cur$ava)*runif(1)/(cur$ava)
prop=target(peit)
if (log(runif(1))*1e4/T<prop$dive-cur$dive){
weit=peit;cur=prop}
T=T*1.00001
print(cur\$ave)}


convergence for non-Markovian simulated AAs

Posted in Books, pictures, Statistics with tags , , , on December 24, 2015 by xi'an

Mathieu Gerber (formerly CREST) and Luke Bornn have arXived a paper on the almost sure convergence of simulated annealing algorithms when using a non-Markovian sequence that can be in the limiting case completely deterministic and hence use quasi-Monte Carlo sequences. The paper extends the earlier Gerber and Bornn (2015) that I missed. While the paper is highly technical, it shows that under some conditions a sequence of time-varying kernels can be used to reach the maximum of an objective function. With my limited experience with simulated annealing I find this notion of non-iid or even non-random both worth investigating and somewhat unsurprising from a practitioner’s view in that modifying a standard simulated annealing algorithm with steps depending on the entire past of the sequence usually produces better performances.

more of the same!

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on December 10, 2015 by xi'an

Daniel Seita, Haoyu Chen, and John Canny arXived last week a paper entitled “Fast parallel SAME Gibbs sampling on general discrete Bayesian networks“.  The distributions of the observables are defined by full conditional probability tables on the nodes of a graphical model. The distributions on the latent or missing nodes of the network are multinomial, with Dirichlet priors. To derive the MAP in such models, although this goal is not explicitly stated in the paper till the second page, the authors refer to the recent paper by Zhao et al. (2015), discussed on the ‘Og just as recently, which applies our SAME methodology. Since the paper is mostly computational (and submitted to ICLR 2016, which takes place juuust before AISTATS 2016), I do not have much to comment about it. Except to notice that the authors mention our paper as “Technical report, Statistics and Computing, 2002”. I am not sure the editor of Statistics and Computing will appreciate! The proper reference is in Statistics and Computing, 12:77-84, 2002.

“We argue that SAME is beneficial for Gibbs sampling because it helps to reduce excess variance.”

Still, I am a wee bit surprised at both the above statement and at the comparison with a JAGS implementation. Because SAME augments the number of latent vectors as the number of iterations increases, so should be slower by a mere curse of dimension,, slower than a regular Gibbs with a single latent vector. And because I do not get either the connection with JAGS: SAME could be programmed in JAGS, couldn’t it? If the authors means a regular Gibbs sampler with no latent vector augmentation, the comparison makes little sense as one algorithm aims at the MAP (with a modest five replicas), while the other encompasses the complete posterior distribution. But this sounds unlikely when considering that the larger the number m of replicas the better their alternative to JAGS. It would thus be interesting to understand what the authors mean by JAGS in this setup!

a simulated annealing approach to Bayesian inference

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on October 1, 2015 by xi'an

A misleading title if any! Carlos Albert arXived a paper with this title this morning and I rushed to read it. Because it sounded like Bayesian analysis could be expressed as a special form of simulated annealing. But it happens to be a rather technical sequel [“that complies with physics standards”] to another paper I had missed, A simulated annealing approach to ABC, by Carlos Albert, Hans Künsch, and Andreas Scheidegger. Paper that appeared in Statistics and Computing last year, and which is most interesting!

“These update steps are associated with a flow of entropy from the system (the ensemble of particles in the product space of parameters and outputs) to the environment. Part of this flow is due to the decrease of entropy in the system when it transforms from the prior to the posterior state and constitutes the well-invested part of computation. Since the process happens in finite time, inevitably, additional entropy is produced. This entropy production is used as a measure of the wasted computation and minimized, as previously suggested for adaptive simulated annealing” (p.3)

The notion behind this simulated annealing intrusion into the ABC world is that the choice of the tolerance can be adapted along iterations according to a simulated annealing schedule. Both papers make use of thermodynamics notions that are completely foreign to me, like endoreversibility, but aim at minimising the “entropy production of the system, which is a measure for the waste of computation”. The central innovation is to introduce an augmented target on (θ,x) that is

f(x|θ)π(θ)exp{-ρ(x,y)/ε},

where ε is the tolerance, while ρ(x,y) is a measure of distance to the actual observations, and to treat ε as an annealing temperature. In an ABC-MCMC implementation, the acceptance probability of a random walk proposal (θ’,x’) is then

exp{ρ(x,y)/ε-ρ(x’,y)/ε}∧1.

Under some regularity constraints, the sequence of targets converges to

π(θ|y)exp{-ρ(x,y)},

if ε decreases slowly enough to zero. While the representation of ABC-MCMC through kernels other than the Heaviside function can be found in the earlier ABC literature, the embedding of tolerance updating within the modern theory of simulated annealing is rather exciting.

Furthermore, we will present an adaptive schedule that attempts convergence to the correct posterior while minimizing the required simulations from the likelihood. Both the jump distribution in parameter space and the tolerance are adapted using mean fields of the ensemble.” (p.2)

What I cannot infer from a rather quick perusal of the papers is whether or not the implementation gets into the way of the all-inclusive theory. For instance, how can the Markov chain keep moving as the tolerance gets to zero? Even with a particle population and a sequential Monte Carlo implementation, it is unclear why the proposal scale factor [as in equation (34)] does not collapse to zero in order to ensure a non-zero acceptance rate. In the published paper, the authors used the same toy mixture example as ours [from Sisson et al., 2007], where we earned the award of the “incredibly ugly squalid picture”, with improvements in the effective sample size, but this remains a toy example. (Hopefully a post to be continued in more depth…)