Archive for simulated annealing

non-Bayesian decision riddle

Posted in Statistics with tags , , , on June 22, 2017 by xi'an

As a continuation of the Bayesian resolution of last week riddle, I looked at a numeric resolution of the four secretaries problem, while in the train back from Rouen (and trying to block the chatter of my neighbours, a nuisance I find myself more and more sensitive to!). The target function is defined as

gainz=function(b,c,T=1e4,type="raw"){
  x=matrix(runif(4*T),ncol=4)
  maz=t(apply(x,1,cummax))
  zam=t(apply(x[,4:1],1,cummax))
  if (type=="raw"){return(mean(
   ((x[,2]>b*x[,1])*x[,2]+
    (x[,2]<b*x[,1])*((x[,3]>c*maz[,2])*x[,3]+
        (x[,3]<c*maz[,2])*x[,4]))/maz[,4]))} 
  if (type=="global"){return(mean( 
    ((x[,2]>b*x[,1])*(x[,2]==maz[,4])+
     (x[,2]<b*x[,1])*((x[,3]>c*maz[,2])*(x[,3]==maz[,4])+
         (x[,3]<c*maz[,2])*(x[,4]==maz[,4])))))} 
  if (type=="remain"){return(mean( 
    ((x[,2]>b*x[,1])*(x[,2]==zam[,3])+
     (x[,2]<b*x[,1])*((x[,3]>c*maz[,2])*(x[,3]==zam[,2])+
          (x[,3]<c*maz[,2])*(x[,4]==zam[,2])))))}}

where the data is generated from a U(0,1) distribution as the loss functions are made scaled free by deciding to always sacrifice the first draw, x¹. This function is to be optimised in (b,c) and hence I used a plain vanilla simulated annealing R code:

avemale=function(T=3e4,type){
  b=c=.5
  maxtar=targe=gainz(b,c,T=1e4,type)
  temp=0.1
  for (t in 1:T){
    bp=b+runif(1,-temp,temp)
    cp=c+runif(1,-temp,temp)
    parge=(bp>0)*(cp>0)*gainz(bp,cp,T=1e4,type)
    if (parge>maxtar){
      b=bs=bp;c=cs=cp;maxtar=targe=parge}else{
    if (runif(1)<exp((parge-targe)/temp)){
      b=bp;c=cp;targe=parge}}
    temp=.9999*temp}
  return(list(bs=bs,cs=cs,max=maxtar))}

with outcomes

  • b=1, c=.5, and optimum 0.8 for the raw type
  • b=c=1 and optimum 0.45 for the global type
  • b undefined, c=2/3 and optimum 0.75 for the remain type

gerrymandering detection by MCMC

Posted in Books, Statistics with tags , , , , , , , on June 16, 2017 by xi'an

In the latest issue of Nature I read (June 8), there is a rather long feature article on mathematical (and statistical) ways of measuring gerrymandering, that is the manipulation of the delimitations of a voting district toward improving the chances of a certain party. (The name comes from Elbridge Gerry (1812) and the salamander shape of the district he created.) The difficulty covered by the article is about detecting gerrymandering, which leads to the challenging and almost philosophical question of defining a “fair” partition of a region into voting districts, when those are not geographically induced. Since each partition does not break the principles of “one person, one vote” and of majority rule. Having a candidate or party win at the global level and loose at every local level seems to go against this majority rule, but with electoral systems like in the US, this frequently happens (with dire consequences in the latest elections). Just another illustration of Simpson’s paradox, essentially. And a damning drawback of multi-tiered electoral systems.

“In order to change the district boundaries, we use a Markov Chain Monte Carlo algorithm to produce about 24,000 random but reasonable redistrictings.”

In the arXiv paper that led to this Nature article (along with other studies), Bagiat et al. essentially construct a tail probability to assess how extreme the current district partition is against a theoretical distribution of such partitions. Finding that the actual redistrictings of 2012 and 2016 in North Carolina are “extremely atypical”.  (The generation of random partitions obeyed four rules, namely equal population, geographic compacity and connexity, proximity to county boundaries, and a majority of Afro-American voters in at least two districts, the latest being a requirement in North Carolina. A score function was built by linear combination of four corresponding scores, mostly χ² like, and turned into a density, simulated annealing style. The determination of the final temperature β=1 (p.18) [or equivalently of the weights (p.20)] remains unclear to me. As does the use of more than 10⁵ simulated annealing iterations to produce a single partition (p.18)…

From a broader perspective, agreeing on a method to produce random district allocations could be the way to go towards solving the judicial dilemma in setting new voting maps as what is currently under discussion in the US.

SMC on a sequence of increasing dimension targets

Posted in Statistics with tags , , , , , , , , , on February 15, 2017 by xi'an

mixdirRichard Everitt and co-authors have arXived a preliminary version of a paper entitled Sequential Bayesian inference for mixture models and the coalescent using sequential Monte Carlo samplers with transformations. The central notion is an SMC version of the Carlin & Chib (1995) completion in the comparison of models in different dimensions. Namely to create auxiliary variables for each model in such a way that the dimension of the completed models are all the same. (Reversible jump MCMC à la Peter Green (1995) can also be interpreted this way, even though only relevant bits of the completion are used in the transitions.) I find the paper and the topic most interesting if only because it relates to earlier papers of us on population Monte Carlo. It also brought to my awareness the paper by Karagiannis and Andrieu (2013) on annealed reversible jump MCMC that I had missed at the time it appeared. The current paper exploits this annealed expansion in the devising of the moves. (Sequential Monte Carlo on a sequence of models with increasing dimension has been studied in the past.)

The way the SMC is described in the paper, namely, reweight-subsample-move, does not strike me as the most efficient as I would try to instead move-reweight-subsample, using a relevant move that incorporate the new model and hence enhance the chances of not rejecting.

One central application of the paper is mixture models with an unknown number of components. The SMC approach applied to this problem means creating a new component at each iteration t and moving the existing particles after adding the parameters of the new component. Since using the prior for this new part is unlikely to be at all efficient, a split move as in Richardson and Green (1997) can be considered, which brings back the dreaded Jacobian of RJMCMC into the picture! Here comes an interesting caveat of the method, namely that the split move forces a choice of the split component of the mixture. However, this does not appear as a strong difficulty, solved in the paper by auxiliary [index] variables, but possibly better solved by a mixture representation of the proposal, as in our PMC [population Monte Carlo] papers. Which also develop a family of SMC algorithms, incidentally. We found there that using a mixture representation of the proposal achieves a provable variance reduction.

“This puts a requirement on TSMC that the single transition it makes must be successful.”

As pointed by the authors, the transformation SMC they develop faces the drawback that a given model is only explored once in the algorithm, when moving to the next model. On principle, there would be nothing wrong in including regret steps, retracing earlier models in the light of the current one, since each step is an importance sampling step valid on its own right. But SMC also offers a natural albeit potentially high-varianced approximation to the marginal likelihood, which is quite appealing when comparing with an MCMC outcome. However, it would have been nice to see a comparison with alternative estimates of the marginal in the case of mixtures of distributions. I also wonder at the comparative performances of a dual approach that would be sequential in the number of observations as well, as in Chopin (2004) or our first population Monte Carlo paper (Cappé et al., 2005), since subsamples lead to tempered versions of the target and hence facilitate moves between models, being associated with flatter likelihoods.

a knapsack riddle?

Posted in Books, pictures, R, Statistics, Travel with tags , , , , , , on February 13, 2017 by xi'an

gear

The [then current now past] riddle of the week is a sort of multiarmed bandits optimisation. Of sorts. Or rather a generalised knapsack problem. The question is about optimising the allocation of 100 undistinguishable units to 10 distinct boxes against a similarly endowed adversary, when the loss function is

L(x,y)=(x_1>y_1)-(x_1<y_1)+...+10((x_{10}>y_{10})-(x_{10}<y_{10}))

and the distribution q of the adversary is unknown. As usual (!), the phrasing of the riddle is somewhat ambiguous but I am under the impression that the game is played sequentially, hence that one can learn about the distribution of the adversary, at least when assuming this adversary keeps the same distribution q at all times. Continue reading

adaptive exchange

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

 The crux of the method is to run an iteration as [where y denotes the observation]

  1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
  2. Generate a pseudo-observation z~ƒ(z|θ’);
  3. Accept with probability

\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

Le Monde puzzle [#977]

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , on October 3, 2016 by xi'an

A mild arithmetic Le Monde mathematical puzzle:

Find the optimal permutation of {1,2,..,15} towards minimising the maximum of sum of all three  consecutive numbers, including the sums of the 14th, 15th, and first numbers, as well as the 15th, 1st and 2nd numbers.

If once again opted for a lazy solution, not even considering simulated annealing!,

parme=sample(15)
max(apply(matrix(c(parme,parme[-1],
parme[1],parme[-(1:2)],parme[1:2]),3),2,sum))

and got the minimal value of 26 for the permutation

14 9 2 15 7 1 11 10 4 12 8 5 13 6 3

Le Monde gave a solution with value 25, though, which is

11 9 7 5 13 8 2 10 14 6 1 12 15 4 3

but there is a genuine mistake in the solution!! This anyway shows that brute force may sometimes fail. (Which sounds like a positive conclusion to failing to find the proper solution! But trying with a simple simulated annealing version did not produce any 25 either…)

a Simpson paradox of sorts

Posted in Books, Kids, pictures, R with tags , , , , , , , , , on May 6, 2016 by xi'an

The riddle from The Riddler this week is about finding an undirected graph with N nodes and no isolated node such that the number of nodes with more connections than the average of their neighbours is maximal. A representation of a connected graph is through a matrix X of zeros and ones, on which one can spot the nodes satisfying the above condition as the positive entries of the vector (X1)^2-(X^21), if 1 denotes the vector of ones. I thus wrote an R code aiming at optimising this target

targe <- function(F){
  sum(F%*%F%*%rep(1,N)/(F%*%rep(1,N))^2<1)}

by mere simulated annealing:

rate <- function(N){ 
# generate matrix F
# 1. no single 
F=matrix(0,N,N) 
F[sample(2:N,1),1]=1 
F[1,]=F[,1] 
for (i in 2:(N-1)){ 
if (sum(F[,i])==0) 
F[sample((i+1):N,1),i]=1 
F[i,]=F[,i]} 
if (sum(F[,N])==0) 
F[sample(1:(N-1),1),N]=1 
F[N,]=F[,N] 
# 2. more connections 
F[lower.tri(F)]=F[lower.tri(F)]+
  sample(0:1,N*(N-1)/2,rep=TRUE,prob=c(N,1)) 
F[F>1]=1
F[upper.tri(F)]=t(F)[upper.tri(t(F))]
#simulated annealing
T=1e4
temp=N
targo=targe(F)
for (t in 1:T){
  #1. local proposal
  nod=sample(1:N,2)
  prop=F
  prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
     1-prop[nod[1],nod[2]]
  while (min(prop%*%rep(1,N))==0){
    nod=sample(1:N,2)
    prop=F
    prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
     1-prop[nod[1],nod[2]]}
  target=targe(prop)
  if (log(runif(1))*temp<target-targo){ 
    F=prop;targo=target} 
#2. global proposal 
  prop=F prop[lower.tri(prop)]=F[lower.tri(prop)]+
   sample(c(0,1),N*(N-1)/2,rep=TRUE,prob=c(N,1)) 
prop[prop>1]=1
  prop[upper.tri(prop)]=t(prop)[upper.tri(t(prop))]
  target=targe(prop)
  if (log(runif(1))*temp<target-targo){
      F=prop;targo=target}
   temp=temp*.999
   }
return(F)}

Eward SimpsonThis code returns quite consistently (modulo the simulated annealing uncertainty, which grows with N) the answer N-2 as the number of entries above average! Which is rather surprising in a Simpson-like manner since all entries but two are above average. (Incidentally, I found out that Edward Simpson recently wrote a paper in Significance about the Simpson-Yule paradox and him being a member of the Bletchley Park Enigma team. I must have missed out the connection with the Simpson paradox when reading the paper in the first place…)