## SMC on a sequence of increasing dimension targets

Posted in Statistics with tags , , , , , , , , , on February 15, 2017 by xi'an

Richard Everitt and co-authors have arXived a preliminary version of a paper entitled Sequential Bayesian inference for mixture models and the coalescent using sequential Monte Carlo samplers with transformations. The central notion is an SMC version of the Carlin & Chib (1995) completion in the comparison of models in different dimensions. Namely to create auxiliary variables for each model in such a way that the dimension of the completed models are all the same. (Reversible jump MCMC à la Peter Green (1995) can also be interpreted this way, even though only relevant bits of the completion are used in the transitions.) I find the paper and the topic most interesting if only because it relates to earlier papers of us on population Monte Carlo. It also brought to my awareness the paper by Karagiannis and Andrieu (2013) on annealed reversible jump MCMC that I had missed at the time it appeared. The current paper exploits this annealed expansion in the devising of the moves. (Sequential Monte Carlo on a sequence of models with increasing dimension has been studied in the past.)

The way the SMC is described in the paper, namely, reweight-subsample-move, does not strike me as the most efficient as I would try to instead move-reweight-subsample, using a relevant move that incorporate the new model and hence enhance the chances of not rejecting.

One central application of the paper is mixture models with an unknown number of components. The SMC approach applied to this problem means creating a new component at each iteration t and moving the existing particles after adding the parameters of the new component. Since using the prior for this new part is unlikely to be at all efficient, a split move as in Richardson and Green (1997) can be considered, which brings back the dreaded Jacobian of RJMCMC into the picture! Here comes an interesting caveat of the method, namely that the split move forces a choice of the split component of the mixture. However, this does not appear as a strong difficulty, solved in the paper by auxiliary [index] variables, but possibly better solved by a mixture representation of the proposal, as in our PMC [population Monte Carlo] papers. Which also develop a family of SMC algorithms, incidentally. We found there that using a mixture representation of the proposal achieves a provable variance reduction.

“This puts a requirement on TSMC that the single transition it makes must be successful.”

As pointed by the authors, the transformation SMC they develop faces the drawback that a given model is only explored once in the algorithm, when moving to the next model. On principle, there would be nothing wrong in including regret steps, retracing earlier models in the light of the current one, since each step is an importance sampling step valid on its own right. But SMC also offers a natural albeit potentially high-varianced approximation to the marginal likelihood, which is quite appealing when comparing with an MCMC outcome. However, it would have been nice to see a comparison with alternative estimates of the marginal in the case of mixtures of distributions. I also wonder at the comparative performances of a dual approach that would be sequential in the number of observations as well, as in Chopin (2004) or our first population Monte Carlo paper (Cappé et al., 2005), since subsamples lead to tempered versions of the target and hence facilitate moves between models, being associated with flatter likelihoods.

## a knapsack riddle?

Posted in Books, pictures, R, Statistics, Travel with tags , , , , , , on February 13, 2017 by xi'an

The [then current now past] riddle of the week is a sort of multiarmed bandits optimisation. Of sorts. Or rather a generalised knapsack problem. The question is about optimising the allocation of 100 undistinguishable units to 10 distinct boxes against a similarly endowed adversary, when the loss function is

$L(x,y)=(x_1>y_1)-(x_1y_{10})-(x_{10}

and the distribution q of the adversary is unknown. As usual (!), the phrasing of the riddle is somewhat ambiguous but I am under the impression that the game is played sequentially, hence that one can learn about the distribution of the adversary, at least when assuming this adversary keeps the same distribution q at all times. Continue reading

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

The crux of the method is to run an iteration as [where y denotes the observation]

1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
2. Generate a pseudo-observation z~ƒ(z|θ’);
3. Accept with probability

$\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}$

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

## Le Monde puzzle [#977]

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , on October 3, 2016 by xi'an

A mild arithmetic Le Monde mathematical puzzle:

Find the optimal permutation of {1,2,..,15} towards minimising the maximum of sum of all three  consecutive numbers, including the sums of the 14th, 15th, and first numbers, as well as the 15th, 1st and 2nd numbers.

If once again opted for a lazy solution, not even considering simulated annealing!,

parme=sample(15)
max(apply(matrix(c(parme,parme[-1],
parme[1],parme[-(1:2)],parme[1:2]),3),2,sum))


and got the minimal value of 26 for the permutation

14 9 2 15 7 1 11 10 4 12 8 5 13 6 3

Le Monde gave a solution with value 25, though, which is

11 9 7 5 13 8 2 10 14 6 1 12 15 4 3

but there is a genuine mistake in the solution!! This anyway shows that brute force may sometimes fail. (Which sounds like a positive conclusion to failing to find the proper solution! But trying with a simple simulated annealing version did not produce any 25 either…)

## a Simpson paradox of sorts

Posted in Books, Kids, pictures, R with tags , , , , , , , , , on May 6, 2016 by xi'an

The riddle from The Riddler this week is about finding an undirected graph with N nodes and no isolated node such that the number of nodes with more connections than the average of their neighbours is maximal. A representation of a connected graph is through a matrix X of zeros and ones, on which one can spot the nodes satisfying the above condition as the positive entries of the vector (X1)^2-(X^21), if 1 denotes the vector of ones. I thus wrote an R code aiming at optimising this target

targe <- function(F){
sum(F%*%F%*%rep(1,N)/(F%*%rep(1,N))^2<1)}


by mere simulated annealing:

rate <- function(N){
# generate matrix F
# 1. no single
F=matrix(0,N,N)
F[sample(2:N,1),1]=1
F[1,]=F[,1]
for (i in 2:(N-1)){
if (sum(F[,i])==0)
F[sample((i+1):N,1),i]=1
F[i,]=F[,i]}
if (sum(F[,N])==0)
F[sample(1:(N-1),1),N]=1
F[N,]=F[,N]
# 2. more connections
F[lower.tri(F)]=F[lower.tri(F)]+
sample(0:1,N*(N-1)/2,rep=TRUE,prob=c(N,1))
F[F>1]=1
F[upper.tri(F)]=t(F)[upper.tri(t(F))]
#simulated annealing
T=1e4
temp=N
targo=targe(F)
for (t in 1:T){
#1. local proposal
nod=sample(1:N,2)
prop=F
prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
1-prop[nod[1],nod[2]]
while (min(prop%*%rep(1,N))==0){
nod=sample(1:N,2)
prop=F
prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
1-prop[nod[1],nod[2]]}
target=targe(prop)
if (log(runif(1))*temp<target-targo){
F=prop;targo=target}
#2. global proposal
prop=F prop[lower.tri(prop)]=F[lower.tri(prop)]+
sample(c(0,1),N*(N-1)/2,rep=TRUE,prob=c(N,1))
prop[prop>1]=1
prop[upper.tri(prop)]=t(prop)[upper.tri(t(prop))]
target=targe(prop)
if (log(runif(1))*temp<target-targo){
F=prop;targo=target}
temp=temp*.999
}
return(F)}


This code returns quite consistently (modulo the simulated annealing uncertainty, which grows with N) the answer N-2 as the number of entries above average! Which is rather surprising in a Simpson-like manner since all entries but two are above average. (Incidentally, I found out that Edward Simpson recently wrote a paper in Significance about the Simpson-Yule paradox and him being a member of the Bletchley Park Enigma team. I must have missed out the connection with the Simpson paradox when reading the paper in the first place…)

## Le Monde puzzle [#945]

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , on January 25, 2016 by xi'an

A rather different Le Monde mathematical puzzle:

A two-person game is played on an nxn grid filled with zero. Each player pick an entry, which is increased by one as well as all adjacent entries. The game stops when all entries are equal. For n=3,4,5, what are the possible grids with identical values all over?

If I define an R neighbourhood function

 nighbo=function(i,n){
neigh=i
if (i%%n!=1) neigh=c(i-1,neigh)
if (i%%n>0) neigh=c(i+1,neigh)
if (i+n<=n*n) neigh=c(i+n,neigh)
if (i-n>0) neigh=c(i-n,neigh)
return(neigh)}


and try a brute force filling of the grid

while ((min(grid)==0)||(length(unique(grid))>1)){
ent=sample(1:(n*n),1,prob=1/as.vector(grid+1)^10)
grid[nighbo(ent,n)]=grid[nighbo(ent,n)]+1}


the loop never stops. When thinking of the case n=3 [while running in the early morning], I wondered whether or not reaching an equal value on all entries was at all possible. Indeed, it is impossible to update one of the four corners without updating at least one of the neighbours, while the converse is false. Experimenting further with simulated annealing to optimise the probabilities of picking different entries in the table when n=4,5 seems to indicate this is also the case for larger values of n, in that all attempts lead to larger values of neighbours to the four corners :

outer=c(1,n,n*n,n*n-n+1)
border=sort(unique(c(2:(n-1),(n*n-n+2):(n*n-1),1+n*(1:(n-2)),n+n*(1:(n-2)))))
inner=(1:(n*n))[-c(outer,border)]
#
target=function(weit){
grid=matrix(0,n,n)
for (t in 1:1e4){
cas=sample(1:3,1,prob=weit)
if (cas==1) ent=sample(outer,1,prob=max(grid[outer])-grid[outer]+1)
if (cas==2) ent=sample(border,1,prob=max(grid[border])-grid[border]+1)
if (cas==3) ent=sample(inner,1,prob=max(grid[inner])-grid[inner]+1)
ent=nighbo(ent,n)
grid[ent]=grid[ent]+1}
ave=c(mean(grid[outer]),mean(grid[border]),mean(grid[inner]))
return(list(dive=max(diff(sort(ave))),ave=ave,ava=mean(grid)))
}
#
weit=rep(1,3)
cur=target(weit)
T=100
while (cur$dive>0){ ind=sample(1:3,1,prob=1/cur$ave)
peit=weit
peit[ind]=weit[ind]-(cur$ave[ind]-cur$ava)*runif(1)/(cur$ava) while(peit[ind]<0) peit[ind]=weit[ind]-(cur$ave[ind]-cur$ava)*runif(1)/(cur$ava)
prop=target(peit)
if (log(runif(1))*1e4/T<prop$dive-cur$dive){
weit=peit;cur=prop}
T=T*1.00001
print(cur\$ave)}


## convergence for non-Markovian simulated AAs

Posted in Books, pictures, Statistics with tags , , , on December 24, 2015 by xi'an

Mathieu Gerber (formerly CREST) and Luke Bornn have arXived a paper on the almost sure convergence of simulated annealing algorithms when using a non-Markovian sequence that can be in the limiting case completely deterministic and hence use quasi-Monte Carlo sequences. The paper extends the earlier Gerber and Bornn (2015) that I missed. While the paper is highly technical, it shows that under some conditions a sequence of time-varying kernels can be used to reach the maximum of an objective function. With my limited experience with simulated annealing I find this notion of non-iid or even non-random both worth investigating and somewhat unsurprising from a practitioner’s view in that modifying a standard simulated annealing algorithm with steps depending on the entire past of the sequence usually produces better performances.