Archive for simulated annealing

hierarchical models are not Bayesian models

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on February 18, 2015 by xi'an

When preparing my OxWaSP projects a few weeks ago, I came perchance on a set of slides, entitled “Hierarchical models are not Bayesian“, written by Brian Dennis (University of Idaho), where the author argues against Bayesian inference in hierarchical models in ecology, much in relation with the previously discussed paper of Subhash Lele. The argument is the same, namely a possibly major impact of the prior modelling on the resulting inference, in particular when some parameters are hardly identifiable, the more when the model is complex and when there are many parameters. And that “data cloning” being available since 2007, frequentist methods have “caught up” with Bayesian computational abilities.

Let me remind the reader that “data cloning” means constructing a sequence of Bayes estimators corresponding to the data being duplicated (or cloned) once, twice, &tc., until the point estimator stabilises. Since this corresponds to using increasing powers of the likelihood, the posteriors concentrate more and more around the maximum likelihood estimator. And even recover the Hessian matrix. This technique is actually older than 2007 since I proposed it in the early 1990’s under the name of prior feedback, with earlier occurrences in the literature like D’Epifanio (1989) and even the discussion of Aitkin (1991). A more efficient version of this approach is the SAME algorithm we developed in 2002 with Arnaud Doucet and Simon Godsill where the power of the likelihood is increased during iterations in a simulated annealing version (with a preliminary version found in Duflo, 1996).

I completely agree with the author that a hierarchical model does not have to be Bayesian: when the random parameters in the model are analysed as sources of additional variations, as for instance in animal breeding or ecology, and integrated out, the resulting model can be analysed by any statistical method. Even though one may wonder at the motivations for selecting this particular randomness structure in the model. And at an increasing blurring between what is prior modelling and what is sampling modelling as the number of levels in the hierarchy goes up. This rather amusing set of slides somewhat misses a few points, in particular the ability of data cloning to overcome identifiability and multimodality issues. Indeed, as with all simulated annealing techniques, there is a practical difficulty in avoiding the fatal attraction of a local mode using MCMC techniques. There are thus high chances data cloning ends up in the “wrong” mode. Moreover, when the likelihood is multimodal, it is a general issue to decide which of the modes is most relevant for inference. In which sense is the MLE more objective than a Bayes estimate, then? Further, the impact of a prior on some aspects of the posterior distribution can be tested by re-running a Bayesian analysis with different priors, including empirical Bayes versions or, why not?!, data cloning, in order to understand where and why huge discrepancies occur. This is part of model building, in the end.

ABC by population annealing

Posted in Statistics, University life with tags , , , , , , , , on January 6, 2015 by xi'an

The paper “Bayesian Parameter Inference and Model Selection by Population Annealing in System Biology” by Yohei Murakami got published in PLoS One last August but I only became aware of it when ResearchGate pointed it out to me [by mentioning one of our ABC papers was quoted there].

“We are recommended to try a number of annealing schedules to check the influence of the schedules on the simulated data (…) As a whole, the simulations with the posterior parameter ensemble could, not only reproduce the data used for parameter inference, but also capture and predict the data which was not used for parameter inference.”

Population annealing is a notion introduced by Y Iba, the very same IBA who introduced the notion of population Monte Carlo that we studied in subsequent papers. It reproduces the setting found in many particle filter papers of a sequence of (annealed or rather tempered) targets ranging from an easy (i.e., almost flat) target to the genuine target, and of an update of a particle set by MCMC moves and reweighing. I actually have trouble perceiving the difference with other sequential Monte Carlo schemes as those exposed in Del Moral, Doucet and Jasra (2006, Series B). And the same is true of the ABC extension covered in this paper. (Where the annealed intermediate targets correspond to larger tolerances.) This sounds like a traditional ABC-SMC algorithm. Without the adaptive scheme on the tolerance ε found e.g. in Del Moral et al., since the sequence is set in advance. [However, the discussion about the implementation includes the above quote that suggests a vague form of cross-validated tolerance construction]. The approximation of the marginal likelihood also sounds standard, the marginal being approximated by the proportion of accepted pseudo-samples. Or more exactly by the sum of the SMC weights at the end of the annealing simulation. This actually raises several questions: (a) this estimator is always between 0 and 1, while the marginal likelihood is not restricted [but this is due to a missing 1/ε in the likelihood estimate that cancels from both numerator and denominator]; (b) seeing the kernel as a non-parametric estimate of the likelihood led me to wonder why different ε could not be used in different models, in that the pseudo-data used for each model under comparison differs. If we were in a genuine non-parametric setting the bandwidth would be derived from the pseudo-data.

“Thus, Bayesian model selection by population annealing is valid.”

The discussion about the use of ABC population annealing somewhat misses the point of using ABC, which is to approximate the genuine posterior distribution, to wit the above quote: that the ABC Bayes factors favour the correct model in the simulation does not tell anything about the degree of approximation wrt the original Bayes factor. [The issue of non-consistent Bayes factors does not apply here as there is no summary statistic applied to the few observations in the data.] Further, the magnitude of the variability of the values of this Bayes factor as ε varies, from 1.3 to 9.6, mostly indicates that the numerical value is difficult to trust. (I also fail to explain the huge jump in Monte Carlo variability from 0.09 to 1.17 in Table 1.) That this form of ABC-SMC improves upon the basic ABC rejection approach is clear. However it needs to build some self-control to avoid arbitrary calibration steps and reduce the instability of the final estimates.

“The weighting function is set to be large value when the observed data and the simulated data are ‘‘close’’, small value when they are ‘‘distant’’, and constant when they are ‘‘equal’’.”

The above quote is somewhat surprising as the estimated likelihood f(xobs|xobs,θ) is naturally constant when xobs=xsim… I also failed to understand how the model intervened in the indicator function used as a default ABC kernel

top posts for 2014

Posted in Books, R, Statistics, University life with tags , , , on December 30, 2014 by xi'an

Here are the most popular entries for 2014:

17 equations that changed the World (#2) 995
Le Monde puzzle [website] 992
“simply start over and build something better” 991
accelerating MCMC via parallel predictive prefetching 990
Bayesian p-values 960
posterior predictive p-values 849
Bayesian Data Analysis [BDA3] 846
Bayesian programming [book review] 834
Feller’s shoes and Rasmus’ socks [well, Karl’s actually…] 804
the cartoon introduction to statistics 803
Asymptotically Exact, Embarrassingly Parallel MCMC 730
Foundations of Statistical Algorithms [book review] 707
a brief on naked statistics 704
In{s}a(ne)!! 682
the demise of the Bayes factor 660
Statistical modeling and computation [book review] 591
bridging the gap between machine learning and statistics 587
new laptop with ubuntu 14.04 574
Bayesian Data Analysis [BDA3 – part #2] 570
MCMC on zero measure sets 570
Solution manual to Bayesian Core on-line 567
Nonlinear Time Series just appeared 555
Sudoku via simulated annealing 538
Solution manual for Introducing Monte Carlo Methods with R 535
future of computational statistics 531

What I appreciate from that list is that (a) book reviews [of stats books] get a large chunk (50%!) of the attention and (b) my favourite topics of Bayesian testing, parallel MCMC and MCMC on zero measure sets made it to the top list. Even the demise of the Bayes factor that was only posted two weeks ago!

SAME but different

Posted in Statistics, University life with tags , , , , , , , , , on October 27, 2014 by xi'an

MumbaiAfter several clones of our SAME algorithm appeared in the literature, it is rather fun to see another paper acknowledging the connection. SAME but different was arXived today by Zhao, Jiang and Canny. The point of this short paper is to show that the parallel implementation of SAME leads to efficient performances compared with existing standards. Since the duplicated latent variables are independent [given θ] they can be simulated in parallel. They further assume independence between the components of those latent variables. And finite support. As in document analysis. So they can sample the replicated latent variables all at once. Parallelism is thus used solely for the components of the latent variable(s). SAME is normally associated with an annealing schedule but the authors could not detect an improvement over a fixed and large number of replications. They reported gains comparable to state-of-the-art variational Bayes on two large datasets. Quite fun to see SAME getting a new life thanks to computer scientists!

Le Monde puzzle [#868]

Posted in Books, Kids, Statistics with tags , , , on June 1, 2014 by xi'an

Another permutation-based Le Monde mathematical puzzle:

Given the integers 1,…n, a “perfect” combination is a pair (i,j) of integers such that no other pair enjoys the same sum. For n=33, what is the maximum of perfect combinations one can build? And for n=214? 

A rather straightforward problem, or so it seemed: take the pairs (2m,2m+1), their sums all differ, and we get the maximal possible number of sums, ⌊n/2⌋… However, I did not read the question properly (!) and the constraint is on the sum (i+j), namely

How many mutually exclusive pairs (i,j) can be found with different sums all bounded by n=33? n=2014?

In which case, the previous and obvious proposal works no longer… The dumb brute-force search

T=10^6
sol=0
for (t in 1:T){
  perm=sample(1:sample(seq(20,32,2),1))
  sal=sum(unique(apply(matrix(perm,ncol=2),1,sum))<33)
  if (sal>sol){
     sol=sal;laperm=perm}
  }

leads to a solution of

> sol
[1] 12
> laperm
 [1]  6  9  1 24 13 20  4  7 21 14 17  3 16 11 19 25 23 18 12 26 15  2  5 10 22
[26]  8
> unique(apply(matrix(laperm,ncol=2),1,sum))
[1] 17 28 26 47 31 32 30 22 23 19 27 25 24

which is close of the solution sol=13 proposed in Le Monde… It is obviously hopeless for a sum bounded by 2014. A light attempt at simulated annealing did not help either.

Le Monde puzzle [#853]

Posted in Books, Kids, Statistics with tags , , , on February 13, 2014 by xi'an

Yet another one of those Le Monde mathematical puzzles which wording is confusing (or at least annoying) to me:

A political party has 11 commissions, to which belong some of the 13 members of the central committee. A token is given to each member for each commission whom he or she belongs. Two different members cannot share more than one common commission. How many tokens at most? Same question if the president belongs to five commissions.

I just dislike the “story” around the combinatoric problem. Given 13 sets and 11 letters, how many letters can one allocate to each set so that each pair of sets share at most one letter? While waiting for my prosthesis this afternoon, I thought of a purely random search, using the following R code:

n=11
m=13
best=0
for (t in 1:10^3){
  token=matrix(-1,ncol=n,nrow=m);diag(token)=1
  #fill at random: -1 unchecked, 1 allocated, 0 prohibited
  while (sum(token<0)>0){
    res=TRUE
    while (res){
      if (sum(token<0)==1){
          dex=(1:(n*m))[token[(1:(n*m))]<0]}else{
          dex=sample((1:(n*m))[token[(1:(n*m))]<0],1)}
      i=dex%%m+m*(dex%%m==0)
      j=trunc(dex/(1.01*m))+1
      res=FALSE
      for (k in (1:m)[token[,j]==1])
         res=res||(max(token[i,-j]+token[k,-j])==2)
      token[dex]=0
      if (sum(token<0)==0) res=FALSE
      }
    token[dex]=1*(sum(token<0)>0)
    for (k in (1:m)[-i])
      if (max(token[i,-j]+token[k,-j])==2) token[k,j]=0
    image(1:13,1:11,token)
    }
  if (sum(token[token==1])>best) bestoken=token
  best=max(best,sum(token))
}

which led to the value of 43 after that many iterations. Longer runs on faster machines at Paris-Dauphine did not change the output but, as usual with brute force simulation, the true solution may be such an extreme that it is extremely unlikely to happen… I then tried a standard simulated annealing exploration, but could not find an higher value. Except once, leading to the value 44. Here is the corresponding allocation of letters (committees) to sets (individuals) for that solution.lemonde853In this Feb. 5, 2014, issue of Le Monde Science&Médecine leaflet, a review of (my Warwick colleague) Ian Stewart’s 17 equations that changed the World, which must have been recently translated in French (with no criticism of the less compelling chapter on Black-Scholes, and a confusion of the “bell curve” with statistics). Yet another tribune of Marco Zito about the generalisation of Richard Feynman’s diagrams to a solid called the Amplituhedron. (Not making much sense as exposed!)

sudoku break

Posted in pictures, R, Statistics with tags , , , , on December 13, 2013 by xi'an

sudo291113While in Warwick last week, one evening after having exhausted my laptop battery, I tried the following Sudoku (from Libération):

>   printSudoku(readSudoku("libe.dk"))
  +-------+-------+-------+
  | 4   6 |   2   | 3   9 |
  |   3   |       |   2   |
  | 7   2 |       | 5   6 |
  +-------+-------+-------+
  |       | 9 4 5 |       |
  | 5     | 7 6 2 |     1 |
  |       | 3 1 8 |       |
  +-------+-------+-------+
  | 6   9 |       | 1   3 |
  |   7   |       |   9   |
  | 3   1 |   9   | 4   7 |
  +-------+-------+-------+

and could not even start. As it happened, this was a setting with no deterministic move, i.e. all free/empty entries had multiple possible values. So after trying for a while and following trees to no obvious contradiction (!) I decided to give up and on the next day (with power) to call my “old” sudoku solver (built while at SAMSI), using simulated annealing and got the result after a few thousand iterations. The detail of the exploration is represented above, the two colours being code for two different moves on the Sudoku table. Leading to the solution

  +-------+-------+-------+
  | 4 8 6 | 5 2 1 | 3 7 9 |
  | 1 3 5 | 6 7 9 | 8 2 4 |
  | 7 9 2 | 8 3 4 | 5 1 6 |
  +-------+-------+-------+
  | 2 1 3 | 9 4 5 | 7 6 8 |
  | 5 4 8 | 7 6 2 | 9 3 1 |
  | 9 6 7 | 3 1 8 | 2 4 5 |
  +-------+-------+-------+
  | 6 2 9 | 4 8 7 | 1 5 3 |
  | 8 7 4 | 1 5 3 | 6 9 2 |
  | 3 5 1 | 2 9 6 | 4 8 7 |
  +-------+-------+-------+

I then tried a variant with more proposals (hence more colours) at each iteration, which ended up being stuck at a penalty of 4 (instead of 0) in the final thousand iterations. Although this is a one occurrence experiment, I find it interesting that having move proposals may get the algorithm stuck faster in a local minimum. Nothing very deep there, of course..!

sudo301113

Follow

Get every new post delivered to your Inbox.

Join 794 other followers