Mathieu Gerber (formerly CREST) and Luke Bornn have arXived a paper on the almost sure convergence of simulated annealing algorithms when using a non-Markovian sequence that can be in the limiting case completely deterministic and hence use quasi-Monte Carlo sequences. The paper extends the earlier Gerber and Bornn (2015) that I missed. While the paper is highly technical, it shows that under some conditions a sequence of time-varying kernels can be used to reach the maximum of an objective function. With my limited experience with simulated annealing I find this notion of non-iid or even non-random both worth investigating and somewhat unsurprising from a practitioner’s view in that modifying a standard simulated annealing algorithm with steps depending on the entire past of the sequence usually produces better performances.
Archive for CREST
Yesterday, I took part in the thesis defence of James Ridgway [soon to move to the University of Bristol[ at Université Paris-Dauphine. While I have already commented on his joint paper with Nicolas on the Pima Indians, I had not read in any depth another paper in the thesis, “On the properties of variational approximations of Gibbs posteriors” written jointly with Pierre Alquier and Nicolas Chopin.
PAC stands for probably approximately correct and starts with an empirical form of posterior, called the Gibbs posterior, where the log-likelihood is replaced with an empirical error
that is rescaled by a factor λ. Factor that is called the learning rate, to be optimised as the (Kullback) closest approximation to the true unknown distribution, by Peter Grünwald (2012) in his SafeBayes approach. In the paper of James, Pierre and Nicolas, there is no visible Bayesian perspective, since the pseudo-posterior is used to define a randomised estimator that achieves optimal oracle bounds. When λ is of order n. The purpose of the paper is rather to produce an efficient approximation to the Gibbs posterior, by using variational Bayes techniques. And to derive point estimators. With the added appeal that the approximation also achieves the oracle bounds. (Surprisingly, the authors do not leave the Pima Indians alone as they use this benchmark for a ranking model.) Since there is no discussion on the choice of the learning rate λ, as opposed to Bissiri et al. (2013) I discussed around Bayes.250, I have difficulties perceiving the possible impact of this representation on Bayesian analysis. Except maybe as an ABC device, as suggested by Christophe Andrieu.
Nicolas Chopin ran a workshop at ENSAE on sequential Monte Carlo the past three days and it was a good opportunity to get a much needed up-to-date on the current trends in the field. Especially given that the meeting was literally downstairs from my office at CREST. And given the top range of researchers presenting their current or past work (in the very amphitheatre where I attended my first statistics lectures, a few dozen years ago!). Since unforeseen events made me miss most of the central day, I will not comment on individual talks, some of which I had already heard in the recent past, but this was a high quality workshop, topped by a superb organisation. (I started wondering why there was no a single female speaker in the program and so few female participants in the audience, then realised this is a field with a massive gender imbalance, which is difficult to explain given the different situation in Bayesian statistics and even in Bayesian computation…) Some key topics I gathered during the talks I could attend–apologies to the other speakers for missing their talk due to those unforeseen events–are unbiasedness, which sounds central to the SMC methods [at least those presented there] as opposed to MCMC algorithms, and local features, used in different ways like hierarchical decomposition, multiscale, parallelisation, local coupling, &tc., to improve convergence and efficiency…
The statistician, econometrician, macro- and micro-economist, Edmond Malinvaud died on Saturday, March 7. He had been director of my alma mater ENSAE (1962–1966), directeur de la Prévision at the Finance Department (1972–1974), director of INSEE (1974–1987), and Professeur at Collège de France (1988–1993). While primarily an economist, with his theories of disequilibrium and unemployment, reflected in his famous book Théorie macro-économique (1981) that he taught us at ENSAE, he was also instrumental in shaping the French econometrics school, see his equally famous Statistical Methods of Econometrics (1970), and in the reorganisation of INSEE as the post-war State census and economic planning tool. He was also an honorary Fellow of the Royal Statistical Society and the 1981 president of the International Institute of Statistics. Edmond Malinvaud studied under Maurice Allais, Nobel Prize in economics in 1988, and was himself considered as a potential Nobel for several years. My personal memories of him at ENSAE and CREST are of a very clear teacher and of a kind and considerate man, with the reserve and style of a now-bygone era…
There is an opening at the Statistics School ENSAE for a Statistics associate or full professor position, starting on September 2015. Currently located on the South-West boundary of Paris, the school is soon to move to the mega-campus of Paris Saclay, near École Polytechnique, along with a dozen other schools. See this description of the position. The deadline is very close, March 23!
When Andrew was in Paris, we discussed at length about using EP for handling big datasets in a different way than running parallel MCMC. A related preprint came out on arXiv a few days ago, with an introduction on Andrews’ blog. (Not written two months in advance as most of his entries!)
The major argument in using EP in a large data setting is that the approximation to the true posterior can be build using one part of the data at a time and thus avoids handling the entire likelihood function. Nonetheless, I still remain mostly agnostic about using EP and a seminar this morning at CREST by Guillaume Dehaene and Simon Barthelmé (re)generated self-interrogations about the method that hopefully can be exploited towards the future version of the paper.
One of the major difficulties I have with EP is about the nature of the resulting approximation. Since it is chosen out of a “nice” family of distributions, presumably restricted to an exponential family, the optimal approximation will remain within this family, which further makes EP sound like a specific variational Bayes method since the goal is to find the family member the closest to the posterior in terms of Kullback-Leibler divergence. (Except that the divergence is the opposite one.) I remain uncertain about what to do with the resulting solution, as the algorithm does not tell me how close this solution will be from the true posterior. Unless one can use it as a pseudo-distribution for indirect inference (a.k.a., ABC)..?
Another thing that became clear during this seminar is that the decomposition of the target as a product is completely arbitrary, i.e., does not correspond to an feature of the target other than the later being the product of those components. Hence, the EP partition could be adapted or even optimised within the algorithm. Similarly, the parametrisation could be optimised towards a “more Gaussian” posterior. This is something that makes EP both exciting as opening many avenues for experimentation and fuzzy as its perceived lack of goal makes comparing approaches delicate. For instance, using MCMC or HMC steps to estimate the parameters of the tilted distribution is quite natural in complex settings but the impact of the additional approximation must be gauged against the overall purpose of the approach.
As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper. Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).
Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).
“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)
At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order
means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms. For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform
which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal
has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.
The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.
Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?
In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?
What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?
Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?