## QuanTA

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , on September 17, 2018 by xi'an

My Warwick colleagues Nick Tawn [who also is my most frequent accomplice to running, climbing and currying in Warwick!] and Gareth Robert have just arXived a paper on QuanTA, a new parallel tempering algorithm that Nick designed during his thesis at Warwick, which he defended last semester. Parallel tempering targets in parallel several powered (or power-tempered) versions of the target distribution. With proposed switches between adjacent targets. An improved version transforms the local values before operating the switches. Ideally, the transform should be the composition of the cdf and inverse cdf, but this is impossible. Linearising the transform is feasible, but does not agree with multimodality, which calls for local transforms. Which themselves call for the identification of the different modes. In QuanTA, they are identified by N parallel runs of the standard, or rather N/2 to avoid dependence issues, and K-means estimates. The paper covers the construction of an optimal scaling of temperatures, in that the difference between the temperatures is scaled [with order 1/√d] so that the acceptance rate for swaps is 0.234. Which in turns induces a practical if costly calibration of the temperatures, especially when the size of the jump is depending on the current temperature. However, this cost issue is addressed in the paper, resorting to the acceptance rate as a proxy for effective sample size and the acceptance rate over run time to run the comparison with regular parallel tempering, leading to strong improvements in the mixture examples examined in the paper. The use of machine learning techniques like K-means or more involved solutions is a promising thread in this exciting area of tempering, where intuition about high temperatures can be actually misleading. Because using the wrong scale means missing the area of interest, which is not the mode!

## Better together in Kolkata [slides]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on January 4, 2018 by xi'an

Here are the slides of the talk on modularisation I am giving today at the PC Mahalanobis 125 Conference in Kolkata, mostly borrowed from Pierre’s talk at O’Bayes 2018 last month:

[which made me realise Slideshare has discontinued the option to update one’s presentation, forcing users to create a new presentation for each update!] Incidentally, the amphitheatre at ISI is located right on top of a geological exhibit room with a reconstituted Barapasaurus tagorei so I will figuratively ride a dinosaur during my talk!

## mixtures of mixtures

Posted in pictures, Statistics, University life with tags , , , , , , , , , on March 9, 2015 by xi'an

And yet another arXival of a paper on mixtures! This one is written by Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, and Bettina Grün, from the Johannes Kepler University Linz and the Wirtschaftsuniversitat Wien I visited last September. With the exact title being Identifying mixtures of mixtures using Bayesian estimation.

So, what is a mixture of mixtures if not a mixture?! Or if not only a mixture. The upper mixture level is associated with clusters, while the lower mixture level is used for modelling the distribution of a given cluster. Because the cluster needs to be real enough, the components of the mixture are assumed to be heavily overlapping. The paper thus spends a large amount of space on detailing the construction of the associated hierarchical prior. Which in particular implies defining through the prior what a cluster means. The paper also connects with the overfitting mixture idea of Rousseau and Mengersen (2011, Series B). At the cluster level, the Dirichlet hyperparameter is chosen to be very small, 0.001, which empties superfluous clusters but sounds rather arbitrary (which is the reason why we did not go for such small values in our testing/mixture modelling). On the opposite, the mixture weights have an hyperparameter staying (far) away from zero. The MCMC implementation is based on a standard Gibbs sampler and the outcome is analysed and sorted by estimating the “true” number of clusters as the MAP and by selecting MCMC simulations conditional on that value. From there clusters are identified via the point process representation of a mixture posterior. Using a standard k-means algorithm.

The remainder of the paper illustrates the approach on simulated and real datasets. Recovering in those small dimension setups the number of clusters used in the simulation or found in other studies. As noted in the conclusion, using solely a Gibbs sampler with such a large number of components is rather perilous since it may get stuck close to suboptimal configurations. Especially with very small Dirichlet hyperparameters.

## reading classics (#9,10)

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on January 28, 2014 by xi'an

Today was the very last session of our Reading Classics Seminar for the academic year 2013-2014. We listened two presentations, one on the Casella and Strawderman (1984) paper on the estimation of the normal bounded mean. And one on the Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C. The first presentation did not go well as my student had difficulties with the maths behind the paper. (As he did not come to ask me or others for help, it may well be that he put this talk together at the last minute, at a time busy with finals and project deliveries. He also failed to exploit those earlier presentations of the paper.) The innovative part in the talk was the presentation of several R simulations comparing the risk of the minimax Bayes estimator with the one for the MLE. Although the choice of simulating different samples of standard normals for different values of the parameters and even for both estimators made the curves (unnecessarily) all wiggly.

By contrast, the second presentation was very well-designed, with great Beamer slides, interactive features and a software oriented focus. My student Mouna Berrada started from the existing R function kmeans to explain the principles of the algorithm, recycling the interactive presentation of last year as well (with my permission), and creating a dynamic flowchart that was most helpful. So she made the best of this very short paper! Just (predictably) missing the question of the statistical model behind the procedure. During the discussion, I mused why k-medians clustering was not more popular as it offered higher robustness guarantees, albeit further away from a genuine statistical model. And why k-means clustering was not more systematically compared with mixture (EM) estimation.

Here are the slides for the second talk

## reading classics (#1)

Posted in Statistics, University life with tags , , , , , , , on October 26, 2012 by xi'an

This year, a lot of my Master students (plus all of my PhD students) registered for the Reading Classics Seminar course, so we should spend half of the year going through those “classics“. And have lively discussions thanks to the size of the group. The first student to present a paper, Céline Beji, chose Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C. She did quite well, esp. when considering she had two weeks to learn $\mathrm{L\!\!^{{}_{\scriptstyle A}} \!\!\!\!\!\;\; T\!_{\displaystyle E} \! X}$ and Beamer in addition to getting thru the paper! She also managed to find an online demo of the algorithm. Here are her slides

This was not the easiest paper in the list, by far: it is short, mostly algorithmic and somehow requires some background on the reasons why clustering was of interest and on how it impacted the field. Tellingly, the discussion with the class then focussed on the criterion rather than on the algorithm itself. In a sense, this is the most striking feature of the paper, namely that it is completely a-statistical in picking a criterion to minimise. there is neither randomness nor error involved at this stage, it is simply an extended least-square approach. This is why the number of clusters—and again the discussion from the class spent some time on this—cannot be inferred via this method. A well-auguring start to the course!