Archive for ICMS

limited shelf validity

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on December 11, 2019 by xi'an

A great article from Steve Stigler in the new, multi-scaled, and so exciting Harvard Data Science Review magisterially operated by Xiao-Li Meng, on the limitations of old datasets. Illustrated by three famous datasets used by three equally famous statisticians, Quetelet, Bortkiewicz, and Gosset. None of whom were fundamentally interested in the data for their own sake. First, Quetelet’s data was (wrongly) reconstructed and missed the opportunity to beat Galton at discovering correlation. Second, Bortkiewicz went looking (or even cherry-picking!) for these rare events in yearly tables of mortality minutely divided between causes such as military horse kicks. The third dataset is not Guinness‘, but a test between two sleeping pills, operated rather crudely over inmates from a psychiatric institution in Kalamazoo, with further mishandling by Gosset himself. Manipulations that turn the data into dead data, as Steve put it. (And illustrates with the above skull collection picture. As well as warning against attempts at resuscitating dead data into what could be called “zombie data”.)

“Successful resurrection is only slightly more common than in Christian theology.”

His global perspective on dead data is that they should stop being used before extending their (shelf) life, rather than turning into benchmarks recycled over and over as a proof of concept. If only (my two cents) because it leads to calibrate (and choose) methods doing well over these benchmarks. Another example that could have been added to the skulls above is the Galaxy Velocity Dataset that makes frequent appearances in works estimating Gaussian mixtures. Which Radford Neal signaled at the 2001 ICMS workshop on mixture estimation as an inappropriate use of the dataset since astrophysical arguments weighted against a mixture modelling.

“…the role of context in shaping data selection and form—context in temporal, political, and social as well as scientific terms—has been shown to be a powerful and interesting phenomenon.”

The potential for “dead-er” data (my neologism!) increases with the epoch in that the careful sleuth work Steve (and others) conducted about these historical datasets is absolutely impossible with the current massive data sets. Massive and proprietary. And presumably discarded once the associated neural net is designed and sold. Letting the burden of unmasking the potential (or highly probable?) biases to others. Most interestingly, this recoups a “comment” in Nature of 17 October by Sabina Leonelli on the transformation of data from a national treasure to a commodity which “ownership can confer and signal power”. But her call for openness and governance of research data seems as illusory as other attempts to sever the GAFAs from their extra-territorial privileges…

label switching by optimal transport: Wasserstein to the rescue

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , , on November 28, 2019 by xi'an

A new arXival by Pierre Monteiller et al. on resolving label switching by optimal transport. To appear in NeurIPS 2019, next month (where I will be, but extra muros, as I have not registered for the conference). Among other things, the paper was inspired from an answer of mine on X validated, presumably a première (and a dernière?!). Rather than picketing [in the likely unpleasant weather ]on the pavement outside the conference centre, here are my raw reactions to the proposal made in the paper. (Usual disclaimer: I was not involved in the review of this paper.)

“Previous methods such as the invariant losses of Celeux et al. (2000) and pivot alignments of Marin et al. (2005) do not identify modes in a principled manner.”

Unprincipled, me?! We did not aim at identifying all modes but only one of them, since the posterior distribution is invariant under reparameterisation. Without any bad feeling (!), I still maintain my position that using a permutation invariant loss function is a most principled and Bayesian approach towards a proper resolution of the issue. Even though figuring out the resulting Bayes estimate may prove tricky.

The paper thus adopts a different approach, towards giving a manageable meaning to the average of the mixture distributions over all permutations, not in a linear Euclidean sense but thanks to a Wasserstein barycentre. Which indeed allows for an averaged mixture density, although a point-by-point estimate that does not require switching to occur at all was already proposed in earlier papers of ours. Including the Bayesian Core. As shown above. What was first unclear to me is how necessary the Wasserstein formalism proves to be in this context. In fact, the major difference with the above picture is that the estimated barycentre is a mixture with the same number of components. Computing time? Bayesian estimate?

Green’s approach to the problem via a point process representation [briefly mentioned on page 6] of the mixture itself, as for instance presented in our mixture analysis handbook, should have been considered. As well as issues about Bayes factors examined in Gelman et al. (2003) and our more recent work with Kate Jeong Eun Lee. Where the practical impossibility of considering all possible permutations is processed by importance sampling.

An idle thought that came to me while reading this paper (in Seoul) was that a more challenging problem would be to face a model invariant under the action of a group with only a subset of known elements of that group. Or simply too many elements in the group. In which case averaging over the orbit would become an issue.

at the centre of Bayes

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on October 14, 2019 by xi'an

thermodynamic integration plus temperings

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , on July 30, 2019 by xi'an

Biljana Stojkova and David Campbel recently arXived a paper on the used of parallel simulated tempering for thermodynamic integration towards producing estimates of marginal likelihoods. Resulting into a rather unwieldy acronym of PT-STWNC for “Parallel Tempering – Simulated Tempering Without Normalizing Constants”. Remember that parallel tempering runs T chains in parallel for T different powers of the likelihood (from 0 to 1), potentially swapping chain values at each iteration. Simulated tempering monitors a single chain that explores both the parameter space and the temperature range. Requiring a prior on the temperature. Whose optimal if unrealistic choice was found by Geyer and Thomson (1995) to be proportional to the inverse (and unknown) normalising constant (albeit over a finite set of temperatures). Proposing the new temperature instead via a random walk, the Metropolis within Gibbs update of the temperature τ then involves normalising constants.

“This approach is explored as proof of concept and not in a general sense because the precision of the approximation depends on the quality of the interpolator which in turn will be impacted by smoothness and continuity of the manifold, properties which are difficult to characterize or guarantee given the multi-modal nature of the likelihoods.”

To bypass this issue, the authors pick for their (formal) prior on the temperature τ, a prior such that the profile posterior distribution on τ is constant, i.e. the joint distribution at τ and at the mode [of the conditional posterior distribution of the parameter] is constant. This choice makes for a closed form prior, provided this mode of the tempered posterior can de facto be computed for each value of τ. (However it is unclear to me why the exact mode would need to be used.) The resulting Metropolis ratio becomes independent of the normalising constants. The final version of the algorithm runs an extra exchange step on both this simulated tempering version and the untempered version, i.e., the original unnormalised posterior. For the marginal likelihood, thermodynamic integration is invoked, following Friel and Pettitt (2008), using simulated tempering samples of (θ,τ) pairs (associated instead with the above constant profile posterior) and simple Riemann integration of the expected log posterior. The paper stresses the gain due to a continuous temperature scale, as it “removes the need for optimal temperature discretization schedule.” The method is applied to the Glaxy (mixture) dataset in order to compare it with the earlier approach of Friel and Pettitt (2008), resulting in (a) a selection of the mixture with five components and (b) much more variability between the estimated marginal  likelihoods for different numbers of components than in the earlier approach (where the estimates hardly move with k). And (c) a trimodal distribution on the means [and unimodal on the variances]. This example is however hard to interpret, since there are many contradicting interpretations for the various numbers of components in the model. (I recall Radford Neal giving an impromptu talks at an ICMS workshop in Edinburgh in 2001 to warn us we should not use the dataset without a clear(er) understanding of the astrophysics behind. If I remember well he was excluded all low values for the number of components as being inappropriate…. I also remember taking two days off with Peter Green to go climbing Craigh Meagaidh, as the only authorised climbing place around during the foot-and-mouth epidemics.) In conclusion, after presumably too light a read (I did not referee the paper!), it remains unclear to me why the combination of the various tempering schemes is bringing a noticeable improvement over the existing. At a given computational cost. As the temperature distribution does not seem to favour spending time in the regions where the target is most quickly changing. As such the algorithm rather appears as a special form of exchange algorithm.

convergence of MCMC

Posted in Statistics with tags , , , , , , , , , on June 16, 2017 by xi'an

Michael Betancourt just posted on arXiv an historical  review piece on the convergence of MCMC, with a physical perspective.

“The success of these of Markov chain Monte Carlo, however, contributed to its own demise.”

The discourse proceeds through augmented [reality!] versions of MCMC algorithms taking advantage of the shape and nature of the target distribution, like Langevin diffusions [which cannot be simulated directly and exactly at the same time] in statistics and molecular dynamics in physics. (Which reminded me of the two parallel threads at the ICMS workshop we had a few years ago.) Merging into hybrid Monte Carlo, morphing into Hamiltonian Monte Carlo under the quills of Radford Neal and David MacKay in the 1990’s. It is a short entry (and so is this post), with some background already well-known to the community, but it nonetheless provides a perspective and references rarely mentioned in statistics.

Bruce Lindsay (March 7, 1947 — May 5, 2015)

Posted in Books, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on May 22, 2015 by xi'an

When early registering for Seattle (JSM 2015) today, I discovered on the ASA webpage the very sad news that Bruce Lindsay had passed away on May 5.  While Bruce was not a very close friend, we had met and interacted enough times for me to feel quite strongly about his most untimely death. Bruce was indeed “Mister mixtures” in many ways and I have always admired the unusual and innovative ways he had found for analysing mixtures. Including algebraic ones through the rank of associated matrices. Which is why I first met him—besides a few words at the 1989 Gertrude Cox (first) scholarship race in Washington DC—at the workshop I organised with Gilles Celeux and Mike West in Aussois, French Alps, in 1995. After this meeting, we met twice in Edinburgh at ICMS workshops on mixtures, organised with Mike Titterington. I remember sitting next to Bruce at one workshop dinner (at Blonde) and him talking about his childhood in Oregon and his father being a journalist and how this induced him to become an academic. He also contributed a chapter on estimating the number of components [of a mixture] to the Wiley book we edited out of this workshop. Obviously, his work extended beyond mixtures to a general neo-Fisherian theory of likelihood inference. (Bruce was certainly not a Bayesian!) Last time, I met him, it was in Italia, at a likelihood workshop in Venezia, October 2012, mixing Bayesian nonparametrics, intractable likelihoods, and pseudo-likelihoods. He gave a survey talk about composite likelihood, telling me about his extended stay in Italy (Padua?) around that time… So, Bruce, I hope you are now running great marathons in a place so full of mixtures that you can always keep ahead of the pack! Fare well!

 

“those” coincidences

Posted in pictures, Travel, University life with tags , , , , , , , , , , , , , , on June 21, 2014 by xi'an

waverleyLast Thursday night, after a friendly dinner closing the ICMS workshop, I was rushing back to Pollock Halls to catch some sleep before a very early flight. When crossing North Bridge, on top of Waverley station, I then spotted in the crowd a well-known face of a fellow statistician from Cambridge University, on an academic visit to the University of Edinburgh that was completely unrelated with the workshop. Then, today, on my way back from submitting a visa request at the Indian embassy in Paris, I took the RER train for one stop between Gare du Nord and Chatelet. When I stood up from my seat and looked behind me, a senior (and most famous) mathematician was sitting right there, in deep conversation with a colleague about algorithms… Just two of “those” coincidences. (Edinburgh may be propitious to coincidences: at the last ICMS workshop I attended, I ended up in the same Indian restaurant as Marc Suchard, who also was on an academic visit to the University of Edinburgh that was completely unrelated with the workshop!)