Archive for arXiv

troubling trends in machine learning

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , , , , on July 25, 2018 by xi'an

This morning, in Coventry, while having an n-th cup of tea after a very early morning run (light comes early at this time of the year!), I spotted an intriguing title in the arXivals of the day, by Zachary Lipton and Jacob Steinhard. Addressing the academic shortcomings of machine learning papers. While I first thought little of the attempt to address poor scholarship in the machine learning literature, I read it with growing interest and, although I am pessimistic at the chances of inverting the trend, considering the relentless pace and massive production of the community, I consider the exercise worth conducting, if only to launch a debate on the excesses found in the literature.

“…desirable characteristics:  (i) provide intuition to aid the reader’s understanding, but clearly distinguish it from stronger conclusions supported by evidence; (ii) describe empirical investigations that consider and rule out alternative hypotheses; (iii) make clear the relationship between theoretical analysis and intuitive or empirical claims; and (iv) use language to empower the reader, choosing terminology to avoid misleading or unproven connotations, collisions with other definitions, or conflation with other related but distinct concepts”

The points made by the authors are (p.1)

  1. Failure to distinguish between explanation and speculation
  2. Failure to identify the sources of empirical gains
  3. Mathiness
  4. Misuse of language

Again, I had misgiving about point 3., but this is not an anti-maths argument, rather about the recourse to vaguely connected or oversold mathematical results as a way to support a method.

Most interestingly (and living dangerously!), the authors select specific papers to illustrate their point, picking from well-established authors and from their own papers, rather than from junior authors. And also include counter-examples of papers going the(ir) right way. Among the recommendations for emerging from the morass of poor scholarship papers, they suggest favouring critical writing and retrospective surveys (provided authors can be found for these!). And mention open reviews before I can mention these myself. One would think that published anonymous reviews are a step in the right direction, I would actually say that this should be the norm (plus or minus anonymity) for all journals or successors of journals (PCis coming strongly to mind). But requiring more work from the referees implies rewards for said referees, as done in some biology and hydrology journals I refereed for (and PCIs of course).

accelerating MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on April 11, 2018 by xi'an

As forecasted a rather long while ago (!), I wrote a short and incomplete survey on some approaches to accelerating MCMC. With the massive help of Victor Elvira (Lille), Nick Tawn (Warwick) and Changye Wu (Dauphine). Survey which current version just got arXived and which has now been accepted by WIREs Computational Statistics. The typology (and even the range of methods) adopted here is certainly mostly arbitrary, with suggestions for different divisions made by a very involved and helpful reviewer. While we achieved a quick conclusion to the review process, suggestions and comments are most welcome! Even if we cannot include every possible suggestion, just like those already made on X validated. (WIREs stands for Wiley Interdisciplinary Reviews and its dozen topics cover several fields, from computational stats to biology, to medicine, to engineering.)

ABCDay [arXivals]

Posted in Books, Statistics, University life with tags , , , , , , on March 2, 2018 by xi'an

A bunch of ABC papers on arXiv yesterday, most of them linked to the incoming Handbook of ABC:

    1. Overview of Approximate Bayesian Computation S. A. Sisson, Y. Fan, M. A. Beaumont
    2. Kernel Recursive ABC: Point Estimation with Intractable Likelihood Takafumi Kajihara, Keisuke Yamazaki, Motonobu Kanagawa, Kenji Fukumizu
    3. High-dimensional ABC D. J. Nott, V. M.-H. Ong, Y. Fan, S. A. Sisson
    4. ABC Samplers Y. Fan, S. A. Sisson


divide & reconquer

Posted in Books, Statistics, University life with tags , , , , , , , , , , on February 5, 2018 by xi'an

Qi Liu, Anindya Bhadra, and William Cleveland from Purdue have arXived a paper entitled Divide and Recombine for Large and Complex Data: Model Likelihood Functions using MCMC. Which is a variation on the earlier divide & … papers attempting at handling large datasets. The beginning is quite similar to these earlier papers in that the likelihood is split into sub-likelihoods, approximated from MCMC samples and recombined into an approximate full likelihood. As in for instance Scott et al. one approximation use for the subsample is to replace the likelihood with a Normal approximation, or a skew Normal generalisation, which remains  a limited choice for heavy tailed likelihoods. Producing a Normal and skew-Normal approximation for the whole [data] likelihood, respectively. If I understand correctly, these approximations are missing a normalising constant to bring them to scale with the true likelihood, which I do not completely understand as the likelihood only needs to be defined up to a [constant] constant for most purposes, including Bayesian ones. The  method of estimation of this constant proposed therein is called the contour probability algorithm and it consists in using a highest density region to compare a likelihood and its approximation. (Nothing to do with our adaptation of Gelfand and Dey (1994) based on HPDs, with Darren Wright. Nor with nested sampling.) Returning a form of qq-plot. This is rather exploratory, while hardly addressing the issue of the precision of such approximations and the resolution of conflicting proposals. And the comparison with all these other recent proposals for splitting likelihoods into manageable bits (proposals that are mentioned in the final section, including our recentering scheme with my student Changye Wu).

a paradox about likelihood ratios?

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on January 15, 2018 by xi'an

Aware of my fascination for paradoxes (and heterodox publications), Ewan Cameron sent me the link to a recent arXival by Louis Lyons (Oxford) on different asymptotic distributions of the likelihood ratio. Which is full of approximations. The overall point of the note is hard to fathom… Unless it simply plans to illustrate Betteridge’s law of headlines, as suggested by Ewan.

For instance, the limiting distribution of the log-likelihood of an exponential sample at the true value of the parameter τ is not asymptotically Gaussian but almost surely infinite. While the log of the (Wilks) likelihood ratio at the true value of τ is truly (if asymptotically) a Χ² variable with one degree of freedom. That it is not a Gaussian is deemed a “paradox” by the author, explained by a cancellation of first order terms… Same thing again for the common Gaussian mean problem!

sliced Wasserstein estimation of mixtures

Posted in Books, pictures, R, Statistics with tags , , , , , , on November 28, 2017 by xi'an

A paper by Soheil Kolouri and co-authors was arXived last week about using Wasserstein distance for inference on multivariate Gaussian mixtures. The basic concept is that the parameter is estimated by minimising the p-Wasserstein distance to the empirical distribution, smoothed by a Normal kernel. As the general Wasserstein distance is quite costly to compute, the approach relies on a sliced version, which means computing the Wasserstein distance between one-dimensional projections of the distributions. Optimising over the directions is an additional computational constraint.

“To fit a finite GMM to the observed data, one is required to answer the following questions: 1) how to estimate the number of mixture components needed to represent the data, and 2) how to estimate the parameters of the mixture components.”

The paper contains a most puzzling comment opposing maximum likelihood estimation to minimum Wasserstein distance estimation on the basis that the later would not suffer from multimodality. This sounds incorrect as the multimodality of a mixture model (likelihood) stems from the lack of identifiability of the parameters. If all permutations of these parameters induce exactly the same distribution, they all stand at the same distance from the data distribution, whatever the distance is. Furthermore, the above tartan-like picture clashes with the representation of the log-likelihood of a Normal mixture, as exemplified by the picture below based on a 150 sample with means 0 and 2, same unit variance, and weights 0.3 and 0.7, which shows a smooth if bimodal structure:And for the same dataset, my attempt at producing a Wasserstein “energy landscape” does return a multimodal structure (this is the surface of minus the logarithm of the 2-Wasserstein distance):“Jin et al. proved that with random initialization, the EM algorithm will converge to a bad critical point with high probability.”

This statement is most curious in that the “probability” in the assessment must depend on the choice of the random initialisation, hence on a sort of prior distribution that is not explicited in the paper. Which remains blissfully unaware of Bayesian approaches.

Another [minor mode] puzzling statement is that the p-Wasserstein distance is defined on the space of probability measures with finite p-th moment, which does not make much sense when what matters is rather the finiteness of the expectation of the distance d(X,Y) raised to the power p. A lot of the maths details either do not make sense or seem superfluous.

approximate likelihood

Posted in Books, Statistics with tags , , , , , on September 6, 2017 by xi'an

Today, I read a newly arXived paper by Stephen Gratton on a method called GLASS for General Likelihood Approximate Solution Scheme… The starting point is the same as with ABC or synthetic likelihood, namely a collection of summary statistics and an intractable likelihood. The author proposes to use as a substitute a maximum entropy solution based on these summary statistics and their assumed moments under the theoretical model. What is quite unclear in the paper is whether or not these assumed moments are available in closed form or not. Otherwise, it would appear as a variant to the synthetic likelihood [aka simulated moments] approach, meaning that the expectations of the summary statistics under the theoretical model and for a given value of the parameter are obtained through Monte Carlo approximations. (All the examples therein allow for closed form expressions.)