machine learning-based approach to likelihood-free inference

Posted in Statistics with tags , , , , , , , , , , , on March 3, 2017 by xi'an

At ABC’ory last week, Kyle Cranmer gave an extended talk on estimating the likelihood ratio by classification tools. Connected with a 2015 arXival. The idea is that the likelihood ratio is invariant by a transform s(.) that is monotonic with the likelihood ratio itself. It took me a few minutes (after the talk) to understand what this meant. Because it is a transform that actually depends on the parameter values in the denominator and the numerator of the ratio. For instance the ratio itself is a proper transform in the sense that the likelihood ratio based on the distribution of the likelihood ratio under both parameter values is the same as the original likelihood ratio. Or the (naïve Bayes) probability version of the likelihood ratio. Which reminds me of the invariance in Fearnhead and Prangle (2012) of the Bayes estimate given x and of the Bayes estimate given the Bayes estimate. I also feel there is a connection with Geyer’s logistic regression estimate of normalising constants mentioned several times on the ‘Og. (The paper mentions in the conclusion the connection with this problem.)

Now, back to the paper (which I read the night after the talk to get a global perspective on the approach), the ratio is of course unknown and the implementation therein is to estimate it by a classification method. Estimating thus the probability for a given x to be from one versus the other distribution. Once this estimate is produced, its distributions under both values of the parameter can be estimated by density estimation, hence an estimated likelihood ratio be produced. With better prospects since this is a one-dimensional quantity. An objection to this derivation is that it intrinsically depends on the pair of parameters θ¹ and θ² used therein. Changing to another pair requires a new ratio, new simulations, and new density estimations. When moving to a continuous collection of parameter values, in a classical setting, the likelihood ratio involves two maxima, which can be formally represented in (3.3) as a maximum over a likelihood ratio based on the estimated densities of likelihood ratios, except that each evaluation of this ratio seems to require another simulation. (Which makes the comparison with ABC more complex than presented in the paper [p.18], since ABC major computational hurdle lies in the production of the reference table and to a lesser degree of the local regression, both items that can be recycled for any new dataset.) A smoothing step is then to include the pair of parameters θ¹ and θ² as further inputs of the classifier.  There still remains the computational burden of simulating enough values of s(x) towards estimating its density for every new value of θ¹ and θ². And while the projection from x to s(x) does effectively reduce the dimension of the problem to one, the method still aims at estimating with some degree of precision the density of x, so cannot escape the curse of dimensionality. The sleight of hand resides in the classification step, since it is equivalent to estimating the likelihood ratio. I thus fail to understand how and why a poor classifier can then lead to a good approximations of the likelihood ratio “obtained by calibrating s(x)” (p.16). Where calibrating means estimating the density.

communication-efficient distributed statistical learning

Posted in Books, Statistics, University life with tags , , , , , , , , on June 10, 2016 by xi'an

Michael Jordan, Jason Lee, and Yun Yang just arXived a paper with their proposal on handling large datasets through distributed computing, thus contributing to the currently very active research topic of approximate solutions in large Bayesian models. The core of the proposal is summarised by the screenshot above, where the approximate likelihood replaces the exact likelihood with a first order Taylor expansion. The first term is the likelihood computed for a given subsample (or a given thread) at a ratio of one to N and the difference of the gradients is only computed once at a good enough guess. While the paper also considers M-estimators and non-Bayesian settings, the Bayesian part thus consists in running a regular MCMC when the log-target is approximated by the above. I first thought this proposal amounted to a Gaussian approximation à la Simon Wood or to an INLA approach but this is not the case: the first term of the approximate likelihood is exact and hence can be of any form, while the scalar product is linear in θ, providing a sort of first order approximation, albeit frozen at the chosen starting value.

Assuming that each block of the dataset is stored on a separate machine, I think the approach could further be implemented in parallel, running N MCMC chains and comparing the output. With a post-simulation summary stemming from the N empirical distributions thus produced. I also wonder how the method would perform outside the fairly smooth logistic regression case, where the single sample captures well-enough the target. The picture above shows a minor gain in a misclassification rate that is already essentially zero.

Mathematical underpinnings of Analytics (theory and applications)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on September 25, 2015 by xi'an

“Today, a week or two spent reading Jaynes’ book can be a life-changing experience.” (p.8)

I received this book by Peter Grindrod, Mathematical underpinnings of Analytics (theory and applications), from Oxford University Press, quite a while ago. (Not that long ago since the book got published in 2015.) As a book for review for CHANCE. And let it sit on my desk and in my travel bag for the same while as it was unclear to me that it was connected with Statistics and CHANCE. What is [are?!] analytics?! I did not find much of a definition of analytics when I at last opened the book, and even less mentions of statistics or machine-learning, but Wikipedia told me the following:

“Analytics is a multidimensional discipline. There is extensive use of mathematics and statistics, the use of descriptive techniques and predictive models to gain valuable knowledge from data—data analysis. The insights from data are used to recommend action or to guide decision making rooted in business context. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology.”

Barring the absurdity of speaking of a “multidimensional discipline” [and even worse of linking with the mathematical notion of dimension!], this tells me analytics is a mix of data analysis and decision making. Hence relying on (some) statistics. Fine.

“Perhaps in ten years, time, the mathematics of behavioural analytics will be common place: every mathematics department will be doing some of it.”(p.10)

First, and to start with some positive words (!), a book that quotes both Friedrich Nietzsche and Patti Smith cannot get everything wrong! (Of course, including a most likely apocryphal quote from the now late Yogi Berra does not partake from this category!) Second, from a general perspective, I feel the book meanders its way through chapters towards a higher level of statistical consciousness, from graphs to clustering, to hidden Markov models, without precisely mentioning statistics or statistical model, while insisting very much upon Bayesian procedures and Bayesian thinking. Overall, I can relate to most items mentioned in Peter Grindrod’s book, but mostly by first reconstructing the notions behind. While I personally appreciate the distanced and often ironic tone of the book, reflecting upon the author’s experience in retail modelling, I am thus wondering at which audience Mathematical underpinnings of Analytics aims, for a practitioner would have a hard time jumping the gap between the concepts exposed therein and one’s practice, while a theoretician would require more formal and deeper entries on the topics broached by the book. I just doubt this entry will be enough to lead maths departments to adopt behavioural analytics as part of their curriculum… Continue reading

Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

ABC model choice via random forests [expanded]

Posted in Statistics, University life with tags , , , , , , , , , , , on October 1, 2014 by xi'an

Today, we arXived a second version of our paper on ABC model choice with random forests. Or maybe [A]BC model choice with random forests. Since the random forest is built on a simulation from the prior predictive and no further approximation is used in the process. Except for the computation of the posterior [predictive] error rate. The update wrt the earlier version is that we ran massive simulations throughout the summer, on existing and new datasets. In particular, we have included a Human dataset extracted from the 1000 Genomes Project. Made of 51,250 SNP loci. While this dataset is not used to test new evolution scenarios, we compared six out-of-Africa scenarios, with a possible admixture for Americans of African ancestry. The scenario selected by a random forest procedure posits a single out-of-Africa colonization event with a secondary split into a European and an East Asian population lineages, and a recent genetic admixture between African and European lineages, for Americans of African origin. The procedure reported a high level of confidence since the estimated posterior error rate is equal to zero. The SNP loci were carefully selected using the following criteria: (i) all individuals have a genotype characterized by a quality score (GQ)>10, (ii) polymorphism is present in at least one of the individuals in order to fit the SNP simulation algorithm of Hudson (2002) used in DIYABC V2 (Cornuet et al., 2014), (iii) the minimum distance between two consecutive SNPs is 1 kb in order to minimize linkage disequilibrium between SNP, and (iv) SNP loci showing significant deviation from Hardy-Weinberg equilibrium at a 1% threshold in at least one of the four populations have been removed.

In terms of random forests, we optimised the size of the bootstrap subsamples for all of our datasets. While this optimisation requires extra computing time, it is negligible when compared with the enormous time taken by a logistic regression, which is [yet] the standard ABC model choice approach. Now the data has been gathered, it is only a matter of days before we can send the paper to a journal

Statistics second slides

Posted in Books, Kids, Statistics, University life with tags , , , , , on September 24, 2014 by xi'an

This is the next chapter of my Statistics course, definitely more standard, with some notions on statistical models, limit theorems, and exponential families. In the first class, I recalled the convergence notions with no proof but counterexamples and spend some time on a slide not included here, borrowed from Chris Holmes’ talk last Friday on the linear relation between blood pressure and the log odds ratio of an heart condition. This was a great example, both to illustrate the power of increasing the number of observations and of using a logistic regression model. Students kept asking questions about it.

the Poisson transform

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , on June 19, 2014 by xi'an

In obvious connection with an earlier post on the “estimation” of normalising constants, Simon Barthelmé and Nicolas Chopin just arXived a paper on The Poisson transform for unormalised statistical models. Obvious connection because I heard of the Guttmann and Hyvärinen (2012) paper when Simon came to CREST to give a BiP talk on this paper a few weeks ago. (A connected talk he gave in Banff is available as a BIRS video.)

Without getting too much into details, the neat idea therein is to turn the observed likelihood

$\sum_{i=1}^n f(x_i|\theta) - n \log \int \exp f(x|\theta) \text{d}x$

into a joint likelihood

$\sum_{i=1}^n[f(x_i|\theta)+\nu]-n\int\exp[f(x|\theta)+\nu]\text{d}x$

which is the likelihood of a Poisson point process with intensity function

$\exp\{ f(x|\theta) + \nu +\log n\}$

This is an alternative model in that the original likelihood does not appear as a marginal of the above. Only the modes coincide, with the conditional mode in ν providing the normalising constant. In practice, the above Poisson process likelihood is unavailable and Guttmann and Hyvärinen (2012) offer an approximation by means of their logistic regression.

Unavailable likelihoods inevitably make me think of ABC. Would ABC solutions be of interest there? In particular, could the Poisson point process be simulated with no further approximation? Since the “true” likelihood is not preserved by this representation, similar questions to those found in ABC arise, like a measure of departure from the “true” posterior. Looking forward the Bayesian version! (Marginalia: Siméon Poisson died in Sceaux, which seemed to have attracted many mathematicians at the time, since Cauchy also spent part of his life there…)