machine learning-based approach to likelihood-free inference

Posted in Statistics with tags , , , , , , , , , , , on March 3, 2017 by xi'an

At ABC’ory last week, Kyle Cranmer gave an extended talk on estimating the likelihood ratio by classification tools. Connected with a 2015 arXival. The idea is that the likelihood ratio is invariant by a transform s(.) that is monotonic with the likelihood ratio itself. It took me a few minutes (after the talk) to understand what this meant. Because it is a transform that actually depends on the parameter values in the denominator and the numerator of the ratio. For instance the ratio itself is a proper transform in the sense that the likelihood ratio based on the distribution of the likelihood ratio under both parameter values is the same as the original likelihood ratio. Or the (naïve Bayes) probability version of the likelihood ratio. Which reminds me of the invariance in Fearnhead and Prangle (2012) of the Bayes estimate given x and of the Bayes estimate given the Bayes estimate. I also feel there is a connection with Geyer’s logistic regression estimate of normalising constants mentioned several times on the ‘Og. (The paper mentions in the conclusion the connection with this problem.)

Now, back to the paper (which I read the night after the talk to get a global perspective on the approach), the ratio is of course unknown and the implementation therein is to estimate it by a classification method. Estimating thus the probability for a given x to be from one versus the other distribution. Once this estimate is produced, its distributions under both values of the parameter can be estimated by density estimation, hence an estimated likelihood ratio be produced. With better prospects since this is a one-dimensional quantity. An objection to this derivation is that it intrinsically depends on the pair of parameters θ¹ and θ² used therein. Changing to another pair requires a new ratio, new simulations, and new density estimations. When moving to a continuous collection of parameter values, in a classical setting, the likelihood ratio involves two maxima, which can be formally represented in (3.3) as a maximum over a likelihood ratio based on the estimated densities of likelihood ratios, except that each evaluation of this ratio seems to require another simulation. (Which makes the comparison with ABC more complex than presented in the paper [p.18], since ABC major computational hurdle lies in the production of the reference table and to a lesser degree of the local regression, both items that can be recycled for any new dataset.) A smoothing step is then to include the pair of parameters θ¹ and θ² as further inputs of the classifier.  There still remains the computational burden of simulating enough values of s(x) towards estimating its density for every new value of θ¹ and θ². And while the projection from x to s(x) does effectively reduce the dimension of the problem to one, the method still aims at estimating with some degree of precision the density of x, so cannot escape the curse of dimensionality. The sleight of hand resides in the classification step, since it is equivalent to estimating the likelihood ratio. I thus fail to understand how and why a poor classifier can then lead to a good approximations of the likelihood ratio “obtained by calibrating s(x)” (p.16). Where calibrating means estimating the density.

Umingmak Nuna [jatp]

Posted in Statistics with tags , , , , , , , , , , , on February 23, 2017 by xi'an

ABC’ory in Banff [17w5025]

Posted in Statistics with tags , , , , on February 22, 2017 by xi'an

ABC with kernelised regression

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 22, 2017 by xi'an

The exact title of the paper by Jovana Metrovic, Dino Sejdinovic, and Yee Whye Teh is DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression. It appeared last year in the proceedings of ICML.  The idea is to build ABC summaries by way of reproducing kernel Hilbert spaces (RKHS). Regressing such embeddings to the “optimal” choice of summary statistics by kernel ridge regression. With a possibility to derive summary statistics for quantities of interest rather than for the entire parameter vector. The use of RKHS reminds me of Arthur Gretton’s approach to ABC, although I see no mention made of that work in the current paper.

In the RKHS pseudo-linear formulation, the prediction of a parameter value given a sample attached to this value looks like a ridge estimator in classical linear estimation. (I thus wonder at why one would stop at the ridge stage instead of getting the full Bayes treatment!) Things get a bit more involved in the case of parameters (and observations) of interest, as the modelling requires two RKHS, because of the conditioning on the nuisance observations. Or rather three RHKS. Since those involve a maximum mean discrepancy between probability distributions, which define in turn a sort of intrinsic norm, I also wonder at a Wasserstein version of this approach.

What I find hard to understand in the paper is how a large-dimension large-size sample can be managed by such methods with no visible loss of information and no explosion of the computing budget. The authors mention Fourier features, which never rings a bell for me, but I wonder how this operates in a general setting, i.e., outside the iid case. The examples do not seem to go into enough details for me to understand how this massive dimension reduction operates (and they remain at a moderate level in terms of numbers of parameters). I was hoping Jovana Mitrovic could present her work here at the 17w5025 workshop but she sadly could not make it to Banff for lack of funding!

ABC’ory in Banff [17w5025]

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , on February 21, 2017 by xi'an

The ABC workshop I co-organised has now started and, despite a few last minutes cancellations, we have gathered a great crowd of researchers on the validation and expansion of ABC methods. Or ABC’ory to keep up with my naming of workshops. The videos of the talks should come up progressively on the BIRS webpage. When I did not forget to launch the recording. The program is quite open and with this size of workshop allows for talks and discussions to last longer than planned: the first days contain several expository talks on ABC convergence, auxiliary or synthetic models, summary constructions, challenging applications, dynamic models, and model assessment. Plus prepared discussions on those topics that hopefully involve several workshop participants. We had also set some time for snap-talks, to induce everyone to give a quick presentation of one’s on-going research and open problems. The first day was rather full but saw a lot of interactions and discussions during and around the talks, a mood I hope will last till Friday! Today in replacement of Richard Everitt who alas got sick just before the workshop, we are conducting a discussion on dimensional issues, part of which is made of parts of the following slides (mostly recycled from earlier talks, including the mini-course in Les Diablerets):