## machine learning-based approach to likelihood-free inference

Posted in Statistics with tags , , , , , , , , , , , on March 3, 2017 by xi'an

At ABC’ory last week, Kyle Cranmer gave an extended talk on estimating the likelihood ratio by classification tools. Connected with a 2015 arXival. The idea is that the likelihood ratio is invariant by a transform s(.) that is monotonic with the likelihood ratio itself. It took me a few minutes (after the talk) to understand what this meant. Because it is a transform that actually depends on the parameter values in the denominator and the numerator of the ratio. For instance the ratio itself is a proper transform in the sense that the likelihood ratio based on the distribution of the likelihood ratio under both parameter values is the same as the original likelihood ratio. Or the (naïve Bayes) probability version of the likelihood ratio. Which reminds me of the invariance in Fearnhead and Prangle (2012) of the Bayes estimate given x and of the Bayes estimate given the Bayes estimate. I also feel there is a connection with Geyer’s logistic regression estimate of normalising constants mentioned several times on the ‘Og. (The paper mentions in the conclusion the connection with this problem.)

Now, back to the paper (which I read the night after the talk to get a global perspective on the approach), the ratio is of course unknown and the implementation therein is to estimate it by a classification method. Estimating thus the probability for a given x to be from one versus the other distribution. Once this estimate is produced, its distributions under both values of the parameter can be estimated by density estimation, hence an estimated likelihood ratio be produced. With better prospects since this is a one-dimensional quantity. An objection to this derivation is that it intrinsically depends on the pair of parameters θ¹ and θ² used therein. Changing to another pair requires a new ratio, new simulations, and new density estimations. When moving to a continuous collection of parameter values, in a classical setting, the likelihood ratio involves two maxima, which can be formally represented in (3.3) as a maximum over a likelihood ratio based on the estimated densities of likelihood ratios, except that each evaluation of this ratio seems to require another simulation. (Which makes the comparison with ABC more complex than presented in the paper [p.18], since ABC major computational hurdle lies in the production of the reference table and to a lesser degree of the local regression, both items that can be recycled for any new dataset.) A smoothing step is then to include the pair of parameters θ¹ and θ² as further inputs of the classifier.  There still remains the computational burden of simulating enough values of s(x) towards estimating its density for every new value of θ¹ and θ². And while the projection from x to s(x) does effectively reduce the dimension of the problem to one, the method still aims at estimating with some degree of precision the density of x, so cannot escape the curse of dimensionality. The sleight of hand resides in the classification step, since it is equivalent to estimating the likelihood ratio. I thus fail to understand how and why a poor classifier can then lead to a good approximations of the likelihood ratio “obtained by calibrating s(x)” (p.16). Where calibrating means estimating the density.

## approximate lasso

Posted in pictures, R, Statistics with tags , , , on October 2, 2016 by xi'an

Here is a representation of the precision of a kernel density estimate (second axis) against the true value of the density (first axis), which looks like a lasso of sorts, hence the title. I am not sure this tells much, except that the estimated values are close to the true values and that a given value of f(x) is associated with two different estimates, predictably…

## local kernel reduction for ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , on September 14, 2016 by xi'an

“…construction of low dimensional summary statistics can be performed as in a black box…”

Today Zhou and Fukuzumi just arXived a paper that proposes a gradient-based dimension reduction for ABC summary statistics, in the spirit of RKHS kernels as advocated, e.g., by Arthur Gretton. Here the projection is a mere linear projection Bs of the vector of summary statistics, s, where B is an estimated Hessian matrix associated with the posterior expectation E[θ|s]. (There is some connection with the latest version of Li’s and Fearnhead’s paper on ABC convergence as they also define a linear projection of the summary statistics, based on asymptotic arguments, although their matrix does depend on the true value of the parameter.) The linearity sounds like a strong restriction [to me] especially when the summary statistics have no reason to belong to a vectorial space and thus be open to changes of bases and linear projections. For instance, a specific value taken by a summary statistic, like 0 say, may be more relevant than the range of their values. On a larger scale, I am doubtful about always projecting a vector of summary statistics on a subspace with the smallest possible dimension, ie the dimension of θ. In practical settings, it seems impossible to derive the optimal projection and a subvector is almost certain to loose information against a larger vector.

“Another proposal is to use different summary statistics for different parameters.”

Which is exactly what we did in our random forest estimation paper. Using a different forest for each parameter of interest (but no real tree was damaged in the experiment!).

## Bayesian Indirect Inference and the ABC of GMM

Posted in Books, Statistics, University life with tags , , , , , , , , , , on February 17, 2016 by xi'an

“The practicality of estimation of a complex model using ABC is illustrated by the fact that we have been able to perform 2000 Monte Carlo replications of estimation of this simple DSGE model, using a single 32 core computer, in less than 72 hours.” (p.15)

Earlier this week, Michael Creel and his coauthors arXived a long paper with the above title, where ABC relates to approximate Bayesian computation. In short, this paper provides deeper theoretical foundations for the local regression post-processing of Mark Beaumont and his coauthors (2002). And some natural extensions. But apparently considering one univariate transform η(θ) of interest at a time. The theoretical validation of the method is that the resulting estimators converge at speed √n under some regularity assumptions. Including the identifiability of the parameter θ in the mean of the summary statistics T, which relates to our consistency result for ABC model choice. And a CLT on an available (?) preliminary estimator of η(θ).

The paper also includes a GMM version of ABC which appeal is less clear to me as it seems to rely on a preliminary estimator of the univariate transform of interest η(θ). Which is then randomized by a normal random walk. While this sounds a wee bit like noisy ABC, it differs from this generic approach as the model is not assumed to be known, but rather available through an asymptotic Gaussian approximation. (When the preliminary estimator is available in closed form, I do not see the appeal of adding this superfluous noise. When it is unavailable, it is unclear why a normal perturbation can be produced.)

“[In] the method we study, the estimator is consistent, asymptotically normal, and asymptotically as efficient as a limited information maximum likelihood estimator. It does not require either optimization, or MCMC, or the complex evaluation of the likelihood function.” (p.3)

Overall, I have trouble relating the paper to (my?) regular ABC in that the outcome of the supported procedures is an estimator rather than a posterior distribution. Those estimators are demonstrably endowed with convergence properties, including quantile estimates that can be exploited for credible intervals, but this does not produce a posterior distribution in the classical Bayesian sense. For instance, how can one run model comparison in this framework? Furthermore, each of those inferential steps requires solving another possibly costly optimisation problem.

“Posterior quantiles can also be used to form valid confidence intervals under correct model specification.” (p.4)

Nitpicking(ly), this statement is not correct in that posterior quantiles produce valid credible intervals and only asymptotically correct confidence intervals!

“A remedy is to choose the prior π(θ) iteratively or adaptively as functions of initial estimates of θ, so that the “prior” becomes dependent on the data, which can be denoted as π(θ|T).” (p.6)

This modification of the basic ABC scheme relying on simulation from the prior π(θ) can be found in many earlier references and the iterative construction of a better fitted importance function rather closely resembles ABC-PMC. Once again nitpicking(ly), the importance weights are defined therein (p.6) as the inverse of what they should be.

## estimation of deformations of densities

Posted in R, Statistics, University life with tags , , , , on May 22, 2014 by xi'an

Today, Jean-Michel Loubes from Toulouse gave a seminar in Dauphine on the estimation of deformations using Wassertsein distances. This is functional data analysis, where samples from random transforms of the original density are observed towards estimating the baseline (or true) measure

$\mu_i=\varphi_i(\mu)$

As a neophyte, I found the problem of interest if difficult to evaluate, in particular wrt the identifiability of μ. Esp. when the distribution of the transform φ is unknown. I also wondered about the choice of means over medians, because of the added robustness of the later… In a possible connection with David Dunson’s median estimate of densities. I ran the following simulation based on 150 (centred) location-scale transforms of a normal mixture [in red] with the median of the 150 density estimators [in blue]. It is not such a poor estimate! Now, the problem itself could have implications in ABC where we have replicas of random versions of the ABC density. For instance, DIYABC produces a few copies of the ABC posteriors for the parameters of the model. Jean-Michel also mentioned  connection with transport problems.

## Approximate Integrated Likelihood via ABC methods

Posted in Books, Statistics, University life with tags , , , , , , , , on March 13, 2014 by xi'an

My PhD student Clara Grazian just arXived this joint work with Brunero Liseo on using ABC for marginal density estimation. The idea in this paper is to produce an integrated likelihood approximation in intractable problems via the ratio

$L(\psi|x)\propto \dfrac{\pi(\psi|x)}{\pi(\psi)}$

both terms in the ratio being estimated from simulations,

$\hat L(\psi|x) \propto \dfrac{\hat\pi^\text{ABC}(\psi|x)}{\hat\pi(\psi)}$

(with possible closed form for the denominator). Although most of the examples processed in the paper (Poisson means ratio, Neyman-Scott’s problem, g-&-k quantile distribution, semi-parametric regression) rely on summary statistics, hence de facto replacing the numerator above with a pseudo-posterior conditional on those summaries, the approximation remains accurate (for those examples). In the g-&-k quantile example, Clara and Brunero compare our ABC-MCMC algorithm with the one of Allingham et al. (2009, Statistics & Computing): the later does better by not replicating values in the Markov chain but instead proposing a new value until it is accepted by the usual Metropolis step. (Although I did not spend much time on this issue, I cannot see how both approaches could be simultaneously correct. Even though the outcomes do not look very different.) As noted by the authors, “the main drawback of the present approach is that it requires the use of proper priors”, unless the marginalisation of the prior can be done analytically. (This is an interesting computational problem: how to provide an efficient approximation to a marginal density of a σ-finite measure, assuming this density exists.)

Clara will give a talk at CREST-ENSAE today about this work, in the Bayes in Paris seminar: 2pm in room 18.

## parallel MCMC via Weirstrass sampler (a reply by Xiangyu Wang)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on January 3, 2014 by xi'an

Almost immediately after I published my comments on his paper with David Dunson, Xiangyu Wang sent a long comment that I think worth a post on its own (especially, given that I am now busy skiing and enjoying Chamonix!). So here it is:

Thanks for the thoughtful comments. I did not realize that Neiswanger et al. also proposed the similar trick to avoid combinatoric problem as we did for the rejection sampler. Thank you for pointing that out.

For the criticism 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close.)

Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3.)

Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to θ and then do one step Gibbs update to obtain a new θ, and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.

The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.

I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is small enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?

Actually I’m adding a new algorithm into the Weierstrass rejection sampling, which will render it thoroughly free from the dimensionality curse of p. The new scheme is applicable to the nonparametric method in Neiswanger et al. (2013) as well. It should appear soon in the second version of the draft.