## scaling the Gibbs posterior credible regions

Posted in Books, Statistics, University life with tags , , , , , , , on September 11, 2015 by xi'an

“The challenge in implementation of the Gibbs posterior is that it depends on an unspecified scale (or inverse temperature) parameter.”

A new paper by Nick Syring and Ryan Martin was arXived today on the same topic as the one I discussed last January. The setting is the same as with empirical likelihood, namely that the distribution of the data is not specified, while parameters of interest are defined via moments or, more generally, a minimising a loss function. A pseudo-likelihood can then be constructed as a substitute to the likelihood, in the spirit of Bissiri et al. (2013). It is called a “Gibbs posterior” distribution in this paper. So the “Gibbs” in the title has no link with the “Gibbs” in Gibbs sampler, since inference is conducted with respect to this pseudo-posterior. Somewhat logically (!), as n grows to infinity, the pseudo- posterior concentrates upon the pseudo-true value of θ minimising the expected loss, hence asymptotically resembles to the M-estimator associated with this criterion. As I pointed out in the discussion of Bissiri et al. (2013), one major hurdle when turning a loss into a log-likelihood is that it is at best defined up to a scale factor ω. The authors choose ω so that the Gibbs posterior

$\exp\{-\omega n l_n(\theta,x) \}\pi(\theta)$

is well-calibrated. Where ln is the empirical averaged loss. So the Gibbs posterior is part of the matching prior collection. In practice the authors calibrate ω by a stochastic optimisation iterative process, with bootstrap on the side to evaluate coverage. They briefly consider empirical likelihood as an alternative, on a median regression example, where they show that their “Gibbs confidence intervals (…) are clearly the best” (p.12). Apart from the relevance of being “well-calibrated”, and the asymptotic nature of the results. and the dependence on the parameterisation via the loss function, one may also question the possibility of using this approach in large dimensional cases where all of or none of the parameters are of interest.

## ABC and cosmology

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on May 4, 2015 by xi'an

Two papers appeared on arXiv in the past two days with the similar theme of applying ABC-PMC [one version of which we developed with Mark Beaumont, Jean-Marie Cornuet, and Jean-Michel Marin in 2009] to cosmological problems. (As a further coincidence, I had just started refereeing yet another paper on ABC-PMC in another astronomy problem!) The first paper cosmoabc: Likelihood-free inference via Population Monte Carlo Approximate Bayesian Computation by Ishida et al. [“et al” including Ewan Cameron] proposes a Python ABC-PMC sampler with applications to galaxy clusters catalogues. The paper is primarily a description of the cosmoabc package, including code snapshots. Earlier occurrences of ABC in cosmology are found for instance in this earlier workshop, as well as in Cameron and Pettitt earlier paper. The package offers a way to evaluate the impact of a specific distance, with a 2D-graph demonstrating that the minimum [if not the range] of the simulated distances increases with the parameters getting away from the best parameter values.

“We emphasis [sic] that the choice of the distance function is a crucial step in the design of the ABC algorithm and the reader must check its properties carefully before any ABC implementation is attempted.” E.E.O. Ishida et al.

The second [by one day] paper Approximate Bayesian computation for forward modelling in cosmology by Akeret et al. also proposes a Python ABC-PMC sampler, abcpmc. With fairly similar explanations: maybe both samplers should be compared on a reference dataset. While I first thought the description of the algorithm was rather close to our version, including the choice of the empirical covariance matrix with the factor 2, it appears it is adapted from a tutorial in the Journal of Mathematical Psychology by Turner and van Zandt. One out of many tutorials and surveys on the ABC method, of which I was unaware, but which summarises the pre-2012 developments rather nicely. Except for missing Paul Fearnhead’s and Dennis Prangle’s semi-automatic Read Paper. In the abcpmc paper, the update of the covariance matrix is the one proposed by Sarah Filippi and co-authors, which includes an extra bias term for faraway particles.

“For complex data, it can be difficult or computationally expensive to calculate the distance ρ(x; y) using all the information available in x and y.” Akeret et al.

In both papers, the role of the distance is stressed as being quite important. However, the cosmoabc paper uses an L1 distance [see (2) therein] in a toy example without normalising between mean and variance, while the abcpmc paper suggests using a Mahalanobis distance that turns the d-dimensional problem into a comparison of one-dimensional projections.

## likelihood-free model choice

Posted in Books, pictures, Statistics, University life, Wines with tags , , , , , , , on March 27, 2015 by xi'an

Jean-Michel Marin, Pierre Pudlo and I just arXived a short review on ABC model choice, first version of a chapter for the incoming Handbook of Approximate Bayesian computation edited by Scott Sisson, Yannan Fan, and Mark Beaumont. Except for a new analysis of a Human evolution scenario, this survey mostly argues for the proposal made in our recent paper on the use of random forests and [also argues] about the lack of reliable approximations to posterior probabilities. (Paper that was rejected by PNAS and that is about to be resubmitted. Hopefully with a more positive outcome.) The conclusion of the survey is  that

The presumably most pessimistic conclusion of this study is that the connections between (i) the true posterior probability of a model, (ii) the ABC version of this probability, and (iii) the random forest version of the above, are at best very loose. This leaves open queries for acceptable approximations of (i), since the posterior predictive error is instead an error assessment for the ABC RF model choice procedure. While a Bayesian quantity that can be computed at little extra cost, it does not necessarily compete with the posterior probability of a model.

reflecting my hope that we can eventually come up with a proper approximation to the “true” posterior probability…

## ABC@NIPS: call for papers

Posted in Statistics, Travel, University life with tags , , , , , , , , , on September 9, 2014 by xi'an

In connection with the previous announcement of ABC in Montréal, a call for papers that came out today:

NIPS 2014 Workshop: ABC in Montreal

December 12, 2014

Approximate Bayesian computation (ABC) or likelihood-free (LF) methods have developed mostly beyond the radar of the machine learning community, but are important tools for a large segment of the scientific community. This is particularly true for systems and population biology, computational psychology, computational chemistry, etc. Recent work has both applied machine learning models and algorithms to general ABC inference (NN, forests, GPs) and ABC inference to machine learning (e.g. using computer graphics to solve computer vision using ABC). In general, however, there is significant room for collaboration between the two communities.

The workshop will consist of invited and contributed talks, poster spotlights, and a poster session. Rather than a panel discussion we will encourage open discussion between the speakers and the audience!

Examples of topics of interest in the workshop include (but are not limited to):

* Applications of ABC to machine learning, e.g., computer vision, inverse problems
* ABC in Systems Biology, Computational Science, etc
* ABC Reinforcement Learning
* Machine learning simulator models, e.g., NN models of simulation responses, GPs etc.
* Selection of sufficient statistics
* Online and post-hoc error
* ABC with very expensive simulations and acceleration methods (surrogate modeling, choice of design/simulation points)
* ABC with probabilistic programming
* Posterior evaluation of scientific problems/interaction with scientists
* Post-computational error assessment
* Impact on resulting ABC inference
* ABC for model selection

===========
Submission:

## model selection by likelihood-free Bayesian methods

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , on May 29, 2014 by xi'an

Just glanced at the introduction of this arXived paper over breakfast, back from my morning run: the exact title is “Model Selection for Likelihood-free Bayesian Methods Based on Moment Conditions: Theory and Numerical Examples” by Cheng Li and Wenxin Jiang. (The paper is 81 pages long.) I selected the paper for its title as it connected with an interrogation of ours on the manner to extend our empirical likelihood [A]BC work to model choice. We looked at this issue with Kerrie Mengersen and Judith Rousseau the last time Kerrie visited Paris but could not spot a satisfying entry… The current paper is of a theoretical nature, considering a moment defined model

$\mathbb{E}[g(D,\theta)]=0,$

where D denotes the data, as the dimension p of the parameter θ grows with n, the sample size. The approximate model is derived from a prior on the parameter θ and of a Gaussian quasi-likelihood on the moment estimating function g(D,θ). Examples include single index longitudinal data, quantile regression and partial correlation selection. The model selection setting is one of variable selection, resulting in 2p models to compare, with p growing to infinity… Which makes the practical implementation rather delicate to conceive. And the probability one of hitting the right model a fairly asymptotic concept. (At least after a cursory read from my breakfast table!)

## parallel MCMC via Weirstrass sampler (a reply by Xiangyu Wang)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on January 3, 2014 by xi'an

Almost immediately after I published my comments on his paper with David Dunson, Xiangyu Wang sent a long comment that I think worth a post on its own (especially, given that I am now busy skiing and enjoying Chamonix!). So here it is:

Thanks for the thoughtful comments. I did not realize that Neiswanger et al. also proposed the similar trick to avoid combinatoric problem as we did for the rejection sampler. Thank you for pointing that out.

For the criticism 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close.)

Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3.)

Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to θ and then do one step Gibbs update to obtain a new θ, and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.

The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.

I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is small enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?

Actually I’m adding a new algorithm into the Weierstrass rejection sampling, which will render it thoroughly free from the dimensionality curse of p. The new scheme is applicable to the nonparametric method in Neiswanger et al. (2013) as well. It should appear soon in the second version of the draft.

## parallel MCMC via Weirstrass sampler

Posted in Books, Statistics, University life with tags , , , , , , , , on January 2, 2014 by xi'an

During O’Bayes 2013, Xiangyu Wang and David Dunson arXived a paper (with the above title) that David then presented on the 19th.  The setting is quite similar to the recently discussed embarrassingly parallel paper of Neiswanger et al., in that Xiangyu and David start from the same product representation of the target (posterior). Namely,

$p(\theta|x) = \prod_{i=1}^m p_i(\theta|x).$

However, they criticise the choice made by Neiswanger et al to use MCMC approximations to each component of the product for the following reasons:

1. Curse of dimensionality in the number of parameters p
2. Curse of dimensionality in the number of subsets m
3. Tail degeneration
4. Support inconsistency and mode misspecification Continue reading