## model selection by likelihood-free Bayesian methods

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , on May 29, 2014 by xi'an

Just glanced at the introduction of this arXived paper over breakfast, back from my morning run: the exact title is “Model Selection for Likelihood-free Bayesian Methods Based on Moment Conditions: Theory and Numerical Examples” by Cheng Li and Wenxin Jiang. (The paper is 81 pages long.) I selected the paper for its title as it connected with an interrogation of ours on the manner to extend our empirical likelihood [A]BC work to model choice. We looked at this issue with Kerrie Mengersen and Judith Rousseau the last time Kerrie visited Paris but could not spot a satisfying entry… The current paper is of a theoretical nature, considering a moment defined model

$\mathbb{E}[g(D,\theta)]=0,$

where D denotes the data, as the dimension p of the parameter θ grows with n, the sample size. The approximate model is derived from a prior on the parameter θ and of a Gaussian quasi-likelihood on the moment estimating function g(D,θ). Examples include single index longitudinal data, quantile regression and partial correlation selection. The model selection setting is one of variable selection, resulting in 2p models to compare, with p growing to infinity… Which makes the practical implementation rather delicate to conceive. And the probability one of hitting the right model a fairly asymptotic concept. (At least after a cursory read from my breakfast table!)

## parallel MCMC via Weirstrass sampler (a reply by Xiangyu Wang)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on January 3, 2014 by xi'an

Almost immediately after I published my comments on his paper with David Dunson, Xiangyu Wang sent a long comment that I think worth a post on its own (especially, given that I am now busy skiing and enjoying Chamonix!). So here it is:

Thanks for the thoughtful comments. I did not realize that Neiswanger et al. also proposed the similar trick to avoid combinatoric problem as we did for the rejection sampler. Thank you for pointing that out.

For the criticism 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close.)

Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3.)

Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to θ and then do one step Gibbs update to obtain a new θ, and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.

The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.

I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is small enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?

Actually I’m adding a new algorithm into the Weierstrass rejection sampling, which will render it thoroughly free from the dimensionality curse of p. The new scheme is applicable to the nonparametric method in Neiswanger et al. (2013) as well. It should appear soon in the second version of the draft.

## parallel MCMC via Weirstrass sampler

Posted in Books, Statistics, University life with tags , , , , , , , , on January 2, 2014 by xi'an

During O’Bayes 2013, Xiangyu Wang and David Dunson arXived a paper (with the above title) that David then presented on the 19th.  The setting is quite similar to the recently discussed embarrassingly parallel paper of Neiswanger et al., in that Xiangyu and David start from the same product representation of the target (posterior). Namely,

$p(\theta|x) = \prod_{i=1}^m p_i(\theta|x).$

However, they criticise the choice made by Neiswanger et al to use MCMC approximations to each component of the product for the following reasons:

1. Curse of dimensionality in the number of parameters p
2. Curse of dimensionality in the number of subsets m
3. Tail degeneration
4. Support inconsistency and mode misspecification Continue reading

## ABC with composite score functions

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on December 12, 2013 by xi'an

My friends Erlis Ruli, Nicola Sartori and Laura Ventura from Università degli Studi de Padova have just arXived a new paper entitled Approximate Bayesian Computation with composite score functions. While the paper provides a survey of composite likelihood methods, the core idea of the paper is to use the score function (of the composite likelihood) as the summary statistic,

$\dfrac{\partial\,c\ell(\theta;y)}{\partial\,\theta},$

when evaluated at the maximum composite likelihood at the observed data point. In the specific (but unrealistic) case of an exponential family, an ABC based on the score is asymptotically (i.e., as the tolerance ε goes to zero) exact. The choice of the composite likelihood thus induces a natural summary statistics and, as in our empirical likelihood paper, where we also use the score of a composite likelihood, the composite likelihoods that are available for computation are usually quite a few, thus leading to an automated choice of a summary statistic..

An interesting (common) feature in most examples found in this paper is that comparisons are made between ABC using the (truly) sufficient statistic and ABC based on the pairwise score function, which essentially relies on the very same statistics. So the difference, when there is a difference, pertains to the choice of a different combination of the summary statistics or, somehow equivalently to the choice of a different distance function. One of the examples starts from our MA(2) toy-example in the 2012 survey in Statistics and Computing. The composite likelihood is then based on the consecutive triplet marginal densities. As shown by the picture below, the composite version improves to some extent upon the original ABC solution using three autocorrelations.

A suggestion I would have about a refinement of the proposed method deals with the distance utilised in the paper, namely the sum of the absolute differences between the statistics. Indeed, this sum is not scaled at all, neither for regular ABC nor for composite ABC, while the composite likelihood perspective provides in addition to the score a natural metric through the matrix A(θ) [defined on page 12]. So I would suggest comparing the performances of the methods using instead this rescaling since, in my opinion and in contrast with a remark on page 13, it is relevant in some (many?) settings where the amount of information brought by the composite model widely varies from one parameter to the next.

## Maximum likelihood vs. likelihood-free quantum system identification in the atom maser

Posted in Books, Statistics, University life with tags , , , , , , on December 2, 2013 by xi'an

This paper (arXived a few days ago) compares maximum likelihood with different ABC approximations in a quantum physic setting and for an atom maser modelling that essentially bears down to a hidden Markov model. (I mostly blanked out of the physics explanations so cannot say I understand the model at all.) While the authors (from the University of Nottingham, hence Robin’s statue above…) do not consider the recent corpus of work by Ajay Jasra and coauthors (some of which was discussed on the ‘Og), they get interesting findings for an equally interesting model. First, when comparing the Fisher informations on the sole parameter of the model, the “Rabi angle” φ, for two different sets of statistics, one gets to zero at a certain value of the parameter, while the (fully informative) other is maximum (Figure 6). This is quite intriguing, esp. give the shape of the information in the former case, which reminds me of (my) inverse normal distributions. Second, the authors compare different collections of summary statistics in terms of ABC distributions against the likelihood function. While most bring much more uncertainty in the analysis, the whole collection recovers the range and shape of the likelihood function, which is nice. Third, they also use a kolmogorov-Smirnov distance to run their ABC, which is enticing, except that I cannot fathom from the paper when one would have enough of a sample (conditional on a parameter value) to rely on what is essentially an estimate of the sampling distribution. This seems to contradict the fact that they only use seven summary statistics. Or it may be that the “statistic” of waiting times happens to be a vector, in which case a Kolmogorov-Smirnov distance can indeed be adopted for the distance… The fact that the grouped seven-dimensional summary statistic provides the best ABC fit is somewhat of a surprise when considering the problem enjoys a single parameter.

“However, in practice, it is often difficult to find an s(.) which is sufficient.”

Just a point that irks me in most ABC papers is to find quotes like the above, since in most models, it is easy to show that there cannot be a non-trivial sufficient statistic! As soon as one leaves the exponential family cocoon, one is doomed in this respect!!!

## rate of convergence for ABC

Posted in Statistics, University life with tags , , , , on November 19, 2013 by xi'an

Barber, Voss, and Webster recently posted and arXived a paper entitled The Rate of Convergence for Approximate Bayesian Computation. The paper is essentially theoretical and establishes the optimal rate of convergence of the MSE—for approximating a posterior moment—at a rate of 2/(q+4), where q is the dimension of the summary statistic, associated with an optimal tolerance in n-1/4. I was first surprised at the role of the dimension of the summary statistic, but rationalised it as being the dimension where the non-parametric estimation takes place. I may have read the paper too quickly as I did not spot any link with earlier convergence results found in the literature: for instance, Blum (2010, JASA) links ABC with standard kernel density non-parametric estimation and find a tolerance (bandwidth) of order n-1/q+4 and an MSE of order 2/(q+4) as well. Similarly, Biau et al. (2013, Annales de l’IHP) obtain precise convergence rates for ABC interpreted as a k-nearest-neighbour estimator. And, as already discussed at length on this blog, Fearnhead and Prangle (2012, JRSS Series B) derive rates similar to Blum’s with a tolerance of order n-1/q+4 for the regular ABC and of order n-1/q+2 for the noisy ABC

## ABC for design

Posted in Statistics with tags , , , , , , , on August 30, 2013 by xi'an

I wrote a comment on this arXived paper on simulation based design that starts from Müller (1999) and gets an ABC perspective a while ago on my iPad when travelling to Montpellier and then forgot to download it…

Hainy, [Wener] Müller, and Wagner recently arXived a paper called “Likelihood-free Simulation-based Optimal Design“, paper which relies on ABC to construct optimal designs . Remember that [Peter] Müller (1999) uses a natural simulated annealing that is quite similar to our MAP [SAME] algorithm with Arnaud Doucet and Simon Godsill, relying on multiple versions of the data set to get to the maximum. The paper also builds upon our 2006 JASA paper with my then PhD student Billy Amzal, Eric Parent, and Frederic Bois, paper that took advantage of the then emerging particle methods to improve upon a static horizon target. While our method is sequential in that it pursues a moving target, it does not rely on the generic methodology developed by del Moral et al. (2006), where a backward kernel brings more stability to the moves. The paper also implements a version of our population Monte Carlo ABC algorithm (Beaumont et al., 2009), as a first step before an MCMC simulation. Overall, the paper sounds more like a review than like a strongly directive entry into ABC based design in that it remains quite generic. Not that I have specific suggestions, mind!, but I fear a realistic implementation (as opposed to the linear model used in the paper) would require a certain amount of calibration. There are missing references of recent papers using ABC for design, including some by Michael Stumpf I think.

I did not know about the Kuck et al. reference… Which is reproducing our 2006 approach within the del Moral framework. It uses a continuous temperature scale that I find artificial and not that useful, again a maybe superficial comment as I didn’t get very much into the paper … Just that integer powers lead to multiples of the sample and have a nice algorithmic counterpart.