## Approximate Integrated Likelihood via ABC methods

Posted in Books, Statistics, University life with tags , , , , , , , , on March 13, 2014 by xi'an

My PhD student Clara Grazian just arXived this joint work with Brunero Liseo on using ABC for marginal density estimation. The idea in this paper is to produce an integrated likelihood approximation in intractable problems via the ratio

$L(\psi|x)\propto \dfrac{\pi(\psi|x)}{\pi(\psi)}$

both terms in the ratio being estimated from simulations,

$\hat L(\psi|x) \propto \dfrac{\hat\pi^\text{ABC}(\psi|x)}{\hat\pi(\psi)}$

(with possible closed form for the denominator). Although most of the examples processed in the paper (Poisson means ratio, Neyman-Scott’s problem, g-&-k quantile distribution, semi-parametric regression) rely on summary statistics, hence de facto replacing the numerator above with a pseudo-posterior conditional on those summaries, the approximation remains accurate (for those examples). In the g-&-k quantile example, Clara and Brunero compare our ABC-MCMC algorithm with the one of Allingham et al. (2009, Statistics & Computing): the later does better by not replicating values in the Markov chain but instead proposing a new value until it is accepted by the usual Metropolis step. (Although I did not spend much time on this issue, I cannot see how both approaches could be simultaneously correct. Even though the outcomes do not look very different.) As noted by the authors, “the main drawback of the present approach is that it requires the use of proper priors”, unless the marginalisation of the prior can be done analytically. (This is an interesting computational problem: how to provide an efficient approximation to a marginal density of a σ-finite measure, assuming this density exists.)

Clara will give a talk at CREST-ENSAE today about this work, in the Bayes in Paris seminar: 2pm in room 18.

## parallel MCMC via Weierstrass sampler (a reply by Xiangyu Wang)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on January 3, 2014 by xi'an

Almost immediately after I published my comments on his paper with David Dunson, Xiangyu Wang sent a long comment that I think worth a post on its own (especially, given that I am now busy skiing and enjoying Chamonix!). So here it is:

Thanks for the thoughtful comments. I did not realize that Neiswanger et al. also proposed the similar trick to avoid combinatoric problem as we did for the rejection sampler. Thank you for pointing that out.

For the criticism 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close.)

Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3.)

Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to θ and then do one step Gibbs update to obtain a new θ, and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.

The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.

I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is small enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?

Actually I’m adding a new algorithm into the Weierstrass rejection sampling, which will render it thoroughly free from the dimensionality curse of p. The new scheme is applicable to the nonparametric method in Neiswanger et al. (2013) as well. It should appear soon in the second version of the draft.

## parallel MCMC via Weirstrass sampler

Posted in Books, Statistics, University life with tags , , , , , , , , on January 2, 2014 by xi'an

During O’Bayes 2013, Xiangyu Wang and David Dunson arXived a paper (with the above title) that David then presented on the 19th.  The setting is quite similar to the recently discussed embarrassingly parallel paper of Neiswanger et al., in that Xiangyu and David start from the same product representation of the target (posterior). Namely,

$p(\theta|x) = \prod_{i=1}^m p_i(\theta|x).$

However, they criticise the choice made by Neiswanger et al to use MCMC approximations to each component of the product for the following reasons:

1. Curse of dimensionality in the number of parameters p
2. Curse of dimensionality in the number of subsets m
3. Tail degeneration
4. Support inconsistency and mode misspecification Continue reading

## why do we maximise the weights in empirical likelihood?

Posted in Books, Statistics, University life with tags , , , , on October 29, 2013 by xi'an

Mark Johnson sent me the following question a few days ago:

I have one question about EL: how important is it to maximise the probabilities pi on the data items in the formula (stolen from the Wikipedia page on EL)?

$\max_{\pi,\theta} \sum_{i=1}^n \ln\pi_i$

You’re already replacing the max over θ with a distribution over θ.  What about the πi

It would seem to be “more Bayesian” to put a prior on the data item probabilities pi_i, and it would also seem to “do the right thing” in situations where there are several different pi that have the same empirical likelihood.

This is a fairly reasonable question, which first reminds me of an issue we had examined with Costas Goutis, on his very last trip to Paris in 1996, a few months before he died in a diving accident near Seattle. We were wondering if treating the bandwidth in a non-parametric density estimator as a regular parameter was making sense. After experimenting for a few days with different priors we found that it was not such a great idea and that, instead, the prior on the bandwidth needed to depend on the sample size. This led to Costas’ posthumous paper, Nonparametric Estimation of a Mixing Density via the Kernel Method, in JASA in 1997 (with the kind help of Jianqing Fan).

Now, more to the point (of empirical likelihood), I am afraid that putting (almost) any kind of prior on the weights πi would be hopeless. For one thing, the πi are of the same size as the sample (modulo the identifying equation constraints) so estimating them based on a prior that does not depend on the sample size does not produce consistent estimators of the weights. (Search Bayesian nonparametric likelihood estimation for more advanced reasons.) Intuitively, it seems to me that the (true) parameter θ of the (unknown or unavailable) distribution of the data does not make sense in the non-parametric setting or, conversely, that the weights πi have no meaning for the inference on θ. It thus sounds difficult to treat them together and on an equal footing. The approximation

$\max_{\pi} \sum_{i=1}^n \ln\pi_i$

is a function of θ that replaces the unknown or unavailable likelihood, in which the weights have no statistical meaning. But this is a wee of a weak argument as other solutions than the maximisation of the entropy could be used to determine the weights.

In the end, this remains a puzzling issue (and hence a great question), hitting at the difficulty of replacing the true model with an approximation on the one hand and aiming at estimating the true parameter(s) on the other hand.