## a bootstrap likelihood approach to Bayesian computation

Posted in Books, R, Statistics, University life with tags , , , , , , , , on October 16, 2014 by xi'an

This paper by Weixuan Zhu, Juan Miguel Marín [from Carlos III in Madrid, not to be confused with Jean-Michel Marin, from Montpellier!], and Fabrizio Leisen proposes an alternative to our 2013 PNAS paper with Kerrie Mengersen and Pierre Pudlo on empirical likelihood ABC, or BCel. The alternative is based on Davison, Hinkley and Worton’s (1992) bootstrap likelihood, which relies on a double-bootstrap to produce a non-parametric estimate of the distribution of a given estimator of the parameter θ. Including a smooth curve-fitting algorithm step, for which not much description is available from the paper.

“…in contrast with the empirical likelihood method, the bootstrap likelihood doesn’t require any set of subjective constrains taking advantage from the bootstrap methodology. This makes the algorithm an automatic and reliable procedure where only a few parameters need to be specified.”

The spirit is indeed quite similar to ours in that a non-parametric substitute plays the role of the actual likelihood, with no correction for the substitution. Both approaches are convergent, with similar or identical convergence speeds. While the empirical likelihood relies on a choice of parameter identifying constraints, the bootstrap version starts directly from the [subjectively] chosen estimator of θ. For it indeed needs to be chosen. And computed.

“Another benefit of using the bootstrap likelihood (…) is that the construction of bootstrap likelihood could be done once and not at every iteration as the empirical likelihood. This leads to significant improvement in the computing time when different priors are compared.”

This is an improvement that could apply to the empirical likelihood approach, as well, once a large enough collection of likelihood values has been gathered. But only in small enough dimensions where smooth curve-fitting algorithms can operate. The same criticism applying to the derivation of a non-parametric density estimate for the distribution of the estimator of θ. Critically, the paper only processes examples with a few parameters.

In the comparisons between BCel and BCbl that are produced in the paper, the gain is indeed towards BCbl. Since this paper is mostly based on examples and illustrations, not unlike ours, I would like to see more details on the calibration of the non-parametric methods and of regular ABC, as well as on the computing time. And the variability of both methods on more than a single Monte Carlo experiment.

I am however uncertain as to how the authors process the population genetic example. They refer to the composite likelihood used in our paper to set the moment equations. Since this is not the true likelihood, how do the authors select their parameter estimates in the double-bootstrap experiment? The inclusion of Crakel’s and Flegal’s (2013) bivariate Beta, is somewhat superfluous as this example sounds to me like an artificial setting.

In the case of the Ising model, maybe the pre-processing step in our paper with Matt Moores could be compared with the other algorithms. In terms of BCbl, how does the bootstrap operate on an Ising model, i.e. (a) how does one subsample pixels and (b)what are the validity guarantees?

A test that would be of interest is to start from a standard ABC solution and use this solution as the reference estimator of θ, then proceeding to apply BCbl for that estimator. Given that the reference table would have to be produced only once, this would not necessarily increase the computational cost by a large amount…

## model selection by likelihood-free Bayesian methods

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , on May 29, 2014 by xi'an

Just glanced at the introduction of this arXived paper over breakfast, back from my morning run: the exact title is “Model Selection for Likelihood-free Bayesian Methods Based on Moment Conditions: Theory and Numerical Examples” by Cheng Li and Wenxin Jiang. (The paper is 81 pages long.) I selected the paper for its title as it connected with an interrogation of ours on the manner to extend our empirical likelihood [A]BC work to model choice. We looked at this issue with Kerrie Mengersen and Judith Rousseau the last time Kerrie visited Paris but could not spot a satisfying entry… The current paper is of a theoretical nature, considering a moment defined model

$\mathbb{E}[g(D,\theta)]=0,$

where D denotes the data, as the dimension p of the parameter θ grows with n, the sample size. The approximate model is derived from a prior on the parameter θ and of a Gaussian quasi-likelihood on the moment estimating function g(D,θ). Examples include single index longitudinal data, quantile regression and partial correlation selection. The model selection setting is one of variable selection, resulting in 2p models to compare, with p growing to infinity… Which makes the practical implementation rather delicate to conceive. And the probability one of hitting the right model a fairly asymptotic concept. (At least after a cursory read from my breakfast table!)

## ABC for bivariate betas

Posted in Statistics, University life with tags , , , , , , , on February 19, 2014 by xi'an

Crakel and Flegal just arXived a short paper running ABC for doing inference on the parameters of two families of bivariate betas. And I could not but read it thru. And wonder why ABC was that necessary to handle the model. The said bivariate betas are defined from

$V_1=(U_1+U_5+U_7)/(U_3+U_6+U_8)\,,$

$V_2=(U_2+U_5+U_8)/(U_4+U_6+U_7)$

when

$U_i\sim \text{Ga}(\delta_i,1)$

and

$X_1=V_1/(1+V_1)\,,\ X_2=V_2/(1+V_2)$

This makes each term in the pair Beta and the two components dependent. This construct was proposed by Arnold and Ng (2011). (The five-parameter version cancels the gammas for i=3,4,5.)

Since the pdf of the joint distribution is not available in closed form, Crakel and Flegal zoom on ABC-MCMC as the method of choice and discuss simulation experiments. (The choice of the tolerance ε as an absolute rather than relative value, ε=0.2,0.0.6,0.8, puzzles me, esp. since the distance between the summary statistics is not scaled.) I however wonder why other approaches are impossible. (Or why it is necessary to use this distribution to model correlated betas. Unless I am confused copulas were invented to this effect.) First, this is a latent variable model, so latent variables could be introduced inside an MCMC scheme. A wee bit costly but feasible. Second, several moments of those distributions are known so a empirical likelihood approach could be considered.

## my week at War[wick]

Posted in pictures, Running, Statistics, Travel, Uncategorized with tags , , , , , , , , , on February 1, 2014 by xi'an

This was a most busy and profitable week in Warwick as, in addition to meeting with local researchers and students on a wide range of questions and projects, giving an extended seminar to MASDOC students, attending as many seminars as humanly possible (!), and preparing a 5k race by running in the Warwickshire countryside (in the dark and in the rain), I received the visits of Kerrie Mengersen, Judith Rousseau and Jean-Michel Marin, with whom I made some progress on papers we are writing together. In particular, Jean-Michel and I wrote the skeleton of a paper we (still) plan to submit to COLT 2014 next week. And Judith, Kerrie and I drafted new if paradoxical aconnections between empirical likelihood and model selection. Jean-Michel and Judith also gave talks at the CRiSM seminar, Jean-Michel presenting the latest developments on the convergence of our AMIS algorithm, Judith summarising several papers on the analysis of empirical Bayes methods in non-parametric settings.

## ABC with composite score functions

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on December 12, 2013 by xi'an

My friends Erlis Ruli, Nicola Sartori and Laura Ventura from Università degli Studi de Padova have just arXived a new paper entitled Approximate Bayesian Computation with composite score functions. While the paper provides a survey of composite likelihood methods, the core idea of the paper is to use the score function (of the composite likelihood) as the summary statistic,

$\dfrac{\partial\,c\ell(\theta;y)}{\partial\,\theta},$

when evaluated at the maximum composite likelihood at the observed data point. In the specific (but unrealistic) case of an exponential family, an ABC based on the score is asymptotically (i.e., as the tolerance ε goes to zero) exact. The choice of the composite likelihood thus induces a natural summary statistics and, as in our empirical likelihood paper, where we also use the score of a composite likelihood, the composite likelihoods that are available for computation are usually quite a few, thus leading to an automated choice of a summary statistic..

An interesting (common) feature in most examples found in this paper is that comparisons are made between ABC using the (truly) sufficient statistic and ABC based on the pairwise score function, which essentially relies on the very same statistics. So the difference, when there is a difference, pertains to the choice of a different combination of the summary statistics or, somehow equivalently to the choice of a different distance function. One of the examples starts from our MA(2) toy-example in the 2012 survey in Statistics and Computing. The composite likelihood is then based on the consecutive triplet marginal densities. As shown by the picture below, the composite version improves to some extent upon the original ABC solution using three autocorrelations.

A suggestion I would have about a refinement of the proposed method deals with the distance utilised in the paper, namely the sum of the absolute differences between the statistics. Indeed, this sum is not scaled at all, neither for regular ABC nor for composite ABC, while the composite likelihood perspective provides in addition to the score a natural metric through the matrix A(θ) [defined on page 12]. So I would suggest comparing the performances of the methods using instead this rescaling since, in my opinion and in contrast with a remark on page 13, it is relevant in some (many?) settings where the amount of information brought by the composite model widely varies from one parameter to the next.

## why do we maximise the weights in empirical likelihood?

Posted in Books, Statistics, University life with tags , , , , on October 29, 2013 by xi'an

Mark Johnson sent me the following question a few days ago:

I have one question about EL: how important is it to maximise the probabilities pi on the data items in the formula (stolen from the Wikipedia page on EL)?

$\max_{\pi,\theta} \sum_{i=1}^n \ln\pi_i$

You’re already replacing the max over θ with a distribution over θ.  What about the πi

It would seem to be “more Bayesian” to put a prior on the data item probabilities pi_i, and it would also seem to “do the right thing” in situations where there are several different pi that have the same empirical likelihood.

This is a fairly reasonable question, which first reminds me of an issue we had examined with Costas Goutis, on his very last trip to Paris in 1996, a few months before he died in a diving accident near Seattle. We were wondering if treating the bandwidth in a non-parametric density estimator as a regular parameter was making sense. After experimenting for a few days with different priors we found that it was not such a great idea and that, instead, the prior on the bandwidth needed to depend on the sample size. This led to Costas’ posthumous paper, Nonparametric Estimation of a Mixing Density via the Kernel Method, in JASA in 1997 (with the kind help of Jianqing Fan).

Now, more to the point (of empirical likelihood), I am afraid that putting (almost) any kind of prior on the weights πi would be hopeless. For one thing, the πi are of the same size as the sample (modulo the identifying equation constraints) so estimating them based on a prior that does not depend on the sample size does not produce consistent estimators of the weights. (Search Bayesian nonparametric likelihood estimation for more advanced reasons.) Intuitively, it seems to me that the (true) parameter θ of the (unknown or unavailable) distribution of the data does not make sense in the non-parametric setting or, conversely, that the weights πi have no meaning for the inference on θ. It thus sounds difficult to treat them together and on an equal footing. The approximation

$\max_{\pi} \sum_{i=1}^n \ln\pi_i$

is a function of θ that replaces the unknown or unavailable likelihood, in which the weights have no statistical meaning. But this is a wee of a weak argument as other solutions than the maximisation of the entropy could be used to determine the weights.

In the end, this remains a puzzling issue (and hence a great question), hitting at the difficulty of replacing the true model with an approximation on the one hand and aiming at estimating the true parameter(s) on the other hand.

## i-like[d the] workshop

Posted in Running, Statistics, Travel, University life with tags , , , , , , , , on May 17, 2013 by xi'an

Indeed, I liked the i-like workshop very much. Among the many interesting talks of the past two days (incl. Cristiano Varin’s ranking of Series B as the top influential stat. journal!) , Matti Vihola’s and Nicolas Chopin’s had the strongest impact on me (to the point of scribbling in my notebook). In a joint work with Christophe Andrieu, Matti focussed on evaluating the impact of replacing the target with an unbiased estimate in a Metropolis-Hastings algorithm. In particular, they found necessary and sufficient conditions for keeping geometric and uniform ergodicity. My question (asked by Iain Murray) was whether they had derived ways of selecting the number of terms in the unbiased estimator towards maximal efficiency. I also wonder if optimal reparameterisations can be found in this sense (since unbiased estimators remain unbiased after reparameterisation).

Nicolas’ talk was about particle Gibbs sampling, a joint paper with Sumeet Singh recently arXived. I did not catch the whole detail of their method but/as I got intrigued by a property of Marc Beaumont’s algorithm (the very same algorithm used by Matti & Christophe). Indeed, the notion is that an unbiased estimator of the target distribution can be found in missing variable settings by picking an importance sampling distribution q on those variables. This representation leads to a pseudo-target Metropolis-Hastings algorithm. In the stationary regime, there exists a way to derive an “exact” simulation from the joint posterior on (parameter,latent). All the remaining/rejected latents are then distributed from the proposal q. What I do not see is how this impacts the next MCMC move since it implies generating a new sample of latent variables. I spoke with Nicolas about this over breakfast: the explanation is that this re-generated set of latent variables can be used in the denominator of the Metropolis-Hastings acceptance probability and is validated as a Gibbs step. (Incidentally, it may be seen as a regeneration event as well.)

Furthermore, I had a terrific run in the rising sun (at 5am) all the way to Kenilworth where I was a deer, pheasants and plenty of rabbits. (As well as this sculpture that now appears to me as being a wee sexist…)