## why do we maximise the weights in empirical likelihood?

Mark Johnson sent me the following question a few days ago:

I have one question about EL: how important is it to maximise the probabilities pi on the data items in the formula (stolen from the Wikipedia page on EL)?

$\max_{\pi,\theta} \sum_{i=1}^n \ln\pi_i$

You’re already replacing the max over θ with a distribution over θ.  What about the πi

It would seem to be “more Bayesian” to put a prior on the data item probabilities pi_i, and it would also seem to “do the right thing” in situations where there are several different pi that have the same empirical likelihood.

This is a fairly reasonable question, which first reminds me of an issue we had examined with Costas Goutis, on his very last trip to Paris in 1996, a few months before he died in a diving accident near Seattle. We were wondering if treating the bandwidth in a non-parametric density estimator as a regular parameter was making sense. After experimenting for a few days with different priors we found that it was not such a great idea and that, instead, the prior on the bandwidth needed to depend on the sample size. This led to Costas’ posthumous paper, Nonparametric Estimation of a Mixing Density via the Kernel Method, in JASA in 1997 (with the kind help of Jianqing Fan).

Now, more to the point (of empirical likelihood), I am afraid that putting (almost) any kind of prior on the weights πi would be hopeless. For one thing, the πi are of the same size as the sample (modulo the identifying equation constraints) so estimating them based on a prior that does not depend on the sample size does not produce consistent estimators of the weights. (Search Bayesian nonparametric likelihood estimation for more advanced reasons.) Intuitively, it seems to me that the (true) parameter θ of the (unknown or unavailable) distribution of the data does not make sense in the non-parametric setting or, conversely, that the weights πi have no meaning for the inference on θ. It thus sounds difficult to treat them together and on an equal footing. The approximation

$\max_{\pi} \sum_{i=1}^n \ln\pi_i$

is a function of θ that replaces the unknown or unavailable likelihood, in which the weights have no statistical meaning. But this is a wee of a weak argument as other solutions than the maximisation of the entropy could be used to determine the weights.

In the end, this remains a puzzling issue (and hence a great question), hitting at the difficulty of replacing the true model with an approximation on the one hand and aiming at estimating the true parameter(s) on the other hand.

### 4 Responses to “why do we maximise the weights in empirical likelihood?”

1. I maximized because that is the direct analogy to parametric likelihood and Wilks’ theorem etc. But there are alternatives. As Kamild points out, Lazar’s Bayesian EL and Rubin’s Bayesian bootstrap do things differently.

There is also a view that empirical likelihood is (nearly) a likelihood on a least favorable family with dimension equal to that of the parameter. Ch9 of my book points to work by DiCiccio and Romano (1990) on this. Then it is reasonable to multiply a prior on that family by an empirical likelihood. It would be interesting to connect these dots a bit more.

There are lots of papers on entropy methods. Entropy is natural for finding least informative distributions subject to constraints. It also leads to the familiar exponential tilting. But it looks like the probability of the model under the data, ie, a backwards likelihood. Empirical likelihood gives a reciprocal tilting that can be solved by convex optimization.

2. The paper “Bayesian empirical likelihood” by N.A. Lazar (Biometrika, 90(2)) discusses placing a prior on the weights. It admits that “The rationale for a prior specification on the pi is the same as that underlying the Bayesian bootstrap…On the other hand, this construction suffers from the same criticisms as are levelled against the Bayesian bootstrap…”

3. Merci, Simon! I just love exponential tilting…!

4. For a variant of empirical likelihood (the exponentially-tilted kind) you can define a prior over distributions so that the empirical likelihood is a good approximation of a non-parametric likelihood estimated under the prior.
See the following paper by Schennach:
http://biomet.oxfordjournals.org/content/92/1/31.short

“We show that a likelihood function very closely related to empirical likelihood naturally arises from a nonparametric Bayesian procedure which places a type of noninformative prior on the space of distributions. This prior gives preference to distributions having a small support and, among those sharing the same support, it favours entropy-maximising distributions. The resulting nonparametric Bayesian procedure admits a computationally convenient representation as an empirical-likelihood-type likelihood where the probability weights are obtained via exponential tilting.”

This site uses Akismet to reduce spam. Learn how your comment data is processed.