Mark Johnson sent me the following question a few days ago:
I have one question about EL: how important is it to maximise the probabilities pi on the data items in the formula (stolen from the Wikipedia page on EL)?
You’re already replacing the max over θ with a distribution over θ. What about the πi?
It would seem to be “more Bayesian” to put a prior on the data item probabilities pi_i, and it would also seem to “do the right thing” in situations where there are several different pi that have the same empirical likelihood.
This is a fairly reasonable question, which first reminds me of an issue we had examined with Costas Goutis, on his very last trip to Paris in 1996, a few months before he died in a diving accident near Seattle. We were wondering if treating the bandwidth in a non-parametric density estimator as a regular parameter was making sense. After experimenting for a few days with different priors we found that it was not such a great idea and that, instead, the prior on the bandwidth needed to depend on the sample size. This led to Costas’ posthumous paper, Nonparametric Estimation of a Mixing Density via the Kernel Method, in JASA in 1997 (with the kind help of Jianqing Fan).
Now, more to the point (of empirical likelihood), I am afraid that putting (almost) any kind of prior on the weights πi would be hopeless. For one thing, the πi are of the same size as the sample (modulo the identifying equation constraints) so estimating them based on a prior that does not depend on the sample size does not produce consistent estimators of the weights. (Search Bayesian nonparametric likelihood estimation for more advanced reasons.) Intuitively, it seems to me that the (true) parameter θ of the (unknown or unavailable) distribution of the data does not make sense in the non-parametric setting or, conversely, that the weights πi have no meaning for the inference on θ. It thus sounds difficult to treat them together and on an equal footing. The approximation
is a function of θ that replaces the unknown or unavailable likelihood, in which the weights have no statistical meaning. But this is a wee of a weak argument as other solutions than the maximisation of the entropy could be used to determine the weights.
In the end, this remains a puzzling issue (and hence a great question), hitting at the difficulty of replacing the true model with an approximation on the one hand and aiming at estimating the true parameter(s) on the other hand.