## estimating a constant (not really)

Larry Wasserman wrote a blog entry on the normalizing constant paradox, where he repeats that he does not understand my earlier point…Let me try to recap here this point and the various comments I made on StackExchange (while keeping in mind all this is for intellectual fun!)

The entry is somehow paradoxical in that Larry acknowledges (in that post) that the analysis in his book, All of Statistics, is wrong. The fact that “g(x)/c is a valid density only for one value of c” (and hence cannot lead to a notion of likelihood on c) is the very reason why I stated that there can be no statistical inference nor prior distribution about c: a sample from f does not bring statistical information about c and there can be no statistical estimate of c based on this sample. (In case you did not notice, I insist upon statistical!)

To me this problem is completely different from a statistical problem, at least in the modern sense: if I need to approximate the constant c—as I do in fact when computing Bayes factors—, I can produce an arbitrarily long sample from a certain importance distribution and derive a converging (and sometimes unbiased) approximation of c. Once again, this is Monte Carlo integration, a numerical technique based on the Law of Large Numbers and the stabilisation of frequencies. (Call it a frequentist method if you wish. I completely agree that MCMC methods are inherently frequentist in that sense, And see no problem with this because they are not statistical methods. Of course, this may be the core of the disagreement with Larry and others, that they call statistics the Law of Large Numbers, and I do not. This lack of separation between both notions also shows up in a recent general public talk on Poincaré’s mistakes by Cédric Villani! All this may just mean I am irremediably Bayesian, seeing anything motivated by frequencies as non-statistical!) But that process does not mean that c can take a range of values that would index a family of densities compatible with a given sample. In this Monte Carlo integration approach, the distribution of the sample is completely under control (modulo the errors induced by pseudo-random generation). This approach is therefore outside the realm of Bayesian analysis “that puts distributions on fixed but unknown constants”, because those unknown constants parameterise the distribution of an observed sample. Ergo, c is not a parameter of the sample and the sample Larry argues about (“we have data sampled from a distribution”) contains no information whatsoever about c that is not already in the function g. (It is not “data” in this respect, but a stochastic sequence that can be used for approximation purposes.) Which gets me back to my first argument, namely that c is known (and at the same time difficult or impossible to compute)!

Let me also answer here the comments on “why is this any different from estimating the speed of light c?” “why can’t you do this with the 100th digit of π?” on the earlier post or on StackExchange. Estimating the speed of light means for me (who repeatedly flunked Physics exams after leaving high school!) that we have a physical experiment that measures the speed of light (as the original one by Rœmer at the Observatoire de Paris I visited earlier last week) and that the statistical analysis infers about c by using those measurements and the impact of the imprecision of the measuring instruments (as we do when analysing astronomical data). If, now, there exists a physical formula of the kind

$c=\int_\Xi \psi(\xi) \varphi(\xi) \text{d}\xi$

where φ is a probability density, I can imagine stochastic approximations of c based on this formula, but I do not consider it a statistical problem any longer. The case is thus clearer for the 100th digit of π: it is also a fixed number, that I can approximate by a stochastic experiment but on which I cannot attach a statistical tag. (It is 9, by the way.) Throwing darts at random as I did during my Oz tour is not a statistical procedure, but simple Monte Carlo à la Buffon…

Overall, I still do not see this as a paradox for our field (and certainly not as a critique of Bayesian analysis), because there is no reason a statistical technique should be able to address any and every numerical problem. (Once again, Persi Diaconis would almost certainly differ, as he defended a Bayesian perspective on numerical analysis in the early days of MCMC…) There may be a “Bayesian” solution to this particular problem (and that would nice) and there may be none (and that would be OK too!), but I am not even convinced I would call this solution “Bayesian”! (Again, let us remember this is mostly for intellectual fun!)

### 9 Responses to “estimating a constant (not really)”

1. [...] estimating a constant (not really) [...]

2. You’ve given us estimators for the mean and variance for an importance sampler run — can’t I use these to estimate (at least asymptotically) the likelihood of observing this importance sampler run for any given value of c?

And once we have likelihoods, don’t we have Bayesian inference for c?

• Argh, Mark, you are setting the debate back to stage one: I though it was more or less settled (?!) that there is no such thing as a likelihood on the normalising constant?! Actually, in the previous post, I also pointed out that the numerical approximations of the constant are not estimates in the usual sense, since there is no unknown parameter. (Thanks for the comments, eh!)

• I understand your qualms: what could the relevant probability distributions possibly be over? (“possible worlds” in which the laws of arithmetic are different?) But I think it really is quite parallel to the “speed of light” case: we have one or more “noisy measurements” of some unknown quantity that we want to combine in some way. (Can’t we think of an importance sampling run as a noisy measurement of c?)

Imagine we set estimating c using annealed importance sampling as a homework problem for a group of students, and we’d like to combine their answers to arrive at an even more accurate estimate of c. But each student used a different reference distribution, a different number of samples, a different annealing schedule, etc., so a simple average doesn’t make sense; what would a better method be?

There’s a technical problem for even an “in principle” Bayesian estimator for c that I don’t see a way around: as you point out, we can estimate P(c | r) where c is the partition function and r is an importance sampler run, but for Bayesian estimation we need a likelihood P(r | c), and I don’t see how to get this. (We can estimate P(r, c) from multiple importance sampler runs; could we use this to estimate P(r | c)? Of course c is a deterministic function of r so our estimate of P(r | c) would be a mixture of delta functions …)

• Sorry to keep bothering you with this, but couldn’t we estimate (at least in principle) a likelihood P(r | c) (where r is an importance sampling run and c is the partition function) using techniques like the ones you use in ABC samplers? That is, approximate by putting epsilon balls around r and c? (I’m not sure what the right metric on r space would be, but you’re the expert here!). Then we could at least formally set up a Bayesian inference for c (even if we’re still unsure of what it would mean).

• Thanks, Mark, for the additional comments. I am still unsure about your suggestion: using a discretization via balls is going against the notion that we know everything but c (and hence c in a way!) about the problem.

So using this approach or a Bayesian non-parametric approach to estimating f sounds like throwing away information. My gut feeling is that this paradox centres on calling c a parameter (of the sample), while it is not.

• At this stage I’m just interested to see if it is possible to come up with any Bayesian estimator of c (let’s see if it is in fact possible before worrying about doing it efficiently).

Just thinking about this problem may help us understand the limits of Bayesian inference. As far as I can tell the question of whether there’s a Bayesian estimator for c is still open: there’s a significant technical problem (as you point out, we need a likelihood in which c is a parameter), but as far as I can tell there’s no proof that it can’t be done.

Importance sampling for c is basically just estimating a ratio from a set of samples, and it surprises me that something apparently so simple may be beyond the expressive power of Bayesian inference. Maybe I should be more reserved in advocating Bayesian methods to my colleagues and students!

3. ok one more question.
After you get your simulation-based point estimate of c,
how do you assess your uncertainty?
Is there a posterior for c?
Or a confidence interval?
Or something else?

Larry

• thanks Larry you got me stuck for maybe one minute..then I went running in the early morning and realised this was just the same thing: when using a stochastic approximation to the integral c,

$\delta=\frac{1}{N} \sum_{i=1}^N h(x_i)$

say, the distribution of δ is given and known, at least formally, so I can exploit (on principle) this distribution to assess the variability of my evaluation δ. Obviously, in practice, the variance is also as unavailable as c,

$\varpi = \int (h(x)-c)^2 f(x) \text{d} x$

and has itself to be approximated, leading to a sort of infinite regress. However, it is indeed the same thing (to me).