## estimating normalising constants [mea culpa?!]

“The basic idea is to estimate the parameters by learning to discriminatebetween the dataxand some artificially generated noisey.”

**I**n the sequel of this popular earlier post of mine on [not] estimating normalising constants, Simon Barthelmé and Nicolas Chopin pointed me to recent papers by Michael Gutmann and Aapo Hyvaärinen on this topic, one published in the proceedings of AISTATS 2010 in Sardinia and one from the proceedings of the 2013 Workshop on Information Theoretic Methods in Science and Engineering (WITMSE2013), in Tokyo. Which led me to reconsider my perspective on this issue…

**J**ust like Larry, Gutmann and Hyvaärinen consider the normalising constant associated with an unnormalised density,

as *an extra parameter*. They then add to the actual sample from the unnormalised density an artificial sample from a fixed distribution g with identical size and eventually proceed to run a logistic regression on the model index (p *versus* g) based on those merged datasets. A logistic regression parameterised by the difference of the log-densities:

With the actual sample corresponding to the first modality and the artificial sample to the second modality. While the resulting estimator is different, this approach reminds me of the proposal we made in our nested sampling paper of 2009 with Nicolas, esp. Section 6.3 where we also introduce an artificial mixture to estimate the normalising constant (and obtain an alternative version of bridge sampling). The difference is that Gutmann and Hyvärinen estimate both Z and α by logistic regression. And without imposing the integration constraint that would turn Z into a superfluous “parameter”.

**N**ow, if we return to the original debate, does this new estimation approach close it? And if so, is it to my defeat (hence the title)?! Obviously, Gutmann and Hyvärinen use both a statistical technique and a statistical model to estimate the constant Z(α). They produce an extra artificial sample from g but exploit the current sample from p and no other. The estimator of the normalising constant is converging with the sample size. However, I do remain puzzled by the addition of the normalising constant to the parameter vector. The data comes from a probability distribution and hence the normalising constraint holds. Relaxing the constraint leads to a minimisation framework that can be interpreted as either statistics or numerics. Which keeps open my original questioning of which information about the constant Z(α) is contained in the sample per se… (But not questioning the potential in using this method in providing a constant estimate.)

May 28, 2014 at 4:19 pm

Thanks, Corey. Maybe my reluctance to dab this as “statistical” is that I do not see a strong justification in the approximation of the variance of the estimator…

May 28, 2014 at 3:55 am

The above method strikes me as somehow similar to a scheme in a paper by Salimans and Knowles, where the application is definitely numerical optimization, not statistics.

One key insight of that paper is that when the target distribution being approximated belongs to the (exponential) family of approximating distributions, the (there, stochastic iterative) scheme finds the ideal “approximation” in a finite number of iterations. This is because the iterative scheme uses stochastic support points for a minimization scheme implemented through regression, and the stochasticity of the algorithm cancels out perfectly when ideal “approximation” is possible.