**A**n ‘Og’s reader pointed me to this paper by Li and Malik, which made it to arXiv after not making it to NIPS. While the NIPS reviews were not particularly informative and strongly discordant, the authors point out in the comments that they are available for the sake of promoting discussion. (As made clear in earlier posts, I am quite supportive of this attitude! *Disclaimer: I was not involved in an evaluation of this paper, neither for NIPS nor for another conference or journal!!*) Although the paper does not seem to mention ABC in the setting of implicit likelihoods and generative models, there is a reference to the early (1984) paper by Peter Diggle and Richard Gratton that is often seen as the ancestor of ABC methods. The authors point out numerous issues with solutions proposed for parameter estimation in such implicit models. For instance, for GANs, they signal that “minimizing the Jensen-Shannon divergence or the Wasserstein distance between the empirical data distribution and the model distribution does not necessarily minimize the same between the true data distribution and the model distribution.” (Not mentioning the particular difficulty with Bayesian GANs.) Their own solution is the implicit maximum likelihood estimator, which picks the value of the parameter θ bringing a simulated sample the closest to the observed sample. Closest in the sense of the Euclidean distance between both samples. Or between the minimum of several simulated samples and the observed sample. (The modelling seems to imply the availability of n>1 observed samples.) They advocate using a stochastic gradient descent approach for finding the optimal parameter θ which presupposes that the dependence between θ and the simulated samples is somewhat differentiable. (And this does not account for using a min, which would make differentiation close to impossible.) The paper then meanders in a lengthy discussion as to whether maximising the likelihood makes sense, with a rather naïve view on why using the empirical distribution in a Kullback-Leibler divergence does not make sense! What does not make sense is considering the finite sample approximation to the Kullback-Leibler divergence with the true distribution in my opinion.

## Archive for untractable normalizing constant

## Implicit maximum likelihood estimates

Posted in Statistics with tags ABC, Approximate Bayesian computation, GANs, Hyvärinen score, Kullback-Leibler divergence, likelihood-free methods, maximum likelihood estimation, NIPS 2018, Peter Diggle, untractable normalizing constant, Wasserstein distance on October 9, 2018 by xi'an## approximate likelihood

Posted in Books, Statistics with tags ABC, arXiv, likelihood-free methods, maximum entropy, synthetic likelihood, untractable normalizing constant on September 6, 2017 by xi'an**T**oday, I read a newly arXived paper by Stephen Gratton on a method called GLASS for *General Likelihood Approximate Solution Scheme*… The starting point is the same as with ABC or synthetic likelihood, namely a collection of summary statistics and an intractable likelihood. The author proposes to use as a substitute a maximum entropy solution based on these summary statistics and their assumed moments under the theoretical model. What is quite unclear in the paper is whether or not these assumed moments are available in closed form or not. Otherwise, it would appear as a variant to the synthetic likelihood [aka simulated moments] approach, meaning that the expectations of the summary statistics under the theoretical model and for a given value of the parameter are obtained through Monte Carlo approximations. (All the examples therein allow for closed form expressions.)

## normalising constants of G-Wishart densities

Posted in Books, Statistics with tags birth-and-death process, G-Wishart distribution, graphs, ratio of integrals, reversible jump MCMC, untractable normalizing constant, Wishart distribution on June 28, 2017 by xi'anAbdolreza Mohammadi, Hélène Massam, and Gérard Letac arXived last week a paper on a new approximation of the ratio of two normalising constants associated with two G-Wishart densities associated with different graphs G. The G-Wishart is the generalisation of the Wishart distribution by Alberto Roverato to the case when some entries of the matrix are equal to zero, which locations are associated with the graph G. While enjoying the same shape as the Wishart density, this generalisation does not enjoy a closed form normalising constant. Which leads to an intractable ratio of normalising constants when doing Bayesian model selection across different graphs.

Atay-Kayis and Massam (2005) expressed the ratio as a ratio of two expectations, and the current paper shows that this leads to an approximation of the ratio of normalising constants for a graph G against the graph G augmented by the edge e, equal to

Γ(½{δ+d}) / 2 √π Γ(½{δ+d+1})

where δ is the degree of freedom of the G-Wishart and d is the number of minimal paths of length 2 linking the two end points of e. This is remarkably concise and provides a fast approximation. (The proof is quite involved, by comparison.) Which can then be used in reversible jump MCMC. The difficulty is obviously in evaluating the impact of the approximation on the target density, as there is no manageable available alternative to calibrate the approximation. In a simulation example where such an alternative is available, the error is negligible though.

## density normalization for MCMC algorithms

Posted in Statistics, University life with tags harmonic mean estimator, importance sampling, Laplace approximation, MCMC, MCQMC, Monte Carlo Statistical Methods, normalising constant, Rao-Blackwellisation, untractable normalizing constant on November 6, 2014 by xi'an**A**nother paper addressing the estimation of the normalising constant and the wealth of available solutions just came out on arXiv, with the full title of “Target density normalization for Markov chain Monte Carlo algorithms“, written by Allen Caldwell and Chang Liu. (I became aware of it by courtesy of Ewan Cameron, as it appeared in the physics section of arXiv. It is actually a wee bit annoying that papers in the subcategory “Data Analysis, Statistics and Probability” of physics do not get an automated reposting on the statistics lists…)

In this paper, the authors compare three approaches to the problem of finding

when the density f is unormalised, i.e., in more formal terms, when f is proportional to a probability density (and available):

- an “arithmetic mean”, which is an importance sampler based on (a) reducing the integration volume to a neighbourhood ω of the global mode. This neighbourhood is chosen as an hypercube and the importance function turns out to be the uniform over this hypercube. The corresponding estimator is then a rescaled version of the average of f over uniform simulations in ω.
- an “harmonic mean”, of all choices!, with again an integration over the neighbourhood ω of the global mode in order to avoid the almost sure infinite variance of harmonic mean estimators.
- a Laplace approximation, using the target at the mode and the Hessian at the mode as well.

The paper then goes to comparing those three solutions on a few examples, demonstrating how the diameter of the hypercube can be calibrated towards a minimum (estimated) uncertainty. The rather anticlimactic conclusion is that the arithmetic mean is the most reliable solution as harmonic means may fail in larger dimension and more importantly fail to signal its failure, while Laplace approximations only approximate well quasi-Gaussian densities…

What I find most interesting in this paper is the idea of using only one part of the integration space to compute the integral, even though it is not exactly new. Focussing on a specific region ω has pros and cons, the pros being that the reduction to a modal region reduces needs for absolute MCMC convergence and helps in selecting alternative proposals and also prevents from the worst consequences of using a dreaded harmonic mean, the cons being that the region needs be well-identified, which means requirements on the MCMC kernel, and that the estimate is a product of two estimates, the frequency being driven by a Binomial noise. I also like very much the idea of calibrating the diameter Δof the hypercube ex-post by estimating the uncertainty.

As an aside, the paper mentions most of the alternative solutions I just presented in my Monte Carlo graduate course two days ago (like nested or bridge or Rao-Blackwellised sampling, including our proposal with Darren Wraith), but dismisses them as not “directly applicable in an MCMC setting”, i.e., without modifying this setting. I unsurprisingly dispute this labelling, both because something like the Laplace approximation requires extra-work on the MCMC output (and once done this work can lead to advanced Laplace methods like INLA) and because other methods could be considered as well (for instance, bridge sampling over several hypercubes). As shown in the recent paper by Mathieu Gerber and Nicolas Chopin (soon to be discussed at the RSS!), MCqMC has also become a feasible alternative that would compete well with the methods studied in this paper.

Overall, this is a paper that comes in a long list of papers on constant approximations. I do not find the Markov chain of MCMC aspect particularly compelling or specific, once the effective sample size is accounted for. It would be nice to find generic ways of optimising the visit to the hypercube ω *and* to estimate efficiently the weight of ω. The comparison is solely run over examples, but they all rely on a proper characterisation of the hypercube and the ability to simulate efficiently f over that hypercube.

## this issue of Series B

Posted in Books, Statistics, Travel, University life with tags bag of little bootstraps, Bayesian bridge, Bayesian lasso, JRSSB, marginal likelihood, Markov chain Monte Carlo, normalising constant, Series B, simulation, untractable normalizing constant, Wasserman's paradox on September 5, 2014 by xi'an**T**he September issue of [JRSS] Series B I received a few days ago is of particular interest to me. (And not as an ex-co-editor since I was never involved in any of those papers!) To wit: a paper by Hani Doss and Aixin Tan on evaluating normalising constants based on MCMC output, a preliminary version I had seen at a previous JSM meeting, a paper by Nick Polson, James Scott and Jesse Windle on the Bayesian bridge, connected with Nick’s talk in Boston earlier this month, yet another paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael Jordan on the bag of little bootstraps, which presentation I heard Michael deliver a few times when he was in Paris. (Obviously, this does not imply any negative judgement on the other papers of this issue!)

For instance, Doss and Tan consider the multiple mixture estimator [my wording, the authors do not give the method a name, referring to Vardi (1985) but missing the connection with Owen and Zhou (2000)] of k ratios of normalising constants, namely

where the z’s are the normalising constants and with possible different numbers of iterations of each Markov chain. An interesting starting point (that Hans Künsch had mentioned to me a while ago but that I had since then forgotten) is that the problem was reformulated by Charlie Geyer (1994) as a quasi-likelihood estimation where the ratios of all z’s relative to one reference density are the unknowns. This is doubling interesting, actually, because it restates the constant estimation problem into a statistical light and thus somewhat relates to the infamous “paradox” raised by Larry Wasserman a while ago. The novelty in the paper is (a) to derive an optimal estimator of the ratios of normalising constants in the Markov case, essentially accounting for possibly different lengths of the Markov chains, and (b) to estimate the variance matrix of the ratio estimate by regeneration arguments. A favourite tool of mine, at least theoretically as practically useful minorising conditions are hard to come by, if at all available.

## recycling accept-reject rejections

Posted in Statistics, University life with tags accept-reject algorithm, arXiv, auxiliary variable, Data augmentation, George Casella, intractable likelihood, Monte Carlo Statistical Methods, Rao-Blackwellisation, recycling, untractable normalizing constant on July 1, 2014 by xi'an**V**inayak Rao, Lizhen Lin and David Dunson just arXived a paper which proposes anew technique to handle intractable normalising constants. And which exact title is Data augmentation for models based on rejection sampling. (Paper that I read in the morning plane to B’ham, since this is one of my weeks in Warwick.) The central idea therein is that, if the sample density (*aka* likelihood) satisfies

where all terms but p are known in closed form, then completion by the rejected values of an hypothetical accept-reject algorithm−hypothetical in the sense that the data does not have to be produced by an accept-reject scheme but simply the above domination condition to hold−allows for a data augmentation scheme. Without requiring the missing normalising constant. Since the completed likelihood is

A closed-form, if not necessarily congenial, function.

**N**ow this is quite a different use of the “rejected values” from the accept reject algorithm when compared with our 1996 Biometrika paper on the Rao-Blackwellisation of accept-reject schemes (which, still, could have been mentioned there… Or Section 4.2 of Monte Carlo Statistical Methods. Rather than re-deriving the joint density of the augmented sample, “accepted+rejected”.)

**I**t is a neat idea in that it completely bypasses the approximation of the normalising constant. And avoids the somewhat delicate tuning of the auxiliary solution of Moller et al. (2006) The difficulty with this algorithm is however in finding an upper bound M on the unnormalised density f that is

- in closed form;
- with a manageable and tight enough “constant” M;
- compatible with running a posterior simulation conditional on the added rejections.

The paper seems to assume further that the bound M is independent from the current parameter value θ, at least as suggested by the notation (and Theorem 2), but this is not in the least necessary for the validation of the formal algorithm. Such a constraint would pull M higher, hence reducing the efficiency of the method. Actually the matrix Langevin distribution considered in the first example involves a bound that depends on the parameter κ.

**T**he paper includes a result (Theorem 2) on the uniform ergodicity that relies on heavy assumptions on the proposal distribution. And a rather surprising one, namely that the probability of *rejection* is bounded from below, i.e. calling for a *less* efficient proposal. Now it seems to me that a uniform ergodicity result holds as well when the probability of *acceptance* is bounded from below since, then, the event when no rejection occurs constitutes an atom from the augmented Markov chain viewpoint. There therefore occurs a renewal each time the rejected variable set ϒ is empty, and ergodicity ensues (Robert, 1995, *Statistical Science*).

**N**ote also that, despite the opposition raised by the authors, the method *per se* does constitute a pseudo-marginal technique à la Andrieu-Roberts (2009) since the independent completion by the (pseudo) rejected variables produces an unbiased estimator of the likelihood. It would thus be of interest to see how the recent evaluation tools of Andrieu and Vihola can assess the loss in efficiency induced by this estimation of the likelihood.

*Maybe some further experimental evidence tomorrow…*

## the Poisson transform

Posted in Books, Kids, pictures, Statistics, University life with tags Banff, Cauchy, logistic regression, Paris, Poisson point process, Sceaux, Siméon Poisson, untractable normalizing constant on June 19, 2014 by xi'an**I**n obvious connection with an earlier post on the “estimation” of normalising constants, Simon Barthelmé and Nicolas Chopin just arXived a paper on The Poisson transform for unormalised statistical models. Obvious connection because I heard of the Guttmann and Hyvärinen (2012) paper when Simon came to CREST to give a BiP talk on this paper a few weeks ago. (A connected talk he gave in Banff is available as a BIRS video.)

**W**ithout getting too much into details, the neat idea therein is to turn the observed likelihood

into a joint likelihood

which is the likelihood of a Poisson point process with intensity function

This is an alternative model in that the original likelihood does not appear as a marginal of the above. Only the modes coincide, with the conditional mode in ν providing the normalising constant. In practice, the above Poisson process likelihood is unavailable and Guttmann and Hyvärinen (2012) offer an approximation by means of their logistic regression.

**U**navailable likelihoods inevitably make me think of ABC. Would ABC solutions be of interest there? In particular, could the Poisson point process be simulated with no further approximation? Since the “true” likelihood is not preserved by this representation, similar questions to those found in ABC arise, like a measure of departure from the “true” posterior. Looking forward the Bayesian version!* (Marginalia: Siméon Poisson died in Sceaux, which seemed to have attracted many mathematicians at the time, since Cauchy also spent part of his life there…)*