## posterior collapse

Posted in Statistics with tags , , , , , , on February 24, 2022 by xi'an

The latest ABC One World webinar was a talk by Yixin Wang about the posterior collapse of auto-encoders, of which I was completely unaware. It is essentially an identifiability issue with auto-encoders, where the latent variable z at the source of the VAE does not impact the likelihood, assumed to be an exponential family with parameter depending on z and on θ, through possibly a neural network construct. The variational part comes from the parameter being estimated as θ⁰, via a variational approximation.

“….the problem of posterior collapse mainly arises from the model and the data, rather than from inference or optimization…”

The collapse means that the posterior for the latent satisfies p(z|θ⁰,x)=p(z), which is not a standard property since θ⁰=θ⁰(x). Which Yixin Wang, David Blei and John Cunningham show is equivalent to p(x|θ⁰,z)=p(x|θ⁰), i.e. z being unidentifiable. The above quote is then both correct and incorrect in that the choice of the inference approach, i.e. of the estimator θ⁰=θ⁰(x) has an impact on whether or not p(z|θ⁰,x)=p(z) holds. As acknowledged by the authors when describing “methods modify the optimization objectives or algorithms of VAE to avoid parameter values θ at which the latent variable is non-identifiable“. They later build a resolution for identifiable VAEs by imposing that the conditional p(x|θ,z) is injective in z for all values of θ. Resulting in a neural network with Brenier maps.

From a Bayesian perspective, I have difficulties to connect to the issue, the folk lore being that selecting a proper prior is a sufficient fix for avoiding non-identifiability, but more fundamentally I wonder at the relevance of inferring about the latent z’s and hence worrying about their identifiability or lack thereof.

## variational approximation to empirical likelihood ABC

Posted in Statistics with tags , , , , , , , , , , , , , , , , , , on October 1, 2021 by xi'an

Sanjay Chaudhuri and his colleagues from Singapore arXived last year a paper on a novel version of empirical likelihood ABC that I hadn’t yet found time to read. This proposal connects with our own, published with Kerrie Mengersen and Pierre Pudlo in 2013 in PNAS. It is presented as an attempt at approximating the posterior distribution based on a vector of (summary) statistics, the variational approximation (or information projection) appearing in the construction of the sampling distribution of the observed summary. (Along with a weird eyed-g symbol! I checked inside the original LaTeX file and it happens to be a mathbbmtt g, that is, the typewriter version of a blackboard computer modern g…) Which writes as an entropic correction of the true posterior distribution (in Theorem 1).

“First, the true log-joint density of the observed summary, the summaries of the i.i.d. replicates and the parameter have to be estimated. Second, we need to estimate the expectation of the above log-joint density with respect to the distribution of the data generating process. Finally, the differential entropy of the data generating density needs to be estimated from the m replicates…”

The density of the observed summary is estimated by empirical likelihood, but I do not understand the reasoning behind the moment condition used in this empirical likelihood. Indeed the moment made of the difference between the observed summaries and the observed ones is zero iff the true value of the parameter is used in the simulation. I also fail to understand the connection with our SAME procedure (Doucet, Godsill & X, 2002), in that the empirical likelihood is based on a sample made of pairs (observed,generated) where the observed part is repeated m times, indeed, but not with the intent of approximating a marginal likelihood estimator… The notion of using the actual data instead of the true expectation (i.e. as a unbiased estimator) at the true parameter value is appealing as it avoids specifying the exact (or analytical) value of this expectation (as in our approach), but I am missing the justification for the extension to any parameter value. Unless one uses an ancillary statistic, which does not sound pertinent… The differential entropy is estimated by a Kozachenko-Leonenko estimator implying k-nearest neighbours.

“The proposed empirical likelihood estimates weights by matching the moments of g(X¹), , g(X) with that of
g(X), without requiring a direct relationship with the parameter. (…) the constraints used in the construction of the empirical likelihood are based on the identity in (7), which can only be satisfied when θ = θ⁰. “

Although I am feeling like missing one argument, the later part of the paper seems to comfort my impression, as quoted above. Meaning that the approximation will fare well only in the vicinity of the true parameter. Which makes it untrustworthy for model choice purposes, I believe. (The paper uses the g-and-k benchmark without exploiting Pierre Jacob’s package that allows for exact MCMC implementation.)

## approximate Bayesian inference [survey]

Posted in Statistics with tags , , , , , , , , , , , , , , , , , , on May 3, 2021 by xi'an

In connection with the special issue of Entropy I mentioned a while ago, Pierre Alquier (formerly of CREST) has written an introduction to the topic of approximate Bayesian inference that is worth advertising (and freely-available as well). Its reference list is particularly relevant. (The deadline for submissions is 21 June,)