## posterior collapse

**T**he latest ABC One World webinar was a talk by Yixin Wang about the posterior collapse of auto-encoders, of which I was completely unaware. It is essentially an *identifiability* issue with auto-encoders, where the latent variable z at the source of the VAE does not impact the likelihood, assumed to be an exponential family with parameter depending on z and on θ, through possibly a neural network construct. The *variational* part comes from the parameter being estimated as θ⁰, via a variational approximation.

*“….the problem of posterior collapse mainly arises from the model and the data, rather than from inference or optimization…”*

The collapse means that the posterior for the latent satisfies p(z|θ⁰,x)=p(z), which is not a standard property since θ⁰=θ⁰(x). Which Yixin Wang, David Blei and John Cunningham show is equivalent to p(x|θ⁰,z)=p(x|θ⁰), i.e. z being unidentifiable. The above quote is then both correct and incorrect in that the choice of the inference approach, i.e. of the estimator θ⁰=θ⁰(x) has an impact on whether or not p(z|θ⁰,x)=p(z) holds. As acknowledged by the authors when describing “*methods modify the optimization objectives or algorithms of VAE to avoid parameter values θ at which the latent variable is non-identifiable*“. They later build a resolution for identifiable VAEs by imposing that the conditional p(x|θ,z) is injective in z for all values of θ. Resulting in a neural network with Brenier maps.

From a Bayesian perspective, I have difficulties to connect to the issue, the folk lore being that selecting a proper prior is a sufficient fix for avoiding non-identifiability, but more fundamentally I wonder at the relevance of inferring about the latent z’s and hence worrying about their identifiability or lack thereof.

February 24, 2022 at 2:12 pm

Beyond identifiability the ML community also cares about getting useful “latent representations”, i.e. continuous vector representations of observed data that are useful in other tasks.

One alternative way is to ensure sufficient mutual information between the observed x and the latent z by enforcing the constraint I_p(x,z) = C, where C > 0 is a user-provided constant.

This line of thinking has been explored by Phuong et al. in https://openreview.net/pdf?id=HkbmWqxCZ