Archive for GANs

“more Bayesian” GANs

Posted in Books, Statistics with tags , , , , on December 21, 2018 by xi'an
On X validated, I got pointed to this recent paper by He, Wang, Lee and Tiang, that proposes a new form of Bayesian GAN. Although I do not see it as really Bayesian, as explained below.
“[The] existing Bayesian method (Saatchi & Wilson, 2017) may lead to incompatible conditionals, which suggest that the underlying joint distribution actually does not exist.”
The difference with the Bayesian GANs of Saatchi & Wilson (2017) [with Saatchi’s name being consistently misspelled throughout] is in the definition of the likelihood function, function of both generative and discriminatory parameters. As in Bissiri et al. (2013), the likelihood is replaced by the exponentiated loss function, or rather functions, which are computed with expected or pluggin distributions or discriminating functions. Expectations under the respective priors and for the observed data (?). Meaning there are “two likelihoods” for the same data, one being the inverse of the other in the minimax GAN case. Further, the prior on the generative parameter is actually of the prior feedback category:  at each iteration, the authors “use the generator distribution in the previous time step as a prior for the next time step”. Which makes me wonder how they avoid ending up with a Dirac “prior”. (Even curiouser, the prior on the discriminating parameter, which is not a genuine component of the statistical model, is a flat prior across iterations.) The convergence result established in the paper is that, if the (true) data-generating model can be written as the marginal of the assumed parametric generative model against an “optimal” distribution, then the discriminating function converges to non-discrimination between (true) data-generating model and the assumed parametric generative model. This somehow negates the Bayesian side of the affair, as convergence to a point mass does not produce a Bayesian outcome on the parameters of the model, or on the comparison between true and assumed models. The paper also demonstrates the incompatibility of the two conditionals used by Saatchi & Wilson (2017) and provides a formal example [missing any form of data?] where the associated Bayesian GAN does not converge to the true value behind the model. But the issue is more complex in my opinion in that using two incompatible conditionals does not mean that the associated Markov chain is transient (as e.g. on a compact space).

Big Bayes goes South

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on December 5, 2018 by xi'an

At the Big [Data] Bayes conference this week [which I found quite exciting despite a few last minute cancellations by speakers] there were a lot of clustering talks including the ones by Amy Herring (Duke), using a notion of centering that should soon appear on arXiv. By Peter Müller (UT, Austin) towards handling large datasets. Based on a predictive recursion that takes one value at a time, unsurprisingly similar to the update of Dirichlet process mixtures. (Inspired by a 1998 paper by Michael Newton and co-authors.) The recursion doubles in size at each observation, requiring culling of negligible components. Order matters? Links with Malsiner-Walli et al. (2017) mixtures of mixtures. Also talks by Antonio Lijoi and Igor Pruenster (Boconni Milano) on completely random measures that are used in creating clusters. And by Sylvia Frühwirth-Schnatter (WU Wien) on creating clusters for the Austrian labor market of the impact of company closure. And by Gregor Kastner (WU Wien) on multivariate factor stochastic models, with a video of a large covariance matrix evolving over time and catching economic crises. And by David Dunson (Duke) on distance clustering. Reflecting like myself on the definitely ill-defined nature of the [clustering] object. As the sample size increases, spurious clusters appear. (Which reminded me of a disagreement I had had with David McKay at an ICMS conference on mixtures twenty years ago.) Making me realise I missed the recent JASA paper by Miller and Dunson on that perspective.

Some further snapshots (with short comments visible by hovering on the picture) of a very high quality meeting [says one of the organisers!]. Following suggestions from several participants, it would be great to hold another meeting at CIRM in a near future. Continue reading

Implicit maximum likelihood estimates

Posted in Statistics with tags , , , , , , , , , , on October 9, 2018 by xi'an

An ‘Og’s reader pointed me to this paper by Li and Malik, which made it to arXiv after not making it to NIPS. While the NIPS reviews were not particularly informative and strongly discordant, the authors point out in the comments that they are available for the sake of promoting discussion. (As made clear in earlier posts, I am quite supportive of this attitude! Disclaimer: I was not involved in an evaluation of this paper, neither for NIPS nor for another conference or journal!!) Although the paper does not seem to mention ABC in the setting of implicit likelihoods and generative models, there is a reference to the early (1984) paper by Peter Diggle and Richard Gratton that is often seen as the ancestor of ABC methods. The authors point out numerous issues with solutions proposed for parameter estimation in such implicit models. For instance, for GANs, they signal that “minimizing the Jensen-Shannon divergence or the Wasserstein distance between the empirical data distribution and the model distribution does not necessarily minimize the same between the true data distribution and the model distribution.” (Not mentioning the particular difficulty with Bayesian GANs.) Their own solution is the implicit maximum likelihood estimator, which picks the value of the parameter θ bringing a simulated sample the closest to the observed sample. Closest in the sense of the Euclidean distance between both samples. Or between the minimum of several simulated samples and the observed sample. (The modelling seems to imply the availability of n>1 observed samples.) They advocate using a stochastic gradient descent approach for finding the optimal parameter θ which presupposes that the dependence between θ and the simulated samples is somewhat differentiable. (And this does not account for using a min, which would make differentiation close to impossible.) The paper then meanders in a lengthy discussion as to whether maximising the likelihood makes sense, with a rather naïve view on why using the empirical distribution in a Kullback-Leibler divergence does not make sense! What does not make sense is considering the finite sample approximation to the Kullback-Leibler divergence with the true distribution in my opinion.

Gibbs for incompatible kids

Posted in Books, Statistics, University life with tags , , , , , , , , , , on September 27, 2018 by xi'an

In continuation of my earlier post on Bayesian GANs, which resort to strongly incompatible conditionals, I read a 2015 paper of Chen and Ip that I had missed. (Published in the Journal of Statistical Computation and Simulation which I first confused with JCGS and which I do not know at all. Actually, when looking at its editorial board,  I recognised only one name.) But the study therein is quite disappointing and not helping as it considers Markov chains on finite state spaces, meaning that the transition distributions are matrices, meaning also that convergence is ensured if these matrices have no null probability term. And while the paper is motivated by realistic situations where incompatible conditionals can reasonably appear, the paper only produces illustrations on two and three states Markov chains. Not that helpful, in the end… The game is still afoot!

JSM 2018 [#1]

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , on July 30, 2018 by xi'an

As our direct flight from Paris landed in the morning in Vancouver,  we found ourselves in the unusual situation of a few hours to kill before accessing our rental and where else better than a general introduction to deep learning in the first round of sessions at JSM2018?! In my humble opinion, or maybe just because it was past midnight in Paris time!, the talk was pretty uninspiring in missing the natural question of the possible connections between the construction of a prediction function and statistics. Watching improving performances at classifying human faces does not tell much more than creating a massively non-linear function in high dimensions with nicely designed error penalties. Most of the talk droned about neural networks and their fitting by back-propagation and the variations on stochastic gradient descent. Not addressing much rather natural (?) questions about choice of functions at each level, of the number of levels, of the penalty term, or regulariser, and even less the reason why no sparsity is imposed on the structure, despite the humongous number of parameters involved. What came close [but not that close] to sparsity is the notion of dropout, which is a sort of purely automated culling of the nodes, and which was new to me. More like a sort of randomisation that turns the optimisation criterion in an average. Only at the end of the presentation more relevant questions emerged, presenting unsupervised learning as density estimation, the pivot being the generative features of (most) statistical models. And GANs of course. But nonetheless missing an explanation as to why models with massive numbers of parameters can be considered in this setting and not in standard statistics. (One slide about deterministic auto-encoders was somewhat puzzling in that it seemed to repeat the “fiducial mistake”.)

Bayesian GANs [#2]

Posted in Books, pictures, R, Statistics with tags , , , , , , , , , , , , on June 27, 2018 by xi'an

As an illustration of the lack of convergence of the Gibbs sampler applied to the two “conditionals” defined in the Bayesian GANs paper discussed yesterday, I took the simplest possible example of a Normal mean generative model (one parameter) with a logistic discriminator (one parameter) and implemented the scheme (during an ISBA 2018 session). With flat priors on both parameters. And a Normal random walk as Metropolis-Hastings proposal. As expected, since there is no stationary distribution associated with the Markov chain, simulated chains do not exhibit a stationary pattern,

And they eventually reach an overflow error or a trapping state as the log-likelihood gets approximately to zero (red curve).

Too bad I missed the talk by Shakir Mohammed yesterday, being stuck on the Edinburgh by-pass at rush hour!, as I would have loved to hear his views about this rather essential issue…

Bayesian gan [gan style]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on June 26, 2018 by xi'an

In their paper Bayesian GANS, arXived a year ago, Saatchi and Wilson consider a Bayesian version of generative adversarial networks, putting priors on both the model and the discriminator parameters. While the prospect seems somewhat remote from genuine statistical inference, if the following statement is representative

“GANs transform white noise through a deep neural network to generate candidate samples from a data distribution. A discriminator learns, in a supervised manner, how to tune its parameters so as to correctly classify whether a given sample has come from the generator or the true data distribution. Meanwhile, the generator updates its parameters so as to fool the discriminator. As long as the generator has sufficient capacity, it can approximate the cdf inverse-cdf composition required to sample from a data distribution of interest.”

I figure the concept can also apply to a standard statistical model, where x=G(z,θ) rephrases the distributional assumption x~F(x;θ) via a white noise z. This makes resorting to a prior distribution on θ more relevant in the sense of using potential prior information on θ (although the successes of probabilistic numerics show formal priors can be used on purely numerical ground).

The “posterior distribution” that is central to the notion of Bayesian GANs is however unorthodox in that the distribution is associated with the following conditional posteriors

where D(x,θ) is the “discriminator”, that is, in GAN lingo, the probability to be allocated to the “true” data generating mechanism rather than to the one associated with G(·,θ). The generative conditional posterior (1) then aims at fooling the discriminator, i.e. favours generative parameter values that raise the probability of wrong allocation of the pseudo-data. The discriminative conditional posterior (2) is a standard Bayesian posterior based on the original sample and the generated sample. The authors then iteratively sample from these posteriors, effectively implementing a two-stage Gibbs sampler.

“By iteratively sampling from (1) and (2) at every step of an epoch one can, in the limit, obtain samples from the approximate posteriors over [both sets of parameters].”

What worries me about this approach is that  just cannot work, in the sense that (1) and (2) cannot be compatible conditional (posterior) distributions. There is no joint distribution for which (1) and (2) would be the conditionals, since the pseudo-data appears in D for (1) and (1-D) in (2). This means that the convergence of a Gibbs sampler is at best to a stationary σ-finite measure. And hence that the meaning of the chain is delicate to ascertain… Am I missing any fundamental point?! [I checked the reviews on NIPS webpage and could not spot this issue being raised.]