## martingale posteriors

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on November 7, 2022 by xi'an

A new Royal Statistical Society Read Paper featuring Edwin Fong, Chris Holmes, and Steve Walker. Starting from the predictive

$p(y_{n+1:+\infty}|y_{1:n})\ \ \ (1)$

rather than from the posterior distribution on the parameter is a fairly novel idea, also pursued by Sonia Petrone and some of her coauthors. It thus adopts a de Finetti’s perspective while adding some substance to the rather metaphysical nature of the original. It however relies on the “existence” of an infinite sample in (1) that assumes a form of underlying model à la von Mises or at least an infinite population. The representation of a parameter θ as a function of an infinite sequence comes as a shock first but starts making sense when considering it as a functional of the underlying distribution. Of course, trading (modelling) a random “opaque” parameter θ for (envisioning) an infinite sequence of random (un)observations may sound like a sure loss rather than as a great deal, but it gives substance to the epistemic uncertainty about a distributional parameter, even when a model is assumed, as in Example 1, which defines θ in the usual parametric way (i.e., the mean of the iid variables). Furthermore, the link with bootstrap and even more Bayesian bootstrap becomes clear when θ is seen this way.

Always a fan of minimal loss approaches, but (2.4) defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence does not exist outside the primary definition of said parametric family. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant but what is its Bayesian justification? (I did not read Appendix C.2. in full detail but could not spot the prior on F.)

“The resemblance of the martingale posterior to a bootstrap estimator should not have gone unnoticed”

I am always fan of minimal loss approaches, but I wonder at (2.4), as it defines either a moment or a true parameter value that depends on the parametric family indexed by θ. Hence it does not exist outside the primary definition of said parametric family, which limits its appeal. The following construct of the empirical cdf based on the infinite sequence as providing the θ function is elegant and connect with bootstrap, but I wonder at its Bayesian justification. (I did not read Appendix C.2. in full detail but could not spot a prior on F.)

While I completely missed the resemblance, it is indeed the case that, if the predictive at each step is build from the earlier “sample”, the support is not going to evolve. However, this is not particularly exciting as the Bayesian non-parametric estimator is most rudimentary. This seems to bring us back to Rubin (1981) ?! A Dirichlet prior is mentioned with no further detail. And I am getting confused at the complete lack of structure, prior, &tc. It seems to contradict the next section:

“While the prescription of (3.1) remains a subjective task, we find it to be no more subjective than the selection of a likelihood function”

Copulas!!! Again, I am very glad to see copulas involved in the analysis. However, I remain unclear as to why Corollary 1 implies that any sequence of copulas could do the job. Further, why does the Gaussian copula appear as the default choice? What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. I am also missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying, while relying on a single hyperparameter in (4.5.2)?

In the illustration section, the use of the galaxy dataset may fail to appeal to Radford Neal, in a spirit similar to Chopin’s & Ridgway’s call to leave the Pima Indians alone, since he delivered a passionate lecture on the inappropriateness of a mixture model for this dataset (at ICMS in 2001). I am unclear as to where the number of modes is extracted from the infinite predictive. What is $\theta$ in this case?

Copulas!!! Although I am unclear why Corollary 1 implies that any sequence of copulas does the job. And why the Gaussian copula appears as the default choice. What is the computing cost of the update (4.4) after k steps? Similarly (4.7) is using a very special form of copula, with independent-across-dimension increments. Missing a guided tour on the implementation, as it sounds explosive in book-keeping and multiplying. A single hyperparameter (4.5.2)?

## simulating from the joint cdf

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , on July 13, 2022 by xi'an

An X validated question (what else?!) brought back (to me) the question of handling a bivariate cdf for simulation purposes. In the specific case of a copula when thus marginals were (well-)known…. And led me to an erroneous chain of thought, fortunately rescued by Robin Ryder! When the marginal distributions are set, the simulation setup is indeed equivalent to a joint Uniform simulation from a copula

$\mathbb P[U_1\leq u_1,U_2\leq u_2,\dots,U_d\leq u_d]=C(u_1,u_2,\dots,u_d)$

In specific cases, as for instance the obvious example of Gaussian copulas, there exist customised simulation algorithms. Looking for more generic solutions, I turn to the Bible, where Chapter XI[an], has two entire sections XI.3.2. and XI.3.3 on the topic (even though Luc Devroye does not use the term copula there despite them being introduced in 1959 by A, Sklar, in response to a query of M. Fréchet). In addition to a study of copulas, both sections contain many specific solutions (as for instance in the [unnumbered] Table on page 585) but I found no generic simulation method. My [non-selected] answer to the question was thus to propose standard solutions such as finding one conditional since the marginals are Uniform. Which depends on the tractability of the derivatives of C(·,·).

However, being dissatisfied with this bland answer, I thought further about the problem and came up with a fallacious scheme, namely to first simulate the value p of C(U,V) by drawing a Uniform, and second simulate (U,V) conditional on C(U,V)=p. Going as far as running an R code on a simple copula, as shown above. Fallacious reasoning since (as I knew already!!!), C(U,V) is not uniformly distributed! But has instead a case-dependent distribution… As a (connected) aside, I wonder if the generator attached with Archimedean copulas has any magical feature that help with the generation of the associated copula.

## sequential neural likelihood estimation as ABC substitute

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on May 14, 2020 by xi'an

A JMLR paper by Papamakarios, Sterratt, and Murray (Edinburgh), first presented at the AISTATS 2019 meeting, on a new form of likelihood-free inference, away from non-zero tolerance and from the distance-based versions of ABC, following earlier papers by Iain Murray and co-authors in the same spirit. Which I got pointed to during the ABC workshop in Vancouver. At the time I had no idea as to autoregressive flows meant. We were supposed to hold a reading group in Paris-Dauphine on this paper last week, unfortunately cancelled as a coronaviral precaution… Here are some notes I had prepared for the meeting that did not take place.

A simulator model is a computer program, which takes a vector of parameters θ, makes internal calls to a random number generator, and outputs a data vector x.”

Just the usual generative model then.

“A conditional neural density estimator is a parametric model q(.|φ) (such as a neural network) controlled by a set of parameters φ, which takes a pair of datapoints (u,v) and outputs a conditional probability density q(u|v,φ).”

Less usual, in that the outcome is guaranteed to be a probability density.

“For its neural density estimator, SNPE uses a Mixture Density Network, which is a feed-forward neural network that takes x as input and outputs the parameters of a Gaussian mixture over θ.”

In which theoretical sense would it improve upon classical or Bayesian density estimators? Where are the error evaluation, the optimal rates, the sensitivity to the dimension of the data? of the parameter?

“Our new method, Sequential Neural Likelihood (SNL), avoids the bias introduced by the proposal, by opting to learn a model of the likelihood instead of the posterior.”

I do not get the argument in that the final outcome (of using the approximation within an MCMC scheme) remains biased since the likelihood is not the exact likelihood. Where is the error evaluation? Note that in the associated Algorithm 1, the learning set is enlarged on each round, as in AMIS, rather than set back to the empty set ∅ on each round.

…given enough simulations, a sufficiently flexible conditional neural density estimator will eventually approximate the likelihood in the support of the proposal, regardless of the shape of the proposal. In other words, as long as we do not exclude parts of the parameter space, the way we propose parameters does not bias learning the likelihood asymptotically. Unlike when learning the posterior, no adjustment is necessary to account for our proposing strategy.”

This is a rather vague statement, with the only support being that the Monte Carlo approximation to the Kullback-Leibler divergence does converge to its actual value, i.e. a direct application of the Law of Large Numbers! But an interesting point I informally made a (long) while ago that all that matters is the estimate of the density at x⁰. Or at the value of the statistic at x⁰. The masked auto-encoder density estimator is based on a sequence of bijections with a lower-triangular Jacobian matrix, meaning the conditional density estimate is available in closed form. Which makes it sounds like a form of neurotic variational Bayes solution.

The paper also links with ABC (too costly?), other parametric approximations to the posterior (like Gaussian copulas and variational likelihood-free inference), synthetic likelihood, Gaussian processes, noise contrastive estimation… With experiments involving some of the above. But the experiments involve rather smooth models with relatively few parameters.

“A general question is whether it is preferable to learn the posterior or the likelihood (…) Learning the likelihood can often be easier than learning the posterior, and it does not depend on the choice of proposal, which makes learning easier and more robust (…) On the other hand, methods such as SNPE return a parametric model of the posterior directly, whereas a further inference step (e.g. variational inference or MCMC) is needed on top of SNL to obtain a posterior estimate”

A fair point in the conclusion. Which also mentions the curse of dimensionality (both for parameters and observations) and the possibility to work directly with summaries.

Getting back to the earlier and connected Masked autoregressive flow for density estimation paper, by Papamakarios, Pavlakou and Murray:

“Viewing an autoregressive model as a normalizing flow opens the possibility of increasing its flexibility by stacking multiple models of the same type, by having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a normalizing flow that is more flexible than the original model, and that remains tractable.”

Which makes it sound like a sort of a neural network in the density space. Optimised by Kullback-Leibler minimisation to get asymptotically close to the likelihood. But a form of Bayesian indirect inference in the end, namely an MLE on a pseudo-model, using the estimated model as a proxy in Bayesian inference…