Today, Pierre Jacob posted on arXiv a paper of ours on the use of the Wasserstein distance in statistical inference, which main focus is exploiting this distance to create an automated measure of discrepancy for ABC. Which is why the full title is Inference in generative models using the Wasserstein distance. Generative obviously standing for the case when a model can be generated from but cannot be associated with a closed-form likelihood. We had all together discussed this notion when I visited Harvard and Pierre last March, with much excitement. (While I have not contributed much more than that round of discussions and ideas to the paper, the authors kindly included me!) The paper contains theoretical results for the consistency of statistical inference based on those distances, as well as computational on how the computation of these distances is practically feasible and on how the Hilbert space-filling curve used in sequential quasi-Monte Carlo can help. The notion further extends to dependent data via delay reconstruction and residual reconstruction techniques (as we did for some models in our empirical likelihood BCel paper). I am quite enthusiastic about this approach and look forward discussing it at the 17w5015 BIRS ABC workshop, next month!
Archive for Harvard University
Lawrence Murray, Sumeet Singh, Pierre Jacob, and Anthony Lee (Warwick) recently arXived a paper on Anytime Monte Carlo. (The earlier post on this topic is no coincidence, as Lawrence had told me about this problem when he visited Paris last Spring. Including a forced extension when his passport got stolen.) The difficulty with anytime algorithms for MCMC is the lack of exchangeability of the MCMC sequence (except for formal settings where regeneration can be used).
When accounting for duration of computation between steps of an MCMC generation, the Markov chain turns into a Markov jump process, whose stationary distribution α is biased by the average delivery time. Unless it is constant. The authors manage this difficulty by interlocking the original chain with a secondary chain so that even- and odd-index chains are independent. The secondary chain is then discarded. This provides a way to run an anytime MCMC. The principle can be extended to K+1 chains, run one after the other, since only one of those chains need be discarded. It also applies to SMC and SMC². The appeal of anytime simulation in this particle setting is that resampling is no longer a bottleneck. Hence easily distributed among processors. One aspect I do not fully understand is how the computing budget is handled, since allocating the same real time to each iteration of SMC seems to envision each target in the sequence as requiring the same amount of time. (An interesting side remark made in this paper is the lack of exchangeability resulting from elaborate resampling mechanisms, lack I had not thought of before.)
Luke Bornn, Neal Shephard and Reza Solgi have recently arXived a research report on non-parametric Bayesian quantiles. This work relates to their earlier paper that combines Bayesian inference with moment estimators, in that the quantiles do not define entirely the distribution of the data, which then needs to be completed by Bayesian means. But contrary to this previous paper, it does not require MCMC simulation for distributions defined on a variety as, e.g., a curve.
Here a quantile is defined as minimising an asymmetric absolute risk, i.e., an expected loss. It is therefore a deterministic function of the model parameters for a parametric model and a functional of the model otherwise. And connected to a moment if not a moment per se. In the case of a model with a discrete support, the unconstrained model is parameterised by the probability vector θ and β=t(θ). However, the authors study the opposite approach, namely to set a prior on β, p(β), and then complement this prior with a conditional prior on θ, p(θ|β), the joint prior p(β)p(θ|β) being also the marginal p(θ) because of the deterministic relation. However, I am getting slightly lost in the motivation for the derivation of the conditional when the authors pick an arbitrary prior on θ and use it to derive a conditional on β which, along with an arbitrary (“scientific”) prior on β defines a new prior on θ. This works out in the discrete case because β has a finite support. But it is unclear (to me) why it should work in the continuous case [not covered in the paper].
Getting back to the central idea of defining first the distribution on the quantile β, a further motivation is provided in the hierarchical extension of Section 3, where the same quantile distribution is shared by all individuals (e.g., cricket players) in the population, while the underlying distributions for the individuals are otherwise disconnected and unconstrained. (Obviously, a part of the cricket example went far above my head. But one may always idly wonder why all players should share the same distribution. And about what would happen when imposing no quantile constraint but picking instead a direct hierarchical modelling on the θ’s.) This common distribution on β can then be modelled by a Dirichlet hyperprior.
The paper also contains a section on estimating the entire quantile function, which is a wee paradox in that this function is again a deterministic transform of the original parameter θ, but that the authors use instead pointwise estimation, i.e., for each level τ. I find the exercise furthermore paradoxical in that the hierarchical modelling with a common distribution on the quantile β(τ) only is repeated for each τ but separately, while it should be that the entire parameter should share a common distribution. Given the equivalence between the quantile function and the entire parameter θ.
As in Bristol two months ago, where I joined the statistics reading in the morning, I had the opportunity to discuss the paper on testing via mixtures prior to my talk with a group of Harvard graduate students. Which concentrated on the biasing effect of the Bayes factor against the more complex hypothesis/model. Arguing [if not in those terms!] that Occam’s razor was too sharp. With a neat remark that decomposing the log Bayes factor as
meant that the first marginal was immensely and uniquely impacted by the prior modelling, hence very likely to be very small for a larger model H, which would then take forever to recover from. And asking why there was such a difference with cross-validation
where the leave-one out posterior predictor is indeed more stable. While the later leads to major overfitting in my opinion, I never spotted the former decomposition which does appear as a strong and maybe damning criticism of the Bayes factor in terms of long-term impact of the prior modelling.
Other points made during the talk or before when preparing the talk:
- additive mixtures are but one encompassing model, geometric mixtures could be fun too, if harder to process (e.g., missing normalising constant). Or Zellner’s mixtures (with again the normalising issue);
- if the final outcome of the “test” is the posterior on α itself, the impact of the hyper-parameter on α is quite relative since this posterior can be calibrated by simulation against limiting cases (α=0,1);
- for the same reason the different rate of accumulation near zero and one when compared with a posterior probability is hardly worrying;
- what I see as a fundamental difference in processing improper priors for Bayes factors versus mixtures is not perceived as such by everyone;
- even a common parameter θ on both models does not mean both models are equally weighted a priori, which relates to an earlier remark in Amsterdam about the different Jeffreys priors one can use;
- the MCMC output also produces a sample of θ’s which behaviour is obviously different from single model outputs. It would be interesting to study further the behaviour of those samples, which are not to be confused with model averaging;
- the mixture setting has nothing intrinsically Bayesian in that the model can be processed in other ways.
Next week, I will be in Harvard Monday and Tuesday, visiting friends in the Department of Statistics and giving a seminar. The slides for the talk will be quite similar to those of my talk in Bristol, a few weeks ago. Hopefully, there will not be too much overlap between both audiences! And hopefully I’ll manage to get to my conclusion before all hell breaks loose (which is why I strategically set my conclusion in the early slides!)