Archive for MCMC

simple, scalable and accurate posterior interval estimation

Posted in Statistics with tags , , , , , , , on July 6, 2016 by xi'an

“There is a lack of simple and scalable algorithms for uncertainty quantification.”

A paper by Cheng Li , Sanvesh Srivastava, and David Dunson that I had missed and which was pointed out on Andrew’s blog two days ago. As recalled in the very first sentence of the paper, above, the existing scalable MCMC algorithms somewhat fail to account for confidence (credible) intervals. In the sense that handling parallel samples does not naturally produce credible intervals.Since the approach is limited to one-dimensional quantity of interest, ζ=h(θ), the authors of the paper consider the MCMC approximations of the cdf of the said quantity ζ based on the manageable subsets like as many different approximations of the same genuine posterior distribution of that quantity ζ. (Corrected by a power of the likelihood but dependent on the particular subset used for the estimation.) The estimate proposed in the paper is a Wasserstein barycentre of the available estimations, barycentre that is defined as minimising the sum of the Wasserstein distances to all estimates. (Why should this measure be relevant: the different estimates may be of different quality). Interestingly (at least at a computational level), the solution is such that the quantile function of the Wasserstein barycentre is the average of the estimated quantiles functions. (And is there an alternative loss returning the median cdf?) A confidence interval based on the quantile function can then be directly derived. The paper shows that this Wasserstein barycentre converges to the true (marginal) posterior as the sample size m of each sample grows to infinity (and faster than 1/√m), with the strange side-result that the convergence is in 1/√n when the MLE of the global parameter θ is unbiased. Strange to me because unbiasedness is highly dependent on parametrisation while the performances of this estimator should not be, i.e., should be invariant under reparameterisation. Maybe this is due to ζ being a linear transform of θ in the convergence theorem… In any case, I find this question of merging cdf’s from poorly defined approximations to an unknown cdf of the highest interest and look forward any further proposal to this effect!

communication-efficient distributed statistical learning

Posted in Books, Statistics, University life with tags , , , , , , , , on June 10, 2016 by xi'an

mikecemMichael Jordan, Jason Lee, and Yun Yang just arXived a paper with their proposal on handling large datasets through distributed computing, thus contributing to the currently very active research topic of approximate solutions in large Bayesian models. The core of the proposal is summarised by the screenshot above, where the approximate likelihood replaces the exact likelihood with a first order Taylor expansion. The first term is the likelihood computed for a given subsample (or a given thread) at a ratio of one to N and the difference of the gradients is only computed once at a good enough guess. While the paper also considers M-estimators and non-Bayesian settings, the Bayesian part thus consists in running a regular MCMC when the log-target is approximated by the above. I first thought this proposal amounted to a Gaussian approximation à la Simon Wood or to an INLA approach but this is not the case: the first term of the approximate likelihood is exact and hence can be of any form, while the scalar product is linear in θ, providing a sort of first order approximation, albeit frozen at the chosen starting value.

mikecem2Assuming that each block of the dataset is stored on a separate machine, I think the approach could further be implemented in parallel, running N MCMC chains and comparing the output. With a post-simulation summary stemming from the N empirical distributions thus produced. I also wonder how the method would perform outside the fairly smooth logistic regression case, where the single sample captures well-enough the target. The picture above shows a minor gain in a misclassification rate that is already essentially zero.

Nonparametric hierarchical Bayesian quantiles

Posted in Books, Statistics, University life with tags , , , , , , , on June 9, 2016 by xi'an

Luke Bornn, Neal Shephard and Reza Solgi have recently arXived a research report on non-parametric Bayesian quantiles. This work relates to their earlier paper that combines Bayesian inference with moment estimators, in that the quantiles do not define entirely the distribution of the data, which then needs to be completed by Bayesian means. But contrary to this previous paper, it does not require MCMC simulation for distributions defined on a variety as, e.g., a curve.

Here a quantile is defined as minimising an asymmetric absolute risk, i.e., an expected loss. It is therefore a deterministic function of the model parameters for a parametric model and a functional of the model otherwise. And connected to a moment if not a moment per se. In the case of a model with a discrete support, the unconstrained model is parameterised by the probability vector θ and β=t(θ). However, the authors study the opposite approach, namely to set a prior on β, p(β), and then complement this prior with a conditional prior on θ, p(θ|β), the joint prior p(β)p(θ|β) being also the marginal p(θ) because of the deterministic relation. However, I am getting slightly lost in the motivation for the derivation of the conditional when the authors pick an arbitrary prior on θ and use it to derive a conditional on β which, along with an arbitrary (“scientific”) prior on β defines a new prior on θ. This works out in the discrete case because β has a finite support. But it is unclear (to me) why it should work in the continuous case [not covered in the paper].

Getting back to the central idea of defining first the distribution on the quantile β, a further motivation is provided in the hierarchical extension of Section 3, where the same quantile distribution is shared by all individuals (e.g., cricket players) in the population, while the underlying distributions for the individuals are otherwise disconnected and unconstrained. (Obviously, a part of the cricket example went far above my head. But one may always idly wonder why all players should share the same distribution. And about what would happen when imposing no quantile constraint but picking instead a direct hierarchical modelling on the θ’s.) This common distribution on β can then be modelled by a Dirichlet hyperprior.

The paper also contains a section on estimating the entire quantile function, which is a wee paradox in that this function is again a deterministic transform of the original parameter θ, but that the authors use instead pointwise estimation, i.e., for each level τ. I find the exercise furthermore paradoxical in that the hierarchical modelling with a common distribution on the quantile β(τ) only is repeated for each τ but separately, while it should be that the entire parameter should share a common distribution. Given the equivalence between the quantile function and the entire parameter θ.

MCMC with strings and branes

Posted in Books, pictures, Statistics with tags , , , , , on June 6, 2016 by xi'an

branes“Roughly speaking, the idea [of MCMC] is that the thermal fluctuations of a particle moving in an energy landscape provides a conceptually elegant way to sample from a target distribution.”

A short version of their earlier long paper was arXived last week by Heckman et al. The starting point of the paper is to consider simultaneously M parallel samplers by envisioning the M-uple as a single object. This reminded me of the attempt we had made in our 1995 pinball sampler paper with Kerrie Mengersen, where we introduced a repulsive scheme to keep the particles apart and individually stationary against the same target. The joint target being defined as the product of the individual targets. As a single Markov chain, the MCMC sampler can take advantage of the parallel chains to possibly improve the efficiency when compared with running M parallel chains. Possibly only, because the target has moved to a space that is M times larger…

“…although the physics of point particles underlies much of our modern understanding of natural phenomena, it has proven fruitful to consider objects such as strings and branes with finite extent in p spatial dimensions.”

The details of the proposal are somewhat unclear in that the notion of brane remains a mystery to me. It sounds like a sort of random graph over the indices of the particles, but endowed with further (magical?) physical properties. As defined in the paper the suburban algorithm picks a random (neighbourhood) graph at each iteration that is used in the proposal over the particle system (or ensemble). As justified by the authors, the fact that this choice is independent of the current state of the Markov chain implies stationarity. If not efficiency, when compared with the independent parallel MCMC scheme. And because there are so many ways in taking into account the neighbours. (I did not see (11) as a particularly relevant implementation of the algorithm, mixing a random walk move with another random walk on the time differences.) Actually, I have somewhat of a worry about the term “nearest neighbour” which may be defined by the graph (which is fine) or by the configuration of the particle system at time t (which is not fine).

“The effective dimension rather than the overall topology of the grid plays the dominant role in the performance of the algorithm.”

The limited influence of the grid topology is quite understandable in that the chain targets an iid sample, so there is no reason one index value is more relevant than another. All particles in the vector are interchangeable in this respect and in the long run only the number of connected particles should matter. A more interesting feature is that the suburban sampler seems to perform best for strings, i.e. when the number of connected particles is larger than two. This somewhat agrees with my initial remark that dealing with the M particles as a single object should slow down convergence because of the dimension increase. This is one reason why we did not pursue our pinball sampler any further, as it seemed to converge quite slowly.

“To summarize: With too few friends one drifts into oblivion, but with too many friends one becomes a boring conformist.”

The notion of generating a sample all at once is quite appealing, especially because of the iid nature of the target, but I am not convinced the approach followed in this paper is sufficiently involved for this purpose. Alex Shestopaloff and Radford Neal’s recent work on ensemble MCMC is certainly pushing things further in the analysis of efficient moves. I also wonder if a repulsive component shouldn’t be added to the target as in pinball sampling, borrowing maybe from determinental processes, towards a more thorough exploration of the target. Even though it remains unclear that this will be more efficient than running M parallel chains.

inefficiency of data augmentation for large samples

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on May 31, 2016 by xi'an

On Monday, James Johndrow, Aaron Smith, Natesh Pillai, and David Dunson arXived a paper on the diminishing benefits of using data augmentation for large and highly imbalanced categorical data. They reconsider the data augmentation scheme of Tanner and Wong (1987), surprisingly not mentioned, used in the first occurrences of the Gibbs sampler like Albert and Chib’s (1993) or our mixture estimation paper with Jean Diebolt (1990). The central difficulty with data augmentation is that the distribution to be simulated operates on a space that is of order O(n), even when the original distribution covers a single parameter. As illustrated by the coalescent in population genetics (and the subsequent intrusion of the ABC methodology), there are well-known cases when the completion is near to impossible and clearly inefficient (as again illustrated by the failure of importance sampling strategies on the coalescent). The paper provides spectral gaps for the logistic and probit regression completions, which are of order a power of log(n) divided by √n, when all observations are equal to one. In a somewhat related paper with Jim Hobert and Vivek Roy, we studied the spectral gap for mixtures with a small number of observations: I wonder at the existence of a similar result in this setting, when all observations stem from one component of the mixture, when all observations are one. The result in this paper is theoretically appealing, the more because the posteriors associated with such models are highly regular and very close to Gaussian (and hence not that challenging as argued by Chopin and Ridgway). And because the data augmentation algorithm is uniformly ergodic in this setting (as we established with Jean Diebolt  and later explored with Richard Tweedie). As demonstrated in the  experiment produced in the paper, when comparing with HMC and Metropolis-Hastings (same computing times?), which produce much higher effective sample sizes.

reversible chain[saw] massacre

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on May 16, 2016 by xi'an

A paper in Nature this week that uses reversible-jump MCMC, phylogenetic trees, and Bayes factors. And that looks at institutionalised or ritual murders in Austronesian cultures. How better can it get?!

“by applying Bayesian phylogenetic methods (…) we find strong support for models in which human sacrifice stabilizes social stratification once stratification has arisen, and promotes a shift to strictly inherited class systems.” Joseph Watts et al.

The aim of the paper is to establish that societies with human sacrifices are more likely to have become stratified and stable than societies without such niceties. The hypothesis to be tested is then about the evolution towards more stratified societies rather the existence of a high level of stratification.

“The social control hypothesis predicts that human sacrifice (i) co-evolves with social stratification, (ii) increases the chance of a culture gaining social stratification, and (iii) reduces the chance of a culture losing social stratification once stratification has arisen.” Joseph Watts et al.

The methodological question is then how can this be tested when considering those are extinct societies about which little is known. Grouping together moderate and high stratification societies against egalitarian societies, the authors tested independence of both traits versus dependence, with a resulting Bayes factor of 3.78 in favour of the latest. Other hypotheses of a similar flavour led to Bayes factors in the same range. Which is thus not overwhelming. Actually, given that the models are quite simplistic, I do not agree that those Bayes factors prove anything of the magnitude of such anthropological conjectures. Even if the presence/absence of human sacrifices is confirmed in all of the 93 societies, and if the stratification of the cultures is free from uncertainties, the evolutionary part is rather involved, from my neophyte point of view: the evolutionary structure (reproduced above) is based on a sample of 4,200 trees based on Bayesian analysis of Austronesian basic vocabulary items, followed by a call to the BayesTrait software to infer about evolution patterns between stratification levels, concluding (with p-values!) at a phylogenetic structure of the data. BayesTrait was also instrumental in deriving MLEs for the various transition rates, “in order to inform our choice of priors” (!). BayesTrait has an MCMC function used by the authors “to test for correlated evolution between traits” and derive the above Bayes factors. Using a stepping-stone method I am unaware of. And 10⁹ iterations (repeated 3 times for checking consistency)… Reversible jump is apparently used to move between constrained and unconstrained models, leading to the pie charts at the inner nodes of the above picture. Again a by-product of BayesTrait. The trees on the left and the right are completely identical, the difference being in the inference about stratification evolution (right) and sacrifice evolution (left). While the overall hypothesis makes sense at my layman level (as a culture has to be stratified enough to impose sacrifices from its members), I am not convinced that this involved statistical analysis brings that strong a support. (But it would make a fantastic topic for an undergraduate or a Master thesis!)

AISTATS 2016 [#1]

Posted in pictures, R, Running, Statistics, Travel, Wines with tags , , , , , , , , , , , , on May 11, 2016 by xi'an

Travelling through Seville, I arrived in Càdiz on Sunday night, along with a massive depression [weather-speaking!]. Walking through the city from the station was nonetheless pleasant as this is an town full of small streets and nice houses. If with less churches than Seville! Richard Samworth gave the first plenary talk of AISTATS 2016  with a presentation on random projections for classification. His classifier is based on an average of a large number of linear random projections of the original data where the projections are chosen as minimising the prediction error over a subset of the components. The performances of this approach seem to be consistently higher than for random forests, which makes it definitely worth investigating further. (A related R package is available.)

The following talks that day covered Bayesian optimisation and probabilistic numerics, with Javier Gonzales introducing glasses for Bayesian optimisation in order to solve its myopia (!)—by which he meant predicting the output of the optimisation over n future steps. And a first mention of the Pima Indians by Daniel Hernandez-Lobato in his talk about EP with stochastic gradient steps towards optimisation. (As well as much larger datasets.) And Mark Girolami bringing quasi-Monte Carlo into control variates. A kernel based ABC by Mijung Park, which uses kernels and maximum mean discrepancy to avoid defining summary statistics, and a version of parallel MCMC by Guillaume Basse. Plus another session on deep learning.

As usual with AISTATS conferences, the central activity of the day was the noon poster session, including speakers discussing their paper, and I had several interesting chats about MCMC related topics, with e.g. one alternative notion of ensemble MCMC [centred on estimating the normalising constant].

We awarded the notable student paper awards before the welcoming cocktail: The winners are Bo DaiNedelina Teneva, and Ye Wang.  And this first day ended up with a companionable evening in a most genuine tapa bar, tasting local blood sausage and local blue cheese. (If you do not mind the corrida theme!)

Follow

Get every new post delivered to your Inbox.

Join 1,068 other followers