generating from a failure rate function [X’ed]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on July 4, 2015 by xi'an

While I now try to abstain from participating to the Cross Validated forum, as it proves too much of a time-consuming activity with little added value (in the sense that answers are much too often treated as disposable napkins by users who cannot be bothered to open a textbook and who usually do not exhibit any long-term impact of the provided answer, while clogging the forum with so many questions that the individual entries seem to get so little traffic, when compared say with the stackoverflow forum, to the point of making the analogy with disposable wipes more appropriate!), I came across a truly interesting question the other night. Truly interesting for me in that I had never considered the issue before.

The question is essentially wondering at how to simulate from a distribution defined by its failure rate function, which is connected with the density f of the distribution by

$\eta(t)=\frac{f(t)}{\int_t^\infty f(x)\,\text{d}x}=-\frac{\text{d}}{\text{d}t}\,\log \int_t^\infty f(x)\,\text{d}x$

From a purely probabilistic perspective, defining the distribution through f or through η is equivalent, as shown by the relation

$F(t)=1-\exp\left\{-\int_0^t\eta(x)\,\text{d}x\right\}$

but, from a simulation point of view, it may provide a different entry. Indeed, all that is needed is the ability to solve (in X) the equation

$\int_0^X\eta(x)\,\text{d}x=-\log(U)$

when U is a Uniform (0,1) variable. Which may help in that it does not require a derivation of f. Obviously, this also begs the question as to why would a distribution be defined by its failure rate function.

“UK outmoded universities must modernise”

Posted in Books, Kids, pictures, University life with tags , , , , on July 3, 2015 by xi'an

[A rather stinky piece in The Guardian today, written by a consultant self-styled Higher Education expert… No further comments needed!]

“The reasons cited for this laggardly response [to innovations] will be familiar to any observer of the university system: an inherently conservative and risk-averse culture in most institutions; sclerotic systems and processes designed for a different world, and a lack of capacity, skills and willingness to change among an ageing academic community. All these are reinforced by perceptions that most proposed innovations are over-hyped and that current ways of operating have plenty of life left in them yet.”

Posted in Books, Statistics, University life with tags , , , , , , , , , , on July 3, 2015 by xi'an

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltán Szabó, and Arthur Gretton arXived paper about Kamiltonian MCMC generated comments from Michael Betancourt, Dan Simpson and myself, which themselves induced the following reply by Heiko, detailed enough to deserve a post of its own.

We certainly agree that the naive approach of using a non-parametric kernel density estimator on the chain history (as in [Christian’s book, Example 8.8]) as a *proposal* fails spectacularly on simple examples: the probability of proposing in unexplored regions is extremely small, independent of the current position of the MCMC trajectory. This is not what we do though. Instead, we use the gradient of a density estimator, and not the density itself, for our HMC proposal. Just like KAMH, KMC lite in fact falls back to Random Walk Metropolis in previously unexplored regions and therefore inherits geometric ergodicity properties. This in particular includes the ability to explore previously “unseen” regions, even if adaptation has stopped. I implemented a simple illustration and comparison here.

ABC example.
The main point of the ABC example, is that our method does not suffer from the additional bias from Gaussian synthetic likelihoods when being confronted with skewed models. But there is also a computational efficiency aspect. The scheme by Meeds et al. relies on finite differences and requires $2D$ simulations from the likelihood *every time* the gradient is evaluated (i.e. every leapfrog iteration) and H-ABC discards this valuable information subsequently. In contrast, KMC accumulates gradient information from simulations: it only requires to simulate from the likelihood *once* in the accept/reject step after the leapfrog integration (where gradients are available in closed form). The density is only updated then, and not during the leapfrog integration. Similar work on speeding up HMC via energy surrogates can be applied in the tall data scenario.

Approximating HMC when gradients aren’t available is in general a difficult problem. One approach (like surrogate models) may work well in some scenarios while a different approach (i.e. Monte Carlo) may work better in others, and the ABC example showcases such a case. We very much doubt that one size will fit all — but rather claim that it is of interest to find and document these scenarios.
Michael raised the concern that intractable gradients in the Pseudo-Marginal case can be avoided by running an MCMC chain on the joint space (e.g. $(f,\theta)$ for the GP classifier). To us, however, the situation is not that clear. In many cases, the correlations between variables can cause convergence problems (see e.g. here) for the MCMC and have to be addressed by de-correlation schemes (as here), or e.g. by incorporating geometric information, which also needs fixes as Michaels’s very own one. Which is the method of choice with a particular statistical problem at hand? Which method gives the smallest estimation error (if that is the goal?) for a given problem? Estimation error per time? A thorough comparison of these different classes of algorithms in terms of performance related to problem class would help here. Most papers (including ours) only show experiments favouring their own method.

GP estimator quality.
Finally, to address Michael’s point on the consistency of the GP estimator of the density gradient: this is discussed In the original paper on the infinite dimensional exponential family. As Michael points out, higher dimensional problems are unavoidably harder, however the specific details are rather involved. First, in terms of theory: both the well-specified case (when the natural parameter is in the RKHS, Section 4), and the ill-specified case (the natural parameter is in a “reasonable”, larger class of functions, Section 5), the estimate is consistent. Consistency is obtained in various metrics, including the L² error on gradients. The rates depend on how smooth the natural parameter is (and indeed a poor choice of hyper-parameter will mean slower convergence). The key point, in regards to Michael’s question, is that the smoothness requirement becomes more restrictive as the dimension increases: see Section 4.2, “range space assumption”.
Second, in terms of practice: we have found in experiments that the infinite dimensional exponential family does perform considerably better than a kernel density estimator when the dimension increases (Section 6). In other words, our density estimator can take advantage of smoothness properties of the “true” target density to get good convergence rates. As a practical strategy for hyper-parameter choice, we cross-validate, which works well empirically despite being distasteful to Bayesians. Experiments in the KMC paper also indicate that we can scale these estimators up to dimensions in the 100s on Laptop computers (unlike most other gradient estimation techniques in HMC, e.g. the ones in your HMC & sub-sampling note, or the finite differences in Meeds et al).

variational consensus Monte Carlo

Posted in Books, Statistics, University life with tags , , , , , , on July 2, 2015 by xi'an

“Unfortunately, the factorization does not make it immediately clear how to aggregate on the level of samples without first having to obtain an estimate of the densities themselves.” (p.2)

The recently arXived variational consensus Monte Carlo is a paper by Maxim Rabinovich, Elaine Angelino, and Michael Jordan that approaches the consensus Monte Carlo principle from a variational perspective. As in the embarrassingly parallel version,  the target is split into a product of K terms, each being interpreted as an unnormalised density and being fed to a different parallel processor. The most natural partition is to break the data into K subsamples and to raise the prior to the power 1/K in each term. While this decomposition makes sense from a storage perspective, since each bit corresponds to a different subsample of the data, it raises the question of the statistical pertinence of splitting the prior and my feelings about it are now more lukewarm than when I commented on the embarrassingly parallel version,  mainly for the reason that it is not reparameterisation invariant—getting different targets if one does the reparameterisation before or after the partition—and hence does not treat the prior as the reference measure it should be. I therefore prefer the version where the same original prior is attached to each part of the partitioned likelihood (and even more the random subsampling approaches discussed in the recent paper of Bardenet, Doucet, and Holmes). Another difficulty with the decomposition is that a product of densities is not a density in most cases (it may even be of infinite mass) and does not offer a natural path to the analysis of samples generated from each term in the product. Nor an explanation as to why those samples should be relevant to construct a sample for the original target.

“The performance of our algorithm depends critically on the choice of aggregation function family.” (p.5)

Since the variational Bayes approach is a common answer to complex products models, Rabinovich et al. explore the use of variational Bayes techniques to build the consensus distribution out of the separate samples. As in Scott et al., and Neiswanger et al., the simulation from the consensus distribution is a transform of simulations from each of the terms in the product, e.g., a weighted average. Which determines the consensus distribution as a member of an aggregation family defined loosely by a Dirac mass. When the transform is a sum of individual terms, variational Bayes solutions get much easier to find and the authors work under this restriction… In the empirical evaluation of this variational Bayes approach as opposed to the uniform and Gaussian averaging options in Scott et al., it improves upon those, except in a mixture example with a large enough common variance.

In fine, despite the relevance of variational Bayes to improve the consensus approximation, I still remain unconvinced about the use of the product of (pseudo-)densities and the subsequent mix of simulations from those components, for the reason mentioned above and also because the tail behaviour of those components is not related with the tail behaviour of the target. Still, this is a working solution to a real problem and as such is a reference for future works.

the (expected) demise of the Bayes factor [#2]

Posted in Books, Kids, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on July 1, 2015 by xi'an

Following my earlier comments on Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers, from Amsterdam, Joris Mulder, a special issue editor of the Journal of Mathematical Psychology, kindly asked me for a written discussion of that paper, discussion that I wrote last week and arXived this weekend. Besides the above comments on ToP, this discussion contains some of my usual arguments against the use of the Bayes factor as well as a short introduction to our recent proposal via mixtures. Short introduction as I had to restrain myself from reproducing the arguments in the original paper, for fear it would jeopardize its chances of getting published and, who knows?, discussed.

Kamiltonian Monte Carlo [no typo]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on June 29, 2015 by xi'an

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltán Szabó, and Arthur Gretton arXived a paper last week about Kamiltonian MCMC, the K being related with RKHS. (RKHS as in another KAMH paper for adaptive Metropolis-Hastings by essentially the same authors, plus Maria Lomeli and Christophe Andrieu. And another paper by some of the authors on density estimation via infinite exponential family models.) The goal here is to bypass the computation of the derivatives in the moves of the Hamiltonian MCMC algorithm by using a kernel surrogate. While the genuine RKHS approach operates within an infinite exponential family model, two versions are proposed, KMC lite with an increasing sequence of RKHS subspaces, and KMC finite, with a finite dimensional space. In practice, this means using a leapfrog integrator with a different potential function, hence with a different dynamics.

The estimation of the infinite exponential family model is somewhat of an issue, as it is estimated from the past history of the Markov chain, simplified into a random subsample from this history [presumably without replacement, meaning the Markovian structure is lost on the subsample]. This is puzzling because there is dependence on the whole past, which cancels ergodicity guarantees… For instance, we gave an illustration in Introducing Monte Carlo Methods with R [Chapter 8] of the poor impact of approximating the target by non-parametric kernel estimates. I would thus lean towards the requirement of a secondary Markov chain to build this kernel estimate. The authors are obviously aware of this difficulty and advocate an attenuation scheme. There is also the issue of the cost of a kernel estimate, in O(n³) for a subsample of size n. If, instead, a fixed dimension m for the RKHS is selected, the cost is in O(tm²+m³), with the advantage of a feasible on-line update, making it an O(m³) cost in fine. But again the worry of using the whole past of the Markov chain to set its future path…

Among the experiments, a KMC for ABC that follows the recent proposal of Hamiltonian ABC by Meeds et al. The arguments  are interesting albeit sketchy: KMC-ABC does not require simulations at each leapfrog step, is it because the kernel approximation does not get updated at each step? Puzzling.

I also discussed the paper with Michael Betancourt (Warwick) and here his comments:

“I’m hesitant for the same reason I’ve been hesitant about algorithms like Bayesian quadrature and GP emulators in general. Outside of a few dimensions I’m not convinced that GP priors have enough regularization to really specify the interpolation between the available samples, so any algorithm that uses a single interpolation will be fundamentally limited (as I believe is born out in non-trivial scaling examples) and trying to marginalize over interpolations will be too awkward.

They’re really using kernel methods to model the target density which then gives the gradient analytically. RKHS/kernel methods/ Gaussian processes are all the same math — they’re putting prior measures over functions. My hesitancy is that these measures are at once more diffuse than people think (there are lots of functions satisfying a given smoothness criterion) and more rigid than people think (perturb any of the smoothness hyper-parameters and you get an entirely new space of functions).

When using these methods as an emulator you have to set the values of the hyper-parameters which locks in a very singular definition of smoothness and neglects all others. But even within this singular definition there are a huge number of possible functions. So when you only have a few points to constrain the emulation surface, how accurate can you expect the emulator to be between the points?

In most cases where the gradient is unavailable it’s either because (a) people are using decades-old Fortran black boxes that no one understands, in which case there are bigger problems than trying to improve statistical methods or (b) there’s a marginalization, in which case the gradients are given by integrals which can be approximated with more MCMC. Lots of options.”

the girl who saved the king of Sweden [book review]

Posted in Books, Kids, Travel with tags , , , , , , , , , , , on June 27, 2015 by xi'an

When visiting a bookstore in Florence last month, during our short trip to Tuscany, I came upon this book with enough of a funny cover and enough of a funny title (possibly capitalising on the similarity with “the girl who played with fire”] to make me buy it. I am glad I gave in to this impulse as the book is simply hilarious! The style and narrative relate rather strongly to the series of similarly [mostly] hilarious picaresque tales written by Paasilina and not only because both authors are from Scandinavia. There is the same absurd feeling that the book characters should not have this sort of things happening to them and still the morbid fascination to watch catastrophe after catastrophe being piled upon them. While the story is deeply embedded within the recent history of South Africa and [not so much] of Sweden for the past 30 years, including major political figures, there is no true attempt at making the story in the least realistic, which is another characteristic of the best stories of Paasilina. Here, a young girl escapes the poverty of the slums of Soweto, to eventually make her way to Sweden along with a spare nuclear bomb and a fistful of diamonds. Which alas are not eternal… Her intelligence helps her to overcome most difficulties, but even her needs from time to time to face absurd situations as another victim. All is well that ends well for most characters in the story, some of whom one would prefer to vanish in a gruesome accident. Which seemed to happen until another thread in the story saved the idiot. The satire of South Africa and of Sweden is most enjoyable if somewhat easy! Now I have to read the previous volume in the series, The Hundred-Year-Old Man Who Climbed Out of the Window and Disappeared!