dynamic nested sampling for stars

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , , , on April 12, 2019 by xi'an

In the sequel of earlier nested sampling packages, like MultiNest, Joshua Speagle has written a new package called dynesty that manages dynamic nested sampling, primarily intended for astronomical applications. Which is the field where nested sampling is the most popular. One of the first remarks in the paper is that nested sampling can be more easily implemented by using a Uniform reparameterisation of the prior, that is, a reparameterisation that turns the prior into a Uniform over the unit hypercube. Which means in fine that the prior distribution can be generated from a fixed vector of uniforms and known transforms. Maybe not such an issue given that this is the prior after all.  The author considers this makes sampling under the likelihood constraint a much simpler problem but it all depends in the end on the concentration of the likelihood within the unit hypercube. And on the ability to reach the higher likelihood slices. I did not see any special trick when looking at the documentation, but reflected on the fundamental connection between nested sampling and this ability. As in the original proposal by John Skilling (2006), the slice volumes are “estimated” by simulated Beta order statistics, with no connection with the actual sequence of simulation or the problem at hand. We did point out our incomprehension for such a scheme in our Biometrika paper with Nicolas Chopin. As in earlier versions, the algorithm attempts at visualising the slices by different bounding techniques, before proceeding to explore the bounded regions by several exploration algorithms, including HMC.

“As with any sampling method, we strongly advocate that Nested Sampling should not be viewed as being strictly“better” or “worse” than MCMC, but rather as a tool that can be more or less useful in certain problems. There is no “One True Method to Rule Them All”, even though it can be tempting to look for one.”

When introducing the dynamic version, the author lists three drawbacks for the static (original) version. One is the reliance on this transform of a Uniform vector over an hypercube. Another one is that the overall runtime is highly sensitive to the choice the prior. (If simulating from the prior rather than an importance function, as suggested in our paper.) A third one is the issue that nested sampling is impervious to the final goal, evidence approximation versus posterior simulation, i.e., uses a constant rate of prior integration. The dynamic version simply modifies the number of point simulated in each slice. According to the (relative) increase in evidence provided by the current slice, estimated through iterations. This makes nested sampling a sort of inversted Wang-Landau since it sharpens the difference between slices. (The dynamic aspects for estimating the volumes of the slices and the stopping rule may hinder convergence in unclear ways, which is not discussed by the paper.) Among the many examples produced in the paper, a 200 dimension Normal target, which is an interesting object for posterior simulation in that most of the posterior mass rests on a ring away from the maximum of the likelihood. But does not seem to merit a mention in the discussion. Another example of heterogeneous regression favourably compares dynesty with MCMC in terms of ESS (but fails to include an HMC version).

[Breaking News: Although I wrote this post before the exciting first image of the black hole in M87 was made public and hence before I was aware of it, the associated AJL paper points out relying on dynesty for comparing several physical models of the phenomenon by nested sampling.]

testing MCMC code

Posted in Books, Statistics, University life with tags , , , , , , , , on December 26, 2014 by xi'an

A title that caught my attention on arXiv: testing MCMC code by Roger Grosse and David Duvenaud. The paper is in fact a tutorial adapted from blog posts written by Grosse and Duvenaud, on the blog of the Harvard Intelligent Probabilistic Systems group. The purpose is to write code in such a modular way that (some) conditional probability computations can be tested. Using my favourite Gibbs sampler for the mixture model, they advocate computing the ratios

$\dfrac{p(x'|z)}{p(x|z)}\quad\text{and}\quad \dfrac{p(x',z)}{p(x,z)}$

to make sure they are exactly identical. (Where x denotes the part of the parameter being simulated and z anything else.) The paper also mentions an older paper by John Geweke—of which I was curiously unaware!—leading to another test: consider iterating the following two steps:

1. update the parameter θ given the current data x by an MCMC step that preserves the posterior p(θ|x);
2. update the data x given the current parameter value θ from the sampling distribution p(x|θ).

Since both steps preserve the joint distribution p(x,θ), values simulated from those steps should exhibit the same properties as a forward production of (x,θ), i.e., simulating from p(θ) and then from p(x|θ). So with enough simulations, comparison tests can be run. (Andrew has a very similar proposal at about the same time.) There are potential limitations to the first approach, obviously, from being unable to write the full conditionals [an ABC version anyone?!] to making a programming mistake that keep both ratios equal [as it would occur if a Metropolis-within-Gibbs was run by using the ratio of the joints in the acceptance probability]. Further, as noted by the authors it only addresses the mathematical correctness of the code, rather than the issue of whether the MCMC algorithm mixes well enough to provide a pseudo-iid-sample from p(θ|x). (Lack of mixing that could be spotted by Geweke’s test.) But it is so immediately available that it can indeed be added to every and all simulations involving a conditional step. While Geweke’s test requires re-running the MCMC algorithm altogether. Although clear divergence between an iid sampling from p(x,θ) and the Gibbs version above could appear fast enough for a stopping rule to be used. In fine, a worthwhile addition to the collection of checkings and tests built across the years for MCMC algorithms! (Of which the trick proposed by my friend Tobias Rydén to run first the MCMC code with n=0 observations in order to recover the prior p(θ) remains my favourite!)

reflections on the probability space induced by moment conditions with implications for Bayesian Inference [refleXions]

Posted in Statistics, University life with tags , , , , , , , , , , on November 26, 2014 by xi'an

“The main finding is that if the moment functions have one of the properties of a pivotal, then the assertion of a distribution on moment functions coupled with a proper prior does permit Bayesian inference. Without the semi-pivotal condition, the assertion of a distribution for moment functions either partially or completely specifies the prior.” (p.1)

Ron Gallant will present this paper at the Conference in honour of Christian Gouréroux held next week at Dauphine and I have been asked to discuss it. What follows is a collection of notes I made while reading the paper , rather than a coherent discussion, to come later. Hopefully prior to the conference.

The difficulty I have with the approach presented therein stands as much with the presentation as with the contents. I find it difficult to grasp the assumptions behind the model(s) and the motivations for only considering a moment and its distribution. Does it all come down to linking fiducial distributions with Bayesian approaches? In which case I am as usual sceptical about the ability to impose an arbitrary distribution on an arbitrary transform of the pair (x,θ), where x denotes the data. Rather than a genuine prior x likelihood construct. But I bet this is mostly linked with my lack of understanding of the notion of structural models.

“We are concerned with situations where the structural model does not imply exogeneity of θ, or one prefers not to rely on an assumption of exogeneity, or one cannot construct a likelihood at all due to the complexity of the model, or one does not trust the numerical approximations needed to construct a likelihood.” (p.4)

As often with econometrics papers, this notion of structural model sets me astray: does this mean any latent variable model or an incompletely defined model, and if so why is it incompletely defined? From a frequentist perspective anything random is not a parameter. The term exogeneity also hints at this notion of the parameter being not truly a parameter, but including latent variables and maybe random effects. Reading further (p.7) drives me to understand the structural model as defined by a moment condition, in the sense that

$\mathbb{E}[m(\mathbf{x},\theta)]=0$

has a unique solution in θ under the true model. However the focus then seems to make a major switch as Gallant considers the distribution of a pivotal quantity like

$Z=\sqrt{n} W(\mathbf{x},\theta)^{-\frac{1}{2}} m(\mathbf{x},\theta)$

as induced by the joint distribution on (x,θ), hence conversely inducing constraints on this joint, as well as an associated conditional. Which is something I have trouble understanding, First, where does this assumed distribution on Z stem from? And, second, exchanging randomness of terms in a random variable as if it was a linear equation is a pretty sure way to produce paradoxes and measure theoretic difficulties.

The purely mathematical problem itself is puzzling: if one knows the distribution of the transform Z=Z(X,Λ), what does that imply on the joint distribution of (X,Λ)? It seems unlikely this will induce a single prior and/or a single likelihood… It is actually more probable that the distribution one arbitrarily selects on m(x,θ) is incompatible with a joint on (x,θ), isn’t it?

“The usual computational method is MCMC (Markov chain Monte Carlo) for which the best known reference in econometrics is Chernozhukov and Hong (2003).” (p.6)

While I never heard of this reference before, it looks like a 50 page survey and may be sufficient for an introduction to MCMC methods for econometricians. What I do not get though is the connection between this reference to MCMC and the overall discussion of constructing priors (or not) out of fiducial distributions. The author also suggests using MCMC to produce the MAP estimate but this always stroke me as inefficient (unless one uses our SAME algorithm of course).

“One can also compute the marginal likelihood from the chain (Newton and Raftery (1994)), which is used for Bayesian model comparison.” (p.22)

Not the best solution to rely on harmonic means for marginal likelihoods…. Definitely not. While the author actually uses the stabilised version (15) of Newton and Raftery (1994) estimator, which in retrospect looks much like a bridge sampling estimator of sorts, it remains dangerously close to the original [harmonic mean solution] especially for a vague prior. And it only works when the likelihood is available in closed form.

“The MCMC chains were comprised of 100,000 draws well past the point where transients died off.” (p.22)

I wonder if the second statement (with a very nice image of those dying transients!) is intended as a consequence of the first one or independently.

“A common situation that requires consideration of the notions that follow is that deriving the likelihood from a structural model is analytically intractable and one cannot verify that the numerical approximations one would have to make to circumvent the intractability are sufficiently accurate.” (p.7)

This then is a completely different business, namely that defining a joint distribution by mean of moment equations prevents regular Bayesian inference because the likelihood is not available. This is more exciting because (i) there are alternative available! From ABC to INLA (maybe) to EP to variational Bayes (maybe). And beyond. In particular, the moment equations are strongly and even insistently suggesting that empirical likelihood techniques could be well-suited to this setting. And (ii) it is no longer a mathematical worry: there exist a joint distribution on m(x,θ), induced by a (or many) joint distribution on (x,θ). So the question of finding whether or not it induces a single proper prior on θ becomes relevant. But, if I want to use ABC, being given the distribution of m(x,θ) seems to mean I can only generate new values of this transform while missing a natural distance between observations and pseudo-observations. Still, I entertain lingering doubts that this is the meaning of the study. Where does the joint distribution come from..?!

“Typically C is coarse in the sense that it does not contain all the Borel sets (…)  The probability space cannot be used for Bayesian inference”

My understanding of that part is that defining a joint on m(x,θ) is not always enough to deduce a (unique) posterior on θ, which is fine and correct, but rather anticlimactic. This sounds to be what Gallant calls a “partial specification of the prior” (p.9).

Overall, after this linear read, I remain very much puzzled by the statistical (or Bayesian) implications of the paper . The fact that the moment conditions are central to the approach would once again induce me to check the properties of an alternative approach like empirical likelihood.

mathematical statistics books with Bayesian chapters [incomplete book reviews]

Posted in Books, Statistics, University life with tags , , , , , , , , on July 9, 2013 by xi'an

I received (in the same box) two mathematical statistics books from CRC Press, Understanding Advanced Statistical Methods by Westfall and Henning, and Statistical Theory A Concise Introduction by Abramovich and Ritov. For review in CHANCE. While they are both decent books for teaching mathematical statistics at undergraduate borderline graduate level, I do not find enough of a novelty in them to proceed to a full review. (Given more time, I could have changed my mind about the first one.) Instead, I concentrate here on their processing of the Bayesian paradigm, which takes a wee bit more than a chapter in either of them. (And this can be done over a single métro trip!) The important following disclaimer applies: comparing both books is highly unfair in that it is only because I received them together. They do not necessarily aim at the same audience. And I did not read the whole of either of them.

First, the concise Statistical Theory  covers the topic in a fairly traditional way. It starts with a warning about the philosophical nature of priors and posteriors, which reflect beliefs rather than frequency limits (just like likelihoods, no?!). It then introduces priors with the criticism that priors are difficult to build and assess. The two classes of priors analysed in this chapter are unsurprisingly conjugate priors (which hyperparameters have to be determined or chosen or estimated in the empirical Bayes heresy [my words!, not the authors’]) and “noninformative (objective) priors”.  The criticism of the flat priors is also traditional and leads to the  group invariant (Haar) measures, then to Jeffreys non-informative priors (with the apparent belief that Jeffreys only handled the univariate case). Point estimation is reduced to posterior expectations, confidence intervals to HPD regions, and testing to posterior probability ratios (with a warning about improper priors). Bayes rules make a reappearance in the following decision-theory chapter, as providers of both admissible and minimax estimators. This is it, as Bayesian techniques are not mentioned in the final “Linear Models” chapter. As a newcomer to statistics, I think I would be as bemused about Bayesian statistics as when I got my 15mn entry as a student, because here was a method that seemed to have a load of history, an inner coherence, and it was mentioned as an oddity in an otherwise purely non-Bayesian course. What good could this do to the understanding of the students?! So I would advise against getting this “token Bayesian” chapter in the book

“You are not ignorant! Prior information is what you know prior to collecting the data.” Understanding Advanced Statistical Methods (p.345)

Second, Understanding Advanced Statistical Methods offers a more intuitive entry, by justifying prior distributions as summaries of prior information. And observations as a mean to increase your knowledge about the parameter. The Bayesian chapter uses a toy but very clear survey examplew to illustrate the passage from prior to posterior distributions. And to discuss the distinction between informative and noninformative priors. (I like the “Ugly Rule of Thumb” insert, as it gives a guideline without getting too comfy about it… E.g., using a 90% credible interval is good enough on p.354.) Conjugate priors are mentioned as a result of past computational limitations and simulation is hailed as a highly natural tool for analysing posterior distributions. Yay! A small section discusses the purpose of vague priors without getting much into details and suggests to avoid improper priors by using “distributions with extremely large variance”, a concept we dismissed in Bayesian Core! For how large is “extremely large”?!

“You may end up being surprised to learn in later chapters (..) that, with classical methods, you simply cannot perform the types of analyses shown in this section (…) And that’s the answer to the question, “What good is Bayes?””Understanding Advanced Statistical Methods (p.345)

Then comes the really appreciable part, a section entitled “What good is Bayes?”—it actually reads “What Good is Bayes?” (p.359), leading to a private if grammatically poor joke since I.J. Good was one of the first modern Bayesians, working with Turing at Bletchley Park…—  The authors simply skip the philosophical arguments to give the reader a showcase of examples where the wealth of the Bayesian toolbox: logistic regression, VaR (Value at Risk), stock prices, drug profit prediction. Concluding with arguments in favour of the frequentist methods: (a) not requiring priors, (b) easier with generic distributifrequentistons, (c) easier to understand with simulation, and (d) easier to validate with validation. I do not mean to get into a debate about those points as my own point is that the authors are taking a certain stand about the pros and cons of the frequentist/Bayesian approaches and that they are making their readers aware of it. (Note that the Bayesian chapter comes before the frequentist chapter!) A further section is “Comparing the Bayesian and frequentist paradigms?” (p.384), again with a certain frequentist slant, but again making the distinctions and similarities quite clear to the reader. Of course, there is very little (if any) about Bayesian approaches in the next chapters but this is somehow coherent with the authors’ perspective. Once more, a perspective that is well-spelled and comprehensible for the reader. Even the novice statistician. In that sense, having a Bayesian chapter inside a general theory book makes sense.  (The second book has a rather detailed website, by the way! Even though handling simulations in Excel and drawing graphs in SAS could be dangerous to your health…)