Archive for evidence

dynamic nested sampling for stars

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , , , on April 12, 2019 by xi'an

In the sequel of earlier nested sampling packages, like MultiNest, Joshua Speagle has written a new package called dynesty that manages dynamic nested sampling, primarily intended for astronomical applications. Which is the field where nested sampling is the most popular. One of the first remarks in the paper is that nested sampling can be more easily implemented by using a Uniform reparameterisation of the prior, that is, a reparameterisation that turns the prior into a Uniform over the unit hypercube. Which means in fine that the prior distribution can be generated from a fixed vector of uniforms and known transforms. Maybe not such an issue given that this is the prior after all.  The author considers this makes sampling under the likelihood constraint a much simpler problem but it all depends in the end on the concentration of the likelihood within the unit hypercube. And on the ability to reach the higher likelihood slices. I did not see any special trick when looking at the documentation, but reflected on the fundamental connection between nested sampling and this ability. As in the original proposal by John Skilling (2006), the slice volumes are “estimated” by simulated Beta order statistics, with no connection with the actual sequence of simulation or the problem at hand. We did point out our incomprehension for such a scheme in our Biometrika paper with Nicolas Chopin. As in earlier versions, the algorithm attempts at visualising the slices by different bounding techniques, before proceeding to explore the bounded regions by several exploration algorithms, including HMC.

“As with any sampling method, we strongly advocate that Nested Sampling should not be viewed as being strictly“better” or “worse” than MCMC, but rather as a tool that can be more or less useful in certain problems. There is no “One True Method to Rule Them All”, even though it can be tempting to look for one.”

When introducing the dynamic version, the author lists three drawbacks for the static (original) version. One is the reliance on this transform of a Uniform vector over an hypercube. Another one is that the overall runtime is highly sensitive to the choice the prior. (If simulating from the prior rather than an importance function, as suggested in our paper.) A third one is the issue that nested sampling is impervious to the final goal, evidence approximation versus posterior simulation, i.e., uses a constant rate of prior integration. The dynamic version simply modifies the number of point simulated in each slice. According to the (relative) increase in evidence provided by the current slice, estimated through iterations. This makes nested sampling a sort of inversted Wang-Landau since it sharpens the difference between slices. (The dynamic aspects for estimating the volumes of the slices and the stopping rule may hinder convergence in unclear ways, which is not discussed by the paper.) Among the many examples produced in the paper, a 200 dimension Normal target, which is an interesting object for posterior simulation in that most of the posterior mass rests on a ring away from the maximum of the likelihood. But does not seem to merit a mention in the discussion. Another example of heterogeneous regression favourably compares dynesty with MCMC in terms of ESS (but fails to include an HMC version).

[Breaking News: Although I wrote this post before the exciting first image of the black hole in M87 was made public and hence before I was aware of it, the associated AJL paper points out relying on dynesty for comparing several physical models of the phenomenon by nested sampling.]

 

let the evidence speak [book review]

Posted in Books, Kids, Statistics with tags , , , , , , , , , , on December 17, 2018 by xi'an

This book by Alan Jessop, professor at the Durham University Business School,  aims at presenting Bayesian ideas and methods towards decision making “without formula because they are not necessary; the ability to add and multiply is all that is needed.” The trick is in using a Bayes grid, in other words a two by two table. (There are a few formulas that survived the slaughter, see e.g. on p. 91 the formula for the entropy. Contained in the chapter on information that I find definitely unclear.) When leaving the 2×2 world, things become more complicated and the construction of a prior belief as a probability density gets heroic without the availability of maths formulas. The first part of the paper is about Likelihood, albeit not the likelihood function, despite having the general rule that (p.73)

belief is proportional to base rate x likelihood

which is the book‘s version of Bayes’ (base?!) theorem. It then goes on to discuss the less structure nature of prior (or prior beliefs) against likelihood by describing Tony O’Hagan’s way of scaling experts’ beliefs in terms of a Beta distribution. And mentioning Jaynes’ maximum entropy prior without a single formula. What is hard to fathom from the text is how can one derive the likelihood outside surveys. (Using the illustration of 1963 Oswald’s murder by Ruby in the likelihood chapter does not particularly help!) A bit of nitpicking at this stage: the sentence

“The ancient Greeks, and before them the Chinese and the Aztecs…”

is historically incorrect since, while the Chinese empire dates back before the Greek dark ages, the Aztecs only rule Mexico from the 14th century (AD) until the Spaniard invasion. While most of the book sticks with unidimensional parameters, it also discusses more complex structures, for which it relies on Monte Carlo, although the description is rather cryptic (use your spreadsheet!, p.133). The book at this stage turns into a more story-telling mode, by considering for instance the Federalist papers analysis by Mosteller and Wallace. The reader can only follow the process of assessing a document authorship for a single word, as multidimensional cases (for either data or parameters) are out of reach. The same comment applies to the ecology, archeology, and psychology chapters that follow. The intermediary chapter on the “grossly misleading” [Court wording] of the statistical evidence in the Sally Clark prosecution is more accessible in that (again) it relies on a single number. Returning to the ban of Bayes rule in British courts:

In the light of the strong criticism by this court in the 1990s of using Bayes theorem before the jury in cases where there was no reliable statistical evidence, the practice of using a Bayesian approach and likelihood ratios to formulate opinions placed before a jury without that process being disclosed and debated in court is contrary to principles of open justice.

the discussion found in the book is quite moderate and inclusive, in that a Bayesian analysis helps in gathering evidence about a case, but may be misunderstood or misused at the [non-Bayesian] decision level.

In conclusion, Let the Evidence Speak is an interesting introduction to Bayesian thinking, through a simplifying device, the Bayes grid, which seems to come from management, with a large number of examples, if not necessarily all realistic and some side-stories. I doubt this exposure can produce expert practitioners, but it makes for an worthwhile awakening for someone “likely to have read this book because [one] had heard of Bayes but were uncertain what is was” (p.222). With commendable caution and warnings along the way.

unrejected null [xkcd]

Posted in Statistics with tags , , , , , on July 18, 2018 by xi'an

new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.

 

unbiased consistent nested sampling via sequential Monte Carlo [a reply]

Posted in pictures, Statistics, Travel with tags , , , , , , , , on June 13, 2018 by xi'an

Rob Salomone sent me the following reply on my comments of yesterday about their recently arXived paper.

Our main goal in the paper was to show that Nested Sampling (when interpreted a certain way) is really just a member of a larger class of SMC algorithms, and exploring the consequences of that. We should point out that the section regarding calibration applies generally to SMC samplers, and hope that people give those techniques a try regardless of their chosen SMC approach.
Regarding your question about “whether or not it makes more sense to get completely SMC and forego any nested sampling flavour!”, this is an interesting point. After all, if Nested Sampling is just a special form of SMC, why not just use more standard SMC approaches? It seems that the Nested Sampling’s main advantage is its ability to cope with problems that have “phase transition’’ like behaviour, and thus is robust to a wider range of difficult problems than annealing approaches. Nevertheless, we hope this way of looking at NS (and showing that there may be variations of SMC with certain advantages) leads to improved NS and SMC methods down the line.  
Regarding your post, I should clarify a point regarding unbiasedness. The largest likelihood bound is actually set to infinity. Thus, for the fixed version of NS—SMC, one has an unbiased estimator of the “final” band. Choosing a final band prematurely will of course result in very high variance. However, the estimator is unbiased. For example, consider NS—SMC with only one strata. Then, the method reduces to simply using the prior as an importance sampling distribution for the posterior (unbiased, but often high variance).
Comments related to two specific parts of your post are below (your comments in italicised bold):
“Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue since I believe that the first occurrence within the level set is distributed from the constrained prior.”
This is an interesting point that we had not considered! In practice, and in many papers that apply Nested Sampling with MCMC, the common approach is to start the MCMC at one of the randomly selected “live points”, so the discussion related to independence was in regard to these common implementations.
Regarding starting the chain from outside of the level set. This is likely not done in practice as it introduces an additional difficulty of needing to propose a sample inside the required region (Metropolis–Hastings will have non—zero probability of returning a sample that is still outside the constrained region for any fixed number of iterations). Forcing the continuation of MCMC until a valid point is proposed I believe will be a subtle violation of detailed balance. Of course, the bias of such a modification may be small in practice, but it is an additional awkwardness introduced by the requirement of sample independence!
“And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a single  particle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, the  final constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N)”
Thanks for pointing out that this isn’t clear, we will try to do better in the next revision! The goal of this part of the paper wasn’t necessarily to propose a new version of nested sampling. Our focus here was to demonstrate that NS–SMC is not simply the Nested Sampling idea with an SMC twist, but that the original NS algorithm with MCMC (and restarting the MCMC sampling at one of the “live points’” as people do in practice) actually is a special case of SMC (with the weights replaced with a suboptimal choice).
The most curious thing is that, as you note, the estimates of remaining prior mass in the SMC context come out as powers of (N-1)/N and not exp(-t/N). In the paper by Walter (2017), he shows that the former choice is actually superior in terms of bias and variance. It was a nice touch that the superior choice of weights came out naturally in the SMC interpretation! 
That said, as the fixed version of NS-SMC is the one with the unbiasedness and consistency properties, this was the version we used in the main statistical examples.

unbiased consistent nested sampling via sequential Monte Carlo

Posted in pictures, Statistics, Travel with tags , , , , , , , , on June 12, 2018 by xi'an

“Moreover, estimates of the marginal likelihood are unbiased.” (p.2)

Rob Salomone, Leah South, Chris Drovandi and Dirk Kroese (from QUT and UQ, Brisbane) recently arXived a paper that frames the nested sampling in such a way that marginal likelihoods can be unbiasedly (and consistently) estimated.

“Why isn’t nested sampling more popular with statisticians?” (p.7)

A most interesting question, especially given its popularity in cosmology and other branches of physics. A first drawback pointed out in the c is the requirement of independence between the elements of the sample produced at each iteration. Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue since I believe that the first occurrence within the level set is distributed from the constrained prior.

A second difficulty is the use of quadrature which turns integrand into step functions at random slices. Indeed, mixing Monte Carlo with numerical integration makes life much harder, as shown by the early avatars of nested sampling that only accounted for the numerical errors. (And which caused Nicolas and I to write our critical paper in Biometrika.) There are few studies of that kind in the literature, the only one I can think of being [my former PhD student] Anne Philippe‘s thesis twenty years ago.

The third issue stands with the difficulty in parallelising the method. Except by jumping k points at once, rather than going one level at a time. While I agree this makes life more complicated, I am also unsure about the severity of that issue as k nested sampling algorithms can be run in parallel and aggregated in the end, from simple averaging to something more elaborate.

The final blemish is that the nested sampling estimator has a stopping mechanism that induces a truncation error, again maybe a lesser problem given the overall difficulty in assessing the total error.

The paper takes advantage of the ability of SMC to produce unbiased estimates of a sequence of normalising constants (or of the normalising constants of a sequence of targets). For nested sampling, the sequence is made of the prior distribution restricted to an embedded sequence of level sets. With another sequence restricted to bands (likelihood between two likelihood boundaries). If all restricted posteriors of the second kind and their normalising constant are known, the full posterior is known. Apparently up to the main normalising constant, i.e. the marginal likelihood., , except that it is also the sum of all normalising constants. Handling this sequence by SMC addresses the four concerns of the four authors, apart from the truncation issue, since the largest likelihood bound need be set for running the algorithm.

When the sequence of likelihood bounds is chosen based on the observed likelihoods so far, the method becomes adaptive. Requiring again the choice of a stopping rule that may induce bias if stopping occurs too early. And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a single particle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, the final constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N).

The remainder of this long paper (61 pages!) is dedicated to practical implementation, calibration and running a series of comparisons. A nice final touch is the thanks to the ‘Og for its series of posts on nested sampling, which “helped influence this work, and played a large part in inspiring it.”

In conclusion, this paper is certainly a worthy exploration of the nested sampler, providing further arguments towards a consistent version, with first and foremost an (almost?) unbiased resolution. The comparison with a wide range of alternatives remains open, in particular time-wise, if evidence is the sole target of the simulation. For instance, the choice of this sequence of targets in an SMC may be improved by another sequence, since changing one particle at a time does not sound efficient. The complexity of the implementation and in particular of the simulation from the prior under more and more stringent constraints need to be addressed.

atheism: a very [very] short introduction [book review]

Posted in Books with tags , , , , , , , , , , , , , , , on November 3, 2017 by xi'an

After the rather disappointing Edge of Reason, I gave a try at Baggini’s very brief introduction to atheism, which is very short. And equally very disappointing. Rather than approaching the topic from a (academic) philosophical perspective, ex nihilo,  and while defending himself from doing so, the author indeed adopts a rather militant tone in trying to justify the arguments and ethics of atheism, setting the approach solely in a defensive opposition to religions. That is, in reverse, as an answer to faiths and creeds. Even when his arguments make complete sense, e.g., in the lack of support for agnosticism against atheism, the link with inductive reasoning (and Hume), and the logical [and obvious] disconnection between morality and religious attitudes.

“…once we accept the inductive method, we should, to be consistent, also accept that it points toward a naturalism that supports atheism…” (p.27)

While he mentions “militant atheism” as a fundamentalist position to be as avoided as the numerous religious versions, I find the whole exercise in this book missing the point of both an intellectual criticism of atheism [in the sense of Kant’s best seller!] and of the VSI series. Again, to define atheism as an answer to religions and to their irrationality is reducing the scope of this philosophical branch to a contrarian posture, rather than independently advancing a rationalist and scientific position on the entropic nature of life and the universe, one that does not require for a purpose or a higher cause. And to try to show it provides better answers to the same questions as those addressed by religions stoops down to their level.

“So it is not the case that atheism follows merely from some shallow commitment to the primacy of scientific inquiry.” (p.77)

The link therein with a philosophical analysis seems so weak that I deem the essay rather belongs to journalosophy. The very short history of atheism and its embarrassed debate on the attributed connections between atheism and some modern era totalitarianisms [found in the last chapter] are an illustration of this divergence from scholarly work. That the author felt the need to include pictures to illustrate his points says it all!