Archive for evidence

new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.


unbiased consistent nested sampling via sequential Monte Carlo [a reply]

Posted in pictures, Statistics, Travel with tags , , , , , , , , on June 13, 2018 by xi'an

Rob Salomone sent me the following reply on my comments of yesterday about their recently arXived paper.

Our main goal in the paper was to show that Nested Sampling (when interpreted a certain way) is really just a member of a larger class of SMC algorithms, and exploring the consequences of that. We should point out that the section regarding calibration applies generally to SMC samplers, and hope that people give those techniques a try regardless of their chosen SMC approach.
Regarding your question about “whether or not it makes more sense to get completely SMC and forego any nested sampling flavour!”, this is an interesting point. After all, if Nested Sampling is just a special form of SMC, why not just use more standard SMC approaches? It seems that the Nested Sampling’s main advantage is its ability to cope with problems that have “phase transition’’ like behaviour, and thus is robust to a wider range of difficult problems than annealing approaches. Nevertheless, we hope this way of looking at NS (and showing that there may be variations of SMC with certain advantages) leads to improved NS and SMC methods down the line.  
Regarding your post, I should clarify a point regarding unbiasedness. The largest likelihood bound is actually set to infinity. Thus, for the fixed version of NS—SMC, one has an unbiased estimator of the “final” band. Choosing a final band prematurely will of course result in very high variance. However, the estimator is unbiased. For example, consider NS—SMC with only one strata. Then, the method reduces to simply using the prior as an importance sampling distribution for the posterior (unbiased, but often high variance).
Comments related to two specific parts of your post are below (your comments in italicised bold):
“Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue since I believe that the first occurrence within the level set is distributed from the constrained prior.”
This is an interesting point that we had not considered! In practice, and in many papers that apply Nested Sampling with MCMC, the common approach is to start the MCMC at one of the randomly selected “live points”, so the discussion related to independence was in regard to these common implementations.
Regarding starting the chain from outside of the level set. This is likely not done in practice as it introduces an additional difficulty of needing to propose a sample inside the required region (Metropolis–Hastings will have non—zero probability of returning a sample that is still outside the constrained region for any fixed number of iterations). Forcing the continuation of MCMC until a valid point is proposed I believe will be a subtle violation of detailed balance. Of course, the bias of such a modification may be small in practice, but it is an additional awkwardness introduced by the requirement of sample independence!
“And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a single  particle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, the  final constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N)”
Thanks for pointing out that this isn’t clear, we will try to do better in the next revision! The goal of this part of the paper wasn’t necessarily to propose a new version of nested sampling. Our focus here was to demonstrate that NS–SMC is not simply the Nested Sampling idea with an SMC twist, but that the original NS algorithm with MCMC (and restarting the MCMC sampling at one of the “live points’” as people do in practice) actually is a special case of SMC (with the weights replaced with a suboptimal choice).
The most curious thing is that, as you note, the estimates of remaining prior mass in the SMC context come out as powers of (N-1)/N and not exp(-t/N). In the paper by Walter (2017), he shows that the former choice is actually superior in terms of bias and variance. It was a nice touch that the superior choice of weights came out naturally in the SMC interpretation! 
That said, as the fixed version of NS-SMC is the one with the unbiasedness and consistency properties, this was the version we used in the main statistical examples.

unbiased consistent nested sampling via sequential Monte Carlo

Posted in pictures, Statistics, Travel with tags , , , , , , , , on June 12, 2018 by xi'an

“Moreover, estimates of the marginal likelihood are unbiased.” (p.2)

Rob Salomone, Leah South, Chris Drovandi and Dirk Kroese (from QUT and UQ, Brisbane) recently arXived a paper that frames the nested sampling in such a way that marginal likelihoods can be unbiasedly (and consistently) estimated.

“Why isn’t nested sampling more popular with statisticians?” (p.7)

A most interesting question, especially given its popularity in cosmology and other branches of physics. A first drawback pointed out in the c is the requirement of independence between the elements of the sample produced at each iteration. Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue since I believe that the first occurrence within the level set is distributed from the constrained prior.

A second difficulty is the use of quadrature which turns integrand into step functions at random slices. Indeed, mixing Monte Carlo with numerical integration makes life much harder, as shown by the early avatars of nested sampling that only accounted for the numerical errors. (And which caused Nicolas and I to write our critical paper in Biometrika.) There are few studies of that kind in the literature, the only one I can think of being [my former PhD student] Anne Philippe‘s thesis twenty years ago.

The third issue stands with the difficulty in parallelising the method. Except by jumping k points at once, rather than going one level at a time. While I agree this makes life more complicated, I am also unsure about the severity of that issue as k nested sampling algorithms can be run in parallel and aggregated in the end, from simple averaging to something more elaborate.

The final blemish is that the nested sampling estimator has a stopping mechanism that induces a truncation error, again maybe a lesser problem given the overall difficulty in assessing the total error.

The paper takes advantage of the ability of SMC to produce unbiased estimates of a sequence of normalising constants (or of the normalising constants of a sequence of targets). For nested sampling, the sequence is made of the prior distribution restricted to an embedded sequence of level sets. With another sequence restricted to bands (likelihood between two likelihood boundaries). If all restricted posteriors of the second kind and their normalising constant are known, the full posterior is known. Apparently up to the main normalising constant, i.e. the marginal likelihood., , except that it is also the sum of all normalising constants. Handling this sequence by SMC addresses the four concerns of the four authors, apart from the truncation issue, since the largest likelihood bound need be set for running the algorithm.

When the sequence of likelihood bounds is chosen based on the observed likelihoods so far, the method becomes adaptive. Requiring again the choice of a stopping rule that may induce bias if stopping occurs too early. And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a single particle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, the final constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N).

The remainder of this long paper (61 pages!) is dedicated to practical implementation, calibration and running a series of comparisons. A nice final touch is the thanks to the ‘Og for its series of posts on nested sampling, which “helped influence this work, and played a large part in inspiring it.”

In conclusion, this paper is certainly a worthy exploration of the nested sampler, providing further arguments towards a consistent version, with first and foremost an (almost?) unbiased resolution. The comparison with a wide range of alternatives remains open, in particular time-wise, if evidence is the sole target of the simulation. For instance, the choice of this sequence of targets in an SMC may be improved by another sequence, since changing one particle at a time does not sound efficient. The complexity of the implementation and in particular of the simulation from the prior under more and more stringent constraints need to be addressed.

atheism: a very [very] short introduction [book review]

Posted in Books with tags , , , , , , , , , , , , , , , on November 3, 2017 by xi'an

After the rather disappointing Edge of Reason, I gave a try at Baggini’s very brief introduction to atheism, which is very short. And equally very disappointing. Rather than approaching the topic from a (academic) philosophical perspective, ex nihilo,  and while defending himself from doing so, the author indeed adopts a rather militant tone in trying to justify the arguments and ethics of atheism, setting the approach solely in a defensive opposition to religions. That is, in reverse, as an answer to faiths and creeds. Even when his arguments make complete sense, e.g., in the lack of support for agnosticism against atheism, the link with inductive reasoning (and Hume), and the logical [and obvious] disconnection between morality and religious attitudes.

“…once we accept the inductive method, we should, to be consistent, also accept that it points toward a naturalism that supports atheism…” (p.27)

While he mentions “militant atheism” as a fundamentalist position to be as avoided as the numerous religious versions, I find the whole exercise in this book missing the point of both an intellectual criticism of atheism [in the sense of Kant’s best seller!] and of the VSI series. Again, to define atheism as an answer to religions and to their irrationality is reducing the scope of this philosophical branch to a contrarian posture, rather than independently advancing a rationalist and scientific position on the entropic nature of life and the universe, one that does not require for a purpose or a higher cause. And to try to show it provides better answers to the same questions as those addressed by religions stoops down to their level.

“So it is not the case that atheism follows merely from some shallow commitment to the primacy of scientific inquiry.” (p.77)

The link therein with a philosophical analysis seems so weak that I deem the essay rather belongs to journalosophy. The very short history of atheism and its embarrassed debate on the attributed connections between atheism and some modern era totalitarianisms [found in the last chapter] are an illustration of this divergence from scholarly work. That the author felt the need to include pictures to illustrate his points says it all!

WBIC, practically

Posted in Statistics with tags , , , , , , , , , on October 20, 2017 by xi'an

“Thus far, WBIC has received no more than a cursory mention by Gelman et al. (2013)”

I had missed this 2015  paper by Nial Friel and co-authors on a practical investigation of Watanabe’s WBIC. Where WBIC stands for widely applicable Bayesian information criterion. The thermodynamic integration approach explored by Nial and some co-authors for the approximation of the evidence, thermodynamic integration that produces the log-evidence as an integral between temperatures t=0 and t=1 of a powered evidence, is eminently suited for WBIC, as the widely applicable Bayesian information criterion is associated with the specific temperature t⁰ that makes the power posterior equidistant, Kullback-Leibler-wise, from the prior and posterior distributions. And the expectation of the log-likelihood under this very power posterior equal to the (genuine) evidence. In fact, WBIC is often associated with the sub-optimal temperature 1/log(n), where n is the (effective?) sample size. (By comparison, if my minimalist description is unclear!, thermodynamic integration requires a whole range of temperatures and associated MCMC runs.) In an ideal Gaussian setting, WBIC improves considerably over thermodynamic integration, the larger the sample the better. In more realistic settings, though, including a simple regression and a logistic [Pima Indians!] model comparison, thermodynamic integration may do better for a given computational cost although the paper is unclear about these costs. The paper also runs a comparison with harmonic mean and nested sampling approximations. Since the integral of interest involves a power of the likelihood, I wonder if a safe version of the harmonic mean resolution can be derived from simulations of the genuine posterior. Provided the exact temperature t⁰ is known…

marginal likelihoods from MCMC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on April 26, 2017 by xi'an

A new arXiv entry on ways to approximate marginal likelihoods based on MCMC output, by astronomers (apparently). With an application to the 2015 Planck satellite analysis of cosmic microwave background radiation data, which reminded me of our joint work with the cosmologists of the Paris Institut d’Astrophysique ten years ago. In the literature review, the authors miss several surveys on the approximation of those marginals, including our San Antonio chapter, on Bayes factors approximations, but mention our ABC survey somewhat inappropriately since it is not advocating the use of ABC for such a purpose. (They mention as well variational Bayes approximations, INLA, powered likelihoods, if not nested sampling.)

The proposal of this paper is to identify the marginal m [actually denoted a there] as the normalising constant of an unnormalised posterior density. And to do so the authors estimate the posterior by a non-parametric approach, namely a k-nearest-neighbour estimate. With the additional twist of producing a sort of Bayesian posterior on the constant m. [And the unusual notion of number density, used for the unnormalised posterior.] The Bayesian estimation of m relies on a Poisson sampling assumption on the k-nearest neighbour distribution. (Sort of, since k is actually fixed, not random.)

If the above sounds confusing and imprecise it is because I am myself rather mystified by the whole approach and find it difficult to see the point in this alternative. The Bayesian numerics does not seem to have other purposes than producing a MAP estimate. And using a non-parametric density estimate opens a Pandora box of difficulties, the most obvious one being the curse of dimension(ality). This reminded me of the commented paper of Delyon and Portier where they achieve super-efficient convergence when using a kernel estimator, but with a considerable cost and a similar sensitivity to dimension.

round-table on Bayes[ian[ism]]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , on March 7, 2017 by xi'an

In a [sort of] coincidence, shortly after writing my review on Le bayésianisme aujourd’hui, I got invited by the book editor, Isabelle Drouet, to take part in a round-table on Bayesianism in La Sorbonne. Which constituted the first seminar in the monthly series of the séminaire “Probabilités, Décision, Incertitude”. Invitation that I accepted and honoured by taking place in this public debate (if not dispute) on all [or most] things Bayes. Along with Paul Egré (CNRS, Institut Jean Nicod) and Pascal Pernot (CNRS, Laboratoire de chimie physique). And without a neuroscientist, who could not or would not attend.

While nothing earthshaking came out of the seminar, and certainly not from me!, it was interesting to hear of the perspectives of my philosophy+psychology and chemistry colleagues, the former explaining his path from classical to Bayesian testing—while mentioning trying to read the book Statistical rethinking reviewed a few months ago—and the later the difficulty to teach both colleagues and students the need for an assessment of uncertainty in measurements. And alluding to GUM, developed by the Bureau International des Poids et Mesures I visited last year. I tried to present my relativity viewpoints on the [relative] nature of the prior, to avoid the usual morass of debates on the nature and subjectivity of the prior, tried to explain Bayesian posteriors via ABC, mentioned examples from The Theorem that Would not Die, yet untranslated into French, and expressed reserves about the glorious future of Bayesian statistics as we know it. This seminar was fairly enjoyable, with none of the stress induced by the constraints of a radio-show. Just too bad it did not attract a wider audience!