Archive for nested sampling

seeking the error in nested sampling

Posted in pictures, Statistics, Travel with tags , , , , , , on April 13, 2017 by xi'an

A newly arXived paper on the error in nested sampling, written by Higson and co-authors, and read in Berlin, looks at the difficult task of evaluating the sampling error of nested sampling. The conclusion is essentially negative in that the authors recommend multiple runs of the method to assess the magnitude of the variability of the output by bootstrap, i.e. to call for the most empirical approach…

The core of this difficulty lies in the half-plug-in, half-quadrature, half-Monte Carlo (!) feature of the nested sampling algorithm, in that (i) the truncation of the unit interval is based on a expectation of the mass of each shell (i.e., the zone between two consecutive isoclines of the likelihood, (ii) the evidence estimator is a quadrature formula, and (iii) the level of the likelihood at the truncation is replaced with a simulated value that is not even unbiased (and correlated with the previous value in the case of an MCMC implementation). As discussed in our paper with Nicolas, the error in the evidence approximation is of the same order as other Monte Carlo methods in that it gets down like the square root of the number of terms at each iteration. Contrary to earlier intuitions that focussed on the error due to the quadrature.

But the situation is much less understood when the resulting sample is used for estimation of quantities related with the posterior distribution. With no clear approach to assess and even less correct the resulting error, since it is not solely a Monte Carlo error. As noted by the authors, the quadrature approximation to the univariate integral replaces the unknown prior weight of a shell with its Beta order statistic expectation and the average of the likelihood over the shell with a single (uniform???) realisation. Or the mean value of a transform of the parameter with a single (biased) realisation. Since most posterior expectations can be represented as integrals over likelihood levels of the average value over an iso-likelihood contour. The approach advocated in the paper involved multiple threads of an “unwoven nested sampling run”, which means launching n nested sampling runs with one living term from the n currents living points in the current nested sample. (Those threads may then later be recombined into a single nested sample.) This is the starting point to a nested flavour of bootstrapping, where threads are sampled with replacement, from which confidence intervals and error estimates can be constructed. (The original notion appears in Skilling’s 2006 paper, but I missed it.)

The above graphic is an attempt within the paper at representing the (marginal) posterior of a transform f(θ). That I do not fully understand… The notations are rather horrendous as X is not the data but the prior probability for the likelihood to be above a given bound which is actually the corresponding quantile. (There is no symbol for data and £ is used for the likelihood function as well as realisations of the likelihood function…) A vertical slice on the central panel gives the posterior distribution of f(θ) given the event that the likelihood is in the corresponding upper tail. Or given the corresponding shell (?).

nested sampling for philogenies

Posted in Statistics with tags , , , , , , , on March 3, 2017 by xi'an

“Methods to estimate the marginal likelihood should be sensitive to the prior choice. Non-informative priors should increase the contribution of low-likelihood regions of parameter space in the estimated marginal likelihood. Consequently, the prior choice should affect the estimated evidence.”

 In a most recent arXival, Maturana, Brewer, and Klaere discuss of the appeal of nested sampling for conducting model choice in philogenetic models. In comparison with the “generalized steppingstone sampling” method, which represents the evidence as a product of ratios of evidences (Fan et al., 2011). And which I do not think I have previously met, with all references provided therein relating to Bayesian philogenetics, apparently. The stepping stone approach relies on a sequence of tempered targets, moving from a reference distribution to the real target as a temperature β goes from zero to one. (The paper also mentions thermodynamic integration as too costly.) Nested sampling—much discussed on this blog!—is presented in this paper as having the ability to deal with partly convex likelihoods, although I do not really get how or why. (As there is nothing new in the fairly pedagogical pretentation of nested sampling therein.) Nothing appears to be mentioned about the difficulty to handle multimodal as high likelihood isolated regions are unlikely to be sampled from poorly weighted priors (by which I mean that a region with significant likelihood mass is unlikely to get sampled if the prior distribution gives little prior weight to that region). The novelty in the paper is to compare nested sampling with generalized steppingstone sampling and path sampling on several phylogenetic examples. I did not spot computing time mentioned there. As usual with examples, my reservation is that the conclusions drawn for one particular analysis of one (three) particular example(s) does not convey a general method about the power and generality of a method. Even though I acknowledge that nested sampling is on principle operational in wide generality.

Bayesian methods in cosmology

Posted in Statistics with tags , , , , , , , , , , , , on January 18, 2017 by xi'an

A rather massive document was arXived a few days ago by Roberto Trotta on Bayesian methods for cosmology, in conjunction with an earlier winter school, the 44th Saas Fee Advanced Course on Astronomy and Astrophysics, “Cosmology with wide-field surveys”. While I never had the opportunity to give a winter school in Saas Fee, I will give next month a course on ABC to statistics graduates in another Swiss dream location, Les Diablerets.  And next Fall a course on ABC again but to astronomers and cosmologists, in Autrans, near Grenoble.

The course document is an 80 pages introduction to probability and statistics, in particular Bayesian inference and Bayesian model choice. Including exercises and references. As such, it is rather standard in that the material could be found as well in textbooks. Statistics textbooks.

When introducing the Bayesian perspective, Roberto Trotta advances several arguments in favour of this approach. The first one is that it is generally easier to follow a Bayesian approach when compared with seeking a non-Bayesian one, while recovering long-term properties. (Although there are inconsistent Bayesian settings.) The second one is that a Bayesian modelling allows to handle naturally nuisance parameters, because there are essentially no nuisance parameters. (Even though preventing small world modelling may lead to difficulties as in the Robbins-Wasserman paradox.) The following two reasons are the incorporation of prior information and the appeal on conditioning on the actual data.

trottaThe document also includes this above and nice illustration of the concentration of measure as the dimension of the parameter increases. (Although one should not over-interpret it. The concentration does not occur in the same way for a normal distribution for instance.) It further spends quite some space on the Bayes factor, its scaling as a natural Occam’s razor,  and the comparison with p-values, before (unsurprisingly) introducing nested sampling. And the Savage-Dickey ratio. The conclusion of this model choice section proposes some open problems, with a rather unorthodox—in the Bayesian sense—line on the justification of priors and the notion of a “correct” prior (yeech!), plus an musing about adopting a loss function, with which I quite agree.

pitfalls of nested Monte Carlo

Posted in Books, pictures, Statistics, University life with tags , , , , , on December 19, 2016 by xi'an

Cockatoo Island, Sydney Harbour, July 15, 2012A few days ago, Tom Rainforth, Robert Cornish, Hongseok Yang, and Frank Wood from Oxford have arXived a paper on the limitations of nested Monte Carlo. By nested Monte Carlo [not nested sampling], they mean Monte Carlo techniques used to evaluate the expectation of a non-linear transform of an expectation, which often call for plug-in resolution. The main result is that this expectation cannot be evaluated by an unbiased estimator. Which is only mildly surprising. I do wonder if there still exist series solutions à la Glynn and Rhee, as in the Russian roulette version. Which is mentioned in a footnote. Or specially tuned versions, as suggested by some techniques found in Devroye’s book where the expectation of the exponential of another expectation is considered… (The paper is quite short, which may be correlated with the format imposed by some machine-learning conference proceedings like AISTATS.)

warp-U bridge sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on October 12, 2016 by xi'an

[I wrote this set of comments right after MCqMC 2016 on a preliminary version of the paper so mileage may vary in terms of the adequation to the current version!]

In warp-U bridge sampling, newly arXived and first presented at MCqMC 16, Xiao-Li Meng continues (in collaboration with Lahzi Wang) his exploration of bridge sampling techniques towards improving the estimation of normalising constants and ratios thereof. The bridge sampling estimator of Meng and Wong (1996) is an harmonic mean importance sampler that requires iterations as it depends on the ratio of interest. Given that the normalising constant of a density does not depend on the chosen parameterisation in the sense that the Jacobian transform preserves this constant, a degree of freedom is in the choice of the parameterisation. This is the idea behind warp transformations. The initial version of Meng and Schilling (2002) used location-scale transforms, while the warp-U solution goes for a multiple location-scale transform that can be seen as based on a location-scale mixture representation of the target. With K components. This approach can also be seen as a sort of artificial reversible jump algorithm when one model is fully known. A strategy Nicolas and I also proposed in our nested sampling Biometrika paper.

Once such a mixture approximation is obtained. each and every component of the mixture can be turned into the standard version of the location-scale family by the appropriate location-scale transform. Since the component index k is unknown for a given X, they call this transform a random transform, which I find somewhat more confusing that helpful. The conditional distribution of the index given the observable x is well-known for mixtures and it is used here to weight the component-wise location-scale transforms of the original distribution p into something that looks rather similar to the standard version of the location-scale family. If no mode has been forgotten by the mixture. The simulations from the original p are then rescaled by one of those transforms, which index k is picked according to the conditional distribution. As explained later to me by XL, the random[ness] in the picture is due to the inclusion of a random ± sign. Still, in the notation introduced in (13), I do not get how the distribution Þ [sorry for using different symbols, I cannot render a tilde on a p] is defined since both ψ and W are random. Is it the marginal? In which case it would read as a weighted average of rescaled versions of p. I have the same problem with Theorem 1 in that I do not understand how one equates Þ with the joint distribution.

Equation (21) is much more illuminating (I find) than the previous explanation in that it exposes the fact that the principle is one of aiming at a new distribution for both the target and the importance function, with hopes that the fit will get better. It could have been better to avoid the notion of random transform, then, but this is mostly a matter of conveying the notion.

On more specifics points (or minutiae), the unboundedness of the likelihood is rarely if ever a problem when using EM. An alternative to the multiple start EM proposal would then be to get sequential and estimate the mixture in a sequential manner, only adding a component when it seems worth it. See eg Chopin and Pelgrin (2004) and Chopin (2007). This could also help with the bias mentioned therein since only a (tiny?) fraction of the data would be used. And the number of components K has an impact on the accuracy of the approximation, as in not missing a mode, and on the computing time. However my suggestion would be to avoid estimating K as this must be immensely costly.

Section 6 obviously relates to my folded Markov interests. If I understand correctly, the paper argues that the transformed density Þ does not need to be computed when considering the folding-move-unfolding step as a single step rather than three steps. I fear the description between equations (30) and (31) is missing the move step over the transformed space. Also on a personal basis I still do not see how to add this approach to our folding methodology, even though the different transforms act as as many replicas of the original Markov chain.

Savage-Dickey supermodels

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on September 13, 2016 by xi'an

The Wider Image: Bolivia's cholita climbers: Combination picture shows Aymara indigenous women (L-R) Domitila Alana, 42, Bertha Vedia, 48, Lidia Huayllas, 48, and Dora Magueno, 50, posing for a photograph at the Huayna Potosi mountain, Bolivia April 6, 2016Combination picture shows Aymara indigenous women (L-R) Domitila Alana, 42, Bertha Vedia, 48, Lidia Huayllas, 48, and Dora Magueno, 50, posing for a photograph at the Huayna Potosi mountain, Bolivia April 6, 2016. (c.) REUTERS/David Mercado. REUTERS/David MercadoA. Mootoovaloo, B. Bassett, and M. Kunz just arXived a paper on the computation of Bayes factors by the Savage-Dickey representation through a supermodel (or encompassing model). (I wonder why Savage-Dickey is so popular in astronomy and cosmology statistical papers and not so much elsewhere.) Recall that the trick is to write the Bayes factor in favour of the encompasssing model as the ratio of the posterior and of the prior for the tested parameter (thus eliminating nuisance or common parameters) at its null value,

B10=π(φ⁰|x)/π(φ⁰).

Modulo some continuity constraints on the prior density, and the assumption that the conditional prior on nuisance parameter is the same under the null model and the encompassing model [given the null value φ⁰]. If this sounds confusing or even shocking from a mathematical perspective, check the numerous previous entries on this topic on the ‘Og!

The supermodel created by the authors is a mixture of the original models, as in our paper, and… hold the presses!, it is a mixture of the likelihood functions, as in Phil O’Neill’s and Theodore Kypraios’ paper. Which is not mentioned in the current paper and should obviously be. In the current representation, the posterior distribution on the mixture weight α is a linear function of α involving both evidences, α(m¹-m²)+m², times the artificial prior on α. The resulting estimator of the Bayes factor thus shares features with bridge sampling, reversible jump, and the importance sampling version of nested sampling we developed in our Biometrika paper. In addition to O’Neill and Kypraios’s solution.

The following quote is inaccurate since the MCMC algorithm needs simulating the parameters of the compared models in realistic settings, hence representing the multidimensional integrals by Monte Carlo versions.

“Though we have a clever way of avoiding multidimensional integrals to calculate the Bayesian Evidence, this new method requires very efficient sampling and for a small number of dimensions is not faster than individual nested sampling runs.”

I actually wonder at the sheer rationale of running an intensive MCMC sampler in such a setting, when the weight α is completely artificial. It is only used to jump from one model to the next, which sound quite inefficient when compared with simulating from both models separately and independently. This approach can also be seen as a special case of Carlin’s and Chib’s (1995) alternative to reversible jump. Using instead the Savage-Dickey representation is of course infeasible. Which makes the overall reference to this method rather inappropriate in my opinion. Further, the examples processed in the paper all involve (natural) embedded models where the original Savage-Dickey approach applies. Creating an additional model to apply a pseudo-Savage-Dickey representation does not sound very compelling…

Incidentally, the paper also includes a discussion of a weird notion, the likelihood of the Bayes factor, B¹², which is plotted as a distribution in B¹², most strangely. The only other place I met this notion is in Murray Aitkin’s book. Something’s unclear there or in my head!

“One of the fundamental choices when using the supermodel approach is how to deal with common parameters to the two models.”

This is an interesting question, although maybe not so relevant for the Bayes factor issue where it should not matter. However, as in our paper, multiplying the number of parameters in the encompassing model may hinder convergence of the MCMC chain or reduce the precision of the approximation of the Bayes factor. Again, from a Bayes factor perspective, this does not matter [while it does in our perspective].

ABC by subset simulation

Posted in Books, Statistics, Travel with tags , , , , , , , , , on August 25, 2016 by xi'an

Last week, Vakilzadeh, Beck and Abrahamsson arXived a paper entitled “Using Approximate Bayesian Computation by Subset Simulation for Efficient Posterior Assessment of Dynamic State-Space Model Classes”. It follows an earlier paper by Beck and co-authors on ABC by subset simulation, paper that I did not read. The model of interest is a hidden Markov model with continuous components and covariates (input), e.g. a stochastic volatility model. There is however a catch in the definition of the model, namely that the observable part of the HMM includes an extra measurement error term linked with the tolerance level of the ABC algorithm. Error term that is dependent across time, the vector of errors being within a ball of radius ε. This reminds me of noisy ABC, obviously (and as acknowledged by the authors), but also of some ABC developments of Ajay Jasra and co-authors. Indeed, as in those papers, Vakilzadeh et al. use the raw data sequence to compute their tolerance neighbourhoods, which obviously bypasses the selection of a summary statistic [vector] but also may drown signal under noise for long enough series.

“In this study, we show that formulating a dynamical system as a general hierarchical state-space model enables us to independently estimate the model evidence for each model class.”

Subset simulation is a nested technique that produces a sequence of nested balls (and related tolerances) such that the conditional probability to be in the next ball given the previous one remains large enough. Requiring a new round of simulation each time. This is somewhat reminding me of nested sampling, even though the two methods differ. For subset simulation, estimating the level probabilities means that there also exists a converging (and even unbiased!) estimator for the evidence associated with different tolerance levels. Which is not a particularly natural object unless one wants to turn it into a tolerance selection principle, which would be quite a novel perspective. But not one adopted in the paper, seemingly. Given that the application section truly compares models I must have missed something there. (Blame the long flight from San Francisco to Sydney!) Interestingly, the different models as in Table 4 relate to different tolerance levels, which may be an hindrance for the overall validation of the method.

I find the subsequent part on getting rid of uncertain prediction error model parameters of lesser [personal] interest as it essentially replaces the marginal posterior on the parameters of interest by a BIC approximation, with the unsurprising conclusion that “the prior distribution of the nuisance parameter cancels out”.