Archive for marginal likelihood

ISBA 18 tidbits

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 2, 2018 by xi'an

Among a continuous sequence of appealing sessions at this ISBA 2018 meeting [says a member of the scientific committee!], I happened to attend two talks [with a wee bit of overlap] by Sid Chib in two consecutive sessions, because his co-author Ana Simoni (CREST) was unfortunately sick. Their work was about models defined by a collection of moment conditions, as often happens in econometrics, developed in a recent JASA paper by Chib, Shin, and Simoni (2017). With an extension about moving to defining conditional expectations by use of a functional basis. The main approach relies on exponentially tilted empirical likelihoods, which reminded me of the empirical likelihood [BCel] implementation we ran with Kerrie Mengersen and Pierre Pudlo a few years ago. As a substitute to ABC. This problematic made me wonder on how much Bayesian the estimating equation concept is, as it should somewhat involve a nonparametric prior under the moment constraints.

Note that Sid’s [talks and] papers are disconnected from ABC, as everything comes in closed form, apart from the empirical likelihood derivation, as we actually found in our own work!, but this could become a substitute model for ABC uses. For instance, identifying the parameter θ of the model by identifying equations. Would that impose too much input from the modeller? I figure I came with this notion mostly because of the emphasis on proxy models the previous day at ABC in ‘burgh! Another connected item of interest in the work is the possibility of accounting for misspecification of these moment conditions by introducing a vector of errors with a spike & slab distribution, although I am not sure this is 100% necessary without getting further into the paper(s) [blame conference pressure on my time].

Another highlight was attending a fantastic poster session Monday night on computational methods except I would have needed four more hours to get through every and all posters. This new version of ISBA has split the posters between two sites (great) and themes (not so great!), while I would have preferred more sites covering all themes over all nights, to lower the noise (still bearable this year) and to increase the possibility to check all posters of interest in a particular theme…

Mentioning as well a great talk by Dan Roy about assessing deep learning performances by what he calls non-vacuous error bounds. Namely, through PAC-Bayesian bounds. One major comment of his was about deep learning models being much more non-parametric (number of parameters rising with number of observations) than parametric models, meaning that generative adversarial constructs as the one I discussed a few days ago may face a fundamental difficulty as models are taken at face value there.

On closed-form solutions, a closed-form Bayes factor for component selection in mixture models by Fũqene, Steel and Rossell that resemble the Savage-Dickey version, without the measure theoretic difficulties. But with non-local priors. And closed-form conjugate priors for the probit regression model, using unified skew-normal priors, as exhibited by Daniele Durante. Which are product of Normal cdfs and pdfs, and which allow for closed form marginal likelihoods and marginal posteriors as well. (The approach is not exactly conjugate as the prior and the posterior are not in the same family.)

And on the final session I attended, there were two talks on scalable MCMC, one on coresets, which will require some time and effort to assimilate, by Trevor Campbell and Tamara Broderick, and another one using Poisson subsampling. By Matias Quiroz and co-authors. Which did not completely convinced me (but this was the end of a long day…)

All in all, this has been a great edition of the ISBA meetings, if quite intense due to a non-stop schedule, with a very efficient organisation that made parallel sessions manageable and poster sessions back to a reasonable scale [although I did not once manage to cross the street to the other session]. Being in unreasonably sunny Edinburgh helped a lot obviously! I am a wee bit disappointed that no one else follows my call to wear a kilt, but I had low expectations to start with… And too bad I missed the Ironman 70.3 Edinburgh by one day!

new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.

 

unbiased consistent nested sampling via sequential Monte Carlo [a reply]

Posted in pictures, Statistics, Travel with tags , , , , , , , , on June 13, 2018 by xi'an

Rob Salomone sent me the following reply on my comments of yesterday about their recently arXived paper.

Our main goal in the paper was to show that Nested Sampling (when interpreted a certain way) is really just a member of a larger class of SMC algorithms, and exploring the consequences of that. We should point out that the section regarding calibration applies generally to SMC samplers, and hope that people give those techniques a try regardless of their chosen SMC approach.
Regarding your question about “whether or not it makes more sense to get completely SMC and forego any nested sampling flavour!”, this is an interesting point. After all, if Nested Sampling is just a special form of SMC, why not just use more standard SMC approaches? It seems that the Nested Sampling’s main advantage is its ability to cope with problems that have “phase transition’’ like behaviour, and thus is robust to a wider range of difficult problems than annealing approaches. Nevertheless, we hope this way of looking at NS (and showing that there may be variations of SMC with certain advantages) leads to improved NS and SMC methods down the line.  
Regarding your post, I should clarify a point regarding unbiasedness. The largest likelihood bound is actually set to infinity. Thus, for the fixed version of NS—SMC, one has an unbiased estimator of the “final” band. Choosing a final band prematurely will of course result in very high variance. However, the estimator is unbiased. For example, consider NS—SMC with only one strata. Then, the method reduces to simply using the prior as an importance sampling distribution for the posterior (unbiased, but often high variance).
Comments related to two specific parts of your post are below (your comments in italicised bold):
“Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue since I believe that the first occurrence within the level set is distributed from the constrained prior.”
This is an interesting point that we had not considered! In practice, and in many papers that apply Nested Sampling with MCMC, the common approach is to start the MCMC at one of the randomly selected “live points”, so the discussion related to independence was in regard to these common implementations.
Regarding starting the chain from outside of the level set. This is likely not done in practice as it introduces an additional difficulty of needing to propose a sample inside the required region (Metropolis–Hastings will have non—zero probability of returning a sample that is still outside the constrained region for any fixed number of iterations). Forcing the continuation of MCMC until a valid point is proposed I believe will be a subtle violation of detailed balance. Of course, the bias of such a modification may be small in practice, but it is an additional awkwardness introduced by the requirement of sample independence!
“And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a single  particle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, the  final constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N)”
Thanks for pointing out that this isn’t clear, we will try to do better in the next revision! The goal of this part of the paper wasn’t necessarily to propose a new version of nested sampling. Our focus here was to demonstrate that NS–SMC is not simply the Nested Sampling idea with an SMC twist, but that the original NS algorithm with MCMC (and restarting the MCMC sampling at one of the “live points’” as people do in practice) actually is a special case of SMC (with the weights replaced with a suboptimal choice).
The most curious thing is that, as you note, the estimates of remaining prior mass in the SMC context come out as powers of (N-1)/N and not exp(-t/N). In the paper by Walter (2017), he shows that the former choice is actually superior in terms of bias and variance. It was a nice touch that the superior choice of weights came out naturally in the SMC interpretation! 
That said, as the fixed version of NS-SMC is the one with the unbiasedness and consistency properties, this was the version we used in the main statistical examples.

unbiased consistent nested sampling via sequential Monte Carlo

Posted in pictures, Statistics, Travel with tags , , , , , , , , on June 12, 2018 by xi'an

“Moreover, estimates of the marginal likelihood are unbiased.” (p.2)

Rob Salomone, Leah South, Chris Drovandi and Dirk Kroese (from QUT and UQ, Brisbane) recently arXived a paper that frames the nested sampling in such a way that marginal likelihoods can be unbiasedly (and consistently) estimated.

“Why isn’t nested sampling more popular with statisticians?” (p.7)

A most interesting question, especially given its popularity in cosmology and other branches of physics. A first drawback pointed out in the c is the requirement of independence between the elements of the sample produced at each iteration. Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue since I believe that the first occurrence within the level set is distributed from the constrained prior.

A second difficulty is the use of quadrature which turns integrand into step functions at random slices. Indeed, mixing Monte Carlo with numerical integration makes life much harder, as shown by the early avatars of nested sampling that only accounted for the numerical errors. (And which caused Nicolas and I to write our critical paper in Biometrika.) There are few studies of that kind in the literature, the only one I can think of being [my former PhD student] Anne Philippe‘s thesis twenty years ago.

The third issue stands with the difficulty in parallelising the method. Except by jumping k points at once, rather than going one level at a time. While I agree this makes life more complicated, I am also unsure about the severity of that issue as k nested sampling algorithms can be run in parallel and aggregated in the end, from simple averaging to something more elaborate.

The final blemish is that the nested sampling estimator has a stopping mechanism that induces a truncation error, again maybe a lesser problem given the overall difficulty in assessing the total error.

The paper takes advantage of the ability of SMC to produce unbiased estimates of a sequence of normalising constants (or of the normalising constants of a sequence of targets). For nested sampling, the sequence is made of the prior distribution restricted to an embedded sequence of level sets. With another sequence restricted to bands (likelihood between two likelihood boundaries). If all restricted posteriors of the second kind and their normalising constant are known, the full posterior is known. Apparently up to the main normalising constant, i.e. the marginal likelihood., , except that it is also the sum of all normalising constants. Handling this sequence by SMC addresses the four concerns of the four authors, apart from the truncation issue, since the largest likelihood bound need be set for running the algorithm.

When the sequence of likelihood bounds is chosen based on the observed likelihoods so far, the method becomes adaptive. Requiring again the choice of a stopping rule that may induce bias if stopping occurs too early. And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a single particle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, the final constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N).

The remainder of this long paper (61 pages!) is dedicated to practical implementation, calibration and running a series of comparisons. A nice final touch is the thanks to the ‘Og for its series of posts on nested sampling, which “helped influence this work, and played a large part in inspiring it.”

In conclusion, this paper is certainly a worthy exploration of the nested sampler, providing further arguments towards a consistent version, with first and foremost an (almost?) unbiased resolution. The comparison with a wide range of alternatives remains open, in particular time-wise, if evidence is the sole target of the simulation. For instance, the choice of this sequence of targets in an SMC may be improved by another sequence, since changing one particle at a time does not sound efficient. The complexity of the implementation and in particular of the simulation from the prior under more and more stringent constraints need to be addressed.

nested sampling when prior and likelihood clash

Posted in Books, Statistics with tags , , , , , , , , , on April 3, 2018 by xi'an

A recent arXival by Chen, Hobson, Das, and Gelderblom makes the proposal of a new nested sampling implementation when prior and likelihood disagree, making simulations from the prior inefficient. The paper holds the position that a single given prior is used over and over all datasets that come along:

“…in applications where one wishes to perform analyses on many thousands (or even millions) of different datasets, since those (typically few) datasets for which the prior is unrepresentative can absorb a large fraction of the computational resources.” Chen et al., 2018

My reaction to this situation, provided (a) I want to implement nested sampling and (b) I realise there is a discrepancy, would be to resort to an importance sampling resolution, as we proposed in our Biometrika paper with Nicolas. Since one objection [from the authors] is that identifying outlier datasets is complicated (it should not be when the likelihood function can be computed) and time-consuming, sequential importance sampling could be implemented.

“The posterior repartitioning (PR) method takes advantage of the fact that nested sampling makes use of the likelihood L(θ) and prior π(θ) separately in its exploration of the parameter space, in contrast to Markov chain Monte Carlo (MCMC) sampling methods or genetic algorithms which typically deal solely in terms of the product.” Chen et al., 2018

The above salesman line does not ring a particularly convincing chime in that nested sampling is about as myopic as MCMC since based on the similar notion of a local proposal move, starting from the lowest likelihood argument (the minimum likelihood estimator!) in the nested sample.

“The advantage of this extension is that one can choose (π’,L’) so that simulating from π’ under the constraint L'(θ) > l is easier than simulating from π under the constraint L(θ) > l. For instance, one may choose an instrumental prior π’ such that Markov chain Monte Carlo steps adapted to the instrumental constrained prior are easier to implement than with respect to the actual constrained prior. In a similar vein, nested importance sampling facilitates contemplating several priors at once, as one may compute the evidence for each prior by producing the same nested sequence, based on the same pair (π’,L’), and by simply modifying the weight function.” Chopin & Robert, 2010

Since the authors propose to switch to a product (π’,L’) such that π’.L’=π.L, the solution appears like a special case of importance sampling, with the added drwaback that when π’ is not normalised, its normalised constant must be estimated as well. (With an extra nested sampling implementation?) Furthermore, the advocated solution is to use tempering, which is not so obvious as it seems in small dimensions. As the mass does not always diffuse to relevant parts of the space. A more “natural” tempering would be to use a subsample in the (sub)likelihood for nested sampling and keep the remainder of the sample for weighting the evaluation of the evidence.

marginal likelihoods from MCMC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on April 26, 2017 by xi'an

A new arXiv entry on ways to approximate marginal likelihoods based on MCMC output, by astronomers (apparently). With an application to the 2015 Planck satellite analysis of cosmic microwave background radiation data, which reminded me of our joint work with the cosmologists of the Paris Institut d’Astrophysique ten years ago. In the literature review, the authors miss several surveys on the approximation of those marginals, including our San Antonio chapter, on Bayes factors approximations, but mention our ABC survey somewhat inappropriately since it is not advocating the use of ABC for such a purpose. (They mention as well variational Bayes approximations, INLA, powered likelihoods, if not nested sampling.)

The proposal of this paper is to identify the marginal m [actually denoted a there] as the normalising constant of an unnormalised posterior density. And to do so the authors estimate the posterior by a non-parametric approach, namely a k-nearest-neighbour estimate. With the additional twist of producing a sort of Bayesian posterior on the constant m. [And the unusual notion of number density, used for the unnormalised posterior.] The Bayesian estimation of m relies on a Poisson sampling assumption on the k-nearest neighbour distribution. (Sort of, since k is actually fixed, not random.)

If the above sounds confusing and imprecise it is because I am myself rather mystified by the whole approach and find it difficult to see the point in this alternative. The Bayesian numerics does not seem to have other purposes than producing a MAP estimate. And using a non-parametric density estimate opens a Pandora box of difficulties, the most obvious one being the curse of dimension(ality). This reminded me of the commented paper of Delyon and Portier where they achieve super-efficient convergence when using a kernel estimator, but with a considerable cost and a similar sensitivity to dimension.

Bayesian parameter estimation versus model comparison

Posted in Books, pictures, Statistics with tags , , , , , , on December 5, 2016 by xi'an

John Kruschke [of puppies’ fame!] wrote a paper in Perspectives in Psychological Science a few years ago on the comparison between two Bayesian approaches to null hypotheses. Of which I became aware through a X validated question that seemed to confuse Bayesian parameter estimation with Bayesian hypothesis testing.

“Regardless of the decision rule, however, the primary attraction of using parameter estimation to assess null values is that the an explicit posterior distribution reveals the relative credibility of all the parameter values.” (p.302)

After reading this paper, I realised that Kruschke meant something completely different, namely that a Bayesian approach to null hypothesis testing could operate from the posterior on the corresponding parameter, rather than to engage into formal Bayesian model comparison (null versus the rest of the World). The notion is to check whether or not the null value stands within the 95% [why 95?] HPD region [modulo a buffer zone], which offers the pluses of avoiding a Dirac mass at the null value and a long-term impact of the prior tails on the decision, with the minus of replacing the null with a tolerance region around the null and calibrating the rejection level. This opposition is thus a Bayesian counterpart of running tests on point null hypotheses either by Neyman-Pearson procedures or by confidence intervals. Note that in problems with nuisance parameters this solution requires a determination of the 95% HPD region associated with the marginal on the parameter of interest, which may prove a challenge.

“…the measure provides a natural penalty for vague priors that allow a broad range of parameter values, because a vague prior dilutes credibility across a broad range of parameter values, and therefore the weighted average is also attenuated.” (p. 306)

While I agree with most of the critical assessment of Bayesian model comparison, including Kruschke’s version of Occam’s razor [and Lindley’s paradox] above, I do not understand how Bayesian model comparison fails to return a full posterior on both the model indices [for model comparison] and the model parameters [for estimation]. To state that it does not because the Bayes factor only depends on marginal likelihoods (p.307) sounds unfair if only because most numerical techniques to approximate the Bayes factors rely on preliminary simulations of the posterior. The point that the Bayes factor strongly depends on the modelling of the alternative model is well-taken, albeit the selection of the null in the “estimation” approach does depend as well on this alternative modelling. Which is an issue if one ends up accepting the null value and running a Bayesian analysis based on this null value.

“The two Bayesian approaches to assessing null values can be unified in a single hierarchical model.” (p.308)

Incidentally, the paper briefly considers a unified modelling that can be interpreted as a mixture across both models, but this mixture representation completely differs from ours [where we also advocate estimation to replace testing] since the mixture is at the likelihood x prior level, as in O’Neill and Kypriaos.