Archive for Chib’s approximation

Bayesian empirical likelihood

Posted in Books, pictures, Statistics with tags , , , , , , on July 21, 2016 by xi'an

non-tibetan flags in Pula, Sardinia, June 12, 2016Sid Chib, Minchul Shin, and Anna Simoni (CREST) recently arXived a paper entitled “Bayesian Empirical Likelihood Estimation and Comparison of Moment Condition Models“. That Sid mentioned to me in Sardinia. The core notion is related to earlier Bayesian forays into empirical likelihood pseudo-models, like Lazar (2005) or our PNAS paper with Kerrie Mengersen and Pierre Pudlo. Namely to build a pseudo-likelihood using empirical likelihood principles and to derive the posterior associated with this pseudo-likelihood. Some novel aspects are the introduction of tolerance (nuisance) extra-parameters when some constraints do not hold, a maximum entropy (or exponentially tilted) representation of the empirical  likelihood function, and a Chib-Jeliazkov representation of the marginal likelihood. The authors obtain a Bernstein-von Mises theorem under correct specification. Meaning convergence. And another one under misspecification.

While the above Bernstein-von Mises theory is somewhat expected (if worth deriving) in the light of frequentist consistency results, the paper also considers a novel and exciting aspect, namely to compare models (or rather moment restrictions) by Bayes factors derived from empirical likelihoods. A grand (encompassing) model is obtained by considering all moment restrictions at once, which first sounds like more restricted, except that the extra-parameters are there to monitor constraints that actually hold. It is unclear from my cursory read of the paper whether priors on those extra-parameters can be automatically derived from a single prior. And how much they impact the value of the Bayes factor. The consistency results found in the paper do not seem to depend on the form of priors adopted for each model (for all three cases of both correctly, one correctly and none correctly specified models). Except maybe for some local asymptotic normality (LAN). Interestingly (?), the authors consider the Poisson versus Negative Binomial test we used in our testing by mixture paper. This paper is thus bringing a better view of the theoretical properties of a pseudo-Bayesian approach based on moment conditions and empirical likelihood approximations. Without a clear vision of the implementation details, from the parameterisation of the constraints (which could be tested the same way) to the construction of the prior(s) to the handling of MCMC difficulties in realistic models.

Using MCMC output to efficiently estimate Bayes factors

Posted in Books, R, Statistics, University life with tags , , , , on May 19, 2016 by xi'an

As I was checking for software to answer a query on X validated about generic Bayes factor derivation, I came across an R software called BayesFactor, which only applies in regression settings and relies on the Savage-Dickey representation of the Bayes factor

B_{01}=\dfrac{f(y|\theta^0)}{m(y)}=\dfrac{\pi(\theta^0|y)}{\pi(\theta^0)}

when the null hypothesis writes as θ=θ⁰ (and possibly additional nuisance parameters with [roughly speaking] an independent prior). As we discussed in our paper with Jean-Michel Marin [which got ignored by large!], this representation of the Bayes factor is based on picking a very specific version of the prior, or more exactly of three prior densities. Assuming such versions are selected, I wonder at the performances of this approximation, given that it involves approximating the marginal posterior at θ⁰….

“To ensure that the Bayes factor we compute using the Savage–Dickey ratio is the the ratio of marginal densities that we intend, the condition (…) is easily met by models which specify priors in which the nuisance parameters are independent of the parameters of interest.” Morey et al. (2011)

First, when reading Morey at al. (2011), I realised (a wee bit late!) that Chib’s method is nothing but a version of the Savage-Dickey representation when the marginal posterior can be estimated in a parametric (Rao-Blackwellised) way. However, outside hierarchical models based on conjugate priors such parametric approximations are intractable and non-parametric versions must be invoked instead, which necessarily degrades the quality of the method. A degradation that escalates with the dimension of the parameter θ. In addition, I am somewhat perplexed by the use of a Rao-Blackwell argument in the setting of the Dickey-Savage representation. Indeed this representation assumes that

\pi_1(\psi|\theta_0)=\pi_0(\psi) \  \ \text{or}\quad \pi_1(\theta_0,\psi)=\pi_1(\theta_0)\pi_0(\psi)

which means that [the specific version of] the conditional density of θ⁰ given ψ should not depend on the nuisance parameter. But relying on a Rao-Blackwellisation leads to estimate the marginal posterior via full conditionals. Of course, θ given ψ and y may depend on ψ, but still… Morey at al. (2011) advocate the recourse to Chib’s formula as optimal but this obviously requires the full conditional to be available. They acknowledge this point as moot, since it is sufficient from their perspective to specify a conjugate prior. They consider this to be a slight modification of the model (p.377). However, I see the evaluation of an estimated density at a single (I repeat, single!) point as being the direst part of the method as it is clearly more sensitive to approximations that the evaluation of a whole integral, since the later incorporates an averaging effect by definition. Hence, even if this method was truly available for all models, I would be uncertain of its worth when compared with other methods, except the harmonic mean estimator of course!

On the side, Morey at al. (2011) study a simple one-sample t test where they use an improper prior on the nuisance parameter σ, under both models. While the Savage-Dickey representation is correct in this special case, I fail to see why the identity would apply in every case under an improper prior. In particular, independence does not make sense with improper priors. The authors also indicate the possible use of this Bayes factor approximation for encompassing models. At first, I thought this could be most useful in our testing by mixture framework where we define an encompassing model as a mixture. However, I quickly realised that using a Beta Be(a,a) prior on the weight α with a<1 leads to an infinite density value at both zero and one, hence cannot be compatible with a Savage-Dickey representation of the Bayes factor.

approximating evidence with missing data

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on December 23, 2015 by xi'an

University of Warwick, May 31 2010Panayiota Touloupou (Warwick), Naif Alzahrani, Peter Neal, Simon Spencer (Warwick) and Trevelyan McKinley arXived a paper yesterday on Model comparison with missing data using MCMC and importance sampling, where they proposed an importance sampling strategy based on an early MCMC run to approximate the marginal likelihood a.k.a. the evidence. Another instance of estimating a constant. It is thus similar to our Frontier paper with Jean-Michel, as well as to the recent Pima Indian survey of James and Nicolas. The authors give the difficulty to calibrate reversible jump MCMC as the starting point to their research. The importance sampler they use is the natural choice of a Gaussian or t distribution centred at some estimate of θ and with covariance matrix associated with Fisher’s information. Or derived from the warmup MCMC run. The comparison between the different approximations to the evidence are done first over longitudinal epidemiological models. Involving 11 parameters in the example processed therein. The competitors to the 9 versions of importance samplers investigated in the paper are the raw harmonic mean [rather than our HPD truncated version], Chib’s, path sampling and RJMCMC [which does not make much sense when comparing two models]. But neither bridge sampling, nor nested sampling. Without any surprise (!) harmonic means do not converge to the right value, but more surprisingly Chib’s method happens to be less accurate than most importance solutions studied therein. It may be due to the fact that Chib’s approximation requires three MCMC runs and hence is quite costly. The fact that the mixture (or defensive) importance sampling [with 5% weight on the prior] did best begs for a comparison with bridge sampling, no? The difficulty with such study is obviously that the results only apply in the setting of the simulation, hence that e.g. another mixture importance sampler or Chib’s solution would behave differently in another model. In particular, it is hard to judge of the impact of the dimensions of the parameter and of the missing data.

Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

pimcisIn the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

Le Monde puzzle [#902]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on March 8, 2015 by xi'an

Another arithmetics Le Monde mathematical puzzle:

From the set of the integers between 1 and 15, is it possible to partition it in such a way that the product of the terms in the first set is equal to the sum of the members of the second set? can this be generalised to an arbitrary set {1,2,..,n}? What happens if instead we only consider the odd integers in those sets?.

I used brute force by looking at random for a solution,

pb <- txtProgressBar(min = 0, max = 100, style = 3)
for (N in 5:100){
sol=FALSE
while (!sol){
  k=sample(1:N,1,prob=(1:N)*(N-(1:N)))
  pro=sample(1:N,k)
  sol=(prod(pro)==sum((1:N)[-pro]))
}
setTxtProgressBar(pb, N)}
close(pb)

and while it took a while to run the R code, it eventually got out of the loop, meaning there was at least one solution for all n’s between 5 and 100. (It does not work for n=1,2,3,4, for obvious reasons.) For instance, when n=15, the integers in the product part are either 3,5,7, 1,7,14, or 1,9,11. Jean-Louis Fouley sent me an explanation:  when n is odd, n=2p+1, one solution is (1,p,2p), while when n is even, n=2p, one solution is (1,p-1,2p).

A side remark on the R code: thanks to a Cross Validated question by Paulo Marques, on which I thought I had commented on this blog, I learned about the progress bar function in R, setTxtProgressBar(), which makes running R code with loops much nicer!

For the second question, I just adapted the R code to exclude even integers:

while (!sol){
  k=1+trunc(sample(1:N,1)/2)
  pro=sample(seq(1,N,by=2),k)
  cum=(1:N)[-pro]
  sol=(prod(pro)==sum(cum[cum%%2==1]))
}

and found a solution for n=15, namely 1,3,15 versus 5,7,9,11,13. However, there does not seem to be a solution for all n’s: I found solutions for n=15,21,23,31,39,41,47,49,55,59,63,71,75,79,87,95…

Robert’s paradox [reading in Reading]

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , on January 28, 2015 by xi'an

paradoxOn Wednesday afternoon, Richard Everitt and Dennis Prangle organised an RSS workshop in Reading on Bayesian Computation. And invited me to give a talk there, along with John Hemmings, Christophe Andrieu, Marcelo Pereyra, and themselves. Given the proximity between Oxford and Reading, this felt like a neighbourly visit, especially when I realised I could take my bike on the train! John Hemmings gave a presentation on synthetic models for climate change and their evaluation, which could have some connection with Tony O’Hagan’s recent talk in Warwick, Dennis told us about “the lazier ABC” version in connection with his “lazy ABC” paper, [from my very personal view] Marcelo expanded on the Moreau-Yoshida expansion he had presented in Bristol about six months ago, with the notion that using a Gaussian tail regularisation of a super-Gaussian target in a Langevin algorithm could produce better convergence guarantees than the competition, including Hamiltonian Monte Carlo, Luke Kelly spoke about an extension of phylogenetic trees using a notion of lateral transfer, and Richard introduced a notion of biased approximation to Metropolis-Hasting acceptance ratios, notion that I found quite attractive if not completely formalised, as there should be a Monte Carlo equivalent to the improvement brought by biased Bayes estimators over unbiased classical counterparts. (Repeating a remark by Persi Diaconis made more than 20 years ago.) Christophe Andrieu also exposed some recent developments of his on exact approximations à la Andrieu and Roberts (2009).

Since those developments are not yet finalised into an archived document, I will not delve into the details, but I found the results quite impressive and worth exploring, so I am looking forward to the incoming publication. One aspect of the talk which I can comment on is related to the exchange algorithm of Murray et al. (2006). Let me recall that this algorithm handles double intractable problems (i.e., likelihoods with intractable normalising constants like the Ising model), by introducing auxiliary variables with the same distribution as the data given the new value of the parameter and computing an augmented acceptance ratio which expectation is the targeted acceptance ratio and which conveniently removes the unknown normalising constants. This auxiliary scheme produces a random acceptance ratio and hence differs from the exact-approximation MCMC approach, which target directly the intractable likelihood. It somewhat replaces the unknown constant with the density taken at a plausible realisation, hence providing a proper scale. At least for the new value. I wonder if a comparison has been conducted between both versions, the naïve intuition being that the ratio of estimates should be more variable than the estimate of the ratio. More generally, it seemed to me [during the introductory part of Christophe’s talk] that those different methods always faced a harmonic mean danger when being phrased as expectations of ratios, since those ratios were not necessarily squared integrable. And not necessarily bounded. Hence my rather gratuitous suggestion of using other tools than the expectation, like maybe a median, thus circling back to the biased estimators of Richard. (And later cycling back, unscathed, to Reading station!)

On top of the six talks in the afternoon, there was a small poster session during the tea break, where I met Garth Holloway, working in agricultural economics, who happened to be a (unsuspected) fan of mine!, to the point of entitling his poster “Robert’s paradox”!!! The problem covered by this undeserved denomination connected to the bias in Chib’s approximation of the evidence in mixture estimation, a phenomenon that I related to the exchangeability of the component parameters in an earlier paper or set of slides. So “my” paradox is essentially label (un)switching and its consequences. For which I cannot claim any fame! Still, I am looking forward the completed version of this poster to discuss Garth’s solution, but we had a beer together after the talks, drinking to the health of our mutual friend John Deely.

importance sampling schemes for evidence approximation [revised]

Posted in Statistics, University life with tags , , , , , , , on November 18, 2014 by xi'an

After a rather intense period of new simulations and versions, Juong Een (Kate) Lee and I have now resubmitted our paper on (some) importance sampling schemes for evidence approximation in mixture models to Bayesian Analysis. There is no fundamental change in the new version but rather a more detailed description of what those importance schemes mean in practice. The original idea in the paper is to improve upon the Rao-Blackwellisation solution proposed by Berkoff et al. (2002) and later by Marin et al. (2005) to avoid the impact of label switching on Chib’s formula. The Rao-Blackwellisation consists in averaging over all permutations of the labels while the improvement relies on the elimination of useless permutations, namely those that produce a negligible conditional density in Chib’s (candidate’s) formula. While the improvement implies truncated the overall sum and hence induces a potential bias (which was the concern of one referee), the determination of the irrelevant permutations after relabelling next to a single mode does not appear to cause any bias, while reducing the computational overload. Referees also made us aware of many recent proposals that conduct to different evidence approximations, albeit not directly related with our purpose. (One was Rodrigues and Walker, 2014, discussed and commented in a recent post.)