## Bayesian methods in cosmology

Posted in Statistics with tags , , , , , , , , , , , , on January 18, 2017 by xi'an

A rather massive document was arXived a few days ago by Roberto Trotta on Bayesian methods for cosmology, in conjunction with an earlier winter school, the 44th Saas Fee Advanced Course on Astronomy and Astrophysics, “Cosmology with wide-field surveys”. While I never had the opportunity to give a winter school in Saas Fee, I will give next month a course on ABC to statistics graduates in another Swiss dream location, Les Diablerets.  And next Fall a course on ABC again but to astronomers and cosmologists, in Autrans, near Grenoble.

The course document is an 80 pages introduction to probability and statistics, in particular Bayesian inference and Bayesian model choice. Including exercises and references. As such, it is rather standard in that the material could be found as well in textbooks. Statistics textbooks.

When introducing the Bayesian perspective, Roberto Trotta advances several arguments in favour of this approach. The first one is that it is generally easier to follow a Bayesian approach when compared with seeking a non-Bayesian one, while recovering long-term properties. (Although there are inconsistent Bayesian settings.) The second one is that a Bayesian modelling allows to handle naturally nuisance parameters, because there are essentially no nuisance parameters. (Even though preventing small world modelling may lead to difficulties as in the Robbins-Wasserman paradox.) The following two reasons are the incorporation of prior information and the appeal on conditioning on the actual data.

The document also includes this above and nice illustration of the concentration of measure as the dimension of the parameter increases. (Although one should not over-interpret it. The concentration does not occur in the same way for a normal distribution for instance.) It further spends quite some space on the Bayes factor, its scaling as a natural Occam’s razor,  and the comparison with p-values, before (unsurprisingly) introducing nested sampling. And the Savage-Dickey ratio. The conclusion of this model choice section proposes some open problems, with a rather unorthodox—in the Bayesian sense—line on the justification of priors and the notion of a “correct” prior (yeech!), plus an musing about adopting a loss function, with which I quite agree.

## sleeping beauty

Posted in Kids, Statistics, Books with tags , , , , , , , , , on December 24, 2016 by xi'an

Through X validated, W. Huber made me aware of this probability paradox [or para-paradox] of which I had never heard before. One of many guises of this paradox goes as follows:

Shahrazad is put to sleep on Sunday night. Depending on the hidden toss of a fair coin, she is awaken either once (Heads) or twice (Tails). After each awakening, she gets back to sleep and forget that awakening. When awakened, what should her probability of Heads be?

My first reaction is to argue that Shahrazad does not gain information between the time she goes to sleep when the coin is fair and the time(s) she is awaken, apart from being awaken, since she does not know how many times she has been awaken, so the probability of Heads remains ½. However, when thinking more about it on my bike ride to work, I thought of the problem as a decision theory or betting problem, which makes ⅓ the optimal answer.

I then read [if not the huge literature] a rather extensive analysis of the paradox by Ciweski, Kadane, Schervish, Seidenfeld, and Stern (CKS³), which concludes at roughly the same thing, namely that, when Monday is completely exchangeable with Tuesday, meaning that no event can bring any indication to Shahrazad of which day it is, the posterior probability of Heads does not change (Corollary 1) but that a fair betting strategy is p=1/3, with the somewhat confusing remark by CKS³ that this may differ from her credence. But then what is the point of the experiment? Or what is the meaning of credence? If Shahrazad is asked for an answer, there must be a utility or a penalty involved otherwise she could as well reply with a probability of p=-3.14 or p=10.56… This makes for another ill-defined aspect of the “paradox”.

Another remark about this ill-posed nature of the experiment is that, when imagining running an ABC experiment, I could only come with one where the fair coin is thrown (Heads or Tails) and a day (Monday or Tuesday) is chosen at random. Then every proposal (Heads or Tails) is accepted as an awakening, hence the posterior on Heads is the uniform prior. The same would not occurs if we consider the pair of awakenings under Tails as two occurrences of (p,E), but this does not sound (as) correct since Shahrazad only knows of one E: to paraphrase Jeffreys, this is an unobservable result that may have not occurred. (Or in other words, Bayesian learning is not possible on Groundhog Day!)

## asymptotically exact inference in likelihood-free models [a reply from the authors]

Posted in R, Statistics with tags , , , , , , , , , , , , , , , , , on December 1, 2016 by xi'an

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with MN, for large N the dominant costs at each timestep are usually the constraint Jacobian (c/u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (c/u)(c/u) with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach. Continue reading

## asymptotically exact inference in likelihood-free models

Posted in Books, pictures, Statistics with tags , , , , , , , on November 29, 2016 by xi'an

“We use the intuition that inference corresponds to integrating a density across the manifold corresponding to the set of inputs consistent with the observed outputs.”

Following my earlier post on that paper by Matt Graham and Amos Storkey (University of Edinburgh), I now read through it. The beginning is somewhat unsettling, albeit mildly!, as it starts by mentioning notions like variational auto-encoders, generative adversial nets, and simulator models, by which they mean generative models represented by a (differentiable) function g that essentially turn basic variates with density p into the variates of interest (with intractable density). A setting similar to Meeds’ and Welling’s optimisation Monte Carlo. Another proximity pointed out in the paper is Meeds et al.’s Hamiltonian ABC.

“…the probability of generating simulated data exactly matching the observed data is zero.”

The section on the standard ABC algorithms mentions the fact that ABC MCMC can be (re-)interpreted as a pseudo-marginal MCMC, albeit one targeting the ABC posterior instead of the original posterior. The starting point of the paper is the above quote, which echoes a conversation I had with Gabriel Stolz a few weeks ago, when he presented me his free energy method and when I could not see how to connect it with ABC, because having an exact match seemed to cancel the appeal of ABC, all parameter simulations then producing an exact match under the right constraint. However, the paper maintains this can be done, by looking at the joint distribution of the parameters, latent variables, and observables. Under the implicit restriction imposed by keeping the observables constant. Which defines a manifold. The mathematical validation is achieved by designing the density over this manifold, which looks like

$p(u)\left|\frac{\partial g^0}{\partial u}\frac{\partial g^0}{\partial u}^\text{T}\right|^{-\textonehalf}$

if the constraint can be rewritten as g⁰(u)=0. (This actually follows from a 2013 paper by Diaconis, Holmes, and Shahshahani.) In the paper, the simulation is conducted by Hamiltonian Monte Carlo (HMC), the leapfrog steps consisting of an unconstrained move followed by a projection onto the manifold. This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step. I also find it surprising that this projection step does not jeopardise the stationary distribution of the process, as the argument found therein about the approximation of the approximation is not particularly deep. But the main thing that remains unclear to me after reading the paper is how the constraint that the pseudo-data be equal to the observable data can be turned into a closed form condition like g⁰(u)=0. As mentioned above, the authors assume a generative model based on uniform (or other simple) random inputs but this representation seems impossible to achieve in reasonably complex settings.

## rare events for ABC

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , on November 24, 2016 by xi'an

Dennis Prangle, Richard G. Everitt and Theodore Kypraios just arXived a new paper on ABC, aiming at handling high dimensional data with latent variables, thanks to a cascading (or nested) approximation of the probability of a near coincidence between the observed data and the ABC simulated data. The approach amalgamates a rare event simulation method based on SMC, pseudo-marginal Metropolis-Hastings and of course ABC. The rare event is the near coincidence of the observed summary and of a simulated summary. This is so rare that regular ABC is forced to accept not so near coincidences. Especially as the dimension increases.  I mentioned nested above purposedly because I find that the rare event simulation method of Cérou et al. (2012) has a nested sampling flavour, in that each move of the particle system (in the sample space) is done according to a constrained MCMC move. Constraint derived from the distance between observed and simulated samples. Finding an efficient move of that kind may prove difficult or impossible. The authors opt for a slice sampler, proposed by Murray and Graham (2016), however they assume that the distribution of the latent variables is uniform over a unit hypercube, an assumption I do not fully understand. For the pseudo-marginal aspect, note that while the approach produces a better and faster evaluation of the likelihood, it remains an ABC likelihood and not the original likelihood. Because the estimate of the ABC likelihood is monotonic in the number of terms, a proposal can be terminated earlier without inducing a bias in the method.

This is certainly an innovative approach of clear interest and I hope we will discuss it at length at our BIRS ABC 15w5025 workshop next February. At this stage of light reading, I am slightly overwhelmed by the combination of so many computational techniques altogether towards a single algorithm. The authors argue there is very little calibration involved, but so many steps have to depend on as many configuration choices.

## an attempt at EP-ABC from scratch, nothing more… [except for a few bugs]

Posted in R, Statistics, University life with tags , , , , , , on October 19, 2016 by xi'an

Following a request from one of the reviewers of our chapter Likelihood-free model choice, I tried to run EP-ABC on a toy problem and to compare it with the outcome of a random forest ABC. Literally starting from scratch, namely from the description found in Simon and Nicolas’ JASA paper.  To run my test, I chose as my modelled data an exponential Exp(λ) sample of size 20, with a Gaussian N(0,1) prior on the log parameter (here replicated 100 times):

n=20
trueda=matrix(rexp(100*n),ncol=n)
#prior values
er0=0
Q0=1

Then I applied the formulas found in the paper for approximating the evidence, first defining the log normalising constant for an unnormalised Gaussian density as on the top of p.318

Psi=function(er,Q){
-.5*(log(Q/2/pi)-Q*er*er)}

[which exhibited a typo in the paper, as Simon Barthelmé figured out after emails exchanges that the right formulation was the inverse]

Psi=function(er,Q){
-.5*(log(Q/2/pi)-er*er/Q)}

and iterating the site updates as described in Sections 2.2, 3.1 and Algorithms 1 and 3:

ep=function(dat,M=1e4,eps,Nep=10){
n=length(dat)
#global and ith natural parameters for Gaussian approximations
#initialisation:
Qi=rep(0,n)
Q=Q0+sum(Qi)
eri=rep(0,n)
er=er0+sum(eri)
#constants Ci in (2.6)
normz=rep(1,n)
for (t in 1:Nep){
for (i in sample(1:n)){
#site i update
Qm=Q-Qi[i] #m for minus i
erm=er-eri[i]
#ABC sample
thetas=rnorm(M,me=erm/Qm,sd=1/sqrt(Qm))
dati=rexp(M,rate=exp(thetas))
#update by formula (3.3)
Mh=sum((abs(dati-dat[i])<eps))
mu=mean(thetas[abs(dati-dat[i])<eps])
sig=var(thetas[abs(dati-dat[i])<eps])
if (Mh>1e3){
#enough proposals accepted
Qi[i]=1/sig-Qm
eri[i]=mu/sig-erm
Q=Qm+Qi[i]
er=erm+eri[i]
#update of Ci as in formula (2.7)
#with correction bottom of p.319
normz[i]=log(Mh/M/2/eps)- Psi(er,Q)+Psi(erm,Qm)
}}}
#approximation of the log evidence as on p.318
return(sum(normz)+Psi(er,Q)-Psi(er0,Q0)) }

except that I made an R beginner’s mistake (!) when calling the normal simulation to be

thetas=rnorm(M,me=erm/Qm)/sqrt(Qm)

to be compared with the genuine evidence under a conjugate Exp(1) prior [values of the evidence should differ but should not differ that much]

ev=function(dat){
n=length(dat)
lgamma(n+1)-(n+1)*log(1+sum(dat))
}

After correcting for those bugs and too small sample sizes, thanks to Simon kind-hearted help!, running this code results in minor discrepancies between both quantities:

evid=ovid=rep(0,100)
M=1e4 #number of ABC samples
propep=.1 #tolerance factor
for (t in 1:100){
#tolerance scaled by data
epsi=propep*sd(trueda[t,])
evid[t]=ep(trueda[t,],M,epsi)
ovid[t]=ev(trueda[t,])
}
plot(evid,ovid,cex=.6,pch=19)

as shown by the comparison below between the evidence and the EP-ABC approximations to the evidence, called epabence (the dashed line is the diagonal):

Obviously, this short (if lengthy at my scale) experiment is not intended to conclude that EP-ABC work or does not work. (Simon also provided additional comparison with the genuine evidence, that is under the right prior, and with the zero Monte-Carlo version of the EP-ABC evidence, achieving a high concentration over the diagonal.) I just conclude that the method does require a certain amount of calibration to become stable. Calibrations steps that are clearly spelled out in the paper (adaptive choice of M, stabilisation by quasi-random samples, stabilisation by stochastic optimisation steps and importance sampling), but also imply that EP-ABC is also “far from routine use because it make take days to tune on real problems” (pp.315-316). Thinking about the reasons for such discrepancy over the past days, one possible explanation I see for the impact of the calibration parameters is that the validation of EP if any occurs at a numerical rather than Monte Carlo level, hence that the Monte Carlo error must be really small to avoid interfering with the numerical aspects.

## advanced computational methods for complex models in Biology [talk]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on September 29, 2016 by xi'an

Here are the slides of the presentation I gave at the EPSRC Advanced Computational methods for complex models in Biology at University College London, last week. Introducing random forests as proper summaries for both model choice and parameter estimation (with considerable overlap with earlier slides, obviously!). The other talks of that highly interesting day on computational Biology were mostly about ancestral graphs, using Wright-Fisher diffusions for coalescents, plus a comparison of expectation-propagation and ABC on a genealogy model by Mark Beaumont and the decision theoretic approach to HMM order estimation by Chris Holmes. In addition, it gave me the opportunity to come back to the Department of Statistics at UCL more than twenty years after my previous visit, at a time when my friend Costas Goutis was still there. And to realise it had moved from its historical premises years ago. (I wonder what happened to the two staircases built to reduce frictions between Fisher and Pearson if I remember correctly…)