## automatic variational ABC

Posted in pictures, Statistics with tags , , , , , , , , , , on July 8, 2016 by xi'an

“Stochastic Variational inference is an appealing alternative to the inefficient sampling approaches commonly used in ABC.”

Moreno et al. [including Ted Meeds and Max Welling] recently arXived a paper merging variational inference and ABC. The argument for turning variational is computational speedup. The traditional (in variational inference) divergence decomposition of the log-marginal likelihood is replaced by an ABC version, parameterised in terms of intrinsic generators (i.e., generators that do not depend on cyber-parameters, like the U(0,1) or the N(0,1) generators). Or simulation code in the authors’ terms. Which leads to the automatic aspect of the approach. In the paper the derivation of the gradient is indeed automated.

“One issue is that even assuming that the ABC likelihood is an unbiased estimator of the true likelihood (which it is not), taking the log introduces a bias, so that we now have a biased estimate of the lower bound and thus biased gradients.”

I wonder how much of an issue this is, since we consider the variational lower bound. To be optimised in terms of the parameters of the variational posterior. Indeed, the endpoint of the analysis is to provide an optimal variational approximation, which remains an approximation whether or not the likelihood estimator is unbiased. A more “severe” limitation may be in the inversion constraint, since it seems to eliminate Beta or Gamma distributions. (Even though calling qbeta(runif(1),a,b) definitely is achievable… And not rejected by a Kolmogorov-Smirnov test.)

Incidentally, I discovered through the paper the existence of the Kumaraswamy distribution, which main appeal seems to be the ability to produce a closed-form quantile function, while bearing some resemblance with the Beta distribution. (Another arXival by Baltasar Trancón y Widemann studies some connections between those, but does not tell how to select the parameters to optimise the similarity.)

## debunking a (minor and personal) myth

Posted in Books, Kids, R, Statistics, University life with tags , , , , on September 9, 2015 by xi'an

For quite a while, I entertained the idea that Beta and Dirichlet proposals  were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate… Continue reading

## intuition beyond a Beta property

Posted in Books, Kids, R, Statistics, University life with tags , , , on March 30, 2015 by xi'an

A self-study question on X validated exposed an interesting property of the Beta distribution:

If x is B(n,m) and y is B(n+½,m) then √xy is B(2n,2m)

While this can presumably be established by a mere change of variables, I could not carry the derivation till the end and used instead the moment generating function E[(XY)s/2] since it naturally leads to ratios of B(a,b) functions and to nice cancellations thanks to the ½ in some Gamma functions [and this was the solution proposed on X validated]. However, I wonder at a more fundamental derivation of the property that would stem from a statistical reasoning… Trying with the ratio of Gamma random variables did not work. And the connection with order statistics does not apply because of the ½. Any idea?

## where did the normalising constants go?! [part 2]

Posted in R, Statistics, Travel with tags , , , , , , , on March 12, 2014 by xi'an

Coming (swiftly and smoothly) back home after this wonderful and intense week in Banff, I hugged my loved ones,  quickly unpacked, ran a washing machine, and  then sat down to check where and how my reasoning was wrong. To start with, I experimented with a toy example in R:

# true target is (x^.7(1-x)^.3) (x^1.3 (1-x)^1.7)
# ie a Beta(3,3) distribution

# samples from partial posteriors
N=10^5
sam1=rbeta(N,1.7,1.3)
sam2=rbeta(N,2.3,2.7)

# first version: product of density estimates
dens1=density(sam1,from=0,to=1)
dens2=density(sam2,from=0,to=1)
prod=dens1$y*dens2$y
# normalising by hand
prod=prod*length(dens1$x)/sum(prod) plot(dens1$x,prod,type="l",col="steelblue",lwd=2)

# second version: F-S & P's yin+yang sampling
# with weights proportional to the other posterior

subsam1=sam1[sample(1:N,N,prob=dbeta(sam1,2.3,2.7),rep=T)]
plot(density(subsam1,from=0,to=1),col="steelblue",lwd=2)

subsam2=sam2[sample(1:N,N,prob=dbeta(sam2,1.7,1.3),rep=T)]
plot(density(subsam2,from=0,to=1),col="steelblue",lwd=2)


and (of course!) it produced the perfect fits reproduced below. Writing the R code acted as a developing bath as it showed why we could do without the constants!

Of course”, because the various derivations in the above R code all are clearly independent from the normalising constant: (i) when considering a product of kernel density estimators, as in the first version, this is an approximation of

$\prod_{i=1}^k p_i(\theta)$

as well as of

$\prod_{ i}^k m_i(\theta)$

since the constant does not matter. (ii) When considering a sample from mi and weighting it by the product of the remaining true or estimated mj‘s, this is a sampling weighting resampling simulation from the density proportional to the product and hence, once again, the constants do not matter. At last, (iii) when mixing the two subsamples, since they both are distributed from the product density, the constants do not matter. As I slowly realised when running this morning (trail-running, not code-runninh!, for the very first time in ten days!), the straight-from-the-box importance sampling version on the mixed samples I considered yesterday (to the point of wondering out loud where did the constants go) is never implemented in the cited papers. Hence, the fact that

$\prod_i^k p_i(\theta)\propto \prod_{i}^k m_i(\theta)$

is enough to justify handling the target directly as the product of the partial marginals. End of the mystery. Anticlimactic end, sorry…

## ABC for bivariate betas

Posted in Statistics, University life with tags , , , , , , , on February 19, 2014 by xi'an

Crakel and Flegal just arXived a short paper running ABC for doing inference on the parameters of two families of bivariate betas. And I could not but read it thru. And wonder why ABC was that necessary to handle the model. The said bivariate betas are defined from

$V_1=(U_1+U_5+U_7)/(U_3+U_6+U_8)\,,$

$V_2=(U_2+U_5+U_8)/(U_4+U_6+U_7)$

when

$U_i\sim \text{Ga}(\delta_i,1)$

and

$X_1=V_1/(1+V_1)\,,\ X_2=V_2/(1+V_2)$

This makes each term in the pair Beta and the two components dependent. This construct was proposed by Arnold and Ng (2011). (The five-parameter version cancels the gammas for i=3,4,5.)

Since the pdf of the joint distribution is not available in closed form, Crakel and Flegal zoom on ABC-MCMC as the method of choice and discuss simulation experiments. (The choice of the tolerance ε as an absolute rather than relative value, ε=0.2,0.0.6,0.8, puzzles me, esp. since the distance between the summary statistics is not scaled.) I however wonder why other approaches are impossible. (Or why it is necessary to use this distribution to model correlated betas. Unless I am confused copulas were invented to this effect.) First, this is a latent variable model, so latent variables could be introduced inside an MCMC scheme. A wee bit costly but feasible. Second, several moments of those distributions are known so a empirical likelihood approach could be considered.

## beta HPD

Posted in Books, R, Statistics, Uncategorized, University life with tags , , , , , , , on October 17, 2013 by xi'an

While writing an introductory chapter on Bayesian analysis (in French), I came by the issue of computing an HPD region when the posterior distribution is a Beta B(α,β) distribution… There is no analytic solution and hence I resorted to numerical resolution (provided here for α=117.5, β=115.5):

f=function(p){

# find the symmetric
g=function(x){return(x-p*((1-p)/(1-x))^(115.5/117.5))}
return(uniroot(g,c(.504,.99))$root)} ff=function(alpha){ # find the coverage g=function(x){return(x-p*((1-p)/(1-x))^(115.5/117.5))} return(uniroot(g,c(.011,.49))$root)}


and got the following return:

> ff(.95)
[1] 0.4504879
> f(ff(.95))
[1] 0.5580267


which was enough for my simple book illustration… Since (.450,558) is then the HPD region at credible level 0.95.

## A slice of infinity

Posted in R, Statistics, University life with tags , , , , , , , on July 28, 2011 by xi'an

Peng Yu sent me an email about the conditions for convergence of a Gibbs sampler:

The following statement mentions convergence. But I’m not familiar what the regularity condition is.

“But it is necessary to have a finite probability of moving away from the current state at all times in order to satisfy the regularity conditions on which the whole MCMC theory depends.”

Slice sampler is discussed in your book Monte Carlo Statistical Methods. I think that the “regularity condition” may have been discussed in your book. If so, would you please let me know where it is? Thanks and look forward to hearing from you!

The quote is from Martyn Plummer and deals with a stopping rule in JAGS implementation of the slice sampler. (The correct wording should be “strictly positive probability” rather than “finite probability”, I think.) However, this has nothing to do with a “regularity condition” on the irreducibility of a Markov chain: if a slice sampler is implemented for an unbounded density target, say a Beta(1/2,1/2), there is no irreducibility condition connected with the infiniteness of the density. In theory, (a) the chain never visits the “state” where the density is infinite (if only because we are dealing with a continuous state space) and (b) after visiting a value x with a large density f(x), the slice sampler allows for a move away from it since the slice involves a uniform simulation over (0,f(x)). Deeper properties of the slice sampler (like geometric ergodicity) are explored in, e.g., this JRSS B paper by Gareth Roberts and Jeff Rosenthal and this one in the Annals of Statistics by Radford Neal. In practice, the problem is caused by values of f(x) that cannot be computed and hence produce an error message like

Singularity in likelihood found by Slicer.

If those singularities can be localised, a neighbourhood excluding them should be introduced. (More easily said than done, obviously!)

Here is an example of a slice sampler with the Beta(1/2,1/2) distribution:

#graphics
dote=function(x,y) points(x,y,col="gold",pch=19,cex=.4)
mote=function(x,y,z,w) lines(c(x,z),c(y,w),col="gold",lwd=.5)
cst=dbeta(.5,.5,.5)*.5 #normalising constant
#inverting f(x)=d, 2nd degree equation
hitden=function(d) .5+.5*sqrt(1-4*( cst/ max(d,dbeta(.5,.5,.5)))^2)*c(-1,1)
#output
curve(dbeta(x,.5,.5),0,1,ylab="density",lwd=2,col="steelblue",n=1001)
x=runif(1);u=runif(1)*dbeta(x,.5,.5);dote(x,u)
for (t in 1:100){ #100 slice steps
bo=hitden(u)
nx=sample(c(runif(1,0,bo[1]),runif(1,bo[2],1)),1)
nu=runif(1)*dbeta(nx,.5,.5)
mote(x,u,nx,nu)
x=nx;u=nu;dote(x,u)
}


which clearly explores the whole area under the Beta(1/2,1/2) density. Even when started at a large density value like f(.999999), it eventually leaves the vicinity of this highly improbable value.