## Nonparametric applications of Bayesian inference

Posted in Books, Statistics, University life with tags , , , , , , on April 22, 2016 by xi'an

Gary Chamberlain and Guido Imbens published this paper in the Journal of Business & Economic Statistics in 2003. I just came to read it in connection with the paper by Luke Bornn, Niel Shephard and Reza Solgi that I commented a few months ago. The setting is somewhat similar: given a finite support distribution with associated probability parameter θ, a natural prior on θ is a Dirichlet prior. This prior induces a prior on transforms of θ, whether or not they are in close form (for instance as the solution of a moment equation E[F(X,β)]=0. As in Bornn et al. In this paper, Chamberlain and Imbens argue in favour of the limiting Dirichlet with all coefficients equal to zero as a way to avoid prior dominating influence when the number of classes J goes to infinity and the data size remains fixed. But they fail to address the issue that the posterior is no longer defined since some classes get unobserved. They consider instead that the parameters corresponding to those classes are equal to zero with probability one, a convention and not a result. (The computational advantage in using the improper prior sounds at best incremental.) The notion of letting some Dirichlet hyper-parameters going to zero is somewhat foreign to a Bayesian perspective as those quantities should be either fixed or distributed according to an hyper-prior, rather than set to converge according to a certain topology that has nothing to do with prior modelling. (Another reason why setting those quantities to zero does not have the same meaning as picking a Dirac mass at zero.)

“To allow for the possibility of an improper posterior distribution…” (p.4)

This is a weird beginning of a sentence, especially when followed by a concept of expected posterior distribution, which is actually a bootstrap expectation. Not as in Bayesian bootstrap, mind. And thus this feels quite orthogonal to the Bayesian approach. I do however find most interesting this notion of constructing a true expected posterior by imposing samples that ensure properness as it reminds me of our approach to mixtures with Jean Diebolt, where (latent) allocations were prohibited to induce improper priors. The bootstrapped posterior distribution seems to be proposed mostly for assessing the impact of the prior modelling, albeit in an non-quantitative manner. (I fail to understand how the very small bootstrap sample sizes are chosen.)

Obviously, there is a massive difference between this paper and Bornn et al, where the authors use two competing priors in parallel, one on θ and one on β, which induces difficulties in setting priors since the parameter space is concentrated upon a manifold. (In which case I wonder what would happen if one implemented the preposterior idea of Berger and Pérez, 2002, to derive a fixed point solution. That we implemented recently with Diego Salmerón and Juan Antonio Caño in a paper published in Statistica Sinica.. This exhibits a similarity with the above bootstrap proposal in that the posterior gets averaged wrt another posterior.)

## approximate Bayesian inference

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , on March 23, 2016 by xi'an

Maybe it is just a coincidence, but both most recent issues of Bayesian Analysis have an article featuring approximate Bayesian inference. One is by Daniel Graham and co-authors on Approximate Bayesian Inference for Doubly Robust Estimation, while the other one is by Chris Drovandi and co-authors from QUT on Exact and Approximate Bayesian Inference for Low Integer-Valued Time Series Models with Intractable Likelihoods. The first paper has little connection with ABC. Even though it (a) uses a lot of three letter acronyms [which does not help with speed reading] and (b) relies on moment based and propensity score models. Instead, it relies on Bayesian bootstrap, which suddenly seems to me to be rather connected with empirical likelihood! Except the weights are estimated via a Dirichlet prior instead of being optimised. The approximation lies in using the bootstrap to derive a posterior predictive. I did not spot any assessment or control of the approximation effect in the paper.

“Note that we are always using the full data so avoiding the need to choose a summary statistic” (p.326)

The second paper connects pMCMC with ABC. Plus pseudo-marginals on the side! And even simplified reversible jump MCMC!!! I am far from certain I got every point of the paper, though, especially the notion of dimension reduction associated with this version of reversible jump MCMC. It may mean that latent variables are integrated out in approximate (marginalised) likelihoods [as explicated in Andrieu and Roberts (2009)].

“The difference with the common ABC approach is that we match on observations one-at-a-time” (p.328)

The model that the authors study is an integer value time-series, like the INAR(p) model. Which integer support allows for a non-zero probability of exact matching between simulated and observed data. One-at-a-time as indicated in the above quote. And integer valued tolerances like ε=1 otherwise. In the case auxiliary variables are necessary, the authors resort to the alive particle filter of Jasra et al. (2013), which main point is to produce an unbiased estimate of the (possibly approximate) likelihood, to be exploited by pseudo-marginal techniques. However, unbiasedness sounds less compelling when moving to approximate methods, as illustrated by the subsequent suggestion to use a more stable estimate of the log-likelihood. In fact, when the tolerance ε is positive, the pMCMC acceptance probability looks quite close to an ABC-MCMC probability when relying on several pseudo-data simulations. Which is unbiased for the “right” approximate target. A fact that may actually holds for all ABC algorithms. One quite interesting aspect of the paper is its reflection about the advantage of pseudo-marginal techniques for RJMCMC algorithms since they allow for trans-dimension moves to be simplified, as they consider marginals on the space of interest. Up to this day, I had not realised Andrieu and Roberts (2009) had a section on this aspect… I am still unclear about the derivation of the posterior probabilities of the models under comparison, unless it is a byproduct of the RJMCMC algorithm. A last point is that, for some of the Markov models used in the paper, the pseudo observations can be produced as a random one-time move away from the current true observation, which makes life much easier for ABC and explain why exact simulations can sometimes be produced. (A side note: the authors mention on p.326 that EP is only applicable when the posterior is from an exponential family, while my understanding is that it uses an exponential family to approximate the true posterior.)

## debunking a (minor and personal) myth

Posted in Books, Kids, R, Statistics, University life with tags , , , , on September 9, 2015 by xi'an

For quite a while, I entertained the idea that Beta and Dirichlet proposals  were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate… Continue reading

## approximate approximate Bayesian computation [not a typo!]

Posted in Books, Statistics, University life with tags , , , , , on January 12, 2015 by xi'an

“Our approach in handling the model uncertainty has some resemblance to statistical ‘‘emulators’’ (Kennedy and O’Hagan, 2001), approximative methods used to express the model uncertainty when simulating data under a mechanistic model is computationally intensive. However, emulators are often motivated in the context of Gaussian processes, where the uncertainty in the model space can be reasonably well modeled by a normal distribution.”

Pierre Pudlo pointed out to me the paper AABC: Approximate approximate Bayesian computation for inference in population-genetic models by Buzbas and Rosenberg that just appeared in the first 2015 issue of Theoretical Population Biology. Despite the claim made above, including a confusion on the nature of Gaussian processes, I am rather reserved about the appeal of this AA rated ABC…

“When likelihood functions are computationally intractable, likelihood-based inference is a challenging problem that has received considerable attention in the literature (Robert and Casella, 2004).”

The ABC approach suggested therein is doubly approximate in that simulation from the sampling distribution is replaced with simulation from a substitute cheaper model. After a learning stage using the costly sampling distribution. While there is convergence of the approximation to the genuine ABC posterior under infinite sample and Monte Carlo sample sizes, there is no correction for this approximation and I am puzzled by its construction. It seems (see p.34) that the cheaper model is build by a sort of weighted bootstrap: given a parameter simulated from the prior, weights based on its distance to a reference table are constructed and then used to create a pseudo-sample by weighted sampling from the original pseudo-samples. Rather than using a continuous kernel centred on those original pseudo-samples, as would be the suggestion for a non-parametric regression. Each pseudo-sample is accepted only when a distance between the summary statistics is small enough. This bootstrap flavour is counter-intuitive in that it requires a large enough sample from the true  sampling distribution to operate with some confidence… I also wonder at what happens when the data is not iid.  (I added the quote above as another source of puzzlement, since the book is about cases when the likelihood is manageable.)

## A discussion on Bayesian analysis : Selecting Noninformative Priors

Posted in Statistics with tags , , , , , , on February 26, 2014 by xi'an

Following an earlier post on the American Statistician 2013 paper by Seaman III and co-authors, Hidden dangers of specifying noninformative priors, my PhD student Kaniav Kamary wrote a paper re-analysing the examples processed by those authors and concluding to the stability of the posterior distributions of the parameters and to the effect of the noninformative prior being essentially negligible. (This is the very first paper quoting verbatim from the ‘Og!) Kaniav logically submitted the paper to the American Statistician.

## hidden dangers of noninformative priors

Posted in Books, Statistics, University life with tags , , , , , on November 21, 2013 by xi'an

Last year, John Seaman (III), John Seaman (Jr.), and James Stamey published a paper in The American Statistician with the title Hidden dangers of specifying noninformative priors. (It does not seem to be freely available on-line.) I gave it to read to my PhD students, meaning to read towards the goal of writing a critical reply to the authors. In the meanwhile, here are my own two-cents on the paper.

“Applications typically employ Markov chain Monte Carlo (MCMC) methods to obtain posterior features, resulting in the need for proper priors, even when the modeler prefers that priors be relatively noninformative.” (p.77)

Apart from the above quote, which confuses proper priors with proper posteriors (maybe as the result of a contagious BUGS!), and which is used to focus solely and sort-of inappropriately on proper priors, there is no hard fact to bite in, but rather a collection of soft decisions and options that end up weakly supporting the authors’ thesis. (Obviously, following an earlier post, there is no such thing as a “noninformative” prior.) The paper is centred on four examples where a particular choice of (“noninformative”) prior leads to peaked or informative priors on some transform(s) of the parameters. Note that there is no definition provided for informative, non-informative, diffuse priors, except those found in BUGS with “extremely large variance” (p.77). (The quote below seems to settle on a uniform prior if one understands the “likely” as evaluated through the posterior density.) The argument of the authors is that “if parameters with diffuse proper priors are subsequently transformed, the resulting induced priors can, of course, be far from diffuse, possibly resulting in unintended influence on the posterior of the transformed parameters” (p.77).

## Estimating the number of species

Posted in Statistics with tags , , , , , on November 20, 2009 by xi'an

Bayesian Analysis just published on-line a paper by Hongmei Zhang and Hal Stern on a (new) Bayesian analysis of the problem of estimating the number of unseen species within a population. This problem has always fascinated me, as it seems at first sight to be an impossible problem, how can you estimate the number of species you do not know?! The approach relates to capture-recapture models, with an extra hierarchical layer for the species. The Bayesian analysis of the model obviously makes a lot of sense, with the prior modelling being quite influential. Zhang and Stern use a hierarchical Dirichlet prior on the capture probabilities, $\theta_i$, when the captures follow a multinomial model

$y|\theta,S \sim \mathcal{M}(N, \theta_1,\ldots,\theta_S)$

where $N=\sum_i y_i$ the total number of observed individuals,

$\mathbf{\theta}|S \sim \mathcal{D}(\alpha,\ldots,\alpha)$

and

$\pi(\alpha,S) = f(1-f)^{S-S_\text{min}} \alpha^{-3/2}$

forcing the coefficients of the Dirichlet prior towards zero. The paper also covers predictive design, analysing the capture effort corresponding to a given recovery rate of species. The overall approach is not immensely innovative in its methodology, the MCMC part being rather straightforward, but the predictive abilities of the model are nonetheless interesting.

The previously accepted paper in Bayesian Analysis is a note by Ron Christensen about an inconsistent Bayes estimator that you may want to use in an advanced Bayesian class. For all practical purposes, it should not overly worry you, since the example involves a sampling distribution that is normal when its parameter is irrational and is Cauchy otherwise. (The prior is assumed to be absolutely continuous wrt the Lebesgue measure and it thus gives mass zero to the set of rational numbers $\mathbb{Q}$. The fact that $\mathbb{Q}$ is dense in $\mathbb{R}$ is irrelevant from a measure-theoretic viewpoint.)