## likelihood-free inference in high-dimensional models

Posted in Books, R, Statistics, University life with tags , , , , , , , , , on September 1, 2015 by xi'an

“…for a general linear model (GLM), a single linear function is a sufficient statistic for each associated parameter…” The recently arXived paper “Likelihood-free inference in high-dimensional models“, by Kousathanas et al. (July 2015), proposes an ABC resolution of the dimensionality curse [when the dimension of the parameter and of the corresponding summary statistics] by turning Gibbs-like and by using a component-by-component ABC-MCMC update that allows for low dimensional statistics. In the (rare) event there exists a conditional sufficient statistic for each component of the parameter vector, the approach is just as justified as when using a generic ABC-Gibbs method based on the whole data. Otherwise, that is, when using a non-sufficient estimator of the corresponding component (as, e.g., in a generalised [not general!] linear model), the approach is less coherent as there is no joint target associated with the Gibbs moves. One may therefore wonder at the convergence properties of the resulting algorithm. The only safe case [in dimension 2] is when one of the restricted conditionals does not depend on the other parameter. Note also that each Gibbs step a priori requires the simulation of a new pseudo-dataset, which may be a major imposition on computing time. And that setting the tolerance for each parameter is a delicate calibration issue because in principle the tolerance should depend on the other component values. Continue reading

## an ABC experiment

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on November 24, 2014 by xi'an In a cross-validated forum exchange, I used the code below to illustrate the working of an ABC algorithm:

#normal data with 100 observations
n=100
x=rnorm(n)
#observed summaries

#normal x gamma prior
priori=function(N){
return(cbind(rnorm(N,sd=10),
1/sqrt(rgamma(N,shape=2,scale=5))))
}

ABC=function(N,alpha=.05){

prior=priori(N) #reference table

#pseudo-data
summ=matrix(0,N,2)
for (i in 1:N){
xi=rnorm(n)*prior[i,2]+prior[i,1]
}

#normalisation factor for the distance
#distance
#selection
posterior=prior[dist<quantile(dist,alpha),]}


Hence I used the median and the mad as my summary statistics. And the outcome is rather surprising, for two reasons: the first one is that the posterior on the mean μ is much wider than when using the mean and the variance as summary statistics. This is not completely surprising in that the latter are sufficient, while the former are not. Still, the (-10,10) range on the mean is way larger… The second reason for surprise is that the true posterior distribution cannot be derived since the joint density of med and mad is unavailable. After thinking about this for a while, I went back to my workbench to check the difference with using mean and variance. To my greater surprise, I found hardly any difference! Using the almost exact ABC with 10⁶ simulations and a 5% subsampling rate returns exactly the same outcome. (The first row above is for the sufficient statistics (mean,standard deviation) while the second row is for the (median,mad) pair.) Playing with the distance does not help. The genuine posterior output is quite different, as exposed on the last row of the above, using a basic Gibbs sampler since the posterior is not truly conjugate.

## a general framework for updating belief functions

Posted in Books, Statistics, University life with tags , , , , , , , , , on July 15, 2013 by xi'an

Pier Giovanni Bissiri, Chris Holmes and Stephen Walker have recently arXived the paper related to Sephen’s talk in London for Bayes 250. When I heard the talk (of which some slides are included below), my interest was aroused by the facts that (a) the approach they investigated could start from a statistics, rather than from a full model, with obvious implications for ABC, & (b) the starting point could be the dual to the prior x likelihood pair, namely the loss function. I thus read the paper with this in mind. (And rather quickly, which may mean I skipped important aspects. For instance, I did not get into Section 4 to any depth. Disclaimer: I wasn’t nor is a referee for this paper!)

The core idea is to stick to a Bayesian (hardcore?) line when missing the full model, i.e. the likelihood of the data, but wishing to infer about a well-defined parameter like the median of the observations. This parameter is model-free in that some degree of prior information is available in the form of a prior distribution. (This is thus the dual of frequentist inference: instead of a likelihood w/o a prior, they have a prior w/o a likelihood!) The approach in the paper is to define a “posterior” by using a functional type of loss function that balances fidelity to prior and fidelity to data. The prior part (of the loss) ends up with a Kullback-Leibler loss, while the data part (of the loss) is an expected loss wrt to l(THETASoEUR,x), ending up with the definition of a “posterior” that is $\exp\{ -l(\theta,x)\} \pi(\theta)$

the loss thus playing the role of the log-likelihood.

I like very much the problematic developed in the paper, as I think it is connected with the real world and the complex modelling issues we face nowadays. I also like the insistence on coherence like the updating principle when switching former posterior for new prior (a point sorely missed in this book!) The distinction between M-closed M-open, and M-free scenarios is worth mentioning, if only as an entry to the Bayesian processing of pseudo-likelihood and proxy models. I am however not entirely convinced by the solution presented therein, in that it involves a rather large degree of arbitrariness. In other words, while I agree on using the loss function as a pivot for defining the pseudo-posterior, I am reluctant to put the same faith in the loss as in the log-likelihood (maybe a frequentist atavistic gene somewhere…) In particular, I think some of the choices are either hard or impossible to make and remain unprincipled (despite a call to the LP on page 7).  I also consider the M-open case as remaining unsolved as finding a convergent assessment about the pseudo-true parameter brings little information about the real parameter and the lack of fit of the superimposed model. Given my great expectations, I ended up being disappointed by the M-free case: there is no optimal choice for the substitute to the loss function that sounds very much like a pseudo-likelihood (or log thereof). (I thought the talk was more conclusive about this, I presumably missed a slide there!) Another great expectation was to read about the proper scaling of the loss function (since L and wL are difficult to separate, except for monetary losses). The authors propose a “correct” scaling based on balancing both faithfulness for a single observation, but this is not a completely tight argument (dependence on parametrisation and prior, notion of a single observation, &tc.)

The illustration section contains two examples, one of which is a full-size or at least challenging  genetic data analysis. The loss function is based on a logistic  pseudo-likelihood and it provides results where the Bayes factor is in agreement with a likelihood ratio test using Cox’ proportional hazard model. The issue about keeping the baseline function as unkown reminded me of the Robbins-Wasserman paradox Jamie discussed in Varanasi. The second example offers a nice feature of putting uncertainties onto box-plots, although I cannot trust very much the 95%  of the credibles sets. (And I do not understand why a unique loss would come to be associated with the median parameter, see p.25.)

Watch out: Tomorrow’s post contains a reply from the authors!

## Pitman closeness renewal?

Posted in Statistics, University life with tags , , , , on July 26, 2012 by xi'an

As noticed there a few months ago, the Pitman closeness criterion for comparing estimators (through the probability

Pθ(|δ-θ|<|δ’-θ|)

which should be larger than .5 for the first estimator to be deemed “better” or “Pitman closer”) has been “resuscitated” by Canadian researchers. In 1993, I wrote a JASA (discussion) paper along with Gene Hwang and Bill Strawderman pointing out the many inconsistencies of this criterion as a decision tool.  It was entitled “Is Pitman Closeness a Reasonable Criterion?” (The answer was in the question, right?!)

In an arXiv posting today, Jozani, Balakrishnan, and Davies propose new characterisations for comparing (in this sense) symmetrically distributed estimators. There is nothing wrong with this mathematical exercise, obviously. However, the approach still seems to suffer from the same decisional inconsistencies as in the past:

1. the results in the paper (see, e.g., Lemma 1 and 2) only apply to independent estimators, which is rather unrealistic (to the point of having the authors applying it to dependent estimators, the sample median X[n/2] versus a fixed index observation, e.g. X3, and again at the end of the paper in the comparison of several order statistics). Having independent estimators to compare is a rather rare situation as one tries to make the most of a given sample;
2. the setup is highly dependent on considering a single (one-dimensional) location parameter, the results do not apply to more general settings (except location-scale cases with scale parameters known to some extent, see Lemma 5) ;
3. some results (see Remark 4) allow to find a whole range of estimators dominating a given (again independent) estimator δ’, but they do not give a ranking of those estimators, except in the weak sense of having the above probability maximal in one of the estimators δ (Lemma 9). This is due to the independence constraint on the comparison. There is therefore no possibility (in this setting) of obtaining an estimator that is the “Pitman closest estimator of θ“, as claimed by the authors in the final section of their paper.

Once again, I have nothing against these derivations, which are mostly correct, but I simply argue here that they cannot constitute a competitor to standard decision theory.

Posted in R, Statistics, University life with tags , , , , , on April 30, 2012 by xi'an In the motivating toy example to our ABC model choice paper, we compare summary statistics, mean, median, variance, and… median absolute deviation (mad). The latest is the only one able to discriminate between our normal and Laplace models (as now discussed on Cross Validated!). When rerunning simulations to produce nicer graphical outcomes (for the revision), I noticed a much longer run time associated with the computation of the mad statistic. Here is a comparison for the computation of the mean, median, and mad on identical simulations:

> system.time(mmean(10^5))
user  system elapsed
4.040   0.056   4.350
> system.time(mmedian(10^5))
user  system elapsed
12.509   0.012  15.353