Archive for bias

automatic variational ABC

Posted in pictures, Statistics with tags , , , , , , , , , , on July 8, 2016 by xi'an

Amster11“Stochastic Variational inference is an appealing alternative to the inefficient sampling approaches commonly used in ABC.”

Moreno et al. [including Ted Meeds and Max Welling] recently arXived a paper merging variational inference and ABC. The argument for turning variational is computational speedup. The traditional (in variational inference) divergence decomposition of the log-marginal likelihood is replaced by an ABC version, parameterised in terms of intrinsic generators (i.e., generators that do not depend on cyber-parameters, like the U(0,1) or the N(0,1) generators). Or simulation code in the authors’ terms. Which leads to the automatic aspect of the approach. In the paper the derivation of the gradient is indeed automated.

“One issue is that even assuming that the ABC likelihood is an unbiased estimator of the true likelihood (which it is not), taking the log introduces a bias, so that we now have a biased estimate of the lower bound and thus biased gradients.”

I wonder how much of an issue this is, since we consider the variational lower bound. To be optimised in terms of the parameters of the variational posterior. Indeed, the endpoint of the analysis is to provide an optimal variational approximation, which remains an approximation whether or not the likelihood estimator is unbiased. A more “severe” limitation may be in the inversion constraint, since it seems to eliminate Beta or Gamma distributions. (Even though calling qbeta(runif(1),a,b) definitely is achievable… And not rejected by a Kolmogorov-Smirnov test.)

Incidentally, I discovered through the paper the existence of the Kumaraswamy distribution, which main appeal seems to be the ability to produce a closed-form quantile function, while bearing some resemblance with the Beta distribution. (Another arXival by Baltasar Trancón y Widemann studies some connections between those, but does not tell how to select the parameters to optimise the similarity.)

the penalty method

Posted in Statistics, University life with tags , , , , , , , , , , on July 7, 2016 by xi'an

“In this paper we will make conceptually simple generalization of Metropolis algorithm, by adjusting the acceptance ratio formula so that the transition probabilities are unaffected by the fluctuations in the estimate of [the acceptance ratio]…”

Last Friday, in Paris-Dauphine, my PhD student Changye Wu showed me a paper of Ceperley and Dewing entitled the penalty method for random walks with uncertain energies. Of which I was unaware of (and which alas pre-dated a recent advance made by Changye).  Despite its physics connections, the paper is actually about estimating a Metropolis-Hastings acceptance ratio and correcting the Metropolis-Hastings move for this estimation. While there is no generic solution to this problem, assuming that the logarithm of the acceptance ratio estimate is Gaussian around the true log acceptance ratio (and hence unbiased) leads to a log-normal correction for the acceptance probability.

“Unfortunately there is a serious complication: the variance needed in the noise penalty is also unknown.”

Even when the Gaussian assumption is acceptable, there is a further issue with this approach, namely that it also depends on an unknown variance term. And replacing it with an estimate induces further bias. So it may be that this method has not met with many followers because of those two penalising factors. Despite precluding the pseudo-marginal approach of Mark Beaumont (2003) by a few years, with the later estimating separately numerator and denominator in the Metropolis-Hastings acceptance ratio. And hence being applicable in a much wider collection of cases. Although I wonder if some generic approaches like path sampling or the exchange algorithm could be applied on a generic basis… [I just realised the title could be confusing in relation with the current football competition!]

stopping rule impact

Posted in Books, R, Statistics, University life with tags , , , on May 9, 2014 by xi'an

shravanHere is a question from my friend Shravan Vasishth about the consequences of using a stopping rule:

Psycholinguists and psychologists often adopt the following type of data-gathering procedure: The experimenter gathers n data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (n more data points). Since time and money are limited, he might decide to stop anyway at sample size, say, some multiple of n.  One can play with different scenarios here. A typical n might be 10 or 15.

This approach would give us a distribution of t-values and p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses (I also track the t-statistic, in order to compare it with my stopping rule code below).

Here’s a simulation showing what happens. I wanted to ask you whether this simulation makes sense. I assume here that the experimenter gathers 10 data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (10 more data points). Since time and money are limited, he might decide to stop anyway at sample size 60. This gives us p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses:
n<-10 # sample size
nsim<-1000 # number of simulations
stddev<-1 # standard dev
mn<-0 ## mean

for(i in 1:nsim){

## Type I error rate: about 5% as theory says:

But the situation quickly deteriorates as soon as adopt the strategy I outlined above:

## how many subjects can I run?

for(i in 1:nsim){
## at the outset we have no significant result:
## null hyp is going to be true,
## so any rejection is a mistake.
## take sample:
 while(!significant & length(x)<upper_bound){
  ## if not significant:
  ## get more data:
  ## otherwise stop:
  } else {significant<-TRUE}}
## will be either significant or not:

Now let’s compare the distribution of the t-statistic in the standard case vs with the above stopping rule. We get fatter tails with the above stopping rule, as shown by the histogram below.

Is this a correct way to think about the stopping rule problem?


To which I replied the following:

By adopting a stopping rule on a random iid sequence, you favour values in the sequence that agree with your stopping condition, hence modify the distribution of the outcome. To take an extreme example, if you draw N(0,1) variates until the empirical average is between -2 and 2, the average thus produced cannot remain N(0,1/n) but have a different distribution.

The t-test statistic you build from your experiment is no longer distributed as a uniform variate because of the stopping rule: the sample(x1,…,x10m) (with random size 10m [resulting from increases in the sample size by adding 10 more observations at a time] is distributed from

\prod_{i=1}^{10m} \phi(x_i) \times \prod_{j=1}^{m-1} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05} \times \mathbb{I}_{t(x_1,\ldots,x_{10m})<.05}

if 10m<60 [assuming the maximal acceptable sample size is 60] and from

\prod_{i=1}^{60} \phi(x_i) \times \prod_{j=1}^{5} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05}

otherwise. The histogram at the top of this post is the empirical distribution of the average of those observations, clearly far from a normal distribution.

bounded target support [#2]

Posted in Books, Kids, Statistics, University life with tags , , , , , on July 8, 2013 by xi'an

In a sort of echo from an earlier post, I received this emailed question from Gabriel:

I am contacting you in connection with my internship and your book «Le choix bayésien» where I cannot find an answer to my question. Given a constrained parameter space and an unconstrained Markov chain, is it correct to subsample the chain in order to keep only those points that satisfy the constraint?

To which I replied that this would induce a bias in the outcome, even though this is a marginally valid argument (if  the Markov chain is in its stationary regime, picking one value at random from those satisfying the constraint is akin to accept-reject). The unbiased approach is to resort to Metropolis-Hastings steps in Gabriel’s Gibbs sampler to check whether or not each proposed move stays within the constrained space. If not, one need to replicate the current value of the chain…

Update: Following comments by Ajay and David, I withdraw the term “bias”. The method works as a low key accept-reject but can be very inefficient, to the extent of having no visit to the constrained set. (However, in the event of a practically/numerically disconnected support with a huge gap between the connected components, it may also be more efficient than a low energy Metropolis-Hastings algorithm. Mileage may vary!)

mostly nuisance, little interest

Posted in Statistics, University life with tags , , , , , , on February 7, 2013 by xi'an

tree next to my bike parking garage at INSEE, Malakoff, Feb. 02, 2012Sorry for the misleading if catchy (?) title, I mean mostly nuisance parameters, very few parameters of interest! This morning I attended a talk by Eric Lesage from CREST-ENSAI on non-responses in surveys and their modelling through instrumental variables. The weighting formula used to compensate for the missing values was exactly the one at the core of the Robins-Wasserman paradox, discussed a few weeks ago by Jamie in Varanasi. Namely the one with the estimated probability of response at the denominator: The solution adopted in the talk was obviously different, with linear estimators used at most steps to evaluate the bias of the procedure (since researchers in survey sampling seem particularly obsessed with bias!)

On a somehow related topic, Aris Spanos arXived a short note (that I read yesterday) about the Neyman-Scott paradox. The problem is similar to the Robins-Wasserman paradox in that there is an infinity of nuisance parameters (the means of the successive pairs of observations) and that a convergent estimator of the parameter of interest, namely the variance common to all observations, is available. While there exist Bayesian solutions to this problem (see, e.g., this paper by Brunero Liseo), they require some preliminary steps to bypass the difficulty of this infinite number of parameters and, in this respect, are involving ad-hocquery to some extent, because the prior is then designed purposefully so. In other words, missing the direct solution based on the difference of the pairs is a wee frustrating, even though this statistic is not sufficient! The above paper by Brunero also my favourite example in this area: when considering a normal mean in large dimension, if the parameter of interest is the squared norm of this mean, the MLE ||x||² (and the Bayes estimator associated with Jeffreys’ prior) is (are) very poor: the bias is constant and of the order of the dimension of the mean, p. On the other hand, if one starts from ||x||² as the observation (definitely in-sufficient!), the resulting MLE (and the Bayes estimator associated with Jeffreys’ prior) has (have) much nicer properties. (I mentioned this example in my review of Chang’s book as it is paradoxical, gaining in efficiency by throwing away “information”! Of course, the part we throw away does not contain true information about the norm, but the likelihood does not factorise and hence the Bayesian answers differ…)

I showed the paper to Andrew Gelman and here are his comments:

Spanos writes, “The answer is surprisingly straightforward.” I would change that to, “The answer is unsurprisingly straightforward.” He should’ve just asked me the answer first rather than wasting his time writing a paper!

The way it works is as follows. In Bayesian inference, everything unknown is unknown, they have a joint prior and a joint posterior distribution. In frequentist inference, each unknowns quantity is either a parameter or a predictive quantity. Parameters do not have probability distributions (hence the discomfort that frequentists have with notation such as N(y|m,s); they prefer something like N(y;m,s) or f_N(y;m,s)), while predictions do have probability distributions. In frequentist statistics, you estimate parameters and you predict predictors. In this world, estimation and prediction are different. Estimates are evaluated conditional on the parameter. Predictions are evaluated conditional on model parameters but unconditional on the predictive quantities. Hence, mle can work well in many high-dimensional problems, as long as you consider many of the uncertain quantities as predictive. (But mle is still not perfect because of the problem of boundary estimates, e.g., here..

paradoxes in scientific inference: a reply from the author

Posted in Books, Statistics, University life with tags , , , , , , , , , on December 26, 2012 by xi'an

(I received the following set of comments from Mark Chang after publishing a review of his book on the ‘Og. Here they are, verbatim, except for a few editing and spelling changes. It’s a huge post as Chang reproduces all of my comments as well.)

Professor Christian Robert reviewed my book: “Paradoxes in Scientific Inference”. I found that the majority of his criticisms had no foundation and were based on his truncated way of reading. I gave point-by-point responses below. For clarity, I kept his original comments.

Robert’s Comments: This CRC Press book was sent to me for review in CHANCE: Paradoxes in Scientific Inference is written by Mark Chang, vice-president of AMAG Pharmaceuticals. The topic of scientific paradoxes is one of my primary interests and I have learned a lot by looking at Lindley-Jeffreys and Savage-Dickey paradoxes. However, I did not find a renewed sense of excitement when reading the book. The very first (and maybe the best!) paradox with Paradoxes in Scientific Inference is that it is a book from the future! Indeed, its copyright year is 2013 (!), although I got it a few months ago. (Not mentioning here the cover mimicking Escher’s “paradoxical” pictures with dices. A sculpture due to Shigeo Fukuda and apparently not quoted in the book. As I do not want to get into another dice cover polemic, I will abstain from further comments!)

Thank you, Robert for reading and commenting on part of my book. I had the same question on the copyright year being 2013 when it was actually published in previous year. I believe the same thing had happened to my other books too. The incorrect year causes confusion for future citations. The cover was designed by the publisher. They gave me few options and I picked the one with dices. I was told that the publisher has the copyright for the art work. I am not aware of the original artist. Continue reading

Harmonic means, again

Posted in Statistics, University life with tags , , , , , , , , on January 3, 2012 by xi'an

Over the non-vacation and the vacation breaks of the past weeks, I skipped a lot of arXiv postings. This morning, I took a look at “Probabilities of exoplanet signals from posterior samplings” by Mikko Tuomi and Hugh R. A. Jones. This is a paper to appear in Astronomy and Astrophysics, but the main point [to me] is to study a novel approximation to marginal likelihood. The authors propose what looks at first as defensive sampling: given a likelihood f(x|θ) and a corresponding Markov chaini), the approximation is based on the following importance sampling representation

\hat m(x) = \sum_{i=h+1}^N \dfrac{f(x|\theta_i)}{(1-\lambda) f(x|\theta_i) + \lambda f(x|\theta_{i-h})}\Big/

\sum_{i=h+1}^N \dfrac{1}{(1-\lambda) f(x|\theta_i) + \lambda f(x|\theta_{i-h})}

This is called a truncated posterior mixture approximation and, under closer scrutiny, it is not defensive sampling. Indeed the second part in the denominators does not depend on the parameter θi, therefore, as far as importance sampling is concerned, this is a constant (if random) term! The authors impose a bounded parameter space for this reason, however I do not see why such an approximation is converging. Except when λ=0, of course, which brings us back to the original harmonic mean estimator. (Properly rejected by the authors for having a very slow convergence. Or, more accurately, generally no stronger convergence than the law of large numbers.)  Furthermore, the generic importance sampling argument does not work here since, if

g(\theta) \propto (1-\lambda) \pi(\theta|x) + \lambda \pi(\theta_{i-h}|x)

is the importance function, the ratio

\dfrac{\pi(\theta_i)f(x|\theta_i)}{(1-\lambda) \pi(\theta|x) + \lambda \pi(\theta_{i-h}|x)}

does not simplify… I do not understand either why the authors compare Bayes factors approximations based on this technique, on the harmonic mean version or on Chib and Jeliazkov’s (2001) solution with both DIC and AIC, since the later are not approximations to the Bayes factors. I am therefore quite surprised at the paper being accepted for publication, given that the numerical evaluation shows the impact of the coefficient λ does not vanish with the number of simulations. (Which is logical given the bias induced by the additional term.)


Get every new post delivered to your Inbox.

Join 1,066 other followers