## Archive for bias

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

The crux of the method is to run an iteration as [where y denotes the observation]

1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
2. Generate a pseudo-observation z~ƒ(z|θ’);
3. Accept with probability

$\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}$

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

## automatic variational ABC

Posted in pictures, Statistics with tags , , , , , , , , , , on July 8, 2016 by xi'an

“Stochastic Variational inference is an appealing alternative to the inefficient sampling approaches commonly used in ABC.”

Moreno et al. [including Ted Meeds and Max Welling] recently arXived a paper merging variational inference and ABC. The argument for turning variational is computational speedup. The traditional (in variational inference) divergence decomposition of the log-marginal likelihood is replaced by an ABC version, parameterised in terms of intrinsic generators (i.e., generators that do not depend on cyber-parameters, like the U(0,1) or the N(0,1) generators). Or simulation code in the authors’ terms. Which leads to the automatic aspect of the approach. In the paper the derivation of the gradient is indeed automated.

“One issue is that even assuming that the ABC likelihood is an unbiased estimator of the true likelihood (which it is not), taking the log introduces a bias, so that we now have a biased estimate of the lower bound and thus biased gradients.”

I wonder how much of an issue this is, since we consider the variational lower bound. To be optimised in terms of the parameters of the variational posterior. Indeed, the endpoint of the analysis is to provide an optimal variational approximation, which remains an approximation whether or not the likelihood estimator is unbiased. A more “severe” limitation may be in the inversion constraint, since it seems to eliminate Beta or Gamma distributions. (Even though calling qbeta(runif(1),a,b) definitely is achievable… And not rejected by a Kolmogorov-Smirnov test.)

Incidentally, I discovered through the paper the existence of the Kumaraswamy distribution, which main appeal seems to be the ability to produce a closed-form quantile function, while bearing some resemblance with the Beta distribution. (Another arXival by Baltasar Trancón y Widemann studies some connections between those, but does not tell how to select the parameters to optimise the similarity.)

## the penalty method

Posted in Statistics, University life with tags , , , , , , , , , , on July 7, 2016 by xi'an

“In this paper we will make conceptually simple generalization of Metropolis algorithm, by adjusting the acceptance ratio formula so that the transition probabilities are unaffected by the fluctuations in the estimate of [the acceptance ratio]…”

Last Friday, in Paris-Dauphine, my PhD student Changye Wu showed me a paper of Ceperley and Dewing entitled the penalty method for random walks with uncertain energies. Of which I was unaware of (and which alas pre-dated a recent advance made by Changye).  Despite its physics connections, the paper is actually about estimating a Metropolis-Hastings acceptance ratio and correcting the Metropolis-Hastings move for this estimation. While there is no generic solution to this problem, assuming that the logarithm of the acceptance ratio estimate is Gaussian around the true log acceptance ratio (and hence unbiased) leads to a log-normal correction for the acceptance probability.

“Unfortunately there is a serious complication: the variance needed in the noise penalty is also unknown.”

Even when the Gaussian assumption is acceptable, there is a further issue with this approach, namely that it also depends on an unknown variance term. And replacing it with an estimate induces further bias. So it may be that this method has not met with many followers because of those two penalising factors. Despite precluding the pseudo-marginal approach of Mark Beaumont (2003) by a few years, with the later estimating separately numerator and denominator in the Metropolis-Hastings acceptance ratio. And hence being applicable in a much wider collection of cases. Although I wonder if some generic approaches like path sampling or the exchange algorithm could be applied on a generic basis… [I just realised the title could be confusing in relation with the current football competition!]

## stopping rule impact

Posted in Books, R, Statistics, University life with tags , , , on May 9, 2014 by xi'an

Here is a question from my friend Shravan Vasishth about the consequences of using a stopping rule:

Psycholinguists and psychologists often adopt the following type of data-gathering procedure: The experimenter gathers n data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (n more data points). Since time and money are limited, he might decide to stop anyway at sample size, say, some multiple of n.  One can play with different scenarios here. A typical n might be 10 or 15.

This approach would give us a distribution of t-values and p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses (I also track the t-statistic, in order to compare it with my stopping rule code below).

Here’s a simulation showing what happens. I wanted to ask you whether this simulation makes sense. I assume here that the experimenter gathers 10 data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (10 more data points). Since time and money are limited, he might decide to stop anyway at sample size 60. This gives us p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses:
##Standard:
pvals<-NULL
tstat_standard<-NULL
n<-10 # sample size
nsim<-1000 # number of simulations
stddev<-1 # standard dev
mn<-0 ## mean

for(i in 1:nsim){
samp<-rnorm(n,mean=mn,sd=stddev)
pvals[i]<-t.test(samp)$p.value tstat_standard[i]<-t.test(samp)$statistic}

## Type I error rate: about 5% as theory says:
table(pvals<0.05)[2]/nsim


But the situation quickly deteriorates as soon as adopt the strategy I outlined above:

pvals<-NULL
tstat<-NULL
## how many subjects can I run?
upper_bound<-n*6

for(i in 1:nsim){
## at the outset we have no significant result:
significant<-FALSE
## null hyp is going to be true,
## so any rejection is a mistake.
## take sample:
x<-rnorm(n,mean=mn,sd=stddev)
while(!significant & length(x)<upper_bound){
## if not significant:
if(t.test(x)$p.value>0.05){ ## get more data: x<-append(x,rnorm(n,mean=mn,sd=stddev)) ## otherwise stop: } else {significant<-TRUE}} ## will be either significant or not: pvals[i]<-t.test(x)$p.value
tstat[i]<-t.test(x)\$statistic}


Now let’s compare the distribution of the t-statistic in the standard case vs with the above stopping rule. We get fatter tails with the above stopping rule, as shown by the histogram below.

Is this a correct way to think about the stopping rule problem?

To which I replied the following:

By adopting a stopping rule on a random iid sequence, you favour values in the sequence that agree with your stopping condition, hence modify the distribution of the outcome. To take an extreme example, if you draw N(0,1) variates until the empirical average is between -2 and 2, the average thus produced cannot remain N(0,1/n) but have a different distribution.

The t-test statistic you build from your experiment is no longer distributed as a uniform variate because of the stopping rule: the sample(x1,…,x10m) (with random size 10m [resulting from increases in the sample size by adding 10 more observations at a time] is distributed from

$\prod_{i=1}^{10m} \phi(x_i) \times \prod_{j=1}^{m-1} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05} \times \mathbb{I}_{t(x_1,\ldots,x_{10m})<.05}$

if 10m<60 [assuming the maximal acceptable sample size is 60] and from

$\prod_{i=1}^{60} \phi(x_i) \times \prod_{j=1}^{5} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05}$

otherwise. The histogram at the top of this post is the empirical distribution of the average of those observations, clearly far from a normal distribution.

## bounded target support [#2]

Posted in Books, Kids, Statistics, University life with tags , , , , , on July 8, 2013 by xi'an

In a sort of echo from an earlier post, I received this emailed question from Gabriel:

I am contacting you in connection with my internship and your book «Le choix bayésien» where I cannot find an answer to my question. Given a constrained parameter space and an unconstrained Markov chain, is it correct to subsample the chain in order to keep only those points that satisfy the constraint?

To which I replied that this would induce a bias in the outcome, even though this is a marginally valid argument (if  the Markov chain is in its stationary regime, picking one value at random from those satisfying the constraint is akin to accept-reject). The unbiased approach is to resort to Metropolis-Hastings steps in Gabriel’s Gibbs sampler to check whether or not each proposed move stays within the constrained space. If not, one need to replicate the current value of the chain…

Update: Following comments by Ajay and David, I withdraw the term “bias”. The method works as a low key accept-reject but can be very inefficient, to the extent of having no visit to the constrained set. (However, in the event of a practically/numerically disconnected support with a huge gap between the connected components, it may also be more efficient than a low energy Metropolis-Hastings algorithm. Mileage may vary!)

## mostly nuisance, little interest

Posted in Statistics, University life with tags , , , , , , on February 7, 2013 by xi'an

Sorry for the misleading if catchy (?) title, I mean mostly nuisance parameters, very few parameters of interest! This morning I attended a talk by Eric Lesage from CREST-ENSAI on non-responses in surveys and their modelling through instrumental variables. The weighting formula used to compensate for the missing values was exactly the one at the core of the Robins-Wasserman paradox, discussed a few weeks ago by Jamie in Varanasi. Namely the one with the estimated probability of response at the denominator: The solution adopted in the talk was obviously different, with linear estimators used at most steps to evaluate the bias of the procedure (since researchers in survey sampling seem particularly obsessed with bias!)

On a somehow related topic, Aris Spanos arXived a short note (that I read yesterday) about the Neyman-Scott paradox. The problem is similar to the Robins-Wasserman paradox in that there is an infinity of nuisance parameters (the means of the successive pairs of observations) and that a convergent estimator of the parameter of interest, namely the variance common to all observations, is available. While there exist Bayesian solutions to this problem (see, e.g., this paper by Brunero Liseo), they require some preliminary steps to bypass the difficulty of this infinite number of parameters and, in this respect, are involving ad-hocquery to some extent, because the prior is then designed purposefully so. In other words, missing the direct solution based on the difference of the pairs is a wee frustrating, even though this statistic is not sufficient! The above paper by Brunero also my favourite example in this area: when considering a normal mean in large dimension, if the parameter of interest is the squared norm of this mean, the MLE ||x||² (and the Bayes estimator associated with Jeffreys’ prior) is (are) very poor: the bias is constant and of the order of the dimension of the mean, p. On the other hand, if one starts from ||x||² as the observation (definitely in-sufficient!), the resulting MLE (and the Bayes estimator associated with Jeffreys’ prior) has (have) much nicer properties. (I mentioned this example in my review of Chang’s book as it is paradoxical, gaining in efficiency by throwing away “information”! Of course, the part we throw away does not contain true information about the norm, but the likelihood does not factorise and hence the Bayesian answers differ…)

I showed the paper to Andrew Gelman and here are his comments:

Spanos writes, “The answer is surprisingly straightforward.” I would change that to, “The answer is unsurprisingly straightforward.” He should’ve just asked me the answer first rather than wasting his time writing a paper!

The way it works is as follows. In Bayesian inference, everything unknown is unknown, they have a joint prior and a joint posterior distribution. In frequentist inference, each unknowns quantity is either a parameter or a predictive quantity. Parameters do not have probability distributions (hence the discomfort that frequentists have with notation such as N(y|m,s); they prefer something like N(y;m,s) or f_N(y;m,s)), while predictions do have probability distributions. In frequentist statistics, you estimate parameters and you predict predictors. In this world, estimation and prediction are different. Estimates are evaluated conditional on the parameter. Predictions are evaluated conditional on model parameters but unconditional on the predictive quantities. Hence, mle can work well in many high-dimensional problems, as long as you consider many of the uncertain quantities as predictive. (But mle is still not perfect because of the problem of boundary estimates, e.g., here..