**W**hen looking at questions on X validated, I came across this seemingly obvious request for an unbiased estimator of **P**(X=k), when X~**B**(n,p). Except that X is not observed but only Y~**B**(s,p) with s<n. Since **P**(X=k) is a polynomial in p, I was expecting such an unbiased estimator to exist. But it does not, for the reasons that Y only takes s+1 values and that any function of Y, including the MLE of **P**(X=k), has an expectation involving monomials in p of power s at most. It is actually straightforward to establish properly that the unbiased estimator does not exist. But this remains an interesting additional example of the rarity of the existence of unbiased estimators, to be saved until a future mathematical statistics exam!

## Archive for unbiasedness

## unbiased estimators that do not exist

Posted in Statistics with tags binomial distribution, counterexample, cross validated, exam, existence of unbiased estimators, monomial, teaching, unbiasedness on January 21, 2019 by xi'an## the [not so infamous] arithmetic mean estimator

Posted in Books, Statistics with tags arithmetic mean estimator, Bayesian Analysis, Chib's approximation, harmonic mean estimator, HPD region, importance sampling, label switching, mixture of distributions, nested sampling, unbiasedness on June 15, 2018 by xi'an

“Unfortunately, no perfect solution exists.”Anna Pajor

**A**nother paper about harmonic and not-so-harmonic mean estimators that I (also) missed came out last year in Bayesian Analysis. The author is Anna Pajor, whose earlier note with Osiewalski I also spotted on the same day. The idea behind the approach [which belongs to the branch of Monte Carlo methods requiring additional simulations after an MCMC run] is to start as the corrected harmonic mean estimator on a restricted set **A** as to avoid tails of the distributions and the connected infinite variance issues that plague the harmonic mean estimator (an old ‘Og tune!). The marginal density p(y) then satisfies an identity involving the prior expectation of the likelihood function restricted to **A** divided by the posterior coverage of **A**. Which makes the resulting estimator unbiased only when this posterior coverage of **A** is known, which does not seem realist or efficient, except if **A** is an HPD region, as suggested in our earlier “safe” harmonic mean paper. And efficient only when **A** is well-chosen in terms of the likelihood function. In practice, the author notes that P(**A**|y) is to be estimated from the MCMC sequence and that the set **A** should be chosen to return large values of the likelihood, p(y|θ), through importance sampling, hence missing somehow the double opportunity of using an HPD region. Hence using the same default choice as in Lenk (2009), an HPD region which lower bound is derived as the minimum likelihood in the MCMC sample, “range of the posterior sampler output”. Meaning P(**A**|y)=1. (As an aside, the paper does not produce optimality properties or even heuristics towards efficiently choosing the various parameters to be calibrated in the algorithm, like the set **A** itself. As another aside, the paper concludes with a simulation study on an AR(p) model where the marginal may be obtained in closed form if stationarity is not imposed, which I first balked at, before realising that even in this setting both the posterior and the marginal do exist for a finite sample size, and hence the later can be estimated consistently by Monte Carlo methods.) A last remark is that computing costs are not discussed in the comparison of methods.

The final experiment in the paper is aiming at the marginal of a mixture model posterior, operating on the galaxy benchmark used by Roeder (1990) and about every other paper on mixtures since then (incl. ours). The prior is pseudo-conjugate, as in Chib (1995). And label-switching is handled by a random permutation of indices at each iteration. Which may not be enough to fight the attraction of the current mode on a Gibbs sampler and hence does not automatically correct Chib’s solution. As shown in Table 7 by the divergence with Radford Neal’s (1999) computations of the marginals, which happen to be quite close to the approximation proposed by the author. (As an aside, the paper mentions poor performances of Chib’s method when centred at the posterior mean, but this is a setting where the posterior mean is meaningless because of the permutation invariance. As another, I do not understand how the RMSE can be computed in this real data situation.) The comparison is limited to Chib’s method and a few versions of arithmetic and harmonic means. Missing nested sampling (Skilling, 2006; Chopin and X, 2011), and attuned importance sampling as in Berkoff et al. (2003), Marin, Mengersen and X (2005), and the most recent Lee and X (2016) in Bayesian Analysis.

## unbiased consistent nested sampling via sequential Monte Carlo [a reply]

Posted in pictures, Statistics, Travel with tags auxiliary variable, Brisbane, evidence, marginal likelihood, nested sampling, Og, particle filter, QUT, unbiasedness on June 13, 2018 by xi'an*Rob Salomone sent me the following reply on my comments of yesterday about their recently arXived paper.*

“Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue sinceI believe that the first occurrence within the level set is distributed from the constrained prior.”

“And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a singleparticle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, thefinal constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N)”

**is**a special case of SMC (with the weights replaced with a suboptimal choice).

## unbiased consistent nested sampling via sequential Monte Carlo

Posted in pictures, Statistics, Travel with tags auxiliary variable, Brisbane, evidence, marginal likelihood, nested sampling, Og, particle filter, QUT, unbiasedness on June 12, 2018 by xi'an

“Moreover, estimates of the marginal likelihood are unbiased.” (p.2)

Rob Salomone, Leah South, Chris Drovandi and Dirk Kroese (from QUT and UQ, Brisbane) recently arXived a paper that frames the nested sampling in such a way that marginal likelihoods can be unbiasedly (and consistently) estimated.

“Why isn’t nested sampling more popular with statisticians?” (p.7)

A most interesting question, especially given its popularity in cosmology and other branches of physics. A first drawback pointed out in the c is the requirement of independence between the elements of the sample produced at each iteration. Which never occurred as the number one difficulty there, as the simplest implementation runs a Markov chain from the last removed entry, independently from the remaining entries. Even stationarity is not an issue since I believe that the first occurrence within the level set is distributed from the constrained prior.

A second difficulty is the use of quadrature which turns integrand into step functions at random slices. Indeed, mixing Monte Carlo with numerical integration makes life much harder, as shown by the early avatars of nested sampling that only accounted for the numerical errors. (And which caused Nicolas and I to write our critical paper in Biometrika.) There are few studies of that kind in the literature, the only one I can think of being [my former PhD student] Anne Philippe‘s thesis twenty years ago.

The third issue stands with the difficulty in parallelising the method. Except by jumping k points at once, rather than going one level at a time. While I agree this makes life more complicated, I am also unsure about the severity of that issue as k nested sampling algorithms can be run in parallel and aggregated in the end, from simple averaging to something more elaborate.

The final blemish is that the nested sampling estimator has a stopping mechanism that induces a truncation error, again maybe a lesser problem given the overall difficulty in assessing the total error.

The paper takes advantage of the ability of SMC to produce unbiased estimates of a sequence of normalising constants (or of the normalising constants of a sequence of targets). For nested sampling, the sequence is made of the prior distribution restricted to an embedded sequence of level sets. With another sequence restricted to bands (likelihood between two likelihood boundaries). If all restricted posteriors of the second kind and their normalising constant are known, the full posterior is known. Apparently up to the main normalising constant, i.e. the marginal likelihood., *ℨ*, except that it is also the sum of all normalising constants. Handling this sequence by SMC addresses the four concerns of the four authors, apart from the truncation issue, since the largest likelihood bound need be set for running the algorithm.

When the sequence of likelihood bounds is chosen based on the observed likelihoods so far, the method becomes adaptive. Requiring again the choice of a stopping rule that may induce bias if stopping occurs too early. And then, in a twist that is not clearly explained in the paper, the focus moves to an improved nested sampler that moves one likelihood value at a time, with a particle step replacing a single particle. (Things get complicated when several particles may take the very same likelihood value, but randomisation helps.) At this stage the algorithm is quite similar to the original nested sampler. Except for the unbiased estimation of the constants, the final constant, and the replacement of exponential weights exp(-t/N) by powers of (N-1/N).

The remainder of this long paper (61 pages!) is dedicated to practical implementation, calibration and running a series of comparisons. A nice final touch is the thanks to the ‘Og for its series of posts on nested sampling, which “helped influence this work, and played a large part in inspiring it.”

In conclusion, this paper is certainly a worthy exploration of the nested sampler, providing further arguments towards a consistent version, with first and foremost an (almost?) unbiased resolution. The comparison with a wide range of alternatives remains open, in particular time-wise, if evidence is the sole target of the simulation. For instance, the choice of this sequence of targets in an SMC may be improved by another sequence, since changing one particle at a time does not sound efficient. The complexity of the implementation and in particular of the simulation from the prior under more and more stringent constraints need to be addressed.

## about paradoxes

Posted in Books, Kids, Statistics, University life with tags bias, book review, email, Jacobian, Mark Chang, MLE, paradoxes, reparameterisation, scientific inference, The Bayesian Choice, unbiasedness on December 5, 2017 by xi'an**A**n email I received earlier today about statistical paradoxes:

I am a PhD student in biostatistics, and an avid reader of your work. I recently came across this blog post, where you review a text on statistical paradoxes, and I was struck by this section:

I found this section provocative, but I am unclear on the nature of these “paradoxes”. I reviewed my stat inference notes and came across the classic example that there is no unbiased estimator for 1/p w.r.t. a binomial distribution, but I believe you are getting at a much more general result. If it’s not too much trouble, I would sincerely appreciate it if you could point me in the direction of a reference or provide a bit more detail for these two “paradoxes”.

The text is Chang’s Paradoxes in Scientific Inference, which I indeed reviewed negatively. To answer about the bias “paradox”, it is indeed a neglected fact that, while the average of *any* transform of a sample obviously is an unbiased estimator of its mean (!), the converse does not hold, namely, an *arbitrary* transform of the model parameter θ is not necessarily enjoying an unbiased estimator. In Lehmann and Casella, Chapter 2, Section 4, this issue is (just slightly) discussed. But essentially, transforms that lead to unbiased estimators are mostly the polynomial transforms of the mean parameters… (This also somewhat connects to a recent X validated question as to why MLEs are not always unbiased. Although the simplest explanation is that the transform of the MLE is the MLE of the transform!) In exponential families, I would deem the range of transforms with unbiased estimators closely related to the collection of functions that allow for inverse Laplace transforms, although I cannot quote a specific result on this hunch.

The other “paradox” is that, if h(X) is the MLE of the model parameter θ for the observable X, the distribution of h(X) has a density different from the density of X and, hence, its maximisation in the parameter θ may differ. An example (my favourite!) is the MLE of ||a||² based on x N(a,I) which is ||x||², a poor estimate, and which (strongly) differs from the MLE of ||a||² based on ||x||², which is close to (1-p/||x||²)²||x||² and (nearly) admissible [as discussed in the Bayesian Choice].

## unbiased HMC

Posted in Books, pictures, Statistics with tags Boston, coupling, Hamiltonian Monte Carlo, Harvard, HMC, Massachusetts, sunrise, unbiasedness on September 25, 2017 by xi'an**J**eremy Heng and Pierre Jacob arXived last week a paper on unbiased Hamiltonian Monte Carlo by coupling, following the earlier paper of Pierre and co-authors on debiasing by coupling a few weeks ago. The coupling within the HMC amounts to running two HMC chains with common random numbers, plus subtleties!

“As with any other MCMC method, HMC estimators are justified in the limit of the number of iterations. Algorithms which rely on such asymptotics face the risk of becoming obsolete if computational power keeps increasing through the number of available processors and not through clock speed.”

The main difficulty here is to have both chains meet (exactly) with large probability, since coupled HMC can only bring these chain close to one another. The trick stands in using both coupled HMC *and* coupled Hastings-Metropolis kernels, since the coupled MH kernel allows for exact meetings when the chains are already close, after which they remain happily and forever together! The algorithm is implemented by choosing between the kernels at random at each iteration. (Unbiasedness follows by the Glynn-Rhee trick, which is eminently well-suited for coupling!) As pointed out from the start of the paper, the appeal of this unbiased version is that the algorithm can be (embarrassingly) parallelised since all processors in use return estimators that are iid copies of one another, hence easily merged into a better estimator.

## X-Outline of a Theory of Statistical Estimation

Posted in Books, Statistics, University life with tags Bayesian Analysis, confidence intervals, credible intervals, Dennis Lindley, Harold Jeffreys, inference, Jerzy Neyman, maximum likelihood estimation, unbiasedness, University of Warwick, X-Outline on March 23, 2017 by xi'an**W**hile visiting Warwick last week, Jean-Michel Marin pointed out and forwarded me this remarkable paper of Jerzy Neyman, published in 1937, and presented to the Royal Society by Harold Jeffreys.

“Leaving apart on one side the practical difficulty of achieving randomness and the meaning of this word when applied to actual experiments…”

“It may be useful to point out that although we are frequently witnessing controversies in which authors try to defend one or another system of the theory of probability as the only legitimate, I am of the opinion that several such theories may be and actually are legitimate, in spite of their occasionallycontradicting one another. Each of these theories is based on some system of postulates, and so long as the postulates forming one particular system do not contradict each other and are sufficient to construct a theory, this is as legitimate as any other. “

This paper is fairly long in part because Neyman starts by setting Kolmogorov’s axioms of probability. This is of historical interest but also needed for Neyman to oppose his notion of probability to Jeffreys’ (which is the same from a formal perspective, I believe!). He actually spends a fair chunk on explaining why constants cannot have anything but trivial probability measures. Getting ready to state that an a priori distribution has no meaning (p.343) and that in the rare cases it does it is mostly unknown. While reading the paper, I thought that the distinction was more in terms of frequentist or conditional properties of the estimators, Neyman’s arguments paving the way to his definition of a confidence interval. Assuming repeatability of the experiment under the same conditions and therefore same parameter value (p.344).

“The advantage of the unbiassed [sic] estimates and the justification of their use lies in the fact that in cases frequently met the probability of their differing very much from the estimated parameters is small.”

“…the maximum likelihood estimates appear to be what could be called the best “almost unbiassed [sic]” estimates.”

It is also quite interesting to read that the principle for insisting on unbiasedness is one of producing small errors, because this is not that often the case, as shown by the complete class theorems of Wald (ten years later). And that maximum likelihood is somewhat relegated to a secondary rank, almost unbiased being understood as consistent. A most amusing part of the paper is when Neyman inverts the credible set into a confidence set, that is, turning what is random in a constant and vice-versa. With a justification that the credible interval has zero or one coverage, while the confidence interval has a long-run validity of returning the correct rate of success. What is equally amusing is that the boundaries of a credible interval turn into functions of the sample, hence could be evaluated on a frequentist basis, as done later by Dennis Lindley and others like Welch and Peers, but that Neyman fails to see this and turn the bounds into hard values. For a given sample.

“This, however, is not always the case, and in general there are two or more systems of confidence intervals possible corresponding to the same confidence coefficient α, such that for certain sample points, E’, the intervals in one system are shorter than those in the other, while for some other sample points, E”, the reverse is true.”

The resulting construction of a confidence interval is then awfully convoluted when compared with the derivation of an HPD region, going through regions of acceptance that are the dual of a confidence interval (in the sampling space), while apparently [from my hasty read] missing a rule to order them. And rejecting the notion of a confidence interval being possibly empty, which, while being of practical interest, clashes with its frequentist backup.