## open reviews

Posted in Statistics with tags , , , , , , on September 13, 2019 by xi'an

When looking at a question on X validated, on the expected Metropolis-Hastings ratio being one (not all the time!), I was somewhat bemused at the OP linking to an anonymised paper under review for ICLR, as I thought this was breaching standard confidentiality rules for reviews. Digging a wee bit deeper, I realised this was a paper from the previous ICLR conference, already published both on arXiv and in the 2018 conference proceedings, and that ICLR was actually resorting to an open review policy where both papers and reviews were available and even better where anyone could comment on the paper while it was under review. And after. Which I think is a great idea, the worst possible situation being a poor paper remaining un-discussed. While I am not a big fan of the brutalist approach of many machine-learning conferences, where the restrictive format of both submissions and reviews is essentially preventing in-depth reviews, this feature should be added to statistics journal webpages (until PCIs become the norm).

## my likelihood is dominating my prior [not!]

Posted in Kids, Statistics with tags , , , , , on August 29, 2019 by xi'an

An interesting misconception read on X validated today, with a confusion between the absolute value of the likelihood function and its variability. Which I have trouble explaining except possibly by the extrapolation from the discrete case and a confusion between the probability density of the data [scaled as a probability] and the likelihood function [scale-less]. I also had trouble convincing the originator of the question of the irrelevance of the scale of the likelihood per se, even when demonstrating that |$$𝚺|$$ could vanish from the posterior with no consequence whatsoever. It is only when I thought of the case when the likelihood is constant in $$𝜃$$ that I managed to make my case.

## a problem that did not need ABC in the end

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , on August 8, 2019 by xi'an

While in Denver, at JSM, I came across [across validated!] this primarily challenging problem of finding the posterior of the 10³ long probability vector of a Multinomial M(10⁶,p) when only observing the range of a realisation of M(10⁶,p). This sounded challenging because the distribution of the pair (min,max) is not available in closed form. (Although this allowed me to find a paper on the topic by the late Shanti Gupta, who was chair at Purdue University when I visited 32 years ago…) This seemed to call for ABC (especially since I was about to give an introductory lecture on the topic!, law of the hammer…), but the simulation of datasets compatible with the extreme values of both minimum and maximum, m=80 and M=12000, proved difficult when using a uniform Dirichlet prior on the probability vector, since these extremes called for both small and large values of the probabilities. However, I later realised that the problem could be brought down to a Multinomial with only three categories and the observation (m,M,n-m-M), leading to an obvious Dirichlet posterior and a predictive for the remaining 10³-2 realisations.

## Gibbs sampling with incompatible conditionals

Posted in Books, Kids, R, Statistics with tags , , , , , , on July 23, 2019 by xi'an

An interesting question (with no clear motivation) on X validated wondering why a Gibbs sampler produces NAs… Interesting because multi-layered:

1. The attached R code indeed produces NAs because it calls the Negative Binomial Neg(x¹,p) random generator with a zero success parameter, x¹=0, which automatically returns NAs. This can be escaped by returning a one (1) instead.
2. The Gibbs sampler is based on a Bin(x²,p) conditional for X¹ and a Neg(x¹,p) conditional for X². When using the most standard version of the Negative Binomial random variate as the number of failures, hence supported on 0,1,2…. these two conditionals are incompatible, i.e., there cannot be a joint distribution behind that returns these as conditionals, which makes the limiting behaviour of the Markov chain harder to study. It however seems to converge to a distribution close to zero, which is not contradictory with the incompatibility property: the stationary joint distribution simply does not enjoy the conditionals used by the Gibbs sampler as its conditionals.
3. When using the less standard version of the Negative Binomial random variate understood as a number of attempts for the conditional on X², the two conditionals are compatible and correspond to a joint measure proportional to $x_1^{-1} {x_1 \choose x_2} p^{x_2} (1-p)^{x_1-x_2}$, however this pmf does not sum up to a finite quantity (as in the original Gibbs for Kids example!), hence the resulting Markov chain is at best null recurrent, which seems to be the case for p different from ½. This is unclear to me for p=½.

## truncated Normal moments

Posted in Books, Kids, Statistics with tags , , , , , on May 24, 2019 by xi'an

An interesting if presumably hopeless question spotted on X validated: a lower-truncated Normal distribution is parameterised by its location, scale, and truncation values, μ, σ, and α. There exist formulas to derive the mean and variance of the resulting distribution,  that is, when α=0,

$\Bbb{E}_{\mu,\sigma}[X]= \mu + \frac{\varphi(\mu/\sigma)}{1-\Phi(-\mu/\sigma)}\sigma$

and

$\text{var}_{\mu,\sigma}(X)=\sigma^2\left[1-\frac{\mu\varphi(\mu/\sigma)/\sigma}{1-\Phi(-\mu/\sigma)} -\left(\frac{\varphi(\mu/\sigma)}{1-\Phi(-\mu/\sigma)}\right)^2\right]$

but there is no easy way to choose (μ, σ) from these two quantities. Beyond numerical resolution of both equations. One of the issues is that ( μ, σ) is not a location-scale parameter for the truncated Normal distribution when α is fixed.

## visualising bias and unbiasedness

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , , on April 29, 2019 by xi'an

A question on X validated led me to wonder at the point made by Christopher Bishop in his Pattern Recognition and Machine Learning book about the MLE of the Normal variance being biased. As it is illustrated by the above graph that opposes the true and green distribution of the data (made of two points) against the estimated and red distribution. While it is true that the MLE under-estimates the variance on average, the pictures are cartoonist caricatures in their deviance permanence across three replicas. When looking at 10⁵ replicas, rather than three, and at samples of size 10, rather than 2, the distinction between using the MLE (left) and the unbiased estimator of σ² (right).

When looking more specifically at the case n=2, the humongous variability of the density estimate completely dwarfs the bias issue:

Even when averaging over all 10⁵ replications, the difference is hard to spot (and both estimations are more dispersed than the truth!):

## dynamic nested sampling for stars

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , , , on April 12, 2019 by xi'an

In the sequel of earlier nested sampling packages, like MultiNest, Joshua Speagle has written a new package called dynesty that manages dynamic nested sampling, primarily intended for astronomical applications. Which is the field where nested sampling is the most popular. One of the first remarks in the paper is that nested sampling can be more easily implemented by using a Uniform reparameterisation of the prior, that is, a reparameterisation that turns the prior into a Uniform over the unit hypercube. Which means in fine that the prior distribution can be generated from a fixed vector of uniforms and known transforms. Maybe not such an issue given that this is the prior after all.  The author considers this makes sampling under the likelihood constraint a much simpler problem but it all depends in the end on the concentration of the likelihood within the unit hypercube. And on the ability to reach the higher likelihood slices. I did not see any special trick when looking at the documentation, but reflected on the fundamental connection between nested sampling and this ability. As in the original proposal by John Skilling (2006), the slice volumes are “estimated” by simulated Beta order statistics, with no connection with the actual sequence of simulation or the problem at hand. We did point out our incomprehension for such a scheme in our Biometrika paper with Nicolas Chopin. As in earlier versions, the algorithm attempts at visualising the slices by different bounding techniques, before proceeding to explore the bounded regions by several exploration algorithms, including HMC.

“As with any sampling method, we strongly advocate that Nested Sampling should not be viewed as being strictly“better” or “worse” than MCMC, but rather as a tool that can be more or less useful in certain problems. There is no “One True Method to Rule Them All”, even though it can be tempting to look for one.”

When introducing the dynamic version, the author lists three drawbacks for the static (original) version. One is the reliance on this transform of a Uniform vector over an hypercube. Another one is that the overall runtime is highly sensitive to the choice the prior. (If simulating from the prior rather than an importance function, as suggested in our paper.) A third one is the issue that nested sampling is impervious to the final goal, evidence approximation versus posterior simulation, i.e., uses a constant rate of prior integration. The dynamic version simply modifies the number of point simulated in each slice. According to the (relative) increase in evidence provided by the current slice, estimated through iterations. This makes nested sampling a sort of inversted Wang-Landau since it sharpens the difference between slices. (The dynamic aspects for estimating the volumes of the slices and the stopping rule may hinder convergence in unclear ways, which is not discussed by the paper.) Among the many examples produced in the paper, a 200 dimension Normal target, which is an interesting object for posterior simulation in that most of the posterior mass rests on a ring away from the maximum of the likelihood. But does not seem to merit a mention in the discussion. Another example of heterogeneous regression favourably compares dynesty with MCMC in terms of ESS (but fails to include an HMC version).

[Breaking News: Although I wrote this post before the exciting first image of the black hole in M87 was made public and hence before I was aware of it, the associated AJL paper points out relying on dynesty for comparing several physical models of the phenomenon by nested sampling.]