## Monte Carlo Markov chains

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , on May 12, 2020 by xi'an

Darren Wraith pointed out this (currently free access) Springer book by Massimiliano Bonamente [whose family name means good spirit in Italian] to me for its use of the unusual Monte Carlo Markov chain rendering of MCMC.  (Google Trend seems to restrict its use to California!) This is a graduate text for physicists, but one could nonetheless expect more rigour in the processing of the topics. Particularly of the Bayesian topics. Here is a pot-pourri of memorable quotes:

“Two major avenues are available for the assignment of probabilities. One is based on the repetition of the experiments a large number of times under the same conditions, and goes under the name of the frequentist or classical method. The other is based on a more theoretical knowledge of the experiment, but without the experimental requirement, and is referred to as the Bayesian approach.”

“The Bayesian probability is assigned based on a quantitative understanding of the nature of the experiment, and in accord with the Kolmogorov axioms. It is sometimes referred to as empirical probability, in recognition of the fact that sometimes the probability of an event is assigned based upon a practical knowledge of the experiment, although without the classical requirement of repeating the experiment for a large number of times. This method is named after the Rev. Thomas Bayes, who pioneered the development of the theory of probability.”

“The likelihood P(B/A) represents the probability of making the measurement B given that the model A is a correct description of the experiment.”

“…a uniform distribution is normally the logical assumption in the absence of other information.”

“The Gaussian distribution can be considered as a special case of the binomial, when the number of tries is sufficiently large.”

“This clearly does not mean that the Poisson distribution has no variance—in that case, it would not be a random variable!”

“The method of moments therefore returns unbiased estimates for the mean and variance of every distribution in the case of a large number of measurements.”

“The great advantage of the Gibbs sampler is the fact that the acceptance is 100 %, since there is no rejection of candidates for the Markov chain, unlike the case of the Metropolis–Hastings algorithm.”

Let me then point out (or just whine about!) the book using “statistical independence” for plain independence, the use of / rather than Jeffreys’ | for conditioning (and sometimes forgetting \ in some LaTeX formulas), the confusion between events and random variables, esp. when computing the posterior distribution, between models and parameter values, the reliance on discrete probability for continuous settings, as in the Markov chain chapter, confusing density and probability, using Mendel’s pea data without mentioning the unlikely fit to the expected values (or, as put more subtly by Fisher (1936), “the data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations”), presenting Fisher’s and Anderson’s Iris data [a motive for rejection when George was JASA editor!] as a “a new classic experiment”, mentioning Pearson but not Lee for the data in the 1903 Biometrika paper “On the laws of inheritance in man” (and woman!), and not accounting for the discrete nature of this data in the linear regression chapter, the three page derivation of the Gaussian distribution from a Taylor expansion of the Binomial pmf obtained by differentiating in the integer argument, spending endless pages on deriving standard properties of classical distributions, this appalling mess of adding over the conditioning atoms with no normalisation in a Poisson experiment

$P(X=4|\mu=0,1,2) = \sum_{\mu=0}^2 \frac{\mu^4}{4!}\exp\{-\mu\}$,

botching the proof of the CLT, which is treated before the Law of Large Numbers, restricting maximum likelihood estimation to the Gaussian and Poisson cases and muddling its meaning by discussing unbiasedness, confusing a drifted Poisson random variable with a drift on its parameter, as well as using the pmf of the Poisson to define an area under the curve (Fig. 5.2), sweeping the improperty of a constant prior under the carpet, defining a null hypothesis as a range of values for a summary statistic, no mention of Bayesian perspectives in the hypothesis testing, model comparison, and regression chapters, having one-dimensional case chapters followed by two-dimensional case chapters, reducing model comparison to the use of the Kolmogorov-Smirnov test, processing bootstrap and jackknife in the Monte Carlo chapter without a mention of importance sampling, stating recurrence results without assuming irreducibility, motivating MCMC by the intractability of the evidence, resorting to the term link to designate the current value of a Markov chain, incorporating the need for a prior distribution in a terrible description of the Metropolis-Hastings algorithm, including a discrete proof for its stationarity, spending many pages on early 1990’s MCMC convergence tests rather than discussing the adaptive scaling of proposal distributions, the inclusion of numerical tables [in a 2017 book] and turning Bayes (1763) into Bayes and Price (1763), or Student (1908) into Gosset (1908).

[Usual disclaimer about potential self-plagiarism: this post or an edited version of it could possibly appear later in my Books Review section in CHANCE. Unlikely, though!]

## nested sampling with a test

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , on December 5, 2014 by xi'an

On my way back from Warwick, I read through a couple preprints, including this statistical test for nested sampling algorithms by Johannes Buchner. As it happens, I had already read and commented it in July! However, without the slightest memory of it (sad, isn’t it?!), I focussed this time much more on the modification proposed to MultiNest than on the test itself, which is in fact a Kolmogorov-Smirnov test applied to a specific target function.

Indeed, when reading the proposed modification of Buchner, I thought of a modification to the modification that sounded more appealing. Without getting back  to defining nested sampling in detail, this algorithm follows a swarm of N particles within upper-level sets of the likelihood surface, each step requiring a new simulation above the current value of the likelihood. The remark that set me on this time was that we should exploit the fact that (N-1) particles were already available within this level set. And uniformly distributed herein. Therefore this particle cloud should be exploited as much as possible to return yet another particle distributed just as uniformly as the other ones (!). Buchner proposes an alternative to MultiNest based on a randomised version of the maximal distance to a neighbour and a ball centre picked at random (but not uniformly). But it would be just as feasible to draw a distance from the empirical cdf of the distances to the nearest neighbours or to the k-nearest neighbours. With some possible calibration of k. And somewhat more accurate, because this distribution represents the repartition of the particle within the upper-level set. Although I looked at it briefly in the [sluggish] metro from Roissy airport, I could not figure out a way to account for the additional point to be included in the (N-1) existing particles. That is, how to deform the empirical cdf of those distances to account for an additional point. Unless one included the just-removed particle, which is at the boundary of this upper-level set. (Or rather, which defines the boundary of this upper-level set.) I have no clear intuition as to whether or not this would amount to a uniform generation over the true upper-level set. But simulating from the distance distribution would remove (I think) the clustering effect mentioned by Buchner.

“Other priors can be mapped [into the uniform prior over the unit hypercube] using the inverse of the cumulative prior distribution.”

Hence another illustration of the addictive features of nested sampling! Each time I get back to this notion, a new understanding or reinterpretation comes to mind. In any case, an equally endless source of projects for Master students. (Not that I agree with the above quote, mind you!)

## a statistical test for nested sampling

Posted in Books, Statistics, University life with tags , , , , , on July 25, 2014 by xi'an

A new arXival on nested sampling: “A statistical test for nested sampling algorithms” by Johannes Buchner. The point of the test is to check if versions of the nested sampling algorithm that fail to guarantee increased likelihood (or nesting) at each step are not missing parts of the posterior mass. and hence producing biased evidence approximations. This applies to MultiNest for instance. This version of nest sampling evaluates the above-threshold region by drawing hyper-balls around the remaining points. A solution which is known to fail in one specific but meaningful case. Buchner’s  arXived paper proposes an hyper-pyramid distribution for which the volume of any likelihood constrained set is known. Hence allowing for a distribution test like Kolmogorov-Smirnov. Confirming the findings of Beaujean and Caldwell (2013). The author then proposes an alternative to MultiNest that is more robust but also much more costly as it computes distances between all pairs of bootstrapped samples. This solution passes the so-called “shrinkage test”, but it is orders of magnitude less efficient than MultiNest. And also simply shows that its coverage is fine for a specific target rather than all possible targets. I wonder if a solution to the problem is at all possible given that evaluating a support or a convex hull is a complex problem which complexity explodes with the dimension.

## another R new trick [new for me!]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on July 16, 2014 by xi'an

While working with Andrew and a student from Dauphine on importance sampling, we wanted to assess the distribution of the resulting sample via the Kolmogorov-Smirnov measure

$\max_x |\hat{F_n}(x)-F(x)|$

where F is the target.  This distance (times √n) has an asymptotic distribution that does not depend on n, called the Kolmogorov distribution. After searching for a little while, we could not figure where this distribution was available in R. It had to, since ks.test was returning a p-value. Hopefully correct! So I looked into the ks.test function, which happens not to be entirely programmed in C, and found the line

PVAL <- 1 - if (alternative == "two.sided")
.Call(C_pKolmogorov2x, STATISTIC, n)


which means that the Kolmogorov distribution is coded as a C function C_pKolmogorov2x in R. However, I could not call the function myself.

> .Call(C_pKolmogorov2x,.3,4)


Hence, as I did not want to recode this distribution cdf, I posted the question on stackoverflow (long time no see!) and got a reply almost immediately as to use the package kolmim. Followed by the extra comment from the same person that calling the C code only required to add the path to its name, as in

> .Call(stats:::C_pKolmogorov2x,STAT=.3,n=4)
[1] 0.2292


## Maximum likelihood vs. likelihood-free quantum system identification in the atom maser

Posted in Books, Statistics, University life with tags , , , , , , on December 2, 2013 by xi'an

This paper (arXived a few days ago) compares maximum likelihood with different ABC approximations in a quantum physic setting and for an atom maser modelling that essentially bears down to a hidden Markov model. (I mostly blanked out of the physics explanations so cannot say I understand the model at all.) While the authors (from the University of Nottingham, hence Robin’s statue above…) do not consider the recent corpus of work by Ajay Jasra and coauthors (some of which was discussed on the ‘Og), they get interesting findings for an equally interesting model. First, when comparing the Fisher informations on the sole parameter of the model, the “Rabi angle” φ, for two different sets of statistics, one gets to zero at a certain value of the parameter, while the (fully informative) other is maximum (Figure 6). This is quite intriguing, esp. give the shape of the information in the former case, which reminds me of (my) inverse normal distributions. Second, the authors compare different collections of summary statistics in terms of ABC distributions against the likelihood function. While most bring much more uncertainty in the analysis, the whole collection recovers the range and shape of the likelihood function, which is nice. Third, they also use a kolmogorov-Smirnov distance to run their ABC, which is enticing, except that I cannot fathom from the paper when one would have enough of a sample (conditional on a parameter value) to rely on what is essentially an estimate of the sampling distribution. This seems to contradict the fact that they only use seven summary statistics. Or it may be that the “statistic” of waiting times happens to be a vector, in which case a Kolmogorov-Smirnov distance can indeed be adopted for the distance… The fact that the grouped seven-dimensional summary statistic provides the best ABC fit is somewhat of a surprise when considering the problem enjoys a single parameter.

“However, in practice, it is often difficult to find an s(.) which is sufficient.”

Just a point that irks me in most ABC papers is to find quotes like the above, since in most models, it is easy to show that there cannot be a non-trivial sufficient statistic! As soon as one leaves the exponential family cocoon, one is doomed in this respect!!!