**A**s Randal Douc and Éric Moulines are both very close friends and two authors of this book on Markov chains, I cannot engage into a regular book review! Judging from the table of contents, the coverage is not too dissimilar to the now classic Markov chain Stochastic Stability book by Sean Meyn and the late Richard Tweedie (1994), called the Bible of Markov chains by Peter Glynn, with more emphasis on convergence matters and a more mathematical perspective. The 757 pages book also includes a massive appendix on maths and probability background. As indicated in the preface, “the reason [the authors] thought it would be useful to write a new book is to survey some of the developments made during the 25 years that have elapsed since the publication of Meyn and Tweedie (1993b).” Connecting with the theoretical developments brought by MCMC methods. Like subgeometric rates of convergence to stationarity, sample paths, limit theorems, and concentration inequalities. The book also reflects on the numerous contributions of the authors to the field. Hence a perfect candidate for teaching Markov chains to mathematically well-prepared. graduate audiences. Congrats to the authors!

## Archive for probability theory

## Markov Chains [not a book review]

Posted in Books, pictures, Statistics, University life with tags book review, concentration inequalities, coupling, Eric Moulines, irreducibility, Markov chain and stochastic stability, Markov chain Monte Carlo, Markov chains, MCMC convergence, probability theory, Randal Douc, Richard Tweedie, Sean Meyn, Wasserstein distance on January 14, 2019 by xi'an## are there a frequentist and a Bayesian likelihoods?

Posted in Statistics with tags Bayes factor, Bayes formula, cross validated, dominating measure, Harold Jeffreys, likelihood function, Metron, probability theory, R.A. Fisher, University of Amsterdam, wikipedia on June 7, 2018 by xi'an**A** question that came up on X validated and led me to spot rather poor entries in Wikipedia about both the likelihood function and Bayes’ Theorem. Where unnecessary and confusing distinctions are made between the frequentist and Bayesian versions of these notions. I have already discussed the later (Bayes’ theorem) a fair amount here. The discussion about the likelihood is quite bemusing, in that the likelihood function is the … function of the parameter equal to the density indexed by this parameter at the observed value.

“What we can find from a sample is the likelihood of any particular value of r, if we define the likelihood as a quantity proportional to the probability that, from a population having the particular value of r, a sample having the observed value of r, should be obtained.”R.A. Fisher,On the “probable error’’ of a coefficient of correlation deduced from a small sample.Metron1, 1921, p.24

By mentioning an informal side to likelihood (rather than to likelihood function), and then stating that the likelihood is not a probability in the frequentist version but a probability in the Bayesian version, the W page makes a complete and unnecessary mess. Whoever is ready to rewrite this introduction is more than welcome! (Which reminded me of an earlier question also on X validated asking why a common reference measure was needed to define a likelihood function.)

This also led me to read a recent paper by Alexander Etz, whom I met at E.J. Wagenmakers‘ lab in Amsterdam a few years ago. Following Fisher, as Jeffreys complained about

“..likelihood, a convenient term introduced by Professor R.A. Fisher, though in his usage it is sometimes multiplied by a constant factor. This is the probability of the observations given the original information and the hypothesis under discussion.”H. Jeffreys,Theory of Probability, 1939, p.28

Alexander defines the likelihood up to a constant, which causes extra-confusion, for free!, as there is no foundational reason to introduce this degree of freedom rather than imposing an exact equality with the density of the data (albeit with an arbitrary choice of dominating measure, never neglect the dominating measure!). The paper also repeats the message that the likelihood is not a probability (density, *missing in the paper*). And provides intuitions about maximum likelihood, likelihood ratio and Wald tests. But does not venture into a separate definition of the likelihood, being satisfied with the fundamental notion to be plugged into the magical formula

posterior∝prior×likelihood

## fiducial inference

Posted in Books, Mountains, pictures, Running, Statistics, Travel with tags cross validated, fiducial distribution, fiducial inference, fjord, σ-algebra, measure theory, Norway, probability theory, Ronald Fisher, Trondheim, Trondheimsfjord on October 30, 2017 by xi'an**I**n connection with my recent tale of the many ε’s, I received from Gunnar Taraldsen [from Tronheim, Norge] a paper [jointly written with Bo Lindqvist and just appeared on-line in JSPI] on conditional fiducial models.

“The role of the prior and the statistical model in Bayesian analysis is replaced by the use of the fiducial model x=R(θ,ε) in fiducial inference. The fiducial is obtained in this case without a prior distribution for the parameter.”

Reading this paper after addressing the X validated question made me understood better the fundamental wrongness of fiducial analysis! If I may herein object to Fisher himself… Indeed, when writing x=R(θ,ε), as the representation of the [observed] random variable x as a deterministic transform of a parameter θ and of an [unobserved] random factor ε, the two random variables x and ε are based on the same random preimage ω, i.e., x=x(ω) and ε=ε(ω). Observing x hence sets a massive constraint on the preimage ω and on the conditional distribution of ε=ε(ω). When the fiducial inference incorporates another level of randomness via an independent random variable ε’ and inverts x=R(θ,ε’) into θ=θ(x,ε’), assuming there is only one solution to the inversion, it modifies the nature of the underlying σ-algebra into something that is incompatible with the original model. Because of this sudden duplication of the random variates. While the inversion of this equation x=R(θ,ε’) gives an idea of the possible values of θ when ε varies according to its [prior] distribution, it does not account for the connection between x and ε. And does not turn the original parameter into a random variable with an implicit prior distribution.

As to conditional fiducial distributions, they are defined by inversion of x=R(θ,ε), under a certain constraint on θ, like C(θ)=0, which immediately raises a Pavlovian reaction in me, namely that since the curve C(θ)=0 has measure zero under the original fiducial distribution, how can this conditional solution be uniquely or at all defined. Or to avoid the Borel paradox mentioned in the paper. If I get the meaning of the authors in this section, the resulting fiducial distribution will actually depend on the choice of σ-algebra governing the projection.

“A further advantage of the fiducial approach in the case of a simple fiducial model is that independent samples areproduced directly from independent sampling from [the fiducial distribution]. Bayesian simulations most often come as dependent samples from a Markov chain.”

This side argument in “favour” of the fiducial approach is most curious as it brings into the picture computational aspects that do not have any reason to be there. (The core of the paper is concerned with the unicity of the fiducial distribution in some univariate settings. Not with computational issues.)

## generating from a failure rate function [X’ed]

Posted in Books, Kids, Statistics, University life with tags cross validated, failure rate, Monte Carlo Statistical Methods, probability theory, reliability, simulation, StackExchange, stackoverflow, survival analysis on July 4, 2015 by xi'an**W**hile I now try to abstain from participating to the Cross Validated forum, as it proves too much of a time-consuming activity with little added value (in the sense that answers are much too often treated as disposable napkins by users who cannot be bothered to open a textbook and who usually do not exhibit any long-term impact of the provided answer, while clogging the forum with so many questions that the individual entries seem to get so little traffic, when compared say with the stackoverflow forum, to the point of making the analogy with disposable wipes more appropriate!), I came across a truly interesting question the other night. Truly interesting for me in that I had never considered the issue before.

The question is essentially wondering at how to simulate from a distribution defined by its failure rate function, which is connected with the density f of the distribution by

From a purely probabilistic perspective, defining the distribution through f or through η is equivalent, as shown by the relation

but, from a simulation point of view, it may provide a different entry. Indeed, all that is needed is the ability to solve (in X) the equation

when U is a Uniform (0,1) variable. Which may help in that it does not require a derivation of f. Obviously, this also begs the question as to why would a distribution be defined by its failure rate function.

## probabilistic numerics

Posted in pictures, Running, Statistics, Travel, University life with tags Bayesian statistics, Brownian motion, Coventry, CRiSM, Gaussian processes, numerical analysis, numerical integration, Persi Diaconis, probability theory, Runge-Kutta, stochastic processes, sunrise, trapezoidal approximation, University of Warwick, Warwickshire, workshop on April 27, 2015 by xi'an**I** attended an highly unusual workshop while in Warwick last week. Unusual for me, obviously. It was about *probabilistic numerics*, i.e., the use of probabilistic or stochastic arguments in the numerical resolution of (possibly) deterministic problems. The notion in this approach is fairly Bayesian in that it makes use to prior information or belief about the quantity of interest, e.g., a function, to construct an usually Gaussian process prior and derive both an estimator that is identical to a numerical method (e.g., Runge-Kutta or trapezoidal integration) and uncertainty or variability around this estimator. While I did not grasp much more than the classy introduction talk by Philipp Hennig, this concept sounds fairly interesting, if only because of the Bayesian connection, and I wonder if we will soon see a probability numerics section at ISBA! More seriously, placing priors on functions or functionals is a highly formal perspective (as in Bayesian non-parametrics) and it makes me wonder how much of the data (evaluation of a function at a given set of points) and how much of the prior is reflected in the output [variability]. (Obviously, one could also ask a similar question for statistical analyses!) For instance, issues of singularity arise among those stochastic process priors.

Another question that stemmed from this talk is whether or not more efficient numerical methods can derived that way, in addition to recovering the most classical ones. Somewhat, somehow, given the idealised nature of the prior, it feels like priors could be more easily compared or ranked than in classical statistical problems. Since the aim is to figure out the value of an integral or the solution to an ODE. (Or maybe not, since again almost the same could be said about estimating a normal mean.)

## stopping rule impact

Posted in Books, R, Statistics, University life with tags bias, p-values, probability theory, stopping rule on May 9, 2014 by xi'an**H**ere is a question from my friend Shravan Vasishth about the consequences of using a stopping rule:

Psycholinguists and psychologists often adopt the following type of data-gathering procedure: The experimenter gathers n data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (n more data points). Since time and money are limited, he might decide to stop anyway at sample size, say, some multiple of n. One can play with different scenarios here. A typical n might be 10 or 15.

This approach would give us a distribution of t-values and p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses (I also track the t-statistic, in order to compare it with my stopping rule code below).

Here’s a simulation showing what happens. I wanted to ask you whether this simulation makes sense. I assume here that the experimenter gathers 10 data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (10 more data points). Since time and money are limited, he might decide to stop anyway at sample size 60. This gives us p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses:##Standard: pvals<-NULL tstat_standard<-NULL n<-10 # sample size nsim<-1000 # number of simulations stddev<-1 # standard dev mn<-0 ## mean for(i in 1:nsim){ samp<-rnorm(n,mean=mn,sd=stddev) pvals[i]<-t.test(samp)$p.value tstat_standard[i]<-t.test(samp)$statistic} ## Type I error rate: about 5% as theory says: table(pvals<0.05)[2]/nsimBut the situation quickly deteriorates as soon as adopt the strategy I outlined above:

pvals<-NULL tstat<-NULL ## how many subjects can I run? upper_bound<-n*6 for(i in 1:nsim){ ## at the outset we have no significant result: significant<-FALSE ## null hyp is going to be true, ## so any rejection is a mistake. ## take sample: x<-rnorm(n,mean=mn,sd=stddev) while(!significant & length(x)<upper_bound){ ## if not significant: if(t.test(x)$p.value>0.05){ ## get more data: x<-append(x,rnorm(n,mean=mn,sd=stddev)) ## otherwise stop: } else {significant<-TRUE}} ## will be either significant or not: pvals[i]<-t.test(x)$p.value tstat[i]<-t.test(x)$statistic}Now let’s compare the distribution of the t-statistic in the standard case vs with the above stopping rule. We get fatter tails with the above stopping rule, as shown by the histogram below.

Is this a correct way to think about the stopping rule problem?

**T**o which I replied the following:

By adopting a stopping rule on a random iid sequence, you favour values in the sequence that agree with your stopping condition, hence modify the distribution of the outcome. To take an extreme example, if you draw N(0,1) variates

untilthe empirical average is between -2 and 2, the average thus produced cannot remain N(0,1/n) but have a different distribution.The t-test statistic you build from your experiment is no longer distributed as a uniform variate because of the stopping rule: the sample(x

_{1},…,x_{10m}) (with random size 10m [resulting from increases in the sample size by adding 10 more observations at a time] is distributed fromif 10m<60 [assuming the maximal acceptable sample size is 60] and from

otherwise. The histogram at the top of this post is the empirical distribution of the average of those observations, clearly far from a normal distribution.

## Bayes on the radio (regrets)

Posted in Books, Kids, Running, Statistics with tags Baccalauréat, Bayes formula, Bayes theorem, court of appeal, justice, probability theory on November 13, 2012 by xi'an**W**hile running this morning I was reconsidering (over and over) my discussion of Bayes’ formula on the radio and thought I should have turned the presentation of Bayes’ theorem differently. I spent much too much time on the math side of Bayes’ formula and not enough on the stat side. The math aspect is not of real importance as it is a mere reformulation of conditional probabilities. The stat side is what matters as introducing a (prior) distribution on the parameter (space) is the #1 specificity of Bayesian statistics…. And the focus point of most criticisms, as expressed later by the physicist working on the Higgs boson, Dirk Zerwas.

**I** also regret not mentioning that Bayes’ formula was taught in French high schools, as illustrated by the anecdote of Bayes at the bac. And not reacting at the question about Bayes in the courtroom with yet another anecdote of Bayes’ formula been thrown out of the accepted tools by an English court of appeal about a year ago. Oh well, another argument for sticking to the written world.