**A**mong much more dramatic news today, I learned about Monty Hall passing away, who achieved long lasting fame among probabilists for his TV game show leading to the Monty Hall problem, a simple conditional probability derivation often leading to arguments because of the loose wording of the conditioning event. By virtue of Stigler’s Law, the Monty Hall game was actually invented earlier, apparently by the French probabilist Joseph Bertrand, in his *Calcul des probabilités*. The New York Times article linked with the image points out the role of outfits with the game participants, towards being selected by the host, Monty Hall. And that one show had a live elephant behind a door, instead of a goat, elephant which freaked out..!

## Archive for conditional probability

## Monty Hall closes the door

Posted in Books, Kids, pictures with tags competition, conditional probability, game show, Monty Hall, Monty Hall problem, paradoxes, pop culture, Stigler's Law, The New York Times, USA on October 1, 2017 by xi'an## what makes variables randoms [book review]

Posted in Books, Mountains, Statistics with tags Bayesian Analysis, Bertrand's paradox, conditional probability, introductory textbooks, σ-algebra, Lebesgue integration, Riemann integration on July 19, 2017 by xi'an**W**hen the goal of a book is to make measure theoretic probability available to applied researchers for conducting their research, I cannot but applaud! Peter Veazie’s goal of writing “a brief text that provides a basic conceptual introduction to measure theory” (p.4) is hence most commendable. Before reading What makes variables random, I was uncertain how this could be achieved with a limited calculus background, given the difficulties met by our third year maths students. After reading the book, I am even less certain this is feasible!

“…it is the data generating process that makes the variables random and not the data.”

Chapter 2 is about basic notions of set theory. Chapter 3 defines measurable sets and measurable functions and integrals against a given measure μ as

which I find particularly unnatural compared with the definition through simple functions (esp. because it does not tell how to handle 0x∞). The ensuing discussion shows the limitation of the exercise in that the definition is only explained for finite sets (since the notion of a partition achieving the supremum on page 29 is otherwise meaningless). A generic problem with the book, in that most examples in the probability section relate to discrete settings (see the discussion of the power set p.66). I also did not see a justification as to why measurable functions enjoy well-defined integrals in the above sense. All in all, to see less than ten pages allocated to measure theory *per se* is rather staggering! For instance,

does not appear to be defined at all.

“…the mathematical probability theory underlying our analyses is just mathematics…”

Chapter 4 moves to probability measures. It distinguishes between objective (or frequentist) and subjective measures, which is of course open to diverse interpretations. And the definition of a conditional measure is the traditional one, conditional on a set rather than on a σ-algebra. Surprisingly as this is in my opinion one major reason for using measures in probability theory. And avoids unpleasant issues such as Bertrand’s paradox. While random variables are defined in the standard sense of real valued measurable functions, I did not see a definition of a continuous random variables or of the Lebesgue measure. And there are only a few lines (p.48) about the notion of expectation, which is so central to measure-theoretic probability as to provide a way of entry into measure theory! Progressing further, the σ-algebra induced by a random variable is defined as a partition (p.52), a particularly obscure notion for continuous rv’s. When the conditional density of one random variable given the realisation of another is finally introduced (p.63), as an expectation reconciling with the set-wise definition of conditional probabilities, it is in a fairly convoluted way that I fear will scare newcomers out of their wit. Since it relies on a sequence of nested sets with positive measure, implying an underlying topology and the like, which somewhat shows the impossibility of the overall task…

“In the Bayesian analysis, the likelihood provides meaning to the posterior.”

Statistics is hurriedly introduced in a short section at the end of Chapter 4, assuming the notion of likelihood is already known by the readers. But nitpicking (p.65) at the representation of the terms in the log-likelihood as depending on an unspecified parameter value θ [not to be confused with the data-generating value of θ, which does not appear clearly in this section]. Section that manages to include arcane remarks distinguishing maximum likelihood estimation from Bayesian analysis, all this within a page! (Nowhere is the Bayesian perspective clearly defined.)

“We should no more perform an analysis clustered by state than we would cluster by age, income, or other random variable.”

The last part of the book is about probabilistic models, drawing a distinction between data generating process models and data models (p.89), by which the author means the hypothesised probabilistic model versus the empirical or bootstrap distribution. An interesting way to relate to the main thread, except that the convergence of the data distribution to the data generating process model cannot be established at this level. And hence that the very nature of bootstrap may be lost on the reader. A second and final chapter covers some common or vexing problems and the author’s approach to them. Revolving around standard errors, fixed and random effects. The distinction between standard deviation (“a mathematical property of a probability distribution”) and standard error (“representation of variation due to a data generating process”) that is followed for several pages seems to boil down to a possible (and likely) model mis-specification. The chapter also contains an extensive discussion of notations, like indexes (or indicators), which seems a strange focus esp. at this location in the book. Over 15 pages! (Furthermore, I find quite confusing that a set of indices is denoted there by the double barred I, usually employed for the indicator function.)

“…the reader will probably observe the conspicuous absence of a time-honoured topic in calculus courses, the “Riemann integral”… Only the stubborn conservatism of academic tradition could freeze it into a regular part of the curriculum, long after it had outlived its historical importance.”Jean Dieudonné,Foundations of Modern Analysis

In conclusion, I do not see the point of this book, from its insistence on measure theory that never concretises for lack of mathematical material to an absence of convincing examples as to why this is useful for the applied researcher, to the intended audience which is expected to already quite a lot about probability and statistics, to a final meandering around linear models that seems at odds with the remainder of What makes variables random, without providing an answer to this question. Or to the more relevant one of why Lebesgue integration is preferable to Riemann integration. (Not that there does not exist convincing replies to this question!)

## optimultiplication [a riddle]

Posted in Books, Kids, R, Statistics with tags coding, conditional probability, FiveThirtyEight, mathematical puzzle, R, The Riddler on April 14, 2017 by xi'an**T**he riddle of this week is about an optimisation of positioning the four digits of a multiplication of two numbers with two digits each and is open to a coding resolution:

Four digits are drawn without replacement from {0,1,…,9}, one at a time. What is the optimal strategy to position those four digits, two digits per row, as they are drawn, toward minimising the average product?

Although the problem can be solved algebraically by computing **E**[X⁴|x¹,..] and **E**[X⁴X³|x¹,..] I wrote three R codes to “optimise” the location of the first three digits: the first digit ends up as a unit if it is 5 or more and a multiple of ten otherwise, on the first row. For the second draw, it is slightly more variable: with this R code,

second<-function(i,j,N=1e5){draw drew=matrix(0,N,2) for (t in 1:N) drew[t,]=sample((0:9)[-c(i+1,j+1)],2) conmean=(45-i-j)/8 conprod=mean(drew[,1]*drew[,2]) if (i<5){ #10*i pos=c((110*i+11*j)*conmean, 100*i*j+10*(i+j)*conmean+conprod, (100*i+j)*conmean+10*i*j+10*conprod)}else{ pos=c((110*j+11*i)*conmean, 10*i*j+(100*j+i)*conmean+10*conprod, 10*(i+j)*conmean+i*j+100*conprod) return(order(pos)[1])}

the resulting digit again ends up as a unit if it is 5 (except when x¹=7,8,9, where it is 4) or more and a multiple of ten otherwise, but on the second row. Except when x¹=0, x²=1,2,3,4, when they end up on the first row together, 0 obviously in front.

For the third and last open draw, there is only one remaining random draw, which mean that the decision only depends on x¹,x²,x³ and **E**[X⁴|x¹,x²,x³]=(45-x¹-x²-x³)/7. Attaching x³ to x² or x¹ will then vary monotonically in x³, depending on whether x¹>x² or x¹<x²:

fourth=function(i,j,k){ comean=(45-i-j-k)/7 if ((i<1)&(j<5)){ pos=c(10*comean+k,comean+10*k)} if ((i<5)&(j>4)){ pos=c(100*i*comean+k*j,j*comean+100*i*k)} if ((i>0)&(i<5)&(j<5)){ pos=c(i*comean+k*j,j*comean+i*k)} if ((i<7)&(i>4)&(j<5)){ pos=c(i*comean+100*k*j,j*comean+100*i*k)} if ((i<7)&(i>4)&(j>4)){ pos=c(i*comean+k*j,j*comean+i*k)} if ((i>6)&(j<4)){ pos=c(i*comean+100*k*j,j*comean+100*i*k)} if ((i>6)&(j>3)){ pos=c(i*comean+k*j,j*comean+i*k)} return(order(pos)[1])}

Running this R code for all combinations of x¹,x² shows that, except for the cases x¹≥5 and x²=0, for which x³ invariably remains in front of x¹, there are always values of x³ associated with each position.

## Computing the variance of a conditional expectation via non-nested Monte Carlo

Posted in Books, pictures, Statistics, University life with tags conditional probability, debiasing, Monte Carlo approximations, Monte Carlo Statistical Methods, Rao-Blackwellisation on May 26, 2016 by xi'an**T**he recent arXival by Takashi Goda of Computing the variance of a conditional expectation via non-nested Monte Carlo led me to read it as I could not be certain of the contents from only reading the title! The short paper considers the issue of estimating the variance of a conditional expectation when able to simulate the joint distribution behind the quantity of interest. The second moment E(E[f(X)|Y]²) can be written as a triple integral with two versions of x given y and one marginal y, which means that it can approximated in an unbiased manner by simulating a realisation of y then conditionally two realisations of x. The variance requires a third simulation of x, which the author seems to deem too costly and that he hence replaces with another unbiased version based on two conditional generations only. (He notes that a faster biased version is available with bias going down faster than the Monte Carlo error, which makes the alternative somewhat irrelevant, as it is also costly to derive.) An open question after reading the paper stands with the optimal version of the generic estimator (5), although finding the optimum may require more computing time than it is worth spending. Another one is whether or not this version of the expected conditional variance is more interesting (computation-wise) that the difference between the variance and the expected conditional variance as reproduced in (3) given that both quantities can equally be approximated by unbiased Monte Carlo…

## Sunday morning puzzle

Posted in Books, Kids, R with tags conditional probability, cross validated, mathematical puzzle, R on November 22, 2015 by xi'an**A** question from X validated that took me quite a while to fathom and then the solution suddenly became quite obvious:

If a sample taken from an arbitrary distribution on {0,1}⁶ is censored from its (0,0,0,0,0,0) elements, and if the marginal probabilities are know for all six components of the random vector, what is an estimate of the proportion of (missing) (0,0,0,0,0,0) elements?

Since the censoring modifies all probabilities by the same renormalisation, i.e. divides them by the probability to be different from (0,0,0,0,0,0), *ρ*, this probability can be estimated by looking at the marginal probabilities to be equal to 1, which equal the original and known marginal probabilities divided by *ρ*. Here is a short R code illustrating the approach that I wrote in the taxi home yesterday night:

#generate vectors N=1e5 zprobs=c(.1,.9) #iid example smpl=matrix(sample(0:1,6*N,rep=TRUE,prob=zprobs),ncol=6) pty=apply(smpl,1,sum) smpl=smpl[pty>0,] ps=apply(smpl,2,mean) cor=mean(ps/rep(zprobs[2],6)) #estimated original size length(smpl[,1])*cor

A broader question is how many values (and which values) of the sample can be removed before this recovery gets impossible (with the same amount of information).