O’Bayes 19/2

Posted in Books, pictures, Running, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 1, 2019 by xi'an

One talk on Day 2 of O’Bayes 2019 was by Ryan Martin on data dependent priors (or “priors”). Which I have already discussed in this blog. Including the notion of a Gibbs posterior about quantities that “are not always defined through a model” [which is debatable if one sees it like part of a semi-parametric model]. Gibbs posterior that is built through a pseudo-likelihood constructed from the empirical risk, which reminds me of Bissiri, Holmes and Walker. Although requiring a prior on this quantity that is  not part of a model. And is not necessarily a true posterior and not necessarily with the same concentration rate as a true posterior. Constructing a data-dependent distribution on the parameter does not necessarily mean an interesting inference and to keep up with the theme of the conference has no automated claim to [more] “objectivity”.

And after calling a prior both Beauty and The Beast!, Erlis Ruli argued about a “bias-reduction” prior where the prior is solution to a differential equation related with some cumulants, connected with an earlier work of David Firth (Warwick).  An interesting conundrum is how to create an MCMC algorithm when the prior is that intractable, with a possible help from PDMP techniques like the Zig-Zag sampler.

While Peter Orbanz’ talk was centred on a central limit theorem under group invariance, further penalised by being the last of the (sun) day, Peter did a magnificent job of presenting the result and motivating each term. It reminded me of the work Jim Bondar was doing in Ottawa in the 1980’s on Haar measures for Bayesian inference. Including the notion of amenability [a term due to von Neumann] I had not met since then. (Neither have I met Jim since the last summer I spent in Carleton.) The CLT and associated LLN are remarkable in that the average is not over observations but over shifts of the same observation under elements of a sub-group of transformations. I wondered as well at the potential connection with the Read Paper of Kong et al. in 2003 on the use of group averaging for Monte Carlo integration [connection apart from the fact that both discussants, Michael Evans and myself, are present at this conference].

the beauty of maths in computer science [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on January 17, 2019 by xi'an

CRC Press sent me this book for review in CHANCE: Written by Jun Wu, “staff research scientist in Google who invented Google’s Chinese, Japanese, and Korean Web search algorithms”, and translated from the Chinese, 数学之美, originating from Google blog entries. (Meaning most references are pre-2010.) A large part of the book is about word processing and web navigation, which is the author’s research specialty. And not so much about mathematics. (When rereading the first chapters to start this review I then realised why the part about language processing in AIQ sounded familiar: I had read it in the Beauty of Mathematics in Computer Science.)

In the first chapter, about the history of languages, I found out, among other things, that ancient Jewish copists of the Bible had an error correcting algorithm consisting in giving each character a numerical equivalent, summing up each row, then all rows, and  checking the sum at the end of the page was the original one. The second chapter explains why the early attempts at language computer processing, based on grammar rules, were unsuccessful and how a statistical approach had broken the blockade. Explained via Markov chains in the following chapter. Along with the Good-Turing [Bayesian] estimate of the transition probabilities. Next comes a short and low-tech chapter on word segmentation. And then an introduction to hidden Markov models. Mentioning the Baum-Welch algorithm as a special case of EM, which makes a return by Chapter 26. Plus a chapter on entropies and Kullback-Leibler divergence.

A first intermede is provided by a chapter dedicated to the late Frederick Jelinek, the author’s mentor (including what I find a rather unfortunate equivalent drawn between the Nazi and Communist eras in Czechoslovakia, p.64). Chapter that sounds a wee bit too much like an extended obituary.

The next section of chapters is about search engines, with a few pages on Boolean logic, dynamic programming, graph theory, Google’s PageRank and TF-IDF (term frequency/inverse document frequency). Unsurprisingly, given that the entries were originally written for Google’s blog, Google’s tools and concepts keep popping throughout the entire book.

Another intermede about Amit Singhal, the designer of Google’s internal search ranking system, Ascorer. With another unfortunate equivalent with the AK-47 Kalashnikov rifle as “elegantly simple”, “effective, reliable, uncomplicated, and easy to implement or operate” (p.105). Even though I do get the (reason for the) analogy, using an equivalent tool which purpose is not to kill other people would have been just decent…

Then chapters on measuring proximity between news articles by (vectors in a 64,000 dimension vocabulary space and) their angle, and singular value decomposition, and turning URLs as long integers into 16 bytes random numbers by the Mersenne Twister (why random, except for encryption?), missing both the square in von Neumann’s first PRNG (p.124) and the opportunity to link the probability of overlap with the birthday problem (p.129). Followed by another chapter on cryptography, always a favourite in maths vulgarisation books (but with no mention made of the originators of public key cryptography, like James Hellis or the RSA trio, or of the impact of quantum computers on the reliability of these methods). And by an a-mathematic chapter on spam detection.

Another sequence of chapters cover maximum entropy models (in a rather incomprehensible way, I think, see p.159), continued with an interesting argument how Shannon’s first theorem predicts that it should be faster to type Chinese characters than Roman characters. Followed by the Bloom filter, which operates as an approximate Poisson variate. Then Bayesian networks where the “probability of any node is computed by Bayes’ formula” [not really]. With a slightly more advanced discussion on providing the highest posterior probability network. And conditional random fields, where the conditioning is not clearly discussed (p.192). Next are chapters about Viterbi’s algorithm (and successful career) and the EM algorithm, nicknamed “God’s algorithm” in the book (Chapter 26) although I never heard of this nickname previously.

The final two chapters are on neural networks and Big Data, clearly written later than the rest of the book, with the predictable illustration of AlphaGo (but without technical details). The twenty page chapter on Big Data does not contain a larger amount of mathematics, with no equation apart from Chebyshev’s inequality, and a frequency estimate for a conditional probability. But I learned about 23&me running genetic tests at a loss to build a huge (if biased) genetic database. (The bias in “Big Data” issues is actually not covered by this chapter.)

“One of my main objectives for writing the book is to introduce some mathematical knowledge related to the IT industry to people who do not work in the industry.”

To conclude, I found the book a fairly interesting insight on the vision of his field and job experience by a senior scientist at Google, with loads of anecdotes and some historical backgrounds, but very Google-centric and what I felt like an excessive amount of name dropping and of I did, I solved, I &tc. The title is rather misleading in my opinion as the amount of maths is very limited and rarely sufficient to connect with the subject at hand. Although this is quite a relative concept, I did not spot beauty therein but rather technical advances and trick, allowing the author and Google to beat the competition.

George Forsythe’s last paper

Posted in Books, Statistics, University life with tags , , , on May 25, 2018 by xi'an

When looking for a link in a recent post, I came across Richard Brent’ arXival of historical comments on George Forsythe’s last paper (in 1972). Which is about the Forsythe-von Neumann approach to simulating exponential variates, covered in Luc Devroye’s Non-Uniform Random Variate Generation in a special section, Section 2 of Chapter 4,  is about generating a random variable from a target density proportional to g(x)exp(-F(x)), where g is a density and F is a function on (0,1). Then, after generating a realisation x⁰ from g and computing F(x⁰), generate a sequence u¹,u²,… of uniforms as long as they keep decreasing, i.e., F(x⁰) >u¹>u²>… If the maximal length k of this sequence is odd, the algorithm exists with a value x⁰ generated from  g(x)exp(-F(x)). Von Neumann (1949) treated the special case when g is constant and F(x)=x, which leads to an Exponential generator that never calls an exponential function. Which does not make the proposal a particularly efficient one as it rejects O(½) of the simulations. Refinements of the algorithm lead to using on average 1.38 uniforms per Normal generation, which does not sound much faster than a call to the Box-Muller method, despite what is written in the paper. (Brent also suggests using David Wallace’s 1999 Normal generator, which I had not encountered before. And which I am uncertain is relevant at the present time.)

best unbiased estimator of θ² for a Poisson model

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on May 23, 2018 by xi'an

A mostly traditional question on X validated about the “best” [minimum variance] unbiased estimator of θ² from a Poisson P(θ) sample leads to the Rao-Blackwell solution

$\mathbb{E}[X_1X_2|\underbrace{\sum_{i=1}^n X_i}_S=s] = -\frac{s}{n^2}+\frac{s^2}{n^2}=\frac{s(s-1)}{n^2}$

and a similar estimator could be constructed for θ³, θ⁴, … With the interesting limitation that this procedure stops at the power equal to the number of observations (minus one?). But,  since the expectation of a power of the sufficient statistics S [with distribution P(nθ)] is a polynomial in θ, there is de facto no limitation. More interestingly, there is no unbiased estimator of negative powers of θ in this context, while this neat comparison on Wikipedia (borrowed from the great book of counter-examples by Romano and Siegel, 1986, selling for a mere \$180 on amazon!) shows why looking for an unbiased estimator of exp(-2θ) is particularly foolish: the only solution is (-1) to the power S [for a single observation]. (There is however a first way to circumvent the difficulty if having access to an arbitrary number of generations from the Poisson, since the Forsythe – von Neuman algorithm allows for an unbiased estimation of exp(-F(x)). And, as a second way, as remarked by Juho Kokkala below, a sample of at least two Poisson observations leads to a more coherent best unbiased estimator.)

10 great ideas about chance [book preview]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on November 13, 2017 by xi'an

[As I happened to be a reviewer of this book by Persi Diaconis and Brian Skyrms, I had the opportunity (and privilege!) to go through its earlier version. Here are the [edited] comments I sent back to PUP and the authors about this earlier version. All in  all, a terrific book!!!]

The historical introduction (“measurement”) of this book is most interesting, especially its analogy of chance with length. I would have appreciated a connection earlier than Cardano, like some of the Greek philosophers even though I gladly discovered there that Cardano was not only responsible for the closed form solutions to the third degree equation. I would also have liked to see more comments on the vexing issue of equiprobability: we all spend (if not waste) hours in the classroom explaining to (or arguing with) students why their solution is not correct. And they sometimes never get it! [And we sometimes get it wrong as well..!] Why is such a simple concept so hard to explicit? In short, but this is nothing but a personal choice, I would have made the chapter more conceptual and less chronologically historical.

“Coherence is again a question of consistent evaluations of a betting arrangement that can be implemented in alternative ways.” (p.46)

The second chapter, about Frank Ramsey, is interesting, if only because it puts this “man of genius” back under the spotlight when he has all but been forgotten. (At least in my circles.) And for joining probability and utility together. And for postulating that probability can be derived from expectations rather than the opposite. Even though betting or gambling has a (negative) stigma in many cultures. At least gambling for money, since most of our actions involve some degree of betting. But not in a rational or reasoned manner. (Of course, this is not a mathematical but rather a psychological objection.) Further, the justification through betting is somewhat tautological in that it assumes probabilities are true probabilities from the start. For instance, the Dutch book example on p.39 produces a gain of .2 only if the probabilities are correct.

> gain=rep(0,1e4)
> for (t in 1:1e4){
+ p=rexp(3);p=p/sum(p)
+ gain[t]=(p[1]*(1-.6)+p[2]*(1-.2)+p[3]*(.9-1))/sum(p)}
> hist(gain)

As I made it clear at the BFF4 conference last Spring, I now realise I have never really adhered to the Dutch book argument. This may be why I find the chapter somewhat unbalanced with not enough written on utilities and too much on Dutch books.

“The force of accumulating evidence made it less and less plausible to hold that subjective probability is, in general, approximate psychology.” (p.55)

A chapter on “psychology” may come as a surprise, but I feel a posteriori that it is appropriate. Most of it is about the Allais paradox. Plus entries on Ellesberg’s distinction between risk and uncertainty, with only the former being quantifiable by “objective” probabilities. And on Tversky’s and Kahneman’s distinction between heuristics, and the framing effect, i.e., how the way propositions are expressed impacts the choice of decision makers. However, it is leaving me unclear about the conclusion that the fact that people behave irrationally should not prevent a reliance on utility theory. Unclear because when taking actions involving other actors their potentially irrational choices should also be taken into account. (This is mostly nitpicking.)

“This is Bernoulli’s swindle. Try to make it precise and it falls apart. The conditional probabilities go in different directions, the desired intervals are of different quantities, and the desired probabilities are different probabilities.” (p.66)

The next chapter (“frequency”) is about Bernoulli’s Law of Large numbers and the stabilisation of frequencies, with von Mises making it the basis of his approach to probability. And Birkhoff’s extension which is capital for the development of stochastic processes. And later for MCMC. I like the notions of “disreputable twin” (p.63) and “Bernoulli’s swindle” about the idea that “chance is frequency”. The authors call the identification of probabilities as limits of frequencies Bernoulli‘s swindle, because it cannot handle zero probability events. With a nice link with the testing fallacy of equating rejection of the null with acceptance of the alternative. And an interesting description as to how Venn perceived the fallacy but could not overcome it: “If Venn’s theory appears to be full of holes, it is to his credit that he saw them himself.” The description of von Mises’ Kollectiven [and the welcome intervention of Abraham Wald] clarifies my previous and partial understanding of the notion, although I am unsure it is that clear for all potential readers. I also appreciate the connection with the very notion of randomness which has not yet found I fear a satisfactory definition. This chapter asks more (interesting) questions than it brings answers (to those or others). But enough, this is a brilliant chapter!

“…a random variable, the notion that Kac found mysterious in early expositions of probability theory.” (p.87)

Chapter 5 (“mathematics”) is very important [from my perspective] in that it justifies the necessity to associate measure theory with probability if one wishes to evolve further than urns and dices. To entitle Kolmogorov to posit his axioms of probability. And to define properly conditional probabilities as random variables (as my third students fail to realise). I enjoyed very much reading this chapter, but it may prove difficult to read for readers with no or little background in measure (although some advanced mathematical details have vanished from the published version). Still, this chapter constitutes a strong argument for preserving measure theory courses in graduate programs. As an aside, I find it amazing that mathematicians (even Kac!) had not at first realised the connection between measure theory and probability (p.84), but maybe not so amazing given the difficulty many still have with the notion of conditional probability. (Now, I would have liked to see some description of Borel’s paradox when it is mentioned (p.89).

“Nothing hangs on a flat prior (…) Nothing hangs on a unique quantification of ignorance.” (p.115)

The following chapter (“inverse inference”) is about Thomas Bayes and his posthumous theorem, with an introduction setting the theorem at the centre of the Hume-Price-Bayes triangle. (It is nice that the authors include a picture of the original version of the essay, as the initial title is much more explicit than the published version!) A short coverage, in tune with the fact that Bayes only contributed a twenty-plus paper to the field. And to be logically followed by a second part [formerly another chapter] on Pierre-Simon Laplace, both parts focussing on the selection of prior distributions on the probability of a Binomial (coin tossing) distribution. Emerging into a discussion of the position of statistics within or even outside mathematics. (And the assertion that Fisher was the Einstein of Statistics on p.120 may be disputed by many readers!)

“So it is perfectly legitimate to use Bayes’ mathematics even if we believe that chance does not exist.” (p.124)

The seventh chapter is about Bruno de Finetti with his astounding representation of exchangeable sequences as being mixtures of iid sequences. Defining an implicit prior on the side. While the description sticks to binary events, it gets quickly more advanced with the notion of partial and Markov exchangeability. With the most interesting connection between those exchangeabilities and sufficiency. (I would however disagree with the statement that “Bayes was the father of parametric Bayesian analysis” [p.133] as this is extrapolating too much from the Essay.) My next remark may be non-sensical, but I would have welcomed an entry at the end of the chapter on cases where the exchangeability representation fails, for instance those cases when there is no sufficiency structure to exploit in the model. A bonus to the chapter is a description of Birkhoff’s ergodic theorem “as a generalisation of de Finetti” (p..134-136), plus half a dozen pages of appendices on more technical aspects of de Finetti’s theorem.

“We want random sequences to pass all tests of randomness, with tests being computationally implemented”. (p.151)

The eighth chapter (“algorithmic randomness”) comes (again!) as a surprise as it centres on the character of Per Martin-Löf who is little known in statistics circles. (The chapter starts with a picture of him with the iconic Oberwolfach sculpture in the background.) Martin-Löf’s work concentrates on the notion of randomness, in a mathematical rather than probabilistic sense, and on the algorithmic consequences. I like very much the section on random generators. Including a mention of our old friend RANDU, the 16 planes random generator! This chapter connects with Chapter 4 since von Mises also attempted to define a random sequence. To the point it feels slightly repetitive (for instance Jean Ville is mentioned in rather similar terms in both chapters). Martin-Löf’s central notion is computability, which forces us to visit Turing’s machine. And its role in the undecidability of some logical statements. And Church’s recursive functions. (With a link not exploited here to the notion of probabilistic programming, where one language is actually named Church, after Alonzo Church.) Back to Martin-Löf, (I do not see how his test for randomness can be implemented on a real machine as the whole test requires going through the entire sequence: since this notion connects with von Mises’ Kollektivs, I am missing the point!) And then Kolmororov is brought back with his own notion of complexity (which is also Chaitin’s and Solomonov’s). Overall this is a pretty hard chapter both because of the notions it introduces and because I do not feel it is completely conclusive about the notion(s) of randomness. A side remark about casino hustlers and their “exploitation” of weak random generators: I believe Jeff Rosenthal has a similar if maybe simpler story in his book about Canadian lotteries.

“Does quantum mechanics need a different notion of probability? We think not.” (p.180)

The penultimate chapter is about Boltzmann and the notion of “physical chance”. Or statistical physics. A story that involves Zermelo and Poincaré, And Gibbs, Maxwell and the Ehrenfests. The discussion focus on the definition of probability in a thermodynamic setting, opposing time frequencies to space frequencies. Which requires ergodicity and hence Birkhoff [no surprise, this is about ergodicity!] as well as von Neumann. This reaches a point where conjectures in the theory are yet open. What I always (if presumably naïvely) find fascinating in this topic is the fact that ergodicity operates without requiring randomness. Dynamical systems can enjoy ergodic theorem, while being completely deterministic.) This chapter also discusses quantum mechanics, which main tenet requires probability. Which needs to be defined, from a frequency or a subjective perspective. And the Bernoulli shift that brings us back to random generators. The authors briefly mention the Einstein-Podolsky-Rosen paradox, which sounds more metaphysical than mathematical in my opinion, although they get to great details to explain Bell’s conclusion that quantum theory leads to a mathematical impossibility (but they lost me along the way). Except that we “are left with quantum probabilities” (p.183). And the chapter leaves me still uncertain as to why statistical mechanics carries the label statistical. As it does not seem to involve inference at all.

“If you don’t like calling these ignorance priors on the ground that they may be sharply peaked, call them nondogmatic priors or skeptical priors, because these priors are quite in the spirit of ancient skepticism.” (p.199)

And then the last chapter (“induction”) brings us back to Hume and the 18th Century, where somehow “everything” [including statistics] started! Except that Hume’s strong scepticism (or skepticism) makes induction seemingly impossible. (A perspective with which I agree to some extent, if not to Keynes’ extreme version, when considering for instance financial time series as stationary. And a reason why I do not see the criticisms contained in the Black Swan as pertinent because they savage normality while accepting stationarity.) The chapter rediscusses Bayes’ and Laplace’s contributions to inference as well, challenging Hume’s conclusion of the impossibility to finer. Even though the representation of ignorance is not unique (p.199). And the authors call again for de Finetti’s representation theorem as bypassing the issue of whether or not there is such a thing as chance. And escaping inductive scepticism. (The section about Goodman’s grue hypothesis is somewhat distracting, maybe because I have always found it quite artificial and based on a linguistic pun rather than a logical contradiction.) The part about (Richard) Jeffrey is quite new to me but ends up quite abruptly! Similarly about Popper and his exclusion of induction. From this chapter, I appreciated very much the section on skeptical priors and its analysis from a meta-probabilist perspective.

There is no conclusion to the book, but to end up with a chapter on induction seems quite appropriate. (But there is an appendix as a probability tutorial, mentioning Monte Carlo resolutions. Plus notes on all chapters. And a commented bibliography.) Definitely recommended!

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about Chance!]

complexity of the von Neumann algorithm

Posted in Statistics with tags , , , , , , , , , on April 3, 2017 by xi'an

“Without the possibility of computing infimum and supremum of the density f over compact subintervals of the domain of f, sampling absolutely continuous distribution using the rejection method seems to be impossible in total generality.”

The von Neumann algorithm is another name for the rejection method introduced by von Neumann circa 1951. It was thus most exciting to spot a paper by Luc Devroye and Claude Gravel appearing in the latest Statistics and Computing. Assessing the method in terms of random bits and precision. Specifically, assuming that the only available random generator is one of random bits, which necessarily leads to an approximation when the target is a continuous density. The authors first propose a bisection algorithm for distributions defined on a compact interval, which compares random bits with recursive bisections of the unit interval and stops when the interval is small enough.

In higher dimension, for densities f over the unit hypercube, they recall that the original algorithm consisted in simulating uniforms x and u over the hypercube and [0,1], using the uniform as the proposal distribution and comparing the density at x, f(x), with the rescaled uniform. When using only random bits, the proposed method is based on a quadtree that subdivides the unit hypercube into smaller and smaller hypercubes until the selected hypercube is entirely above or below the density. And is small enough for the desired precision. This obviously requires for the computation of the upper and lower bound of the density over the hypercubes to be feasible, with Devroye and Gravel considering that this is a necessary property as shown by the above quote. Densities with non-compact support can be re-expressed as densities on the unit hypercube thanks to the cdf transform. (Actually, this is equivalent to the general accept-reject algorithm, based on the associated proposal.)

“With the oracles introduced in our modification of von Neumann’s method, we believe that it is impossible to design a rejection algorithm for densities that are not Riemann-integrable, so the question of the design of a universally valid rejection algorithm under the random bit model remains open.”

In conclusion, I enjoyed very much reading this paper, especially the reflection it proposes on the connection between Riemann integrability and rejection algorithms. (Actually, I cannot think straight away of a simulation algorithm that would handle non-Riemann-integrable densities, apart from nested sampling. Or of significant non-Riemann-integrable densities.)

Sobol’s Monte Carlo

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on December 10, 2016 by xi'an

The name of Ilya Sobol is familiar to researchers in quasi-Monte Carlo methods for his Sobol’s sequences. I was thus surprised to find in my office a small book entitled The Monte Carlo Method by this author, which is a translation of his 1968 book in Russian. I have no idea how it reached my office and I went to check with the library of Paris-Dauphine around the corner [of my corridor] whether it had been lost: apparently, the library got rid of it among a collection of old books… Now, having read through this 67 pages book (or booklet as Sobol puts it) makes me somewhat agree with the librarians, in that there is nothing of major relevance in this short introduction. It is quite interesting to go through the book and see the basics of simulation principles and Monte Carlo techniques unfolding, from the inverse cdf principle [established by a rather convoluted proof] to importance sampling, but the amount of information is about equivalent to the Wikipedia entry on the topic. From an historical perspective, it is also captivating to see the efforts to connect physical random generators (such as those based on vacuum tube noise) to shift-register pseudo-random generators created by Sobol in 1958. On a Soviet Strela computer.

While Googling the title of that book could not provide any connection, I found out that a 1994 version had been published under the title of A Primer for the Monte Carlo Method, which is mostly the same as my version, except for a few additional sections on pseudo-random generation, from the congruential method (with a FORTRAN code) to the accept-reject method being then called von Neumann’s instead of Neyman’s, to the notion of constructive dimension of a simulation technique, which amounts to demarginalisation, to quasi-Monte Carlo [for three pages]. A funny side note is that the author notes in the preface that the first translation [now in my office] was published without his permission!