## Buffon machines

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on December 22, 2020 by xi'an

By chance I came across a 2010 paper entitled On Buffon Machines and Numbers by Philippe Flajolet, Maryse Pelletier and Michèle Soria. Which relates to Bernoulli factories, a related riddle, and the recent paper by Luis Mendo I reviewed here. What the authors call a Buffon machine is a device that produces a perfect simulation of a discrete random variable out of a uniform bit generator. Just like (George Louis Leclerc, comte de) Buffon’s needle produces a Bernoulli outcome with success probability π/4. out of a real Uniform over (0,1). Turned into a sequence of Uniform random bits.

“Machines that always halt can only produce Bernoulli distributions whose parameter is a dyadic rational.”

When I first read this sentence it seemed to clash with the earlier riddle, until I realised the later started from a B(p) coin to produce a fair coin, while this paper starts with a fair coin.

The paper hence aims at a version of the Bernoulli factory problem (see Definition 2), although the term is only mentioned at the very end, with the added requirement of simplicity and conciseness translated as a finite expected number of draws and possibly an exponential tail.

It first recalls the (Forsythe–)von Neumann method of generating exponential (and other) variates out of a Uniform generator (see Section IV.2 in Devroye’s generation bible). Expanded into a general algorithm for generating discrete random variables whose pmf’s are related to permutation cardinals,

$\mathbb P(N=n)\propto P_n\lambda^n/n!$

if the Bernoulli generator has probability λ. These include the Poisson and the logarithmic distributions and as a side product Bernoulli distributions with some logarithmic, exponential and trigonometric transforms of λ.

As a side remark, a Bernoulli generator with probability 1/π is derived from Ramanujan identity

$\frac{1}{\pi} = \sum_{n=0}^\infty {2n \choose n}^3 \frac{6n+1}{2^{8n+2}}$

as “a discrete analogue of Buffon’s original. In a neat connection with Mendo’s paper, the authors of this 2010 paper note that Euler’s constant does not appear to be achievable by a Buffon machine.

## Bernoulli factory in the Riddler

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , on December 1, 2020 by xi'an

“Mathematician John von Neumann is credited with figuring out how to take a p biased coin and “simulate” a fair coin. Simply flip the coin twice. If it comes up heads both times or tails both times, then flip it twice again. Eventually, you’ll get two different flips — either a heads and then a tails, or a tails and then a heads, with each of these two cases equally likely. Once you get two different flips, you can call the second of those flips the outcome of your “simulation.” For any value of p between zero and one, this procedure will always return heads half the time and tails half the time. This is pretty remarkable! But there’s a downside to von Neumann’s approach — you don’t know how long the simulation will last.” The Riddler

The associated riddle (first one of the post-T era!) is to figure out what are the values of p for which an algorithm can be derived for simulating a fair coin in at most three flips. In one single flip, p=½ sounds like the unique solution. For two flips, p²,(1-p)^2,2p(1-p)=½ work, but so do p+(1-p)p,(1-p)+p(1-p)=½, and the number of cases grows for three flips at most. However, since we can have 2³=8 different sequences, there are 2⁸ ways to aggregate these events and thus at most 2⁸ resulting probabilities (including 0 and 1). Running a quick R code and checking for proximity to ½ of any of these sums leads to

[1] 0.2062997 0.7937005 #p^3
[1] 0.2113249 0.7886753 #p^3+(1-p)^3
[1] 0.2281555 0.7718448 #p^3+p(1-p)^2
[1] 0.2372862 0.7627143 #p^3+(1-p)^3+p(1-p)^2
[1] 0.2653019 0.7346988 #p^3+2p(1-p)^2
[1] 0.2928933 0.7071078 #p^2
[1] 0.3154489 0.6845518 #p^3+2p^2(1-p)
[1] 0.352201  0.6477993 #p^3+p(1-p)^2+p^2(1-p)
[1] 0.4030316 0.5969686 #p^3+p(1-p)^2+3(1-p)p^2
[1] 0.5


which correspond to

1-p³=½, p³+(1-p)³=½,(1-p)³+(1-p)p²=½,p³+(1-p)³+p²(1-p),(1-p)³+2(1-p)p²=½,1-p²=½, p³+(1-p)³+p²(1-p)=½,(1-p)³+p(1-p)²+p²(1-p)=½,(1-p)³+p²(1-p)+3p(1-p)²=½,p³+p(1-p)²+3(p²(1-p)=½,p³+2p(1-p)²+3(1-p)p²=½,p=½,

(plus the symmetric ones), leading to 19 different values of p producing a “fair coin”. Missing any other combination?!

Another way to look at the problem is to find all roots of the $2^{2^n}$ equations

$a_0p^n+a_1p^{n-1}(1-p)+\cdots+a_{n-1}p(1-p)^{n-1}+a_n(1-p)^n=1/2$

where

$0\le a_i\le{n \choose i}$

(None of these solutions is rational, by the way, except p=½.) I also tried this route with a slightly longer R code, calling polyroot, and finding the same 19 roots for three flips, [at least] 271 for four, and [at least] 8641 for five (The Riddler says 8635!). With an imprecision in the exact number of roots due to rather poor numerical rounding by polyroot. (Since the coefficients of the above are not directly providing those of the polynomial, I went through an alternate representation as a polynomial in (1-p)/p, with a straightforward derivation of the coefficients.)

## O’Bayes 19/2

Posted in Books, pictures, Running, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 1, 2019 by xi'an

One talk on Day 2 of O’Bayes 2019 was by Ryan Martin on data dependent priors (or “priors”). Which I have already discussed in this blog. Including the notion of a Gibbs posterior about quantities that “are not always defined through a model” [which is debatable if one sees it like part of a semi-parametric model]. Gibbs posterior that is built through a pseudo-likelihood constructed from the empirical risk, which reminds me of Bissiri, Holmes and Walker. Although requiring a prior on this quantity that is  not part of a model. And is not necessarily a true posterior and not necessarily with the same concentration rate as a true posterior. Constructing a data-dependent distribution on the parameter does not necessarily mean an interesting inference and to keep up with the theme of the conference has no automated claim to [more] “objectivity”.

And after calling a prior both Beauty and The Beast!, Erlis Ruli argued about a “bias-reduction” prior where the prior is solution to a differential equation related with some cumulants, connected with an earlier work of David Firth (Warwick).  An interesting conundrum is how to create an MCMC algorithm when the prior is that intractable, with a possible help from PDMP techniques like the Zig-Zag sampler.

While Peter Orbanz’ talk was centred on a central limit theorem under group invariance, further penalised by being the last of the (sun) day, Peter did a magnificent job of presenting the result and motivating each term. It reminded me of the work Jim Bondar was doing in Ottawa in the 1980’s on Haar measures for Bayesian inference. Including the notion of amenability [a term due to von Neumann] I had not met since then. (Neither have I met Jim since the last summer I spent in Carleton.) The CLT and associated LLN are remarkable in that the average is not over observations but over shifts of the same observation under elements of a sub-group of transformations. I wondered as well at the potential connection with the Read Paper of Kong et al. in 2003 on the use of group averaging for Monte Carlo integration [connection apart from the fact that both discussants, Michael Evans and myself, are present at this conference].

## the beauty of maths in computer science [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on January 17, 2019 by xi'an

CRC Press sent me this book for review in CHANCE: Written by Jun Wu, “staff research scientist in Google who invented Google’s Chinese, Japanese, and Korean Web search algorithms”, and translated from the Chinese, 数学之美, originating from Google blog entries. (Meaning most references are pre-2010.) A large part of the book is about word processing and web navigation, which is the author’s research specialty. And not so much about mathematics. (When rereading the first chapters to start this review I then realised why the part about language processing in AIQ sounded familiar: I had read it in the Beauty of Mathematics in Computer Science.)

In the first chapter, about the history of languages, I found out, among other things, that ancient Jewish copists of the Bible had an error correcting algorithm consisting in giving each character a numerical equivalent, summing up each row, then all rows, and  checking the sum at the end of the page was the original one. The second chapter explains why the early attempts at language computer processing, based on grammar rules, were unsuccessful and how a statistical approach had broken the blockade. Explained via Markov chains in the following chapter. Along with the Good-Turing [Bayesian] estimate of the transition probabilities. Next comes a short and low-tech chapter on word segmentation. And then an introduction to hidden Markov models. Mentioning the Baum-Welch algorithm as a special case of EM, which makes a return by Chapter 26. Plus a chapter on entropies and Kullback-Leibler divergence.

A first intermede is provided by a chapter dedicated to the late Frederick Jelinek, the author’s mentor (including what I find a rather unfortunate equivalent drawn between the Nazi and Communist eras in Czechoslovakia, p.64). Chapter that sounds a wee bit too much like an extended obituary.

The next section of chapters is about search engines, with a few pages on Boolean logic, dynamic programming, graph theory, Google’s PageRank and TF-IDF (term frequency/inverse document frequency). Unsurprisingly, given that the entries were originally written for Google’s blog, Google’s tools and concepts keep popping throughout the entire book.

Another intermede about Amit Singhal, the designer of Google’s internal search ranking system, Ascorer. With another unfortunate equivalent with the AK-47 Kalashnikov rifle as “elegantly simple”, “effective, reliable, uncomplicated, and easy to implement or operate” (p.105). Even though I do get the (reason for the) analogy, using an equivalent tool which purpose is not to kill other people would have been just decent…

Then chapters on measuring proximity between news articles by (vectors in a 64,000 dimension vocabulary space and) their angle, and singular value decomposition, and turning URLs as long integers into 16 bytes random numbers by the Mersenne Twister (why random, except for encryption?), missing both the square in von Neumann’s first PRNG (p.124) and the opportunity to link the probability of overlap with the birthday problem (p.129). Followed by another chapter on cryptography, always a favourite in maths vulgarisation books (but with no mention made of the originators of public key cryptography, like James Hellis or the RSA trio, or of the impact of quantum computers on the reliability of these methods). And by an a-mathematic chapter on spam detection.

Another sequence of chapters cover maximum entropy models (in a rather incomprehensible way, I think, see p.159), continued with an interesting argument how Shannon’s first theorem predicts that it should be faster to type Chinese characters than Roman characters. Followed by the Bloom filter, which operates as an approximate Poisson variate. Then Bayesian networks where the “probability of any node is computed by Bayes’ formula” [not really]. With a slightly more advanced discussion on providing the highest posterior probability network. And conditional random fields, where the conditioning is not clearly discussed (p.192). Next are chapters about Viterbi’s algorithm (and successful career) and the EM algorithm, nicknamed “God’s algorithm” in the book (Chapter 26) although I never heard of this nickname previously.

The final two chapters are on neural networks and Big Data, clearly written later than the rest of the book, with the predictable illustration of AlphaGo (but without technical details). The twenty page chapter on Big Data does not contain a larger amount of mathematics, with no equation apart from Chebyshev’s inequality, and a frequency estimate for a conditional probability. But I learned about 23&me running genetic tests at a loss to build a huge (if biased) genetic database. (The bias in “Big Data” issues is actually not covered by this chapter.)

“One of my main objectives for writing the book is to introduce some mathematical knowledge related to the IT industry to people who do not work in the industry.”

To conclude, I found the book a fairly interesting insight on the vision of his field and job experience by a senior scientist at Google, with loads of anecdotes and some historical backgrounds, but very Google-centric and what I felt like an excessive amount of name dropping and of I did, I solved, I &tc. The title is rather misleading in my opinion as the amount of maths is very limited and rarely sufficient to connect with the subject at hand. Although this is quite a relative concept, I did not spot beauty therein but rather technical advances and trick, allowing the author and Google to beat the competition.

## George Forsythe’s last paper

Posted in Books, Statistics, University life with tags , , , on May 25, 2018 by xi'an

When looking for a link in a recent post, I came across Richard Brent’ arXival of historical comments on George Forsythe’s last paper (in 1972). Which is about the Forsythe-von Neumann approach to simulating exponential variates, covered in Luc Devroye’s Non-Uniform Random Variate Generation in a special section, Section 2 of Chapter 4,  is about generating a random variable from a target density proportional to g(x)exp(-F(x)), where g is a density and F is a function on (0,1). Then, after generating a realisation x⁰ from g and computing F(x⁰), generate a sequence u¹,u²,… of uniforms as long as they keep decreasing, i.e., F(x⁰) >u¹>u²>… If the maximal length k of this sequence is odd, the algorithm exists with a value x⁰ generated from  g(x)exp(-F(x)). Von Neumann (1949) treated the special case when g is constant and F(x)=x, which leads to an Exponential generator that never calls an exponential function. Which does not make the proposal a particularly efficient one as it rejects O(½) of the simulations. Refinements of the algorithm lead to using on average 1.38 uniforms per Normal generation, which does not sound much faster than a call to the Box-Muller method, despite what is written in the paper. (Brent also suggests using David Wallace’s 1999 Normal generator, which I had not encountered before. And which I am uncertain is relevant at the present time.)