Archive for Mersenne twister

random generators produce ties

Posted in Books, R, Statistics with tags , , , , , , , on April 21, 2020 by xi'an

“…an essential part of understanding how many ties these RNGs produce is to understand how many ties one expects in 32-bit integer arithmetic.”

A sort of a birthday-problem paper for random generators by Markus Hofert on arXiv as to why they produce ties. As shown for instance in the R code (inspired by the paper):

sum(duplicated(runif(1e6)))

returning values around 100, which is indeed unexpected until one thinks a wee bit about it… With no change if moving to an alternative to the Mersenne twister generator. Indeed, assuming the R random generators produce integers with 2³² values, the expected number of ties is actually 116 for 10⁶ simulations. Moving to 2⁶⁴, the probability of a tie is negligible, around 10⁻⁸. A side remark of further inerest in the paper is that, due to a different effective gap between 0 and the smallest positive normal number, of order 10⁻²⁵⁴ and between 1 and the smallest normal number greater than 1, of order 10⁻¹⁶, “the grid of representable double numbers is not equidistant”. Justifying the need for special functions such as expm1 and log1p, corresponding to more accurate derivations of exp(x)-1 and log(1+x).

really random generators [again!]

Posted in Books, Statistics with tags , , , , , , , , , on March 2, 2020 by xi'an

A pointer sent me to Chemistry World and an article therein about “really random numbers“. Or “truly” random numbers. Or “exactly” random numbers. Not particularly different from the (in)famous lava lamp generator!

“Cronin’s team has developed a robot that can automatically grow crystals in a 10 by 10 array of vials, take photographs of them, and use measurements of their size, orientation, and colour to generate strings of random numbers. The researchers analysed the numbers generated from crystals grown in three solutions – including a solution of copper sulfate – and found that they all passed statistical tests for the quality of their randomness.” Chemistry World, Tom Metcalfe, 18 February 2020

The validation of this truly random generator is thus exactly the same as a (“bad”) pseudo-random generator, namely that in the law of large number sense, it fits the predicted behaviour. And thus the difference between them cannot be statistical, but rather cryptographic:

“…we considered the encryption capability of this random number generator versus that of a frequently used pseudorandom number generator, the Mersenne Twister.” Lee et al., Matter, February 10, 2020

Meaning that the knowledge of the starting point and of the deterministic transform for the Mersenne Twister makes it feasible to decipher, which is not the case for a physical and non-reproducible generator as the one advocated. One unclear aspect of the proposed generator is the time required to produce 10⁶, even though the authors mention that “the bit-generation rate is significantly lower than that in other methods”.

the beauty of maths in computer science [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on January 17, 2019 by xi'an

CRC Press sent me this book for review in CHANCE: Written by Jun Wu, “staff research scientist in Google who invented Google’s Chinese, Japanese, and Korean Web search algorithms”, and translated from the Chinese, 数学之美, originating from Google blog entries. (Meaning most references are pre-2010.) A large part of the book is about word processing and web navigation, which is the author’s research specialty. And not so much about mathematics. (When rereading the first chapters to start this review I then realised why the part about language processing in AIQ sounded familiar: I had read it in the Beauty of Mathematics in Computer Science.)

In the first chapter, about the history of languages, I found out, among other things, that ancient Jewish copists of the Bible had an error correcting algorithm consisting in giving each character a numerical equivalent, summing up each row, then all rows, and  checking the sum at the end of the page was the original one. The second chapter explains why the early attempts at language computer processing, based on grammar rules, were unsuccessful and how a statistical approach had broken the blockade. Explained via Markov chains in the following chapter. Along with the Good-Turing [Bayesian] estimate of the transition probabilities. Next comes a short and low-tech chapter on word segmentation. And then an introduction to hidden Markov models. Mentioning the Baum-Welch algorithm as a special case of EM, which makes a return by Chapter 26. Plus a chapter on entropies and Kullback-Leibler divergence.

A first intermede is provided by a chapter dedicated to the late Frederick Jelinek, the author’s mentor (including what I find a rather unfortunate equivalent drawn between the Nazi and Communist eras in Czechoslovakia, p.64). Chapter that sounds a wee bit too much like an extended obituary.

The next section of chapters is about search engines, with a few pages on Boolean logic, dynamic programming, graph theory, Google’s PageRank and TF-IDF (term frequency/inverse document frequency). Unsurprisingly, given that the entries were originally written for Google’s blog, Google’s tools and concepts keep popping throughout the entire book.

Another intermede about Amit Singhal, the designer of Google’s internal search ranking system, Ascorer. With another unfortunate equivalent with the AK-47 Kalashnikov rifle as “elegantly simple”, “effective, reliable, uncomplicated, and easy to implement or operate” (p.105). Even though I do get the (reason for the) analogy, using an equivalent tool which purpose is not to kill other people would have been just decent…

Then chapters on measuring proximity between news articles by (vectors in a 64,000 dimension vocabulary space and) their angle, and singular value decomposition, and turning URLs as long integers into 16 bytes random numbers by the Mersenne Twister (why random, except for encryption?), missing both the square in von Neumann’s first PRNG (p.124) and the opportunity to link the probability of overlap with the birthday problem (p.129). Followed by another chapter on cryptography, always a favourite in maths vulgarisation books (but with no mention made of the originators of public key cryptography, like James Hellis or the RSA trio, or of the impact of quantum computers on the reliability of these methods). And by an a-mathematic chapter on spam detection.

Another sequence of chapters cover maximum entropy models (in a rather incomprehensible way, I think, see p.159), continued with an interesting argument how Shannon’s first theorem predicts that it should be faster to type Chinese characters than Roman characters. Followed by the Bloom filter, which operates as an approximate Poisson variate. Then Bayesian networks where the “probability of any node is computed by Bayes’ formula” [not really]. With a slightly more advanced discussion on providing the highest posterior probability network. And conditional random fields, where the conditioning is not clearly discussed (p.192). Next are chapters about Viterbi’s algorithm (and successful career) and the EM algorithm, nicknamed “God’s algorithm” in the book (Chapter 26) although I never heard of this nickname previously.

The final two chapters are on neural networks and Big Data, clearly written later than the rest of the book, with the predictable illustration of AlphaGo (but without technical details). The twenty page chapter on Big Data does not contain a larger amount of mathematics, with no equation apart from Chebyshev’s inequality, and a frequency estimate for a conditional probability. But I learned about 23&me running genetic tests at a loss to build a huge (if biased) genetic database. (The bias in “Big Data” issues is actually not covered by this chapter.)

“One of my main objectives for writing the book is to introduce some mathematical knowledge related to the IT industry to people who do not work in the industry.”

To conclude, I found the book a fairly interesting insight on the vision of his field and job experience by a senior scientist at Google, with loads of anecdotes and some historical backgrounds, but very Google-centric and what I felt like an excessive amount of name dropping and of I did, I solved, I &tc. The title is rather misleading in my opinion as the amount of maths is very limited and rarely sufficient to connect with the subject at hand. Although this is quite a relative concept, I did not spot beauty therein but rather technical advances and trick, allowing the author and Google to beat the competition.

atmospheric random generator?!

Posted in Books, Mountains, pictures, Statistics, Travel with tags , , , , , , , on April 10, 2012 by xi'an

As I was glancing through The Cleanest Line, (the outdoor clothing company) Patagonia‘s enjoyable—as long as one keeps in mind Patagonia is a company, although with commendable ethical and ecological goals—blog, I came upon this entry “And the Winner of “Chasing Waves” is …” where the name of the winner of the book Chasing Wave was revealed. (Not that I am particularly into surfing…!) The interesting point to which I am coming so circumlocutory (!) is that they use a random generator based on atmospheric noise to select the winner! I particularly like the sentence that the generator “for many purposes is better than the pseudo-random number algorithms typically used in computer programs”. For which purpose exactly?!

Now, to be (at least a wee) fair, the site of random.org contains an explanation about the quality of their generator. I am however surprised by the comparison they run with the rand() function from PHP on Microsoft Windows, since it produces a visible divergence from uniformity on a bitmap graph… Further investigation led to this explanation of the phenomenon, namely the inadequacy of the PHP language rather than of the underlying (pseudo-)random generator. (It had been a while since I had a go at this randomness controvery!)

Random generators for parallel processing

Posted in R, Statistics with tags , , , , on October 28, 2010 by xi'an

Given the growing interest in parallel processing through GPUs or multiple processors, there is a clear need for a proper use of (uniform) random number generators in this environment. We were discussing the issue yesterday with Jean-Michel Marin and briefly looked at a few solutions: Continue reading