## folded Normals

Posted in Books, Kids, pictures, R, Running, Statistics with tags , , , , , , , , , , , , on February 25, 2021 by xi'an

While having breakfast (after an early morn swim at the vintage La Butte aux Cailles pool, which let me in free!), I noticed a letter to the Editor in the Annals of Applied Statistics, which I was unaware existed. (The concept, not this specific letter!) The point of the letter was to indicate that finding the MLE for the mean and variance of a folded normal distribution was feasible without resorting to the EM algorithm. Since the folded normal distribution is a special case of mixture (with fixed weights), using EM is indeed quite natural, but the author, Iain MacDonald, remarked that an optimiser such as R nlm() could be called instead. The few lines of relevant R code were even included. While this is a correct if minor remark, I am a wee bit surprised at seeing it included in the journal, the more because the authors of the original paper using the EM approach were given the opportunity to respond, noticing EM is much faster than nlm in the cases they tested, and Iain MacDonald had a further rejoinder! The more because the Wikipedia page mentioned the use of optimisers much earlier (and pointed out at the R package Rfast as producing MLEs for the distribution).

## your GAN is secretly an energy-based model

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on January 5, 2021 by xi'an

As I was reading this NeurIPS 2020 paper by Che et al., and trying to make sense of it, I came across a citation to our paper Casella, Robert and Wells (2004) on a generalized accept-reject sampling scheme where the proposal changes at each simulation that sounds surprising if appreciated! But after checking this paper also appears as the first reference on the Wikipedia page for rejection sampling, which makes me wonder if many actually read it. (On the side, we mostly wrote this paper on a drive from Baltimore to Ithaca, after JSM 1999.)

“We provide more evidence that it is beneficial to sample from the energy-based model defined both by the generator and the discriminator instead of from the generator only.”

The paper seems to propose a post-processing of the generator output by a GAN, generating from the mixture of both generator and discriminator, via a (unscented) Langevin algorithm. The core idea is that, if p(.) is the true data generating process, g(.) the estimated generator and d(.) the discriminator, then

p(x) ≈ p⁰(x)∝g(x) exp(d(x))

(The approximation would be exact the discriminator optimal.) The authors work with the latent z’s, in the GAN meaning that generating pseudo-data x from g means taking a deterministic transform of z, x=G(z). When considering the above p⁰, a generation from p⁰ can be seen as accept-reject with acceptance probability proportional to exp[d{G(z)}]. (On the side, Lemma 1 is the standard validation for accept-reject sampling schemes.)

Reading this paper made me realised how much the field had evolved since my previous GAN related read. With directions like Metropolis-Hastings GANs and Wasserstein GANs. (And I noticed a “broader impact” section past the conclusion section about possible misuses with societal consequences, which is a new requirement for NeurIPS publications.)

## on arithmetic derivations of square roots

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , , , , on November 13, 2020 by xi'an

An intriguing question made a short-lived appearance on the CodeGolf section of Stack Exchange, before being removed, namely the (most concise possible) coding of an arithmetic derivation of the square root of an integer, S, with a 30 digit precision and using only arithmetic operators. I was not aware of the myriad of solutions available, as demonstrated on the dedicated WIkipedia page. And ended playing with three of them during a sleepless pre-election night!

The first solution for finding √S is based on a continued fraction representation of the root,

$\sqrt{S}=a+\cfrac{r}{2a+\cfrac{r}{2a+\ddots}}$

with a²≤S and r=S-a². It is straightforward to code-golf:

while((r<-S-T*T)^2>1e-9)T=(F<-2*T+r/(2*T+F))-T;F


but I found it impossible to reach the 30 digit precision (even when decreasing the error bound from 10⁻⁹). Given the strict rules of the game, this would have been considered to be a failure.

The second solution is Goldschmidt’s algorithm

b=S
T=1/sum((1:S)^2<S)
while((1-S*prod(T)^2)^2>1e-9){
b=b*T[1]^2
T=c((3-b)/2,T)}
S*prod(T)


which is longer for code-golfing but produces both √S and 1/√S (and is faster than the Babylonian method and Newton-Raphson). Again no luck with high precision and almost surely unacceptable for the game.

The third solution is the most interesting [imho] as it mimicks long division, working two digits at a time (and connection with Napier’s bones)

~=length
D=~S
S=c(S,0*(1:30))
p=d=0
a=1:9
while(~S){
F=c(F,x<-sum(a*(20*p+a)<=(g<-100*d+10*S[1]+S[2])))
d=g-x*(20*p+x)
p=x+10*p
S=S[-1:-2]}
sum(10^{1+D/2-1:~F}*F)


plus providing an arbitrary number of digits with no error. This code requires S to be entered as a sequence of digits (with a possible extra top digit 0 to make the integer D even). Returning one digit at a time, it would further have satisfied the constraints of the question (if in a poorly condensed manner).

## artificial EM

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on October 28, 2020 by xi'an

When addressing an X validated question on the use of the EM algorithm when estimating a Normal mean, my first comment was that it was inappropriate since there is no missing data structure to anchor by (right preposition?). However I then reflected upon the infinite number of ways to demarginalise the normal density into a joint density

$$∫ f(x,z;μ)dz = φ(x–μ)$$

from the (slice sampler) call to an indicator function for $$f(x,z;μ)$$ to a joint Normal distribution with an arbitrary correlation. While the joint Normal representation produces a sequence converging to the MLE, the slice representation utterly fails as the indicator functions make any starting value of $$μ$$ a fixed point for EM.

Incidentally, when quoting from Wikipedia on the purpose of the EM algorithm, the following passage

Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations.

struck me as confusing and possibly wrong since it seems to suggest to seek a maximum in both the parameter and the latent variables. Which does not produce the same value as the observed likelihood maximisation.

## babbage in, babbage out?!

Posted in Books, Kids, Statistics with tags , , , , , , on May 25, 2020 by xi'an

When checking for the origin of “garbage in, garbage out” on Wikipedia, I came upon this citation from Charles Babbage:

“On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”

following earlier quotes from him on this ‘Og.