## puzzled by harmony [not!]

Posted in Books, Kids, Mountains, pictures, R, Running, Statistics, Travel with tags , , , , , on December 13, 2016 by xi'an

In answering yet another question on X validated about the numerical approximation of the marginal likelihood, I suggested using an harmonic mean estimate as a simple but worthless solution based on an MCMC posterior sample. This was on a toy example with a uniform prior on (0,π) and a “likelihood” equal to sin(θ) [really a toy problem!]. Simulating an MCMC chain by a random walk Metropolis-Hastings algorithm is straightforward, as is returning the harmonic mean of the sin(θ)’s.

f <- function(x){
if ((0<x)&(x<pi)){
return(sin(x))}else{
return(0)}}

n = 2000 #number of iterations
sigma = 0.5
x = runif(1,0,pi) #initial x value
chain = fx = f(x)
#generates an array of random x values from norm distribution
rands = rnorm(n,0, sigma)
#Metropolis - Hastings algorithm
for (i in 2:n){
can = x + rands[i]  #candidate for jump
fcan=f(can)
aprob = fcan/fx #acceptance probability
if (runif(1) < aprob){
x = can
fx = fcan}
chain=c(chain,fx)}
I = pi*length(chain)/sum(1/chain) #integral harmonic approximation

However, the outcome looks remarkably stable and close to the expected value 2/π, despite 1/sin(θ) having an infinite integral on (0,π). Meaning that the average of the 1/sin(θ)’s has no variance. Hence I wonder why this specific example does not lead to an unreliable output… But re-running the chain with a smaller scale σ starts producing values of sin(θ) regularly closer to zero, which leads to an estimate of I both farther away from 2 and much more variable. No miracle, in the end!

## debunking a (minor and personal) myth

Posted in Books, Kids, R, Statistics, University life with tags , , , , on September 9, 2015 by xi'an

For quite a while, I entertained the idea that Beta and Dirichlet proposals  were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate… Continue reading

Posted in Books, Statistics, University life with tags , , , , , , on May 15, 2015 by xi'an

[Here is a reply by Pierre Druihlet to my comments on his paper.]

There are several goals in the paper, the last one being the most important one.

The first one is to insist that considering θ as a parameter is not appropriate. We are in complete agreement on that point, but I prefer considering l(θ) as the parameter rather than N, mainly because it is much simpler. Knowing N, the law of l(θ) is given by the law of a random walk with 0 as reflexive boundary (Jaynes in his book, explores this link). So for a given prior on N, we can derive a prior on l(θ). Since the random process that generate N is completely unknown, except that N is probably large, the true law of l(θ) is completely unknown, so we may consider l(θ).

The second one is to state explicitly that a flat prior on θ implies an exponentially increasing prior on l(θ). As an anecdote, Stone, in 1972, warned against this kind of prior for Gaussian models. Another interesting anecdote is that he cited the novel by Abbot “Flatland : a romance of many dimension” who described a world where the dimension is changed. This is exactly the case in the FP since θ has to be seen in two dimensions rather than in one dimension.

The third one is to make a distinction between randomness of the parameter and prior distribution, each one having its own rule. This point is extensively discussed in Section 2.3.
– In the intuitive reasoning, the probability of no annihilation involves the true joint distribution on (θ, x) and therefore the true unknown distribution of θ,.
– In the Bayesian reasoning, the posterior probability of no annihilation is derived from the prior distribution which is improper. The underlying idea is that a prior distribution does not obey probability rules but belongs to a projective space of measure. This is especially true if the prior does not represent an accurate knowledge. In that case, there is no discontinuity between proper and improper priors and therefore the impropriety of the distribution is not a key point. In that context, the joint and marginal distributions are irrelevant, not because the prior is improper, but because it is a prior and not a true law. If the prior were the true probability law of θ,, then the flat distribution could not be considered as a limit of probability distributions.

For most applications, the distinction between prior and probability law is not necessary and even pedantic, but it may appear essential in some situations. For example, in the Jeffreys-Lindley paradox, we may note that the construction of the prior is not compatible with the projective space structure.

## Slices and crumbs [arXiv:1011.4722]

Posted in Books, R, Statistics with tags , , , , , , , on November 30, 2010 by xi'an

An interesting note was arXived a few days ago by Madeleine Thompson and Radford Neal. Beside the nice touch of mixing crumbs and slices, the neat idea is to have multiple-try proposals for simulating within a slice and to decrease the dimension of the simulation space at each try. This dimension diminution is achieved via the construction of an orthogonal basis based on the gradient of the log densities at previously-rejected proposals.

$\mathbf{J}=(\nabla\log f(x_1),\ldots,\nabla\log f(x_k))$

until all dimensions are exhausted, in which case the scale of the Gaussian proposal is reduced. (The paper comes with R and C codes.) Provided the gradient can be computed (or at least approximated), this is a fairly general method (even though I have not tested it so cannot say how much calibration it requires). An interesting point is that, contrariwise to the delayed-rejection method of Antonietta Mira and co-authors,  the repeated proposals do not induce a complexification in the slice acceptance probability. I am less convinced by the authors’ conclusion that the method compares with adaptive Metropolis techniques, in the sense the “shrinking rank” method forgets about past experiences as it starts from scratch at each iteration: it is thus not really learning… (Now, in terms of performances, this may be the case!)