## dynamic nested sampling for stars

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , , , on April 12, 2019 by xi'an

In the sequel of earlier nested sampling packages, like MultiNest, Joshua Speagle has written a new package called dynesty that manages dynamic nested sampling, primarily intended for astronomical applications. Which is the field where nested sampling is the most popular. One of the first remarks in the paper is that nested sampling can be more easily implemented by using a Uniform reparameterisation of the prior, that is, a reparameterisation that turns the prior into a Uniform over the unit hypercube. Which means in fine that the prior distribution can be generated from a fixed vector of uniforms and known transforms. Maybe not such an issue given that this is the prior after all.  The author considers this makes sampling under the likelihood constraint a much simpler problem but it all depends in the end on the concentration of the likelihood within the unit hypercube. And on the ability to reach the higher likelihood slices. I did not see any special trick when looking at the documentation, but reflected on the fundamental connection between nested sampling and this ability. As in the original proposal by John Skilling (2006), the slice volumes are “estimated” by simulated Beta order statistics, with no connection with the actual sequence of simulation or the problem at hand. We did point out our incomprehension for such a scheme in our Biometrika paper with Nicolas Chopin. As in earlier versions, the algorithm attempts at visualising the slices by different bounding techniques, before proceeding to explore the bounded regions by several exploration algorithms, including HMC.

“As with any sampling method, we strongly advocate that Nested Sampling should not be viewed as being strictly“better” or “worse” than MCMC, but rather as a tool that can be more or less useful in certain problems. There is no “One True Method to Rule Them All”, even though it can be tempting to look for one.”

When introducing the dynamic version, the author lists three drawbacks for the static (original) version. One is the reliance on this transform of a Uniform vector over an hypercube. Another one is that the overall runtime is highly sensitive to the choice the prior. (If simulating from the prior rather than an importance function, as suggested in our paper.) A third one is the issue that nested sampling is impervious to the final goal, evidence approximation versus posterior simulation, i.e., uses a constant rate of prior integration. The dynamic version simply modifies the number of point simulated in each slice. According to the (relative) increase in evidence provided by the current slice, estimated through iterations. This makes nested sampling a sort of inversted Wang-Landau since it sharpens the difference between slices. (The dynamic aspects for estimating the volumes of the slices and the stopping rule may hinder convergence in unclear ways, which is not discussed by the paper.) Among the many examples produced in the paper, a 200 dimension Normal target, which is an interesting object for posterior simulation in that most of the posterior mass rests on a ring away from the maximum of the likelihood. But does not seem to merit a mention in the discussion. Another example of heterogeneous regression favourably compares dynesty with MCMC in terms of ESS (but fails to include an HMC version).

[Breaking News: Although I wrote this post before the exciting first image of the black hole in M87 was made public and hence before I was aware of it, the associated AJL paper points out relying on dynesty for comparing several physical models of the phenomenon by nested sampling.]

## Gibbs clashes with importance sampling

Posted in pictures, Statistics with tags , , , , , on April 11, 2019 by xi'an

In an X validated question, an interesting proposal was made: at each (component-wise) step of a Gibbs sampler, replace simulation from the exact full conditional with simulation from an alternate density and weight the resulting simulation with a term made of a product of (a) the previous weight (b) the ratio of the true conditional over the substitute for the new value and (c) the inverse ratio for the earlier value of the same component. Which does not work for several reasons:

1. the reweighting is doomed by its very propagation in that it keeps multiplying ratios of expectation one, which means an almost sure chance of degenerating;
2. the weights are computed for a previous value that has not been generated from the same proposal and is anyway already properly weighted;
3. due to the change in dimension produced by Gibbs, the actual target is the full conditional, which involves an intractable normalising constant;
4. there is no guarantee for the weights to have finite variance, esp. when the proposal has thinner tails than the target.

as can be readily checked by a quick simulation experiment. The funny thing is that a proper importance weight can be constructed when envisioning  the sequence of Gibbs steps as a Metropolis proposal (in the dimension of the target).

## Metropolis gets off the ground

Posted in Books, Kids, Statistics with tags , , , , , , , on April 1, 2019 by xi'an

An X validated discussion that toed-and-froed about an incomprehension of the Metropolis-Hastings algorithm. Which started with a blame of George Casella‘s and Roger Berger’s Statistical Inference (p.254), when the real issue was the inquisitor having difficulties with the notation V ~ f(v), or the notion of random variable [generation], mistaking identically distributed with identical. Even (me) crawling from one iteration to the next did not help at the beginning. Another illustration of the strong tendency on this forum to jettison fundamental prerequisites…

## (x=scan())%in%(2*4^(n=0:x)-2^n-1)

Posted in Books, Kids, R with tags , , , , , , , , , on March 28, 2019 by xi'an

One challenge on code golf is to find the shortest possible code to identify whether or not an integer belongs to the binary cyclops numbers which binary expansion is 0, 101, 11011, 1110111, 111101111, &tc. The n-th such number being

$a(n) = 2^{2n + 1} - 2^n - 1 = 2\,4^n - 2^n - 1 = (2^n - 1)(2\,2^n + 1)$

this leads to the above solution in R (26 bits). The same length as the C solution [which I do not get]

f(n){n=~n==(n^=-~n)*~n/2;}

And with shorter versions in many esoteric languages I had never heard of, like the 8 bits Brachylog code

ḃD↔Dḍ×ᵐ≠

or the 7 bits Jelly

B¬ŒḂ⁼SƊ

As a side remark, since this was not the purpose of the game, the R code is most inefficient in creating a set of size (x+1), with most terms being Inf.

## dominating measure

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on March 21, 2019 by xi'an

Yet another question on X validated reminded me of a discussion I had once  with Jay Kadane when visiting Carnegie Mellon in Pittsburgh. Namely the fundamentally ill-posed nature of conjugate priors. Indeed, when considering the definition of a conjugate family as being a parameterised family Þ of distributions over the parameter space Θ stable under transform to the posterior distribution, this property is completely dependent (if there is such a notion as completely dependent!) on the dominating measure adopted on the parameter space Θ. Adopted is the word as there is no default, reference, natural, &tc. measure that promotes one specific measure on Θ as being the dominating measure. This is a well-known difficulty that also sticks out in most “objective Bayes” problems, as well as with maximum entropy priors. This means for instance that, while the Gamma distributions constitute a conjugate family for a Poisson likelihood, so do the truncated Gamma distributions. And so do the distributions which density (against a Lebesgue measure over an arbitrary subset of (0,∞)) is the product of a Gamma density by an arbitrary function of θ. I readily acknowledge that the standard conjugate priors as introduced in every Bayesian textbook are standard because they facilitate (to a certain extent) posterior computations. But, just like there exist an infinity of MaxEnt priors associated with an infinity of dominating measures, there exist an infinity of conjugate families, once more associated with an infinity of dominating measures. And the fundamental reason is that the sampling model (which induces the shape of the conjugate family) does not provide a measure on the parameter space Θ.

## simulation fodder for future exams

Posted in Books, Kids, R, Statistics with tags , , , , on February 20, 2019 by xi'an

Here are two nice exercises for a future simulation exam, seen and solved on X validated.The first one is about simulating a Gibbs sampler associated with the joint target

exp{-|x|-|y|-a|y-x|}

defined over IR² for a≥0 (or possibly a>-1). The conditionals are identical and non-standard, but a simple bound on the conditional density is the corresponding standard double exponential density, which makes for a straightforward accept-reject implementation. However it is also feasible to break the full conditional into three parts, depending on the respective positions of x, y, and 0, and to obtain easily invertible cdfs on the three intervals.The second exercise is about simulating from the cdf

$F(x)=1-\exp\{-ax-bx^{p+1}/(p+1)\}$

which can be numerically inverted. It is however more fun to call for an accept-reject algorithm by bounding the density with a ½ ½ mixture of an Exponential Exp(a) and of the 1/(p+1)-th power of an Exponential Exp(b/(p+1)). Since no extra constant appears in the solution,  I suspect the (p+1) in b/(p+1) was introduced on purpose. As seen in the above fit for 10⁶ simulations (and a=1,b=2,p=3), there is no deviation from the target! There is nonetheless an even simpler and rather elegant resolution to the exercise: since the tail function (1-F(x)) appears as the product of two tail functions, exp(-ax) and the other one, the cdf is the distribution of the minimum of two random variates, one with the Exp(a) distribution and the other one being the 1/(p+1)-th power of an Exponential Exp(b/(p+1)) distribution. Which of course returns a very similar histogram fit:

## leave Bayes factors where they once belonged

Posted in Statistics with tags , , , , , , , , , , on February 19, 2019 by xi'an

In the past weeks I have received and read several papers (and X validated entries)where the Bayes factor is used to compare priors. Which does not look right to me, not on the basis of my general dislike of Bayes factors!, but simply because this seems to clash with the (my?) concept of Bayesian model choice and also because data should not play a role in that situation, from being used to select a prior, hence at least twice to run the inference, to resort to a single parameter value (namely the one behind the data) to decide between two distributions, to having no asymptotic justification, to eventually favouring the prior concentrated on the maximum likelihood estimator. And more. But I fear that this reticence to test for prior adequacy also extends to the prior predictive, or Box’s p-value, namely the probability under this prior predictive to observe something “more extreme” than the current observation, to quote from David Spiegelhalter.