## an interesting identity

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 1, 2018 by xi'an

Another interesting X validated question, another remembrance of past discussions on that issue. Discussions that took place in the Institut d’Astrophysique de Paris, nearby this painting of Laplace, when working on our cosmostats project. Namely the potential appeal of recycling multidimensional simulations by permuting the individual components in nearly independent settings. As shown by the variance decomposition in my answer, when opposing N iid pairs (X,Y) to the N combinations of √N simulations of X and √N simulations of Y, the comparison

$\text{var} \hat{\mathfrak{h}}^2_N=\text{var} (\hat{\mathfrak{h}}^1_N)+\frac{mn(n-1)}{N^2}\,\text{var}^Y\left\{ \mathbb{E}^{X}\left\{\mathfrak{h}(X,Y)\right\}\right\}$

$+\frac{m(m-1)n}{N^2}\,\text{var}^X\left[\mathbb{E}^Y\left\{\mathfrak{h}(X,Y)\right\}\right]$

unsurprisingly gives the upper hand to the iid sequence. A sort of converse to Rao-Blackwellisation…. Unless the production of N simulations gets much more costly when compared with the N function evaluations. No wonder we never see this proposal in Monte Carlo textbooks!

## a well-hidden E step

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , , , , on February 3, 2017 by xi'an

A recent question on X validated ended up being quite interesting! The model under consideration is made of parallel Markov chains on a finite state space, all with the same Markov transition matrix, M, which turns into a hidden Markov model when the only summary available is the number of chains in a given state at a given time. When writing down the EM algorithm, the E step involves the expected number of moves from a given state to a given state at a given time. The conditional distribution of those numbers of chains is a product of multinomials across times and starting states, with no Markov structure since the number of chains starting from a given state is known at each instant. Except that those multinomials are constrained by the number of “arrivals” in each state at the next instant and that this makes the computation of the expectation intractable, as far as I can see.

A solution by Monte Carlo EM means running the moves for each instant under the above constraints, which is thus a sort of multinomial distribution with fixed margins, enjoying a closed-form expression but for the normalising constant. The direct simulation soon gets too costly as the number of states increases and I thus considered a basic Metropolis move, using one margin (row or column) or the other as proposal, with the correction taken on another margin. This is very basic but apparently enough for the purpose of the exercise. If I find time in the coming days, I will try to look at the ABC resolution of this problem, a logical move when starting from non-sufficient statistics!

## importance sampling and necessary sample size

Posted in Books, Statistics with tags , , , , , on September 7, 2016 by xi'an

Daniel Sanz-Alonso arXived a note yesterday where he analyses importance sampling from the point of view of empirical distributions. With the difficulty that unnormalised importance sampling estimators are not associated with an empirical distribution since the sum of the weights is not one. For several f-divergences, he obtains upper bounds on those divergences between the empirical cdf and a uniform version, D(w,u), which translate into lower bounds on the importance sample size. I however do not see why this divergence between a weighted sampled and the uniformly weighted version is relevant for the divergence between the target and the proposal, nor how the resulting Monte Carlo estimator is impacted by this bound. A side remark [in the paper] is that those results apply to infinite variance Monte Carlo estimators, as in the recent paper of Chatterjee and Diaconis I discussed earlier, which also discussed the necessary sample size.

## merging MCMC subposteriors

Posted in Books, Statistics, University life with tags , , , , , , , on June 8, 2016 by xi'an

Christopher Nemeth and Chris Sherlock arXived a paper yesterday about an approach to distributed MCMC sampling via Gaussian processes. As in several other papers commented on the ‘Og, the issue is to merge MCMC samples from sub-posteriors into a sample or any sort of approximation of the complete (product) posterior. I am quite sympathetic to the approach adopted in this paper, namely to use a log-Gaussian process representation of each sub-posterior and then to replace each sub-posterior with its log-Gaussian process posterior expectation in an MCMC or importance scheme. And to assess its variability through the posterior variance of the sum of log-Gaussian processes. As pointed out by the authors the closed form representation of the posterior mean of the log-posterior is invaluable as it allows for an HMC implementation. And importance solutions as well. The probabilistic numerics behind this perspective are also highly relevant.

A few arguable (?) points:

1. The method often relies on importance sampling and hence on the choice of an importance function that is most likely influential but delicate to calibrate in complex settings as I presume the Gaussian estimates are not useful in this regard;
2. Using Monte Carlo to approximate the value of the approximate density at a given parameter value (by simulating from the posterior distribution) is natural but is it that efficient?
3. It could be that, by treating all sub-posterior samples as noisy versions of the same (true) posterior, a more accurate approximation of this posterior could be constructed;
4. The method relies on the exponentiation of a posterior expectation or simulation. As of yesterday, I am somehow wary of log-normal expectations!
5. If the purpose of the exercise is to approximate univariate integrals, it would seem more profitable to use the Gaussian processes at the univariate level;
6. The way the normalising missing constants and the duplicate simulations are processed (or not) could deserve further exploration;
7. Computing costs are in fine unclear when compared with the other methods in the toolbox.

## Computing the variance of a conditional expectation via non-nested Monte Carlo

Posted in Books, pictures, Statistics, University life with tags , , , , on May 26, 2016 by xi'an

The recent arXival by Takashi Goda of Computing the variance of a conditional expectation via non-nested Monte Carlo led me to read it as I could not be certain of the contents from only reading the title! The short paper considers the issue of estimating the variance of a conditional expectation when able to simulate the joint distribution behind the quantity of interest. The second moment E(E[f(X)|Y]²) can be written as a triple integral with two versions of x given y and one marginal y, which means that it can approximated in an unbiased manner by simulating a realisation of y then conditionally two realisations of x. The variance requires a third simulation of x, which the author seems to deem too costly and that he hence replaces with another unbiased version based on two conditional generations only. (He notes that a faster biased version is available with bias going down faster than the Monte Carlo error, which makes the alternative somewhat irrelevant, as it is also costly to derive.) An open question after reading the paper stands with the optimal version of the generic estimator (5), although finding the optimum may require more computing time than it is worth spending. Another one is whether or not this version of the expected conditional variance is more interesting (computation-wise) that the difference between the variance and the expected conditional variance as reproduced in (3) given that both quantities can equally be approximated by unbiased Monte Carlo…

## Bayesian model averaging in astrophysics [guest post]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on August 12, 2015 by xi'an

.[Following my posting of a misfiled 2013 blog, Ewan Cameron told me of the impact of this paper in starting his own blog and I asked him for a guest post, resulting in this analysis, much deeper than mine. No warning necessary this time!]

Back in February 2013 when “Bayesian Model Averaging in Astrophysics: A Review” by Parkinson & Liddle (hereafter PL13) first appeared on the arXiv I was a keen, young(ish) postdoc eager to get stuck into debates about anything and everything ‘astro-statistical’. And with its seemingly glaring flaws, PL13 was more grist to the mill. However, despite my best efforts on various forums I couldn’t get a decent fight started over the right way to do model averaging (BMA) in astronomy, so out of sheer frustration two months later I made my own soapbox to shout from at Another Astrostatistics Blog. Having seen PL13 reviewed recently here on Xian’s Og it feels like the right time to revisit the subject and reflect on where BMA in astronomy is today.

As pointed out to me back in 2013 by Tom Loredo, the act of Bayesian model averaging has been around much longer than its name; indeed an early astronomical example appears in Gregory & Loredo (1992) in which the posterior mean representation of an unknown signal is constructed for an astronomical “light-curve”, averaging over a set of constant and periodic candidate models. Nevertheless the wider popularisation of model averaging in astronomy has only recently taken place through a variety of applications in cosmology: e.g. Liddle, Mukherjee, Parkinson & Wang (2006) and Vardanyan, Trotta & Silk (2011).

In contrast to earlier studies like Gregory & Loredo (1992)—or the classic review on BMA by Hoeting et al. (1999)—in which the target of model averaging is typically either a utility function, a set of future observations, or a latent parameter of the observational process (e.g. the unknown “light-curve” shape) shared naturally by all competing models, the proposal of cosmological BMA studies is to produce a model-averaged version of the posterior for a given ‘shared’ parameter: a so-called “model-averaged PDF”. This proposal didn’t sit well with me back in 2013, and it still doesn’t sit well with me today. Philosophically: without a model a parameter has no meaning, so why should we want to seek meaning in the marginalised distribution of a parameter over an entire set of models? And, practically: to put it another way, without knowing the model ‘label’ to which a given mass of model-averaged parameter probability belongs there’s nothing much useful we can do with this ‘PDF’: nothing much we can say about the data we’ve just analysed and nothing much we can say about future experiments. Whereas the space of the observed data is shared automatically by all competing models, it seems to me to be somehow “un-Bayesian” to place the further restriction that the parameters of separate models share the same scale and topology. I say “un-Bayesian” since this mode of model averaging suggests a formulation of the parameter space + prior pairing stronger than the statement of one’s prior beliefs for the distribution of observable data given the model. But I would be happy to hear arguments from the other side in the comments box below … ! Continue reading