## more typos in Monte Carlo statistical methods

Posted in Books, Statistics, University life with tags , , , , , , , , , on October 28, 2011 by xi'an

Jan Hanning kindly sent me this email about several difficulties with Chapters 3, Monte Carlo Integration, and  5, Monte Carlo Optimization, when teaching out of our book Monte Carlo Statistical Methods [my replies in italics between square brackets, apologies for the late reply and posting, as well as for the confusion thus created. Of course, the additional typos will soon be included in the typo lists on my book webpage.]:

1. I seem to be unable to reproduce Table 3.3 on page 88 – especially the chi-square column does not look quite right. [No, they definitely are not right: the true χ² quantiles should be 2.70, 3.84, and 6.63, at the levels 0.1, 0.05, and 0.01, respectively. I actually fail to understand how we got this table that wrong...]
2. The second question  I have is the choice of the U(0,1) in this Example 3.6. It  feels to me that a choice of Beta(23.5,18.5) for p1 and Beta(36.5,5.5) for p2 might give a better representation based on the data we have. Any comments? [I am plainly uncertain about this... Yours is the choice based on the posterior Beta coefficient distributions associated with Jeffreys prior, hence making the best use of the data. I wonder whether or not we should remove this example altogether... It is certainly "better" than the uniform. However, in my opinion, there is no proper choice for the distribution of the pi's because we are mixing there a likelihood-ratio solution with a Bayesian perspective on the predictive distribution of the likelihood-ratio. If anything, this exposes the shortcomings of a classical approach, but it is likely to confuse the students! Anyway, this is a very interesting problem.]
3. My students discovered that Problem 5.19 has the following typos, copying from their e-mail: “x_x” should be “x_i” [sure!]. There are a few “( )”s missing here and there [yes!]. Most importantly, the likelihood/density seems incorrect. The normalizing constant should be the reciprocal of the one showed in the book [oh dear, indeed, the constant in the exponential density did not get to the denominator...]. As a result, all the formulas would differ except the ones in part (a). [they clearly need to be rewritten, sorry about this mess!]
4. I am unsure about the if and only if part of the Theorem 5.15 [namely that the likelihood sequence is stationary if and only if the Q function in the E step has reached a stationary point]. It appears to me that a condition for the “if part” is missing [the "only if" part is a direct consequence of Jensen's inequality]. Indeed Theorem 1 of Dempster et al 1977 has an extra condition [note that the original proof for convergence of EM has a flaw, as discussed here]. Am I missing something obvious? [maybe: it seems to me that, once Q reaches a fixed point, the likelihood L does not change... It is thus tautological, not a proof of convergence! But the theorem says a wee more, so this needs investigating. As Jan remarked, there is no symmetry in the Q function...]
5. Should there be a (n-m) in the last term of formula (5.17)? [yes, indeed!, multiply the last term by (n-m)]
6. Finally, I am a bit confused about the likelihood in Example 5.22 [which is a capture-recapture model]. Assume that Hij=k [meaning the animal i is in state k at time j]. Do you assume that you observed Xijr [which is the capture indicator for animal i at time j in zone k: it is equal to 1 for at most one k] as a Binomial B(n,pr) even for r≠k? [no, we observe all Xijr's with r≠k equal to zero]  The nature of the problem seems to suggest that the answer is no [for other indices, Xijr is always zero, indeed] If that is the case I do not see where the power on top of (1-pk) in the middle of the page 185 comes from [when the capture indices are zero, they do not contribute to the sum, which explains for this condensed formula. Therefore, I do not think there is anything wrong with this over-parameterised representation of the missing variables.]
7. In Section 5.3.4, there seems to be a missing minus sign in the approximation formula for the variance [indeed, shame on us for missing the minus in the observed information matrix!]
8. I could not find the definition of $\mathbb{N}^*$ in Theorem 6.15. Is it all natural numbers or all integers? May be it would help to include it in Appendix B. [Surprising! This is the set of all positive integers, I thought this was a standard math notation...]
9. In Definition 6.27, you probably want to say covering of A and not X. [Yes, we were already thinking of the next theorem, most likely!]
10. In Proposition 6.33 -   all x in A instead of all x in X. [Yes, again! As shown in the proof. Even though it also holds for all x in X]

Thanks a ton to Jan and to his UNC students (and apologies for leading them astray with those typos!!!)

## parallel Metropolis Hastings [published]

Posted in Statistics, University life with tags , , , , on October 27, 2011 by xi'an

As I was looking at the discussion paper by Yamin Yu and Xiao-Li Meng on improved efficiency for MCMC algorithms, which is available (for free) on-line, I realised the paper on parallel Metropolis-Hastings algorithm we wrote with Pierre Jacob and Murray Smith is now published in Journal of Computational and Graphical Statistics (on-line). This is a special issue for the 20th anniversary of the Journal of Computational and Graphical Statistics and our paper is within the “If Monte Carlo Be a Food of Computing, Simulate on” section! (My friends Olivier Cappé and Radu V. Craiu also have a paper in this issue.)  Here is the complete reference:

P. Jacob, C. P. Robert, & M. H. Smith. Using Parallel Computation to Improve Independent Metropolis–Hastings Based Estimation. Journal of Computational and Graphical Statistics. September 1, 2011, 20(3): 616-635. doi:10.1198/jcgs.2011.10167

The [20th Anniversary Featured Discussion] paper by Yamin Yu and Xiao-Li Meng has already been mentioned on Andrew’s blog, it is full of interesting ideas and remarks about improving Gibbs efficiency, in the spirit of the very fine work Jim Hobert and his collaborators have been developing in the past decade,  fun titles (“To center or not center – that is not the question”, “coupling is more promising than compromising”, “be all our insomnia remembered”, and “needing inception”, in connection with the talk Xiao-Li gave in Paris two months ago….), and above all the fascinating puzzle of linking statistical concepts and Monte Carlo concepts. How comes sufficiency and ancillarity are to play a role in simulation?! Where is the simulation equivalent of Basu’s theorem? These questions obviously relate to the idea of turning simulation into a measure estimation issue, discussed in a post of mine after the Columbia workshop. This interweaving paper also brings back memories of the fantastic Biometrika 1994 interleaving paper by Liu, Wong, and Kong, with its elegant proof of positive decreasing correlation and of improvement by Rao-Blackwellisation [another statistics theorem!] for data augmentation.

## Decision systems and nonstochastic randomness

Posted in Books, Statistics, University life with tags , , , , , on October 26, 2011 by xi'an

Thus the informativity of stochastic experiment turned out to depend on the Bayesian system and to coincide to within the scale factor with the previous “value of information”.” V. Ivanenko, Decision systems and nonstochastic randomness, p.208

This book, Decision systems and nonstochastic randomness, written by the Ukrainian researcher Victor Ivanenko, is related to decision theory and information theory, albeit with a statistical component as well. It however works at a fairly formal level and the reading is certainly not light. The randomness it address is the type formalised by Andreï Kolmogorov (also covered in the book Randomness through Computation I [rather negatively] reviewed a few months ago, inducing angry comments and scathing criticisms in the process). The terminology is slightly different from the usual one, but the basics are those of decision theory as in De Groot (1970). However, the tone quickly gets much more mathematical and the book lost me early in Chapter 3 (Indifferent uncertainty) on a casual reading. The following chapter on non-stochastic randomness reminded me of von Mises for its use of infinite sequences, and of the above book for its purpose, but otherwise offered an uninterrupted array of definitions and theorems that sounded utterly remote from statistical problems. After failing to make sense of the chapter on the informativity of experiment in Bayesian decision problems, I simply gave up… I thus cannot judge from this cursory reading whether or not the book is “useful in describing real situations of decision-making” (p.208). It just sounds very remote from my centres of interest. (Anyone interested by writing a review?)

## Catching up faster by switching sooner

Posted in R, Statistics, University life with tags , , , , , , , , on October 26, 2011 by xi'an

Here is our discussion (with Nicolas Chopin) of the Read Paper of last Wednesday by T. van Erven, P. Grünwald and S. de Rooij (Centrum voor Wiskunde en Informatica, Amsterdam), entitled Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the Akaike information criterion–Bayesian information criterion dilemma. It is still available for written discussions, to be published in Series B. Even though the topic is quite tangential to our interests, the fact that the authors evolve in a Bayesian environment called for the following (my main contribution being in pointing out that the procedure is not Bayesian by failing to incorporate the switch in the predictive (6), hence using the same data for all models under competition…):

Figure 1 - Bayes factors of Model 2 vs.~Model 1 (gray line) and Model 3 vs.~Model 1 (dark line), plotted against the number of observations, i.e. of iterations, when comparing three stochastic volatility models; see Chopin et al. (2011) for full details.

This paper is an interesting attempt at a particularly important problem. We nonetheless believe more classical tools should be used instead if models are truly relevant in the inference led by the authors: Figure 1, reproduced from Chopin et al. (2011), plots [against time] the Bayes factors of Models 2 and 3 vs. Model 1, where all models are state-space models of increasing complexity, fitted to some real data. In this context, one often observes that more complex models need more time to “ascertain themselves”. On the other hand, even BMA based prediction is a very challenging computational problem (the only generic solution currently being the SMC² algorithm of the aforementioned paper), and we believe that the current proposed predictive strategy will remain too computationally expensive for practical use for nonlinear state-space models.

For other classes of models, since the provable methods put forward by this paper are based on “frozen strategies”, which are hard to defend from a modelling perspective, and since the more reasonable “basic switch” strategy seems to perform as well numerically, we would be curious to see how the proposed methods compare to predictive distributions obtained from genuine Bayesian models. A true change point model for instance would generate a coherent prediction strategy, which is not equivalent to the basic switch strategy. (Indeed, for one thing, the proposal made by the authors utilises the whole past to compute the switching probabilities, rather than allocating the proper portion of the data to the relevant model. In this sense, the proposal is “using the data [at least] twice” in a pseudo-Bayesian setting, similar to Aitkin’s, 1991.) More generally, the authors seem to focus on situations where the true generative process is a non-parametric class, and the completed models is an infinite sequence of richer and richer—but also of more and more complex—parametric models, which is a very sensible set-up in practice. Then, we wonder whether or not it would make more sense to set the prior distribution over the switch parameter s in such a way that (a) switches only occurs from one model to another model with greater complexity and (b) the number of switches is infinite.

For ABC readers, note the future Read Paper meeting on December 14 by Paul Fearnhead and Dennis Prangle.

## Approximate Bayesian computational methods on-line

Posted in R, Statistics, University life with tags , , , , , , on October 25, 2011 by xi'an

Fig. 4 – Boxplots of the evolution [against ε] of ABC approximations to the Bayes factor. The representation is made in terms of frequencies of visits to [accepted proposals from] models MA(1) and MA(2) during an ABC simulation when ε corresponds to the 10,1,.1,.01% quantiles on the simulated autocovariance distances. The data is a time series of 50 points simulated from a MA(2) model. The true Bayes factor is then equal to 17.71, corresponding to posterior probabilities of 0.95 and 0.05 for the MA(2) and MA(1) models, resp.

The survey we wrote with Jean-Michel Marin, Pierre Pudlo, and Robin Ryder is now published in [the expensive] Statistics and Computing (on-line). Beside recycling a lot of Og posts on ABC, this paper has the (personal) appeal of giving us the first hint that all was not so rosy in terms of ABC model choice. I wonder whether or not it will be part of the ABC special issue.