## estimating constants [survey]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on February 2, 2017 by xi'an

A new survey on Bayesian inference with intractable normalising constants was posted on arXiv yesterday by Jaewoo Park and Murali Haran. A rather massive work of 58 pages, almost handy for a short course on the topic! In particular, it goes through the most common MCMC methods with a detailed description, followed by comments on components to be calibrated and the potential theoretical backup. This includes for instance the method of Liang et al. (2016) that I reviewed a few months ago. As well as the Wang-Landau technique we proposed with Yves Atchadé and Nicolas Lartillot. And the noisy MCMC of Alquier et al. (2016), also reviewed a few months ago. (The Russian Roulette solution is only mentioned very briefly as” computationally very expensive”. But still used in some illustrations. The whole area of pseudo-marginal MCMC is also missing from the picture.)

“…auxiliary variable approaches tend to be more efficient than likelihood approximation approaches, though efficiencies vary quite a bit…”

The authors distinguish between MCMC methods where the normalizing constant is approximated and those where it is omitted by an auxiliary representation. The survey also distinguishes between asymptotically exact and asymptotically inexact solutions. For instance, using a finite number of MCMC steps instead of the associated target results in an asymptotically inexact method. The question that remains open is what to do with the output, i.e., whether or not there is a way to correct for this error. In the illustration for the Ising model, the double Metropolis-Hastings version of Liang et al. (2010) achieves for instance massive computational gains, but also exhibits a persistent bias that would go undetected were it the sole method implemented. This aspect of approximate inference is not really explored in the paper, but constitutes a major issue for modern statistics (and machine learning as well, when inference is taken into account.)

In conclusion, this survey provides a serious exploration of recent MCMC methods. It begs for a second part involving particle filters, which have often proven to be faster and more efficient than MCMC methods, at least in state space models. In that regard, Nicolas Chopin and James Ridgway examined further techniques when calling to leave the Pima Indians [dataset] alone.

## recycling Gibbs auxiliaries [a reply]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on January 3, 2017 by xi'an

[Here is a reply sent to me by Luca Martino, Victor Elvira, and Gustau Camp-Vallis, after my earlier comments on their paper.]

We provide our contribution to the discussion, reporting our experience with the application of Metropolis-within-Gibbs schemes. Since in literature there are miscellaneous opinions, we want to point out the following considerations:

– according to our experience, the use of M>1 steps of the Metropolis-Hastings (MH) method for drawing from each full-conditional (with or without recycling), decreases the MSE of the estimation (see code Ex1-Ex2 and related Figure 7(b) and Figures 8). If the corresponding full conditional is very concentrated, one possible solution is to applied an adaptive or automatic MH for drawing from this full-conditional (it can require the use of M internal steps; see references in Section 3.2).

– Fixing the number of evaluations of the posterior, the comparison between a longer Gibbs chain with a single step of MH and a shorter Gibbs chain with M>1 steps of MH per each full-conditional, is required. Generally, there is no clear winner. The better performance depends on different aspects: the specific scenario, if and adaptive MH is employed or not, if the recycling is applied or not (see Figure 10(a) and the corresponding code Ex2).

The previous considerations are supported/endorsed by several authors (see the references in Section 3.2). In order to highlight the number of controversial opinions about the MH-within-Gibbs implementation, we report a last observation:

– If it is possible to draw directly from the full-conditionals, of course this is the best scenario (this is our belief). Remarkably, as also reported in Chapter 1, page 393 of the book “Monte Carlo Statistical Methods”, C. Robert and Casella, 2004, some authors have found that a “bad” choice of the proposal function in the MH step (i.e., different from the full conditional, or a poor approximation of it) can improve the performance of the MH-within-Gibbs sampler. Namely, they assert that a more “precise” approximation of the full-conditional does not necessarily improve the overall performance. In our opinion, this is possibly due to the fact that the acceptance rate in the MH step (lower than 1) induces an “accidental” random scan of the components of the target pdf in the Gibbs sampler, which can improve the performance in some cases. In our work, for the simplicity, we only focus on the deterministic scan. However, a random scan could be also considered.

## recycling Gibbs auxiliaries

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on December 6, 2016 by xi'an

Luca Martino, Victor Elvira and Gustau Camps-Valls have arXived a paper on recycling for Gibbs sampling. The argument therein is to take advantage of all simulations induced by MCMC simulation for one full conditional, towards improving estimation if not convergence. The context is thus one when Metropolis-within-Gibbs operates, with several (M) iterations of the corresponding Metropolis being run instead of only one (which is still valid from a theoretical perspective). While there are arguments in augmenting those iterations, as recalled in the paper, I am not a big fan of running a fixed number of M of iterations as this does not approximate better the simulation from the exact full conditional and even if this approximation was perfect, the goal remains simulating from the joint distribution. As such, multiplying the number of Metropolis iterations does not necessarily impact the convergence rate, only brings it closer to the standard Gibbs rate. Moreover, the improvement does varies with the chosen component, meaning that the different full conditionals have different characteristics that produce various levels of variance reduction:

• if the targeted expectation only depends on one component of the Markov chain, multiplying the number of simulations for the other components has no clear impact, except in increasing time;
• if the corresponding full conditional is very concentrated, repeating simulations should produce quasi-repetitions, and no gain.

The only advantage in computing time that I can see at this stage is when constructing the MCMC sampler for the full proposal is much more costly than repeating MCMC iterations, which are then almost free and contribute to the reduction of the variance of the estimator.

This analysis of MCMC-withing-Gibbs strategies reminds me of a recent X validated question, which was about the proper degree of splitting simulations from a marginal and from a corresponding conditional in the chain rule, the optimal balance being in my opinion dependent on the relative variances of the conditional expectations.

A last point is that recycling in the context of simulation and Monte Carlo methodology makes me immediately think of Rao-Blackwellisation, which is surprisingly absent from the current paperRao-Blackwellisation was introduced in the MCMC literature and to the MCMC community in the first papers of Alan Gelfand and Adrian Smith, in 1990. While this is not always producing a major gain in Monte Carlo variability, it remains a generic way of recycling auxiliary variables as shown, e.g., in the recycling paper we wrote with George Casella in 1996, one of my favourite papers.

## non-local priors for mixtures

Posted in Statistics, University life with tags , , , , , , , , , , , , , , , on September 15, 2016 by xi'an

[For some unknown reason, this commentary on the paper by Jairo Fúquene, Mark Steel, David Rossell —all colleagues at Warwick— on choosing mixture components by non-local priors remained untouched in my draft box…]

Choosing the number of components in a mixture of (e.g., Gaussian) distributions is a hard problem. It may actually be an altogether impossible problem, even when abstaining from moral judgements on mixtures. I do realise that the components can eventually be identified as the number of observations grows to infinity, as demonstrated for instance by Judith Rousseau and Kerrie Mengersen (2011). But for a finite and given number of observations, how much can we trust any conclusion about the number of components?! It seems to me that the criticism about the vacuity of point null hypotheses, namely the logical absurdity of trying to differentiate θ=0 from any other value of θ, applies to the estimation or test on the number of components of a mixture. Doubly so, one might argue, since a very small or a very close component is undistinguishable from a non-existing one. For instance, Definition 2 is correct from a mathematical viewpoint, but it does not spell out the multiple contiguities between k and k’ component mixtures.

The paper starts with a comprehensive coverage of l’état de l’art… When using a Bayes factor to compare a k-component and an h-component mixture, the behaviour of the factor is quite different depending on which model is correct. Essentially overfitted mixtures take much longer to detect than underfitted ones, which makes intuitive sense. And BIC should be corrected for overfitted mixtures by a canonical dimension λ between the true and the (larger) assumed number of parameters  into

2 log m(y) = 2 log p(y|θ) – λ log O(n) + O(log log n)

I would argue that this purely invalidates BIG in mixture settings since the canonical dimension λ is unavailable (and DIC does not provide a useful substitute as we illustrated a decade ago…) The criticism about Rousseau and Mengersen (2011) over-fitted mixture that their approach shrinks less than a model averaging over several numbers of components relates to minimaxity and hence sounds both overly technical and reverting to some frequentist approach to testing. Replacing testing with estimating sounds like the right idea.  And I am also unconvinced that a faster rate of convergence of the posterior probability or of the Bayes factor is a relevant factor when conducting

As for non local priors, the notion seems to rely on a specific topology for the parameter space since a k-component mixture can approach a k’-component mixture (when k'<k) in a continuum of ways (even for a given parameterisation). This topology seems to be summarised by the penalty (distance?) d(θ) in the paper. Is there an intrinsic version of d(θ), given the weird parameter space? Like one derived from the Kullback-Leibler distance between the models? The choice of how zero is approached clearly has an impact on how easily the “null” is detected, the more because of the somewhat discontinuous nature of the parameter space. Incidentally, I find it curious that only the distance between means is penalised… The prior also assumes independence between component parameters and component weights, which I think is suboptimal in dealing with mixtures, maybe suboptimal in a poetic sense!, as we discussed in our reparameterisation paper. I am not sure either than the speed the distance converges to zero (in Theorem 1) helps me to understand whether the mixture has too many components for the data’s own good when I can run a calibration experiment under both assumptions.

While I appreciate the derivation of a closed form non-local prior, I wonder at the importance of the result. Is it because this leads to an easier derivation of the posterior probability? I do not see the connection in Section 3, except maybe that the importance weight indeed involves this normalising constant when considering several k’s in parallel. Is there any convergence issue in the importance sampling solution of (3.1) and (3.3) since the simulations are run under the local posterior? While I appreciate the availability of an EM version for deriving the MAP, a fact I became aware of only recently, is it truly bringing an improvement when compared with picking the MCMC simulation with the highest completed posterior?

The section on prior elicitation is obviously of central interest to me! It however seems to be restricted to the derivation of the scale factor g, in the distance, and of the parameter q in the Dirichlet prior on the weights. While the other parameters suffer from being allocated the conjugate-like priors. I would obviously enjoy seeing how this approach proceeds with our non-informative prior(s). In this regard, the illustration section is nice, but one always wonders at the representative nature of the examples and the possible interpretations of real datasets. For instance, when considering that the Old Faithful is more of an HMM than a mixture.

## astroABC: ABC SMC sampler for cosmological parameter estimation

Posted in Books, R, Statistics, University life with tags , , , , , , , , on September 6, 2016 by xi'an

“…the chosen statistic needs to be a so-called sufficient statistic in that any information about the parameter of interest which is contained in the data, is also contained in the summary statistic.”

Elise Jenningsa and Maeve Madigan arXived a paper on a new Python code they developed for implementing ABC-SMC, towards astronomy or rather cosmology applications. They stress the parallelisation abilities of their approach which leads to “crucial speed enhancement” against the available competitors, abcpmc and cosmoabc. The version of ABC implemented there is “our” ABC PMC where particle clouds are shifted according to mixtures of random walks, based on each and every point of the current cloud, with a scale equal to twice the estimated posterior variance. (The paper curiously refers to non-astronomy papers through their arXiv version, even when they have been published. Like our 2008 Biometrika paper.) A large part of the paper is dedicated to computing aspects that escape me, like the constant reference to MPIs. The algorithm is partly automated, except for the choice of the summary statistics and of the distance. The tolerance is chosen as a (large) quantile of the previous set of simulated distances. Getting comments from the designers of abcpmc and cosmoabc would be great.

“It is clear that the simple Gaussian Likelihood assumption in this case, which neglects the effects of systematics yields biased cosmological constraints.”

The last part of the paper compares ABC and MCMC on a supernova simulated dataset. Which is somewhat a dubious comparison since the model used for producing the data and running ABC is not the same as the Gaussian version used with MCMC. Unsurprisingly, MCMC then misses the true value of the cosmological parameters and most likely and more importantly the true posterior HPD region. While ABC SMC (or PMC) proceeds to a concentration around the genuine parameter values. (There is no additional demonstration of how accelerated the approach is.)

## Assistant Professor position @ WU

Posted in Mountains, Statistics, Travel, University life, Wines with tags , , , , , , , on August 15, 2016 by xi'an

There is an opening for an assistant professor non-tenure position in Vienna, WU, in Sylvia Früwirth-Schnatter’s group. With deadline September 7, 2016. The requested profile is

– PhD in applied mathematics or in statistics with a strong mathematical background
– Enthusiastic interest in research in Bayesian statistics, exemplified through publications in international journals in topics including, but not limited to, Bayesian non-parametric methods, Bayesian inference for high-dimensional and complex data, Bayesian time series analysis and state space modelling, efficient Markov chain Monte Carlo methods
– Interest in applications in economics, finance, and business
– Excellent programming skills (e.g. in R or Matlab)
– German language skills are not a prerequisite

Here are the details for those interested in this exciting opportunity!

## commentaries in financial econometrics

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on April 27, 2016 by xi'an

My comment(arie)s on the moment approach to Bayesian inference by Ron Gallant have appeared, along with other comment(arie)s:

Invited Article
Reflections on the Probability Space Induced by Moment Conditions with
Implications for Bayesian Inference
A. Ronald Gallant . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Commentaries
Dante Amengual and Enrique Sentana .. . . . . . . . . . 248
John Geweke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253
Jae-Young Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Oliver Linton and Ruochen Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .261
Christian P. Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Christopher A. Sims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Wei Wei and Asger Lunde . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .278
Author Response
A. Ronald Gallant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .284

While commenting on commentaries is formally bound to induce an infinite loop [or l∞p], I remain puzzled by the main point of the paper, which is that setting a structural distribution on a moment function Z(x,θ) plus a prior p(θ) induces a distribution on the pair (x,θ) in a possibly weaker σ-algebra. (The two distributions may actually be incompatible.) Handling this framework requires checking that a posterior exists, which sounds rather unnatural (even though we also have to check properness of the posterior). And the meaning of such a posterior remains unclear, as for instance in this assertion that (4) above is a likelihood, when it does not define a density in x but on the object inside the exponential.

“…it is typically difficult to determine whether there exists a p(x|θ) such that the implied distribution of m(x,θ) is the one stated, and if not, what damage is done thereby” J. Geweke (p.254)