double yolk priors [a reply from the authors]

Posted in Books, Statistics, University life with tags , , , , , on March 14, 2018 by xi'an

[Here is an email I received from Subhadeep Mukhopadhyay, one of the authors of the paper I discussed yesterday.}
Thank for discussing our work. Let me clarify the technical point that you raised:
– The difference between Legj(u)_j and Tj=Legj(G(θ)). One is orthonormal polyn of L2[0,1] and the other one is L2[G]. The second one is poly of rank-transform G(θ).
– As you correctly pointed out there is a danger in directly approximating the ratio. We work on it after taking the quantile transform: evaluate the ratio at g⁻¹(θ), which is the d(u;G,F) over unit interval. Now, this new transformed function is a proper density.
-Thus the ratio now becomes d(G(θ)) which can be expended into (NOT in Leg-basis) in $T_j$, in eq (2.2), as it lives in the Hilbert space L2(G)
– For your last point on Step 2 of our algo, we can also use the simple integrate command.
-Unlike traditional prior-data conflict here we attempted to answer three questions in one-shot: (i) How compatible is the pre-selected g with the given data? (ii) In the event of a conflict, can we also inform the user on the nature of misfit–finer structure that was a priori unanticipated? (iii) Finally, we would like to provide a simple, yet formal guideline for upgrading (repairing) the starting g.
Hopefully, this will clear the air. But thanks for reading the paper so carefully. Appreciate it.

Hamiltonian MC on discrete spaces [a reply from the authors]

Posted in Books, pictures, Statistics, University life with tags , , , , , on July 8, 2017 by xi'an

Q. Why not embed discrete parameters so that the resulting surrogate density function is smooth?

A. This is only possible in very special settings. Let’s say we have a target distribution π(θ, n), where θ is continuous and ‘n’ is discrete. To construct a surrogate smooth density, we would need to somehow smoothly interpolate a collection of functions fn(θ) = π(θ, n) for n = 1, 2, …. It is not clear to us how we can achieve this in a general and tractable way.

Q. How to generalize the algorithm to a more complex parameter space?

A. We provide a clear solution to dealing with a discontinuous target density defined on a continuous parameter space. We agree, however, that there remains the question of whether and how a more complex parameter space can be embedded into a continuous space. This certainly deserves a further investigation. For example, a binary tree can be embedded in to an interval [0,1] through a dyadic expansion of a real number.

Q. Physical intuition of discontinuous Hamiltonian dynamics is not clear from a theory of differential measure-valued equation and selection principle.

A. Hamiltonian dynamics with a discontinuous potential energy has long been used by physicists as a natural model for some physical phenomena (also known as “impulsive systems”). The main difference from a smooth system is that a gradient become a “delta function” at the discontinuity, causing an instantaneous “push” toward the direction of lower potential energy. A theory of differential measure-valued equation / inclusion and selection principle is only a mathematical formalization of such physical systems.

Q. (A special case of) DHMC looks like taking multiple Gibbs steps?

A. The crucial difference from Metropolis-within-Gibbs is the presence of momentum in DHMC, which helps guide a Markov chain toward a high density region.

The effect of momentum is evident in the Jolly-Seber example of Section 5.1, where DHMC shows 60-fold efficiency improvement over a sampler “NUTS-Gibbs” based on conditional updates. Also, a direct comparison of DHMC and Metropolis-within-Gibbs can be found in Section S4.1 where DHMC, thanks to the momentum, is about 7 times more efficient than Metropolis-within-Gibbs (with optimal proposal variances).

Q. Unlike HMC, DHMC does not seem to use structural information about the parameter space and local information about the target density?

A. It does. After all, other than the use of Laplace momentum and discontinuity in the target density, DHMC is based on the same principle as HMC — simulating Hamiltonian dynamics to generate a proposal.

The confusion is perhaps due to the fact that the coordinate-wise integrator of DHMC does not require gradients. The gradient of the log density — which may be a “delta” function at discontinuities — plays a clear role if you look at Hamilton’s equations Eq (10) corresponding to a Laplace momentum. It’s just that, thanks to a property of a Laplace momentum and conservation of energy principle, we can approximate the exact dynamics without ever computing the gradient. This is in fact a remarkable property of a Laplace momentum and our coordinate-wise integrator.

asymptotically exact inference in likelihood-free models [a reply from the authors]

Posted in R, Statistics with tags , , , , , , , , , , , , , , , , , on December 1, 2016 by xi'an

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with MN, for large N the dominant costs at each timestep are usually the constraint Jacobian (c/u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (c/u)(c/u) with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach. Continue reading

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , on October 24, 2011 by xi'an

Bill Bolstad wrote a reply to my review of his book Understanding computational Bayesian statistics last week and here it is, unedited except for the first paragraph where he thanks me for the opportunity to respond, “so readers will see that the book has some good features beyond having a “nice cover”.” (!) I simply processed the Word document into an html output and put a Read More bar in the middle as it is fairly detailed. (As indicated at the beginning of my review, I am obviously biased on the topic: thus, I will not comment on the reply, lest we get into an infinite regress!)

The target audience for this book are upper division undergraduate students and first year graduate students in statistics whose prior statistical education has been mostly frequentist based. Many will have knowledge of Bayesian statistics at an introductory level similar to that in my first book, but some will have no previous Bayesian statistics course. Being self-contained, it will also be suitable for statistical practitioners without a background in Bayesian statistics.

The book aims to show that:

1. Bayesian statistics makes different assumptions from frequentist statistics, and these differences lead to the advantages of the Bayesian approach.
2. Finding the proportional posterior is easy, however finding the exact posterior distribution is difficult in practice, even numerically, especially for models with many parameters.
3. Inferences can be based on a (random) sample from the posterior.
4. There are methods for drawing samples from the incompletely known posterior.
5. Direct reshaping methods become inefficient for models with large number of parameters.
6. We can find a Markov chain that has the long-run distribution with the same shape as the posterior. A draw from this chain after it has run a long time can be considered a random draw from the posterior
7. We have many choices in setting up a Markov chain Monte Carlo. The book shows the things that should be considered, and how problems can be detected from sample output from the chain.
8. An independent Metropolis-Hastings chain with a suitable heavy-tailed candidate distribution will perform well, particularly for regression type models. The book shows all the details needed to set up such a chain.
9. The Gibbs sampling algorithm is especially well suited for hierarchical models.

I am satisfied that the book has achieved the goals that I set out above. The title “Understanding Computational Bayesian Statistics” explains what this book is about. I want the reader (who has background in frequentist statistics) to understand how computational Bayesian statistics can be applied to models he/she is familiar with. I keep an up-to-date errata on the book website..The website also contains the computer software used in the book. This includes Minitab macros and R-functions. These were used because because they had good data analysis capabilities that could be used in conjunction with the simulations. The website also contains Fortran executables that are much faster for models containing more parameters, and WinBUGS code for the examples in the book. Continue reading