*Q. Why not embed discrete parameters so that the resulting surrogate density function is smooth?*

A. This is only possible in very special settings. Let’s say we have a target distribution π(θ, n), where θ is continuous and ‘n’ is discrete. To construct a surrogate smooth density, we would need to somehow smoothly interpolate a collection of functions f_{n}(θ) = π(θ, n) for n = 1, 2, …. It is not clear to us how we can achieve this in a general and tractable way.

*Q. How to generalize the algorithm to a more complex parameter space?*

A. We provide a clear solution to dealing with a discontinuous target density defined on a continuous parameter space. We agree, however, that there remains the question of whether and how a more complex parameter space can be embedded into a continuous space. This certainly deserves a further investigation. For example, a binary tree can be embedded in to an interval [0,1] through a dyadic expansion of a real number.

*Q. Physical intuition of discontinuous Hamiltonian dynamics is not clear from a theory of differential measure-valued equation and selection principle.*

A. Hamiltonian dynamics with a discontinuous potential energy has long been used by physicists as a natural model for some physical phenomena (also known as “impulsive systems”). The main difference from a smooth system is that a gradient become a “delta function” at the discontinuity, causing an instantaneous “push” toward the direction of lower potential energy. A theory of differential measure-valued equation / inclusion and selection principle is only a mathematical formalization of such physical systems.

*Q. (A special case of) DHMC looks like taking multiple Gibbs steps?*

A. The crucial difference from Metropolis-within-Gibbs is the presence of momentum in DHMC, which helps guide a Markov chain toward a high density region.

The effect of momentum is evident in the Jolly-Seber example of Section 5.1, where DHMC shows 60-fold efficiency improvement over a sampler “NUTS-Gibbs” based on conditional updates. Also, a direct comparison of DHMC and Metropolis-within-Gibbs can be found in Section S4.1 where DHMC, thanks to the momentum, is about 7 times more efficient than Metropolis-within-Gibbs (with optimal proposal variances).

*Q. Unlike HMC, DHMC does not seem to use structural information about the parameter space and local information about the target density?*

A. It does. After all, other than the use of Laplace momentum and discontinuity in the target density, DHMC is based on the same principle as HMC — simulating Hamiltonian dynamics to generate a proposal.

The confusion is perhaps due to the fact that the coordinate-wise integrator of DHMC does not require gradients. The gradient of the log density — which may be a “delta” function at discontinuities — plays a clear role if you look at Hamilton’s equations Eq (10) corresponding to a Laplace momentum. It’s just that, thanks to a property of a Laplace momentum and conservation of energy principle, we can approximate the exact dynamics without ever computing the gradient. This is in fact a remarkable property of a Laplace momentum and our coordinate-wise integrator.