## is there such a thing as optimal subsampling? [comments from the authors]

I do wonder at the use of the most rudimentary representation of an approximation to the target when smoother versions could have been chosen and optimised on the same ground.

This is a fair comment that applies equally to MCMC – why seek a discrete approximation to the target in a situation where the target is known to have a continuous pdf? In our case, at least, it is helpful to consider discrete approximations because this enables the Stein discrepancy between the approximation and the target to be exactly computed.

And I am also surprised at the dependence of both estimators and discrepancies on the choice of the (sort-of) covariance matrix in the inner kernel, as the ODE examples provided in the paper (see, e.g., Figure 7).

In some respects our requirement to pick a suitable whitening matrix in the kernel is the same problem faced in kernel density estimation, where one has to pick a suitable bandwidth. In that context, many (asymptotically equivalent) heuristics exist for bandwidth selection, and these typically lead to quite different approximations to the pdf when applied to a finite dataset. Our empirical results suggest that it is better to assume an isotropic kernel whose single length-scale parameter can be robustly estimated, than it is to assume a more general kernel whose parameters may not be so robustly estimated.

As an aside and at a shallow level, the approach also reminded me of the principal points of my late friend Bernhard Flury…

Thank you for pointing us towards this interesting work. The general problem of approximating a distribution using representative points is sometimes called “quantisation”; there is a (rich) literature on optimal quantisation (including principal points, as a particular example). However, this literature assumes a known distribution is being approximated. MCMC can be viewed as a (sub-optimal) solution to the quantisation problem when the distribution is only partially characterised (e.g. by an un-normalised pdf). Our work aims to approximate an optimal solution to the quantisation problem by selecting an optimal subset from MCMC output.

I am unsure the method is anything but what-you-get-is-what-you-see, i.e. prone to get misled by a poor exploration of the complete support of the target.

It is certainly true that if the MCMC missed a region of the state space then we cannot adjust for that by sub-sampling from the MCMC output. However, we can remove samples from regions that have been “over-explored” and in doing so obtain a better approximation to the target. In fact, by greedily selecting which samples to include in our approximation, we produce a sequence that performs “mode hopping” – first a sample is selected from a region of high probability (RoHP), then a sample is selected from a different RoHP and so forth, cycling through all RoHP to ensure that the number of samples selected from each RoHP is balanced. (This happens automatically through the Stein discrepancy, we do not literally need to work out where these regions are, or what their relative probabilities are, to implement the algorithm.)

## Leave a Reply