## Archive for reproducing kernel Hilbert space

Posted in Books, Statistics, University life with tags , , , , , , , , , , on July 3, 2015 by xi'an

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltán Szabó, and Arthur Gretton arXived paper about Kamiltonian MCMC generated comments from Michael Betancourt, Dan Simpson and myself, which themselves induced the following reply by Heiko, detailed enough to deserve a post of its own.

We certainly agree that the naive approach of using a non-parametric kernel density estimator on the chain history (as in [Christian’s book, Example 8.8]) as a *proposal* fails spectacularly on simple examples: the probability of proposing in unexplored regions is extremely small, independent of the current position of the MCMC trajectory. This is not what we do though. Instead, we use the gradient of a density estimator, and not the density itself, for our HMC proposal. Just like KAMH, KMC lite in fact falls back to Random Walk Metropolis in previously unexplored regions and therefore inherits geometric ergodicity properties. This in particular includes the ability to explore previously “unseen” regions, even if adaptation has stopped. I implemented a simple illustration and comparison here.

ABC example.
The main point of the ABC example, is that our method does not suffer from the additional bias from Gaussian synthetic likelihoods when being confronted with skewed models. But there is also a computational efficiency aspect. The scheme by Meeds et al. relies on finite differences and requires $2D$ simulations from the likelihood *every time* the gradient is evaluated (i.e. every leapfrog iteration) and H-ABC discards this valuable information subsequently. In contrast, KMC accumulates gradient information from simulations: it only requires to simulate from the likelihood *once* in the accept/reject step after the leapfrog integration (where gradients are available in closed form). The density is only updated then, and not during the leapfrog integration. Similar work on speeding up HMC via energy surrogates can be applied in the tall data scenario.

Approximating HMC when gradients aren’t available is in general a difficult problem. One approach (like surrogate models) may work well in some scenarios while a different approach (i.e. Monte Carlo) may work better in others, and the ABC example showcases such a case. We very much doubt that one size will fit all — but rather claim that it is of interest to find and document these scenarios.
Michael raised the concern that intractable gradients in the Pseudo-Marginal case can be avoided by running an MCMC chain on the joint space (e.g. $(f,\theta)$ for the GP classifier). To us, however, the situation is not that clear. In many cases, the correlations between variables can cause convergence problems (see e.g. here) for the MCMC and have to be addressed by de-correlation schemes (as here), or e.g. by incorporating geometric information, which also needs fixes as Michaels’s very own one. Which is the method of choice with a particular statistical problem at hand? Which method gives the smallest estimation error (if that is the goal?) for a given problem? Estimation error per time? A thorough comparison of these different classes of algorithms in terms of performance related to problem class would help here. Most papers (including ours) only show experiments favouring their own method.

GP estimator quality.
Finally, to address Michael’s point on the consistency of the GP estimator of the density gradient: this is discussed In the original paper on the infinite dimensional exponential family. As Michael points out, higher dimensional problems are unavoidably harder, however the specific details are rather involved. First, in terms of theory: both the well-specified case (when the natural parameter is in the RKHS, Section 4), and the ill-specified case (the natural parameter is in a “reasonable”, larger class of functions, Section 5), the estimate is consistent. Consistency is obtained in various metrics, including the L² error on gradients. The rates depend on how smooth the natural parameter is (and indeed a poor choice of hyper-parameter will mean slower convergence). The key point, in regards to Michael’s question, is that the smoothness requirement becomes more restrictive as the dimension increases: see Section 4.2, “range space assumption”.
Second, in terms of practice: we have found in experiments that the infinite dimensional exponential family does perform considerably better than a kernel density estimator when the dimension increases (Section 6). In other words, our density estimator can take advantage of smoothness properties of the “true” target density to get good convergence rates. As a practical strategy for hyper-parameter choice, we cross-validate, which works well empirically despite being distasteful to Bayesians. Experiments in the KMC paper also indicate that we can scale these estimators up to dimensions in the 100s on Laptop computers (unlike most other gradient estimation techniques in HMC, e.g. the ones in your HMC & sub-sampling note, or the finite differences in Meeds et al).

## Kamiltonian Monte Carlo [no typo]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on June 29, 2015 by xi'an

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltán Szabó, and Arthur Gretton arXived a paper last week about Kamiltonian MCMC, the K being related with RKHS. (RKHS as in another KAMH paper for adaptive Metropolis-Hastings by essentially the same authors, plus Maria Lomeli and Christophe Andrieu. And another paper by some of the authors on density estimation via infinite exponential family models.) The goal here is to bypass the computation of the derivatives in the moves of the Hamiltonian MCMC algorithm by using a kernel surrogate. While the genuine RKHS approach operates within an infinite exponential family model, two versions are proposed, KMC lite with an increasing sequence of RKHS subspaces, and KMC finite, with a finite dimensional space. In practice, this means using a leapfrog integrator with a different potential function, hence with a different dynamics.

The estimation of the infinite exponential family model is somewhat of an issue, as it is estimated from the past history of the Markov chain, simplified into a random subsample from this history [presumably without replacement, meaning the Markovian structure is lost on the subsample]. This is puzzling because there is dependence on the whole past, which cancels ergodicity guarantees… For instance, we gave an illustration in Introducing Monte Carlo Methods with R [Chapter 8] of the poor impact of approximating the target by non-parametric kernel estimates. I would thus lean towards the requirement of a secondary Markov chain to build this kernel estimate. The authors are obviously aware of this difficulty and advocate an attenuation scheme. There is also the issue of the cost of a kernel estimate, in O(n³) for a subsample of size n. If, instead, a fixed dimension m for the RKHS is selected, the cost is in O(tm²+m³), with the advantage of a feasible on-line update, making it an O(m³) cost in fine. But again the worry of using the whole past of the Markov chain to set its future path…

Among the experiments, a KMC for ABC that follows the recent proposal of Hamiltonian ABC by Meeds et al. The arguments  are interesting albeit sketchy: KMC-ABC does not require simulations at each leapfrog step, is it because the kernel approximation does not get updated at each step? Puzzling.

I also discussed the paper with Michael Betancourt (Warwick) and here his comments:

“I’m hesitant for the same reason I’ve been hesitant about algorithms like Bayesian quadrature and GP emulators in general. Outside of a few dimensions I’m not convinced that GP priors have enough regularization to really specify the interpolation between the available samples, so any algorithm that uses a single interpolation will be fundamentally limited (as I believe is born out in non-trivial scaling examples) and trying to marginalize over interpolations will be too awkward.

They’re really using kernel methods to model the target density which then gives the gradient analytically. RKHS/kernel methods/ Gaussian processes are all the same math — they’re putting prior measures over functions. My hesitancy is that these measures are at once more diffuse than people think (there are lots of functions satisfying a given smoothness criterion) and more rigid than people think (perturb any of the smoothness hyper-parameters and you get an entirely new space of functions).

When using these methods as an emulator you have to set the values of the hyper-parameters which locks in a very singular definition of smoothness and neglects all others. But even within this singular definition there are a huge number of possible functions. So when you only have a few points to constrain the emulation surface, how accurate can you expect the emulator to be between the points?

In most cases where the gradient is unavailable it’s either because (a) people are using decades-old Fortran black boxes that no one understands, in which case there are bigger problems than trying to improve statistical methods or (b) there’s a marginalization, in which case the gradients are given by integrals which can be approximated with more MCMC. Lots of options.”

## Quasi-Monte Carlo sampling

Posted in Books, Kids, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on December 10, 2014 by xi'an

“The QMC algorithm forces us to write any simulation as an explicit function of uniform samples.” (p.8)

As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper.  Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).

Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).

“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)

At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order

$\mathfrak{O}(N^{-1}\log(N)^{d-}) \text{ at cost } \mathfrak{O}(N\log(N))$

means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms.  For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform

$\mathbf{x}_t^n = \Gamma_t(\mathbf{x}_{t-1}^n,\mathbf{u}_{t}^n)$

which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal

$m_t(\tilde{\mathbf{x}}_{t-1}^n,\cdot)$

has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.

The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.

Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?

In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?

What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?

Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?

## Bayesian computation with empirical likelihood and no A

Posted in Statistics, University life with tags , , , , , , , , on December 7, 2012 by xi'an

We just resubmitted our paper to PNAS about using empirical likelihood for conducting Bayesian computation. Although this is an approximation as well, we removed the A (for approximation) from the title and from the name of the method, BCel, to comply with a referee’s request and also account for several comments during our seminars that this was not ABC! We can see the point in those comments, namely that ABC is understood as a corpus of methods that rely on the simulation of pseudo-datasets to compensate for the missing likelihood, while empirical likelihood stands as another route bypassing this difficulty… I keep my fingers crossed that this ultimate revision is convincing enough for the PNAS board!

Coincidentally, Jean-Pierre Florens came to give a (Malinvaud) seminar at CREST today about semi-parametric Bayesian modelling, mixing Gaussian process priors with generalised moment conditions. This was a fairly involved talk with a lot of technical details about RKHS spaces and a mix of asymptotics and conjugate priors (somewhat empirical Bayesianish in spirit!) In a sense, it was puzzling because the unknown distribution was modelled conditional on an unknown parameter, θ, which itself was a function of this distribution. It was however quite interesting in that it managed to mix Gaussian process priors with some sort of empirical likelihood (or GMM). Furthermore, in a sort of antithesis to our approach with empirical likelihood, Florens and Simoni had a plethora of moment restrictions they called over-identification and used this feature to improve the estimation of the underlying density. There were also connections with Fukumizu et al. kernel Bayes’ rule perspective, even though I am not clear about the later. I also got lost here by the representation of the data as a point in an Hilbert space, thanks to a convolution step. (The examples involved orthogonal polynomials like Lagrange’s or Hermitte’s, which made sense as the data was back to a finite dimension!) Once again, the most puzzling thing is certainly  over-identification: in an empirical likelihood version, it would degrade the quality of the approximation by peaking more and more the approximation. It does not appear to cause such worries in Florens’ and Simoni’s perspective.

## kernel approximate Bayesian computation for population genetic inferences

Posted in Statistics, University life with tags , , , , on May 22, 2012 by xi'an

A new posting about ABC on arXiv by Shigeki Nakagome, Kenji Fukumizu, and Shuhei Mano entitled kernel approximate Bayesian computation for population genetic inferences argues about an improvement brought by the use of reproducing kernel Hilbert space (RKHS) perspective in ABC methodology, when compared with more standard ABC relying on a rather arbitrary choice of summary statistics and metric. However, I feel that the paper does not substantially defend this point, only using a simulation experiment to compare mean square errors. In particular, the claim of consistency is unsubstantiated, as is the counterpoint that “conventional ABC did not have consistency” (page 14) [and several papers, including the just published Read Paper by Fearnhead and Prangle, claim the opposite]. Furthermore, a considerable amount of space is taken in the paper by the description of the existing ABC algorithms, while the complete version of the new kernel ABC-RKHS algorithm is missing. In particular, the coverage of kernel Bayes is too sketchy to be comprehensible [at least to me] without additional study. Actually, I do not get the notion of kernel Bayes’ rule, which seems defined only in terms of expectations

$\mathbb{E}[f(\theta)|s]=\sum_i w_i f(\theta_i),$

where the weights are the ridge-like matrix

$w_i=\sum_j (\mathbf{G}_S + n\epsilon_n \mathbf{I}_n)^{-1}_{ij}k(s_i,s_j)$

where the parameter is generated from the prior, the data s is generated from the sampling distribution, and the matrix GS is made of the k(si,sj)‘s. The surrounding Hilbert space presentation does not seem particularly relevant, esp. in population genetics… I am also under the impression that the choice of the kernel function k(.,.) is as important as the choice of the metric in regular ABC, although this is not discussed in the paper, since it implies [among other things] the choice of a metric. The implementation uses a Gaussian kernel and an Euclidean metric, which involves assumptions on the homogeneous nature of the components of the summary statistics or of the data. Similarly, the “regularization” parameter εn needs to be calibrated and the paper is unclear about this, apparently picking the parameter that “showed the smallest MSEs” (page 10), which cannot be called a calibration. (There is a rather unimportant proposition about concentration of information on page 6 which proof relies on two densities being ordered, see top of page 7.)