gradient importance sampling
Ingmar Schuster, who visited Paris-Dauphine last Spring (and is soon to return here as a postdoc funded by Fondation des Sciences Mathématiques de Paris) has arXived last week a paper on gradient importance sampling. In this paper, he builds a sequential importance sampling (or population Monte Carlo) algorithm that exploits the additional information contained in the gradient of the target. The proposal or importance function being essentially the MALA move as its proposal, mixed across the elements of the previous population. When compared with our original PMC mixture of random walk proposals found in e.g. this paper, each term in the mixture thus involves an extra gradient, with a scale factor that decreases to zero as 1/t√t. Ingmar compares his proposal with an adaptive Metropolis, an adaptive MALTa and an HM algorithms, for two mixture distributions and the banana target of Haario et al. (1999) we also used in our paper. As well as a logistic regression. In each case, he finds both a smaller squared error and a smaller bias for the same computing time (evaluated as the number of likelihood evaluations). While we discussed this scheme when he visited, I remain intrigued as to why it works so well when compared with the other solutions. One possible explanation is that the use of the gradient drift is more efficient on a population of particles than on a single Markov chain, provided the population covers all modes of importance on the target surface: the “fatal” attraction of the local model is then much less of an issue…
August 7, 2015 at 10:06 am
Indeed, I am quite surprised myself, especially considering the magnitude of improvement for some targets. Before I resubmit this, I will have to triple-check it.
Also, in the meantime I think that the main improvement over PMC/SMC samplers results from the adaptation of the covariance matrix for the random walk proposal, not so much the use of gradient information (in the PMC paper you link the adaptation was achieved by optimizing mixture weights of a fixed set of proposals). The importance of adapting the covariance matrix is also evident when comparing Adaptive Metropolis (AM, no gradient) with Atchadés adaptive MALTA (essentially AM with gradient information): if anything, adding gradient information often made things worse, as it can add bias.
The improvement over MCMC of course might very well be the lack of fatal attraction of the local mode.