## optimal importance sampling

**A**n arXiv file that sat for quite a while in my to-read pile is Variance reduction in SGD by distributed importance sampling by Alain et al. I had to wait for the flight to Zürich and MCMskv to get a look at it. The part of the paper that is of primary interest to me is the generalisation of the optimal importance function result

q⁰(x)∞f(x)|h(x)|

to higher dimensions. Namely, what is the best importance function for approximating the expectation of h(X) when h is multidimensional? There does exist an optimal solution when the score function is the trace of the variance matrix. Where the solution is proportional to the target density times the norm of the target integrand

q⁰(x)∞f(x)||h(x)||

The application of the result to neural networks and stochastic gradients using minibatches of the training set somehow escapes me, even though the asynchronous aspects remind me of the recent asynchronous Gibbs sampler of Terenin, Draper, and Simpson.

While the optimality obtained in the paper is mathematically clear, I am a wee bit surprised at the approach: the lack of normalising constant in the optimum means using a reweighted approximation that drifts away from the optimal score. Furthermore, this optimum is sub-optimal when compared with the component wise optimum which produces a variance of zero (if we assume the normalising constant to be available). Obviously, using the component-wise optima requires to run as many simulations as there are components in the integrand, but since cost does not seem to be central to this study…

## Leave a Reply