continuous herded Gibbs sampling

Posted in Books, pictures, Statistics with tags , , , , , , , , on June 28, 2021 by xi'an

Read a short paper by Laura Wolf and Marcus Baum on Gibbs herding, where herding is a technique of “deterministic sampling”, for instance selecting points over the support of the distribution by matching exact and empirical (or “empirical”!) moments. Which reminds me of the principal points devised by my late friend Bernhard Flury. With an unclear argument as to why it could take over random sampling:

“random numbers are often generated by pseudo-random number generators, hence are not truly random”

Especially since the aim is to “draw samples from continuous multivariate probability densities.” The sequential construction of such a sample proceeds sequentially by adding a new (T+1)-th point to the existing sample of y’s by maximising in x the discrepancy

$(T+1)\mathbb E^Y[k(x,Y)]-\sum_{t=1}^T k(x,y_t)$

where k(·,·) is a kernel, e.g. a Gaussian density. Hence a complexity that grows as O(T). The current paper suggests using Gibbs “sampling” to update one component of x at a time. Using the conditional version of the above discrepancy. Making the complexity grow as O(dT) in d dimensions.

I remain puzzled by the whole thing as these samples cannot be used as regular random or quasi-random samples. And in particular do not produce unbiased estimators of anything. Obviously. The production of such samples being furthermore computationally costly it is also unclear to me that they could even be used for quick & dirty approximations of a target sample.

control functionals for Monte Carlo integration

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 28, 2016 by xi'an

A paper on control variates by Chris Oates, Mark Girolami (Warwick) and Nicolas Chopin (CREST) appeared in a recent issue of Series B. I had read and discussed the paper with them previously and the following is a set of comments I wrote at some stage, to be taken with enough gains of salt since Chris, Mark and Nicolas answered them either orally or in the paper. Note also that I already discussed an earlier version, with comments that are not necessarily coherent with the following ones! [Thanks to the busy softshop this week, I resorted to publish some older drafts, so mileage can vary in the coming days.]

First, it took me quite a while to get over the paper, mostly because I have never worked with reproducible kernel Hilbert spaces (RKHS) before. I looked at some proofs in the appendix and at the whole paper but could not spot anything amiss. It is obviously a major step to uncover a manageable method with a rate that is lower than √n. When I set my PhD student Anne Philippe on the approach via Riemann sums, we were quickly hindered by the dimension issue and could not find a way out. In the first versions of the nested sampling approach, John Skilling had also thought he could get higher convergence rates before realising the Monte Carlo error had not disappeared and hence was keeping the rate at the same √n speed.

The core proof in the paper leading to the 7/12 convergence rate relies on a mathematical result of Sun and Wu (2009) that a certain rate of regularisation of the function of interest leads to an average variance of order 1/6. I have no reason to mistrust the result (and anyway did not check the original paper), but I am still puzzled by the fact that it almost immediately leads to the control variate estimator having a smaller order variance (or at least variability). On average or in probability. (I am also uncertain on the possibility to interpret the boxplot figures as establishing super-√n speed.)

Another thing I cannot truly grasp is how the control functional estimator of (7) can be both a mere linear recombination of individual unbiased estimators of the target expectation and an improvement in the variance rate. I acknowledge that the coefficients of the matrices are functions of the sample simulated from the target density but still…

Another source of inner puzzlement is the choice of the kernel in the paper, which seems too simple to be able to cover all problems despite being used in every illustration there. I see the kernel as centred at zero, which means a central location must be know, decreasing to zero away from this centre, so possibly missing aspects of the integrand that are too far away, and isotonic in the reference norm, which also seems to preclude some settings where the integrand is not that compatible with the geometry.

I am equally nonplussed by the existence of a deterministic bound on the error, although it is not completely deterministic, depending on the values of the reproducible kernel at the points of the sample. Does it imply anything restrictive on the function to be integrated?

A side remark about the use of intractable in the paper is that, given the development of a whole new branch of computational statistics handling likelihoods that cannot be computed at all, intractable should possibly be reserved for such higher complexity models.

projection predictive input variable selection

Posted in Books, Statistics, University life with tags , , , , , , on November 2, 2015 by xi'an

Juho Piironen and Aki Vehtari just arXived a paper on variable selection that relates to two projection papers we wrote in the 1990’s with Costas Goutis (who died near Seattle in a diving accident on July 1996) and Jérôme Dupuis… Except that they move to the functional space of Gaussian processes. The covariance function in a Gaussian process is indeed based on a distance between observations, which are themselves defined as a vector of inputs. Some of which matter and some of which do not matter in the kernel value. When rescaling the distance with “length-scales” for all variables, one could think that non-significant variates have very small scales and hence bypass the need for variable selection but this is not the case as those coefficients react poorly to non-linearities in the variates… The paper thus builds a projective structure from a reference model involving all input variables.

“…adding some irrelevant inputs is not disastrous if the model contains a sparsifying prior structure, and therefore, one can expect to lose less by using all the inputs than by trying to differentiate between the relevant and irrelevant ones and ignoring the uncertainty related to the left-out inputs.”

While I of course appreciate this avatar to our original idea (with some borrowing from McCulloch and Rossi, 1992), the paper reminds me of some of the discussions and doubts we had about the role of the reference or super model that “anchors” the projections, as there is no reason for that reference model to be a better one. It could be that an iterative process where the selected submodel becomes the reference for the next iteration could enjoy better performances. When I first presented this work in Cagliari, in the late 1990s, one comment was that the method had no theoretical guarantee like consistency. Which is correct if the minimum distance is not evolving (how quickly?!) with the sample size n. I also remember the difficulty Jérôme and I had in figuring out a manageable forward-backward exploration of the (huge) set of acceptable subsets of variables. Random walk exploration and RJMCMC are unlikely to solve this problem.