Archive for summary statistics

at the Isaac Newton Institute [talks]

Posted in Statistics with tags , , , , , , , on July 7, 2017 by xi'an

Here are the slides I edited this week [from previous talks by Pierre and Epstein] for the INI Workshop on scalable inference, in connection with our recently completed and submitted paper on ABC with Wasserstein distances:

MCM 2017

Posted in Statistics with tags , , , , , , , , , , , , on July 3, 2017 by xi'an

And thus I am back in Montréal, for MCM 2017, located in HEC Montréal, on the campus of Université de Montréal, for three days. My talk is predictably about ABC, what else?!, gathering diverse threads from different talks and papers:

automated ABC summary combination

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on March 16, 2017 by xi'an

Jonathan Harrison and Ruth Baker (Oxford University) arXived this morning a paper on the optimal combination of summaries for ABC in the sense of deriving the proper weights in an Euclidean distance involving all the available summaries. The idea is to find the weights that lead to the maximal distance between prior and posterior, in a way reminiscent of Bernardo’s (1979) maximal information principle. Plus a sparsity penalty à la Lasso. The associated algorithm is sequential in that the weights are updated at each iteration. The paper does not get into theoretical justifications but considers instead several examples with limited numbers of both parameters and summary statistics. Which may highlight the limitations of the approach in that handling (and eliminating) a large number of parameters may prove impossible this way, when compared with optimisation methods like random forests. Or summary-free distances between empirical distributions like the Wasserstein distance.

ABC with kernelised regression

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 22, 2017 by xi'an

sunset from the Banff Centre, Banff, Canada, March 21, 2012The exact title of the paper by Jovana Metrovic, Dino Sejdinovic, and Yee Whye Teh is DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression. It appeared last year in the proceedings of ICML.  The idea is to build ABC summaries by way of reproducing kernel Hilbert spaces (RKHS). Regressing such embeddings to the “optimal” choice of summary statistics by kernel ridge regression. With a possibility to derive summary statistics for quantities of interest rather than for the entire parameter vector. The use of RKHS reminds me of Arthur Gretton’s approach to ABC, although I see no mention made of that work in the current paper.

In the RKHS pseudo-linear formulation, the prediction of a parameter value given a sample attached to this value looks like a ridge estimator in classical linear estimation. (I thus wonder at why one would stop at the ridge stage instead of getting the full Bayes treatment!) Things get a bit more involved in the case of parameters (and observations) of interest, as the modelling requires two RKHS, because of the conditioning on the nuisance observations. Or rather three RHKS. Since those involve a maximum mean discrepancy between probability distributions, which define in turn a sort of intrinsic norm, I also wonder at a Wasserstein version of this approach.

What I find hard to understand in the paper is how a large-dimension large-size sample can be managed by such methods with no visible loss of information and no explosion of the computing budget. The authors mention Fourier features, which never rings a bell for me, but I wonder how this operates in a general setting, i.e., outside the iid case. The examples do not seem to go into enough details for me to understand how this massive dimension reduction operates (and they remain at a moderate level in terms of numbers of parameters). I was hoping Jovana Mitrovic could present her work here at the 17w5025 workshop but she sadly could not make it to Banff for lack of funding!

a well-hidden E step

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , , , , on February 3, 2017 by xi'an

Grand Palais from Esplanade des Invalides, Paris, Dec. 07, 2012A recent question on X validated ended up being quite interesting! The model under consideration is made of parallel Markov chains on a finite state space, all with the same Markov transition matrix, M, which turns into a hidden Markov model when the only summary available is the number of chains in a given state at a given time. When writing down the EM algorithm, the E step involves the expected number of moves from a given state to a given state at a given time. The conditional distribution of those numbers of chains is a product of multinomials across times and starting states, with no Markov structure since the number of chains starting from a given state is known at each instant. Except that those multinomials are constrained by the number of “arrivals” in each state at the next instant and that this makes the computation of the expectation intractable, as far as I can see.

A solution by Monte Carlo EM means running the moves for each instant under the above constraints, which is thus a sort of multinomial distribution with fixed margins, enjoying a closed-form expression but for the normalising constant. The direct simulation soon gets too costly as the number of states increases and I thus considered a basic Metropolis move, using one margin (row or column) or the other as proposal, with the correction taken on another margin. This is very basic but apparently enough for the purpose of the exercise. If I find time in the coming days, I will try to look at the ABC resolution of this problem, a logical move when starting from non-sufficient statistics!

local kernel reduction for ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , on September 14, 2016 by xi'an

“…construction of low dimensional summary statistics can be performed as in a black box…”

Today Zhou and Fukuzumi just arXived a paper that proposes a gradient-based dimension reduction for ABC summary statistics, in the spirit of RKHS kernels as advocated, e.g., by Arthur Gretton. Here the projection is a mere linear projection Bs of the vector of summary statistics, s, where B is an estimated Hessian matrix associated with the posterior expectation E[θ|s]. (There is some connection with the latest version of Li’s and Fearnhead’s paper on ABC convergence as they also define a linear projection of the summary statistics, based on asymptotic arguments, although their matrix does depend on the true value of the parameter.) The linearity sounds like a strong restriction [to me] especially when the summary statistics have no reason to belong to a vectorial space and thus be open to changes of bases and linear projections. For instance, a specific value taken by a summary statistic, like 0 say, may be more relevant than the range of their values. On a larger scale, I am doubtful about always projecting a vector of summary statistics on a subspace with the smallest possible dimension, ie the dimension of θ. In practical settings, it seems impossible to derive the optimal projection and a subvector is almost certain to loose information against a larger vector.

“Another proposal is to use different summary statistics for different parameters.”

Which is exactly what we did in our random forest estimation paper. Using a different forest for each parameter of interest (but no real tree was damaged in the experiment!).

Approximate Bayesian computation via sufficient dimension reduction

Posted in Statistics, University life with tags , , , , , on August 26, 2016 by xi'an

“One of our contribution comes from the mathematical analysis of the consequence of conditioning the parameters of interest on consistent statistics and intrinsically inconsistent statistics”

Xiaolong Zhong and Malay Ghosh have just arXived an ABC paper focussing on the convergence of the method. And on the use of sufficient dimension reduction techniques for the construction of summary statistics. I had not heard of this approach before so read the paper with interest. I however regret that the paper does not link with the recent consistency results of Liu and Fearnhead and of Daniel Frazier, Gael Martin, Judith Rousseau and myself. When conditioning upon the MLE [or the posterior mean] as the summary statistic, Theorem 1 states that the Bernstein-von Mises theorem holds, missing a limit in the tolerance ε. And apparently missing conditions on the speed of convergence of this tolerance to zero although the conditioning event involves the true value of the parameter. This makes me wonder at the relevance of the result. The part about partial posteriors and the characterisation of limiting posterior distributions stats with the natural remark that the mean of the summary statistic must identify the whole parameter θ to achieve consistency, a point central to our 2014 JRSS B paper. The authors suggest using a support vector machine to derive the summary statistics, an idea already exploited by Heiko Strathmann et al.. There is no consistency result of relevance for ABC in that second and final part, which ends up rather abruptly. Overall, while the paper contributes to the current reflection on the convergence properties of ABC, the lack of scaling of the tolerance ε calls for further investigations.