Archive for curse of dimensionality

improved importance sampling via iterated moment matching

Posted in Statistics with tags , , , , on August 1, 2019 by xi'an

Topi Paananen, Juho Piironen, Paul-Christian Bürkner and Aki Vehtari have recently arXived a work on constructing an adapted importance (sampling) distribution. The beginning is more a review than a new contribution, covering the earlier work by Vehtari, Gelman  and Gabri (2017): estimating the Pareto rate for the importance weight distribution helps in assessing whether or not this distribution allows for a (necessary) second moment. In case it does not (seem to), the authors propose an affine transform of the importance distribution, using the earlier sample to match the first two moments of the distribution. Or of the targeted function. Adaptation that is controlled by the same Pareto rate technique, as in the above picture (from the paper). Predicting a natural objection as to the poor performances of the earlier samples, the paper suggests to use robust estimators of these moments, for instance via Pareto smoothing. It also suggests using multiple importance sampling as a way to regularise and robustify the estimates. While I buy the argument of fitting the target moments to achieve a better fit of the importance sampling, I remain unclear as to why an affine transform would change the (poor) tail behaviour of the importance sampler. Hence why it would apply in full generality. An alternative could consist in finding appropriate Box-Cox transforms, although the difficulty would certainly increase with the dimension.

likelihood-free approximate Gibbs sampling

Posted in Books, Statistics with tags , , , , , , , , on June 19, 2019 by xi'an

“Low-dimensional regression-based models are constructed for each of these conditional distributions using synthetic (simulated) parameter value and summary statistic pairs, which then permit approximate Gibbs update steps (…) synthetic datasets are not generated during each sampler iteration, thereby providing efficiencies for expensive simulator models, and only require sufficient synthetic datasets to adequately construct the full conditional models (…) Construction of the approximate conditional distributions can exploit known structures of the high-dimensional posterior, where available, to considerably reduce computational overheads”

Guilherme Souza Rodrigues, David Nott, and Scott Sisson have just arXived a paper on approximate Gibbs sampling. Since this comes a few days after we posted our own version, here are some of the differences I could spot in the paper:

  1. Further references to earlier occurrences of Gibbs versions of ABC, esp. in cases when the likelihood function factorises into components and allows for summaries with lower dimensions. And even to ESP.
  2. More an ABC version of Gibbs sampling that a Gibbs version of ABC in that approximations to the conditionals are first constructed and then used with no further corrections.
  3. Inherently related to regression post-processing à la Beaumont et al.  (2002) in that the regression model is the start to designing an approximate full conditional, conditional on the “other” parameters and on the overall summary statistic. The construction of the approximation is far from automated. And may involve neural networks or other machine learning estimates.
  4. As a consequence of the above, a preliminary ABC step to design the collection of approximate full conditionals using a single and all-purpose multidimensional summary statistic.
  5. Once the approximations constructed, no further pseudo-data is generated.
  6. Drawing from the approximate full conditionals is done exactly, possibly via a bootstrapped version.
  7. Handling a highly complex g-and-k dynamic model with 13,140 unknown parameters, requiring a ten days simulation.

“In certain circumstances it can be seen that the likelihood-free approximate Gibbs sampler will exactly target the true partial posterior (…) In this case, then Algorithms 2 and 3 will be exact.”

Convergence and coherence are handled in the paper by setting the algorithm(s) as noisy Monte Carlo versions, à la Alquier et al., although the issue of incompatibility between the full conditionals is acknowledged, with the main reference being the finite state space analysis of Chen and Ip (2015). It thus remains unclear whether or not the Gibbs samplers that are implemented there do converge and if they do what is the significance of the stationary distribution.

selecting summary statistics [a tale of two distances]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on May 23, 2019 by xi'an

As Jonathan Harrison came to give a seminar in Warwick [which I could not attend], it made me aware of his paper with Ruth Baker on the selection of summaries in ABC. The setting is an ABC-SMC algorithm and it relates with Fearnhead and Prangle (2012), Barnes et al. (2012), our own random forest approach, the neural network version of Papamakarios and Murray (2016), and others. The notion here is to seek the optimal weights of different summary statistics in the tolerance distance, towards a maximization of a distance (Hellinger) between prior and ABC posterior (Wasserstein also comes to mind!). A sort of dual of the least informative prior. Estimated by a k-nearest neighbour version [based on samples from the prior and from the ABC posterior] I had never seen before. I first did not get how this k-nearest neighbour distance could be optimised in the weights since the posterior sample was already generated and (SMC) weighted, but the ABC sample can be modified by changing the [tolerance] distance weights and the resulting Hellinger distance optimised this way. (There are two distances involved, in case the above description is too murky!)

“We successfully obtain an informative unbiased posterior.”

The paper spends a significant while in demonstrating that the k-nearest neighbour estimator converges and much less on the optimisation procedure itself, which seems like a real challenge to me when facing a large number of particles and a high enough dimension (in the number of statistics). (In the examples, the size of the summary is 1 (where does the weight matter?), 32, 96, 64, with 5 10⁴, 5 10⁴, 5 10³ and…10 particles, respectively.) The authors address the issue, though, albeit briefly, by mentioning that, for the same overall computation time, the adaptive weight ABC is indeed further from the prior than a regular ABC with uniform weights [rather than weighted by the precisions]. They also argue that down-weighting some components is akin to selecting a subset of summaries, but I beg to disagree with this statement as the weights are never exactly zero, as far as I can see, hence failing to fight the curse of dimensionality. Some LASSO version could implement this feature.

deep and embarrassingly parallel MCMC

Posted in Books, pictures, Statistics with tags , , , , , , , on April 9, 2019 by xi'an

Diego Mesquita, Paul Blomstedt, and Samuel Kaski (from Helsinki, like the above picture) just arXived a paper on embarrassingly parallel MCMC. Following a series of papers discussed on this ‘og in the past. They use a deep learning approach of Dinh et al. (2017) to the computation of the probability density of a convoluted and non-volume-preserving transform of a given random variable to turn multiple samples from sub-posteriors [corresponding to the k k-th roots of the true posterior] into a sample from the true posterior. If I understand correctly the argument [on page 4], the deep neural network provides a density estimate that apparently does better than traditional non-parametric density estimates. Maybe by being more efficient than a Parzen-Rosenblat estimator which is of order the number of simulations… For any value of θ, the estimate of the true target is the product of these estimates and for a value of θ simulated from one of the subposteriors an importance weight naturally ensues. However, for a one-dimensional transform of θ, h(θ), I would prefer estimating first the density of h(θ) for each sample and then construct an importance weight. If only to avoid the curse of dimension.

On various benchmarks, like the banana-shaped 2D target above, the proposed method (NAP) does better. Even in relatively high dimensions. Given that the overall computing times are not produced, with only the calibration that the same number of subsamples were produced for all methods, it would be interesting to test the same performances on even higher dimensions and larger population sizes.

adaptive copulas for ABC

Posted in Statistics with tags , , , , , , , , on March 20, 2019 by xi'an

A paper on ABC I read on my way back from Cambodia:  Yanzhi Chen and Michael Gutmann arXived an ABC [in Edinburgh] paper on learning the target via Gaussian copulas, to be presented at AISTATS this year (in Okinawa!). Linking post-processing (regression) ABC and sequential ABC. The drawback in the regression approach is that the correction often relies on an homogeneity assumption on the distribution of the noise or residual since this approach only applies a drift to the original simulated sample. Their method is based on two stages, a coarse-grained one where the posterior is approximated by ordinary linear regression ABC. And a fine-grained one, which uses the above coarse Gaussian version as a proposal and returns a Gaussian copula estimate of the posterior. This proposal is somewhat similar to the neural network approach of Papamakarios and Murray (2016). And to the Gaussian copula version of Li et al. (2017). The major difference being the presence of two stages. The new method is compared with other ABC proposals at a fixed simulation cost, which does not account for the construction costs, although they should be relatively negligible. To compare these ABC avatars, the authors use a symmetrised Kullback-Leibler divergence I had not met previously, requiring a massive numerical integration (although this is not an issue for the practical implementation of the method, which only calls for the construction of the neural network(s)). Note also that sequential ABC is only run for two iterations, and also that none of the importance sampling ABC versions of Fearnhead and Prangle (2012) and of Li and Fearnhead (2018) are considered, all versions relying on the same vector of summary statistics with a dimension much larger than the dimension of the parameter. Except in our MA(2) example, where regression does as well. I wonder at the impact of the dimension of the summary statistic on the performances of the neural network, i.e., whether or not it is able to manage the curse of dimensionality by ignoring all but essentially the data  statistics in the optimisation.

approximate likelihood perspective on ABC

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , on December 20, 2018 by xi'an

George Karabatsos and Fabrizio Leisen have recently published in Statistics Surveys a fairly complete survey on ABC methods [which earlier arXival I had missed]. Listing within an extensive bibliography of 20 pages some twenty-plus earlier reviews on ABC (with further ones in applied domains)!

“(…) any ABC method (algorithm) can be categorized as either (1) rejection-, (2) kernel-, and (3) coupled ABC; and (4) synthetic-, (5) empirical- and (6) bootstrap-likelihood methods; and can be combined with classical MC or VI algorithms [and] all 22 reviews of ABC methods have covered rejection and kernel ABC methods, but only three covered synthetic likelihood, one reviewed the empirical likelihood, and none have reviewed coupled ABC and bootstrap likelihood methods.”

The motivation for using approximate likelihood methods is provided by the examples of g-and-k distributions, although the likelihood can be efficiently derived by numerical means, as shown by Pierre Jacob‘s winference package, of mixed effect linear models, although a completion by the mixed effects themselves is available for Gibbs sampling as in Zeger and Karim (1991), and of the hidden Potts model, which we covered by pre-processing in our 2015 paper with Matt Moores, Chris Drovandi, Kerrie Mengersen. The paper produces a general representation of the approximate likelihood that covers the algorithms listed above as through the table below (where t(.) denotes the summary statistic):

The table looks a wee bit challenging simply because the review includes the synthetic likelihood approach of Wood (2010), which figured preeminently in the 2012 Read Paper discussion but opens the door to all kinds of approximations of the likelihood function, including variational Bayes and non-parametric versions. After a description of the above versions (including a rather ignored coupled version) and the special issue of ABC model choice,  the authors expand on the difficulties with running ABC, from multiple tuning issues, to the genuine curse of dimensionality in the parameter (with unnecessary remarks on low-dimension sufficient statistics since they are almost surely inexistent in most realistic settings), to the mis-specified case (on which we are currently working with David Frazier and Judith Rousseau). To conclude, an worthwhile update on ABC and on the side a funny typo from the reference list!

Li, W. and Fearnhead, P. (2018, in press). On the asymptotic efficiency
of approximate Bayesian computation estimators. Biometrika na na-na.

JSM 2018 [#3]

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 2, 2018 by xi'an

Third day at JSM2018 and the audience is already much smaller than the previous days! Although it is hard to tell with a humongous conference centre spread between two buildings. And not getting hooked by the tantalising view of the bay, with waterplanes taking off every few minutes…


Still, there were (too) few participants in the two computational statistics (MCMC) sessions I attended in the morning, the first one being organised by James Flegal on different assessments of MCMC convergence. (Although this small audience made the session quite homely!) In his own talk, James developed an interesting version of multivariate ESS that he related with a stopping rule for minimal precision. Vivek Roy also spoke about a multiple importance sampling construction I missed when it came upon on arXiv last May.

In the second session, Mylène Bédard exposed the construction of and improvement brought by local scaling in MALA, with 20% gain from using non-local tuning. Making me idle muse over whether block sizes in block-Gibbs sampling could also be locally optimised… Then Aaron Smith discussed how HMC should be scaled for optimal performances, under rather idealised conditions and very high dimensions. Mentioning a running time of d, the dimension, to the power ¼. But not addressing the practical question of calibrating scale versus number of steps in the discretised version. (At which time my hands were [sort of] frozen solid thanks to the absurd air conditioning in the conference centre and I had to get out!)