Archive for Monte Carlo error

dropping a point

Posted in Statistics, University life with tags , , , , , , , , on September 8, 2020 by xi'an

“A discussion about whether to drop the initial point came up in the plenary tutorial of Fred Hickernell at MCQMC 2020 about QMCPy software for QMC. The issue has been discussed by the pytorch community , and the scipy community, which are both incorporating QMC methods.”

Art Owen recently arXived a paper entitled On dropping the first Sobol’ point in which he examines the impact of a common practice consisting in skipping the first point of a Sobol’ sequence when using quasi-Monte Carlo. By analogy with the burn-in practice for MCMC that aims at eliminating the biais due to the choice of the starting value. Art’s paper shows that by skipping just this one point the rate of convergence of some QMC estimates may drop by a factor, bringing the rate back to Monte Carlo values! As this applies to randomised scrambled Sobol sequences, this is quite amazing. The explanation centers on the suppression leaving one region of the hypercube unexplored, with an O(n⁻¹) error ensuing.

The above picture from the paper makes the case in a most obvious way: the mean squared error is not decreasing at the same rate for the no-drop and one-drop versions, since they are -3/2 and -1, respectively. The paper further “recommends against using roundnumber sample sizes and thinning QMC points.” Conclusion: QMC is not MC!

transport Monte Carlo

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , on August 31, 2020 by xi'an

Read this recent arXival by Leo Duan (from UF in Gainesville) on transport approaches to approximate Bayesian computation, in connection with normalising flows. The author points out a “lack of flexibility in a large class of normalizing flows”  to bring forward his own proposal.

“…we assume the reference (a multivariate uniform distribution) can be written as a mixture of many one-to-one transforms from the posterior”

The transportation problem is turned into defining a joint distribution on (β,θ) such that θ is marginally distributed from the posterior and β is one of an infinite collection of transforms of θ. Which sounds quite different from normalizing flows, to be sure. Reverting the order, if one manages to simulate β from its marginal the resulting θ is one of the transforms. Chosen to be a location-scale modification of β, s⊗β+m. The weights when going from θ to β are logistic transforms with Dirichlet distributed scales. All with parameters to be optimised by minimising the Kullback-Leibler distance between the reference measure on β and its inverse mixture approximation, and resorting to gradient descent. (This may sound a wee bit overwhelming as an approximation strategy and I actually had to make a large cup of strong macha to get over it, but this may be due to the heat wave occurring at the same time!) Drawing θ from this approximation is custom-made straightforward and an MCMC correction can even be added, resulting in an independent Metropolis-Hastings version since the acceptance ratio remains computable. Although this may defeat the whole purpose of the exercise by stalling the chain if the approximation is poor (hence suggesting this last step being used instead as a control.)

The paper also contains a theoretical section that studies the approximation error, going to zero as the number of terms in the mixture, K, goes to infinity. Including a Monte Carlo error in log(n)/n (and incidentally quoting a result from my former HoD at Paris 6, Paul Deheuvels). Numerical experiments show domination or equivalence with some other solutions, e.g. being much faster than HMC, the remaining $1000 question being of course the on-line evaluation of the quality of the approximation.

uncertainty in the ABC posterior

Posted in Statistics with tags , , , , , , on July 24, 2019 by xi'an

In the most recent Bayesian Analysis, Marko Järvenpää et al. (including my coauthor Aki Vehtari) consider an ABC setting where the number of available simulations of pseudo-samples  is limited. And where they want to quantify the amount of uncertainty resulting from the estimation of the ABC posterior density. Which is a version of the Monte Carlo error in practical ABC, in that this is the difference between the ABC posterior density for a given choice of summaries and a given choice of tolerance, and the actual approximation based on a finite number of simulations from the prior predictive. As in earlier works by Michael Gutmann and co-authors, the focus stands in designing a sequential strategy to decide where to sample the next parameter value towards minimising a certain expected loss. And in adopting a Gaussian process modelling for the discrepancy between observed data and simulated data, hence generalising the synthetic likelihood approach. This allows them to compute the expectation and the variance of the unnormalised ABC posterior, based on plugged-in estimators. From where the authors derive a loss as the expected variance of the acceptance probability (although it is not parameterisation invariant). I am unsure I see the point for this choice in that there is no clear reason for the resulting sequence of parameter choices to explore the support of the posterior distribution in a relatively exhaustive manner. The paper also mentions alternatives where the next parameter is chosen at the location where “the uncertainty of the unnormalised ABC posterior is highest”. Which sounds more pertinent to me. And further avoids integrating out the parameter. I also wonder if ABC mis-specification analysis could apply in this framework since the Gaussian process is most certainly a “wrong” model. (When concluding this post, I realised I had written a similar entry two years ago about the earlier version of the paper!)


Posted in Books, Statistics with tags , , , , , , , , , , , , , on June 25, 2018 by xi'an

A recent arXival on a new version of ABC based on kernel estimators (but one could argue that all ABC versions are based on kernel estimators, one way or another.) In this ABC-CDE version, Izbicki,  Lee and Pospisilz [from CMU, hence the picture!] argue that past attempts failed to exploit the full advantages of kernel methods, including the 2016 ABCDE method (from Edinburgh) briefly covered on this blog. (As an aside, CDE stands for conditional density estimation.) They also criticise these attempts at selecting summary statistics and hence failing in sufficiency, which seems a non-issue to me, as already discussed numerous times on the ‘Og. One point of particular interest in the long list of drawbacks found in the paper is the inability to compare several estimates of the posterior density, since this is not directly ingrained in the Bayesian construct. Unless one moves to higher ground by calling for Bayesian non-parametrics within the ABC algorithm, a perspective which I am not aware has been pursued so far…

The selling points of ABC-CDE are that (a) the true focus is on estimating a conditional density at the observable x⁰ rather than everywhere. Hence, rejecting simulations from the reference table if the pseudo-observations are too far from x⁰ (which implies using a relevant distance and/or choosing adequate summary statistics). And then creating a conditional density estimator from this subsample (which makes me wonder at a double use of the data).

The specific density estimation approach adopted for this is called FlexCode and relates to an earlier if recent paper from Izbicki and Lee I did not read. As in many other density estimation approaches, they use an orthonormal basis (including wavelets) in low dimension to estimate the marginal of the posterior for one or a few components of the parameter θ. And noticing that the posterior marginal is a weighted average of the terms in the basis, where the weights are the posterior expectations of the functions themselves. All fine! The next step is to compare [posterior] estimators through an integrated squared error loss that does not integrate the prior or posterior and does not tell much about the quality of the approximation for Bayesian inference in my opinion. It is furthermore approximated by  a doubly integrated [over parameter and pseudo-observation] squared error loss, using the ABC(ε) sample from the prior predictive. And the approximation error only depends on the regularity of the error, that is the difference between posterior and approximated posterior. Which strikes me as odd, since the Monte Carlo error should take over but does not appear at all. I am thus unclear as to whether or not the convergence results are that relevant. (A difficulty with this paper is the strong dependence on the earlier one as it keeps referencing one version or another of FlexCode. Without reading the original one, I spotted a mention made of the use of random forests for selecting summary statistics of interest, without detailing the difference with our own ABC random forest papers (for both model selection and estimation). For instance, the remark that “nuisance statistics do not affect the performance of FlexCode-RF much” reproduces what we observed with ABC-RF.

The long experiment section always relates to the most standard rejection ABC algorithm, without accounting for the many alternatives produced in the literature (like Li and Fearnhead, 2018. that uses Beaumont et al’s 2002 scheme, along with importance sampling improvements, or ours). In the case of real cosmological data, used twice, I am uncertain of the comparison as I presume the truth is unknown. Furthermore, from having worked on similar data a dozen years ago, it is unclear why ABC is necessary in such context (although I remember us running a test about ABC in the Paris astrophysics institute once).

seeking the error in nested sampling

Posted in pictures, Statistics, Travel with tags , , , , , , on April 13, 2017 by xi'an

A newly arXived paper on the error in nested sampling, written by Higson and co-authors, and read in Berlin, looks at the difficult task of evaluating the sampling error of nested sampling. The conclusion is essentially negative in that the authors recommend multiple runs of the method to assess the magnitude of the variability of the output by bootstrap, i.e. to call for the most empirical approach…

The core of this difficulty lies in the half-plug-in, half-quadrature, half-Monte Carlo (!) feature of the nested sampling algorithm, in that (i) the truncation of the unit interval is based on a expectation of the mass of each shell (i.e., the zone between two consecutive isoclines of the likelihood, (ii) the evidence estimator is a quadrature formula, and (iii) the level of the likelihood at the truncation is replaced with a simulated value that is not even unbiased (and correlated with the previous value in the case of an MCMC implementation). As discussed in our paper with Nicolas, the error in the evidence approximation is of the same order as other Monte Carlo methods in that it gets down like the square root of the number of terms at each iteration. Contrary to earlier intuitions that focussed on the error due to the quadrature.

But the situation is much less understood when the resulting sample is used for estimation of quantities related with the posterior distribution. With no clear approach to assess and even less correct the resulting error, since it is not solely a Monte Carlo error. As noted by the authors, the quadrature approximation to the univariate integral replaces the unknown prior weight of a shell with its Beta order statistic expectation and the average of the likelihood over the shell with a single (uniform???) realisation. Or the mean value of a transform of the parameter with a single (biased) realisation. Since most posterior expectations can be represented as integrals over likelihood levels of the average value over an iso-likelihood contour. The approach advocated in the paper involved multiple threads of an “unwoven nested sampling run”, which means launching n nested sampling runs with one living term from the n currents living points in the current nested sample. (Those threads may then later be recombined into a single nested sample.) This is the starting point to a nested flavour of bootstrapping, where threads are sampled with replacement, from which confidence intervals and error estimates can be constructed. (The original notion appears in Skilling’s 2006 paper, but I missed it.)

The above graphic is an attempt within the paper at representing the (marginal) posterior of a transform f(θ). That I do not fully understand… The notations are rather horrendous as X is not the data but the prior probability for the likelihood to be above a given bound which is actually the corresponding quantile. (There is no symbol for data and £ is used for the likelihood function as well as realisations of the likelihood function…) A vertical slice on the central panel gives the posterior distribution of f(θ) given the event that the likelihood is in the corresponding upper tail. Or given the corresponding shell (?).