Archive for evidence

ABC on brain networks

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on April 16, 2021 by xi'an

Research Gate sent me an automated email pointing out a recent paper citing some of our ABC papers. The paper is written by Timothy West et al., neuroscientists in the UK, comparing models of Parkinsonian circuit dynamics. Using SMC-ABC. One novelty is the update of the tolerance by a fixed difference, unless the acceptance rate is too low, in which case the tolerance is reinitialised to a starting value.

“(…) the proposal density P(θ|D⁰) is formed from the accepted parameters sets. We use a density approximation to the marginals and a copula for the joint (…) [i.e.] a nonparametric estimation of the marginal densities overeach parameter [and] the t-copula(…) Data are transformed to the copula scale (unit-square) using the kernel density estimator of the cumulative distribution function of each parameter and then transformed to the joint space with the t-copula.”

The construct of the proposal is quite involved, as described in the above quote. The model choice approach is standard (à la Grelaud et al.) but uses the median distance as a tolerance.

“(…) test whether the ABC estimator will: a) yield parameter estimates that are unique to the data from which they have been optimized; and b) yield consistent estimation of parameters across multiple instances (…) test the face validity of the model comparison framework (…) [and] demonstrate the scalability of the optimization and model comparison framework.”

The paper runs a fairly extensive test of the above features, concluding that “the ABC optimized posteriors are consistent across multiple initializations and that the output is determined by differences in the underlying model generating the given data.” Concerning model comparison, the authors mix the ABC Bayes factor with a post-hoc analysis of divergence to discriminate against overfitting. And mention the potential impact of the summary statistics in the conclusion section, albeit briefly, and the remark that the statistics were “sufficient to recover known parameters” is not supporting their use for model comparison. The additional criticism of sampling strategies for approximating Bayes factors is somewhat irrelevant, the main issue with ABC model choice being a change of magnitude in the evidence.

“ABC has established itself as a key tool for parameter estimation in systems biology (…) but is yet to see wide adoption in systems neuroscience. It is known that ABC will not perform well under certain conditions (Sunnåker et al., 2013). Specifically, it has been shown that the
simplest form of ABC algorithm based upon an rejection-sampling approach is inefficient in the case where the prior densities lie far from the true posterior (…) This motivates the use of neurobiologically grounded models over phenomenological models where often the ranges of potential parameter values are unknown.”

on Astra and clots

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , , , , , , on March 16, 2021 by xi'an

A tribune this morning in The Guardian by David Spiegelhalter on having no evidence that the Oxford/AstraZeneca vaccine causes blood clots.

“It’s a common human tendency to attribute a causal effect between different events, even when there isn’t one present: we wash the car and the next day a bird relieves itself all over the bonnet. Typical.”

David sets the 30 throboembolic events among the 5 million people vaccinated with AstraZeneca in perpective of the expected 100 deep vein thromboses a week within such a population. Which coincides with the UK’s Medicines and Healthcare Products Regulatory Agency statement that the blood clots are in par with the expected numbers in the vaccinated population. (The part of the tribune about the yellow card reports, based on 10 million vaccinated people, reiterates the remark but may prove confusing to some!) As for hoping for a rational approach to the issue,  … we would need a different type of vaccine, far from being available! As demonstrated by the decision to temporarily stop vaccinating with this vaccine, causing sure additional deaths in the coming weeks.

“Will we ever be able to resist the urge to find causal relationships between different events? One way of doing this would be promoting the scientific method and ensuring everyone understands this basic principle. Testing a hypothesis helps us see which hunches or assumptions are correct and which aren’t. In this way, randomised trials have proved the effectiveness of some Covid treatments and saved vast numbers of lives, while also showing us that some overblown claims about treatments for Covid-19, such as hydroxychloroquine and convalescent plasma, were incorrect.”

sandwiching a marginal

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 8, 2021 by xi'an

When working recently on a paper for estimating the marginal likelihood, I was pointed out this earlier 2015 paper by Roger Grosse, Zoubin Ghahramani and Ryan Adams, which had escaped till now. The beginning of the paper discusses the shortcomings of importance sampling (when simulating from the prior) and harmonic mean (when simulating from the posterior) as solution. And of anNealed importance sampling (when simulating from a sequence, which sequence?!, of targets). The authors are ending up proposing a sequential Monte Carlo or (posterior) particle learning solution. A remark on annealed importance sampling is that there exist both a forward and a backward version for estimating the marginal likelihood, either starting from a simulation from the prior (easy) or from a simulation from the posterior (hard!). As in, e.g., Nicolas Chopin’s thesis, the intermediate steps are constructed from a subsample of the entire sample.

In this context, unbiasedness can be misleading: because partition function estimates can vary over many orders of magnitude, it’s common for an unbiased estimator to drastically underestimate Ζ with overwhelming probability, yet occasionally return extremely large estimates. (An extreme example is likelihood weighting, which is unbiased, but is extremely unlikely to give an accurate answer for a high-dimensional model.) Unless the estimator is chosen very carefully, the variance is likely to be extremely large, or even infinite.”

One novel aspect of the paper is to advocate for the simultaneous use of different methods and for producing both lower and upper bounds on the marginal p(y) and wait for them to get close enough. It is however delicate to find upper bounds, except when using the dreaded harmonic mean estimator.  (A nice trick associated with reverse annealed importance sampling is that the reverse chain can be simulated exactly from the posterior if associated with simulated data, except I am rather lost at the connection between the actual and simulated data.) In a sequential harmonic mean version, the authors also look at the dangers of using an harmonic mean but argue the potential infinite variance of the weights does not matter so much for log p(y), without displaying any variance calculation… The paper also contains a substantial experimental section that compares the different solutions evoked so far, plus others like nested sampling. Which did not work poorly in the experiment (see below) but could not be trusted to provide a lower or an upper bound. The computing time to achieve some level of agreement is however rather daunting. An interesting read definitely (and I wonder what happened to the paper in the end).

marginal likelihood with large amounts of missing data

Posted in Books, pictures, Statistics with tags , , , , , , , , on October 20, 2020 by xi'an

In 2018, Panayiota Touloupou, research fellow at Warwick, and her co-authors published a paper in Bayesian analysis that somehow escaped my radar, despite standing in my first circle of topics of interest! They construct an importance sampling approach to the approximation of the marginal likelihood, the importance function being approximated from a preliminary MCMC run, and consider the special case when the sampling density (i.e., the likelihood) can be represented as the marginal of a joint density. While this demarginalisation perspective is rather usual, the central point they make is that it is more efficient to estimate the sampling density based on the auxiliary or latent variables than to consider the joint posterior distribution of parameter and latent in the importance sampler. This induces a considerable reduction in dimension and hence explains (in part) why the approach should prove more efficient. Even though the approximation itself is costly, at about 5 seconds per marginal likelihood. But a nice feature of the paper is to include the above graph that includes both computing time and variability for different methods (the blue range corresponding to the marginal importance solution, the red range to RJMCMC and the green range to Chib’s estimate). Note that bridge sampling does not appear on the picture but returns a variability that is similar to the proposed methodology.

scalable Metropolis-Hastings, nested Monte Carlo, and normalising flows

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on June 16, 2020 by xi'an

Over a sunny if quarantined Sunday, I started reading the PhD dissertation of Rob Cornish, Oxford University, as I am the external member of his viva committee. Ending up in a highly pleasant afternoon discussing this thesis over a (remote) viva yesterday. (If bemoaning a lost opportunity to visit Oxford!) The introduction to the viva was most helpful and set the results within the different time and geographical zones of the Ph.D since Rob had to switch from one group of advisors in Engineering to another group in Statistics. Plus an encompassing prospective discussion, expressing pessimism at exact MCMC for complex models and looking forward further advances in probabilistic programming.

Made of three papers, the thesis includes this ICML 2019 [remember the era when there were conferences?!] paper on scalable Metropolis-Hastings, by Rob Cornish, Paul Vanetti, Alexandre Bouchard-Côté, Georges Deligiannidis, and Arnaud Doucet, which I commented last year. Which achieves a remarkable and paradoxical O(1/√n) cost per iteration, provided (global) lower bounds are found on the (local) Metropolis-Hastings acceptance probabilities since they allow for Poisson thinning à la Devroye (1986) and  second order Taylor expansions constructed for all components of the target, with the third order derivatives providing bounds. However, the variability of the acceptance probability gets higher, which induces a longer but still manageable if the concentration of the posterior is in tune with the Bernstein von Mises asymptotics. I had not paid enough attention in my first read at the strong theoretical justification for the method, relying on the convergence of MAP estimates in well- and (some) mis-specified settings. Now, I would have liked to see the paper dealing with a more complex problem that logistic regression.

The second paper in the thesis is an ICML 2018 proceeding by Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood, which considers Monte Carlo problems involving several nested expectations in a non-linear manner, meaning that (a) several levels of Monte Carlo approximations are required, with associated asymptotics, and (b) the resulting overall estimator is biased. This includes common doubly intractable posteriors, obviously, as well as (Bayesian) design and control problems. [And it has nothing to do with nested sampling.] The resolution chosen by the authors is strictly plug-in, in that they replace each level in the nesting with a Monte Carlo substitute and do not attempt to reduce the bias. Which means a wide range of solutions (other than the plug-in one) could have been investigated, including bootstrap maybe. For instance, Bayesian design is presented as an application of the approach, but since it relies on the log-evidence, there exist several versions for estimating (unbiasedly) this log-evidence. Similarly, the Forsythe-von Neumann technique applies to arbitrary transforms of a primary integral. The central discussion dwells on the optimal choice of the volume of simulations at each level, optimal in terms of asymptotic MSE. Or rather asymptotic bound on the MSE. The interesting result being that the outer expectation requires the square of the number of simulations for the other expectations. Which all need converge to infinity. A trick in finding an estimator for a polynomial transform reminded me of the SAME algorithm in that it duplicated the simulations as many times as the highest power of the polynomial. (The ‘Og briefly reported on this paper… four years ago.)

The third and last part of the thesis is a proposal [to appear in ICML 20] on relaxing bijectivity constraints in normalising flows with continuously index flows. (Or CIF. As Rob made a joke about this cleaning brand, let me add (?) to that joke by mentioning that looking at CIF and bijections is less dangerous in a Trump cum COVID era at CIF and injections!) With Anthony Caterini, George Deligiannidis and Arnaud Doucet as co-authors. I am much less familiar with this area and hence a wee bit puzzled at the purpose of removing what I understand to be an appealing side of normalising flows, namely to produce a manageable representation of density functions as a combination of bijective and differentiable functions of a baseline random vector, like a standard Normal vector. The argument made in the paper is that imposing this representation of the density imposes a constraint on the topology of its support since said support is homeomorphic to the support of the baseline random vector. While the supporting theoretical argument is a mathematical theorem that shows the Lipschitz bound on the transform should be infinity in the case the supports are topologically different, these arguments may be overly theoretical when faced with the practical implications of the replacement strategy. I somewhat miss its overall strength given that the whole point seems to be in approximating a density function, based on a finite sample.