## finding our way in the dark

Posted in Books, pictures, Statistics with tags , , , , , , , , , on November 18, 2021 by xi'an

The paper Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods by Evgeny Levi and (my friend) Radu Craiu, recently got published in Bayesian Analysis. The central motivation for their work is that both ABC and synthetic likelihood are costly methods when the data is large and does not allow for smaller summaries. That is, when summaries S of smaller dimension cannot be directly simulated. The idea is to try to estimate

$h(\theta)=\mathbb{P}_\theta(d(S,S^\text{obs})\le\epsilon)$

since this is the substitute for the likelihood used for ABC. (A related idea is to build an approximate and conditional [on θ] distribution on the distance, idea with which Doc. Stoehr and I played a wee bit without getting anything definitely interesting!) This is a one-dimensional object, hence non-parametric estimates could be considered… For instance using k-nearest neighbour methods (which were already linked with ABC by Gérard Biau and co-authors.) A random forest could also be used (?). Or neural nets. The method still requires a full simulation of new datasets, so I wonder at the gain unless the replacement of the naïve indicator with h(θ) brings clear improvement to the approximation. Hence much fewer simulations. The ESS reduction is definitely improved, esp. since the CPU cost is higher. Could this be associated with the recourse to independent proposals?

In a sence, Bayesian synthetic likelihood does not convey the same appeal, since is a bit more of a tough cookie: approximating the mean and variance is multidimensional. (BSL is always more expensive!)

As a side remark, the authors use two chains in parallel to simplify convergence proofs, as we did a while ago with AMIS!

## the new DIYABC-RF

Posted in Books, pictures, R, Statistics, Wines with tags , , , , , , , , , , , , , , , , on April 15, 2021 by xi'an

My friends and co-authors from Montpellier have released last month the third version of the DIYABC software, DIYABC-RF, which includes and promotes the use of random forests for parameter inference and model selection, in connection with Louis Raynal’s thesis. Intended as the earlier versions of DIYABC for population genetic applications. Bienvenue!!!

The software DIYABC Random Forest (hereafter DIYABC-RF) v1.0 is composed of three parts: the dataset simulator, the Random Forest inference engine and the graphical user interface. The whole is packaged as a standalone and user-friendly graphical application named DIYABC-RF GUI and available at https://diyabc.github.io. The different developer and user manuals for each component of the software are available on the same website. DIYABC-RF is a multithreaded software on three operating systems: GNU/Linux, Microsoft Windows and MacOS. One can use the program can be used through a modern and user-friendly graphical interface designed as an R shiny application (Chang et al. 2019). For a fluid and simplified user experience, this interface is available through a standalone application, which does not require installing R or any dependencies and hence can be used independently. The application is also implemented in an R package providing a standard shiny web application (with the same graphical interface) that can be run locally as any shiny application, or hosted as a web service to provide a DIYABC-RF server for multiple users.

## ABC & the eighth plague of Egypt [locusts in forests]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on April 6, 2021 by xi'an

“If you refuse to let them go, I will bring locusts into your country tomorrow. They will cover the face of the ground so that it cannot be seen. They will devour what little you have left after the hail, including every tree that is growing in your fields. They will fill your houses and those of all your officials and all the Egyptians.” Exodus 10:3-6

Marie-Pierre Chapuis, Louis Raynal, and co-authors, mostly from Montpellier, published last year a paper on the evolutionary history of the African arid-adapted pest locust, Schistocerca gregaria, called the eighth plague of Egypt in the Bible. And a cause for a major food disaster in East Africa over the past months. The analysis was run with ABC-RF techniques. The paper was first reviewed in PCI Evolutionary Biology, with the following points:

The present-day distribution of extant species is the result of the interplay between their past population demography (e.g., expansion, contraction, isolation, and migration) and adaptation to the environment (…) The understanding of the key factors driving species evolution gives important insights into how the species may respond to changing conditions, which can be particularly relevant for the management of harmful species, such as agricultural pests.

Meaningful demographic inferences present major challenges. These include formulating evolutionary scenarios fitting species biology and the eco-geographical context and choosing informative molecular markers and accurate quantitative approaches to statistically compare multiple demographic scenarios and estimate the parameters of interest. A further issue comes with result interpretation. Accurately dating the inferred events is far from straightforward since reliable calibration points are necessary to translate the molecular estimates of the evolutionary time into absolute time units (i.e. years). This can be attempted in different ways (…) Nonetheless, most experimental systems rarely meet these conditions, hindering the comprehensive interpretation of results.

The contribution of Chapuis et al. addresses these issues to investigate the recent history of the (…) desert locust (…) Owing to their fast mutation rate microsatellite markers offer at least two advantages: i) suitability for analyzing recently diverged populations, and ii) direct estimate of the germline mutation rate in pedigree samples (…) The main aim of the study is to infer the history of divergence of the two subspecies of the desert locust, which have spatially disjoint distribution corresponding to the dry regions of North and West-South Africa. They first use paleo-vegetation maps to formulate hypotheses about changes in species range since the last glacial maximum. Based on them, they generate 12 divergence models. For the selection of the demographic model and parameter estimation, they apply the recently developed ABC-RF approach (…) Some methodological novelties are also introduced in this work, such as the computation of the error associated with the posterior parameter estimates under the best scenario (…) The best-supported model suggests a recent divergence event of the subspecies of S. gregaria (around 2.6 kya) and a reduction of populations size in one of the subspecies (S. g. flaviventris) that colonized the southern distribution area. As such, results did not support the hypothesis that the southward colonization was driven by the expansion of African dry environments associated with the last glacial maximum (…) The estimated time of divergence points at a much more recent origin for the two subspecies, during the late Holocene, in a period corresponding to fairly stable arid conditions similar to current ones. Although the authors cannot exclude that their microsatellite data bear limited information on older colonization events than the last one, they bring arguments in favour of alternative explanations. The hypothesis privileged does not involve climatic drivers, but the particularly efficient dispersal behaviour of the species, whose individuals are able to fly over long distances (up to thousands of kilometers) under favourable windy conditions (…)

There is a growing number of studies in phylogeography in arid regions in the Southern hemisphere, but the impact of past climate changes on the species distribution in this region remains understudied relative to the Northern hemisphere. The study presented by Chapuis et al. offers several important insights into demographic changes and the evolutionary history of an agriculturally important pest species in Africa, which could also mirror the history of other organisms in the continent (…)

Microsatellite markers have been offering a useful tool in population genetics and phylogeography for decades (…) This study reaffirms the usefulness of these classic molecular markers to estimate past demographic events, especially when species- and locus-specific microsatellite mutation features are available and a powerful inferential approach is adopted. Nonetheless, there are still hurdles to overcome, such as the limitations in scenario choice associated with the simulation software used (e.g. not allowing for continuous gene flow in this particular case), which calls for further improvement of simulation tools allowing for more flexible modeling of demographic events and mutation patterns. In sum, this work not only contributes to our understanding of the makeup of the African biodiversity but also offers a useful statistical framework, which can be applied to a wide array of species and molecular markers.

## quoting a book review of mine?!

Posted in Books, Statistics, University life with tags , , , , , , on January 10, 2020 by xi'an

## selecting summary statistics [a tale of two distances]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on May 23, 2019 by xi'an

As Jonathan Harrison came to give a seminar in Warwick [which I could not attend], it made me aware of his paper with Ruth Baker on the selection of summaries in ABC. The setting is an ABC-SMC algorithm and it relates with Fearnhead and Prangle (2012), Barnes et al. (2012), our own random forest approach, the neural network version of Papamakarios and Murray (2016), and others. The notion here is to seek the optimal weights of different summary statistics in the tolerance distance, towards a maximization of a distance (Hellinger) between prior and ABC posterior (Wasserstein also comes to mind!). A sort of dual of the least informative prior. Estimated by a k-nearest neighbour version [based on samples from the prior and from the ABC posterior] I had never seen before. I first did not get how this k-nearest neighbour distance could be optimised in the weights since the posterior sample was already generated and (SMC) weighted, but the ABC sample can be modified by changing the [tolerance] distance weights and the resulting Hellinger distance optimised this way. (There are two distances involved, in case the above description is too murky!)

“We successfully obtain an informative unbiased posterior.”

The paper spends a significant while in demonstrating that the k-nearest neighbour estimator converges and much less on the optimisation procedure itself, which seems like a real challenge to me when facing a large number of particles and a high enough dimension (in the number of statistics). (In the examples, the size of the summary is 1 (where does the weight matter?), 32, 96, 64, with 5 10⁴, 5 10⁴, 5 10³ and…10 particles, respectively.) The authors address the issue, though, albeit briefly, by mentioning that, for the same overall computation time, the adaptive weight ABC is indeed further from the prior than a regular ABC with uniform weights [rather than weighted by the precisions]. They also argue that down-weighting some components is akin to selecting a subset of summaries, but I beg to disagree with this statement as the weights are never exactly zero, as far as I can see, hence failing to fight the curse of dimensionality. Some LASSO version could implement this feature.