## distributed evidence

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on December 16, 2021 by xi'an

Alexander Buchholz (who did his PhD at CREST with Nicolas Chopin), Daniel Ahfock, and my friend Sylvia Richardson published a great paper on the distributed computation of Bayesian evidence in Bayesian Analysis. The setting is one of distributed data from several sources with no communication between them, which relates to consensus Monte Carlo even though model choice has not been particularly studied from that perspective. The authors operate under the assumption of conditionally conjugate models, i.e., the existence of a data augmentation scheme into an exponential family so that conjugate priors can be used. For a division of the data into S blocks, the fundamental identity in the paper is

$p(y) = \alpha^S \prod_{s=1}^S \tilde p(y_s) \int \prod_{s=1}^S \tilde p(\theta|y_s)\,\text d\theta$

where α is the normalising constant of the sub-prior exp{log[p(θ)]/S} and the other terms are associated with this prior. Under the conditionally conjugate assumption, the integral can be approximated based on the latent variables. Most interestingly, the associated variance is directly connected with the variance of

$p(z_{1:S}|y)\Big/\prod_{s=1}^S \tilde p(z_s|y_s)$

under the joint:

“The variance of the ratio measures the quality of the product of the conditional sub-posterior as an importance sample proposal distribution.”

Assuming this variance is finite (which is likely). An approximate alternative is proposed, namely to replace the exact sub-posterior with a Normal distribution, as in consensus Monte Carlo, which should obviously require some consideration as to which parameterisation of the model produces the “most normal” (or the least abnormal!) posterior. And ensures a finite variance in the importance sampling approximation (as ensured by the strong bounds in Proposition 5). A problem shared by the bridgesampling package.

“…if the error that comes from MCMC sampling is relatively small and that the shard sizes are large enough so that the quality of the subposterior normal approximation is reasonable, our suggested approach will result in good approximations of the full data set marginal likelihood.”

The resulting approximation can also be handy in conjunction with reversible jump MCMC, in the sense that RJMCMC algorithms can be run in parallel on different chunks or shards of the entire dataset. Although the computing gain may be reduced by the need for separate approximations.

## Introduction to Sequential Monte Carlo [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , on June 8, 2021 by xi'an

[Warning: Due to many CoI, from Nicolas being a former PhD student of mine, to his being a current colleague at CREST, to Omiros being co-deputy-editor for Biometrika, this review will not be part of my CHANCE book reviews.]

My friends Nicolas Chopin and Omiros Papaspiliopoulos wrote in 2020 An Introduction to Sequential Monte Carlo (Springer) that took several years to achieve and which I find remarkably coherent in its unified presentation. Particles filters and more broadly sequential Monte Carlo have expended considerably in the last 25 years and I find it difficult to keep track of the main advances given the expansive and heterogeneous literature. The book is also quite careful in its mathematical treatment of the concepts and, while the Feynman-Kac formalism is somewhat scary, it provides a careful introduction to the sampling techniques relating to state-space models and to their asymptotic validation. As an introduction it does not go to the same depths as Pierre Del Moral’s 2004 book or our 2005 book (Cappé et al.). But it also proposes a unified treatment of the most recent developments, including SMC² and ABC-SMC. There is even a chapter on sequential quasi-Monte Carlo, naturally connected to Mathieu Gerber’s and Nicolas Chopin’s 2015 Read Paper. Another significant feature is the articulation of the practical part around a massive Python package called particles [what else?!]. While the book is intended as a textbook, and has been used as such at ENSAE and in other places, there are only a few exercises per chapter and they are not necessarily manageable (as Exercise 7.1, the unique exercise for the very short Chapter 7.) The style is highly pedagogical, take for instance Chapter 10 on the various particle filters, with a detailed and separate analysis of the input, algorithm, and output of each of these. Examples are only strategically used when comparing methods or illustrating convergence. While the MCMC chapter (Chapter 15) is surprisingly small, it is actually an introducing of the massive chapter on particle MCMC (and a teaser for an incoming Papaspiloulos, Roberts and Tweedie, a slow-cooking dish that has now been baking for quite a while!).

## approximate Bayesian inference [survey]

Posted in Statistics with tags , , , , , , , , , , , , , , , , , , on May 3, 2021 by xi'an

In connection with the special issue of Entropy I mentioned a while ago, Pierre Alquier (formerly of CREST) has written an introduction to the topic of approximate Bayesian inference that is worth advertising (and freely-available as well). Its reference list is particularly relevant. (The deadline for submissions is 21 June,)

## [de]quarantined by slideshare

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , on January 11, 2021 by xi'an

## democracy suffers when government statistics fail [review of a book review]

Posted in Books, Statistics, Travel with tags , , , , , , , , , , on October 13, 2020 by xi'an

This week, rather extraordinarily!, Nature book review was about official statistics, with a review of Julia Lane’s Democratizing our Data. (The democratizing in the title is painful to watch, though!) The reviewer is Beth Simone Noveck, who was deputy chief technology officer under Barack Obama and a major researcher in digital democracy, excusez du peu! (By comparison, Trump’s deputy chief technology officer had a B.A. in politics and no other qualification for the job, but got nonetheless promoted to chief…)

“Lane asserts that the United States is failing to adequately track its population, economy and society. Agencies are stagnating. The census dramatically undercounts people from minority racial groups. There is no complete national list of households. The data are made available two years after the count, making them out of date as the basis for effective policy making.” B.S. Noveck

The debate raised by the book on the ability of official statistics to keep track of people in a timely manner is most interesting. And not limited to the USA, even though it seems to fit in a Hell of its own:

“In the United States, there is no single national statistical agency. The process of gathering and publishing public data is fragmented across multiple departments and agencies, making it difficult to introduce new ideas across the whole enterprise. Each agency is funded by, and accountable to, a different congressional committee. Congress once sued the commerce department for attempting to introduce modern techniques of statistical sampling to shore up a flawed census process that involves counting every person by hand.” B.S. Noveck

This remark brings back to (my) mind the titanesque debates of the 1990s when Republicans attacked sampling techniques and statisticians like Steve Fienberg rose to their defence. (Although others like David Freedman opposed the move, paradoxically mistrusting statistics!) The French official statistic institute, INSEE, has been running sampled census(es) for decades now, without the national representation going up in arms. I am certainly being partial, having been associated with INSEE, its statistics school ENSAE and its research branch CREST since 1982, but it seems to me that the hiring of highly skilled and thoroughly trained civil servants by this institute helps in making the statistics it produces more trustworthy and efficient, including measuring the impact of public policies. (Even though accusations of delay and bias show up regularly.) And in making the institute more prone to adopt new methods, thanks to the rotation of its agents. (B.S. Noveck notices and deplores the absence of reference to foreign agencies in the book.)

“By contrast, the best private-sector companies produce data that are in real time, comprehensive, relevant, accessible and meaningful.”  B.S. Noveck

However, the notion in the review (and the book?) that private companies are necessarily doing better is harder to buy, if an easy jab at a public institution. Indeed, public official statistic institutes are the only one to have access to data covering the entire population, either directly or through other public institutes, like the IRS or social security claims. And trusting the few companies with a similar reach is beyond naïve (even though a company like Amazon has almost an instantaneous and highly local sensor of economic and social conditions!). And at odds for the call of democratizing, as shown by the impact of some of these companies on the US elections.