Monte Carlo swindles

Posted in Statistics with tags , , , , , , , , , on April 2, 2023 by xi'an

While reading Boos and Hugues-Olivier’s 1998 American Statistician paper on the applications of Basu’s theorem I can across the notion of Monte Carlo swindles. Where a reduced variance can be achieved without the corresponding increase in Monte Carlo budget. For instance, approximating the variance of the median statistic Μ for a Normal location family can be sped up by considering that

\text{var}(\Mu)=\text{var}(\Mu-\bar X)+\text{var}(\bar X)

by Basu’s theorem. However, when reading the originating 1973 paper by Gross (although the notion is presumably due to Tukey), the argument boils down to Rao-Blackwellisation (without the Rao-Blackwell theorem being mentioned). The related 1985 American Statistician paper by Johnstone and Velleman exploits a latent variable representation. It also makes the connection with the control variate approach, noticing the appeal of using the score function as a (standard) control and (unusual) swindle, since its expectation is zero. I am surprised at uncovering this notion only now… Possibly because the method only applies in special settings.

A side remark from the same 1998 paper, namely that the enticing decomposition

\mathbb E[(X/Y)^k] = \mathbb E[X^k] \big/ \mathbb E[Y^k]

when X/Y and Y are independent, should be kept out of reach from my undergraduates at all costs, as they would quickly get rid of the assumption!!!


Posted in Books, pictures, University life with tags , , , , , , , , on April 1, 2023 by xi'an

I got pointed out at an interesting NTY editorial of March 8, 2023, on ChatGPT written by Noam Chomsky, Ian Roberts and Jeffrey Watumull.

“we fear that the most popular and fashionable strain of A.I. — machine learning — will degrade [linguistics] and debase our ethics by incorporating into our technology a fundamentally flawed conception of language and knowledge.”

Starting with a quote of Jorge Luis Borges, most appropriately for the dystopian prospects brought by the new chatbots. And seeing the arrival of these machines as something trivial that operates in contrast with the human mind by making use of terabytesque amounts of data and (cleverly) extrapolating to suit the question. Which is to state that they are merely (?) much better interfaces at reproducing patterns found in their data bases. This remains a technical feat but given the lack of reliability of their output (cf my exam answers) and the correlated lack of uncertainty in their assessment, they are very much useless at explanations. (But sometimes usefull as typewriting monkeys for recommendation letters.)

“The crux of machine learning is description and prediction; it does not posit any causal mechanisms or physical laws.”

The second part of the tribune points out the amorality of such platforms, unable to reach a moral position. This is illustrated by Q&As about the morality of terraforming an other planet (which I cannot connect with morality if there is no sentient life on that planet). While I see the point as a fundamental distinction between humans and AIs, I would feel uncomfortable with the latter producing moral judgements as this would imply a choice of moral rules in their training, as there is no universal moral ground beyond the “obvious”… (Actually, by presenting arguments in an authoritative manner, rarely with provisions for being wrong or incomplete, ChatGPT is agreeing on lying by omission!)

“Given the amorality, faux science and linguistic incompetence of these systems, we can only laugh or cry at their popularity.”

Number savvy [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on March 31, 2023 by xi'an

“This book aspires to contribute to overall numeracy through a tour de force presentation of the production, use, and evolution of data.”

Number Savvy: From the Invention of Numbers to the Future of Data is written by George Sciadas, a  statistician working at Statistics Canada. This book is mostly about data, even though it starts with the “compulsory” tour of the invention(s) of numbers and the evolution towards a mostly universal system and the issue of measurements (with a funny if illogical/anti-geographical confusion in “gare du midi in Paris and gare du Nord in Brussels” since Gare du Midi (south) is in Brussels while Gare du Nord (north) in in Paris). The chapter (Chap. 3) on census and demography is quite detailed about the hurdles preventing an exact count of a population, but much less about the methods employed to improve the estimation. (The request for me to fill the short form for the 2023 French Census actually came while I was reading the book!)

The next chapter links measurement with socio-economic notions or models, like unemployment rate, which depends on so many criteria (pp. 77-87) that its measurement sounds impossible or arbitrary. Almost as arbitrary as the reported number of protesters in a French demonstration! Same difficulty with the GDP, whose interpretation seems beyond the grasp of the common reader. And does not cover significantly missing (-not-at-random) data like tax evasion, money laundering, and the grey economy. (Nitpicking: if GDP got down by 0.5% one year and up by 0.5% the year after, this does not exactly compensate!) Chapter 5 reflects upon the importance of definitions and boundaries in creating official statistics and categorical data. A chapter (Chap 6) on the gathering of data in the past (read prior to the “Big Data” explosion) is preparing the ground to the chapter on the current setting. Mostly about surveys, presented as definitely from the past, “shadows of their old selves”. And with anecdotes reminding me of my only experience as a survey interviewer (on Xmas practices!). About administrative data, progressively moving from collected by design to available for any prospection (or “farming”). A short chapter compared with the one (Chap 7) on new data (types), mostly customer, private sector, data. Covering the data accumulated by big tech companies, but not particularly illuminating (with bar-room remarks like “Facebook users tend to portray their lives as they would like them to be. Google searches may reflect more truthfully what people are looking for.”)

The following Chapter 8 is somehow confusing in its defence of microdata, by which I understand keeping the raw data rather than averaging through summary statistics. Synthetic data is mentioned there, but without reference to a reference model, while machine learning makes a very brief appearance (p.222). In Chapter 9, (statistical) data analysis is [at last!] examined, but mostly through descriptive statistics. Except for a regression model and a discussion of the issues around hypothesis testing and Bayesian testing making its unique visit, albeit confusedly in-between references to Taleb’s Black swan, Gödel’s incompleteness theorem (which always seem to fascinate authors of general public science books!), and Kahneman and Tversky’s prospect theory. Somewhat surprisingly, the chapter also includes a Taoist tale about the farmer getting in turns lucky and unlucky… A tale that was already used in What are the chances? that I reviewed two years ago. As this is a very established parable dating back at least to the 2nd century B.C., there is no copyright involved, but what are the chances the story finds its way that quickly in another book?!

The last and final chapter is about the future, unsurprisingly. With prediction of “plenty of black boxes“, “statistical lawlessness“, “data pooling” and data as a commodity (which relates with some themes of our OCEAN ERC-Synergy grant). Although the solution favoured by the author is centralised, through a (national) statistics office or another “trusted third party“. The last section is about the predicted end of theory, since “simply looking at data can reveal patterns“, but resisting the prophets of doom and idealising the Rise of the (AI) machines… The lyrical conclusion that “With both production consolidation and use of data increasingly in the ‘hands’ of machines, and our wise interventions, the more distant future will bring complete integrations” sounds too much like Brave New World for my taste!

“…the privacy argument is weak, if not hypocritical. Logically, it’s hard to fathom what data that we share with an online retailer or a delivery company we wouldn’t share with others (…) A naysayer will say nay.” (p.190)

The way the book reads and unrolls is somewhat puzzling to this reader, as it sounds like a sequence of common sense remarks with a Guesstimation flavour on the side, and tiny historical or technical facts, some unknown and most of no interest to me, while lacking in the larger picture. For instance, the long-winded tale on evaluating the cumulated size of a neighbourhood lawns (p.34-38) does not seem to be getting anywhere. The inclusion of so many warnings, misgivings, and alternatives in the collection and definition of data may have the counter-effect of discouraging readers from making sense of numeric concepts and trusting the conclusions of data-based analyses. The constant switch in perspective(s) and the apparent absence of definite conclusions are also exhausting. Furthermore, I feel that the author and his rosy prospects are repeatedly minimizing the risks of data collection on individual privacy and freedom, when presenting the platforms as a solution to a real time census (as, e.g., p.178), as exemplified by the high social control exercised by some number savvy dictatures!  And he is highly critical of EU regulations such as GDPR, “less-than-subtle” (p.267), “with its huge impact on businesses” (p.268). I am thus overall uncertain which audience this book will eventually reach.

[Disclaimer about potential self-plagiarism: this post or an edited version will potentially appear in my Books Review section in CHANCE.]

même pas peur [not afrAId]

Posted in Books, Kids, Travel, University life with tags , , , , , , , , , , , , , on March 30, 2023 by xi'an

Both the Beeb and The New York Times are posting tonight about a call to pause AI experiments, by AI researchers and others, due to the danger they could pose to humanity. While this reminds me of Superintelligence, a book by Nick Bostrom I found rather unconvincing, and although I agree that automated help-to-decision systems should not become automated decision systems, I am rather surprised at them setting the omnipresent Chat-GPT as the reference not to be exceeded.

“AI systems with human-competitive intelligence can pose profound risks to society and humanity (…) recent months have seen AI labs locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one – not even their creators – can understand, predict, or reliably control.”

The central (?) issue here is whether something like Chat-GPT can claim anything intelligence, when pumping data from a (inevitably biased) database and producing mostly coherent sentences without any attention to facts. Which is useful when polishing a recommendation letter at the same level as a spelling corrector (but requires checking for potential fake facts inclusions, like imaginary research prizes!)

“Contemporary AI systems are now becoming human-competitive at general tasks, and we must ask ourselves: Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization?”

The increasingly doom-mongering tone of the above questions is rather over the top (civilization, nothing less?!) and again remindful of Superintelligence, while spreading propaganda and untruth need not wait super-AIs to reach conspiracy theorists.

“Such decisions must not be delegated to unelected tech leaders. Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable (…) Therefore, we call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4. This pause should be public and verifiable, and include all key actors. If such a pause cannot be enacted quickly, governments should step in”

A six months period sounds like inappropriate for an existential danger, while the belief that governments want or can intervene sounds rather naïve, given for instance that they lack the ability to judge of the dangerosity of the threat and of the safety nets to be imposed on gigantic black-box systems. Who can judge on the positivity and risk of a billion (trillion?) parameter model? Why is being elected any guarantee of fairness or acumen? Beyond dictatures thriving on surveillance automata, more democratic countries are also happily engaging into problematic AI experiments, incl. AI surveillance of the incoming Paris Olympics. (Another valuable reason to stay away from Paris over the games.)

“AI research and development should be refocused on making today’s powerful, state-of-the-art systems more accurate, safe, interpretable, transparent, robust, aligned, trustworthy, and loyal. In parallel, AI developers must work with policymakers to dramatically accelerate development of robust AI governance systems.”

While these are worthy goals, at a conceptual level—with the practical issue of defining precisely each of these lofty adjectives—, and although I am certainly missing a lot from my ignorance of the field,  this call remains a mystery to me as it sounds unrealistic it could achieve its goal.

Bayesian inference: challenges, perspectives, and prospects

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , on March 29, 2023 by xi'an

Over the past year, Judith, Michael and I edited a special issue of Philosophical Transactions of the Royal Society on Bayesian inference: challenges, perspectives, and prospects, in celebration of the current President of the Royal Society, Adrian Smith, and his contributions to Bayesian analysis that have impacted the field up to this day. The issue is now out! The following is the beginning of our introduction of the series.

When contemplating his past achievements, it is striking to align the emergence of massive advances in these fields with some papers or books of his. For instance, Lindley’s & Smith’s ‘Bayes Estimates for the Linear Model’ (1971), a Read Paper at the Royal Statistical Society, is making the case for the Bayesian analysis of this most standard statistical model, as well as emphasizing the notion of exchangeability that is foundational in Bayesian statistics, and paving the way to the emergence of hierarchical Bayesian modelling. It thus makes a link between the early days of Bruno de Finetti, whose work Adrian Smith translated into English, and the current research in non-parametric and robust statistics. Bernardo’s & Smith’s masterpiece, Bayesian Theory (1994), sets statistical inference within decision- and information-theoretic frameworks in a most elegant and universal manner that could be deemed a Bourbaki volume for Bayesian statistics if this classification endeavour had reached further than pure mathematics. It also emphasizes the central role of hierarchical modelling in the construction of priors, as exemplified in Carlin’s et al.‘Hierarchical Bayesian analysis of change point problems’ (1992).

The series of papers published in 1990 by Alan Gelfand & Adrian Smith, esp. ‘Sampling-Based Approaches to Calculating Marginal Densities’ (1990), is overwhelmingly perceived as the birth date of modern Markov chain Monte Carlo (MCMC) methods, as itbrought to the whole statistics community (and the quickly wider communities) the realization that MCMC simulation was the sesame to unlock complex modelling issues. The consequences on the adoption of Bayesian modelling by non-specialists are enormous and long-lasting.Similarly, Gordon’set al.‘Novel approach to nonlinear/non-Gaussian Bayesian state estimation’ (1992) is considered as the birthplace of sequential Monte Carlo, aka particle filtering, with considerable consequences in tracking, robotics, econometrics and many other fields. Titterington’s, Smith’s & Makov’s reference book, ‘Statistical Analysis of Finite Mixtures(1984)  is a precursor in the formalization of heterogeneous data structures, paving the way for the incoming MCMC resolutions like Tanner & Wong (1987), Gelman & King (1990) and Diebolt & Robert (1990). Denison et al.’s book, ‘Bayesian methods for nonlinear classification and regression’ (2002) is another testimony to the influence of Adrian Smith on the field,stressing the emergence of robust and general classification and nonlinear regression methods to analyse complex data, prefiguring in a way the later emergence of machine-learning methods,with the additional Bayesian assessment of uncertainty. It is also bringing forward the capacity of operating Bayesian non-parametric modelling that is now broadly accepted, following a series of papers by Denison et al. in the late 1990s like CART and MARS.

We are quite grateful to the authors contributing to this volume, namely Joshua J. Bon, Adam Bretherton, Katie Buchhorn, Susanna Cramb, Christopher Drovandi, Conor Hassan, Adrianne L. Jenner, Helen J. Mayfield, James M. McGree, Kerrie Mengersen, Aiden Price, Robert Salomone, Edgar Santos-Fernandez, Julie Vercelloni and Xiaoyu Wang, Afonso S. Bandeira, Antoine Maillard, Richard Nickl and Sven Wang , Fan Li, Peng Ding and Fabrizia Mealli, Matthew Stephens, Peter D. Grünwald, Sumio Watanabe, P. Müller, N. K. Chandra and A. Sarkar, Kori Khan and Alicia Carriquiry, Arnaud Doucet, Eric Moulines and Achille Thin, Beatrice Franzolini, Andrea Cremaschi, Willem van den Boom and Maria De Iorio, Sandra Fortini and Sonia Petrone, Sylvia Frühwirth-Schnatter, S. Wade, Chris C. Holmes and Stephen G. Walker, Lizhen Nie and Veronika Ročková. Some of the papers are open-access, if not all, hence enjoy them!

%d bloggers like this: