Archive for data privacy

postdoctoral research position

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on April 27, 2023 by xi'an

Through the ERC Synergy grant OCEAN (On intelligenCE And Networks: Synergistic research in Bayesian Statistics, Microeconomics and Computer Sciences), I am seeking one postdoctoral researcher with an interest in Bayesian federated learning, distributed MCMC, approximate Bayesian inference, and data privacy.

The project is based at Université Paris Dauphine, on the new PariSanté Campus.  The postdoc will join the OCEAN teams of researchers directed by Éric Moulines and Christian Robert to work on the above themes with multiple focus from statistical theory, to Bayesian methodology, to algorithms, to medical applications.

Qualifications

The candidate should hold a doctorate in statistics or machine learning, with demonstrated skills in Bayesian analysis and Monte Carlo methodology, a record of publications in these domains, and an interest in working as part of an interdisciplinary international team. Scientific maturity and research autonomy are a must for applying.

Funding

Besides a 2 year postdoctoral contract at Université Paris Dauphine (with possible extension for one year), at a salary of 31K€ per year, the project will fund travel to OCEAN partners’ institutions (University of Warwick or University of Berkeley) and participation to yearly summer schools. University benefits are attached to the position and no teaching duty is involved, as per ERC rules.

The postdoctoral work will begin 1 September 2023.

Application Procedure

To apply, preferably before 31 May, please send the following in one pdf to Christian Robert (bayesianstatistics@gmail.com).

  • a letter of application,
  • a CV,
  • letters of recommendation sent directly by recommenders

Number savvy [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on March 31, 2023 by xi'an

“This book aspires to contribute to overall numeracy through a tour de force presentation of the production, use, and evolution of data.”

Number Savvy: From the Invention of Numbers to the Future of Data is written by George Sciadas, a  statistician working at Statistics Canada. This book is mostly about data, even though it starts with the “compulsory” tour of the invention(s) of numbers and the evolution towards a mostly universal system and the issue of measurements (with a funny if illogical/anti-geographical confusion in “gare du midi in Paris and gare du Nord in Brussels” since Gare du Midi (south) is in Brussels while Gare du Nord (north) in in Paris). The chapter (Chap. 3) on census and demography is quite detailed about the hurdles preventing an exact count of a population, but much less about the methods employed to improve the estimation. (The request for me to fill the short form for the 2023 French Census actually came while I was reading the book!)

The next chapter links measurement with socio-economic notions or models, like unemployment rate, which depends on so many criteria (pp. 77-87) that its measurement sounds impossible or arbitrary. Almost as arbitrary as the reported number of protesters in a French demonstration! Same difficulty with the GDP, whose interpretation seems beyond the grasp of the common reader. And does not cover significantly missing (-not-at-random) data like tax evasion, money laundering, and the grey economy. (Nitpicking: if GDP got down by 0.5% one year and up by 0.5% the year after, this does not exactly compensate!) Chapter 5 reflects upon the importance of definitions and boundaries in creating official statistics and categorical data. A chapter (Chap 6) on the gathering of data in the past (read prior to the “Big Data” explosion) is preparing the ground to the chapter on the current setting. Mostly about surveys, presented as definitely from the past, “shadows of their old selves”. And with anecdotes reminding me of my only experience as a survey interviewer (on Xmas practices!). About administrative data, progressively moving from collected by design to available for any prospection (or “farming”). A short chapter compared with the one (Chap 7) on new data (types), mostly customer, private sector, data. Covering the data accumulated by big tech companies, but not particularly illuminating (with bar-room remarks like “Facebook users tend to portray their lives as they would like them to be. Google searches may reflect more truthfully what people are looking for.”)

The following Chapter 8 is somehow confusing in its defence of microdata, by which I understand keeping the raw data rather than averaging through summary statistics. Synthetic data is mentioned there, but without reference to a reference model, while machine learning makes a very brief appearance (p.222). In Chapter 9, (statistical) data analysis is [at last!] examined, but mostly through descriptive statistics. Except for a regression model and a discussion of the issues around hypothesis testing and Bayesian testing making its unique visit, albeit confusedly in-between references to Taleb’s Black swan, Gödel’s incompleteness theorem (which always seem to fascinate authors of general public science books!), and Kahneman and Tversky’s prospect theory. Somewhat surprisingly, the chapter also includes a Taoist tale about the farmer getting in turns lucky and unlucky… A tale that was already used in What are the chances? that I reviewed two years ago. As this is a very established parable dating back at least to the 2nd century B.C., there is no copyright involved, but what are the chances the story finds its way that quickly in another book?!

The last and final chapter is about the future, unsurprisingly. With prediction of “plenty of black boxes“, “statistical lawlessness“, “data pooling” and data as a commodity (which relates with some themes of our OCEAN ERC-Synergy grant). Although the solution favoured by the author is centralised, through a (national) statistics office or another “trusted third party“. The last section is about the predicted end of theory, since “simply looking at data can reveal patterns“, but resisting the prophets of doom and idealising the Rise of the (AI) machines… The lyrical conclusion that “With both production consolidation and use of data increasingly in the ‘hands’ of machines, and our wise interventions, the more distant future will bring complete integrations” sounds too much like Brave New World for my taste!

“…the privacy argument is weak, if not hypocritical. Logically, it’s hard to fathom what data that we share with an online retailer or a delivery company we wouldn’t share with others (…) A naysayer will say nay.” (p.190)

The way the book reads and unrolls is somewhat puzzling to this reader, as it sounds like a sequence of common sense remarks with a Guesstimation flavour on the side, and tiny historical or technical facts, some unknown and most of no interest to me, while lacking in the larger picture. For instance, the long-winded tale on evaluating the cumulated size of a neighbourhood lawns (p.34-38) does not seem to be getting anywhere. The inclusion of so many warnings, misgivings, and alternatives in the collection and definition of data may have the counter-effect of discouraging readers from making sense of numeric concepts and trusting the conclusions of data-based analyses. The constant switch in perspective(s) and the apparent absence of definite conclusions are also exhausting. Furthermore, I feel that the author and his rosy prospects are repeatedly minimizing the risks of data collection on individual privacy and freedom, when presenting the platforms as a solution to a real time census (as, e.g., p.178), as exemplified by the high social control exercised by some number savvy dictatures!  And he is highly critical of EU regulations such as GDPR, “less-than-subtle” (p.267), “with its huge impact on businesses” (p.268). I am thus overall uncertain which audience this book will eventually reach.

[Disclaimer about potential self-plagiarism: this post or an edited version will potentially appear in my Books Review section in CHANCE.]

Ocean’s four!

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , on October 25, 2022 by xi'an

Fantastic news! The ERC-Synergy¹ proposal we submitted last year with Michael Jordan, Éric Moulines, and Gareth Roberts has been selected by the ERC (which explains for the trips to Brussels last month). Its acronym is OCEAN [hence the whale pictured by a murmuration of starlings!], which stands for On intelligenCE And Networks​: Mathematical and Algorithmic Foundations for Multi-Agent Decision-Making​. Here is the abstract, which will presumably turn public today along with the official announcement from the ERC:

Until recently, most of the major advances in machine learning and decision making have focused on a centralized paradigm in which data are aggregated at a central location to train models and/or decide on actions. This paradigm faces serious flaws in many real-world cases. In particular, centralized learning risks exposing user privacy, makes inefficient use of communication resources, creates data processing bottlenecks, and may lead to concentration of economic and political power. It thus appears most timely to develop the theory and practice of a new form of machine learning that targets heterogeneous, massively decentralized networks, involving self-interested agents who expect to receive value (or rewards, incentive) for their participation in data exchanges.

OCEAN will develop statistical and algorithmic foundations for systems involving multiple incentive-driven learning and decision-making agents, including uncertainty quantification at the agent’s level. OCEAN will study the interaction of learning with market constraints (scarcity, fairness), connecting adaptive microeconomics and market-aware machine learning.

OCEAN builds on a decade of joint advances in stochastic optimization, probabilistic machine learning, statistical inference, Bayesian assessment of uncertainty, computation, game theory, and information science, with PIs having complementary and internationally recognized skills in these domains. OCEAN will shed a new light on the value and handling data in a competitive, potentially antagonistic, multi-agent environment, and develop new theories and methods to address these pressing challenges. OCEAN requires a fundamental departure from standard approaches and leads to major scientific interdisciplinary endeavors that will transform statistical learning in the long term while opening up exciting and novel areas of research.

Since the ERC support in this grant mostly goes to PhD and postdoctoral positions, watch out for calls in the coming months or contact us at any time.

Continue reading

Fusion at CIRM

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on October 24, 2022 by xi'an

Today is the first day of the FUSION workshop Rémi Bardenet and myself organised. Due to schedule clashes, I will alas not be there, since [no alas!] at the BNP conference in Chili. The program and collection of participants is quite exciting and I hope more fusion will result from this meeting. Enjoy! (And beware of boars, cold water, and cliffs!!!)

distributed evidence

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on December 16, 2021 by xi'an

Alexander Buchholz (who did his PhD at CREST with Nicolas Chopin), Daniel Ahfock, and my friend Sylvia Richardson published a great paper on the distributed computation of Bayesian evidence in Bayesian Analysis. The setting is one of distributed data from several sources with no communication between them, which relates to consensus Monte Carlo even though model choice has not been particularly studied from that perspective. The authors operate under the assumption of conditionally conjugate models, i.e., the existence of a data augmentation scheme into an exponential family so that conjugate priors can be used. For a division of the data into S blocks, the fundamental identity in the paper is

p(y) = \alpha^S \prod_{s=1}^S \tilde p(y_s) \int \prod_{s=1}^S \tilde p(\theta|y_s)\,\text d\theta

where α is the normalising constant of the sub-prior exp{log[p(θ)]/S} and the other terms are associated with this prior. Under the conditionally conjugate assumption, the integral can be approximated based on the latent variables. Most interestingly, the associated variance is directly connected with the variance of

p(z_{1:S}|y)\Big/\prod_{s=1}^S \tilde p(z_s|y_s)

under the joint:

“The variance of the ratio measures the quality of the product of the conditional sub-posterior as an importance sample proposal distribution.”

Assuming this variance is finite (which is likely). An approximate alternative is proposed, namely to replace the exact sub-posterior with a Normal distribution, as in consensus Monte Carlo, which should obviously require some consideration as to which parameterisation of the model produces the “most normal” (or the least abnormal!) posterior. And ensures a finite variance in the importance sampling approximation (as ensured by the strong bounds in Proposition 5). A problem shared by the bridgesampling package.

“…if the error that comes from MCMC sampling is relatively small and that the shard sizes are large enough so that the quality of the subposterior normal approximation is reasonable, our suggested approach will result in good approximations of the full data set marginal likelihood.”

The resulting approximation can also be handy in conjunction with reversible jump MCMC, in the sense that RJMCMC algorithms can be run in parallel on different chunks or shards of the entire dataset. Although the computing gain may be reduced by the need for separate approximations.

%d bloggers like this: