Archive for score function

robust privacy

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on May 14, 2024 by xi'an

During a recent working session, some Oceanerc (incl. me) went reading Privacy-Preserving Parametric Inference: A Case for Robust Statistics by Marco Avella-Medina (JASA, 2022), where robust criteria are advanced as efficient statistical tools in private settings. In this paper, robustness means using M-estimators T—as function of the empirical cdf—with basis score functions Ψ, defined as

\sum_{i=1}^n\Psi(x_i,T(\hat F_n))=0,

where Ψ is bounded. A construction further requiring that one can assess the sensitivity (in Dwork et al, 2006, sense) of a queried function, sensitivity itself linked with a measure of differential privacy. Because standard robustness approaches à la Huber allow for a portion of the sample to issue from an outlying (arbitrary) distribution, as in ε-contaminations, it makes perfect sense that robustness emerges within the differential framework. However, this common sense perception does not seem good enough for achieving differential privacy and the paper introduces a further randomization with noise scaled by (n,ε,δ) in the following way

T(\hat F_n)+\gamma(T,\hat F_n)5\sqrt{2\log(n)\log(2/\delta)/\epsilon_n}Z

that also applies to test statistics. This scaling seems to constitute the central result of the paper, which establishes asymptotically validity in the sense of statistical consistency (with the sample size n). But I am left wondering whether this outcome counts as supporting differential privacy as a sensible notion…

“…our proofs for the convergence of noisy gradient descent and noisy Newton’s method rely on showing that with high probability, the noise introduced to the gradients and Hessians has a negligible effect on the convergence of the iterates (up to the order of the statistical error of the non-noisy versions of the algorithms).” Avella-Medina, Bradshaw, & Loh

As a sequel I then read a more recent publication of Avella-Medina, Differentially private inference via noisy optimization, written with Casey Bradshaw & Po-Ling Loh, which appeared in the Annals of Statistics (2023). Again considering privatised estimation and inference for M-estimators, obtained by using noisy optimization procedures (noisy gradient descent, noisy Newton’s method) and constructing noisy confidence regions, that output differentially private avatars of standard M-estimators. Here the noisification goes through a randomisation of the gradient step like

\theta^{(k+1)}=\theta^{(k)}-\frac{\eta}{n}\sum_i\Psi(x_i,\theta^{(k)})+\frac{\eta B\sqrt K}{n}Z_k

where B is an upper bound on the gradient Ψ, η is a discretization step, and K is the total number of iterations (thus fixed in advance). The above stochastic gradient sequence converges with high probability to the actual M-estimator in n and not in K, since the upper bound on the distance scales in √K/n. Where does the attached privacy guarantee come from? It proceeds by an argument of a composition of a sequence of differentially private outputs, all based on the same dataset.

“…the larger the number [K] of data (gradient) queries of the algorithm, the more prone it will be to privacy leakage.”

The Newton method version is a variation on the above stochastic gradient descent. Except it seems to converge faster, as illustrated above.

Contextual Integrity for Differential Privacy #1 [23w5106]

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on August 2, 2023 by xi'an


Very relaxed beginning to the workshop, with [xkcd illustrated] tutorials on CI and DP, plus informal discussions to find a common ground over the week. Helped by the beautiful surroundings of the UBC Okanagan campus. Tutorial on the fundamentals of differential privacy, with a long room discussion on the meaning(s) of the definition, for instance exhibiting the minimax nature of the thing, placing all bad events at the same level. Any Bayesian version? Still unclear [to me] how one can inverse-engineer (ε,δ) into the probability of identifying one data entry. More questions about the nature of noise in DP algorithms. Again lacking [for me] the impact on identifying one data entry. Plus notions like Laplace privacy-accuracy sound parameterisation dependent. Exponential mechanism, that involves a score function (protecting privacy?) that ends up resembling a Gibbs posterior in the “safe Bayes” perspective. Interesting distinction about which agent provides randomness as impacting the privacy.

As for contextual integrity, presented by Helen Nissenbaum at the origin of the concept, it is complicated by a strongly subjective handling of notions, first and foremost privacy, as inevitably the case in philosophy and ethics.

BayesComp²³ [aka MCMski⁶]

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on March 20, 2023 by xi'an

The main BayesComp meeting started right after the ABC workshop and went on at a grueling pace, and offered a constant conundrum as to which of the four sessions to attend, the more when trying to enjoy some outdoor activity during the lunch breaks. My overall feeling is that it went on too fast, too quickly! Here are some quick and haphazard notes from some of the talks I attended, as for instance the practical parallelisation of an SMC algorithm by Adrien Corenflos, the advances made by Giacommo Zanella on using Bayesian asymptotics to assess robustness of Gibbs samplers to the dimension of the data (although with no assessment of the ensuing time requirements), a nice session on simulated annealing, from black holes to Alps (if the wrong mountain chain for Levi), and the central role of contrastive learning à la Geyer (1994) in the GAN talks of Veronika Rockova and Éric Moulines. Victor  Elvira delivered an enthusiastic talk on our massively recycled importance on-going project that we need to complete asap!

While their earlier arXived paper was on my reading list, I was quite excited by Nicolas Chopin’s (along with Mathieu Gerber) work on some quadrature stabilisation that is not QMC (but not too far either), with stratification over the unit cube (after a possible reparameterisation) requiring more evaluations, plus a sort of pulled-by-its-own-bootstrap control variate, but beating regular Monte Carlo in terms of convergence rate and practical precision (if accepting a large simulation budget from the start). A difficulty common to all (?) stratification proposals is that it does not readily applies to highly concentrated functions.

I chaired the lightning talks session, which were 3mn one-slide snapshots about some incoming posters selected by the scientific committee. While I appreciated the entry into the poster session, the more because it was quite crowded and busy, if full of interesting results, and enjoyed the slide solely made of “0.234”, I regret that not all poster presenters were not given the same opportunity (although I am unclear about which format would have permitted this) and that it did not attract more attendees as it took place in parallel with other sessions.

In a not-solely-ABC session, I appreciated Sirio Legramanti speaking on comparing different distance measures via Rademacher complexity, highlighting that some distances are not robust, incl. for instance some (all?) Wasserstein distances that are not defined for heavy tailed distributions like the Cauchy distribution. And using the mean as a summary statistic in such heavy tail settings comes as an issue, since the distance between simulated and observed means does not decrease in variance with the sample size, with the practical difficulty that the problem is hard to detect on real (misspecified) data since the true distribution behing (if any) is unknown. Would that imply that only intrinsic distances like maximum mean discrepancy or Kolmogorov-Smirnov are the only reasonable choices in misspecified settings?! While, in the ABC session, Jeremiah went back to this role of distances for generalised Bayesian inference, replacing likelihood by scoring rule, and requirement for Monte Carlo approximation (but is approximating an approximation that a terrible thing?!). I also discussed briefly with Alejandra Avalos on her use of pseudo-likelihoods in Ising models, which, while not the original model, is nonetheless a model and therefore to taken as such rather than as approximation.

I also enjoyed Gregor Kastner’s work on Bayesian prediction for a city (Milano) planning agent-based model relying on cell phone activities, which reminded me at a superficial level of a similar exploitation of cell usage in an attraction park in Singapore Steve Fienberg told me about during his last sabbatical in Paris.

In conclusion, an exciting meeting that should have stretched a whole week (or taken place in a less congenial environment!). The call for organising BayesComp 2025 is still open, by the way.

 

focused Bayesian prediction

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on June 3, 2020 by xi'an

In this fourth session of our One World ABC Seminar, my friend and coauthor Gael Martin, gave an after-dinner talk on focused Bayesian prediction, more in the spirit of Bissiri et al. than following a traditional ABC approach.  because along with Ruben Loaiza-Maya and [my friend and coauthor] David Frazier, they consider the possibility of a (mild?) misspecification of the model. Using thus scoring rules à la Gneiting and Raftery. Gael had in fact presented an earlier version at our workshop in Oaxaca, in November 2018. As in other solutions of that kind, difficulty in weighting the score into a distribution. Although asymptotic irrelevance, direct impact on the current predictions, at least for the early dates in the time series… Further calibration of the set of interest A. Or the focus of the prediction. As a side note the talk perfectly fits the One World likelihood-free seminar as it does not use the likelihood function!

“The very premise of this paper is that, in reality, any choice of predictive class is such that the truth is not contained therein, at which point there is no reason to presume that the expectation of any particular scoring rule will be maximized at the truth or, indeed, maximized by the same predictive distribution that maximizes a different (expected) score.”

This approach requires the proxy class to be close enough to the true data generating model. Or in the word of the authors to be plausible predictive models. And to produce the true distribution via the score as it is proper. Or the closest to the true model in the misspecified family. I thus wonder at a possible extension with a non-parametric version, the prior being thus on functionals rather than parameters, if I understand properly the meaning of Π(Pθ). (Could the score function be misspecified itself?!) Since the score is replaced with its empirical version, the implementation is  resorting to off-the-shelf MCMC. (I wonder for a few seconds if the approach could be seen as a pseudo-marginal MCMC but the estimation is always based on the same observed sample, hence does not directly fit the pseudo-marginal MCMC framework.)

[Notice: Next talk in the series is tomorrow, 11:30am GMT+1.]

neural summaries

Posted in Statistics, University life with tags , , , , , , on September 27, 2019 by xi'an