Archive for nuisance parameters

ABC in Svalbard [#1]

Posted in Books, Mountains, pictures, R, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on April 13, 2021 by xi'an

It started a bit awkwardly for me as I ran late, having accidentally switched to UK time the previous evening (despite a record-breaking biking-time to the University!), then the welcome desk could not find the key to the webinar room and I ended up following the first session from my office, by myself (and my teapot)… Until we managed to reunite in the said room (with an air quality detector!).

Software sessions are rather difficult to follow and I wonder what the idea on-line version should be. We could borrow from our teaching experience new-gained from the past year, where we had to engage students without the ability to roam the computer lab and look at their screens to force engage them into coding. It is however unrealistic to run a computer lab, unless a few “guinea pigs” could be selected in advance and show their progress or lack thereof during the session. In any case, thanks to the speakers who made the presentations of

  1. BSL(R)
  2. ELFI (Python)
  3. ABCpy (Python)

this morning/evening. (Just taking the opportunity to point out the publication of the latest version of DIYABC!).

Florence Forbes’ talk on using mixture of experts was quite alluring (and generated online discussions during the break, recovering some of the fun in real conferences), esp. from my longtime interest normalising flows in mixtures of regression (and more to come as part of our biweekly reading group!). Louis talked about gaining efficiency by not resampling the entire data in large network models. Edwin Fong brought martingales and infinite dimension distributions to the rescue, generalising Polya urns! And Justin Alsing discussed the advantages of estimating the likelihood rather than estimating the posterior, which sounds counterintuitive. With a return to mixtures as approximations, using instead normalising flows. With the worth-repeating message that ABC marginalises over nuisance parameters so easily! And a nice perspective on ABayesian decision, which does not occur that often in the ABC literature. Cecilia Viscardi made a link between likelihood estimation and large deviations à la Sanov, the rare event being associated with the larger distances, albeit dependent on a primary choice of the tolerance. Michael Gutmann presented an intringuing optimisation Monte Carlo approach from his last year AISTATS 2020 paper, the simulated parameter being defined by a fiducial inversion. Reweighted by the prior times a Jacobian term, which stroke me as a wee bit odd, ie using two distributions on θ. And Rito concluded the day by seeking approximate sufficient statistics by constructing exponential families whose components are themselves parameterised as neural networks with neural parameter ω. Leading to an unnormalised model because of the energy function, hence to the use of inference techniques on ω that do not require the constant, like Gutmann & Hyvärinen (2012). And using the (pseudo-)sufficient statistic as ABCsummary statistic. Which still requires an exchange MCMC step within ABC.

Bayesian sufficiency

Posted in Books, Kids, Statistics with tags , , , , , , , , , on February 12, 2021 by xi'an

“During the past seven decades, an astonishingly large amount of effort and ingenuity has gone into the search fpr resonable answers to this question.” D. Basu

Induced by a vaguely related question on X validated, I re-read Basu’s 1977 great JASA paper on the elimination of nuisance parameters. Besides the limitations of competing definitions of conditional, partial, marginal sufficiency for the parameter of interest,  Basu discusses various notions of Bayesian (partial) sufficiency.

“After a long journey through a forest of confusing ideas and examples, we seem to have lost our way.” D. Basu

Starting with Kolmogorov’s idea (published during WW II) to impose to all marginal posteriors on the parameter of interest θ to only depend on a statistic S(x). But having to hold for all priors cancels the notion as the statistic need be sufficient jointly for θ and σ, as shown by Hájek in the early 1960’s. Following this attempt, Raiffa and Schlaifer then introduced a more restricted class of priors, namely where nuisance and interest are a priori independent. In which case a conditional factorisation theorem is a sufficient (!) condition for this Q-sufficiency.  But not necessary as shown by the N(θ·σ, 1) counter-example (when σ=±1 and θ>0). [When the prior on σ is uniform, the absolute average is Q-sufficient but is this a positive feature?] This choice of prior separation is somewhat perplexing in that it does not hold under reparameterisation.

Basu ends up with three challenges, including the multinomial M(θ·σ,½(1-θ)·(1+σ),½(1+θ)·(1-σ)), with (n¹,n²,n³) as a minimal sufficient statistic. And the joint observation of an Exponential Exp(θ) translated by σ and of an Exponential Exp(σ) translated by -θ, where the prior on σ gets eliminated in the marginal on θ.

Approximate Integrated Likelihood via ABC methods

Posted in Books, Statistics, University life with tags , , , , , , , , on March 13, 2014 by xi'an

My PhD student Clara Grazian just arXived this joint work with Brunero Liseo on using ABC for marginal density estimation. The idea in this paper is to produce an integrated likelihood approximation in intractable problems via the ratio

L(\psi|x)\propto \dfrac{\pi(\psi|x)}{\pi(\psi)}

both terms in the ratio being estimated from simulations,

\hat L(\psi|x) \propto \dfrac{\hat\pi^\text{ABC}(\psi|x)}{\hat\pi(\psi)}

(with possible closed form for the denominator). Although most of the examples processed in the paper (Poisson means ratio, Neyman-Scott’s problem, g-&-k quantile distribution, semi-parametric regression) rely on summary statistics, hence de facto replacing the numerator above with a pseudo-posterior conditional on those summaries, the approximation remains accurate (for those examples). In the g-&-k quantile example, Clara and Brunero compare our ABC-MCMC algorithm with the one of Allingham et al. (2009, Statistics & Computing): the later does better by not replicating values in the Markov chain but instead proposing a new value until it is accepted by the usual Metropolis step. (Although I did not spend much time on this issue, I cannot see how both approaches could be simultaneously correct. Even though the outcomes do not look very different.) As noted by the authors, “the main drawback of the present approach is that it requires the use of proper priors”, unless the marginalisation of the prior can be done analytically. (This is an interesting computational problem: how to provide an efficient approximation to a marginal density of a σ-finite measure, assuming this density exists.)

Clara will give a talk at CREST-ENSAE today about this work, in the Bayes in Paris seminar: 2pm in room 18.

mostly nuisance, little interest

Posted in Statistics, University life with tags , , , , , , on February 7, 2013 by xi'an

tree next to my bike parking garage at INSEE, Malakoff, Feb. 02, 2012Sorry for the misleading if catchy (?) title, I mean mostly nuisance parameters, very few parameters of interest! This morning I attended a talk by Eric Lesage from CREST-ENSAI on non-responses in surveys and their modelling through instrumental variables. The weighting formula used to compensate for the missing values was exactly the one at the core of the Robins-Wasserman paradox, discussed a few weeks ago by Jamie in Varanasi. Namely the one with the estimated probability of response at the denominator: The solution adopted in the talk was obviously different, with linear estimators used at most steps to evaluate the bias of the procedure (since researchers in survey sampling seem particularly obsessed with bias!)

On a somehow related topic, Aris Spanos arXived a short note (that I read yesterday) about the Neyman-Scott paradox. The problem is similar to the Robins-Wasserman paradox in that there is an infinity of nuisance parameters (the means of the successive pairs of observations) and that a convergent estimator of the parameter of interest, namely the variance common to all observations, is available. While there exist Bayesian solutions to this problem (see, e.g., this paper by Brunero Liseo), they require some preliminary steps to bypass the difficulty of this infinite number of parameters and, in this respect, are involving ad-hocquery to some extent, because the prior is then designed purposefully so. In other words, missing the direct solution based on the difference of the pairs is a wee frustrating, even though this statistic is not sufficient! The above paper by Brunero also my favourite example in this area: when considering a normal mean in large dimension, if the parameter of interest is the squared norm of this mean, the MLE ||x||² (and the Bayes estimator associated with Jeffreys’ prior) is (are) very poor: the bias is constant and of the order of the dimension of the mean, p. On the other hand, if one starts from ||x||² as the observation (definitely in-sufficient!), the resulting MLE (and the Bayes estimator associated with Jeffreys’ prior) has (have) much nicer properties. (I mentioned this example in my review of Chang’s book as it is paradoxical, gaining in efficiency by throwing away “information”! Of course, the part we throw away does not contain true information about the norm, but the likelihood does not factorise and hence the Bayesian answers differ…)

I showed the paper to Andrew Gelman and here are his comments:

Spanos writes, “The answer is surprisingly straightforward.” I would change that to, “The answer is unsurprisingly straightforward.” He should’ve just asked me the answer first rather than wasting his time writing a paper!

The way it works is as follows. In Bayesian inference, everything unknown is unknown, they have a joint prior and a joint posterior distribution. In frequentist inference, each unknowns quantity is either a parameter or a predictive quantity. Parameters do not have probability distributions (hence the discomfort that frequentists have with notation such as N(y|m,s); they prefer something like N(y;m,s) or f_N(y;m,s)), while predictions do have probability distributions. In frequentist statistics, you estimate parameters and you predict predictors. In this world, estimation and prediction are different. Estimates are evaluated conditional on the parameter. Predictions are evaluated conditional on model parameters but unconditional on the predictive quantities. Hence, mle can work well in many high-dimensional problems, as long as you consider many of the uncertain quantities as predictive. (But mle is still not perfect because of the problem of boundary estimates, e.g., here..