Archive for Cross Validation

likelihood-free inference by ratio estimation

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on September 9, 2019 by xi'an

“This approach for posterior estimation with generative models mirrors the approach of Gutmann and Hyvärinen (2012) for the estimation of unnormalised models. The main difference is that here we classify between two simulated data sets while Gutmann and Hyvärinen (2012) classified between the observed data and simulated reference data.”

A 2018 arXiv posting by Owen Thomas et al. (including my colleague at Warwick, Rito Dutta, CoI warning!) about estimating the likelihood (and the posterior) when it is intractable. Likelihood-free but not ABC, since the ratio likelihood to marginal is estimated in a non- or semi-parametric (and biased) way. Following Geyer’s 1994 fabulous estimate of an unknown normalising constant via logistic regression, the current paper which I read in preparation for my discussion in the ABC optimal design in Salzburg uses probabilistic classification and an exponential family representation of the ratio. Opposing data from the density and data from the marginal, assuming both can be readily produced. The logistic regression minimizing the asymptotic classification error is the logistic transform of the log-ratio. For a finite (double) sample, this minimization thus leads to an empirical version of the ratio. Or to a smooth version if the log-ratio is represented as a convex combination of summary statistics, turning the approximation into an exponential family,  which is a clever way to buckle the buckle towards ABC notions. And synthetic likelihood. Although with a difference in estimating the exponential family parameters β(θ) by minimizing the classification error, parameters that are indeed conditional on the parameter θ. Actually the paper introduces a further penalisation or regularisation term on those parameters β(θ), which could have been processed by Bayesian Lasso instead. This step is essentially dirving the selection of the summaries, except that it is for each value of the parameter θ, at the expense of a X-validation step. This is quite an original approach, as far as I can tell, but I wonder at the link with more standard density estimation methods, in particular in terms of the precision of the resulting estimate (and the speed of convergence with the sample size, if convergence there is).

the paper where you are a node

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on February 5, 2019 by xi'an

Sophie Donnet pointed out to me this arXived paper by Tianxi Li, Elizaveta Levina, and Ji Zhu, on a network resampling strategy for X validation, where I appear as a datapoint rather than as a [direct] citation! Which reminded me of the “where you are the hero” gamebooks with which my kids briefly played, before computer games took over. The model selection method is illustrated on a dataset made of X citations [reduced to 706 authors]  in all papers published between 2003 and 2012 in the Annals of Statistics, Biometrika, JASA, and JRSS Series B. With the outcome being the determination of a number of communities, 20, which the authors labelled as they wanted, based on 10 authors with the largest number of citations in the category. As it happens, I appear in the list, within the “mixed (causality + theory + Bayesian)” category (!), along with Jamie Robbins, Paul Fearnhead, Gilles Blanchard, Zhiqiang Tan, Stijn Vansteelandt, Nancy Reid, Jae Kwang Kim, Tyler VanderWeele, and Scott Sisson, which is somewhat mind-boggling in that I am pretty sure I never quoted six of these authors [although I find it hilarious that Jamie appears in the category, given that we almost got into a car crash together, at one of the Valencià meetings!].

unbiased estimation of log-normalising constants

Posted in Statistics with tags , , , , , , , on October 16, 2018 by xi'an

Maxime Rischard, Pierre Jacob, and Natesh Pillai [warning: both of whom are co-authors and friends of mine!] have just arXived a paper on the use of path sampling (a.k.a., thermodynamic integration) for log-constant unbiased approximation and the resulting consequences on Bayesian model comparison by X validation. If the goal is the estimation of the log of a ratio of two constants, creating an artificial path between the corresponding distributions and looking at the derivative at any point of this path of the log-density produces an unbiased estimator. Meaning that random sampling along the path, corrected by the distribution of the sampling still produces an unbiased estimator. From there the authors derive an unbiased estimator for any X validation objective function, CV(V,T)=-log p(V|T), taking m observations T in and leaving n-m observations T out… The marginal conditional log density in the criterion is indeed estimated by an unbiased path sampler, using a powered conditional likelihood. And unbiased MCMC schemes à la Jacob et al. for simulating unbiased MCMC realisations of the intermediary targets on the path. Tuning it towards an approximately constant cost for all powers.

So in all objectivity and fairness (!!!), I am quite excited by this new proposal within my favourite area! Or rather two areas since it brings together the estimation of constants and an alternative to Bayes factors for Bayesian testing. (Although the paper does not broach upon the calibration of the X validation values.)

Lindley’s paradox as a loss of resolution

Posted in Books, pictures, Statistics with tags , , , , , , , , on November 9, 2016 by xi'an

“The principle of indifference states that in the absence of prior information, all mutually exclusive models should be assigned equal prior probability.”

lindleypColin LaMont and Paul Wiggins arxived a paper on Lindley’s paradox a few days ago. The above quote is the (standard) argument for picking (½,½) partition between the two hypotheses, which I object to if only because it does not stand for multiple embedded models. The main point in the paper is to argue about the loss of resolution induced by averaging against the prior, as illustrated by the picture above for the N(0,1) versus N(μ,1) toy problem. What they call resolution is the lowest possible mean estimate for which the null is rejected by the Bayes factor (assuming a rejection for Bayes factors larger than 1). While the detail is missing, I presume the different curves on the lower panel correspond to different choices of L when using U(-L,L) priors on μ… The “Bayesian rejoinder” to the Lindley-Bartlett paradox (p.4) is in tune with my interpretation, namely that as the prior mass under the alternative gets more and more spread out, there is less and less prior support for reasonable values of the parameter, hence a growing tendency to accept the null. This is an illustration of the long-lasting impact of the prior on the posterior probability of the model, because the data cannot impact the tails very much.

“If the true prior is known, Bayesian inference using the true prior is optimal.”

This sentence and the arguments following is meaningless in my opinion as knowing the “true” prior makes the Bayesian debate superfluous. If there was a unique, Nature provided, known prior π, it would loose its original meaning to become part of the (frequentist) model. The argument is actually mostly used in negative, namely that since it is not know we should not follow a Bayesian approach: this is, e.g., the main criticism in Inferential Models. But there is no such thing as a “true” prior! (Or a “true’ model, all things considered!) In the current paper, this pseudo-natural approach to priors is utilised to justify a return to the pseudo-Bayes factors of the 1990’s, when one part of the data is used to stabilise and proper-ise the (improper) prior, and a second part to run the test per se. This includes an interesting insight on the limiting cases of partitioning corresponding to AIC and BIC, respectively, that I had not seen before. With the surprising conclusion that “AIC is the derivative of BIC”!

conditional sampling

Posted in R, Statistics with tags , , , , on September 5, 2016 by xi'an

An interesting question about stratified sampling came up on X validated last week, namely how to optimise a Monte Carlo estimate based on two subsequent simulations, one, X, from a marginal and one or several Y from the corresponding conditional given X, when the costs of producing those two simulations significantly differ. When looking at the variance of the Monte Carlo estimate, this variance can be minimised in the number of simulations from the marginal under a computing budget. However, when the costs of producing x, y given or (x,y) are about the same, it does not pay to replicate simulations from y given x or x given y, because this increases the randomness of the estimator and induces positive correlation between some terms in the sum. Assuming that the conditional variances are always computable, we could envision an extension to the question where for each new value of a simulated x (or y), a variable number of conditionally simulated y (or x) could be produced. Or even when those conditional variances are unknown but can be replaced with empirical versions.

The illustration comes from a bivariate normal model with correlation, for a rather arbitrary function , but the pattern remains the same, namely that iid simulations of the pair (X,Y invariably leads to a smaller variance of the estimator compared with simulation with a 1:10 (10 Y’s for one X) or 10:1 ratio between x’s and y’s. Depending on the function and the relative variances, the 1:10 or 10:1 schemes may have a similar variability.

zigma=c(9,1,-.9*sqrt(1*9))

geney=function(x,n=1){
  return(rnorm(n,mean=zigma[3]*x/zigma[1],sd=sqrt(zigma[2]-
    zigma[3]^2/zigma[1])))}
genex=function(y,n=1){
  return(rnorm(n,mean=zigma[3]*y/zigma[2],sd=sqrt(zigma[1]-
     zigma[3]*zigma[3]/zigma[2])))}
targ=function(x,y){ log(x^2*y^4)+x^2*cos(x^2)/y*sin(y^2)}

T=1e4;N=1e3
vales=matrix(0,N,3)
for (i in 1:N){
   xx=rnorm(T,sd=sqrt(zigma[1]))
   vales[i,1]=mean(targ(xx,geney(xx,n=T)))
   xx=rep(rnorm(T/10,sd=sqrt(zigma[1])),10)
   vales[i,2]=mean(targ(xx,geney(xx,n=T)))
   yy=rep(rnorm(T/10,sd=sqrt(zigma[2])),10)
   vales[i,3]=mean(targ(enex(yy,n=T),yy))}

insufficient statistics for ABC model choice

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , on October 17, 2014 by xi'an

[Here is a revised version of my comments on the paper by Julien Stoehr, Pierre Pudlo, and Lionel Cucala, now to appear [both paper and comments] in Statistics and Computing special MCMSki 4 issue.]

Approximate Bayesian computation techniques are 2000’s successors of MCMC methods as handling new models where MCMC algorithms are at a loss, in the same way the latter were able in the 1990’s to cover models that regular Monte Carlo approaches could not reach. While they first sounded like “quick-and-dirty” solutions, only to be considered until more elaborate solutions could (not) be found, they have been progressively incorporated within the statistican’s toolbox as a novel form of non-parametric inference handling partly defined models. A statistically relevant feature of those ACB methods is that they require replacing the data with smaller dimension summaries or statistics, because of the complexity of the former. In almost every case when calling ABC is the unique solution, those summaries are not sufficient and the method thus implies a loss of statistical information, at least at a formal level since relying on the raw data is out of question. This forced reduction of statistical information raises many relevant questions, from the choice of summary statistics to the consistency of the ensuing inference.

In this paper of the special MCMSki 4 issue of Statistics and Computing, Stoehr et al. attack the recurrent problem of selecting summary statistics for ABC in a hidden Markov random field, since there is no fixed dimension sufficient statistics in that case. The paper provides a very broad overview of the issues and difficulties related with ABC model choice, which has been the focus of some advanced research only for a few years. Most interestingly, the authors define a novel, local, and somewhat Bayesian misclassification rate, an error that is conditional on the observed value and derived from the ABC reference table. It is the posterior predictive error rate

\mathbb{P}^{\text{ABC}}(\hat{m}(y^{\text{obs}})\ne m|S(y^{\text{obs}}))

integrating in both the model index m and the corresponding random variable Y (and the hidden intermediary parameter) given the observation. Or rather given the transform of the observation by the summary statistic S. The authors even go further to define the error rate of a classification rule based on a first (collection of) statistic, conditional on a second (collection of) statistic (see Definition 1). A notion rather delicate to validate on a fully Bayesian basis. And they advocate the substitution of the unreliable (estimates of the) posterior probabilities by this local error rate, estimated by traditional non-parametric kernel methods. Methods that are calibrated by cross-validation. Given a reference summary statistic, this perspective leads (at least in theory) to select the optimal summary statistic as the one leading to the minimal local error rate. Besides its application to hidden Markov random fields, which is of interest per se, this paper thus opens a new vista on calibrating ABC methods and evaluating their true performances conditional on the actual data. (The advocated abandonment of the posterior probabilities could almost justify the denomination of a paradigm shift. This is also the approach advocated in our random forest paper.)