Archive for logistic regression

likelihood-free inference by ratio estimation

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on September 9, 2019 by xi'an

“This approach for posterior estimation with generative models mirrors the approach of Gutmann and Hyvärinen (2012) for the estimation of unnormalised models. The main difference is that here we classify between two simulated data sets while Gutmann and Hyvärinen (2012) classified between the observed data and simulated reference data.”

A 2018 arXiv posting by Owen Thomas et al. (including my colleague at Warwick, Rito Dutta, CoI warning!) about estimating the likelihood (and the posterior) when it is intractable. Likelihood-free but not ABC, since the ratio likelihood to marginal is estimated in a non- or semi-parametric (and biased) way. Following Geyer’s 1994 fabulous estimate of an unknown normalising constant via logistic regression, the current paper which I read in preparation for my discussion in the ABC optimal design in Salzburg uses probabilistic classification and an exponential family representation of the ratio. Opposing data from the density and data from the marginal, assuming both can be readily produced. The logistic regression minimizing the asymptotic classification error is the logistic transform of the log-ratio. For a finite (double) sample, this minimization thus leads to an empirical version of the ratio. Or to a smooth version if the log-ratio is represented as a convex combination of summary statistics, turning the approximation into an exponential family,  which is a clever way to buckle the buckle towards ABC notions. And synthetic likelihood. Although with a difference in estimating the exponential family parameters β(θ) by minimizing the classification error, parameters that are indeed conditional on the parameter θ. Actually the paper introduces a further penalisation or regularisation term on those parameters β(θ), which could have been processed by Bayesian Lasso instead. This step is essentially dirving the selection of the summaries, except that it is for each value of the parameter θ, at the expense of a X-validation step. This is quite an original approach, as far as I can tell, but I wonder at the link with more standard density estimation methods, in particular in terms of the precision of the resulting estimate (and the speed of convergence with the sample size, if convergence there is).

conditional noise contrastive estimation

Posted in Books, pictures, University life with tags , , , , , , , , on August 13, 2019 by xi'an

At ICML last year, Ciwan Ceylan and Michael Gutmann presented a new version of noise constrative estimation to deal with intractable constants. While noise contrastive estimation relies upon a second independent sample to contrast with the observed sample, this approach uses instead a perturbed or noisy version of the original sample, for instance a Normal generation centred at the original datapoint. And eliminates the annoying constant by breaking the (original and noisy) samples into two groups. The probability to belong to one group or the other then does not depend on the constant, which is a very effective trick. And can be optimised with respect to the parameters of the model of interest. Recovering the score matching function of Hyvärinen (2005). While this is in line with earlier papers by Gutmann and Hyvärinen, this line of reasoning (starting with Charlie Geyer’s logistic regression) never ceases to amaze me!

sampling and imbalanced

Posted in Statistics with tags , , , , , on June 21, 2019 by xi'an

Deborshee Sen, Matthias Sachs, Jianfeng Lu and David Dunson have recently arXived a sub-sampling paper for  classification (logistic) models where some covariates or some responses are imbalanced. With a PDMP, namely zig-zag, used towards preserving the correct invariant distribution (as already mentioned in an earlier post on the zig-zag zampler and in a recent Annals paper by Joris Bierkens, Paul Fearnhead, and Gareth Roberts (Warwick)). The current paper is thus an improvement on the above. Using (non-uniform) importance sub-sampling across observations and simpler upper bounds for the Poisson process. A rather practical form of Poisson thinning. And proposing unbiased estimates of the sub-sample log-posterior as well as stratified sub-sampling.

I idly wondered if the zig-zag sampler could itself be improved by not switching the bouncing directions at random since directions associated with almost certainly null coefficients should be neglected as much as possible, but the intensity functions associated with the directions do incorporate this feature. Except for requiring computation of the intensities for all directions. This is especially true when facing many covariates.

Thinking of the logistic regression model itself, it is sort of frustrating that something so close to an exponential family causes so many headaches! Formally, it is an exponential family but the normalising constant is rather unwieldy, especially when there are many observations and many covariates. The Polya-Gamma completion is a way around, but it proves highly costly when the dimension is large…

assessing MCMC convergence

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on June 6, 2019 by xi'an

When MCMC became mainstream in the 1990’s, there was a flurry of proposals to check, assess, and even guarantee convergence to the stationary distribution, as discussed in our MCMC book. Along with Chantal Guihenneuc and Kerrie Mengersen, we also maintained for a while a reviewww webpage categorising theses. Niloy Biswas and Pierre Jacob have recently posted a paper where they propose the use of couplings (and unbiased MCMC) towards deriving bounds on different metrics between the target and the current distribution of the Markov chain. Two chains are created from a given kernel and coupled with a lag of L, meaning that after a while, the two chains become one with a time difference of L. (The supplementary material contains many details on how to induce coupling.) The distance to the target can then be bounded by a sum of distances between the two chains until they merge. The above picture from the paper is a comparison a Polya-Urn sampler with several HMC samplers for a logistic target (not involving the Pima Indian dataset!). The larger the lag L the more accurate the bound. But the larger the lag the more expensive the assessment of how many steps are needed to convergence. Especially when considering that the evaluation requires restarting the chains from scratch and rerunning until they couple again, rather than continuing one run which can only brings the chain closer to stationarity and to being distributed from the target. I thus wonder at the possibility of some Rao-Blackwellisation of the simulations used in this assessment (while realising once more than assessing convergence almost inevitably requires another order of magnitude than convergence itself!). Without a clear idea of how to do it… For instance, keeping the values of the chain(s) at the time of coupling is not directly helpful to create a sample from the target since they are not distributed from that target.

[Pierre also wrote a blog post about the paper on Statisfaction that is definitely much clearer and pedagogical than the above.]

how a hiring quota failed [or not]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on February 26, 2019 by xi'an

This week, Nature has a “career news” section dedicated to how hiring quotas [may have] failed for French university hiring. And based solely on a technical report by a Science Po’ Paris researcher. The hiring quota means that every hiring committee for a French public university hiring committee must be made of at least 40% members of each gender.  (Plus at least 50% of external members.) Which has been reduced to 30% in some severely imbalanced fields like mathematics. The main conclusion of the report is that the reform has had a negative impact on the hiring imbalance between men and women in French universities, with “the higher the share of women in a committee, the lower women are ranked” (p.2). As head of the hiring board in maths at Dauphine, which officiates as a secretarial committee for assembling all hiring committee, I was interested in the reasons for this perceived impact, as I had not observed it at my [first order remote] level. As a warning the discussion that follows makes little sense without a prior glance at the paper.

“Deschamps estimated that without the reform, 21 men and 12 women would have been hired in the field of mathematics. But with the reform, committees whose membership met the quota hired 30 men and 3 women” Nature

Skipping the non-quantitative and somewhat ideological part of the report, as well as descriptive statistics, I looked mostly at the modelling behind the conclusions, as reported for instance in the above definite statement in Nature. Starting with a collection of assumptions and simplifications. A first dubious such assumption is that fields and even less universities where the more than 40% quota was already existing before (the 2015 reform) could be used as “control groups”, given the huge potential for confounders, especially the huge imbalance in female-to-male ratios in diverse fields. Second, the data only covers hiring histories for three French universities (out of 63 total) over the years 2009-2018 and furthermore merges assistant (Maître de Conférence) and full professors, where hiring is de facto much more involved, with often one candidate being contacted [prior to the official advertising of the position] by the department as an expression of interest (or the reverse). Third, the remark that

“there are no significant differences between the percentage of women who apply and those who are hired” (p.9)

seems to make the all discussion moot… and contradict both the conclusion and the above assertion! Fourth, the candidate’s qualification (or quality) is equated with the h-index, which is highly reductive and, once again, open to considerable biases in terms of seniority degree and of field. Depending on the publication lag and also the percentage of publications in English versus the vernacular in the given field. And the type of publications (from an average of 2.94 in business to 9.96 on physics]. Fifth, the report equates academic connections [that may bias the ranking] with having the supervisor present in the hiring committee [which sounds like a clear conflict of interest] or the candidate applying in the [same] university that delivered his or her PhD. Missing a myriad of other connections that make committee members often prone to impact the ranking by reporting facts from outside the application form.

“…controlling for field fixed effects and connections make the coefficient [of the percentage of women in the committee] statistically insignificant, though the point estimate remains high.” (p.17)

The models used by Pierre Deschamps are multivariate logit and probit regressions, where each jury attaches a utility to each of its candidates, made of a qualification term [for the position] and of a gender bias most surprisingly multiplying candidate gender and jury gender dummies. The qualification term is expressed as a [jury free] linear regression on covariates plus a jury fixed effect. Plus an error distributed as a Gumbel extreme variate that leads to a closed-form likelihood [and this seems to be the only reason for picking this highly skewed distribution]. The probit model is used to model the probability that one candidate has a better utility than another. The main issue with this modelling is the agglomeration of independence assumptions, as (i) candidates and hired ones are not independent, from being evaluated over several positions all at once, with earlier selections and rankings all public, to having to rank themselves all the positions where they are eligible, to possibly being co-authors of other candidates; (ii) jurys are not independent either, as the limited pool of external members, esp. in gender-imbalanced fields, means that the same faculty often ends up in several jurys at once and hence evaluates the same candidates as a result, plus decides on local ranking in connection with earlier rankings; (iii) independence between several jurys of the same university when this university may try to impose a certain if unofficial gender quota, a variate obviously impossible to fill . Plus again a unique modelling across disciplines. A side but not solely technical remark is that among the covariates used to predict ranking or first position for a female candidate, the percentage of female candidates appears, while being exogenous. Again, using a univariate probit to predict the probability that a candidate is ranked first ignores the comparison between a dozen candidates, both male and female, operated by the jury. Overall, I find little reason to give (significant) weight to the indicator that the president is a woman in the logistic regression and even less to believe that a better gender balance in the jurys has led to a worse gender balance in the hirings. From one model to the next the coefficients change from being significant to non-significant and, again, I find the definition of the control group fairly crude and unsatisfactory, if only because jurys move from one session to the next (and there is little reason to believe one field more gender biased than another, with everything else accounted for). And for another my own experience within hiring committees in Dauphine or elsewhere has never been one where the president strongly impacts the decision. If anything, the president is often more neutral (and never ever imoe makes use of the additional vote to break ties!)…

algorithm for predicting when kids are in danger [guest post]

Posted in Books, Kids, Statistics with tags , , , , , , , , , , , , , , , , , on January 23, 2018 by xi'an

[Last week, I read this article in The New York Times about child abuse prediction software and approached Kristian Lum, of HRDAG, for her opinion on the approach, possibly for a guest post which she kindly and quickly provided!]

A week or so ago, an article about the use of statistical models to predict child abuse was published in the New York Times. The article recounts a heart-breaking story of two young boys who died in a fire due to parental neglect. Despite the fact that social services had received “numerous calls” to report the family, human screeners had not regarded the reports as meeting the criteria to warrant a full investigation. Offered as a solution to imperfect and potentially biased human screeners is the use of computer models that compile data from a variety of sources (jails, alcohol and drug treatment centers, etc.) to output a predicted risk score. The implication here is that had the human screeners had access to such technology, the software might issued a warning that the case was high risk and, based on this warning, the screener might have sent out investigators to intervene, thus saving the children.

These types of models bring up all sorts of interesting questions regarding fairness, equity, transparency, and accountability (which, by the way, are an exciting area of statistical research that I hope some readers here will take up!). For example, most risk assessment models that I have seen are just logistic regressions of [characteristics] on [indicator of undesirable outcome]. In this case, the outcome is likely an indicator of whether child abuse had been determined to take place in the home or not. This raises the issue of whether past determinations of abuse– which make up  the training data that is used to make the risk assessment tool–  are objective, or whether they encode systemic bias against certain groups that will be passed through the tool to result in systematically biased predictions. To quote the article, “All of the data on which the algorithm is based is biased. Black children are, relatively speaking, over-surveilled in our systems, and white children are under-surveilled.” And one need not look further than the same news outlet to find cases in which there have been egregiously unfair determinations of abuse, which disproportionately impact poor and minority communities.  Child abuse isn’t my immediate area of expertise, and so I can’t responsibly comment on whether these types of cases are prevalent enough that the bias they introduce will swamp the utility of the tool.

At the end of the day, we obviously want to prevent all instances of child abuse, and this tool seems to get a lot of things right in terms of transparency and responsible use. And according to the original article, it (at least on the surface) seems to be effective at more efficiently allocating scarce resources to investigate reports of child abuse. As these types of models become used more and more for a wider variety of prediction types, we need to be cognizant that (to quote my brilliant colleague, Josh Norkin) we don’t “lose sight of the fact that because this system is so broken all we are doing is finding new ways to sort our country’s poorest citizens. What we should be finding are new ways to lift people out of poverty.”

machine learning-based approach to likelihood-free inference

Posted in Statistics with tags , , , , , , , , , , , on March 3, 2017 by xi'an

polyptych painting within the TransCanada Pipeline Pavilion, Banff Centre, Banff, March 21, 2012At ABC’ory last week, Kyle Cranmer gave an extended talk on estimating the likelihood ratio by classification tools. Connected with a 2015 arXival. The idea is that the likelihood ratio is invariant by a transform s(.) that is monotonic with the likelihood ratio itself. It took me a few minutes (after the talk) to understand what this meant. Because it is a transform that actually depends on the parameter values in the denominator and the numerator of the ratio. For instance the ratio itself is a proper transform in the sense that the likelihood ratio based on the distribution of the likelihood ratio under both parameter values is the same as the original likelihood ratio. Or the (naïve Bayes) probability version of the likelihood ratio. Which reminds me of the invariance in Fearnhead and Prangle (2012) of the Bayes estimate given x and of the Bayes estimate given the Bayes estimate. I also feel there is a connection with Geyer’s logistic regression estimate of normalising constants mentioned several times on the ‘Og. (The paper mentions in the conclusion the connection with this problem.)

Now, back to the paper (which I read the night after the talk to get a global perspective on the approach), the ratio is of course unknown and the implementation therein is to estimate it by a classification method. Estimating thus the probability for a given x to be from one versus the other distribution. Once this estimate is produced, its distributions under both values of the parameter can be estimated by density estimation, hence an estimated likelihood ratio be produced. With better prospects since this is a one-dimensional quantity. An objection to this derivation is that it intrinsically depends on the pair of parameters θ¹ and θ² used therein. Changing to another pair requires a new ratio, new simulations, and new density estimations. When moving to a continuous collection of parameter values, in a classical setting, the likelihood ratio involves two maxima, which can be formally represented in (3.3) as a maximum over a likelihood ratio based on the estimated densities of likelihood ratios, except that each evaluation of this ratio seems to require another simulation. (Which makes the comparison with ABC more complex than presented in the paper [p.18], since ABC major computational hurdle lies in the production of the reference table and to a lesser degree of the local regression, both items that can be recycled for any new dataset.) A smoothing step is then to include the pair of parameters θ¹ and θ² as further inputs of the classifier.  There still remains the computational burden of simulating enough values of s(x) towards estimating its density for every new value of θ¹ and θ². And while the projection from x to s(x) does effectively reduce the dimension of the problem to one, the method still aims at estimating with some degree of precision the density of x, so cannot escape the curse of dimensionality. The sleight of hand resides in the classification step, since it is equivalent to estimating the likelihood ratio. I thus fail to understand how and why a poor classifier can then lead to a good approximations of the likelihood ratio “obtained by calibrating s(x)” (p.16). Where calibrating means estimating the density.