When interviewing impressive applicants from a stunning variety of places and background for fellows in our Data Science for Social Good program (in Warwick and Kaiserslautern) this summer, we came through the common conundrum of comparing ranks while each of us only meeting a subset of the candidates. Over a free morning, I briefly thought of the problem (while swimming) and then wrote a short R code to infer about an aggregate ranking, ρ, based on a simple model, namely a Poisson distribution on the distance between an individual’s ranking and the aggregate

a uniform distribution on the missing ranks as well as on the aggregate, and a non-informative prior on λ. Leading to a three step Gibbs sampler for the completion and the simulation of ρ and λ.

I am aware that the problem has been tackled in many different ways, including Bayesian ones (as in Deng et al., 2014) andlocal ones, but this was a fun exercise. Albeit we did not use any model in the end!

The paper Probabilistic Preference Learning with the Mallows Rank Model by Vitelli et al. was published last year in JMLR which may be why I missed it. It brings yet another approach to the perpetual issue of intractable normalising constants. Here, the data is made of rankings of n objects by N experts, with an assumption of a latent ordering ρ acting as “mean” in the Mallows model. Along with a scale α, both to be estimated, and indeed involving an intractable normalising constant in the likelihood that only depends on the scale α because the distance is right-invariant. For instance the Hamming distance used in coding. There exists a simplification of the expression of the normalising constant due to the distance only taking a finite number of values, multiplied by the number of cases achieving a given value. Still this remains a formidable combinatoric problem. Running a Gibbs sampler is not an issue for the parameter ρ as the resulting Metropolis-Hastings-within-Gibbs step does not involve the missing constant. But it poses a challenge for the scale α, because the Mallows model cannot be exactly simulated for most distances. Making the use of pseudo-marginal and exchange algorithms presumably impossible. The authors use instead an importance sampling approximation to the normalising constant relying on a pseudo-likelihood version of Mallows model and a massive number (10⁶ to 10⁸) of simulations (in the humongous set of N-sampled permutations of 1,…,n). The interesting point in using this approximation is that the convergence result associated with pseudo-marginals no long applies and that the resulting MCMC algorithm converges to another limiting distribution. With the drawback that this limiting distribution is conditional to the importance sample. Various extensions are found in the paper, including a mixture of Mallows models. And an round of applications, including one on sushi preferences across Japan (fatty tuna coming almost always on top!). As the authors note, a very large number of items like n>10⁴ remains a challenge (or requires an alternative model).

This week, Nature has a “career news” section dedicated to how hiring quotas [may have] failed for French university hiring. And based solely on a technical report by a Science Po’ Paris researcher. The hiring quota means that every hiring committee for a French public university hiring committee must be made of at least 40% members of each gender. (Plus at least 50% of external members.) Which has been reduced to 30% in some severely imbalanced fields like mathematics. The main conclusion of the report is that the reform has had a negative impact on the hiring imbalance between men and women in French universities, with “the higher the share of women in a committee, the lower women are ranked” (p.2). As head of the hiring board in maths at Dauphine, which officiates as a secretarial committee for assembling all hiring committee, I was interested in the reasons for this perceived impact, as I had not observed it at my [first order remote] level. As a warning the discussion that follows makes little sense without a prior glance at the paper.

“Deschamps estimated that without the reform, 21 men and 12 women would have been hired in the field of mathematics. But with the reform, committees whose membership met the quota hired 30 men and 3 women”Nature,

Skipping the non-quantitative and somewhat ideological part of the report, as well as descriptive statistics, I looked mostly at the modelling behind the conclusions, as reported for instance in the above definite statement in Nature. Starting with a collection of assumptions and simplifications. A first dubious such assumption is that fields and even less universities where the more than 40% quota was already existing before (the 2015 reform) could be used as “control groups”, given the huge potential for confounders, especially the huge imbalance in female-to-male ratios in diverse fields. Second, the data only covers hiring histories for three French universities (out of 63 total) over the years 2009-2018 and furthermore merges assistant (Maître de Conférence) and full professors, where hiring is de facto much more involved, with often one candidate being contacted [prior to the official advertising of the position] by the department as an expression of interest (or the reverse). Third, the remark that

“there are no significant differences between the percentage of women who apply and those who are hired” (p.9)

seems to make the all discussion moot… and contradict both the conclusion and the above assertion! Fourth, the candidate’s qualification (or quality) is equated with the h-index, which is highly reductive and, once again, open to considerable biases in terms of seniority degree and of field. Depending on the publication lag and also the percentage of publications in English versus the vernacular in the given field. And the type of publications (from an average of 2.94 in business to 9.96 on physics]. Fifth, the report equates academic connections [that may bias the ranking] with having the supervisor present in the hiring committee [which sounds like a clear conflict of interest] or the candidate applying in the [same] university that delivered his or her PhD. Missing a myriad of other connections that make committee members often prone to impact the ranking by reporting facts from outside the application form.

“…controlling for field fixed effects and connections make the coefficient [of the percentage of women in the committee] statistically insignificant, though the point estimate remains high.” (p.17)

The models used by Pierre Deschamps are multivariate logit and probit regressions, where each jury attaches a utility to each of its candidates, made of a qualification term [for the position] and of a gender bias most surprisingly multiplying candidate gender and jury gender dummies. The qualification term is expressed as a [jury free] linear regression on covariates plus a jury fixed effect. Plus an error distributed as a Gumbel extreme variate that leads to a closed-form likelihood [and this seems to be the only reason for picking this highly skewed distribution]. The probit model is used to model the probability that one candidate has a better utility than another. The main issue with this modelling is the agglomeration of independence assumptions, as (i) candidates and hired ones are not independent, from being evaluated over several positions all at once, with earlier selections and rankings all public, to having to rank themselves all the positions where they are eligible, to possibly being co-authors of other candidates; (ii) jurys are not independent either, as the limited pool of external members, esp. in gender-imbalanced fields, means that the same faculty often ends up in several jurys at once and hence evaluates the same candidates as a result, plus decides on local ranking in connection with earlier rankings; (iii) independence between several jurys of the same university when this university may try to impose a certain if unofficial gender quota, a variate obviously impossible to fill . Plus again a unique modelling across disciplines. A side but not solely technical remark is that among the covariates used to predict ranking or first position for a female candidate, the percentage of female candidates appears, while being exogenous. Again, using a univariate probit to predict the probability that a candidate is ranked first ignores the comparison between a dozen candidates, both male and female, operated by the jury. Overall, I find little reason to give (significant) weight to the indicator that the president is a woman in the logistic regression and even less to believe that a better gender balance in the jurys has led to a worse gender balance in the hirings. From one model to the next the coefficients change from being significant to non-significant and, again, I find the definition of the control group fairly crude and unsatisfactory, if only because jurys move from one session to the next (and there is little reason to believe one field more gender biased than another, with everything else accounted for). And for another my own experience within hiring committees in Dauphine or elsewhere has never been one where the president strongly impacts the decision. If anything, the president is often more neutral (and never ever imoe makes use of the additional vote to break ties!)…

I recently read a fairly interesting paper by Daniel Yekutieli on a Bayesian perspective for parameters selected after viewing the data, published in Series B in 2012. (Disclaimer: I was not involved in processing this paper!)

The first example is to differentiate the Normal-Normal mean posterior when θ is N(0,1) and x is N(θ,1) from the restricted posterior when θ is N(0,1) and x is N(θ,1) truncated to (0,∞). By restating the later as the repeated generation from the joint until x>0. This does not sound particularly controversial, except for the notion of selecting the parameter after viewing the data. That the posterior support may depend on the data is not that surprising..!

“The observation that selection affects Bayesian inference carries the important implication that in Bayesian analysis of large data sets, for each potential parameter, it is necessary to explicitly specify a selection rule that determines when inference is provided for the parameter and provide inference that is based on the selection-adjusted posterior distribution of the parameter.” (p.31)

The more interesting distinction is between “fixed” and “random” parameters (Section 2.1), which separate cases where the data is from a truncated distribution (given the parameter) and cases where the joint distribution is truncated but misses the normalising constant (function of θ) for the truncated sampling distribution. The “mixed” case introduces an hyperparameter λ and the normalising constant integrates out θ and depends on λ. Which amounts to switching to another (marginal) prior on θ. This is quite interesting even though one can debate of the very notions of “random” and “mixed” “parameters”, which are those where the posterior most often changes, as true parameters. Take for instance Stephen Senn’s example (p.6) of the mean associated with the largest observation in a Normal mean sample, with distinct means. When accounting for the distribution of the largest variate, this random variable is no longer a Normal variate with a single unknown mean but it instead depends on all the means of the sample. Speaking of the largest observation mean is therefore misleading in that it is neither the mean of the largest observation, nor a parameter per se since the index [of the largest observation] is a random variable induced by the observed sample.

In conclusion, a very original article, if difficult to assess as it can be argued that selection models other than the “random” case result from an intentional modelling choice of the joint distribution.

While in Brussels last week I noticed an interesting question on X validated that I considered in the train back home and then more over the weekend. This is a question about spacings, namely how long on average does it take to cover an interval of length L when drawing unit intervals at random (with a torus handling of the endpoints)? Which immediately reminded me of Wilfrid Kendall (Warwick) famous gif animation of coupling from the past via leaves covering a square region, from the top (forward) and from the bottom (backward)…

The problem is rather easily expressed in terms of uniform spacings, more specifically on the maximum spacing being less than 1 (or 1/L depending on the parameterisation). Except for the additional constraint at the boundary, which is not independent of the other spacings. Replacing this extra event with an independent spacing, there exists a direct formula for the expected stopping time, which can be checked rather easily by simulation. But the exact case appears to be add a few more steps to the draws, 3/2 apparently. The following graph displays the regression of the Monte Carlo number of steps over 10⁴ replicas against the exact values: