Archive for logistic regression

reXing the bridge

Posted in Books, pictures, Statistics with tags , , , , , , , , , on April 27, 2021 by xi'an

As I was re-reading Xiao-Li  Meng’s and Wing Hung Wong’s 1996 bridge sampling paper in Statistica Sinica, I realised they were making the link with Geyer’s (1994) mythical tech report, in the sense that the iterative construction of α functions “converges to the `reverse logistic regression’  described in Geyer (1994) for the two-density cases” (p.839). Although they also saw the later as an “iterative” application of Torrie and Valleau’s (1977) “umbrella sampling” estimator. And cited Bennett (1976) in the Journal of Computational Physics [for which Elsevier still asks for $39.95!] as the originator of the formula [check (6)]. And of the optimal solution (check (8)). Bennett (1976) also mentions that the method fares poorly when the targets do not overlap:

“When the two ensembles neither overlap nor satisfy the above smoothness condition, an accurate estimate of the free energy cannot be made without gathering additional MC data from one or more intermediate ensembles”

in which case this sequence of intermediate targets could be constructed and, who knows?!, optimised. (This may be the chain solution discussed in the conclusion of the paper.) Another optimisation not considered in enough detail is the allocation of the computing time to the two densities, maybe using a bandit strategy to avoid estimating the variance of the importance weights first.

Metropolis-Hastings via classification

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on February 23, 2021 by xi'an

Veronicka Rockova (from Chicago Booth) gave a talk on this theme at the Oxford Stats seminar this afternoon. Starting with a survey of ABC, synthetic likelihoods, and pseudo-marginals, to motivate her approach via GANs, learning an approximation of the likelihood from the GAN discriminator. Her explanation for the GAN type estimate was crystal clear and made me wonder at the connection with Geyer’s 1994 logistic estimator of the likelihood (a form of discriminator with a fixed generator). She also expressed the ABC approximation hence created as the actual posterior times an exponential tilt. Which she proved is of order 1/n. And that a random variant of the algorithm (where the shift is averaged) is unbiased. Most interestingly requiring no calibration and no tolerance. Except indirectly when building the discriminator. And no summary statistic. Noteworthy tension between correct shape and correct location.

wrong algebra for slice sampler

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , , , on January 27, 2021 by xi'an

Once more, and thrice alas!, I became aware of a typo in our “Use R!” book through a question on X validated from a reader unable to reproduce the slice of a basic 2D slice sampler for a logistic regression with coefficients (a,b). Indeed, our slice reads as the incorrect set (missing the i=1,…,n)

\left\{ (a,b): y_i(a+bx_i) > \log \frac{u_i}{1-u_i} \right\}

when it should have been

\bigcap_{i=1} \left\{ (a,b)\,:\ (-1)^{y_i}(a+bx_i) > \log\frac{u_i}{1-u_i} \right\}

which is the version I found in my LaTeX file. So I do not know what happened (unless I corrected the LaTeX file at a later date and cannot remember it, but the latest chance on the file reads October 2011…). Fortunately, the resulting slices in a and b and the following R code remain correct. Unfortunately, both French and Japanese translations reproduce the mistake…

scalable Metropolis-Hastings, nested Monte Carlo, and normalising flows

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on June 16, 2020 by xi'an

Over a sunny if quarantined Sunday, I started reading the PhD dissertation of Rob Cornish, Oxford University, as I am the external member of his viva committee. Ending up in a highly pleasant afternoon discussing this thesis over a (remote) viva yesterday. (If bemoaning a lost opportunity to visit Oxford!) The introduction to the viva was most helpful and set the results within the different time and geographical zones of the Ph.D since Rob had to switch from one group of advisors in Engineering to another group in Statistics. Plus an encompassing prospective discussion, expressing pessimism at exact MCMC for complex models and looking forward further advances in probabilistic programming.

Made of three papers, the thesis includes this ICML 2019 [remember the era when there were conferences?!] paper on scalable Metropolis-Hastings, by Rob Cornish, Paul Vanetti, Alexandre Bouchard-Côté, Georges Deligiannidis, and Arnaud Doucet, which I commented last year. Which achieves a remarkable and paradoxical O(1/√n) cost per iteration, provided (global) lower bounds are found on the (local) Metropolis-Hastings acceptance probabilities since they allow for Poisson thinning à la Devroye (1986) and  second order Taylor expansions constructed for all components of the target, with the third order derivatives providing bounds. However, the variability of the acceptance probability gets higher, which induces a longer but still manageable if the concentration of the posterior is in tune with the Bernstein von Mises asymptotics. I had not paid enough attention in my first read at the strong theoretical justification for the method, relying on the convergence of MAP estimates in well- and (some) mis-specified settings. Now, I would have liked to see the paper dealing with a more complex problem that logistic regression.

The second paper in the thesis is an ICML 2018 proceeding by Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood, which considers Monte Carlo problems involving several nested expectations in a non-linear manner, meaning that (a) several levels of Monte Carlo approximations are required, with associated asymptotics, and (b) the resulting overall estimator is biased. This includes common doubly intractable posteriors, obviously, as well as (Bayesian) design and control problems. [And it has nothing to do with nested sampling.] The resolution chosen by the authors is strictly plug-in, in that they replace each level in the nesting with a Monte Carlo substitute and do not attempt to reduce the bias. Which means a wide range of solutions (other than the plug-in one) could have been investigated, including bootstrap maybe. For instance, Bayesian design is presented as an application of the approach, but since it relies on the log-evidence, there exist several versions for estimating (unbiasedly) this log-evidence. Similarly, the Forsythe-von Neumann technique applies to arbitrary transforms of a primary integral. The central discussion dwells on the optimal choice of the volume of simulations at each level, optimal in terms of asymptotic MSE. Or rather asymptotic bound on the MSE. The interesting result being that the outer expectation requires the square of the number of simulations for the other expectations. Which all need converge to infinity. A trick in finding an estimator for a polynomial transform reminded me of the SAME algorithm in that it duplicated the simulations as many times as the highest power of the polynomial. (The ‘Og briefly reported on this paper… four years ago.)

The third and last part of the thesis is a proposal [to appear in ICML 20] on relaxing bijectivity constraints in normalising flows with continuously index flows. (Or CIF. As Rob made a joke about this cleaning brand, let me add (?) to that joke by mentioning that looking at CIF and bijections is less dangerous in a Trump cum COVID era at CIF and injections!) With Anthony Caterini, George Deligiannidis and Arnaud Doucet as co-authors. I am much less familiar with this area and hence a wee bit puzzled at the purpose of removing what I understand to be an appealing side of normalising flows, namely to produce a manageable representation of density functions as a combination of bijective and differentiable functions of a baseline random vector, like a standard Normal vector. The argument made in the paper is that imposing this representation of the density imposes a constraint on the topology of its support since said support is homeomorphic to the support of the baseline random vector. While the supporting theoretical argument is a mathematical theorem that shows the Lipschitz bound on the transform should be infinity in the case the supports are topologically different, these arguments may be overly theoretical when faced with the practical implications of the replacement strategy. I somewhat miss its overall strength given that the whole point seems to be in approximating a density function, based on a finite sample.

likelihood-free inference by ratio estimation

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on September 9, 2019 by xi'an

“This approach for posterior estimation with generative models mirrors the approach of Gutmann and Hyvärinen (2012) for the estimation of unnormalised models. The main difference is that here we classify between two simulated data sets while Gutmann and Hyvärinen (2012) classified between the observed data and simulated reference data.”

A 2018 arXiv posting by Owen Thomas et al. (including my colleague at Warwick, Rito Dutta, CoI warning!) about estimating the likelihood (and the posterior) when it is intractable. Likelihood-free but not ABC, since the ratio likelihood to marginal is estimated in a non- or semi-parametric (and biased) way. Following Geyer’s 1994 fabulous estimate of an unknown normalising constant via logistic regression, the current paper which I read in preparation for my discussion in the ABC optimal design in Salzburg uses probabilistic classification and an exponential family representation of the ratio. Opposing data from the density and data from the marginal, assuming both can be readily produced. The logistic regression minimizing the asymptotic classification error is the logistic transform of the log-ratio. For a finite (double) sample, this minimization thus leads to an empirical version of the ratio. Or to a smooth version if the log-ratio is represented as a convex combination of summary statistics, turning the approximation into an exponential family,  which is a clever way to buckle the buckle towards ABC notions. And synthetic likelihood. Although with a difference in estimating the exponential family parameters β(θ) by minimizing the classification error, parameters that are indeed conditional on the parameter θ. Actually the paper introduces a further penalisation or regularisation term on those parameters β(θ), which could have been processed by Bayesian Lasso instead. This step is essentially dirving the selection of the summaries, except that it is for each value of the parameter θ, at the expense of a X-validation step. This is quite an original approach, as far as I can tell, but I wonder at the link with more standard density estimation methods, in particular in terms of the precision of the resulting estimate (and the speed of convergence with the sample size, if convergence there is).