Archive for Université Paris Dauphine

Markov chain importance sampling

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on May 31, 2018 by xi'an

Ingmar Schuster (formerly a postdoc at Dauphine and now in Freie Universität Berlin) and Ilja Klebanov (from Berlin) have recently arXived a paper on recycling proposed values in [a rather large class of] Metropolis-Hastings and unadjusted Langevin algorithms. This means using the proposed variates of one of these algorithms as in an importance sampler, with an importance weight going from the target over the (fully conditional) proposal to the target over the marginal stationary target. In the Metropolis-Hastings case, since the later is not available in most setups, the authors suggest using a Rao-Blackwellised nonparametric estimate based on the entire MCMC chain. Or a subset.

“Our estimator refutes the folk theorem that it is hard to estimate [the normalising constant] with mainstream Monte Carlo methods such as Metropolis-Hastings.”

The paper thus brings an interesting focus on the proposed values, rather than on the original Markov chain,  which naturally brings back to mind the derivation of the joint distribution of these proposed values we made in our (1996) Rao-Blackwellisation paper with George Casella. Where we considered a parametric and non-asymptotic version of this distribution, which brings a guaranteed improvement to MCMC (Metropolis-Hastings) estimates of integrals. In subsequent papers with George, we tried to quantify this improvement and to compare different importance samplers based on some importance sampling corrections, but as far as I remember, we only got partial results along this way, and did not cover the special case of the normalising constant Þ… Normalising constants did not seem such a pressing issue at that time, I figure. (A Monte Carlo 101 question: how can we be certain the importance sampler offers a finite variance?)

Ingmar’s views about this:

I think this is interesting future work. My intuition is that for Metropolis-Hastings importance sampling with random walk proposals, the variance is guaranteed to be finite because the importance distribution ρ_θ is a convolution of your target ρ with the random walk kernel q. This guarantees that the tails of ρ_θ are no lighter than those of ρ. What other forms of q mean for the tails of ρ_θ I have less intuition about.

When considering the Langevin alternative with transition (4), I was first confused and thought it was incorrect for moving from one value of Y (proposal) to the next. But that’s what unadjusted means in “unadjusted Langevin”! As pointed out in the early Langevin literature, e.g., by Gareth Roberts and Richard Tweedie, using a discretised Langevin diffusion in an MCMC framework means there is a risk of non-stationarity & non-ergodicity. Obviously, the corrected (MALA) version is more delicate to approximate (?) but at the very least it ensures the Markov chain does not diverge. Even when the unadjusted Langevin has a stationary regime, its joint distribution is likely quite far from the joint distribution of a proper discretisation. Now this also made me think about a parameterised version in the 1996 paper spirit, but there is nothing specific about MALA that would prevent the implementation of the general principle. As for the unadjusted version, the joint distribution is directly available.  (But not necessarily the marginals.)

Here is an answer from Ingmar about that point

Personally, I think the most interesting part is the practical performance gain in terms of estimation accuracy for fixed CPU time, combined with the convergence guarantee from the CLT. ULA was particularly important to us because of the papers of Arnak Dalalyan, Alain Durmus & Eric Moulines and recently from Mike Jordan’s group, which all look at an unadjusted Langevin diffusion (and unimodal target distributions). But MALA admits a Metropolis-Hastings importance sampling estimator, just as Random Walk Metropolis does – we didn’t include MALA in the experiments to not get people confused with MALA and ULA. But there is no delicacy involved whatsoever in approximating the marginal MALA proposal distribution. The beauty of our approach is that it works for almost all Metropolis-Hastings algorithms where you can evaluate the proposal density q, there is no constraint to use random walks at all (we will emphasize this more in the paper).

in the street for a year

Posted in Mountains, pictures, Travel, University life with tags , , , , , , , , , , on April 13, 2018 by xi'an

Just like about every year, I sent two of my pictures to the photography competition of Paris Dauphine, with not much consideration for the theme “green the future”, and was hence quite surprised to get selected this time! (Almost as much surprised as last year when an almost perfect copy of my picture of the Alcazar Baths of Lady María de Padilla got selected!) As I could travel back from Oxford to attend the opening ceremony, I went there last night, wondering at which of my pictures had been selected (Lac Pavin, Auvergne versus the Quinrang, Skye)…


And so this picture will remain exposed in the street, boulevard Lannes, for the incoming year, meaning I will cross it each time I bike to the university! The 22 other pictures were more in tune with the theme of a green future, like the winning one of a fast moving métro carriage at the station Chemin Vert. Or this simple blade of grass growing from ashes…

 
And thus the winner is… Continue reading

accelerating MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on April 11, 2018 by xi'an

As forecasted a rather long while ago (!), I wrote a short and incomplete survey on some approaches to accelerating MCMC. With the massive help of Victor Elvira (Lille), Nick Tawn (Warwick) and Changye Wu (Dauphine). Survey which current version just got arXived and which has now been accepted by WIREs Computational Statistics. The typology (and even the range of methods) adopted here is certainly mostly arbitrary, with suggestions for different divisions made by a very involved and helpful reviewer. While we achieved a quick conclusion to the review process, suggestions and comments are most welcome! Even if we cannot include every possible suggestion, just like those already made on X validated. (WIREs stands for Wiley Interdisciplinary Reviews and its dozen topics cover several fields, from computational stats to biology, to medicine, to engineering.)

gender gaps

Posted in Statistics, University life with tags , , , , , , , , , , on March 31, 2018 by xi'an

Two of my colleagues [and co-authors] at Dauphine, Elyès Jouini and Clotilde Napp, published a paper in Science last week (and an associated tribune in Le Monde which I spotted first) about explaining differences in national gender inequalities in maths (as measured by PISA) in terms of the degree of overall inequality in the respective countries. Gaps in the highest maths performer sex ratio. While I have no qualm about the dependency or the overall statistical cum machine learning analysis (supported by our common co-author Jean-Michel Marin), and while I obviously know nothing about the topic!, I leisurely wonder at the cultural factor (which may also partly explain for the degree of inequality) when considering that the countries at the bottom of the above graphs are rather religious (and mostly catholic). I also find it most intriguing that the gender gap is consistently reversed when considering higher performer sex ratio for reading, because mastering the language should be a strong factor in power structures and hence differences therein should also lead to inequalities…

ABCDE for approximate Bayesian conditional density estimation

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on February 26, 2018 by xi'an

Another arXived paper I surprisingly (?) missed, by George Papamakarios and Iain Murray, on an ABCDE (my acronym!) substitute to ABC for generative models. The paper was reviewed [with reviews made available!] and accepted by NIPS 2016. (Most obviously, I was not one of the reviewers!)

“Conventional ABC algorithms such as the above suffer from three drawbacks. First, they only represent the parameter posterior as a set of (possibly weighted or correlated) samples [for which] it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses. Second, the parameter samples do not come from the correct Bayesian posterior (…) Third, as the ε-tolerance is reduced, it can become impractical to simulate the model enough times to match the observed data even once [when] simulations are expensive to perform”

The above criticisms are a wee bit overly harsh as, well…, Monte Carlo approximations remain a solution worth considering for all Bayesian purposes!, while the approximation [replacing the data with a ball] in ABC is replaced with an approximation of the true posterior as a mixture. Both requiring repeated [and likely expensive] simulations. The alternative is in iteratively simulating from pseudo-predictives towards learning better pseudo-posteriors, then used as new proposals at the next iteration modulo an importance sampling correction.  The approximation to the posterior chosen therein is a mixture density network, namely a mixture distribution with parameters obtained as neural networks based on the simulated pseudo-observations. Which the authors claim [p.4] requires no tuning. (Still, there are several aspects to tune, from the number of components to the hyper-parameter λ [p.11, eqn (35)], to the structure of the neural network [20 tanh? 50 tanh?], to the number of iterations, to the amount of X checking. As usual in NIPS papers, it is difficult to assess how arbitrary the choices made in the experiments are. Unless one starts experimenting with the codes provided.) All in all, I find the paper nonetheless exciting enough (!) to now start a summer student project on it in Dauphine and hope to check the performances of ABCDE on different models, as well as comparing this ABC implementation with a synthetic likelihood version.

 As an addendum, let me point out the very pertinent analysis of this paper by Dennis Prangle, 18 months ago!

complex Cauchys

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 8, 2018 by xi'an

During a visit of Don Fraser and Nancy Reid to Paris-Dauphine where Nancy gave a nice introduction to confidence distributions, Don pointed out to me a 1992 paper by Peter McCullagh on the Cauchy distribution. Following my recent foray into the estimation of the Cauchy location parameter. Among several most interesting aspects of the Cauchy, Peter re-expressed the density of a Cauchy C(θ¹,θ²) as

f(x;θ¹,θ²) = |θ²| / |x-θ|²

when θ=θ¹+ιθ² [a complex number on the half-plane]. Denoting the Cauchy C(θ¹,θ²) as Cauchy C(θ), the property that the ratio aX+b/cX+d follows a Cauchy for all real numbers a,b,c,d,

C(aθ+b/cθ+d)

[when X is C(θ)] follows rather readily. But then comes the remark that

“those properties follow immediately from the definition of the Cauchy as the ratio of two correlated normals with zero mean.”

which seems to relate to the conjecture solved by Natesh Pillai and Xiao-Li Meng a few years ago. But the fact that  a ratio of two correlated centred Normals is Cauchy is actually known at least from the1930’s, as shown by Feller (1930, Biometrika) and Geary (1930, JRSS B).

exams

Posted in Kids, Statistics, University life with tags , , , , , , , on February 7, 2018 by xi'an
As in every term, here comes the painful week of grading hundreds of exams! My mathematical statistics exam was highly traditional and did not even involve Bayesian material, as the few students who attended the lectures were so eager to discuss sufficiency and ancilarity, that I decided to spend an extra lecture on these notions rather than rushing though conjugate priors. Highly traditional indeed with an inverse Gaussian model and a few basic consequences of Basu’s theorem. actually exposed during this lecture. Plus mostly standard multiple choices about maximum likelihood estimation and R programming… Among the major trends this year, I spotted out the widespread use of strange derivatives of negative powers, the simultaneous derivation of two incompatible convergent estimates, the common mixup between the inverse of a sum and the sum of the inverses, the inability to produce the MLE of a constant transform of the parameter, the choice of estimators depending on the parameter, and a lack of concern for Fisher informations equal to zero.