Archive for Stanford University
I have just heard that Charles Stein, Professor at Stanford University, passed away last night. Although the following image is definitely over-used, I truly feel this is the departure of a giant of statistics. He has been deeply influential on the fields of probability and mathematical statistics, primarily in decision theory and approximation techniques. On the first field, he led to considerable changes in the perception of optimality by exhibiting the Stein phenomenon, where the aggregation of several admissible estimators of unrelated quantities may (and will) become inadmissible for the joint estimation of those quantities! Although the result can be explained by mathematical and statistical reasoning, it was still dubbed a paradox due to its counter-intuitive nature. More foundationally, it led to expose the ill-posed nature of frequentist optimality criteria and certainly contributed to the Bayesian renewal of the 1980’s, before the MCMC revolution. (It definitely contributed to my own move, as I started working on the Stein phenomenon during my thesis, before realising the fundamentally Bayesian nature of the domination results.)
“…the Bayesian point of view is often accompanied by an insistence that people ought to agree to a certain doctrine even without really knowing what this doctrine is.” (Statistical Science, 1986)
The second major contribution of Charles Stein was the introduction of a new technique for normal approximation that is now called the Stein method. It relies on a differential operator and produces estimates of approximation error in Central Limit theorems, even in dependent settings. While I am much less familiar with this aspect of Charles Stein’s work, I believe the impact it has had on the field is much more profound and durable than the Stein effect in Normal mean estimation.
[I wrote this set of comments right after MCqMC 2016 on a preliminary version of the paper so mileage may vary in terms of the adequation to the current version!]
In warp-U bridge sampling, newly arXived and first presented at MCqMC 16, Xiao-Li Meng continues (in collaboration with Lahzi Wang) his exploration of bridge sampling techniques towards improving the estimation of normalising constants and ratios thereof. The bridge sampling estimator of Meng and Wong (1996) is an harmonic mean importance sampler that requires iterations as it depends on the ratio of interest. Given that the normalising constant of a density does not depend on the chosen parameterisation in the sense that the Jacobian transform preserves this constant, a degree of freedom is in the choice of the parameterisation. This is the idea behind warp transformations. The initial version of Meng and Schilling (2002) used location-scale transforms, while the warp-U solution goes for a multiple location-scale transform that can be seen as based on a location-scale mixture representation of the target. With K components. This approach can also be seen as a sort of artificial reversible jump algorithm when one model is fully known. A strategy Nicolas and I also proposed in our nested sampling Biometrika paper.
Once such a mixture approximation is obtained. each and every component of the mixture can be turned into the standard version of the location-scale family by the appropriate location-scale transform. Since the component index k is unknown for a given X, they call this transform a random transform, which I find somewhat more confusing that helpful. The conditional distribution of the index given the observable x is well-known for mixtures and it is used here to weight the component-wise location-scale transforms of the original distribution p into something that looks rather similar to the standard version of the location-scale family. If no mode has been forgotten by the mixture. The simulations from the original p are then rescaled by one of those transforms, which index k is picked according to the conditional distribution. As explained later to me by XL, the random[ness] in the picture is due to the inclusion of a random ± sign. Still, in the notation introduced in (13), I do not get how the distribution Þ [sorry for using different symbols, I cannot render a tilde on a p] is defined since both ψ and W are random. Is it the marginal? In which case it would read as a weighted average of rescaled versions of p. I have the same problem with Theorem 1 in that I do not understand how one equates Þ with the joint distribution.
Equation (21) is much more illuminating (I find) than the previous explanation in that it exposes the fact that the principle is one of aiming at a new distribution for both the target and the importance function, with hopes that the fit will get better. It could have been better to avoid the notion of random transform, then, but this is mostly a matter of conveying the notion.
On more specifics points (or minutiae), the unboundedness of the likelihood is rarely if ever a problem when using EM. An alternative to the multiple start EM proposal would then be to get sequential and estimate the mixture in a sequential manner, only adding a component when it seems worth it. See eg Chopin and Pelgrin (2004) and Chopin (2007). This could also help with the bias mentioned therein since only a (tiny?) fraction of the data would be used. And the number of components K has an impact on the accuracy of the approximation, as in not missing a mode, and on the computing time. However my suggestion would be to avoid estimating K as this must be immensely costly.
Section 6 obviously relates to my folded Markov interests. If I understand correctly, the paper argues that the transformed density Þ does not need to be computed when considering the folding-move-unfolding step as a single step rather than three steps. I fear the description between equations (30) and (31) is missing the move step over the transformed space. Also on a personal basis I still do not see how to add this approach to our folding methodology, even though the different transforms act as as many replicas of the original Markov chain.
The nice logo of MCqMC 2016 was a collection of eight series of QMC dots on the unit (?) cube. The organisers set a competition to identify the principles behind those quasi-random sets and as I had no idea for most of them I entered very random sets unconnected with algorithmia, for which I got an honourable mention and a CD prize (if not the conference staff tee-shirt I was coveting!) Art Owen sent me back my entry, posted below and hopefully (or not!) readable.
In his plenary talk this morning, Arnaud Doucet discussed the application of pseudo-marginal techniques to the latent variable models he has been investigating for many years. And its limiting behaviour towards efficiency, with the idea of introducing correlation in the estimation of the likelihood ratio. Reducing complexity from O(T²) to O(T√T). With the very surprising conclusion that the correlation must go to 1 at a precise rate to get this reduction, since perfect correlation would induce a bias. A massive piece of work, indeed!
The next session of the morning was another instance of conflicting talks and I hoped from one room to the next to listen to Hani Doss’s empirical Bayes estimation with intractable constants (where maybe SAME could be of interest), Youssef Marzouk’s transport maps for MCMC, which sounds like an attractive idea provided the construction of the map remains manageable, and Paul Russel’s adaptive importance sampling that somehow sounded connected with our population Monte Carlo approach. (With the additional step of considering transform maps.)
An interesting item of information I got from the final announcements at MCqMC 2016 just before heading to Monash, Melbourne, is that MCqMC 2018 will take place in the city of Rennes, Brittany, on July 2-6. Not only it is a nice location on its own, but it is most conveniently located in space and time to attend ISBA 2018 in Edinburgh the week after! Just moving from one Celtic city to another Celtic city. Along with other planned satellite workshops, this occurrence should make ISBA 2018 more attractive [if need be!] for participants from oversea.
On Thursday, Christoph Aistleiter [from TU Gräz] gave a plenary talk at MCqMC 2016 around Hermann Weyl’s 1916 paper, Über die Gleichverteilung von Zahlen mod. Eins, which demonstrates that the sequence a, 22a, 32a, … mod 1 is uniformly distributed on the unit interval when a is irrational. Obviously, the notion was not introduced for simulation purposes, but the construction applies in this setting! At least in a theoretical sense. Since for instance the result that the sequence (a,a²,a³,…) mod 1 being uniformly distributed for almost all a’s has not yet found one realisation a. But a nice hour of history of mathematics and number theory: it is not that common we hear the Riemann zeta function mentioned in a simulation conference!
The following session was a nightmare in that I wanted to attend all four at once! I eventually chose the transport session, in particular because Xiao-Li advertised it at the end of my talk. The connection is that his warp bridge sampling technique provides a folding map between modes of a target. Using a mixture representation of the target and folding all components to a single distribution. Interestingly, this transformation does not require a partition and preserves the normalising constants [which has a side appeal for bridge sampling of course]. In a problem with an unknown number of modes, the technique could be completed by [our] folding in order to bring the unobserved modes into the support of the folded target. Looking forward the incoming paper! The last talk of this session was by Matti Vihola, connecting multi-level Monte Carlo and unbiased estimation à la Rhee and Glynn, paper that I missed when it got arXived last December.
The last session of the day was about probabilistic numerics. I have already discussed extensively about this approach to numerical integration, to the point of being invited to the NIPS workshop as a skeptic! But this was an interesting session, both with introductory aspects and with new ones from my viewpoint, especially Chris Oates’ description of a PN method for handling both integrand and integrating measure as being uncertain. Another arXival that went under my decidedly deficient radar.