Archive for MCqMC 2016

your topic is so much impressive

Posted in University life with tags , , , on August 5, 2017 by xi'an

An email from a predatory “journal” I received last week end… With presumably all other speakers at MCqMC 2016. Items of [moderate] interest after looking at the “journal” website:

  • weird wording
  • no mention is made in the email of the $650 required for publish a paper
  • the Editorial Board is inexistent to the point there is no Editor and the page calls for applications

warp-U bridge sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on October 12, 2016 by xi'an

[I wrote this set of comments right after MCqMC 2016 on a preliminary version of the paper so mileage may vary in terms of the adequation to the current version!]

In warp-U bridge sampling, newly arXived and first presented at MCqMC 16, Xiao-Li Meng continues (in collaboration with Lahzi Wang) his exploration of bridge sampling techniques towards improving the estimation of normalising constants and ratios thereof. The bridge sampling estimator of Meng and Wong (1996) is an harmonic mean importance sampler that requires iterations as it depends on the ratio of interest. Given that the normalising constant of a density does not depend on the chosen parameterisation in the sense that the Jacobian transform preserves this constant, a degree of freedom is in the choice of the parameterisation. This is the idea behind warp transformations. The initial version of Meng and Schilling (2002) used location-scale transforms, while the warp-U solution goes for a multiple location-scale transform that can be seen as based on a location-scale mixture representation of the target. With K components. This approach can also be seen as a sort of artificial reversible jump algorithm when one model is fully known. A strategy Nicolas and I also proposed in our nested sampling Biometrika paper.

Once such a mixture approximation is obtained. each and every component of the mixture can be turned into the standard version of the location-scale family by the appropriate location-scale transform. Since the component index k is unknown for a given X, they call this transform a random transform, which I find somewhat more confusing that helpful. The conditional distribution of the index given the observable x is well-known for mixtures and it is used here to weight the component-wise location-scale transforms of the original distribution p into something that looks rather similar to the standard version of the location-scale family. If no mode has been forgotten by the mixture. The simulations from the original p are then rescaled by one of those transforms, which index k is picked according to the conditional distribution. As explained later to me by XL, the random[ness] in the picture is due to the inclusion of a random ± sign. Still, in the notation introduced in (13), I do not get how the distribution Þ [sorry for using different symbols, I cannot render a tilde on a p] is defined since both ψ and W are random. Is it the marginal? In which case it would read as a weighted average of rescaled versions of p. I have the same problem with Theorem 1 in that I do not understand how one equates Þ with the joint distribution.

Equation (21) is much more illuminating (I find) than the previous explanation in that it exposes the fact that the principle is one of aiming at a new distribution for both the target and the importance function, with hopes that the fit will get better. It could have been better to avoid the notion of random transform, then, but this is mostly a matter of conveying the notion.

On more specifics points (or minutiae), the unboundedness of the likelihood is rarely if ever a problem when using EM. An alternative to the multiple start EM proposal would then be to get sequential and estimate the mixture in a sequential manner, only adding a component when it seems worth it. See eg Chopin and Pelgrin (2004) and Chopin (2007). This could also help with the bias mentioned therein since only a (tiny?) fraction of the data would be used. And the number of components K has an impact on the accuracy of the approximation, as in not missing a mode, and on the computing time. However my suggestion would be to avoid estimating K as this must be immensely costly.

Section 6 obviously relates to my folded Markov interests. If I understand correctly, the paper argues that the transformed density Þ does not need to be computed when considering the folding-move-unfolding step as a single step rather than three steps. I fear the description between equations (30) and (31) is missing the move step over the transformed space. Also on a personal basis I still do not see how to add this approach to our folding methodology, even though the different transforms act as as many replicas of the original Markov chain.

parallel adaptive importance sampling

Posted in Statistics with tags , , , , , on August 30, 2016 by xi'an

Following Paul Russell’s talk at MCqMC 2016, I took a look at his recently arXived paper. In the plane to Sydney. The pseudo-code representation of the method is identical to our population Monte Carlo algorithm as is the suggestion to approximate the posterior by a mixture, but one novel aspect is to use Reich’s ensemble transportation at the resampling stage, in order to maximise the correlation between the original and the resampled versions of the particle systems. (As in our later versions of PMC, the authors also use as importance denominator the entire mixture rather than conditioning on the selected last-step particle.)

“The output of the resampling algorithm gives us a set of evenly weighted samples that we believe represents the target distribution well”

I disagree with this statement: Reweighting does not improve the quality of the posterior approximation, since it introduces more variability. If the original sample is found missing in its adequation to the target, so is the resampled one. Worse, by producing a sample with equal weights, this step may give a false impression of adequate representation…

Another unclear point in the pape relates to tuning the parameters of the mixture importance sampler. The paper discusses tuning these parameters during a burn-in stage, referring to “due to the constraints on adaptive MCMC algorithms”, which indeed is only pertinent for MCMC algorithms, since importance sampling can be constantly modified while remaining valid. This was a major point for advocating PMC. I am thus unsure what the authors mean by a burn-in period in such a context. Actually, I am also unsure on how they use effective sample size to select the new value of the importance parameter, e.g., the variance β in a random walk mixture: the effective sample size involves this variance implicitly through the realised sample hence changing β means changing the realised sample… This seems too costly to contemplate so I wonder at the way Figure 4.2 is produced.

“A popular approach for adaptive MCMC algorithms is to view the scaling parameter as a random variable which we can sample during the course of the MCMC iterations.”

While this is indeed an attractive notion [that I played with in the early days of adaptive MCMC, with the short-lived notion of cyber-parameters], I do not think it is of much help in optimising an MCMC algorithm, since the scaling parameter need be optimised, resulting into a time-inhomogeneous target. A more appropriate tool is thus stochastic optimisation à la Robbins-Monro, as exemplified in Andrieu and Moulines (2006). The paper however remains unclear as to how the scales are updated (see e.g. Section 4.2).

“Ideally, we would like to use a resampling algorithm which is not prohibitively costly for moderately or large sized ensembles, which preserves the mean of the samples, and which makes it much harder for the new samples to forget a significant region in the density.”

The paper also misses on the developments of the early 2000’s about more sophisticated resampling steps, especially Paul Fearnhead’s contributions (see also Nicolas Chopin’s thesis). There exist valid resampling methods that require a single uniform (0,1) to be drawn, rather than m. The proposed method has a flavour similar to systematic resampling, but I wonder at the validity of returning values that are averages of earlier simulations, since this modifies their distribution into ones with slimmer tails. (And it is parameterisation dependent.) Producing xi with probability pi is not the same as returning the average of the pixi‘s.

winning entry at MCqMC’16

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , on August 29, 2016 by xi'an

mcqmc4The nice logo of MCqMC 2016 was a collection of eight series of QMC dots on the unit (?) cube. The organisers set a competition to identify the principles behind those quasi-random sets and as I had no idea for most of them I entered very random sets unconnected with algorithmia, for which I got an honourable mention and a CD prize (if not the conference staff tee-shirt I was coveting!) Art Owen sent me back my entry, posted below and hopefully (or not!) readable.dots

home-grown food [mush/bathroom]

Posted in pictures, Travel, University life with tags , , , , , , , , on August 27, 2016 by xi'an

MCqMC 2016 [#4]

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on August 21, 2016 by xi'an

In his plenary talk this morning, Arnaud Doucet discussed the application of pseudo-marginal techniques to the latent variable models he has been investigating for many years. And its limiting behaviour towards efficiency, with the idea of introducing correlation in the estimation of the likelihood ratio. Reducing complexity from O(T²) to O(T√T). With the very surprising conclusion that the correlation must go to 1 at a precise rate to get this reduction, since perfect correlation would induce a bias. A massive piece of work, indeed!

The next session of the morning was another instance of conflicting talks and I hoped from one room to the next to listen to Hani Doss’s empirical Bayes estimation with intractable constants (where maybe SAME could be of interest), Youssef Marzouk’s transport maps for MCMC, which sounds like an attractive idea provided the construction of the map remains manageable, and Paul Russel’s adaptive importance sampling that somehow sounded connected with our population Monte Carlo approach. (With the additional step of considering transform maps.)

An interesting item of information I got from the final announcements at MCqMC 2016 just before heading to Monash, Melbourne, is that MCqMC 2018 will take place in the city of Rennes, Brittany, on July 2-6. Not only it is a nice location on its own, but it is most conveniently located in space and time to attend ISBA 2018 in Edinburgh the week after! Just moving from one Celtic city to another Celtic city. Along with other planned satellite workshops, this occurrence should make ISBA 2018 more attractive [if need be!] for participants from oversea.

MCqMC [#3]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , on August 20, 2016 by xi'an

On Thursday, Christoph Aistleiter [from TU Gräz] gave a plenary talk at MCqMC 2016 around Hermann Weyl’s 1916 paper, Über die Gleichverteilung von Zahlen mod. Eins, which demonstrates that the sequence a, 22a, 32a, … mod 1 is uniformly distributed on the unit interval when a is irrational. Obviously, the notion was not introduced for simulation purposes, but the construction applies in this setting! At least in a theoretical sense. Since for instance the result that the sequence (a,a²,a³,…) mod 1 being uniformly distributed for almost all a’s has not yet found one realisation a. But a nice hour of history of mathematics and number theory: it is not that common we hear the Riemann zeta function mentioned in a simulation conference!

The following session was a nightmare in that I wanted to attend all four at once! I eventually chose the transport session, in particular because Xiao-Li advertised it at the end of my talk. The connection is that his warp bridge sampling technique provides a folding map between modes of a target. Using a mixture representation of the target and folding all components to a single distribution. Interestingly, this transformation does not require a partition and preserves the normalising constants [which has a side appeal for bridge sampling of course]. In a problem with an unknown number of modes, the technique could be completed by [our] folding in order to bring the unobserved modes into the support of the folded target. Looking forward the incoming paper! The last talk of this session was by Matti Vihola, connecting multi-level Monte Carlo and unbiased estimation à la Rhee and Glynn, paper that I missed when it got arXived last December.

The last session of the day was about probabilistic numerics. I have already discussed extensively about this approach to numerical integration, to the point of being invited to the NIPS workshop as a skeptic! But this was an interesting session, both with introductory aspects and with new ones from my viewpoint, especially Chris Oates’ description of a PN method for handling both integrand and integrating measure as being uncertain. Another arXival that went under my decidedly deficient radar.