Archive for exchange algorithm

Mallows model with intractable constant

Posted in Books, pictures, Statistics with tags , , , , , , , , on November 21, 2019 by xi'an

The paper Probabilistic Preference Learning with the Mallows Rank Model by Vitelli et al. was published last year in JMLR which may be why I missed it. It brings yet another approach to the perpetual issue of intractable  normalising constants. Here, the data is made of rankings of n objects by N experts, with an assumption of a latent ordering ρ acting as “mean” in the Mallows model. Along with a scale α, both to be estimated, and indeed involving an intractable normalising constant in the likelihood that only depends on the scale α because the distance is right-invariant. For instance the Hamming distance used in coding. There exists a simplification of the expression of the normalising constant due to the distance only taking a finite number of values, multiplied by the number of cases achieving a given value. Still this remains a formidable combinatoric problem. Running a Gibbs sampler is not an issue for the parameter ρ as the resulting Metropolis-Hastings-within-Gibbs step does not involve the missing constant. But it poses a challenge for the scale α, because the Mallows model cannot be exactly simulated for most distances. Making the use of pseudo-marginal and exchange algorithms presumably impossible. The authors use instead an importance sampling approximation to the normalising constant relying on a pseudo-likelihood version of Mallows model and a massive number (10⁶ to 10⁸) of simulations (in the humongous set of N-sampled permutations of 1,…,n). The interesting point in using this approximation is that the convergence result associated with pseudo-marginals no long applies and that the resulting MCMC algorithm converges to another limiting distribution. With the drawback that this limiting distribution is conditional to the importance sample. Various extensions are found in the paper, including a mixture of Mallows models. And an round of applications, including one on sushi preferences across Japan (fatty tuna coming almost always on top!). As the authors note, a very large number of items like n>10⁴ remains a challenge (or requires an alternative model).

thermodynamic integration plus temperings

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , on July 30, 2019 by xi'an

Biljana Stojkova and David Campbel recently arXived a paper on the used of parallel simulated tempering for thermodynamic integration towards producing estimates of marginal likelihoods. Resulting into a rather unwieldy acronym of PT-STWNC for “Parallel Tempering – Simulated Tempering Without Normalizing Constants”. Remember that parallel tempering runs T chains in parallel for T different powers of the likelihood (from 0 to 1), potentially swapping chain values at each iteration. Simulated tempering monitors a single chain that explores both the parameter space and the temperature range. Requiring a prior on the temperature. Whose optimal if unrealistic choice was found by Geyer and Thomson (1995) to be proportional to the inverse (and unknown) normalising constant (albeit over a finite set of temperatures). Proposing the new temperature instead via a random walk, the Metropolis within Gibbs update of the temperature τ then involves normalising constants.

“This approach is explored as proof of concept and not in a general sense because the precision of the approximation depends on the quality of the interpolator which in turn will be impacted by smoothness and continuity of the manifold, properties which are difficult to characterize or guarantee given the multi-modal nature of the likelihoods.”

To bypass this issue, the authors pick for their (formal) prior on the temperature τ, a prior such that the profile posterior distribution on τ is constant, i.e. the joint distribution at τ and at the mode [of the conditional posterior distribution of the parameter] is constant. This choice makes for a closed form prior, provided this mode of the tempered posterior can de facto be computed for each value of τ. (However it is unclear to me why the exact mode would need to be used.) The resulting Metropolis ratio becomes independent of the normalising constants. The final version of the algorithm runs an extra exchange step on both this simulated tempering version and the untempered version, i.e., the original unnormalised posterior. For the marginal likelihood, thermodynamic integration is invoked, following Friel and Pettitt (2008), using simulated tempering samples of (θ,τ) pairs (associated instead with the above constant profile posterior) and simple Riemann integration of the expected log posterior. The paper stresses the gain due to a continuous temperature scale, as it “removes the need for optimal temperature discretization schedule.” The method is applied to the Glaxy (mixture) dataset in order to compare it with the earlier approach of Friel and Pettitt (2008), resulting in (a) a selection of the mixture with five components and (b) much more variability between the estimated marginal  likelihoods for different numbers of components than in the earlier approach (where the estimates hardly move with k). And (c) a trimodal distribution on the means [and unimodal on the variances]. This example is however hard to interpret, since there are many contradicting interpretations for the various numbers of components in the model. (I recall Radford Neal giving an impromptu talks at an ICMS workshop in Edinburgh in 2001 to warn us we should not use the dataset without a clear(er) understanding of the astrophysics behind. If I remember well he was excluded all low values for the number of components as being inappropriate…. I also remember taking two days off with Peter Green to go climbing Craigh Meagaidh, as the only authorised climbing place around during the foot-and-mouth epidemics.) In conclusion, after presumably too light a read (I did not referee the paper!), it remains unclear to me why the combination of the various tempering schemes is bringing a noticeable improvement over the existing. At a given computational cost. As the temperature distribution does not seem to favour spending time in the regions where the target is most quickly changing. As such the algorithm rather appears as a special form of exchange algorithm.

Bayesian goodness of fit

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on April 10, 2018 by xi'an


Persi Diaconis and Guanyang Wang have just arXived an interesting reflection on the notion of Bayesian goodness of fit tests. Which is a notion that has always bothered me, in a rather positive sense (!), as

“I also have to confess at the outset to the zeal of a convert, a born again believer in stochastic methods. Last week, Dave Wright reminded me of the advice I had given a graduate student during my algebraic geometry days in the 70’s :`Good Grief, don’t waste your time studying statistics. It’s all cookbook nonsense.’ I take it back! …” David Mumford

The paper starts with a reference to David Mumford, whose paper with Wu and Zhou on exponential “maximum entropy” synthetic distributions is at the source (?) of this paper, and whose name appears in its very title: “A conversation for David Mumford”…, about his conversion from pure (algebraic) maths to applied maths. The issue of (Bayesian) goodness of fit is addressed, with card shuffling examples, the null hypothesis being that the permutation resulting from the shuffling is uniformly distributed if shuffling takes enough time. Interestingly, while the parameter space is compact as a distribution on a finite set, Lindley’s paradox still occurs, namely that the null (the permutation comes from a Uniform) is always accepted provided there is no repetition under a “flat prior”, which is the Dirichlet D(1,…,1) over all permutations. (In this finite setting an improper prior is definitely improper as it does not get proper after accounting for observations. Although I do not understand why the Jeffreys prior is not the Dirichlet(½,…,½) in this case…) When resorting to the exponential family of distributions entertained by Zhou, Wu and Mumford, including the uniform distribution as one of its members, Diaconis and Wang advocate the use of a conjugate prior (exponential family, right?!) to compute a Bayes factor that simplifies into a ratio of two intractable normalising constants. For which the authors suggest using importance sampling, thermodynamic integration, or the exchange algorithm. Except that they rely on the (dreaded) harmonic mean estimator for computing the Bayes factor in the following illustrative section! Due to the finite nature of the space, I presume this estimator still has a finite variance. (Remark 1 calls for convergence results on exchange algorithms, which can be found I think in the just as recent arXival by Christophe Andrieu and co-authors.) An interesting if rare feature of the example processed in the paper is that the sufficient statistic used for the permutation model can be directly simulated from a Multinomial distribution. This is rare as seen when considering the benchmark of Ising models, for which the summary and sufficient statistic cannot be directly simulated. (If only…!) In fine, while I enjoyed the paper a lot, I remain uncertain as to its bearings, since defining an objective alternative for the goodness-of-fit test becomes quickly challenging outside simple enough models.

same simulation, different acceptance

Posted in Books, Statistics with tags , , , on January 30, 2018 by xi'an

In doubly intractable settings, where the likelihood involves an intractable constant Z(θ), an auxiliary or pseudo- observation x is generated to incorporate strategically located densities in the acceptance probability towards cancelling out the Z(θ)’s. The funny thing is that Møller et al.  (2005) and Murray et al. (2006) both use the same simulations in their auxiliary algorithms, namely θ’~q(θ|θ,y) and x’~f(x|θ’), but return different acceptance probabilities. The former use an artificial target on the pair (θ’,x’) [with a free conditional on x’] while the later uses a pseudo-marginal argument to estimate the missing constant Z(θ) by importance sampling as noticed by Everitt (2012). This apparent paradox is rather common to simulation in that several importance weights can often be constructed for the same importance function. But in the case of doubly intractable distributions, the first approach offers a surprisingly wide variability in the selection of the conditional on x’, which can be absolutely any density g(x|θ,y). And hence could be optimised for maximal acceptance rate. Or maximal effective sample size. In the original paper of Møller et al.  (2005) a plug-in version f(x|θ) was suggested, with θ replaced with a crude estimate. This morning, when discussing both versions with Julien Stoehr, I realised that a geometric average of f(x|θ)’s could be used as well, since the intractable normalising constants would not be an issue [as opposed to an arithmetic or harmonic average]. I [idly] wonder if anything has been done in this direction…

the HMC algorithm meets the exchange algorithm

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , on July 26, 2017 by xi'an

Julien Stoehr (now in Dublin, soon to join us as a new faculty in Paris-Dauphine!), Alan Benson and Nial Friel (both at UCD) arXived last week a paper entitled Noisy HMC for doubly-intractable distributions. Which considers solutions for adapting Hamiltonian Monte Carlo to target densities that involve a missing constant. In the sense of our workshop last year in Warwick. And in the theme pursued by Nial in the past years. The notion is thus to tackle a density π(θ)∞exp(V(X|θ)/Z(θ) when Z(θ) is intractable. In that case the gradient of log Z(θ) can be estimated as the expectation of the gradient of V(X|θ) [as a standard exponential family identity]. And the ratio of the Z(θ)’s appearing in the Metropolis ratio can be derived by Iain Murray’s exchange algorithm, based on simulations from the sampling distribution attached to the parameter in the denominator.

The resulting algorithm proposed by the authors thus uses N simulations of auxiliary variables at each step þ of the leapfrog part, towards an approximation of the gradient term, plus another N simulations for approximating the ratio of the normalising constants Z(θ)/Z(θ’). While justified from an importance sampling perspective, this approximation is quite poor when θ and θ’ differ. A better solution [as shown in the paper] is to take advantage of all leapfrog steps and of associated auxiliary simulations to build a telescopic product of ratios where the parameter values θ and θ’ are much closer. The main difficulty is in drawing a comparison with the exchange algorithm, since the noisy HMC version is computationally more demanding. (A secondary difficulty is in having an approximate algorithm that no longer leaves the target density stationary.)