## estimating the marginal likelihood (or an information criterion)

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on December 28, 2019 by xi'an Tory Imai (from Kyoto University) arXived a paper last summer on what first looked like a novel approximation of the marginal likelihood. Based on the variance of thermodynamic integration. The starting argument is that there exists a power 0<t⁰<1 such that the expectation of the logarithm of the product of the prior by the likelihood to the power t⁰ or t⁰-powered likelihood  is equal to the standard log-marginal $\log m(x) = \mathbb{E}^{t^0}[ \log f(X|\theta) ]$

when the expectation is under the posterior corresponding to the t⁰-powered likelihood (rather than the full likelihood). By an application of the mean value theorem. Watanabe’s (2013) WBIC replaces the optimum t⁰ with 1/log(n), n being the sample size. The issue in terms of computational statistics is of course that the error of WBIC (against the true log m(x)) is only characterised as an order of n.

The second part of the paper is rather obscure to me, as the motivation for the real log canonical threshold is missing, even though the quantity is connected with the power likelihood. And the DIC effective dimension. It then goes on to propose a new approximation of sBIC, where s stands for singular, of Drton and Plummer (2017) which I had missed (and may ask my colleague Martin later today at Warwick!). Quickly reading through the later however brings explanations about the real log canonical threshold being simply the effective dimension in Schwarwz’s BIC approximation to the log marginal, $\log m(x) \approx= \log f(x|\hat{\theta}_n) - \lambda \log n +(m-1)\log\log n$

(as derived by Watanabe), where m is called the multiplicity of the real log canonical threshold. Both λ and m being unknown, Drton and Plummer (2017) estimate the above approximation in a Bayesian fashion, which leads to a double indexed marginal approximation for a collection of models. Since this thread leads me further and further from a numerical resolution of the marginal estimation, but brings in a different perspective on mixture Bayesian estimation, I will return to this highly  in a later post. The paper of Imai discusses a different numerical approximation to sBIC, With a potential improvement in computing sBIC. (The paper was proposed as a poster to BayesComp 2020, so I am looking forward discussing it with the author.)

## Mallows model with intractable constant

Posted in Books, pictures, Statistics with tags , , , , , , , , on November 21, 2019 by xi'an The paper Probabilistic Preference Learning with the Mallows Rank Model by Vitelli et al. was published last year in JMLR which may be why I missed it. It brings yet another approach to the perpetual issue of intractable  normalising constants. Here, the data is made of rankings of n objects by N experts, with an assumption of a latent ordering ρ acting as “mean” in the Mallows model. Along with a scale α, both to be estimated, and indeed involving an intractable normalising constant in the likelihood that only depends on the scale α because the distance is right-invariant. For instance the Hamming distance used in coding. There exists a simplification of the expression of the normalising constant due to the distance only taking a finite number of values, multiplied by the number of cases achieving a given value. Still this remains a formidable combinatoric problem. Running a Gibbs sampler is not an issue for the parameter ρ as the resulting Metropolis-Hastings-within-Gibbs step does not involve the missing constant. But it poses a challenge for the scale α, because the Mallows model cannot be exactly simulated for most distances. Making the use of pseudo-marginal and exchange algorithms presumably impossible. The authors use instead an importance sampling approximation to the normalising constant relying on a pseudo-likelihood version of Mallows model and a massive number (10⁶ to 10⁸) of simulations (in the humongous set of N-sampled permutations of 1,…,n). The interesting point in using this approximation is that the convergence result associated with pseudo-marginals no long applies and that the resulting MCMC algorithm converges to another limiting distribution. With the drawback that this limiting distribution is conditional to the importance sample. Various extensions are found in the paper, including a mixture of Mallows models. And an round of applications, including one on sushi preferences across Japan (fatty tuna coming almost always on top!). As the authors note, a very large number of items like n>10⁴ remains a challenge (or requires an alternative model).

## likelihood-free inference by ratio estimation

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on September 9, 2019 by xi'an “This approach for posterior estimation with generative models mirrors the approach of Gutmann and Hyvärinen (2012) for the estimation of unnormalised models. The main difference is that here we classify between two simulated data sets while Gutmann and Hyvärinen (2012) classified between the observed data and simulated reference data.”

A 2018 arXiv posting by Owen Thomas et al. (including my colleague at Warwick, Rito Dutta, CoI warning!) about estimating the likelihood (and the posterior) when it is intractable. Likelihood-free but not ABC, since the ratio likelihood to marginal is estimated in a non- or semi-parametric (and biased) way. Following Geyer’s 1994 fabulous estimate of an unknown normalising constant via logistic regression, the current paper which I read in preparation for my discussion in the ABC optimal design in Salzburg uses probabilistic classification and an exponential family representation of the ratio. Opposing data from the density and data from the marginal, assuming both can be readily produced. The logistic regression minimizing the asymptotic classification error is the logistic transform of the log-ratio. For a finite (double) sample, this minimization thus leads to an empirical version of the ratio. Or to a smooth version if the log-ratio is represented as a convex combination of summary statistics, turning the approximation into an exponential family,  which is a clever way to buckle the buckle towards ABC notions. And synthetic likelihood. Although with a difference in estimating the exponential family parameters β(θ) by minimizing the classification error, parameters that are indeed conditional on the parameter θ. Actually the paper introduces a further penalisation or regularisation term on those parameters β(θ), which could have been processed by Bayesian Lasso instead. This step is essentially dirving the selection of the summaries, except that it is for each value of the parameter θ, at the expense of a X-validation step. This is quite an original approach, as far as I can tell, but I wonder at the link with more standard density estimation methods, in particular in terms of the precision of the resulting estimate (and the speed of convergence with the sample size, if convergence there is).

## Bayesian intelligence in Warwick

Posted in pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on February 18, 2019 by xi'an This is an announcement for an exciting CRiSM Day in Warwick on 20 March 2019: with speakers

10:00-11:00 Xiao-Li Meng (Harvard): “Artificial Bayesian Monte Carlo Integration: A Practical Resolution to the Bayesian (Normalizing Constant) Paradox”

11:00-12:00 Julien Stoehr (Dauphine): “Gibbs sampling and ABC”

14:00-15:00 Arthur Ulysse Jacot-Guillarmod (École Polytechnique Fedérale de Lausanne): “Neural Tangent Kernel: Convergence and Generalization of Deep Neural Networks”

15:00-16:00 Antonietta Mira (Università della Svizzera italiana e Università degli studi dell’Insubria): “Bayesian identifications of the data intrinsic dimensions”

[whose abstracts are on the workshop webpage] and free attendance. The title for the workshop mentions Bayesian Intelligence: this obviously includes human intelligence and not just AI!

## the HMC algorithm meets the exchange algorithm

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , on July 26, 2017 by xi'an Julien Stoehr (now in Dublin, soon to join us as a new faculty in Paris-Dauphine!), Alan Benson and Nial Friel (both at UCD) arXived last week a paper entitled Noisy HMC for doubly-intractable distributions. Which considers solutions for adapting Hamiltonian Monte Carlo to target densities that involve a missing constant. In the sense of our workshop last year in Warwick. And in the theme pursued by Nial in the past years. The notion is thus to tackle a density π(θ)∞exp(V(X|θ)/Z(θ) when Z(θ) is intractable. In that case the gradient of log Z(θ) can be estimated as the expectation of the gradient of V(X|θ) [as a standard exponential family identity]. And the ratio of the Z(θ)’s appearing in the Metropolis ratio can be derived by Iain Murray’s exchange algorithm, based on simulations from the sampling distribution attached to the parameter in the denominator.

The resulting algorithm proposed by the authors thus uses N simulations of auxiliary variables at each step þ of the leapfrog part, towards an approximation of the gradient term, plus another N simulations for approximating the ratio of the normalising constants Z(θ)/Z(θ’). While justified from an importance sampling perspective, this approximation is quite poor when θ and θ’ differ. A better solution [as shown in the paper] is to take advantage of all leapfrog steps and of associated auxiliary simulations to build a telescopic product of ratios where the parameter values θ and θ’ are much closer. The main difficulty is in drawing a comparison with the exchange algorithm, since the noisy HMC version is computationally more demanding. (A secondary difficulty is in having an approximate algorithm that no longer leaves the target density stationary.)