## variational approximation to empirical likelihood ABC

Posted in Statistics with tags , , , , , , , , , , , , , , , , , , on October 1, 2021 by xi'an

Sanjay Chaudhuri and his colleagues from Singapore arXived last year a paper on a novel version of empirical likelihood ABC that I hadn’t yet found time to read. This proposal connects with our own, published with Kerrie Mengersen and Pierre Pudlo in 2013 in PNAS. It is presented as an attempt at approximating the posterior distribution based on a vector of (summary) statistics, the variational approximation (or information projection) appearing in the construction of the sampling distribution of the observed summary. (Along with a weird eyed-g symbol! I checked inside the original LaTeX file and it happens to be a mathbbmtt g, that is, the typewriter version of a blackboard computer modern g…) Which writes as an entropic correction of the true posterior distribution (in Theorem 1).

“First, the true log-joint density of the observed summary, the summaries of the i.i.d. replicates and the parameter have to be estimated. Second, we need to estimate the expectation of the above log-joint density with respect to the distribution of the data generating process. Finally, the differential entropy of the data generating density needs to be estimated from the m replicates…”

The density of the observed summary is estimated by empirical likelihood, but I do not understand the reasoning behind the moment condition used in this empirical likelihood. Indeed the moment made of the difference between the observed summaries and the observed ones is zero iff the true value of the parameter is used in the simulation. I also fail to understand the connection with our SAME procedure (Doucet, Godsill & X, 2002), in that the empirical likelihood is based on a sample made of pairs (observed,generated) where the observed part is repeated m times, indeed, but not with the intent of approximating a marginal likelihood estimator… The notion of using the actual data instead of the true expectation (i.e. as a unbiased estimator) at the true parameter value is appealing as it avoids specifying the exact (or analytical) value of this expectation (as in our approach), but I am missing the justification for the extension to any parameter value. Unless one uses an ancillary statistic, which does not sound pertinent… The differential entropy is estimated by a Kozachenko-Leonenko estimator implying k-nearest neighbours.

“The proposed empirical likelihood estimates weights by matching the moments of g(X¹), , g(X) with that of
g(X), without requiring a direct relationship with the parameter. (…) the constraints used in the construction of the empirical likelihood are based on the identity in (7), which can only be satisfied when θ = θ⁰. “

Although I am feeling like missing one argument, the later part of the paper seems to comfort my impression, as quoted above. Meaning that the approximation will fare well only in the vicinity of the true parameter. Which makes it untrustworthy for model choice purposes, I believe. (The paper uses the g-and-k benchmark without exploiting Pierre Jacob’s package that allows for exact MCMC implementation.)

## return of the boomerang

Posted in Books, pictures, Statistics, Travel with tags , , , , , on January 26, 2021 by xi'an

Pagani, Wiegand and Nadarajah wrote a paper past last Spring on the Rosenbrock distribution. Now, I did not know this distribution under that name but as the banana benchmark distribution I met for the first time in the 1999 Haario, Saksman and Tamminen paper on adaptive MCMC. And that I used in several papers (the picture below being borrowed from Statisfaction!)

The Rosenbrock function was introduced by… Howard Rosenbrock in 1960 in a computer journal as a benchmark for optimisation. (Or by someone else to keep up with Stigler’s Law of Eponymy.) It can be turned into a probability density by exponentiating its opposite. It corresponds to a Normal N(μ,σ²) marginal on the first component, followed by T Normal

N(x²t-1,σ²/10)

conditional distributions on the following components. It is thus fully known, incl. its normalising constant, and easy to simulate. Hence to use as a fat tail target for benchmarking MCMC algorithms. The authors propose an extension as the hybrid Rosenbrock where several parallel sequences stem from the same component, but it is unclear to me how useful of a generalisation this is…

## assessing MCMC convergence

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on June 6, 2019 by xi'an

When MCMC became mainstream in the 1990’s, there was a flurry of proposals to check, assess, and even guarantee convergence to the stationary distribution, as discussed in our MCMC book. Along with Chantal Guihenneuc and Kerrie Mengersen, we also maintained for a while a reviewww webpage categorising theses. Niloy Biswas and Pierre Jacob have recently posted a paper where they propose the use of couplings (and unbiased MCMC) towards deriving bounds on different metrics between the target and the current distribution of the Markov chain. Two chains are created from a given kernel and coupled with a lag of L, meaning that after a while, the two chains become one with a time difference of L. (The supplementary material contains many details on how to induce coupling.) The distance to the target can then be bounded by a sum of distances between the two chains until they merge. The above picture from the paper is a comparison a Polya-Urn sampler with several HMC samplers for a logistic target (not involving the Pima Indian dataset!). The larger the lag L the more accurate the bound. But the larger the lag the more expensive the assessment of how many steps are needed to convergence. Especially when considering that the evaluation requires restarting the chains from scratch and rerunning until they couple again, rather than continuing one run which can only brings the chain closer to stationarity and to being distributed from the target. I thus wonder at the possibility of some Rao-Blackwellisation of the simulations used in this assessment (while realising once more than assessing convergence almost inevitably requires another order of magnitude than convergence itself!). Without a clear idea of how to do it… For instance, keeping the values of the chain(s) at the time of coupling is not directly helpful to create a sample from the target since they are not distributed from that target.

[Pierre also wrote a blog post about the paper on Statisfaction that is definitely much clearer and pedagogical than the above.]

## correlation for maximal coupling

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , on January 3, 2018 by xi'an

An interesting (if vaguely formulated) question on X validated: given two Gaussian variates that are maximally coupled, what is the correlation between these variates? The answer depends on the parameters of both Gaussian, with a correlation of one when both Gaussians are identical. Answering the question by simulation (as I could not figure out the analytical formula on Boxing Day…) led me back to Pierre Jacob’s entry on the topic on Statisfaction, where simulating the maximal coupling stems from the decompositions

p(x)=p(x)∧q(x)+{p(x)-p(x)∧q(x)}  and  q(x)=p(x)∧q(x)+{q(x)-p(x)∧q(x)}

and incidentally to the R function image.plot (from the R library fields) for including the side legend.

## truncated normal algorithms

Posted in Books, pictures, R, Statistics, University life with tags , , , , on January 4, 2017 by xi'an

Nicolas Chopin (CREST) just posted an entry on Statisfaction about the comparison of truncated Normal algorithms run by Alan Rogers, from the University of Utah. Nicolas wrote a paper in Statistics and Computing about a simulation method, which proposes a Ziggurat type of algorithm for this purpose, and which I do not remember reading, thanks to my diminishing memory buffer!  As shown in the picture below, when truncating to the half-line (a,∞), this method improves upon my accept-reject approach except in the far tails.

On the top graph, made by Alan Rogers, my uniform proposal (r) seems to be doing better for a Normal truncated to (a,b) when b<0, or when a gets large and close to b. Nicolas’ ziggurat (c) works better than the Gaussian accept-reject method (c) on the positive part. (I wonder what the exponential proposal (e) stands for, in terms of scale parameter.)