## ABC in Clermont-Ferrand

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on September 20, 2019 by xi'an

Today I am taking part in a one-day workshop at the Université of Clermont Auvergne on ABC. With applications to cosmostatistics, along with Martin Kilbinger [with whom I worked on PMC schemes], Florent Leclerc and Grégoire Aufort. This should prove a most exciting day! (With not enough time to run up Puy de Dôme in the morning, though.)

## the three i’s of poverty

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , on September 15, 2019 by xi'an

Today I made a “quick” (10h door to door!) round trip visit to Marseille (by train) to take part in the PhD thesis defense (committee) of Edwin Fourrier-Nicolaï, which title was Poverty, inequality and redistribution: an econometric approach. While this was mainly a thesis in economics, meaning defending some theory on inequalities based on East German data, there were Bayesian components in the thesis that justified (to some extent!) my presence in the jury. Especially around mixture estimation by Gibbs sampling. (On which I started working almost exactly 30 years ago, when I joined Paris 6 and met  Gilles Celeux and Jean Diebolt.) One intriguing [for me] question stemmed from this defense, namely the notion of a Bayesian estimation of a three i’s of poverty (TIP) curve. The three i’s stand for incidence, intensity, and inequality, as, introduced in Jenkins and Lambert (1997), this curve measure the average income loss from the poverty level for the 100p% lower incomes, when p varies between 0 and 1. It thus depends on the distribution F of the incomes and when using a mixture distribution its computation requires a numerical cdf inversion to determine the income p-th quantile. A related question is thus on how to define a Bayesian estimate of the TIP curve. Using an average over the values of an MCMC sample does not sound absolutely satisfactory since the upper bound in the integral varies for each realisation of the parameter. The use of another estimate would however require a specific loss function, an issue not discussed in the thesis.

## efficient MCMC sampling

Posted in Statistics with tags , , , on June 24, 2019 by xi'an

Maxime Vono, Daniel Paulin and Arnaud Doucet recently arXived a paper about a regularisation technique that allows for efficient sampling from a complex posterior which potential function factorises as a large sum of transforms of linear projections of the parameter θ

$U(\theta)=\sum_i U_i(A_i\theta)$

The central idea in the paper [which was new to me] is to introduce auxiliary variates for the different terms in the sum, replacing the projections in the transforms, with an additional regularisation forcing these auxiliary variates to be as close as possible from the corresponding projection

$U(\theta,\mathbf z)=\sum_i U_i(z_i)+\varrho^{-1}||z_i-A_i\theta||^2$

This is only an approximation to the true target but it enjoys the possibility to run a massive Gibbs sampler in quite a reduced dimension. As the variance ρ of the regularisation term goes to zero the marginal posterior on the parameter θ converges to the true posterior. The authors manage to achieve precise convergence rates both in total variation and in Wasserstein distance.

From a practical point of view, only judging from the logistic example, it is hard to fathom how much this approach improves upon other approaches (provided they still apply) as the impact of the value of ρ should be assessed on top of the convergence of the high-dimensional Gibbs sampler. Or is there an annealing version in the pipe-line? While parallelisation is a major argument, it also seems that the Gibbs sampler need a central monitoring for each new simulation of θ. Unless some asynchronous version can be implemented.

## likelihood-free approximate Gibbs sampling

Posted in Books, Statistics with tags , , , , , , , , on June 19, 2019 by xi'an

“Low-dimensional regression-based models are constructed for each of these conditional distributions using synthetic (simulated) parameter value and summary statistic pairs, which then permit approximate Gibbs update steps (…) synthetic datasets are not generated during each sampler iteration, thereby providing efficiencies for expensive simulator models, and only require sufficient synthetic datasets to adequately construct the full conditional models (…) Construction of the approximate conditional distributions can exploit known structures of the high-dimensional posterior, where available, to considerably reduce computational overheads”

Guilherme Souza Rodrigues, David Nott, and Scott Sisson have just arXived a paper on approximate Gibbs sampling. Since this comes a few days after we posted our own version, here are some of the differences I could spot in the paper:

1. Further references to earlier occurrences of Gibbs versions of ABC, esp. in cases when the likelihood function factorises into components and allows for summaries with lower dimensions. And even to ESP.
2. More an ABC version of Gibbs sampling that a Gibbs version of ABC in that approximations to the conditionals are first constructed and then used with no further corrections.
3. Inherently related to regression post-processing à la Beaumont et al.  (2002) in that the regression model is the start to designing an approximate full conditional, conditional on the “other” parameters and on the overall summary statistic. The construction of the approximation is far from automated. And may involve neural networks or other machine learning estimates.
4. As a consequence of the above, a preliminary ABC step to design the collection of approximate full conditionals using a single and all-purpose multidimensional summary statistic.
5. Once the approximations constructed, no further pseudo-data is generated.
6. Drawing from the approximate full conditionals is done exactly, possibly via a bootstrapped version.
7. Handling a highly complex g-and-k dynamic model with 13,140 unknown parameters, requiring a ten days simulation.

“In certain circumstances it can be seen that the likelihood-free approximate Gibbs sampler will exactly target the true partial posterior (…) In this case, then Algorithms 2 and 3 will be exact.”

Convergence and coherence are handled in the paper by setting the algorithm(s) as noisy Monte Carlo versions, à la Alquier et al., although the issue of incompatibility between the full conditionals is acknowledged, with the main reference being the finite state space analysis of Chen and Ip (2015). It thus remains unclear whether or not the Gibbs samplers that are implemented there do converge and if they do what is the significance of the stationary distribution.

## Gibbs clashes with importance sampling

Posted in pictures, Statistics with tags , , , , , on April 11, 2019 by xi'an

In an X validated question, an interesting proposal was made: at each (component-wise) step of a Gibbs sampler, replace simulation from the exact full conditional with simulation from an alternate density and weight the resulting simulation with a term made of a product of (a) the previous weight (b) the ratio of the true conditional over the substitute for the new value and (c) the inverse ratio for the earlier value of the same component. Which does not work for several reasons:

1. the reweighting is doomed by its very propagation in that it keeps multiplying ratios of expectation one, which means an almost sure chance of degenerating;
2. the weights are computed for a previous value that has not been generated from the same proposal and is anyway already properly weighted;
3. due to the change in dimension produced by Gibbs, the actual target is the full conditional, which involves an intractable normalising constant;
4. there is no guarantee for the weights to have finite variance, esp. when the proposal has thinner tails than the target.

as can be readily checked by a quick simulation experiment. The funny thing is that a proper importance weight can be constructed when envisioning  the sequence of Gibbs steps as a Metropolis proposal (in the dimension of the target).

## automatic adaptation of MCMC algorithms

Posted in pictures, Statistics with tags , , , , , , , on March 4, 2019 by xi'an

“A typical adaptive MCMC sampler will approximately optimize performance given the kind of sampler chosen in the first place, but it will not optimize among the variety of samplers that could have been chosen.”

Last February (2018), Dao Nguyen and five co-authors arXived a paper that I missed. On a new version of adaptive MCMC that aims at selecting a wider range of proposal kernels. Still requiring a by-hand selection of this collection of kernels… Among the points addressed, beyond the theoretical guarantees that the adaptive scheme does not jeopardize convergence to the proper target, are a meta-exploration of the set of combinations of samplers and integration of the computational speed in the assessment of each sampler. Including the very difficulty of assessing mixing. One could deem the index of the proposal as an extra (cyber-)parameter to its generic parameter (like the scale in the random walk), but the discreteness of this index makes the extension more delicate than expected. And justifies the distinction between internal and external parameters. The notion of a worst-mixing dimension is quite appealing and connects with the long-term hope that one could spend the maximum fraction of the sampler runtime over the directions that are poorly mixing, while still keeping the target as should be. The adaptive scheme is illustrated on several realistic models with rather convincing gains in efficiency and time.

The convergence tools are inspired from Roberts and Rosenthal (2007), with an assumption of uniform ergodicity over all kernels considered therein which is both strong and delicate to assess in practical settings. Efficiency is rather unfortunately defined in terms of effective sample size, which is a measure of correlation or lack thereof, but which does not relate to the speed of escape from the basin of attraction of the starting point. I also wonder at the pertinence of the estimation of the effective sample size when the chain is based on different successive kernels, since the lack of correlation may be due to another kernel. Another calibration issue is the internal clock that relates to the average number of iterations required to tune properly a specific kernel, which again may be difficult to assess in a realistic situation. A last query is whether or not this scheme could be compared with an asynchronous (and valid) MCMC approach that exploits parallel capacities of the computer.

## Le Monde puzzle [#1087]

Posted in Books, Kids, R, Statistics with tags , , , , , on February 25, 2019 by xi'an

A board-like Le Monde mathematical puzzle in the digit category:

Given a (k,m) binary matrix, what is the maximum number S of entries with only one neighbour equal to one? Solve for k=m=2,…,13, and k=6,m=8.

For instance, for k=m=2, the matrix

$\begin{matrix} 0 &0\\ 1 &1\\ \end{matrix}$

is producing the maximal number 4. I first attempted a brute force random filling of these matrices with only a few steps of explorations and got the numbers 4,8,16,34,44,57,… for the first cases. Since I was convinced that the square k² of a number k previously exhibited to reach its maximum as S=k² was again perfect in this regard, I then tried another approach based on Gibbs sampling and annealing (what else?):

gibzit=function(k,m,A=1e2,N=1e2){
temp=1 #temperature
board=sole=matrix(sample(c(0,1),(k+2)*(m+2),rep=TRUE),k+2,m+2)
board[1,]=board[k+2,]=board[,1]=board[,m+2]=0 #boundaries
maxol=counter(board,k,m) #how many one-neighbours?
for (t in 1:A){#annealing
for (r in 1:N){#basic gibbs steps
for (i in 2:(k+1))
for (j in 2:(m+1)){
prop=board
prop[i,j]=1-board[i,j]
u=runif(1)
if (log(u/(1-u))<temp*(counter(prop,k,m)-val)){
board=prop;val=counter(prop,k,m)
if (val>maxol){
maxol=val;sole=board}}
}}
temp=temp*2}
return(sole[-c(1,k+2),-c(1,m+2)])}


which leads systematically to the optimal solution, namely a perfect square k² when k is even and a perfect but one k²-1 when k is odd. When k=6, m=8, all entries can afford one neighbour exactly,

> gibzbbgiz(6,8)
[1] 48
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    0    0    1    1    0    0    1
[2,]    1    0    0    0    0    0    0    1
[3,]    0    0    1    0    0    1    0    0
[4,]    0    0    1    0    0    1    0    0
[5,]    1    0    0    0    0    0    0    1
[6,]    1    0    0    1    1    0    0    1


but this does not seem feasible when k=6, m=7, which only achieves 40 entries with one single neighbour.