## null recurrent = zero utility?

Posted in Books, R, Statistics with tags , , , , , , , on April 28, 2022 by xi'an

The stability result that the ratio

$\dfrac{\sum^T_{t=1} f(\theta^{(t)})}{\sum^T_{t=1} g(\theta^{(t)})}\qquad(1)$

converges holds for a Harris π-null-recurrent Markov chain for all functions f,g in L¹(π) [Meyn & Tweedie, 1993, Theorem 17.3.2] is rather fascinating. However, it is unclear it can be useful in simulation environments, as for the integral priors we have been studying over the years with Juan Antonio Cano and Diego Salmeron Martinez. Above, the result of an experiment where I simulated a Markov chain as a Normal random walk in dimension one, hence a Harris π-null-recurrent Markov chain for the Lebesgue measure λ, and monitored the stabilisation of the ratio (1) when using two densities for f and g,  to its expected value (1, shown by a red horizontal line). There is quite a variability in the outcome (repeated 100 times),  but the most intriguing is the quick stabilisation of most cumulated averages to values different from 1. Even longer runs display this feature

which I would blame on the excursions of the random walk far away from the central regions for both f and g, that is on long sequences where zeroes keep being added to numerator and denominators in (1). As far as integral approximation is concerned, this is not very helpful!

## O’Bayes 19/2

Posted in Books, pictures, Running, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 1, 2019 by xi'an

One talk on Day 2 of O’Bayes 2019 was by Ryan Martin on data dependent priors (or “priors”). Which I have already discussed in this blog. Including the notion of a Gibbs posterior about quantities that “are not always defined through a model” [which is debatable if one sees it like part of a semi-parametric model]. Gibbs posterior that is built through a pseudo-likelihood constructed from the empirical risk, which reminds me of Bissiri, Holmes and Walker. Although requiring a prior on this quantity that is  not part of a model. And is not necessarily a true posterior and not necessarily with the same concentration rate as a true posterior. Constructing a data-dependent distribution on the parameter does not necessarily mean an interesting inference and to keep up with the theme of the conference has no automated claim to [more] “objectivity”.

And after calling a prior both Beauty and The Beast!, Erlis Ruli argued about a “bias-reduction” prior where the prior is solution to a differential equation related with some cumulants, connected with an earlier work of David Firth (Warwick).  An interesting conundrum is how to create an MCMC algorithm when the prior is that intractable, with a possible help from PDMP techniques like the Zig-Zag sampler.

While Peter Orbanz’ talk was centred on a central limit theorem under group invariance, further penalised by being the last of the (sun) day, Peter did a magnificent job of presenting the result and motivating each term. It reminded me of the work Jim Bondar was doing in Ottawa in the 1980’s on Haar measures for Bayesian inference. Including the notion of amenability [a term due to von Neumann] I had not met since then. (Neither have I met Jim since the last summer I spent in Carleton.) The CLT and associated LLN are remarkable in that the average is not over observations but over shifts of the same observation under elements of a sub-group of transformations. I wondered as well at the potential connection with the Read Paper of Kong et al. in 2003 on the use of group averaging for Monte Carlo integration [connection apart from the fact that both discussants, Michael Evans and myself, are present at this conference].

## ABC with Gibbs steps

Posted in Statistics with tags , , , , , , , , , , , , , , , , , on June 3, 2019 by xi'an

With Grégoire Clarté, Robin Ryder and Julien Stoehr, all from Paris-Dauphine, we have just arXived a paper on the specifics of ABC-Gibbs, which is a version of ABC where the generic ABC accept-reject step is replaced by a sequence of n conditional ABC accept-reject steps, each aiming at an ABC version of a conditional distribution extracted from the joint and intractable target. Hence an ABC version of the standard Gibbs sampler. What makes it so special is that each conditional can (and should) be conditioning on a different statistic in order to decrease the dimension of this statistic, ideally down to the dimension of the corresponding component of the parameter. This successfully bypasses the curse of dimensionality but immediately meets with two difficulties. The first one is that the resulting sequence of conditionals is not coherent, since it is not a Gibbs sampler on the ABC target. The conditionals are thus incompatible and therefore convergence of the associated Markov chain becomes an issue. We produce sufficient conditions for the Gibbs sampler to converge to a stationary distribution using incompatible conditionals. The second problem is then that, provided it exists, the limiting and also intractable distribution does not enjoy a Bayesian interpretation, hence may fail to be justified from an inferential viewpoint. We however succeed in producing a version of ABC-Gibbs in a hierarchical model where the limiting distribution can be explicited and even better can be weighted towards recovering the original target. (At least with limiting zero tolerance.)

## Hamiltonian MC on discrete spaces

Posted in Statistics, Travel, University life with tags , , , , , , , , on July 3, 2017 by xi'an

Following a lively discussion with Akihiko Nishimura during a BNP11 poster session last Tuesday, I took the opportunity of the flight to Montréal to read through the arXived paper (written jointly with David Dunson and Jianfeng Liu). The issue is thus one of handling discrete valued parameters in Hamiltonian Monte Carlo. The basic “trick” in handling this complexity goes by turning the discrete support via the inclusion of an auxiliary continuous variable whose discretisation is the discrete parameter, hence resembling to some extent the slice sampler. This removes the discreteness blockage but creates another difficulty, namely handling a discontinuous target density. (I idly wonder why the trick cannot be iterated to second or higher order so that to achieve the right amount of smoothness. Of course, the maths behind would be less cool!) The extension of the Hamiltonian to this setting by a  convolution is a trick I had not seen since the derivation of the Central Limit Theorem during Neveu’s course at Polytechnique.  What I find most exciting in the resolution is the move from a Gaussian momentum to a Laplace momentum, for the reason that I always wondered at alternatives [without trying anything myself!]. The Laplace version is indeed most appropriate here in that it avoids a computation of all discontinuity points and associated values along a trajectory. Since the moves are done component-wise, the method has a Metropolis-within-Gibbs flavour, which actually happens to be a special case. What is also striking is that the approach is both rejection-free and exact, provided ergodicity occurs, which is the case when the stepsize is random.

In addition to this resolution of the discrete parameter problem, the paper presents the further appeal of (re-)running an analysis of the Jolly-Seber capture-recapture model. Where the discrete parameter is the latent number of live animals [or whatever] in the system at any observed time. (Which we cover in Bayesian essentials with R as a neat entry to both dynamic and latent variable models.) I would have liked to see a comparison with the completion approach of Jérôme Dupuis (1995, Biometrika), since I figure the Metropolis version implemented here differs from Jérôme’s. The second example is built on Bissiri et al. (2016) surrogate likelihood (discussed earlier here) and Chopin and Ridgway (2017) catalogue of solutions for not analysing the Pima Indian dataset. (Replaced by another dataset here.)

## ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

Another arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)