Archive for ergodicity

O’Bayes 19/2

Posted in Books, pictures, Running, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 1, 2019 by xi'an

One talk on Day 2 of O’Bayes 2019 was by Ryan Martin on data dependent priors (or “priors”). Which I have already discussed in this blog. Including the notion of a Gibbs posterior about quantities that “are not always defined through a model” [which is debatable if one sees it like part of a semi-parametric model]. Gibbs posterior that is built through a pseudo-likelihood constructed from the empirical risk, which reminds me of Bissiri, Holmes and Walker. Although requiring a prior on this quantity that is  not part of a model. And is not necessarily a true posterior and not necessarily with the same concentration rate as a true posterior. Constructing a data-dependent distribution on the parameter does not necessarily mean an interesting inference and to keep up with the theme of the conference has no automated claim to [more] “objectivity”.

And after calling a prior both Beauty and The Beast!, Erlis Ruli argued about a “bias-reduction” prior where the prior is solution to a differential equation related with some cumulants, connected with an earlier work of David Firth (Warwick).  An interesting conundrum is how to create an MCMC algorithm when the prior is that intractable, with a possible help from PDMP techniques like the Zig-Zag sampler.

While Peter Orbanz’ talk was centred on a central limit theorem under group invariance, further penalised by being the last of the (sun) day, Peter did a magnificent job of presenting the result and motivating each term. It reminded me of the work Jim Bondar was doing in Ottawa in the 1980’s on Haar measures for Bayesian inference. Including the notion of amenability [a term due to von Neumann] I had not met since then. (Neither have I met Jim since the last summer I spent in Carleton.) The CLT and associated LLN are remarkable in that the average is not over observations but over shifts of the same observation under elements of a sub-group of transformations. I wondered as well at the potential connection with the Read Paper of Kong et al. in 2003 on the use of group averaging for Monte Carlo integration [connection apart from the fact that both discussants, Michael Evans and myself, are present at this conference].

ABC with Gibbs steps

Posted in Statistics with tags , , , , , , , , , , , , , , , , , on June 3, 2019 by xi'an

With Grégoire Clarté, Robin Ryder and Julien Stoehr, all from Paris-Dauphine, we have just arXived a paper on the specifics of ABC-Gibbs, which is a version of ABC where the generic ABC accept-reject step is replaced by a sequence of n conditional ABC accept-reject steps, each aiming at an ABC version of a conditional distribution extracted from the joint and intractable target. Hence an ABC version of the standard Gibbs sampler. What makes it so special is that each conditional can (and should) be conditioning on a different statistic in order to decrease the dimension of this statistic, ideally down to the dimension of the corresponding component of the parameter. This successfully bypasses the curse of dimensionality but immediately meets with two difficulties. The first one is that the resulting sequence of conditionals is not coherent, since it is not a Gibbs sampler on the ABC target. The conditionals are thus incompatible and therefore convergence of the associated Markov chain becomes an issue. We produce sufficient conditions for the Gibbs sampler to converge to a stationary distribution using incompatible conditionals. The second problem is then that, provided it exists, the limiting and also intractable distribution does not enjoy a Bayesian interpretation, hence may fail to be justified from an inferential viewpoint. We however succeed in producing a version of ABC-Gibbs in a hierarchical model where the limiting distribution can be explicited and even better can be weighted towards recovering the original target. (At least with limiting zero tolerance.)

Hamiltonian MC on discrete spaces

Posted in Statistics, Travel, University life with tags , , , , , , , , on July 3, 2017 by xi'an

Following a lively discussion with Akihiko Nishimura during a BNP11 poster session last Tuesday, I took the opportunity of the flight to Montréal to read through the arXived paper (written jointly with David Dunson and Jianfeng Liu). The issue is thus one of handling discrete valued parameters in Hamiltonian Monte Carlo. The basic “trick” in handling this complexity goes by turning the discrete support via the inclusion of an auxiliary continuous variable whose discretisation is the discrete parameter, hence resembling to some extent the slice sampler. This removes the discreteness blockage but creates another difficulty, namely handling a discontinuous target density. (I idly wonder why the trick cannot be iterated to second or higher order so that to achieve the right amount of smoothness. Of course, the maths behind would be less cool!) The extension of the Hamiltonian to this setting by a  convolution is a trick I had not seen since the derivation of the Central Limit Theorem during Neveu’s course at Polytechnique.  What I find most exciting in the resolution is the move from a Gaussian momentum to a Laplace momentum, for the reason that I always wondered at alternatives [without trying anything myself!]. The Laplace version is indeed most appropriate here in that it avoids a computation of all discontinuity points and associated values along a trajectory. Since the moves are done component-wise, the method has a Metropolis-within-Gibbs flavour, which actually happens to be a special case. What is also striking is that the approach is both rejection-free and exact, provided ergodicity occurs, which is the case when the stepsize is random.

In addition to this resolution of the discrete parameter problem, the paper presents the further appeal of (re-)running an analysis of the Jolly-Seber capture-recapture model. Where the discrete parameter is the latent number of live animals [or whatever] in the system at any observed time. (Which we cover in Bayesian essentials with R as a neat entry to both dynamic and latent variable models.) I would have liked to see a comparison with the completion approach of Jérôme Dupuis (1995, Biometrika), since I figure the Metropolis version implemented here differs from Jérôme’s. The second example is built on Bissiri et al. (2016) surrogate likelihood (discussed earlier here) and Chopin and Ridgway (2017) catalogue of solutions for not analysing the Pima Indian dataset. (Replaced by another dataset here.)

ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

bhamAnother arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)

can we trust computer simulations? [day #2]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on July 13, 2015 by xi'an

Herrenhausen“Sometimes the models are better than the data.” G. Krinner

Second day at the conference on building trust in computer simulations. Starting with a highly debated issue, climate change projections. Since so many criticisms are addressed to climate models as being not only wrong but also unverifiable. And uncheckable. As explained by Gerhart Krinner, the IPCC has developed methodologies to compare models and evaluate predictions. However, from what I understood, this validation does not say anything about the future, which is the part of the predictions that matters. And that is attacked by critics and feeds climatic-skeptics. Because it is so easy to argue against the homogeneity of the climate evolution and for “what you’ve seen is not what you’ll get“! (Even though climatic-skeptics are the least likely to use this time-heterogeneity argument, being convinced as they are of the lack of human impact over the climate.)  The second talk was by Viktoria Radchuk about validation in ecology. Defined here as a test of predictions against independent data (and designs). And mentioning Simon Wood’s synthetic likelihood as the Bayesian reference for conducting model choice (as a synthetic likelihoods ratio). I had never thought of this use (found in Wood’s original paper) for synthetic likelihood, I feel a bit queasy about using a synthetic likelihood ratio as a genuine likelihood ratio. Which led to a lively discussion at the end of her talk. The next talk was about validation in economics by Matteo Richiardi, who discussed state-space models where the hidden state is observed through a summary statistic, perfect playground for ABC! But Matteo opted instead for a non-parametric approach that seems to increase imprecision and that I have never seen used in state-space models. The last part of the talk was about non-ergodic models, for which checking for validity becomes much more problematic, in my opinion. Unless one manages multiple observations of the non-ergodic path. Nicole Saam concluded this “Validation in…” morning with Validation in Sociology. With a more pessimistic approach to the possibility of finding a falsifying strategy, because of the vague nature of sociology models. For which data can never be fully informative. She illustrated the issue with an EU negotiation analysis. Where most hypotheses could hardly be tested.

“Bayesians persist with poor examples of randomness.” L. Smith

“Bayesians can be extremely reasonable.” L. Smith

The afternoon session was dedicated to methodology, mostly statistics! Andrew Robinson started with a talk on (frequentist) model validation. Called splitters and lumpers. Illustrated by a forest growth model. He went through traditional hypothesis tests like Neyman-Pearson’s that try to split between samples. And (bio)equivalence tests that take difference as the null. Using his equivalence R package. Then Leonard Smith took over [in a literal way!] from a sort-of-Bayesian perspective, in a work joint with Jim Berger and Gary Rosner on pragmatic Bayes which was mostly negative about Bayesian modelling. Introducing (to me) the compelling notion of structural model error as a representation of the inadequacy of the model. With illustrations from weather and climate models. His criticism of the Bayesian approach is that it cannot be holistic while pretending to be [my wording]. And being inadequate to measure model inadequacy, to the point of making prior choice meaningless. Funny enough, he went back to the ball dropping experiment David Higdon discussed at one JSM I attended a while ago, with the unexpected outcome that one ball did not make it to the bottom of the shaft. A more positive side was that posteriors are useful models but should not be interpreted from a probabilistic perspective. Move beyond probability was his final message. (For most of the talk, I misunderstood P(BS), the probability of a big surprise, for something else…) This was certainly the most provocative talk of the conference  and the discussion could have gone on for the rest of day! Somewhat, Lenny was voluntarily provocative in piling the responsibility upon the Bayesian’s head for being overconfident and not accounting for the physicist’ limitations in modelling the phenomenon of interest. Next talk was by Edward Dougherty on methods used in biology. He separated within-model uncertainty from outside-model inadequacy. The within model part is mostly easy to agree upon. Even though difficulties in estimating parameters creates uncertainty classes of models. Especially because of being from a small data discipline. He analysed the impact of machine learning techniques like classification as being useless without prior knowledge. And argued in favour of the Bayesian minimum mean square error estimator. Which can also lead to a classifier. And experimental design. (Using MSE seems rather reductive when facing large dimensional parameters.) Last talk of the day was by Nicolas Becu, a geographer, with a surprising approach to validation via stakeholders. A priori not too enticing a name! The discussion was of a more philosophical nature, going back to (re)define validation against reality and imperfect models. And including social aspects of validation, e.g., reality being socially constructed. This led to the stakeholders, because a model is then a shared representation. Nicolas illustrated the construction by simulation “games” of a collective model in a community of Thai farmers and in a group of water users.

In a rather unique fashion, we also had an evening discussion on points we share and points we disagreed upon. After dinner (and wine), which did not help I fear! Bill Oberkampf mentioned the use of manufactured solutions to check code, which seemed very much related to physics. But then we got mired into the necessity of dividing between verification and validation. Which sounded very and too much engineering-like to me. Maybe because I do not usually integrate coding errors and algorithmic errors into my reasoning (verification)… Although sharing code and making it available makes a big difference. Or maybe because considering all models are wrong is neither part of my methodology (validation). This part ended up in a fairly pessimistic conclusion on the lack of trust in most published articles. At least in the biological sciences.