Archive for consistency

online approximate Bayesian learning

Posted in Statistics with tags , , , , , , , on September 25, 2020 by xi'an

My friends and coauthors Matthieu Gerber and Randal Douc have just arXived a massive paper on online approximate Bayesian learning, namely the handling of the posterior distribution on the parameters of a state-space model, which remains a challenge to this day… Starting from the iterated batch importance sampling (IBIS) algorithm of Nicolas (Chopin, 2002) which he introduced in his PhD thesis. The online (“by online we mean that the memory and computational requirement to process each observation is finite and bounded uniformly in t”) method they construct is guaranteed for the approximate posterior to converge to the (pseudo-)true value of the parameter as the sample size grows to infinity, where the sequence of approximations is a Cesaro mixture of initial approximations with Gaussian or t priors, AMIS like. (I am somewhat uncertain about the notion of a sequence of priors used in this setup. Another funny feature is the necessity to consider a fat tail t prior from time to time in this sequence!) The sequence is in turn approximated by a particle filter. The computational cost of this IBIS is roughly in O(NT), depending on the regeneration rate.

prior against truth!

Posted in Books, Kids, Statistics with tags , , , , , , , on June 4, 2018 by xi'an

A question from X validated had interesting ramifications, about what happens when the prior does not cover the true value of the parameter (assuming there ? In fact, not so much in that, from a decision theoretic perspective, the fact that that π(θ⁰)=0, or even that π(θ)=0 in a neighbourhood of θ⁰ does not matter [too much]. Indeed, the formal derivation of a Bayes estimator as minimising the posterior loss means that the resulting estimator may take values that were “impossible” from a prior perspective! Indeed, taking for example the posterior mean, the convex combination of all possible values of θ under π may well escape the support of π when this support is not convex. Of course, one could argue that estimators should further be restricted to be possible values of θ under π but that would reduce their decision theoretic efficiency.

An example is the brilliant minimaxity result by George Casella and Bill Strawderman from 1981: when estimating a Normal mean μ based on a single observation xwith the additional constraint that |μ|<ρ, and when ρ is small enough, ρ1.0567 quite specifically, the minimax estimator for this problem under squared error loss corresponds to a (least favourable) uniform prior on the pair {ρ,ρ}, meaning that π gives equal weight to ρ and ρ (and none to any other value of the mean μ). When ρ increases above this bound, the least favourable prior sees its support growing one point at a time, but remaining a finite set of possible values. However the posterior expectation, 𝔼[μ|x], can take any value on (ρ,ρ).

In an even broader suspension of belief (in the prior), it may be that the prior has such a restricted support that it cannot consistently estimate the (true value of the) parameter, but the associated estimator may remain admissible or minimax.

the Hyvärinen score is back

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , , , on November 21, 2017 by xi'an

Stéphane Shao, Pierre Jacob and co-authors from Harvard have just posted on arXiv a new paper on Bayesian model comparison using the Hyvärinen score

\mathcal{H}(y, p) = 2\Delta_y \log p(y) + ||\nabla_y \log p(y)||^2

which thus uses the Laplacian as a natural and normalisation-free penalisation for the score test. (Score that I first met in Padova, a few weeks before moving from X to IX.) Which brings a decision-theoretic alternative to the Bayes factor and which delivers a coherent answer when using improper priors. Thus a very appealing proposal in my (biased) opinion! The paper is mostly computational in that it proposes SMC and SMC² solutions to handle the estimation of the Hyvärinen score for models with tractable likelihoods and tractable completed likelihoods, respectively. (Reminding me that Pierre worked on SMC² algorithms quite early during his Ph.D. thesis.)

A most interesting remark in the paper is to recall that the Hyvärinen score associated with a generic model on a series must be the prequential (predictive) version

\mathcal{H}_T (M) = \sum_{t=1}^T \mathcal{H}(y_t; p_M(dy_t|y_{1:(t-1)}))

rather than the version on the joint marginal density of the whole series. (Followed by a remark within the remark that the logarithm scoring rule does not make for this distinction. And I had to write down the cascading representation

\log p(y_{1:T})=\sum_{t=1}^T \log p(y_t|y_{1:t-1})

to convince myself that this unnatural decomposition, where the posterior on θ varies on each terms, is true!) For consistency reasons.

This prequential decomposition is however a plus in terms of computation when resorting to sequential Monte Carlo. Since each time step produces an evaluation of the associated marginal. In the case of state space models, another decomposition of the authors, based on measurement densities and partial conditional expectations of the latent states allows for another (SMC²) approximation. The paper also establishes that for non-nested models, the Hyvärinen score as a model selection tool asymptotically selects the closest model to the data generating process. For the divergence induced by the score. Even for state-space models, under some technical assumptions.  From this asymptotic perspective, the paper exhibits an example where the Bayes factor and the Hyvärinen factor disagree, even asymptotically in the number of observations, about which mis-specified model to select. And last but not least the authors propose and assess a discrete alternative relying on finite differences instead of derivatives. Which remains a proper scoring rule.

I am quite excited by this work (call me biased!) and I hope it can induce following works as a viable alternative to Bayes factors, if only for being more robust to the [unspecified] impact of the prior tails. As in the above picture where some realisations of the SMC² output and of the sequential decision process see the wrong model being almost acceptable for quite a long while…

repulsive mixtures

Posted in Books, Statistics with tags , , , , , , , , on April 10, 2017 by xi'an

Fangzheng Xie and Yanxun Xu arXived today a paper on Bayesian repulsive modelling for mixtures. Not that Bayesian modelling is repulsive in any psychological sense, but rather that the components of the mixture are repulsive one against another. The device towards this repulsiveness is to add a penalty term to the original prior such that close means are penalised. (In the spirit of the sugar loaf with water drops represented on the cover of Bayesian Choice that we used in our pinball sampler, repulsiveness being there on the particles of a simulated sample and not on components.) Which means a prior assumption that close covariance matrices are of lesser importance. An interrogation I have has is was why empty components are not excluded as well, but this does not make too much sense in the Dirichlet process formulation of the current paper. And in the finite mixture version the Dirichlet prior on the weights has coefficients less than one.

The paper establishes consistency results for such repulsive priors, both for estimating the distribution itself and the number of components, K, under a collection of assumptions on the distribution, prior, and repulsiveness factors. While I have no mathematical issue with such results, I always wonder at their relevance for a given finite sample from a finite mixture in that they give an impression that the number of components is a perfectly estimable quantity, which it is not (in my opinion!) because of the fluid nature of mixture components and therefore the inevitable impact of prior modelling. (As Larry Wasserman would pound in, mixtures like tequila are evil and should likewise be avoided!)

The implementation of this modelling goes through a “block-collapsed” Gibbs sampler that exploits the latent variable representation (as in our early mixture paper with Jean Diebolt). Which includes the Old Faithful data as an illustration (for which a submission of ours was recently rejected for using too old datasets). And use the logarithm of the conditional predictive ordinate as  an assessment tool, which is a posterior predictive estimated by MCMC, using the data a second time for the fit.

Approximate Bayesian computation via sufficient dimension reduction

Posted in Statistics, University life with tags , , , , , on August 26, 2016 by xi'an

“One of our contribution comes from the mathematical analysis of the consequence of conditioning the parameters of interest on consistent statistics and intrinsically inconsistent statistics”

Xiaolong Zhong and Malay Ghosh have just arXived an ABC paper focussing on the convergence of the method. And on the use of sufficient dimension reduction techniques for the construction of summary statistics. I had not heard of this approach before so read the paper with interest. I however regret that the paper does not link with the recent consistency results of Liu and Fearnhead and of Daniel Frazier, Gael Martin, Judith Rousseau and myself. When conditioning upon the MLE [or the posterior mean] as the summary statistic, Theorem 1 states that the Bernstein-von Mises theorem holds, missing a limit in the tolerance ε. And apparently missing conditions on the speed of convergence of this tolerance to zero although the conditioning event involves the true value of the parameter. This makes me wonder at the relevance of the result. The part about partial posteriors and the characterisation of limiting posterior distributions stats with the natural remark that the mean of the summary statistic must identify the whole parameter θ to achieve consistency, a point central to our 2014 JRSS B paper. The authors suggest using a support vector machine to derive the summary statistics, an idea already exploited by Heiko Strathmann et al.. There is no consistency result of relevance for ABC in that second and final part, which ends up rather abruptly. Overall, while the paper contributes to the current reflection on the convergence properties of ABC, the lack of scaling of the tolerance ε calls for further investigations.