## O’Bayes 19/1 [snapshots]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , on June 30, 2019 by xi'an

Although the tutorials of O’Bayes 2019 of yesterday were poorly attended, albeit them being great entries into objective Bayesian model choice, recent advances in MCMC methodology, and the multiple layers of BART, for which I have to blame myself for sticking the beginning of O’Bayes too closely to the end of BNP as only the most dedicated could achieve the commuting from Oxford to Coventry to reach Warwick in time, the first day of talks were well attended, despite weekend commitments, conference fatigue, and perfect summer weather! Here are some snapshots from my bench (and apologies for not covering better the more theoretical talks I had trouble to follow, due to an early and intense morning swimming lesson! Like Steve Walker’s utility based derivation of priors that generalise maximum entropy priors. But being entirely independent from the model does not sound to me like such a desirable feature… And Natalia Bochkina’s Bernstein-von Mises theorem for a location scale semi-parametric model, including a clever construct of a mixture of two Dirichlet priors to achieve proper convergence.)

Jim Berger started the day with a talk on imprecise probabilities, involving the society for imprecise probability, which I discovered while reading Keynes’ book, with a neat resolution of the Jeffreys-Lindley paradox, when re-expressing the null as an imprecise null, with the posterior of the null no longer converging to one, with a limit depending on the prior modelling, if involving a prior on the bias as well, with Chris discussing the talk and mentioning a recent work with Edwin Fong on reinterpreting marginal likelihood as exhaustive X validation, summing over all possible subsets of the data [using log marginal predictive].Håvard Rue did a follow-up talk from his Valencià O’Bayes 2015 talk on PC-priors. With a pretty hilarious introduction on his difficulties with constructing priors and counseling students about their Bayesian modelling. With a list of principles and desiderata to define a reference prior. However, I somewhat disagree with his argument that the Kullback-Leibler distance from the simpler (base) model cannot be scaled, as it is essentially a log-likelihood. And it feels like multivariate parameters need some sort of separability to define distance(s) to the base model since the distance somewhat summarises the whole departure from the simpler model. (Håvard also joined my achievement of putting an ostrich in a slide!) In his discussion, Robin Ryder made a very pragmatic recap on the difficulties with constructing priors. And pointing out a natural link with ABC (which brings us back to Don Rubin’s motivation for introducing the algorithm as a formal thought experiment).

Sara Wade gave the final talk on the day about her work on Bayesian cluster analysis. Which discussion in Bayesian Analysis I alas missed. Cluster estimation, as mentioned frequently on this blog, is a rather frustrating challenge despite the simple formulation of the problem. (And I will not mention Larry’s tequila analogy!) The current approach is based on loss functions directly addressing the clustering aspect, integrating out the parameters. Which produces the interesting notion of neighbourhoods of partitions and hence credible balls in the space of partitions. It still remains unclear to me that cluster estimation is at all achievable, since the partition space explodes with the sample size and hence makes the most probable cluster more and more unlikely in that space. Somewhat paradoxically, the paper concludes that estimating the cluster produces a more reliable estimator on the number of clusters than looking at the marginal distribution on this number. In her discussion, Clara Grazian also pointed the ambivalent use of clustering, where the intended meaning somehow diverges from the meaning induced by the mixture model.

## asymptotics of synthetic likelihood

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , on March 11, 2019 by xi'an

David Nott, Chris Drovandi and Robert Kohn just arXived a paper on a comparison between ABC and synthetic likelihood, which is both interesting and timely given that synthetic likelihood seems to be lacking behind in terms of theoretical evaluation. I am however as puzzled by the results therein as I was by the earlier paper by Price et al. on the same topic. Maybe due to the Cambodia jetlag, which is where and when I read the paper.

My puzzlement, thus, comes from the difficulty in comparing both approaches on a strictly common ground. The paper first establishes convergence and asymptotic normality for synthetic likelihood, based on the 2003 MCMC paper of Chernozukov and Hong [which I never studied in details but that appears like the MCMC reference in the econometrics literature]. The results are similar to recent ABC convergence results, unsurprisingly when assuming a CLT on the summary statistic vector. One additional dimension of the paper is to consider convergence for a misspecified covariance matrix in the synthetic likelihood [and it will come back with a revenge]. And asymptotic normality of the synthetic score function. Which is obviously unavailable in intractable models.

The first point I have difficulty with is how the computing time required for approximating mean and variance in the synthetic likelihood, by Monte Carlo means, is not accounted for in the comparison between ABC and synthetic likelihood versions. Remember that ABC only requires one (or at most two) pseudo-samples per parameter simulation. The latter requires M, which is later constrained to increase to infinity with the sample size. Simulations that are usually the costliest in the algorithms. If ABC were to use M simulated samples as well, since it already relies on a kernel, it could as well construct [at least on principle] a similar estimator of the [summary statistic] density. Or else produce M times more pairs (parameter x pseudo-sample). The authors pointed out (once this post out) that they do account for the factor M when computing the effective sample size (before Lemma 4, page 12), but I still miss why the ESS converging to N=MN/M when M goes to infinity is such a positive feature.

Another point deals with the use of multiple approximate posteriors in the comparison. Since the approximations differ, it is unclear that convergence to a given approximation is all that should matter, if the approximation is less efficient [when compared with the original and out-of-reach posterior distribution]. Especially for a finite sample size n. This chasm in the targets becomes more evident when the authors discuss the use of a constrained synthetic likelihood covariance matrix towards requiring less pseudo-samples, i.e. lower values of M, because of a smaller number of parameters to estimate. This should be balanced against the loss in concentration of the synthetic approximation, as exemplified by the realistic examples in the paper. (It is also hard to see why M could be not of order √n for Monte Carlo reasons.)

The last section in the paper is revolving around diverse issues for misspecified models, from wrong covariance matrix to wrong generating model. As we just submitted a paper on ABC for misspecified models, I will not engage into a debate on this point but find the proposed strategy that goes through an approximation of the log-likelihood surface by a Gaussian process and a derivation of the covariance matrix of the score function apparently greedy in both calibration and computing. And not so clearly validated when the generating model is misspecified.

## auxiliary likelihood-based approximate Bayesian computation in state-space models

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on May 2, 2016 by xi'an

With Gael Martin, Brendan McCabe, David T. Frazier, and Worapree Maneesoonthorn, we arXived (and submitted) a strongly revised version of our earlier paper. We begin by demonstrating that reduction to a set of sufficient statistics of reduced dimension relative to the sample size is infeasible for most state-space models, hence calling for the use of partial posteriors in such settings. Then we give conditions [like parameter identification] under which ABC methods are Bayesian consistent, when using an auxiliary model to produce summaries, either as MLEs or [more efficiently] scores. Indeed, for the order of accuracy required by the ABC perspective, scores are equivalent to MLEs but are computed much faster than MLEs. Those conditions happen to to be weaker than those found in the recent papers of Li and Fearnhead (2016) and Creel et al.  (2015).  In particular as we make no assumption about the limiting distributions of the summary statistics. We also tackle the dimensionality curse that plagues ABC techniques by numerically exhibiting the improved accuracy brought by looking at marginal rather than joint modes. That is, by matching individual parameters via the corresponding scalar score of the integrated auxiliary likelihood rather than matching on the multi-dimensional score statistics. The approach is illustrated on realistically complex models, namely a (latent) Ornstein-Ulenbeck process with a discrete time linear Gaussian approximation is adopted and a Kalman filter auxiliary likelihood. And a square root volatility process with an auxiliary likelihood associated with a Euler discretisation and the augmented unscented Kalman filter.  In our experiments, we compared our auxiliary based  technique to the two-step approach of Fearnhead and Prangle (in the Read Paper of 2012), exhibiting improvement for the examples analysed therein. Somewhat predictably, an important challenge in this approach that is common with the related techniques of indirect inference and efficient methods of moments, is the choice of a computationally efficient and accurate auxiliary model. But most of the current ABC literature discusses the role and choice of the summary statistics, which amounts to the same challenge, while missing the regularity provided by score functions of our auxiliary models.

## a third way of probability?

Posted in Books, Mountains, Statistics, Travel, University life with tags , , , , , , on September 5, 2015 by xi'an

Because the title intrigued me (who would dream of claiming connection with Tony Blair’s “new” Labour move to centre-right?!) , I downloaded William Briggs‘ paper the Third Way of Probability & Statistics from arXiv and read it while secluded away, with no connection to the outside world, at Longmire, Mount Rainier National Park. Early morning at Paradise Inn. The subtitle of the document is “Beyond Testing and Estimation To Importance, Relevance, and Skill“. Actually, Longmire may have been the only place where I would read through the entire paper and its 14 pages, as the document somewhat sounds like a practical (?) joke. And almost made me wonder whether Mr Briggs was a pseudonym… And where the filter behind arXiv publishing principles was that day.

The notion behind Briggs’ third way is that parameters do not exist and that only conditional probability exists. Not exactly a novel perspective then. The first five pages go on repeating this principle in various ways, without ever embarking into the implementation of the idea, at best referring to a future book in search of a friendly publisher… The remainder of the paper proceeds to analyse a college GPA dataset without ever explaining how the predictive distribution was constructed. The only discussion is about devising a tool to compare predictors, which is chosen as the continuous rank probability score of Gneiting and Raftery (2007). Looking at those scores seems to encompass this third way advocated by the author, then, which sounds to me to be an awfully short lane into statistics. With no foray whatsoever into probability.

## Statistics slides (4)

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , on November 10, 2014 by xi'an

Here is the fourth set of slides for my third year statistics course, trying to build intuition about the likelihood surface and why on Earth would one want to find its maximum?!, through graphs. I am yet uncertain whether or not I will reach the point where I can teach more asymptotics so maybe I will also include asymptotic normality of the MLE under regularity conditions in this chapter…

## Approximate Bayesian Computation in state space models

Posted in Statistics, Travel, University life with tags , , , , , , , on October 2, 2014 by xi'an

While it took quite a while (!), with several visits by three of us to our respective antipodes, incl. my exciting trip to Melbourne and Monash University two years ago, our paper on ABC for state space models was arXived yesterday! Thanks to my coauthors, Gael Martin, Brendan McCabe, and  Worapree Maneesoonthorn,  I am very glad of this outcome and of the new perspective on ABC it produces.  For one thing, it concentrates on the selection of summary statistics from a more econometrics than usual point of view, defining asymptotic sufficiency in this context and demonstrated that both asymptotic sufficiency and Bayes consistency can be achieved when using maximum likelihood estimators of the parameters of an auxiliary model as summary statistics. In addition, the proximity to (asymptotic) sufficiency yielded by the MLE is replicated by the score vector. Using the score instead of the MLE as a summary statistics allows for huge gains in terms of speed. The method is then applied to a continuous time state space model, using as auxiliary model an augmented unscented Kalman filter. We also found in the various state space models tested therein that the ABC approach based on the marginal [likelihood] score was performing quite well, including wrt Fearnhead’s and Prangle’s (2012) approach… I like the idea of using such a generic object as the unscented Kalman filter for state space models, even when it is not a particularly accurate representation of the true model. Another appealing feature of the paper is in the connections made with indirect inference.

## checking for finite variance of importance samplers

Posted in R, Statistics, Travel, University life with tags , , , , , , , , on June 11, 2014 by xi'an

Over a welcomed curry yesterday night in Edinburgh I read this 2008 paper by Koopman, Shephard and Creal, testing the assumptions behind importance sampling, which purpose is to check on-line for (in)finite variance in an importance sampler, based on the empirical distribution of the importance weights. To this goal, the authors use the upper tail  of the weights and a limit theorem that provides the limiting distribution as a type of Pareto distribution

$\dfrac{1}{\beta}\left(1+\xi z/\beta \right)^{-1-1/\xi}$

over (0,∞). And then implement a series of asymptotic tests like the likelihood ratio, Wald and score tests to assess whether or not the power ξ of the Pareto distribution is below ½. While there is nothing wrong with this approach, which produces a statistically validated diagnosis, I still wonder at the added value from a practical perspective, as raw graphs of the estimation sequence itself should exhibit similar jumps and a similar lack of stabilisation as the ones seen in the various figures of the paper. Alternatively, a few repeated calls to the importance sampler should disclose the poor convergence properties of the sampler, as in the above graph. Where the blue line indicates the true value of the integral.