## an arithmetic mean identity

Posted in Books, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , on December 19, 2019 by xi'an

A 2017 paper by Ana Pajor published in Bayesian Analysis addresses my favourite problem [of computing the marginal likelihood] and which I discussed on the ‘Og, linking with another paper by Lenk published in 2012 in JCGS. That I already discussed here last year. Lenk’s (2009) paper is actually using a technique related to the harmonic mean correction based on HPD regions Darren Wraith and myself proposed at MaxEnt 2009. And which Jean-Michel and I presented at Frontiers of statistical decision making and Bayesian analysis in 2010. As I had only vague memories about the arithmetic mean version, we discussed the paper together with graduate students in Paris Dauphine.

The arithmetic mean solution, representing the marginal likelihood as the prior average of the likelihood, is a well-known approach used as well as the basis for nested sampling. With the improvement consisting in restricting the simulation to a set Ð with sufficiently high posterior probability. I am quite uneasy about P(Ð|y) estimated by 1 as the shape of the set containing all posterior simulations is completely arbitrary, parameterisation dependent, and very random since based on the extremes of this posterior sample. Plus, the set Ð converges to the entire parameter space with the number of posterior simulations. An alternative that we advocated in our earlier paper is to take Ð as the HPD region or a variational Bayes version . But the central issue with the HPD regions is how to construct these from an MCMC output and how to compute both P(Ð) and P(Ð|y). It does not seem like a good idea to set P(Ð|x) to the intended α level for the HPD coverage. Using a non-parametric version for estimating Ð could be in the end the only reasonable solution.

As a test, I reran the example of a conjugate normal model used in the paper, based on (exact) simulations from both the prior and  the posterior, and obtained approximations that were all close from the true marginal. With Chib’s being exact in that case (of course!), and an arithmetic mean surprisingly close without an importance correction:

> print(c(hame,chme,came,chib))
[1] -107.6821 -106.5968 -115.5950 -115.3610


Both harmonic versions are of the right order but not trustworthy, the truncation to such a set Ð as the one chosen in this paper having little impact.

## Bayesian webinar: Bayesian conjugate gradient

Posted in Statistics with tags , , , , , , on September 25, 2019 by xi'an

Bayesian Analysis is launching its webinar series on discussion papers! Meaning the first 90 registrants will be able to participate interactively via the Zoom Conference platform while additional registrants will be able to view the Webinar on a dedicated YouTube Channel. This fantastic initiative is starting with the Bayesian conjugate gradient method of Jon Cockayne (University of Warwick) et al., on October 2 at 4pm Greenwich time. (With available equivalences for other time zones!) I strongly support this initiative and wish it the widest possible success, as it could bring a new standard for conferences, having distant participants gathering in a nearby location to present talks and attend other talks from another part of the World, while effectively participating. An dense enough network could even see the non-stop conference emerging!

## latent nested nonparametric priors

Posted in Books, Statistics with tags , , , , , , , on September 23, 2019 by xi'an

A paper on an extended type of non-parametric priors by Camerlenghi et al. [all good friends!] is about to appear in Bayesian Analysis, with a discussion open for contributions (until October 15). While a fairly theoretical piece of work, it validates a Bayesian approach for non-parametric clustering of separate populations with, broadly speaking, common clusters. More formally, it constructs a new family of models that allows for a partial or complete equality between two probability measures, but does not force full identity when the associated samples do share some common observations. Indeed, the more traditional structures prohibit one or the other, from the Dirichlet process (DP) prohibiting two probability measure realisations from being equal or partly equal to some hierarchical DP (HDP) already allowing for common atoms across measure realisations, but prohibiting complete identity between two realised distributions, to nested DP offering one extra level of randomness, but with an infinity of DP realisations that prohibits common atomic support besides completely identical support (and hence distribution).

The current paper imagines two realisations of random measures written as a sum of a common random measure and of one of two separate almost independent random measures: (14) is the core formula of the paper that allows for partial or total equality. An extension to a setting larger than facing two samples seems complicated if only because of the number of common measures one has to introduce, from the totally common measure to measures that are only shared by a subset of the samples. Except in the simplified framework when a single and universally common measure is adopted (with enough justification). The randomness of the model is handled via different completely random measures that involved something like four degrees of hierarchy in the Bayesian model.

Since the example is somewhat central to the paper, the case of one or rather two two-component Normal mixtures with a common component (but with different mixture weights) is handled by the approach, although it seems that it was already covered by HDP. Having exactly the same term (i.e., with the very same weight) is not, but this may be less interesting in real life applications. Note that alternative & easily constructed & parametric constructs are already available in this specific case, involving a limited prior input and a lighter computational burden, although the  Gibbs sampler behind the model proves extremely simple on the paper. (One may wonder at the robustness of the sampler once the case of identical distributions is visited.)

Due to the combinatoric explosion associated with a higher number of observed samples, despite obvious practical situations,  one may wonder at any feasible (and possibly sequential) extension, that would further keep a coherence under marginalisation (in the number of samples). And also whether or not multiple testing could be coherently envisioned in this setting, for instance when handling all hospitals in the UK. Another consistency question covers the Bayes factor used to assess whether the two distributions behind the samples are or not identical. (One may wonder at the importance of the question, hopefully applied to more relevant dataset than the Iris data!)

## uncertainty in the ABC posterior

Posted in Statistics with tags , , , , , , on July 24, 2019 by xi'an

In the most recent Bayesian Analysis, Marko Järvenpää et al. (including my coauthor Aki Vehtari) consider an ABC setting where the number of available simulations of pseudo-samples  is limited. And where they want to quantify the amount of uncertainty resulting from the estimation of the ABC posterior density. Which is a version of the Monte Carlo error in practical ABC, in that this is the difference between the ABC posterior density for a given choice of summaries and a given choice of tolerance, and the actual approximation based on a finite number of simulations from the prior predictive. As in earlier works by Michael Gutmann and co-authors, the focus stands in designing a sequential strategy to decide where to sample the next parameter value towards minimising a certain expected loss. And in adopting a Gaussian process modelling for the discrepancy between observed data and simulated data, hence generalising the synthetic likelihood approach. This allows them to compute the expectation and the variance of the unnormalised ABC posterior, based on plugged-in estimators. From where the authors derive a loss as the expected variance of the acceptance probability (although it is not parameterisation invariant). I am unsure I see the point for this choice in that there is no clear reason for the resulting sequence of parameter choices to explore the support of the posterior distribution in a relatively exhaustive manner. The paper also mentions alternatives where the next parameter is chosen at the location where “the uncertainty of the unnormalised ABC posterior is highest”. Which sounds more pertinent to me. And further avoids integrating out the parameter. I also wonder if ABC mis-specification analysis could apply in this framework since the Gaussian process is most certainly a “wrong” model. (When concluding this post, I realised I had written a similar entry two years ago about the earlier version of the paper!)

## the most probable cluster

Posted in Books, Statistics with tags , , , , , , on July 11, 2019 by xi'an

In the last issue of Bayesian Analysis, Lukasz Rajkowski studies the most likely (MAP) cluster associated with the Dirichlet process mixture model. Reminding me that most Bayesian estimates of the number of clusters are not consistent (when the sample size grows to infinity). I am always puzzled by this problem, as estimating the number of clusters sounds like an ill-posed problem, since it is growing with the number of observations, by definition of the Dirichlet process. For instance, the current paper establishes that the number of clusters intersecting a given compact set remains bounded. (The setup is one of a Normal Dirichlet process mixture with constant and known covariance matrix.)

Since the posterior probability of a given partition of {1,2,…,n} can be (formally) computed, the MAP estimate can be (formally) derived. I inserted formally in the previous sentence as the derivation of the exact MAP is an NP hard problem in the number n of observations. As an aside, I have trouble with the author’s argument that the convex hulls of the clusters should be disjoin: I do not see why they should when the mixture components are overlapping. (More generally, I fail to relate to notions like “bad clusters” or “overestimation of the number of clusters” or a “sensible choice” of the covariance matrix.) More globally, I am somewhat perplexed by the purpose of the paper and the relevance of the MAP estimate, even putting aside my generic criticisms of the MAP approach. No uncertainty is attached to the estimator, which thus appears as a form of penalised likelihood strategy rather than a genuinely Bayesian (Analysis) solution.

The first example in the paper is using data from a Uniform over (-1,1), concluding at a “misleading” partition by the MAP since it produces more than one cluster. I find this statement flabbergasting as the generative model is not the estimated model. To wit, the case of an exponential Exp(1) sample that cannot reach a maximum of the target function with a finite number of sample. Which brings me back full-circle to my general unease about clustering in that much more seems to be assumed about this notion than what the statistical model delivers.

## Bayesian conjugate gradients [open for discussion]

Posted in Books, pictures, Statistics, University life with tags , , , , , on June 25, 2019 by xi'an

When fishing for an illustration for this post on Google, I came upon this Bayesian methods for hackers cover, a book about which I have no clue whatsoever (!) but that mentions probabilistic programming. Which serves as a perfect (?!) introduction to the call for discussion in Bayesian Analysis of the incoming Bayesian conjugate gradient method by Jon Cockayne, Chris Oates (formerly Warwick), Ilse Ipsen and Mark Girolami (still partially Warwick!). Since indeed the paper is about probabilistic numerics à la Mark and co-authors. Surprisingly dealing with solving the deterministic equation Ax=b by Bayesian methods. The method produces a posterior distribution on the solution x⁰, given a fixed computing effort, which makes it pertain to the anytime algorithms. It also relates to an earlier 2015 paper by Christian Hennig where the posterior is on A⁻¹ rather than x⁰ (which is quite a surprising if valid approach to the problem!) The computing effort is translated here in computations of projections of random projections of Ax, which can be made compatible with conjugate gradient steps. Interestingly, the choice of the prior on x is quite important, including setting a low or high convergence rate…  Deadline is August 04!

## from tramway to Panzer (or back!)…

Posted in Books, pictures, Statistics with tags , , , , , , on June 14, 2019 by xi'an

Although it is usually presented as the tramway problem, namely estimating the number of tram or bus lines in a city given observing one line number, including The Bayesian Choice by yours truly, the original version of the problem is about German tanks, Panzer V tanks to be precise, which total number M was to be estimated by the Allies from their observation of serial numbers of a number k of tanks. The Riddler is restating the problem when the only available information is made of the smallest, 22, and largest, 144, numbers, with no information about the number k itself. I am unsure what the Riddler means by “best” estimate, but a posterior distribution on M (and k) can be certainly be constructed for a prior like 1/k x 1/M² on (k,M). (Using M² to make sure the posterior mean does exist.) The joint distribution of the order statistics is

$\frac{k!}{(k-2)!} M^{-k} (144-22)^{k-2}\, \Bbb I_{2\le k\le M\ge 144}$

which makes the computation of the posterior distribution rather straightforward. Here is the posterior surface (with an unfortunate rendering of an artefactual horizontal line at 237!), showing a concentration near the lower bound M=144. The posterior mode is actually achieved for M=144 and k=7, while the posterior means are (rounded as) M=169 and k=9.