## a Bayesian criterion for singular models [discussion]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on October 10, 2016 by xi'an

[Here is the discussion Judith Rousseau and I wrote about the paper by Mathias Drton and Martyn Plummer, a Bayesian criterion for singular models, which was discussed last week at the Royal Statistical Society. There is still time to send a written discussion! Note: This post was written using the latex2wp converter.]

It is a well-known fact that the BIC approximation of the marginal likelihood in a given irregular model ${\mathcal M_k}$ fails or may fail. The BIC approximation has the form

$\displaystyle BIC_k = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - d_k \log n /2$

where ${d_k }$ corresponds on the number of parameters to be estimated in model ${\mathcal M_k}$. In irregular models the dimension ${d_k}$ typically does not provide a good measure of complexity for model ${\mathcal M_k}$, at least in the sense that it does not lead to an approximation of

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log \left( \int_{\mathcal M_k} p(\mathbf Y_n| \pi_k, \mathcal M_k) dP(\pi_k|k )\right) \,.$

A way to understand the behaviour of ${\log m(\mathbf Y_n |\mathcal M_k) }$ is through the effective dimension

$\displaystyle \tilde d_k = -\lim_n \frac{ \log P( \{ KL(p(\mathbf Y_n| \pi_0, \mathcal M_k) , p(\mathbf Y_n| \pi_k, \mathcal M_k) ) \leq 1/n | k ) }{ \log n}$

when it exists, see for instance the discussions in Chambaz and Rousseau (2008) and Rousseau (2007). Watanabe (2009} provided a more precise formula, which is the starting point of the approach of Drton and Plummer:

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - \lambda_k(\pi_0) \log n + [m_k(\pi_0) - 1] \log \log n + O_p(1)$

where ${\pi_0}$ is the true parameter. The authors propose a clever algorithm to approximate of the marginal likelihood. Given the popularity of the BIC criterion for model choice, obtaining a relevant penalized likelihood when the models are singular is an important issue and we congratulate the authors for it. Indeed a major advantage of the BIC formula is that it is an off-the-shelf crierion which is implemented in many softwares, thus can be used easily by non statisticians. In the context of singular models, a more refined approach needs to be considered and although the algorithm proposed by the authors remains quite simple, it requires that the functions ${ \lambda_k(\pi)}$ and ${m_k(\pi)}$ need be known in advance, which so far limitates the number of problems that can be thus processed. In this regard their equation (3.2) is both puzzling and attractive. Attractive because it invokes nonparametric principles to estimate the underlying distribution; puzzling because why should we engage into deriving an approximation like (3.1) and call for Bayesian principles when (3.1) is at best an approximation. In this case why not just use a true marginal likelihood?

1. Why do we want to use a BIC type formula?

The BIC formula can be viewed from a purely frequentist perspective, as an example of penalised likelihood. The difficulty then stands into choosing the penalty and a common view on these approaches is to choose the smallest possible penalty that still leads to consistency of the model choice procedure, since it then enjoys better separation rates. In this case a ${\log \log n}$ penalty is sufficient, as proved in Gassiat et al. (2013). Now whether or not this is a desirable property is entirely debatable, and one might advocate that for a given sample size, if the data fits the smallest model (almost) equally well, then this model should be chosen. But unless one is specifying what equally well means, it does not add much to the debate. This also explains the popularity of the BIC formula (in regular models), since it approximates the marginal likelihood and thus benefits from the Bayesian justification of the measure of fit of a model for a given data set, often qualified of being a Bayesian Ockham’s razor. But then why should we not compute instead the marginal likelihood? Typical answers to this question that are in favour of BIC-type formula include: (1) BIC is supposingly easier to compute and (2) BIC does not call for a specification of the prior on the parameters within each model. Given that the latter is a difficult task and that the prior can be highly influential in non-regular models, this may sound like a good argument. However, it is only apparently so, since the only justification of BIC is purely asymptotic, namely, in such a regime the difficulties linked to the choice of the prior disappear. This is even more the case for the sBIC criterion, since it is only valid if the parameter space is compact. Then the impact of the prior becomes less of an issue as non informative priors can typically be used. With all due respect, the solution proposed by the authors, namely to use the posterior mean or the posterior mode to allow for non compact parameter spaces, does not seem to make sense in this regard since they depend on the prior. The same comments apply to the author’s discussion on Prior’s matter for sBIC. Indeed variations of the sBIC could be obtained by penalizing for bigger models via the prior on the weights, for instance as in Mengersen and Rousseau (2011) or by, considering repulsive priors as in Petralia et al. (20120, but then it becomes more meaningful to (again) directly compute the marginal likelihood. Remains (as an argument in its favour) the relative computational ease of use of sBIC, when compared with the marginal likelihood. This simplification is however achieved at the expense of requiring a deeper knowledge on the behaviour of the models and it therefore looses the off-the-shelf appeal of the BIC formula and the range of applications of the method, at least so far. Although the dependence of the approximation of ${\log m(\mathbf Y_n |\mathcal M_k)}$ on ${\mathcal M_j }$, \$latex {j \leq k} is strange, this does not seem crucial, since marginal likelihoods in themselves bring little information and they are only meaningful when compared to other marginal likelihoods. It becomes much more of an issue in the context of a large number of models.

2. Should we care so much about penalized or marginal likelihoods ?

Marginal or penalized likelihoods are exploratory tools in a statistical analysis, as one is trying to define a reasonable model to fit the data. An unpleasant feature of these tools is that they provide numbers which in themselves do not have much meaning and can only be used in comparison with others and without any notion of uncertainty attached to them. A somewhat richer approach of exploratory analysis is to interrogate the posterior distributions by either varying the priors or by varying the loss functions. The former has been proposed in van Havre et l. (2016) in mixture models using the prior tempering algorithm. The latter has been used for instance by Yau and Holmes (2013) for segmentation based on Hidden Markov models. Introducing a decision-analytic perspective in the construction of information criteria sounds to us like a reasonable requirement, especially when accounting for the current surge in studies of such aspects.

[Posted as arXiv:1610.02503]

## [Royal] Series B’log

Posted in Books, Statistics, University life, Wines with tags , , , , , , on September 12, 2016 by xi'an

[Thanks to Ingmar for suggesting the additional Royal!]

Last week, I got an email from Piotr Fryzlewicz on behalf of the Publication Committee of the Royal Statistical Society enquiring about my interest in becoming a blog associate editor for Series B! Although it does not come exactly as a surprise, as I had previously heard about this interest in creating a dedicated blog, this is great news as I think a lively blog can only enhance the visibility and impact of papers published in Series B and hence increase the influence of Series B. Being quite excited by this on-line and interactive extension to the journal, I have accepted the proposal and we are now working on designing the new blog (Series B’log!) to get it on track as quickly as possible.

Suggestions towards this experiment are most welcome! I am thinking of involving authors to write blog summaries of their paper, AEs and reviewers to voice their expert opinions about the paper, anonymously or not, and of course anyone interested in commenting the paper. The idea is to turn (almost) all papers into on-line Read Papers, with hopefully the backup of authors through their interactions with the commentators. I certainly do not intend to launch discussions on each and every paper, betting on the AEs or referees to share their impressions. And if a paper ends up being un-discussed, this may prove enough of an incentive for some. (Someone asked me if we intended to discuss rejected papers as well. This is an interesting concept, but not to be considered at the moment!)

## a mistake in a 1990 paper

Posted in Kids, Statistics, University life with tags , , , , , , , , on August 7, 2016 by xi'an

As we were working on the Handbook of mixture analysis with Sylvia Früwirth-Schnatter and Gilles Celeux today, near Saint-Germain des Près, I realised that there was a mistake in our 1990 mixture paper with Jean Diebolt [published in 1994], in that when we are proposing to use improper “Jeffreys” priors under the restriction that no component of the Gaussian mixture is “empty”, meaning that there are at least two observations generated from each component, the likelihood needs to be renormalised to be a density for the sample. This normalisation constant only depends on the weights of the mixture, which means that, when simulating from the full conditional distribution of the weights, there should be an extra-acceptance step to account for this correction. Of course, the term is essentially equal to one for a large enough sample but this remains a mistake nonetheless! It is funny that it remained undetected for so long in my most cited paper. Checking on Larry’s 1999 paper exploring the idea of excluding terms from the likelihood to allow for improper priors, I did not spot him using a correction either.

## control functionals for Monte Carlo integration

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 28, 2016 by xi'an

A paper on control variates by Chris Oates, Mark Girolami (Warwick) and Nicolas Chopin (CREST) appeared in a recent issue of Series B. I had read and discussed the paper with them previously and the following is a set of comments I wrote at some stage, to be taken with enough gains of salt since Chris, Mark and Nicolas answered them either orally or in the paper. Note also that I already discussed an earlier version, with comments that are not necessarily coherent with the following ones! [Thanks to the busy softshop this week, I resorted to publish some older drafts, so mileage can vary in the coming days.]

First, it took me quite a while to get over the paper, mostly because I have never worked with reproducible kernel Hilbert spaces (RKHS) before. I looked at some proofs in the appendix and at the whole paper but could not spot anything amiss. It is obviously a major step to uncover a manageable method with a rate that is lower than √n. When I set my PhD student Anne Philippe on the approach via Riemann sums, we were quickly hindered by the dimension issue and could not find a way out. In the first versions of the nested sampling approach, John Skilling had also thought he could get higher convergence rates before realising the Monte Carlo error had not disappeared and hence was keeping the rate at the same √n speed.

The core proof in the paper leading to the 7/12 convergence rate relies on a mathematical result of Sun and Wu (2009) that a certain rate of regularisation of the function of interest leads to an average variance of order 1/6. I have no reason to mistrust the result (and anyway did not check the original paper), but I am still puzzled by the fact that it almost immediately leads to the control variate estimator having a smaller order variance (or at least variability). On average or in probability. (I am also uncertain on the possibility to interpret the boxplot figures as establishing super-√n speed.)

Another thing I cannot truly grasp is how the control functional estimator of (7) can be both a mere linear recombination of individual unbiased estimators of the target expectation and an improvement in the variance rate. I acknowledge that the coefficients of the matrices are functions of the sample simulated from the target density but still…

Another source of inner puzzlement is the choice of the kernel in the paper, which seems too simple to be able to cover all problems despite being used in every illustration there. I see the kernel as centred at zero, which means a central location must be know, decreasing to zero away from this centre, so possibly missing aspects of the integrand that are too far away, and isotonic in the reference norm, which also seems to preclude some settings where the integrand is not that compatible with the geometry.

I am equally nonplussed by the existence of a deterministic bound on the error, although it is not completely deterministic, depending on the values of the reproducible kernel at the points of the sample. Does it imply anything restrictive on the function to be integrated?

A side remark about the use of intractable in the paper is that, given the development of a whole new branch of computational statistics handling likelihoods that cannot be computed at all, intractable should possibly be reserved for such higher complexity models.

## Conditional love [guest post]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on August 4, 2015 by xi'an

[When Dan Simpson told me he was reading Terenin’s and Draper’s latest arXival in a nice Bath pub—and not a nice bath tub!—, I asked him for a blog entry and he agreed. Here is his piece, read at your own risk! If you remember to skip the part about Céline Dion, you should enjoy it very much!!!]

Probability has traditionally been described, as per Kolmogorov and his ardent follower Katy Perry, unconditionally. This is, of course, excellent for those of us who really like measure theory, as the maths is identical. Unfortunately mathematical convenience is not necessarily enough and a large part of the applied statistical community is working with Bayesian methods. These are unavoidably conditional and, as such, it is natural to ask if there is a fundamentally conditional basis for probability.

Bruno de Finetti—and later Richard Cox and Edwin Jaynes—considered conditional bases for Bayesian probability that are, unfortunately, incomplete. The critical problem is that they mainly consider finite state spaces and construct finitely additive systems of conditional probability. For a variety of reasons, neither of these restrictions hold much truck in the modern world of statistics.

In a recently arXiv’d paper, Alexander Terenin and David Draper devise a set of axioms that make the Cox-Jaynes system of conditional probability rigorous. Furthermore, they show that the complete set of Kolmogorov axioms (including countable additivity) can be derived as theorems from their axioms by conditioning on the entire sample space.

This is a deep and fundamental paper, which unfortunately means that I most probably do not grasp it’s complexities (especially as, for some reason, I keep reading it in pubs!). However I’m going to have a shot at having some thoughts on it, because I feel like it’s the sort of paper one should have thoughts on. Continue reading

## discussions on Gerber and Chopin

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on May 29, 2015 by xi'an

As a coincidence, I received my copy of JRSS Series B with the Read Paper by Mathieu Gerber and Nicolas Chopin on sequential quasi Monte Carlo just as I was preparing an arXival of a few discussions on the paper! Among the [numerous and diverse] discussions, a few were of particular interest to me [I highlighted members of the University of Warwick and of Université Paris-Dauphine to suggest potential biases!]:

1. Mike Pitt (Warwick), Murray Pollock et al.  (Warwick) and Finke et al. (Warwick) all suggested combining quasi Monte Carlo with pseudomarginal Metropolis-Hastings, pMCMC (Pitt) and Rao-Bklackwellisation (Finke et al.);
2. Arnaud Doucet pointed out that John Skilling had used the Hilbert (ordering) curve in a 2004 paper;
3. Chris Oates, Dan Simpson and Mark Girolami (Warwick) suggested combining quasi Monte Carlo with their functional control variate idea;
4. Richard Everitt wondered about the dimension barrier of d=6 and about possible slice extensions;
5. Zhijian He and Art Owen pointed out simple solutions to handle a random number of uniforms (for simulating each step in sequential Monte Carlo), namely to start with quasi Monte Carlo and end up with regular Monte Carlo, in an hybrid manner;
6. Hans Künsch points out the connection with systematic resampling à la Carpenter, Clifford and Fearnhead (1999) and wonders about separating the impact of quasi Monte Carlo between resampling and propagating [which vaguely links to one of my comments];
7. Pierre L’Ecuyer points out a possible improvement over the Hilbert curve by a preliminary sorting;
8. Frederik Lindsten and Sumeet Singh propose using ABC to extend the backward smoother to intractable cases [but still with a fixed number of uniforms to use at each step], as well as Mateu and Ryder (Paris-Dauphine) for a more general class of intractable models;
9. Omiros Papaspiliopoulos wonders at the possibility of a quasi Markov chain with “low discrepancy paths”;
10. Daniel Rudolf suggest linking the error rate of sequential quasi Monte Carlo with the bounds of Vapnik and Ĉervonenkis (1977).

The arXiv document also includes the discussions by Julyan Arbel and Igor Prünster (Turino) on the Bayesian nonparametric side of sqMC and by Robin Ryder (Dauphine) on the potential of sqMC for ABC.

## Relevant statistics for Bayesian model choice [hot off the press!]

Posted in Books, Statistics, University life with tags , , , , , , on October 30, 2014 by xi'an

Our paper about evaluating statistics used for ABC model choice has just appeared in Series B! It somewhat paradoxical that it comes out just a few days after we submitted our paper on using random forests for Bayesian model choice, thus bypassing the need for selecting those summary statistics by incorporating all statistics available and letting the trees automatically rank those statistics in term of their discriminating power. Nonetheless, this paper remains an exciting piece of work (!) as it addresses the more general and pressing question of the validity of running a Bayesian analysis with only part of the information contained in the data. Quite usefull in my (biased) opinion when considering the emergence of approximate inference already discussed on this ‘Og…

[As a trivial aside, I had first used fresh from the press(es) as the bracketted comment, before I realised the meaning was not necessarily the same in English and in French.]