**J**ulie Josse contacted me for advertising a postdoc position at École Polytechnique, in Palaiseau, south of Paris. *“The fellowship is focusing on missing data. Interested graduates should apply as early as possible since the position will be filled when a suitable candidate is found. The Centre for Applied Mathematics (CMAP) is looking for highly motivated individuals able to develop a general multiple imputation method for multivariate continuous and categorical variables and its implementation in the free R software. The successful candidate will be part of research group in the statistical team on missing values. The postdoc will also have excellent opportunities to collaborate with researcher in public health with partners on the analysis of a large register from the Paris Hospital (APHP) to model the decisions and events when severe trauma patients are handled by emergency doctors. Candidates should contact Julie Josse at polytechnique.edu.”*

## Archive for Bayesian inference

## postdoc on missing data at École Polytechnique

Posted in Kids, pictures, R, Statistics, Travel, University life with tags École Polytechnique, Bayesian inference, CMAP, France, generalized SVD, latent variable models, matrix completion, missing data, Palaiseau, Paris, postdoctoral position, R on November 18, 2016 by xi'an## Nature snapshot [Volume 539 Number 7627]

Posted in Books, Statistics, University life with tags Africa, Bayesian inference, Capuchin monkeys, data analysis, HIV, linguistics, Monte Carlo Statistical Methods, Patient 0, phylogenetic models, psychiatry, xkcd on November 15, 2016 by xi'an**A** number of entries of interest [to me] in that Nature issue: from the Capuchin monkeys that break stones in a way that resembles early hominins biface tools, to the persistent association between some sounds and some meanings across numerous languages, to the use of infected mosquitoes in South America to fight Zika, to the call for more maths in psychiatry by the NIMH director, where since prevision is mentioned I presumed stats is included, to the potentially earthshaking green power revolution in Africa, to the reconstruction of the first HIV strains in North America, along with the deconstruction of the “Patient 0” myth, helped by Bayesian phylogenetic analyses, to a cover of the Open Syllabus Project, with Monte Carlo Statistical Methods arriving first [in the Monte Carlo list]….

“Observations should not converge on one model but aim to find anomalies that carry clues about the nature of dark matter, dark energy or initial conditions of the Universe. Further observations should be motivated by testing unconventional interpretations of those anomalies (such as exotic forms of dark matter or modified theories of gravity). Vast data sets may contain evidence for unusual behaviour that was unanticipated when the projects were conceived.” Avi Loeb

One editorial particularly drew my attention, Good data are not enough, by the astronomer Avi Loeb. as illustrated by the quote above, Loeb objects to data being interpreted and even to data being collected towards the assessment of the standard model. While I agree that this model contains a lot of fudge factors like dark matter and dark energy, which apparently constitutes most of the available matter, the discussion is quite curious, in that interpreting data according to alternative theories sounds impossible and certainly beyond the reach of most PhD students [as Loeb criticises the analysis of some data in a recent thesis he evaluated].

“modern cosmology is augmented by unsubstantiated, mathematically sophisticated ideas — of the multiverse, anthropic reasoning and string theory.

The author argues to always allow for alternative interpretations of the data, which sounds fine at a primary level but again calls for the conception of such alternative models. When discrepancies are found between the standard model and the data, they can be due to errors in the measurement itself, in the measurement model, or in the theoretical model. However, they may be impossible to analyse outside the model, in the *neutral* way called and wished by Loeb. Designing *neutral* experiments sounds even less meaningful. Which is why I am fairly taken aback by the call to “a research frontier [that] should maintain at least two ways of interpreting data so that new experiments will aim to select the correct one”! Why two and not more?! And which ones?! I am not aware of fully developed alternative theories and cannot see how experiments designed under one model could produce indications about a new and incomplete model.

“Such simple, off-the-shelf remedies could help us to avoid the scientific fate of the otherwise admirable Mayan civilization.”

Hence I am bemused by the whole exercise, which deepest arguments seem to be a paper written by the author last year and an interdisciplinary centre on black holes also launched recently by the same author.

## copy code at your own peril

Posted in Books, Kids, R, Statistics, University life with tags Bayesian inference, cross validated, exported code, normal model, R, xkcd on November 14, 2016 by xi'an**I** have come several times upon cases of scientists [I mean, real, recognised, publishing, senior scientists!] from other fields blindly copying MCMC code from a paper or website, and expecting the program to operate on their own problem… One illustration is from last week, when I read a X Validated question [from 2013] about an attempt of that kind, on a rather standard Normal posterior, but using an R code where the posterior function was not even defined. (I foolishly replied, despite having no expectation of a reply from the person asking the question.)

## Lindley’s paradox as a loss of resolution

Posted in Books, pictures, Statistics with tags AIC, Bayesian inference, Bayesian tests of hypotheses, BIC, Cross Validation, Inferential Models, Jeffreys-Lindley paradox, p-value, pseudo-Bayes factors on November 9, 2016 by xi'an

“The principle of indifference states that in the absence of prior information, all mutually exclusive models should be assigned equal prior probability.”

**C**olin LaMont and Paul Wiggins arxived a paper on Lindley’s paradox a few days ago. The above quote is the (standard) argument for picking (½,½) partition between the two hypotheses, which I object to if only because it does not stand for multiple embedded models. The main point in the paper is to argue about the loss of resolution induced by averaging against the prior, as illustrated by the picture above for the N(0,1) versus N(μ,1) toy problem. What they call resolution is the lowest possible mean estimate for which the null is rejected by the Bayes factor (assuming a rejection for Bayes factors larger than 1). While the detail is missing, I presume the different curves on the lower panel correspond to different choices of L when using U(-L,L) priors on μ… The “Bayesian rejoinder” to the Lindley-Bartlett paradox (p.4) is in tune with my interpretation, namely that as the prior mass under the alternative gets more and more spread out, there is less and less prior support for reasonable values of the parameter, hence a growing tendency to accept the null. This is an illustration of the long-lasting impact of the prior on the posterior probability of the model, because the data cannot impact the tails very much.

“If the true prior is known, Bayesian inference using the true prior is optimal.”

This sentence and the arguments following is meaningless in my opinion as knowing the “true” prior makes the Bayesian debate superfluous. If there was a unique, Nature provided, known prior π, it would loose its original meaning to become part of the (frequentist) model. The argument is actually mostly used in negative, namely that since it is not know we should not follow a Bayesian approach: this is, e.g., the main criticism in Inferential Models. But there is no such thing as a “true” prior! (Or a “true’ model, all things considered!) In the current paper, this pseudo-natural approach to priors is utilised to justify a return to the pseudo-Bayes factors of the 1990’s, when one part of the data is used to stabilise and proper-ise the (improper) prior, and a second part to run the test *per se*. This includes an interesting insight on the limiting cases of partitioning corresponding to AIC and BIC, respectively, that I had not seen before. With the surprising conclusion that “AIC is the derivative of BIC”!

## SAS on Bayes

Posted in Books, Kids, pictures, R, Statistics, University life with tags asymptotics, Bayesian inference, credible intervals, cross validated, epistemic probability, plug-in resolution, SAS, Stack Exchange on November 8, 2016 by xi'an**F**ollowing a question on X Validated, I became aware of the following descriptions of the pros and cons of Bayesian analysis, as perceived by whoever (Tim Arnold?) wrote SAS/STAT(R) 9.2 User’s Guide, Second Edition. I replied more specifically on the point

It [Bayesian inference] provides inferences that are conditional on the data and are exact, without reliance on asymptotic approximation. Small sample inference proceeds in the same manner as if one had a large sample. Bayesian analysis also can estimate any functions of parameters directly, without using the “plug-in” method (a way to estimate functionals by plugging the estimated parameters in the functionals).

which I find utterly confusing and not particularly relevant. The other points in the list are more traditional, except for this one

It provides interpretable answers, such as “the true parameter θ has a probability of 0.95 of falling in a 95% credible interval.”

that I find somewhat unappealing in that the 95% probability has only relevance wrt to the resulting posterior, hence has no absolute (and definitely no frequentist) meaning. The criticisms of the prior selection

It does not tell you how to select a prior. There is no correct way to choose a prior. Bayesian inferences require skills to translate subjective prior beliefs into a mathematically formulated prior. If you do not proceed with caution, you can generate misleading results.

It can produce posterior distributions that are heavily influenced by the priors. From a practical point of view, it might sometimes be difficult to convince subject matter experts who do not agree with the validity of the chosen prior.

are traditional but nonetheless irksome. Once acknowledged there is no correct or true prior, it follows naturally that the resulting inference will depend on the choice of the prior and has to be understood conditional on the prior, which is why the credible interval has for instance an epistemic rather than frequentist interpretation. There is also little reason for trying to convince a fellow Bayesian statistician about one’s prior. Everything is conditional on the chosen prior and I see less and less why this should be an issue.

## Monte Carlo with determinantal processes [reply from the authors]

Posted in Books, Statistics with tags Bayesian inference, central limit theorem, determinantal point processes, Gaussian processes, Gaussian quadrature, hypercube, Lebesgue measure, Monte Carlo integration, Monte Carlo Statistical Methods, orthogonal polynomials, probabilistic numerics, quasi-Monte Carlo methods, super-efficiency, Terence Tao, zero variance importance sampling on September 22, 2016 by xi'an*[Rémi Bardenet and Adrien Hardy have written a reply to my comments of today on their paper, which is more readable as a post than as comments, so here it is. I appreciate the intention, as well as the perfect editing of the reply, suited for a direct posting!]*

**T**hanks for your comments, Xian. As a foreword, a few people we met also had the intuition that DPPs would be relevant for Monte Carlo, but no result so far was backing this claim. As it turns out, we had to work hard to prove a CLT for importance-reweighted DPPs, using some deep recent results on orthogonal polynomials. We are currently working on turning this probabilistic result into practical algorithms. For instance, efficient sampling of DPPs is indeed an important open question, to which most of your comments refer. Although this question is out of the scope of our paper, note however that our results do not depend on how you sample. Efficient sampling of DPPs, along with other natural computational questions, is actually the crux of an ANR grant we just got, so hopefully in a few years we can write a more detailed answer on this blog! We now answer some of your other points.

*“one has to examine the conditions for the result to operate, from the support being within the unit hypercube,”*

Any compactly supported measure would do, using dilations, for instance. Note that we don’t assume the support is the whole hypercube.

*“to the existence of N orthogonal polynomials wrt the dominating measure, not discussed here”*

As explained in Section 2.1.2, it is enough that the reference measure charges some open set of the hypercube, which is for instance the case if it has a density with respect to the Lebesgue measure.

*“to the lack of relation between the point process and the integrand,”*

Actually, our method depends heavily on the target measure μ. Unlike vanilla QMC, the repulsiveness between the quadrature nodes is tailored to the integration problem.

*“changing N requires a new simulation of the entire vector unless I missed the point.”*

You’re absolutely right. This is a well-known open issue in probability, see the discussion on Terence Tao’s blog.

*“This requires figuring out the upper bounds on the acceptance ratios, a “problem-dependent” request that may prove impossible to implement”*

We agree that in general this isn’t trivial. However, good bounds are available for all Jacobi polynomials, see Section 3.

*“Even without this stumbling block, generating the N-sized sample for dimension d=N (why d=N, I wonder?)”*

This is a misunderstanding: we do not say that d=N in any sense. We only say that sampling from a DPP using the algorithm of [Hough et al] requires the same number of operations as orthonormalizing N vectors of dimension N, hence the cubic cost.

*1. “how does it relate to quasi-Monte Carlo?”*

So far, the connection to QMC is only intuitive: both rely on well-spaced nodes, but using different mathematical tools.

*2. “the marginals of the N-th order determinantal process are far from uniform (see Fig. 1), and seemingly concentrated on the boundaries”*

This phenomenon is due to orthogonal polynomials. We are investigating more general constructions that give more flexibility.

*3. “Is the variance of the resulting estimator (2.11) always finite?”*

Yes. For instance, this follows from the inequality below (5.56) since ƒ(x)/K(x,x) is Lipschitz.

*4. and 5.* We are investigating concentration inequalities to answer these points.

*6. “probabilistic numerics produce an epistemic assessment of uncertainty, contrary to the current proposal.”*

A partial answer may be our Remark 2.12. You can interpret DPPs as putting a Gaussian process prior over ƒ and sequentially sampling from the posterior variance of the GP.

## local kernel reduction for ABC

Posted in Books, pictures, Statistics, University life with tags ABC, Approximate Bayesian computation, Bayesian inference, kernel density estimator, reproducing kernel Hilbert space, summary statistics on September 14, 2016 by xi'an

“…construction of low dimensional summary statistics can be performed as in a black box…”

**T**oday Zhou and Fukuzumi just arXived a paper that proposes a gradient-based dimension reduction for ABC summary statistics, in the spirit of RKHS kernels as advocated, e.g., by Arthur Gretton. Here the projection is a mere *linear* projection Bs of the vector of summary statistics, s, where B is an estimated Hessian matrix associated with the posterior expectation E[θ|s]. (There is some connection with the latest version of Li’s and Fearnhead’s paper on ABC convergence as they also define a *linear* projection of the summary statistics, based on asymptotic arguments, although their matrix does depend on the true value of the parameter.) The linearity sounds like a strong restriction [to me] especially when the summary statistics have no reason to belong to a vectorial space and thus be open to changes of bases and linear projections. For instance, a specific value taken by a summary statistic, like 0 say, may be more relevant than the range of their values. On a larger scale, I am doubtful about always projecting a vector of summary statistics on a subspace with the smallest possible dimension, ie the dimension of θ. In practical settings, it seems impossible to derive the optimal projection and a subvector is almost certain to loose information against a larger vector.

“Another proposal is to use different summary statistics for different parameters.”

Which is exactly what we did in our random forest estimation paper. Using a different forest for each parameter of interest (but no real tree was damaged in the experiment!).