Archive for Bayesian inference

Monte Carlo with determinantal processes [reply from the authors]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on September 22, 2016 by xi'an

[Rémi Bardenet and Adrien Hardy have written a reply to my comments of today on their paper, which is more readable as a post than as comments, so here it is. I appreciate the intention, as well as the perfect editing of the reply, suited for a direct posting!]

Thanks for your comments, Xian. As a foreword, a few people we met also had the intuition that DPPs would be relevant for Monte Carlo, but no result so far was backing this claim. As it turns out, we had to work hard to prove a CLT for importance-reweighted DPPs, using some deep recent results on orthogonal polynomials. We are currently working on turning this probabilistic result into practical algorithms. For instance, efficient sampling of DPPs is indeed an important open question, to which most of your comments refer. Although this question is out of the scope of our paper, note however that our results do not depend on how you sample. Efficient sampling of DPPs, along with other natural computational questions, is actually the crux of an ANR grant we just got, so hopefully in a few years we can write a more detailed answer on this blog! We now answer some of your other points.

“one has to examine the conditions for the result to operate, from the support being within the unit hypercube,”
Any compactly supported measure would do, using dilations, for instance. Note that we don’t assume the support is the whole hypercube.

“to the existence of N orthogonal polynomials wrt the dominating measure, not discussed here”
As explained in Section 2.1.2, it is enough that the reference measure charges some open set of the hypercube, which is for instance the case if it has a density with respect to the Lebesgue measure.

“to the lack of relation between the point process and the integrand,”
Actually, our method depends heavily on the target measure μ. Unlike vanilla QMC, the repulsiveness between the quadrature nodes is tailored to the integration problem.

“changing N requires a new simulation of the entire vector unless I missed the point.”
You’re absolutely right. This is a well-known open issue in probability, see the discussion on Terence Tao’s blog.

“This requires figuring out the upper bounds on the acceptance ratios, a “problem-dependent” request that may prove impossible to implement”
We agree that in general this isn’t trivial. However, good bounds are available for all Jacobi polynomials, see Section 3.

“Even without this stumbling block, generating the N-sized sample for dimension d=N (why d=N, I wonder?)”
This is a misunderstanding: we do not say that d=N in any sense. We only say that sampling from a DPP using the algorithm of [Hough et al] requires the same number of operations as orthonormalizing N vectors of dimension N, hence the cubic cost.

1. “how does it relate to quasi-Monte Carlo?”
So far, the connection to QMC is only intuitive: both rely on well-spaced nodes, but using different mathematical tools.

2. “the marginals of the N-th order determinantal process are far from uniform (see Fig. 1), and seemingly concentrated on the boundaries”
This phenomenon is due to orthogonal polynomials. We are investigating more general constructions that give more flexibility.

3. “Is the variance of the resulting estimator (2.11) always finite?”
Yes. For instance, this follows from the inequality below (5.56) since ƒ(x)/K(x,x) is Lipschitz.

4. and 5. We are investigating concentration inequalities to answer these points.

6. “probabilistic numerics produce an epistemic assessment of uncertainty, contrary to the current proposal.”
A partial answer may be our Remark 2.12. You can interpret DPPs as putting a Gaussian process prior over ƒ and sequentially sampling from the posterior variance of the GP.

local kernel reduction for ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , on September 14, 2016 by xi'an

“…construction of low dimensional summary statistics can be performed as in a black box…”

Today Zhou and Fukuzumi just arXived a paper that proposes a gradient-based dimension reduction for ABC summary statistics, in the spirit of RKHS kernels as advocated, e.g., by Arthur Gretton. Here the projection is a mere linear projection Bs of the vector of summary statistics, s, where B is an estimated Hessian matrix associated with the posterior expectation E[θ|s]. (There is some connection with the latest version of Li’s and Fearnhead’s paper on ABC convergence as they also define a linear projection of the summary statistics, based on asymptotic arguments, although their matrix does depend on the true value of the parameter.) The linearity sounds like a strong restriction [to me] especially when the summary statistics have no reason to belong to a vectorial space and thus be open to changes of bases and linear projections. For instance, a specific value taken by a summary statistic, like 0 say, may be more relevant than the range of their values. On a larger scale, I am doubtful about always projecting a vector of summary statistics on a subspace with the smallest possible dimension, ie the dimension of θ. In practical settings, it seems impossible to derive the optimal projection and a subvector is almost certain to loose information against a larger vector.

“Another proposal is to use different summary statistics for different parameters.”

Which is exactly what we did in our random forest estimation paper. Using a different forest for each parameter of interest (but no real tree was damaged in the experiment!).

Assistant Professor position @ WU

Posted in Mountains, Statistics, University life, Wines, Travel with tags , , , , , , , on August 15, 2016 by xi'an

wien2There is an opening for an assistant professor non-tenure position in Vienna, WU, in Sylvia Früwirth-Schnatter’s group. With deadline September 7, 2016. The requested profile is

– PhD in applied mathematics or in statistics with a strong mathematical background
– Enthusiastic interest in research in Bayesian statistics, exemplified through publications in international journals in topics including, but not limited to, Bayesian non-parametric methods, Bayesian inference for high-dimensional and complex data, Bayesian time series analysis and state space modelling, efficient Markov chain Monte Carlo methods
– Interest in applications in economics, finance, and business
– Excellent programming skills (e.g. in R or Matlab)
– German language skills are not a prerequisite

Here are the details for those interested in this exciting opportunity!

Validity and the foundations of statistical inference

Posted in Statistics with tags , , , , , , , , on July 29, 2016 by xi'an

Natesh pointed out to me this recent arXival with a somewhat grandiose abstract:

In this paper, we argue that the primary goal of the foundations of statistics is to provide data analysts with a set of guiding principles that are guaranteed to lead to valid statistical inference. This leads to two new questions: “what is valid statistical inference?” and “do existing methods achieve this?” Towards answering these questions, this paper makes three contributions. First, we express statistical inference as a process of converting observations into degrees of belief, and we give a clear mathematical definition of what it means for statistical inference to be valid. Second, we evaluate existing approaches Bayesian and frequentist approaches relative to this definition and conclude that, in general, these fail to provide valid statistical inference. This motivates a new way of thinking, and our third contribution is a demonstration that the inferential model framework meets the proposed criteria for valid and prior-free statistical inference, thereby solving perhaps the most important unsolved problem in statistics.

Since solving the “most important unsolved problem in statistics” sounds worth pursuing, I went and checked the paper‘s contents.

“To us, the primary goal of the foundations of statistics is to provide a set of guiding principles that, if followed, will guarantee validity of the resulting inference. Our motivation for writing this paper is to be clear about what is meant by valid inference and to provide the necessary principles to help data analysts achieve validity.”

Which can be interpreted in so many ways that it is somewhat meaningless…

“…if real subjective prior information is available, we recommend using it. However, there is an expanding collection of work (e.g., machine learning, etc) that takes the perspective that no real prior information is available. Even a large part of the literature claiming to be Bayesian has abandoned the interpretation of the prior as a serious part of the model, opting for “default” prior that “works.” Our choice to omit a prior from the model is not for the (misleading) purpose of being “objective”—subjectivity is necessary—but, rather, for the purpose of exploring what can be done in cases where a fully satisfactory prior is not available, to see what improvements can be made over the status quo.”

This is a pretty traditional criticism of the Bayesian approach, namely that if a “true” prior is provided (by whom?) then it is optimal to use it. But this amounts to turn the prior into another piece of the sampling distribution and is not in my opinion a Bayesian argument! Most of the criticisms in the paper are directed at objective Bayes approaches, with the surprising conclusion that, because there exist cases where no matching prior is available, “the objective Bayesian approach [cannot] be considered as a general framework for scientific inference.” (p.9)

Another section argues that a Bayesian modelling cannot describe a state of total ignorance. This is formally correct, which is why there is no such thing as a non-informative or the non-informative prior, as often discussed here, but is this truly relevant, in that the inference problem contains one way or another information about the parameter, for instance through a loss function or a pseudo-likelihood.

“This is a desirable property that most existing methods lack.”

The proposal central to the paper thesis is to replace posterior probabilities by belief functions b(.|X), called statistical inference, that are interpreted as measures of evidence about subsets A of the parameter space. If not necessarily as probabilities. This is not very novel, witness the works of Dempster, Shafer and subsequent researchers. And not very much used outside Bayesian and fiducial statistics because of the mostly impossible task of defining a function over all subsets of the parameter space. Because of the subjectivity of such “beliefs”, they will be “valid” only if they are well-calibrated in the sense of b(A|X) being sub-uniform, that is, more concentrated near zero than a uniform variate (i.e., small) under the alternative, i.e. when θ is not in A. At this stage, since this is a mix of a minimax and proper coverage condition, my interest started to quickly wane… Especially because the sub-uniformity condition is highly demanding, if leading to controls over the Type I error and the frequentist coverage. As often, I wonder at the meaning of a calibration property obtained over all realisations of the random variable and all values of the parameter. So for me stability is neither “desirable” nor “essential”. Overall, I have increasing difficulties in perceiving proper coverage as a relevant property. Which has no stronger or weaker meaning that the coverage derived from a Bayesian construction.

“…frequentism does not provide any guidance for selecting a particular rule or procedure.”

I agree with this assessment, which means that there is no such thing as frequentist inference, but rather a philosophy for assessing procedures. That the Gleser-Hwang paradox invalidates this philosophy sounds a bit excessive, however. Especially when the bounded nature of Bayesian credible intervals is also analysed as a failure. A more relevant criticism is the lack of directives for picking procedures.

“…we are the first to recognize that the belief function’s properties are necessary in order for the inferential output to satisfy the required validity property”

The construction of the “inferential model” proposed by the authors offers similarities withn fiducial inference, in that it builds upon the representation of the observable X as X=a(θ,U). With further constraints on the function a() to ensure the validity condition holds… An interesting point is that the functional connection X=a(θ,U) means that the nature of U changes once X is observed, albeit in a delicate manner outside a Bayesian framework. When illustrated on the Gleser-Hwang paradox, the resolution proceeds from an arbitrary choice of a one-dimensional summary, though. (As I am reading the paper, I realise it builds on other and earlier papers by the authors, papers that I cannot read for lack of time. I must have listned to a talk by one of the authors last year at JSM as this rings a bell. Somewhat.) In conclusion of a quick Sunday afternoon read, I am not convinced by the arguments in the paper and even less by the impression of a remaining arbitrariness in setting the resulting procedure.

asymptotic properties of Approximate Bayesian Computation

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on July 26, 2016 by xi'an

Street light near the St Kilda Road bridge, Melbourne, July 21, 2012With David Frazier and Gael Martin from Monash University, and with Judith Rousseau (Paris-Dauphine), we have now completed and arXived a paper entitled Asymptotic Properties of Approximate Bayesian Computation. This paper undertakes a fairly complete study of the large sample properties of ABC under weak regularity conditions. We produce therein sufficient conditions for posterior concentration, asymptotic normality of the ABC posterior estimate, and asymptotic normality of the ABC posterior mean. Moreover, those (theoretical) results are of significant import for practitioners of ABC as they pertain to the choice of tolerance ε used within ABC for selecting parameter draws. In particular, they [the results] contradict the conventional ABC wisdom that this tolerance should always be taken as small as the computing budget allows.

Now, this paper bears some similarities with our earlier paper on the consistency of ABC, written with David and Gael. As it happens, the paper was rejected after submission and I then discussed it in an internal seminar in Paris-Dauphine, with Judith taking part in the discussion and quickly suggesting some alternative approach that is now central to the current paper. The previous version analysed Bayesian consistency of ABC under specific uniformity conditions on the summary statistics used within ABC. But conditions for consistency are now much weaker conditions than earlier, thanks to Judith’s input!

There are also similarities with Li and Fearnhead (2015). Previously discussed here. However, while similar in spirit, the results contained in the two papers strongly differ on several fronts:

  1. Li and Fearnhead (2015) considers an ABC algorithm based on kernel smoothing, whereas our interest is the original ABC accept-reject and its many derivatives
  2. our theoretical approach permits a complete study of the asymptotic properties of ABC, posterior concentration, asymptotic normality of ABC posteriors, and asymptotic normality of the ABC posterior mean, whereas Li and Fearnhead (2015) is only concerned with asymptotic normality of the ABC posterior mean estimator (and various related point estimators);
  3. the results of Li and Fearnhead (2015) are derived under very strict uniformity and continuity/differentiability conditions, which bear a strong resemblance to those conditions in Yuan and Clark (2004) and Creel et al. (2015), while the result herein do not rely on such conditions and only assume very weak regularity conditions on the summaries statistics themselves; this difference allows us to characterise the behaviour of ABC in situations not covered by the approach taken in Li and Fearnhead (2015);

the curious incident of the inverse of the mean

Posted in R, Statistics, University life with tags , , , on July 15, 2016 by xi'an

A s I figured out while working with astronomer colleagues last week, a strange if understandable difficulty proceeds from the simplest and most studied statistical model, namely the Normal model


Indeed, if one reparametrises this model as x~N(υ⁻¹,1) with υ>0, a single observation x brings very little information about υ! (This is not a toy problem as it corresponds to estimating distances from observations of parallaxes.) If x gets large, υ is very likely to be small, but if x is small or negative, υ is certainly large, with no power to discriminate between highly different values. For instance, Fisher’s information for this model and parametrisation is υ⁻² and thus collapses at zero.

While one can always hope for Bayesian miracles, they do not automatically occur. For instance, working with a Gamma prior Ga(3,10³) on υ [as informed by a large astronomy dataset] leads to a posterior expectation hardly impacted by the value of the observation x:

invormAnd using an alternative estimate like the harmonic posterior mean that is associated with the relative squared error loss does not see much more impact from the observation:

invarmThere is simply not enough information contained in one datapoint (or even several datapoints for all that matters) to infer about υ.

ABC random forests for Bayesian parameter inference [version 2.0]

Posted in Books, Kids, pictures, Statistics, Travel, University life, Wines with tags , , , , , , on June 30, 2016 by xi'an

Just mentioning that a second version of our paper has been arXived and submitted to JMLR, the main input being the inclusion of a reference to the abcrf package. And just repeating our best selling arguments that (i) forests do not require a preliminary selection of the summary statistics, since an arbitrary number of summaries can be used as input for the random forest, even when including a large number of useless white noise variables; (b) there is no longer a tolerance level involved in the process, since the many trees in the random forest define a natural if rudimentary distance that corresponds to being or not being in the same leaf as the observed vector of summary statistics η(y); (c) the size of the reference table simulated from the prior (predictive) distribution does not need to be as large as for in usual ABC settings and hence this approach leads to significant gains in computing time since the production of the reference table usually is the costly part! To the point that deriving a different forest for each univariate transform of interest is truly a minor drag in the overall computing cost of the approach.