## Monte Carlo with determinantal processes [reply from the authors]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on September 22, 2016 by xi'an

[Rémi Bardenet and Adrien Hardy have written a reply to my comments of today on their paper, which is more readable as a post than as comments, so here it is. I appreciate the intention, as well as the perfect editing of the reply, suited for a direct posting!]

Thanks for your comments, Xian. As a foreword, a few people we met also had the intuition that DPPs would be relevant for Monte Carlo, but no result so far was backing this claim. As it turns out, we had to work hard to prove a CLT for importance-reweighted DPPs, using some deep recent results on orthogonal polynomials. We are currently working on turning this probabilistic result into practical algorithms. For instance, efficient sampling of DPPs is indeed an important open question, to which most of your comments refer. Although this question is out of the scope of our paper, note however that our results do not depend on how you sample. Efficient sampling of DPPs, along with other natural computational questions, is actually the crux of an ANR grant we just got, so hopefully in a few years we can write a more detailed answer on this blog! We now answer some of your other points.

“one has to examine the conditions for the result to operate, from the support being within the unit hypercube,”
Any compactly supported measure would do, using dilations, for instance. Note that we don’t assume the support is the whole hypercube.

“to the existence of N orthogonal polynomials wrt the dominating measure, not discussed here”
As explained in Section 2.1.2, it is enough that the reference measure charges some open set of the hypercube, which is for instance the case if it has a density with respect to the Lebesgue measure.

“to the lack of relation between the point process and the integrand,”
Actually, our method depends heavily on the target measure μ. Unlike vanilla QMC, the repulsiveness between the quadrature nodes is tailored to the integration problem.

“changing N requires a new simulation of the entire vector unless I missed the point.”
You’re absolutely right. This is a well-known open issue in probability, see the discussion on Terence Tao’s blog.

“This requires figuring out the upper bounds on the acceptance ratios, a “problem-dependent” request that may prove impossible to implement”
We agree that in general this isn’t trivial. However, good bounds are available for all Jacobi polynomials, see Section 3.

“Even without this stumbling block, generating the N-sized sample for dimension d=N (why d=N, I wonder?)”
This is a misunderstanding: we do not say that d=N in any sense. We only say that sampling from a DPP using the algorithm of [Hough et al] requires the same number of operations as orthonormalizing N vectors of dimension N, hence the cubic cost.

1. “how does it relate to quasi-Monte Carlo?”
So far, the connection to QMC is only intuitive: both rely on well-spaced nodes, but using different mathematical tools.

2. “the marginals of the N-th order determinantal process are far from uniform (see Fig. 1), and seemingly concentrated on the boundaries”
This phenomenon is due to orthogonal polynomials. We are investigating more general constructions that give more flexibility.

3. “Is the variance of the resulting estimator (2.11) always finite?”
Yes. For instance, this follows from the inequality below (5.56) since ƒ(x)/K(x,x) is Lipschitz.

4. and 5. We are investigating concentration inequalities to answer these points.

6. “probabilistic numerics produce an epistemic assessment of uncertainty, contrary to the current proposal.”
A partial answer may be our Remark 2.12. You can interpret DPPs as putting a Gaussian process prior over ƒ and sequentially sampling from the posterior variance of the GP.

## miXed distributions

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , on November 3, 2015 by xi'an

A couple of questions on X validated showed the difficulty students have with mixed measures and their density. Actually, my students always react with incredulity to the likelihood of a censored normal sample or to the derivation of a Bayes factor associated with the null (and atomic) hypothesis μ=0…

I attribute this difficulty to a poor understanding of the notion of density and hence to a deficiency in the training in measure theory, since the density f of the distribution F is always relative to a reference measure dμ, i.e.

f(x) = dF/dμ(x)

(Hence Lebesgue’s moustache on the attached poster!) To handle atoms in the distribution requires introducing a dominating measure dμ with atomic components, i.e., usually a sum of the Lebesgue measure and of the counting measure on the appropriate set. Which is not so absolutely obvious: while the first question had {0,1} as atoms, the second question introduced atoms on {-θ,θ}and required a change of variable to consider a counting measure on {-1,1}. I found this second question actually of genuine interest and a great toy example for class and exams.

## Moment conditions and Bayesian nonparametrics

Posted in R, Statistics, University life with tags , , , , , , , , , , on August 6, 2015 by xi'an

Luke Bornn, Neil Shephard, and Reza Solgi (all from Harvard) have arXived a pretty interesting paper on simulating targets on a zero measure set. Although it is not initially presented this way, but rather in non-parametric terms as moment conditions

$\mathbb{E}_\theta[g(X,\beta)]=0$

where θ is the parameter of the sampling distribution, constrained by the value of β. (Which also contains quantile regression.) The very problem of simulating under a hard constraint has been bugging me for years and it is hence very exciting to see them come up with a proposal towards solving this difficulty! Even though it is restricted here to observations with a finite support (hence allowing for the use of a parametric Dirichlet prior). One interesting extension (Section 3.6) processed in the paper is the case when the support is unknown, but finite, with some points in the support being unobserved. Maybe connecting with non-parametrics if a prior is added on the number of unobserved points.

The setting of constricting θ via a parameterised moment condition relates to moment defined econometrics models, in a similar spirit to Gallant’s paper I recently discussed, but equally to empirical likelihood, which would then benefit from a fully Bayesian treatment thanks to the approach advocated by the authors.

Despite the zero-measure difficulty, or more exactly the non-linear manifold structure of the parameter space, for instance

β = log {θ/(1-θ)}

the authors manage to define a “projected” [my words] measure on the set of admissible pairs (β,θ). In a sense this is related with the choice of a certain metric, but the so-called Hausdorff reference measure allows for an automated definition of the original prior. It took me a (wee) while to spot (p.7) that the starting point was not a (unconstrained) prior on that (unconstrained) pair (β,θ) but directly on the manifold

$\mathbb{E}_\theta[g(X,\beta)]=0.$

Which makes its construction a difficulty. Even though, as noted in Section 4, all that we need is a prior over θ since the Hausdorff-Jacobian identity defines the “joint”, in a sort of backward way. (This is a wee bit confusing in that β being a transform of θ, all we need is a prior over θ, but we nonetheless end up with a different density on the joint distribution on the pair (β,θ). Any connection with incompatible priors merged together into a consensus prior?) Another question extending the scope of the paper would be to define Jeffreys’ or reference priors in this manifold sense.

The authors also discuss (Section 4.3) the problem I originally thought they were processing, namely starting from an unconstrained pair (β,θ) and it corresponding prior. The projected prior can then be defined based on a version of the original density on the constrained space, but it is definitely arbitrary. In that sense the paper does not address the general problem.

“…traditional simulation algorithms will fail because the prior and the posterior of the model are supported on a zero Lebesgue measure set…” (p.10)

I somewhat resist this presentation through the measure zero set: once the prior is defined on a manifold, the fact that it is a measure zero set in a larger space is moot. Provided one can simulate a proposal over that manifold, e.g., a random walk, absolutely continuous wrt the same dominating measure, and compute or estimate a Metropolis-Hastings ratio of densities against a common measure, one can formally run MCMC on manifolds as well as regular Euclidean spaces. A first and theoretically straightforward (?) solution is to solve the constraint

$\mathbb{E}_\theta[g(X,\beta)]=0$

in β=β(θ). Then the joint prior p(β,θ) can be projected by the Hausdorff projection into p(θ). For instance, in the case of the above logit transform, the projected density is

p(θ)=p(β,θ) {1+1/θ²(1-θ)²}½

In practice, the inversion may be too costly and Bornn et al. directly simulate the pair (β,θ) within the manifold capitalising on the fact that the constraint is linear in θ given β. Indeed, in this setting, β is unconstrained and θ can be simulated from a proposal restricted to the hyperplane. Gibbs-like.

## approximation of improper by vague priors

Posted in Statistics, University life with tags , , , on November 18, 2013 by xi'an

“…many authors prefer to replace these improper priors by vague priors, i.e. probability measures that aim to represent very few knowledge on the parameter.”

Christèle Bioche and Pierre Druihlet arXived a few days ago a paper with this title. They aim at bringing a new light on the convergence of vague priors to their limit. Their notion of convergence is a pointwise convergence in the quotient space of Radon measures, quotient being defined by the removal of the “normalising” constant. The first results contained in the paper do not show particularly enticing properties of the improper limit of proper measures as the limit cannot be given any (useful) probabilistic interpretation. (A feature already noticeable when reading Jeffreys.) The first result that truly caught my interest in connection with my current research is the fact that the Haar measures appear as a (weak) limit of conjugate priors (Section 2.5). And that the Jeffreys prior is the limit of the parametrisation-free conjugate priors of Druilhet and Pommeret (2012, Bayesian Analysis, a paper I will discuss soon!). The result about the convergence of posterior means is rather anticlimactic as the basis assumption is the uniform integrability of the sequence of the prior densities. An interesting counterexample (somehow familiar to invariance fans): the sequence of Poisson distributions with mean n has no weak limit. And the Haldane prior does appear as a limit of Beta distributions (less surprising). On (0,1) if not on [0,1].

The paper contains a section on the Jeffreys-Lindley paradox, which is only considered from the second perspective, the one I favour. There is however a mention made of the noninformative answer, which is the (meaningless) one associated with the Lebesgue measure of normalising constant one. This Lebesgue measure also appears as a weak limit in the paper, even though the limit of the posterior probabilities is 1. Except when the likelihood has bounded variations outside compacts. Then the  limit of the probabilities is the prior probability of the null… Interesting, truly, but not compelling enough to change my perspective on the topic. (And thanks to the authors for their thanks!)

## optimal direction Gibbs

Posted in Statistics, University life with tags , , , , , , on May 29, 2012 by xi'an

An interesting paper appeared on arXiv today. Entitled On optimal direction gibbs sampling, by Andrés Christen, Colin Fox, Diego Andrés Pérez-Ruiz and Mario Santana-Cibrian, it defines optimality as picking the direction that brings the maximum independence between two successive realisations in the Gibbs sampler. More precisely, it aims at choosing the direction e that minimises the mutual information criterion

$\int\int f_{Y,X}(y,x)\log\dfrac{f_{Y,X}(y,x)}{f_Y(y)f_X(x)}\,\text{d}x\,\text{d}y$

I have a bit of an issue about this choice because it clashes with measure theory. Indeed, in one Gibbs step associated with e the transition kernel is defined in terms of the Lebesgue measure over the line induced by e. Hence the joint density of the pair of successive realisations is defined in terms of the product of the Lebesgue measure on the overall space and of the Lebesgue measure over the line induced by e… While the product in the denominator is defined against the product of the Lebesgue measure on the overall space and itself. The two densities are therefore not comparable since not defined against equivalent measures… The difference between numerator and denominator is actually clearly expressed in the normal example (page 3) when the chain operates over a n dimensional space, but where the conditional distribution of the next realisation is one-dimensional, thus does not relate with the multivariate normal target on the denominator. I therefore do not agree with the derivation of the mutual information henceforth produced as (3).

The above difficulty is indirectly perceived by the authors, who note “we cannot simply choose the best direction: the resulting Gibbs sampler would not be irreducible” (page 5), an objection I had from an earlier page… They instead pick directions at random over the unit sphere and (for the normal case) suggest using a density over those directions such that

$h^*(\mathbf{e})\propto(\mathbf{e}^\prime A\mathbf{e})^{1/2}$

which cannot truly be called “optimal”.

More globally, searching for “optimal” directions (or more generally transforms) is quite a worthwhile idea, esp. when linked with adaptive strategies…

## principles of uncertainty

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , on October 14, 2011 by xi'an

Bayes Theorem is a simple consequence of the axioms of probability, and is therefore accepted by all as valid. However, some who challenge the use of personal probability reject certain applications of Bayes Theorem.”  J. Kadane, p.44

Principles of uncertainty by Joseph (“Jay”) Kadane (Carnegie Mellon University, Pittsburgh) is a profound and mesmerising book on the foundations and principles of subjectivist or behaviouristic Bayesian analysis. Jay Kadane wrote Principles of uncertainty over a period of several years and, more or less in his own words, it represents the legacy he wants to leave for the future. The book starts with a large section on Jay’s definition of a probability model, with rigorous mathematical derivations all the way to Lebesgue measure (or more exactly the McShane-Stieltjes measure). This section contains many side derivations that pertain to mathematical analysis, in order to explain the subtleties of infinite countable and uncountable sets, and the distinction between finitely additive and countably additive (probability) measures. Unsurprisingly, the role of utility is emphasized in this book that keeps stressing the personalistic entry to Bayesian statistics. Principles of uncertainty also contains a formal development on the validity of Markov chain Monte Carlo methods that is superb and missing in most equivalent textbooks. Overall, the book is a pleasure to read. And highly recommended for teaching as it can be used at many different levels. Continue reading