Archive for Bayesian asymptotics

Korean trip

Posted in Mountains, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , on November 24, 2019 by xi'an

A fairly short but exciting trip to Seoul and to the Fall meeting of the Korean Statistical Society there. Plus giving a seminar at Seoul National University, where I stayed and enjoyed its beautiful campus surrounded by hills painted in the flamboyant reds and yellows of trees. Running to the top of Gwanaksan in the early morning, with some scrambling moments, was a fantastic beginning for the day! Although it was quite unintentional Sacha Tsybakov from CREST happened to be another invited speaker at the meeting (along with Regina Liu from Rutgers, whom I was also met in Salzburg two months ago) and we had a nice stroll together on the University of Seoul campus during a break in the sessions, gaining another view of the city from the top of the Bukhasan mountain. The talk I gave there on the asymptotics of ABC happened to be more attended than my tutorial lecture delivered at the beginning of JSM in Denver this summer. I am thus quite grateful to the organisers for their invitation and this opportunity to meet Korean statisticians and to get a glimpse of Korean culture and cuisine!

 

probably ABC [and provably robust]

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , on August 8, 2017 by xi'an

Two weeks ago, James Ridgway (formerly CREST) arXived a paper on misspecification and ABC, a topic on which David Frazier, Judith Rousseau and I have been working for a while now [and soon to be arXived as well].  Paper that I re-read on a flight to Amsterdam [hence the above picture], written as a continuation of our earlier paper with David, Gael, and Judith. One specificity of the paper is to use an exponential distribution on the distance between the observed and simulated sample within the ABC distribution. Which reminds me of the resolution by Bissiri, Holmes, and Walker (2016) of the intractability of the likelihood function. James’ paper contains oracle inequalities between the ABC approximation and the genuine distribution of the summary statistics, like a bound on the distance between the expectations of the summary statistics under both models. Which writes down as a sum of a model bias, of two divergences between empirical and theoretical averages, on smoothness penalties, and on a prior impact term. And a similar bound on the distance between the expected distance to the oracle estimator of θ under the ABC distribution [and a Lipschitz type assumption also found in our paper]. Which first sounded weird [to me] as I would have expected the true posterior, until it dawned on me that the ABC distribution is the one used for the estimation [a passing strike of over-Bayesianism!]. While the oracle bound could have been used directly to discuss the rate of convergence of the exponential rate λ to zero [with the sample size n], James goes into the interesting alternative direction of setting a prior on λ, an idea that dates back to Olivier Catoni and Peter Grünwald. Or rather a pseudo-posterior on λ, a common occurrence in the PAC-Bayesian literature. In one of his results, James obtains a dependence of λ on the dimension m of the summary [as well as the root dependence on the sample size n], which seems to contradict our earlier independence result, until one realises this scale parameter is associated with a distance variable, itself scaled in m.

The paper also contains a non-parametric part, where the parameter θ is the unknown distribution of the data and the summary the data itself. Which is quite surprising as I did not deem it possible to handle non-parametrics with ABC. Especially in a misspecified setting (although I have trouble perceiving what this really means).

“We can use most of the Monte Carlo toolbox available in this context.”

The theoretical parts are a bit heavy on notations and hard to read [as a vacation morning read at least!]. They are followed by a Monte Carlo implementation using SMC-ABC.  And pseudo-marginals [at least formally as I do not see how the specific features of pseudo-marginals are more that an augmented representation here]. And adaptive multiple pseudo-samples that reminded me of the Biometrika paper of Anthony Lee and Krys Latuszynski (Warwick). Therefore using indeed most of the toolbox!

CORE talk at Louvain-la-Neuve

Posted in Statistics with tags , , , , , , , on March 16, 2017 by xi'an

Tomorrow, I will give a talk at the seminar for econometrics and finance of CORE, in Louvain-la-Neuve, Belgium. Here are my slides, recycled from several earlier talks and from Judith’s slides in Banff:

 

mixture models with a prior on the number of components

Posted in Books, Statistics, University life with tags , , , , , , , on March 6, 2015 by xi'an

mixdir

“From a Bayesian perspective, perhaps the most natural approach is to treat the numberof components like any other unknown parameter and put a prior on it.”

Another mixture paper on arXiv! Indeed, Jeffrey Miller and Matthew Harrison recently arXived a paper on estimating the number of components in a mixture model, comparing the parametric with the non-parametric Dirichlet prior approaches. Since priors can be chosen towards agreement between those. This is an obviously interesting issue, as they are often opposed in modelling debates. The above graph shows a crystal clear agreement between finite component mixture modelling and Dirichlet process modelling. The same happens for classification.  However, Dirichlet process priors do not return an estimate of the number of components, which may be considered a drawback if one considers this is an identifiable quantity in a mixture model… But the paper stresses that the number of estimated clusters under the Dirichlet process modelling tends to be larger than the number of components in the finite case. Hence that the Dirichlet process mixture modelling is not consistent in that respect, producing parasite extra clusters…

In the parametric modelling, the authors assume the same scale is used in all Dirichlet priors, that is, for all values of k, the number of components. Which means an incoherence when marginalising from k to (k-p) components. Mild incoherence, in fact, as the parameters of the different models do not have to share the same priors. And, as shown by Proposition 3.3 in the paper, this does not prevent coherence in the marginal distribution of the latent variables. The authors also draw a comparison between the distribution of the partition in the finite mixture case and the Chinese restaurant process associated with the partition in the infinite case. A further analogy is that the finite case allows for a stick breaking representation. A noteworthy difference between both modellings is about the size of the partitions

\mathbb{P}(s_1,\ldots,s_k)\propto\prod_{j=1}^k s_j^{-\gamma}\quad\text{versus}\quad\mathbb{P}(s_1,\ldots,s_k)\propto\prod_{j=1}^k s_j^{-1}

in the finite (homogeneous partitions) and infinite (extreme partitions) cases.

An interesting entry into the connections between “regular” mixture modelling and Dirichlet mixture models. Maybe not ultimately surprising given the past studies by Peter Green and Sylvia Richardson of both approaches (1997 in Series B and 2001 in JASA).

model selection by likelihood-free Bayesian methods

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , on May 29, 2014 by xi'an

Just glanced at the introduction of this arXived paper over breakfast, back from my morning run: the exact title is “Model Selection for Likelihood-free Bayesian Methods Based on Moment Conditions: Theory and Numerical Examples” by Cheng Li and Wenxin Jiang. (The paper is 81 pages long.) I selected the paper for its title as it connected with an interrogation of ours on the manner to extend our empirical likelihood [A]BC work to model choice. We looked at this issue with Kerrie Mengersen and Judith Rousseau the last time Kerrie visited Paris but could not spot a satisfying entry… The current paper is of a theoretical nature, considering a moment defined model

\mathbb{E}[g(D,\theta)]=0,

where D denotes the data, as the dimension p of the parameter θ grows with n, the sample size. The approximate model is derived from a prior on the parameter θ and of a Gaussian quasi-likelihood on the moment estimating function g(D,θ). Examples include single index longitudinal data, quantile regression and partial correlation selection. The model selection setting is one of variable selection, resulting in 2p models to compare, with p growing to infinity… Which makes the practical implementation rather delicate to conceive. And the probability one of hitting the right model a fairly asymptotic concept. (At least after a cursory read from my breakfast table!)