Archive for hyperparameter

from here to infinity

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , on September 30, 2019 by xi'an

“Introducing a sparsity prior avoids overfitting the number of clusters not only for finite mixtures, but also (somewhat unexpectedly) for Dirichlet process mixtures which are known to overfit the number of clusters.”

On my way back from Clermont-Ferrand, in an old train that reminded me of my previous ride on that line that took place in… 1975!, I read a fairly interesting paper published in Advances in Data Analysis and Classification by [my Viennese friends] Sylvia Früwirth-Schnatter and Gertrud Malsiner-Walli, where they describe how sparse finite mixtures and Dirichlet process mixtures can achieve similar results when clustering a given dataset. Provided the hyperparameters in both approaches are calibrated accordingly. In both cases these hyperparameters (scale of the Dirichlet process mixture versus scale of the Dirichlet prior on the weights) are endowed with Gamma priors, both depending on the number of components in the finite mixture. Another interesting feature of the paper is to witness how close the related MCMC algorithms are when exploiting the stick-breaking representation of the Dirichlet process mixture. With a resolution of the label switching difficulties via a point process representation and k-mean clustering in the parameter space. [The title of the paper is inspired from Ian Stewart’s book.]

Nonparametric hierarchical Bayesian quantiles

Posted in Books, Statistics, University life with tags , , , , , , , on June 9, 2016 by xi'an

Luke Bornn, Neal Shephard and Reza Solgi have recently arXived a research report on non-parametric Bayesian quantiles. This work relates to their earlier paper that combines Bayesian inference with moment estimators, in that the quantiles do not define entirely the distribution of the data, which then needs to be completed by Bayesian means. But contrary to this previous paper, it does not require MCMC simulation for distributions defined on a variety as, e.g., a curve.

Here a quantile is defined as minimising an asymmetric absolute risk, i.e., an expected loss. It is therefore a deterministic function of the model parameters for a parametric model and a functional of the model otherwise. And connected to a moment if not a moment per se. In the case of a model with a discrete support, the unconstrained model is parameterised by the probability vector θ and β=t(θ). However, the authors study the opposite approach, namely to set a prior on β, p(β), and then complement this prior with a conditional prior on θ, p(θ|β), the joint prior p(β)p(θ|β) being also the marginal p(θ) because of the deterministic relation. However, I am getting slightly lost in the motivation for the derivation of the conditional when the authors pick an arbitrary prior on θ and use it to derive a conditional on β which, along with an arbitrary (“scientific”) prior on β defines a new prior on θ. This works out in the discrete case because β has a finite support. But it is unclear (to me) why it should work in the continuous case [not covered in the paper].

Getting back to the central idea of defining first the distribution on the quantile β, a further motivation is provided in the hierarchical extension of Section 3, where the same quantile distribution is shared by all individuals (e.g., cricket players) in the population, while the underlying distributions for the individuals are otherwise disconnected and unconstrained. (Obviously, a part of the cricket example went far above my head. But one may always idly wonder why all players should share the same distribution. And about what would happen when imposing no quantile constraint but picking instead a direct hierarchical modelling on the θ’s.) This common distribution on β can then be modelled by a Dirichlet hyperprior.

The paper also contains a section on estimating the entire quantile function, which is a wee paradox in that this function is again a deterministic transform of the original parameter θ, but that the authors use instead pointwise estimation, i.e., for each level τ. I find the exercise furthermore paradoxical in that the hierarchical modelling with a common distribution on the quantile β(τ) only is repeated for each τ but separately, while it should be that the entire parameter should share a common distribution. Given the equivalence between the quantile function and the entire parameter θ.

Nonparametric applications of Bayesian inference

Posted in Books, Statistics, University life with tags , , , , , , on April 22, 2016 by xi'an

Gary Chamberlain and Guido Imbens published this paper in the Journal of Business & Economic Statistics in 2003. I just came to read it in connection with the paper by Luke Bornn, Niel Shephard and Reza Solgi that I commented a few months ago. The setting is somewhat similar: given a finite support distribution with associated probability parameter θ, a natural prior on θ is a Dirichlet prior. This prior induces a prior on transforms of θ, whether or not they are in close form (for instance as the solution of a moment equation E[F(X,β)]=0. As in Bornn et al. In this paper, Chamberlain and Imbens argue in favour of the limiting Dirichlet with all coefficients equal to zero as a way to avoid prior dominating influence when the number of classes J goes to infinity and the data size remains fixed. But they fail to address the issue that the posterior is no longer defined since some classes get unobserved. They consider instead that the parameters corresponding to those classes are equal to zero with probability one, a convention and not a result. (The computational advantage in using the improper prior sounds at best incremental.) The notion of letting some Dirichlet hyper-parameters going to zero is somewhat foreign to a Bayesian perspective as those quantities should be either fixed or distributed according to an hyper-prior, rather than set to converge according to a certain topology that has nothing to do with prior modelling. (Another reason why setting those quantities to zero does not have the same meaning as picking a Dirac mass at zero.)

“To allow for the possibility of an improper posterior distribution…” (p.4)

This is a weird beginning of a sentence, especially when followed by a concept of expected posterior distribution, which is actually a bootstrap expectation. Not as in Bayesian bootstrap, mind. And thus this feels quite orthogonal to the Bayesian approach. I do however find most interesting this notion of constructing a true expected posterior by imposing samples that ensure properness as it reminds me of our approach to mixtures with Jean Diebolt, where (latent) allocations were prohibited to induce improper priors. The bootstrapped posterior distribution seems to be proposed mostly for assessing the impact of the prior modelling, albeit in an non-quantitative manner. (I fail to understand how the very small bootstrap sample sizes are chosen.)

Obviously, there is a massive difference between this paper and Bornn et al, where the authors use two competing priors in parallel, one on θ and one on β, which induces difficulties in setting priors since the parameter space is concentrated upon a manifold. (In which case I wonder what would happen if one implemented the preposterior idea of Berger and Pérez, 2002, to derive a fixed point solution. That we implemented recently with Diego Salmerón and Juan Antonio Caño in a paper published in Statistica Sinica.. This exhibits a similarity with the above bootstrap proposal in that the posterior gets averaged wrt another posterior.)

mixtures are slices of an orange

Posted in Kids, R, Statistics with tags , , , , , , , , , , , , , , , , on January 11, 2016 by xi'an

licenceDataTempering_mu_pAfter presenting this work in both London and Lenzerheide, Kaniav Kamary, Kate Lee and I arXived and submitted our paper on a new parametrisation of location-scale mixtures. Although it took a long while to finalise the paper, given that we came with the original and central idea about a year ago, I remain quite excited by this new representation of mixtures, because the use of a global location-scale (hyper-)parameter doubling as the mean-standard deviation for the mixture itself implies that all the other parameters of this mixture model [beside the weights] belong to the intersection of a unit hypersphere with an hyperplane. [Hence the title above I regretted not using for the poster at MCMskv!]fitted_density_galaxy_data_500iters2This realisation that using a (meaningful) hyperparameter (μ,σ) leads to a compact parameter space for the component parameters is important for inference in such mixture models in that the hyperparameter (μ,σ) is easily estimated from the entire sample, while the other parameters can be studied using a non-informative prior like the Uniform prior on the ensuing compact space. This non-informative prior for mixtures is something I have been seeking for many years, hence my on-going excitement! In the mid-1990‘s, we looked at a Russian doll type parametrisation with Kerrie Mengersen that used the “first” component as defining the location-scale reference for the entire mixture. And expressing each new component as a local perturbation of the previous one. While this is a similar idea than the current one, it falls short of leading to a natural non-informative prior, forcing us to devise a proper prior on the variance that was a mixture of a Uniform U(0,1) and of an inverse Uniform 1/U(0,1). Because of the lack of compactness of the parameter space. Here, fixing both mean and variance (or even just the variance) binds the mixture parameter to an ellipse conditional on the weights. A space that can be turned into the unit sphere via a natural reparameterisation. Furthermore, the intersection with the hyperplane leads to a closed form spherical reparameterisation. Yay!

While I do not wish to get into the debate about the [non-]existence of “non-informative” priors at this stage, I think being able to using the invariant reference prior π(μ,σ)=1/σ is quite neat here because the inference on the mixture parameters should be location and scale equivariant. The choice of the prior on the remaining parameters is of lesser importance, the Uniform over the compact being one example, although we did not study in depth this impact, being satisfied with the outputs produced from the default (Uniform) choice.

From a computational perspective, the new parametrisation can be easily turned into the old parametrisation, hence leads to a closed-form likelihood. This implies a Metropolis-within-Gibbs strategy can be easily implemented, as we did in the derived Ultimixt R package. (Which programming I was not involved in, solely suggesting the name Ultimixt from ultimate mixture parametrisation, a former title that we eventually dropped off for the paper.)

Discussing the paper at MCMskv was very helpful in that I got very positive feedback about the approach and superior arguments to justify the approach and its appeal. And to think about several extensions outside location scale families, if not in higher dimensions which remain a practical challenge (in the sense of designing a parametrisation of the covariance matrices in terms of the global covariance matrix).

MCMskv #2 [ridge with a view]

Posted in Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , on January 7, 2016 by xi'an

Tuesday at MCMSkv was a rather tense day for me, from having to plan the whole day “away from home” [8km away] to the mundane worry of renting ski equipment and getting to the ski runs over the noon break, to giving a poster over our new mixture paper with Kaniav Kamary and Kate Lee, as Kaniav could not get a visa in time. It actually worked out quite nicely, with almost Swiss efficiency. After Michael Jordan’s talk, I attended a Bayesian molecular biology session with an impressive talk by Jukka Corander on evolutionary genomics with novel ABC aspects. And then a Hamiltonian Monte Carlo session with two deep talks by Sam Livingstone and Elena Akhmatskaya on the convergence of HMC, followed by an amazing entry into Bayesian cosmology by Jens Jasche (with a slight drawback that MCMC simulations took about a calendar year, handling over 10⁷ parameters). Finishing the day with more “classical” MCMC convergence results and techniques, with talks about forgetting time, stopping time (an undervalued alternative to convergence controls), and CLTs. Including a multivariate ESS by James Flegal. (This choice of sessions was uniformly frustrating as I was also equally interested in “the other” session. The drawback of running parallel sessions, obviously.)

The poster session was busy and animated, but I alas could not get an idea of the other posters as I was presenting mine. This was quite exciting as I discussed a new parametrisation for location-scale mixture models that allows for a rather straightforward “non-informative” or reference prior. (The paper with Kaniav Kamary and Kate Lee should be arXived overnight!) The recently deposited CRAN package Ultimixt by Kaniav and Kate contains Metropolis-Hastings functions related to this new approach. The result is quite exciting, especially because I have been looking for it for decades and I will discuss it pretty soon in another post, and I had great exchanges with the conference participants, which led me to consider the reparametrisation in a larger scale and to simplify the presentation of the approach, turning the global mean and variance as hyperparameters.

The day was also most auspicious for a ski break as it was very mild and sunny, while the snow conditions were (somewhat) better than the ones we had in the French Alps two weeks ago. (Too bad that the Tweedie ski race had to be cancelled for lack of snow on the reserved run! The Blossom ski reward will have again to be randomly allocated!) Just not exciting enough to consider another afternoon out, given the tension in getting there and back. (And especially when considering that it took me the entire break time to arXive our mixture paper…)