**A** recent submission to Biometrika made me read the 1984 Biometrika paper of Begg and Gray on the individualisation of polychotomous regression, namely the idea that when considering this model with T categories, the regression parameters could be estimated by considering only the pairs (0,i), 0 being the baseline category (with no parameter), since the (true) probability to be in category i conditional on being in either category 0 or category i is logistic with the same coefficient as in the polychotomous model. While I see no issue with this remark (contrary to the submission author), it is of course producing (much quicker) a different estimate of the polychotomous parameter, when compared with the full likelihood approach. Not only because it does not exploit the entire information contained in the data but also because it operates with a pseudo-likelihood.

## Archive for marginalisation

## individualised polychotomous logistic regression

Posted in Books, Statistics, University life with tags Biometrika, individualisation, letter to the editor, marginalisation, maximum likelihood estimation, polychotomous logistic regression on May 17, 2022 by xi'an## efficiency of normalising over discrete parameters

Posted in Statistics with tags arXiv, Gibbs sampler, Hamiltonian Monte Carlo, JAGS, latent variable models, marginalisation, MCMC, mixtures of distributions, Monte Carlo experiment, STAN on May 1, 2022 by xi'an**Y**esterday, I noticed a new arXival entitled *Investigating the efficiency of marginalising over discrete parameters in Bayesian computations* written by Wen Wang and coauthors. The paper is actually comparing the simulation of a Gibbs sampler with an Hamiltonian Monte Carlo approach on Gaussian mixtures, when including and excluding latent variables, respectively. The authors missed the opposite marginalisation when the parameters are integrated.

*While marginalisation requires substantial mathematical effort, folk wisdom in the Stan community suggests that fitting models with marginalisation is more efficient than using Gibbs sampling.*

The comparison is purely experimental, though, which means it depends on the simulated data, the sample size, the prior selection, and of course the chosen algorithms. It also involves the [mostly] automated [off-the-shelf] choices made in the adopted software, JAGS and Stan. The outcome is only evaluated through ESS and the (old) R statistic. Which all depend on the parameterisation. But evacuates the label switching problem by imposing an ordering on the Gaussian means, which may have a different impact on marginalised and unmarginalised models. All in all, there is not much one can conclude about this experiment since the parameter values beyond the simulated data seem to impact the performances much more than the type of algorithm one implements.

## a case for Bayesian deep learnin

Posted in Books, pictures, Statistics, Travel, University life with tags Bayesian foundations, Bayesian model choice, Bayesian neural networks, Bayesian variable selection, Berlin Tegel flughafen, marginalisation, model uncertainty, noninformative priors, normalisation, objective Bayes, snowstorm on September 30, 2020 by xi'an**A**ndrew Wilson wrote a piece about Bayesian deep learning last winter. Which I just read. It starts with the (posterior) predictive distribution being the core of Bayesian model evaluation or of model (epistemic) uncertainty.

*“On the other hand, a flat prior may have a major effect on marginalization.”*

Interesting sentence, as, from my viewpoint, using a flat prior is a no-no when running model evaluation since the marginal likelihood (or evidence) is no longer a probability density. (Check Lindley-Jeffreys’ paradox in this tribune.) The author then goes for an argument in favour of a Bayesian approach to deep neural networks for the reason that data cannot be informative on every parameter in the network, which should then be integrated out wrt a prior. He also draws a parallel between deep ensemble learning, where random initialisations produce different fits, with posterior distributions, although the equivalent to the prior distribution in an optimisation exercise is somewhat vague.

*“…we do not need samples from a posterior, or even a faithful approximation to the posterior. We need to evaluate the posterior in places that will make the greatest contributions to the [posterior predictive].”*

The paper also contains an interesting point distinguishing between priors over parameters and priors over functions, ony the later mattering for prediction. Which must be structured enough to compensate for the lack of data information about most aspects of the functions. The paper further discusses uninformative priors (over the parameters) in the O’Bayes sense as a default way to select priors. It is however unclear to me how this discussion accounts for the problems met in high dimensions by standard uninformative solutions. More aggressively penalising priors may be needed, as those found in high dimension variable selection. As in e.g. the 10⁷ dimensional space mentioned in the paper. Interesting read all in all!

## Why do we draw parameters to draw from a marginal distribution that does not contain the parameters?

Posted in Statistics with tags accept-reject algorithm, Animal Farm, auxiliary variables, cross validated, importance sampling, marginalisation, multiple importance methods, probability basics on November 3, 2019 by xi'an**A** revealing question on X validated of a simulation concept students (and others) have trouble gripping with. Namely using auxiliary variates to simulate from a marginal distribution, since these auxiliary variables are later dismissed and hence appear to them (students) of no use at all. Even after being exposed to the accept-reject algorithm. Or to multiple importance sampling. In the sense that a realisation of a random variable can be associated with a whole series of densities in an importance weight, all of them being valid (but some more equal than others!).

## maximum of a Dirichlet vector

Posted in Books, Statistics with tags cross validated, Dirichlet distribution, LaTeX, marginalisation, order statistics, Peter Dirichlet, Stack Exchange, stamp on September 26, 2016 by xi'an**A**n intriguing question on Stack Exchange this weekend, about the distribution of max{p¹,p²,…}the maximum component of a Dirichlet vector Dir(a¹,a²,…) with arbitrary hyper-parameters. Writing the density of this random variable is feasible, using its connection with a Gamma vector, but I could not find a closed-form expression. If there is such an expression, it may follow from the many properties of the Dirichlet distribution and I’d be interested in learning about it. (Very nice stamp, by the way! I wonder if the original formula was made with LaTeX…)