Archive for Bayesian model choice

a case for Bayesian deep learnin

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on September 30, 2020 by xi'an

Andrew Wilson wrote a piece about Bayesian deep learning last winter. Which I just read. It starts with the (posterior) predictive distribution being the core of Bayesian model evaluation or of model (epistemic) uncertainty.

“On the other hand, a flat prior may have a major effect on marginalization.”

Interesting sentence, as, from my viewpoint, using a flat prior is a no-no when running model evaluation since the marginal likelihood (or evidence) is no longer a probability density. (Check Lindley-Jeffreys’ paradox in this tribune.) The author then goes for an argument in favour of a Bayesian approach to deep neural networks for the reason that data cannot be informative on every parameter in the network, which should then be integrated out wrt a prior. He also draws a parallel between deep ensemble learning, where random initialisations produce different fits, with posterior distributions, although the equivalent to the prior distribution in an optimisation exercise is somewhat vague.

“…we do not need samples from a posterior, or even a faithful approximation to the posterior. We need to evaluate the posterior in places that will make the greatest contributions to the [posterior predictive].”

The paper also contains an interesting point distinguishing between priors over parameters and priors over functions, ony the later mattering for prediction. Which must be structured enough to compensate for the lack of data information about most aspects of the functions. The paper further discusses uninformative priors (over the parameters) in the O’Bayes sense as a default way to select priors. It is however unclear to me how this discussion accounts for the problems met in high dimensions by standard uninformative solutions. More aggressively penalising priors may be needed, as those found in high dimension variable selection. As in e.g. the 10⁷ dimensional space mentioned in the paper. Interesting read all in all!

logic (not logistic!) regression

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on February 12, 2020 by xi'an

A Bayesian Analysis paper by Aliaksandr Hubin, Geir Storvik, and Florian Frommlet on Bayesian logic regression was open for discussion. Here are some hasty notes I made during our group discussion in Paris Dauphine (and later turned into a discussion submitted to Bayesian Analysis):

“Originally logic regression was introduced together with likelihood based model selection, where simulated annealing served as a strategy to obtain one “best” model.”

Indeed, logic regression is not to be confused with logistic regression! Rejection of a true model in Bayesian model choice leads to Bayesian model choice and… apparently to Bayesian logic regression. The central object of interest is a generalised linear model based on a vector of binary covariates and using some if not all possible logical combinations (trees) of said covariates (leaves). The GLM is further using rather standard indicators to signify whether or not some trees are included in the regression (and hence the model). The prior modelling on the model indices sounds rather simple (simplistic?!) in that it is only function of the number of active trees, leading to an automated penalisation of larger trees and not accounting for a possible specificity of some covariates. For instance when dealing with imbalanced covariates (much more 1 than 0, say).

A first question is thus how much of a novel model this is when compared with say an analysis of variance since all covariates are dummy variables. Culling the number of trees away from the exponential of exponential number of possible covariates remains obscure but, without it, the model is nothing but variable selection in GLMs, except for “enjoying” a massive number of variables. Note that there could be a connection with variable length Markov chain models but it is not exploited there.

“…using Jeffrey’s prior for model selection has been widely criticized for not being consistent once the true model coincides with the null model.”

A second point that strongly puzzles me in the paper is its loose handling of improper priors. It is well-known that improper priors are at worst fishy in model choice settings and at best avoided altogether, to wit the Lindley-Jeffreys paradox and friends. Not only does the paper adopts the notion of a same, improper, prior on the GLM scale parameter, which is a position adopted in some of the Bayesian literature, but it also seems to be using an improper prior on each set of parameters (further undifferentiated between models). Because the priors operate on different (sub)sets of parameters, I think this jeopardises the later discourse on the posterior probabilities of the different models since they are not meaningful from a probabilistic viewpoint, with no joint distribution as a reference, neither marginal density. In some cases, p(y|M) may become infinite. Referring to a “simple Jeffrey’s” prior in this setting is therefore anything but simple as Jeffreys (1939) himself shied away from using improper priors on the parameter of interest. I find it surprising that this fundamental and well-known difficulty with improper priors in hypothesis testing is not even alluded to in the paper. Its core setting thus seems to be flawed. Now, the numerical comparison between Jeffrey’s [sic] prior and a regular g-prior exhibits close proximity and I thus wonder at the reason. Could it be that the culling and selection processes end up having the same number of variables and thus eliminate the impact of the prior? Or is it due to the recourse to a Laplace approximation of the marginal likelihood that completely escapes the lack of definition of the said marginal? Computing the normalising constant and repeating this computation while the algorithm is running ignores the central issue.

“…hereby, all states, including all possible models of maximum sized, will eventually be visited.”

Further, I found some confusion between principles and numerics. And as usual bemoan the acronym inflation with the appearance of a GMJMCMC! Where G stands for genetic (algorithm), MJ for mode jumping, and MCMC for…, well no surprise there! I was not aware of the mode jumping algorithm of Hubin and Storvik (2018), so cannot comment on the very starting point of the paper. A fundamental issue with Markov chains on discrete spaces is that the notion of neighbourhood becomes quite fishy and is highly dependent on the nature of the covariates. And the Markovian aspects are unclear because of the self-avoiding aspect of the algorithm. The novel algorithm is intricate and as such seems to require a superlative amount of calibration. Are all modes truly visited, really? (What are memetic algorithms?!)

back to Ockham’s razor

Posted in Statistics with tags , , , , , , , , , on July 31, 2019 by xi'an

“All in all, the Bayesian argument for selecting the MAP model as the single ‘best’ model is suggestive but not compelling.”

Last month, Jonty Rougier and Carey Priebe arXived a paper on Ockham’s factor, with a generalisation of a prior distribution acting as a regulariser, R(θ). Calling on the late David MacKay to argue that the evidence involves the correct penalising factor although they acknowledge that his central argument is not absolutely convincing, being based on a first-order Laplace approximation to the posterior distribution and hence “dubious”. The current approach stems from the candidate’s formula that is already at the core of Sid Chib’s method. The log evidence then decomposes as the sum of the maximum log-likelihood minus the log of the posterior-to-prior ratio at the MAP estimator. Called the flexibility.

“Defining model complexity as flexibility unifies the Bayesian and Frequentist justifications for selecting a single model by maximizing the evidence.”

While they bring forward rational arguments to consider this as a measure model complexity, it remains at an informal level in that other functions of this ratio could be used as well. This is especially hard to accept by non-Bayesians in that it (seriously) depends on the choice of the prior distribution, as all transforms of the evidence would. I am thus skeptical about the reception of the argument by frequentists…

O’Bayes 19/4

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on July 4, 2019 by xi'an

Last talks of the conference! With Rui Paulo (along with Gonzalo Garcia-Donato) considering the special case of factors when doing variable selection. Which is an interesting question that I had never considered, as at best I would remove all leves or keeping them all. Except that there may be misspecification in the factors as for instance when several levels have the same impact.With Michael Evans discussing a paper that he wrote for the conference! Following his own approach to statistical evidence. And including his reluctance to cover infinity (calling on Gauß for backup!) or continuity, and his call to falsify a Bayesian model by checking it can be contradicted by the data. His assumption that checking for prior is separable from checking for [sampling] model is debatable. (With another mention made of the Savage-Dickey ratio.)

And with Dimitris Fouskakis giving a wide ranging assessment [which Mark Steel (Warwick) called a PEP talk!] of power-expected-posterior priors, used with reference (and usually improper) priors. Which in retrospect would have suited better the beginning of the conference as it provided a background to several of the talks. Raising a question (from my perspective) on using the maximum likelihood estimator as a pseudo-sufficient statistic when this MLE is computed for the base (simplest) model. Maybe an ABC induced bias in this question as it would not work for ABC model choice.

Overall, I think the scientific outcomes of the conference were quite positive: a wide range of topics and perspectives, a reasonable and diverse attendance, especially when considering the heavy load of related conferences in the surrounding weeks (the “June fatigue”!), animated poster sessions. I am obviously not the one to assess the organisation of the conference! Things I forgot to do in this regard: organise transportation from Oxford to Warwick University, provide an attached room for in-pair research, insist on sustainability despite the imposed catering solution, facilitate sharing joint transportation to and from the Warwick campus, mention that tap water was potable, and… wear long pants when running in nettles.

Siem Reap conference

Posted in Kids, pictures, Travel, University life with tags , , , , , , , , , , , , , , , , , , on March 8, 2019 by xi'an

As I returned from the conference in Siem Reap. on a flight avoiding India and Pakistan and their [brittle and bristling!] boundary on the way back, instead flying far far north, near Arkhangelsk (but with nothing to show for it, as the flight back was fully in the dark), I reflected how enjoyable this conference had been, within a highly friendly atmosphere, meeting again with many old friends (some met prior to the creation of CREST) and new ones, a pleasure not hindered by the fabulous location near Angkor of course. (The above picture is the “last hour” group picture, missing a major part of the participants, already gone!)

Among the many talks, Stéphane Shao gave a great presentation on a paper [to appear in JASA] jointly written with Pierre Jacob, Jie Ding, and Vahid Tarokh on the Hyvärinen score and its use for Bayesian model choice, with a highly intuitive representation of this divergence function (which I first met in Padua when Phil Dawid gave a talk on this approach to Bayesian model comparison). Which is based on the use of a divergence function based on the squared error difference between the gradients of the true log-score and of the model log-score functions. Providing an alternative to the Bayes factor that can be shown to be consistent, even for some non-iid data, with some gains in the experiments represented by the above graph.

Arnak Dalalyan (CREST) presented a paper written with Lionel Riou-Durand on the convergence of non-Metropolised Langevin Monte Carlo methods, with a new discretization which leads to a substantial improvement of the upper bound on the sampling error rate measured in Wasserstein distance. Moving from p/ε to √p/√ε in the requested number of steps when p is the dimension and ε the target precision, for smooth and strongly log-concave targets.

This post gives me the opportunity to advertise for the NGO Sala Baï hostelry school, which the whole conference visited for lunch and which trains youths from underprivileged backgrounds towards jobs in hostelery, supported by donations, companies (like Krama Krama), or visiting the Sala Baï  restaurant and/or hotel while in Siem Reap.