## logic (not logistic!) regression

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on February 12, 2020 by xi'an

A Bayesian Analysis paper by Aliaksandr Hubin, Geir Storvik, and Florian Frommlet on Bayesian logic regression was open for discussion. Here are some hasty notes I made during our group discussion in Paris Dauphine (and later turned into a discussion submitted to Bayesian Analysis):

“Originally logic regression was introduced together with likelihood based model selection, where simulated annealing served as a strategy to obtain one “best” model.”

Indeed, logic regression is not to be confused with logistic regression! Rejection of a true model in Bayesian model choice leads to Bayesian model choice and… apparently to Bayesian logic regression. The central object of interest is a generalised linear model based on a vector of binary covariates and using some if not all possible logical combinations (trees) of said covariates (leaves). The GLM is further using rather standard indicators to signify whether or not some trees are included in the regression (and hence the model). The prior modelling on the model indices sounds rather simple (simplistic?!) in that it is only function of the number of active trees, leading to an automated penalisation of larger trees and not accounting for a possible specificity of some covariates. For instance when dealing with imbalanced covariates (much more 1 than 0, say).

A first question is thus how much of a novel model this is when compared with say an analysis of variance since all covariates are dummy variables. Culling the number of trees away from the exponential of exponential number of possible covariates remains obscure but, without it, the model is nothing but variable selection in GLMs, except for “enjoying” a massive number of variables. Note that there could be a connection with variable length Markov chain models but it is not exploited there.

“…using Jeffrey’s prior for model selection has been widely criticized for not being consistent once the true model coincides with the null model.”

A second point that strongly puzzles me in the paper is its loose handling of improper priors. It is well-known that improper priors are at worst fishy in model choice settings and at best avoided altogether, to wit the Lindley-Jeffreys paradox and friends. Not only does the paper adopts the notion of a same, improper, prior on the GLM scale parameter, which is a position adopted in some of the Bayesian literature, but it also seems to be using an improper prior on each set of parameters (further undifferentiated between models). Because the priors operate on different (sub)sets of parameters, I think this jeopardises the later discourse on the posterior probabilities of the different models since they are not meaningful from a probabilistic viewpoint, with no joint distribution as a reference, neither marginal density. In some cases, p(y|M) may become infinite. Referring to a “simple Jeffrey’s” prior in this setting is therefore anything but simple as Jeffreys (1939) himself shied away from using improper priors on the parameter of interest. I find it surprising that this fundamental and well-known difficulty with improper priors in hypothesis testing is not even alluded to in the paper. Its core setting thus seems to be flawed. Now, the numerical comparison between Jeffrey’s [sic] prior and a regular g-prior exhibits close proximity and I thus wonder at the reason. Could it be that the culling and selection processes end up having the same number of variables and thus eliminate the impact of the prior? Or is it due to the recourse to a Laplace approximation of the marginal likelihood that completely escapes the lack of definition of the said marginal? Computing the normalising constant and repeating this computation while the algorithm is running ignores the central issue.

“…hereby, all states, including all possible models of maximum sized, will eventually be visited.”

Further, I found some confusion between principles and numerics. And as usual bemoan the acronym inflation with the appearance of a GMJMCMC! Where G stands for genetic (algorithm), MJ for mode jumping, and MCMC for…, well no surprise there! I was not aware of the mode jumping algorithm of Hubin and Storvik (2018), so cannot comment on the very starting point of the paper. A fundamental issue with Markov chains on discrete spaces is that the notion of neighbourhood becomes quite fishy and is highly dependent on the nature of the covariates. And the Markovian aspects are unclear because of the self-avoiding aspect of the algorithm. The novel algorithm is intricate and as such seems to require a superlative amount of calibration. Are all modes truly visited, really? (What are memetic algorithms?!)

## Greek variations on power-expected-posterior priors

Posted in Books, Statistics, University life with tags , , , , , , on October 5, 2016 by xi'an

Dimitris Fouskakis, Ioannis Ntzoufras and Konstantinos Perrakis, from Athens, have just arXived a paper on power-expected-posterior priors. Just like the power prior and the expected-posterior prior, this approach aims at avoiding improper priors by the use of imaginary data, which distribution is itself the marginal against another prior. (In the papers I wrote on that topic with Juan Antonio Cano and Diego Salmerón, we used MCMC to figure out a fixed point for such priors.)

The current paper (which I only perused) studies properties of two versions of power-expected-posterior priors proposed in an earlier paper by the same authors. For the normal linear model. Using a posterior derived from an unormalised powered likelihood either (DR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the powered likelihood, or (CR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the actual likelihood. The baseline model being the G-prior with g=n². Both versions lead to a marginal likelihood that is similar to BIC and hence consistent. The DR version coincides with the original power-expected-posterior prior in the linear case. The CR version involves a change of covariance matrix. All in all, the CR version tends to favour less complex models, but is less parsimonious as a variable selection tool, which sounds a wee bit contradictory. Overall, I thus feel (possibly incorrectly) that the paper is more an appendix to the earlier paper than a paper in itself as I do not get in the end a clear impression of which method should be preferred.

## top model choice week (#3)

Posted in Statistics, University life with tags , , , , , , , , , , , on June 19, 2013 by xi'an

To conclude this exciting week, there will be a final seminar by Veronika Rockovà (Erasmus University) on Friday, June 21, at 11am at ENSAE  in Room 14. Here is her abstract:

11am: Fast Dynamic Posterior Exploration for Factor Augmented Multivariate Regression byVeronika Rockova

Advancements in high-throughput experimental techniques have facilitated the availability of diverse genomic data, which provide complementary information regarding the function and organization of gene regulatory mechanisms. The massive accumulation of data has increased demands for more elaborate modeling approaches that combine the multiple data platforms. We consider a sparse factor regression model, which augments the multivariate regression approach by adding a latent factor structure, thereby allowing for dependent patterns of marginal covariance between the responses. In order to enable the identi cation of parsimonious structure, we impose spike and slab priors on the individual entries in the factor loading and regression matrices. The continuous relaxation of the point mass spike and slab enables the implementation of a rapid EM inferential procedure for dynamic posterior model exploration. This is accomplished by considering a nested sequence of spike and slab priors and various factor space cardinalities. Identi ed candidate models are evaluated by a conditional posterior model probability criterion, permitting trans-dimensional comparisons. Patterned sparsity manifestations such as an orthogonal allocation of zeros in factor loadings are facilitated by structured priors on the binary inclusion matrix. The model is applied to a problem of integrating two genomic datasets, where expression of microRNA’s is related to the expression of genes with an underlying connectivity pathway network.

## top model choice week (#2)

Posted in Statistics, University life with tags , , , , , , , , , , , , on June 18, 2013 by xi'an

Following Ed George (Wharton) and Feng Liang (University of Illinois at Urbana-Champaign) talks today in Dauphine, Natalia Bochkina (University of Edinburgh) will  give a talk on Thursday, June 20, at 2pm in Room 18 at ENSAE (Malakoff) [not Dauphine!]. Here is her abstract:

2 am: Simultaneous local and global adaptivity of Bayesian wavelet estimators in nonparametric regression by Natalia Bochkina

We consider wavelet estimators in the context of nonparametric regression, with the aim of finding estimators that simultaneously achieve the local and global adaptive minimax rate of convergence. It is known that one estimator – James-Stein block thresholding estimator of T.Cai (2008) – achieves simultaneously both optimal rates of convergence but over a limited set of Besov spaces; in particular, over the sets of spatially inhomogeneous functions (with 1≤ p<2) the upper bound on the global rate of this estimator is slower than the optimal minimax rate.

Another possible candidate to achieve both rates of convergence simultaneously is the Empirical Bayes estimator of Johnstone and Silverman (2005) which is an adaptive estimator that achieves the global minimax rate over a wide rage of Besov spaces and Besov balls. The maximum marginal likelihood approach is used to estimate the hyperparameters, and it can be interpreted as a Bayesian estimator with a uniform prior. We show that it also achieves the adaptive local minimax rate over all Besov spaces, and hence it does indeed achieve both local and global rates of convergence simultaneously over Besov spaces. We also give an example of how it works in practice.

## top model choice week

Posted in Statistics, University life with tags , , , , , , , on June 13, 2013 by xi'an

Next week, we are having a special Bayesian [top] model choice week in Dauphine, thanks to the simultaneous visits of Ed George (Wharton), Feng Liang (University of Illinois at Urbana-Champaign), and Veronika Rockovà (Erasmus University). To start the week and get to know the local actors (!), Ed and Feng both give a talk on Tuesday, June 18, at 11am and 1pm in Room C108. Here are the abstracts:

11am: Prediction and Model Selection for Multi-task Learning by Feng Liang

In multi-task learning one simultaneoulsy fits multiple regression models. We are interested in inference problems like model selection and prediction when there are a large number of tasks. A simple version of such models is a one-way ANOVA model where the number of replicates is fixed but the number of groups goes to infinity. We examine the consistency of Bayesian procedures using Zellner (1986)’s g-prior and its variants (such as mixed g-priors and Empirical Bayes), and compare their prediction accuracy with other procedures, such as the ones based AIC/BIC and group Lasso. Our results indicate that the Empirical Bayes procedure (with some modification for the large p small n setting) can achieve model selection consistency, and also have better estimation accuracy than other procedures being considered. During my talk, I’ll focus on the analysis on the one-way ANOVA model, but will also give a summary on our findings for multi-tasking learning invovling a more general regression setting. This is based on joint work with my PhD student Bin Li from University of Illinois at Urbana-Champaign.

1pm: EMVS: The EM Approach to Bayesian Variable Selection by Edward George

Despite rapid developments in stochastic search algorithms, the practicality of Bayesian variable selection methods has continued to pose challenges. High-dimensional data are now routinely analyzed, typically with many more covariates than observations. To broaden the applicability of Bayesian variable selection for such high-dimensional linear regression contexts, we propose EMVS, a deterministic alternative to stochastic search based on an EM algorithm which exploits a conjugate mixture prior formulation to quickly find posterior modes. Combining a spike-and-slab regularization diagram for the discovery of active predictor sets with subsequent rigorous evaluation of posterior model probabilities, EMVS rapidly identifies promising sparse high posterior probability submodels. External structural information such as likely covariate groupings or network topologies is easily incorporated into the EMVS framework. Deterministic annealing variants are seen to improve the effectiveness of our algorithms by mitigating the posterior multi-modality associated with variable selection priors. The usefulness the EMVS approach is demonstrated on real high-dimensional data, where computational complexity renders stochastic search to be less practical. This is joint work with Veronika Rockova of Erasmus University)