tractable Bayesian variable selection: beyond normality
David Rossell and Francisco Rubio (both from Warwick) arXived a month ago a paper on non-normal variable selection. They use two-piece error models that preserve manageable inference and allow for simple computational algorithms, but also characterise the behaviour of the resulting variable selection process under model misspecification. Interestingly, they show that the existence of asymmetries or heavy tails leads to power losses when using the Normal model. The two-piece error distribution is made of two halves of location-scale transforms of the same reference density on the two sides of the common location parameter. In this paper, the density is either Gaussian or Laplace (i.e., exponential?). In both cases the (log-)likelihood has a nice compact expression (although it does not allow for a useful sufficient statistic). One is the L¹ version versus the other which is the L² version. Which is the main reason for using this formalism based on only two families of parametric distributions, I presume. (As mentioned in an earlier post, I do not consider those distributions as mixtures because the component of a given observation can always be identified. And because as shown in the current paper, maximum likelihood estimates can be easily derived.) The prior construction follows the non-local prior principles of Johnson and Rossell (2010, 2012) also discussed in earlier posts. The construction is very detailed and hence highlights how many calibration steps are needed in the process.
“Bayes factor rates are the same as when the correct model is assumed [but] model misspecification often causes a decrease in the power to detect truly active variables.”
When there are too many models to compare at once, the authors propose a random walk on the finite set of models (which does not require advanced measure-theoretic tools like reversible jump MCMC). One interesting aspect is that moving away from the normal to another member of this small family is driven by the density of the data under the marginal densities, which means moving only to interesting alternatives. But also sticking to the normal only for adequate datasets. In a sense this is not extremely surprising given that the marginal likelihoods (model-wise) are available. It is also interesting that on real datasets, one of the four models is heavily favoured against the others, be it Normal (6.3) or Laplace (6.4). And that the four model framework returns almost identical values when compared with a single (most likely) model. Although not immensely surprising when acknowledging that the frequency of the most likely model is 0.998 and 0.998, respectively.
“Our framework represents a middle-ground to add flexibility in a parsimonious manner that remains analytically and computationally tractable, facilitating applications where either p is large or n is too moderate to fit more flexible models accurately.”
Overall, I find the experiment quite conclusive and do not object [much] to this choice of parametric family in that it is always more general and generic than the sempiternal Gaussian model. That we picked in our Bayesian Essentials, following tradition. In a sense, it would be natural to pick the most general possible parametric family that allows for fast computations, if this notion does make any sense…