## Archive for overfitting

## curve fittings [xkcd]

Posted in Books, Kids with tags curve estimation, extrapolation, graphics, LOESS, overfitting, splines, xkcd on November 4, 2018 by xi'an## JSM 2018 [#4½]

Posted in Statistics, University life with tags anchor, British Columbia, Canada, handbook of mixture analysis, JSM 2018, overfitting, prior selection, regularisation, Vancouver on August 10, 2018 by xi'an**A**s I wrote my previous blog entry on JSM2018 before the sessions, I did not have the chance to comment on our mixture session, which I found most interesting!, with new entries on the topic and a great discussion by Bettina Grün. Including the important call for linking weights with the other parameters, as both groups being independent does not make sense when the number of components is uncertain. (Incidentally our paper with Kaniav kamary and Kate Lee does create a dependence.) The talk by Deborah Kunkel was about anchored mixture estimation, a joint work with Mario Peruggia, another arXival that I had missed.

The notion of anchoring found in this paper is to allocate specific observations to specific components. These observations are thus *anchored* to these components. Among other things, this modification of the sampling model implies a removal of the unidentifiability problem. Hence formally of the label-switching or lack thereof issue. (Although, as Peter Green repeatedly mentioned, visualising the parameter space as a point process eliminates the issue.) This idea is somewhat connected with the constraint Jean Diebolt and I imposed in our 1990 mixture paper, namely that no component would have less than two observations allocated to it, but imposing which ones are which of course reduces drastically the complexity of the model. Another (related) aspect of anchoring is that the observations that are anchored to the components act as parts of the prior model, modifying the initial priors (which can then become improper as in our 1990 paper). The difficulty of the anchoring approach is to find observations to anchor in an unsupervised setting. The paper proceeds by optimising the allocations, which somewhat turns the prior into a data-dependent prior since all observations are used to set the anchors and then used again for the standard Bayesian processing. In that respect, I would rather follow the sequential procedure developed by Nicolas Chopin and Florian Pelgrin, where the number of components grows by steps with the number of observations.

## JSM 2018 [#1]

Posted in Mountains, Statistics, Travel, University life with tags British Columbia, Canada, curse of dimensionality, deep learning, GANs, JSM 2018, overfitting, regularisation, sparsity, stochastic gradient descent, Vancouver on July 30, 2018 by xi'an**A**s our direct flight from Paris landed in the morning in Vancouver, we found ourselves in the unusual situation of a few hours to kill before accessing our rental and where else better than a general introduction to deep learning in the first round of sessions at JSM2018?! In my humble opinion, or maybe just because it was past midnight in Paris time!, the talk was pretty uninspiring in missing the natural question of the possible connections between the construction of a prediction function and statistics. Watching improving performances at classifying human faces does not tell much more than creating a massively non-linear function in high dimensions with nicely designed error penalties. Most of the talk droned about neural networks and their fitting by back-propagation and the variations on stochastic gradient descent. Not addressing much rather natural (?) questions about choice of functions at each level, of the number of levels, of the penalty term, or regulariser, and even less the reason why no sparsity is imposed on the structure, despite the humongous number of parameters involved. What came close [but not that close] to sparsity is the notion of dropout, which is a sort of purely automated culling of the nodes, and which was new to me. More like a sort of randomisation that turns the optimisation criterion in an average. Only at the end of the presentation more relevant questions emerged, presenting unsupervised learning as density estimation, the pivot being the generative features of (most) statistical models. And GANs of course. But nonetheless missing an explanation as to why models with massive numbers of parameters can be considered in this setting and not in standard statistics. (One slide about deterministic auto-encoders was somewhat puzzling in that it seemed to repeat the “fiducial mistake”.)

## Bayesian regression trees [seminar]

Posted in pictures, Statistics, University life with tags Bayesian CART, Bayesian inference, CREST, ENSAE, overfitting, Paris-Saclay campus, random histogram, regression trees, seminar, talk, tree on January 26, 2018 by xi'an**D**uring her visit to Paris, Veronika Rockovà (Chicago Booth) will give a talk in ENSAE-CREST on the Saclay Plateau at 2pm. Here is the abstract

**Posterior Concentration for Bayesian Regression Trees and Ensembles**

(joint with Stephanie van der Pas)Since their inception in the 1980’s, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to over-fitting. Strategies against overfitting have traditionally relied on pruning greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors. Roughly speaking, a good prior charges smaller trees where overfitting does not occur. While the consistency of random histograms, trees and their ensembles has been studied quite extensively, the theoretical understanding of the Bayesian counterparts has been missing. In this paper, we take a step towards understanding why/when do Bayesian trees and their ensembles not overfit. To address this question, we study the speed at which the posterior concentrates around the true smooth regression function. We propose a spike-and-tree variant of the popular Bayesian CART prior and establish new theoretical results showing that regression trees (and their ensembles) (a) are capable of recovering smooth regression surfaces, achieving optimal rates up to a log factor, (b) can adapt to the unknown level of smoothness and (c) can perform effective dimension reduction when p>n. These results provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof) have worked so well in practice.

## Dirichlet process mixture inconsistency

Posted in Books, Statistics with tags Dirichlet process, mixtures of distributions, NIPS, overfitting, unknown number of components on February 15, 2016 by xi'an**J**udith Rousseau pointed out to me this NIPS paper by Jeff Miller and Matthew Harrison on the possible inconsistency of Dirichlet mixtures priors for estimating the (true) number of components in a (true) mixture model. The resulting posterior on the number of components does not concentrate on the right number of components. Which is not the case when setting a prior on the unknown number of components of a mixture, where consistency occurs. (The inconsistency results established in the paper are actually focussed on iid Gaussian observations, for which the estimated number of Gaussian components is almost never equal to 1.) In a more recent arXiv paper, they also show that a Dirichlet prior on the weights and a prior on the number of components can still produce the same features as a Dirichlet mixtures priors. Even the stick breaking representation! (Paper that I already reviewed last Spring.)

## mixtures of mixtures

Posted in pictures, Statistics, University life with tags arXiv, Austria, clustering, k-mean clustering algorithm, Linkz, map, MCMC, mixture, overfitting, Wien on March 9, 2015 by xi'an**A**nd yet another arXival of a paper on mixtures! This one is written by Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, and Bettina Grün, from the Johannes Kepler University Linz and the Wirtschaftsuniversitat Wien I visited last September. With the exact title being Identifying mixtures of mixtures using Bayesian estimation.

So, what *is* a mixture of mixtures if not a mixture?! Or if not *only* a mixture. The upper mixture level is associated with clusters, while the lower mixture level is used for modelling the distribution of a given cluster. Because the cluster needs to be real enough, the components of the mixture are assumed to be heavily overlapping. The paper thus spends a large amount of space on detailing the construction of the associated hierarchical prior. Which in particular implies defining through the prior what a cluster means. The paper also connects with the overfitting mixture idea of Rousseau and Mengersen (2011, Series B). At the cluster level, the Dirichlet hyperparameter is chosen to be very small, 0.001, which empties superfluous clusters but sounds rather arbitrary (which is the reason why we did not go for such small values in our testing/mixture modelling). On the opposite, the mixture weights have an hyperparameter staying (far) away from zero. The MCMC implementation is based on a standard Gibbs sampler and the outcome is analysed and sorted by estimating the “true” number of clusters as the MAP and by selecting MCMC simulations conditional on that value. From there clusters are identified via the point process representation of a mixture posterior. Using a standard k-means algorithm.

The remainder of the paper illustrates the approach on simulated and real datasets. Recovering in those small dimension setups the number of clusters used in the simulation or found in other studies. As noted in the conclusion, using solely a Gibbs sampler with such a large number of components is rather perilous since it may get stuck close to suboptimal configurations. Especially with very small Dirichlet hyperparameters.