Classical statistical methods developed during the last century were suitable when the number of observations is much larger than the number of parameters to infer. Unfortunately, many fields such as astronomy, genetics, medicine or neuroscience produce large and complex data sets whose dimension is much larger than the number of experimental units. Such data are said to be high-dimensional. To face with this challenging curse of dimensionality, new methodologies have been developed based on sparsity assumptions.
The goal of these courses is to present the main concepts and ideas on some selected topics of high-dimensional statistics such as variable selection, nonparametric estimation, supervised and non-supervised classification, or multiple testing. Theoretical aspects are motivated by applicable developments of presented methods.
Course 1 (Vincent Rivoirard): Estimation in the high-dimensional setting
The goal of this course is to present modern statistical tools and some of their theoretical properties for estimation in the high-dimensional setting, including:
- Wavelets and thresholding rules
- Penalized estimators: model selection procedures, Ridge and Lasso estimates
- Generalizations and variations of the Lasso estimate: Group-Lasso, Fused-Lasso, elastic-net and Dantzig selectors. Links with Bayesian rules.
- Statistical properties of Lasso estimators: study in the classical regression model. Extensions for the generalized linear model.
Course 2 (Franck Picard): Empirical properties of penalized estimators in high-dimensional linear models
In this course we will illustrate and explore the empirical properties of penalized estimators in linear models (Gaussian regression and generalized linear models). Using R we will learn how to build relevant simulation designs to assess the performance of the LASSO and its derivatives. We will also focus on the importance of penalty calibration in practice, using different methods like cross validation or the BIC for instance. Most importantly we will show how penalization can be used in the low dimensional as well as in the high dimensional settings. These simulation studies constitute a good framework to explore the inherent difficulties of high dimensional statistics in practice, what should be expected, and what should not be! Another aspect of the course will be to learn penalized unsupervised methods like sparse PCA and sparse clustering.
Course 3 (Tristan Mary-Huard): Supervised classification in the high-dimensional setting
In this course we will first introduce the basics about supervised classification along with some classical supervised methods (logistic regression, linear discriminant analysis). We will then focus on the high dimensional setting and introduce regularized methods (ridge logistic regression, SVM…). We will investigate how these algorithms can be cast in a general framework of penalized convex risk minimization and will provide some theoretical guarantees about their performances. Practical questions such as model selection via cross-validation will also be introduced.
Slides can be downloaded here