Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Location indicates the building first and then the room number!

Click on "Floor plan" for orientation in the builings and on the campus.

 
Only Sessions at Location/Venue 
 
 
Session Overview
Session
S12 (3): Computational, functional and high-dimensional statistics
Time:
Wednesday, 12/Mar/2025:
3:50 pm - 5:30 pm

Session Chair: Martin Wahl
Location: ZEU 260
Floor plan

Zeuner Bau
Session Topics:
12. Computational, functional and high-dimensional statistics

Show help for 'Increase or decrease the abstract text size'
Presentations
3:50 pm - 4:15 pm

Tracy-Widom, Gaussian, and Bootstrap: Approximations for Leading Eigenvalues in High-Dimensional PCA

Nina Dörnemann1, Miles E. Lopes2

1Aarhus University; 2University of Davis, California

Under certain conditions, the largest eigenvalue of a sample covariance matrix undergoes a well-known phase transition when the sample size $n$ and data dimension $p$ diverge proportionally. In the subcritical regime, this eigenvalue has fluctuations of order $n^{-2/3}$ that can be approximated by a Tracy-Widom distribution, while in the supercritical regime, it has fluctuations of order $n^{-1/2}$ that can be approximated with a Gaussian distribution. However, the statistical problem of determining which regime underlies a given dataset has remained largely unresolved. We develop a new testing framework and procedure to address this problem. In particular, we demonstrate that the procedure has an asymptotically controlled level, and that it is power consistent for certain alternatives. Also, this testing procedure enables the design a new bootstrap method for approximating the distributions of functionals of the leading sample eigenvalues within the subcritical regime---which is the first such method that is supported by theoretical guarantees.


4:15 pm - 4:40 pm

AIC for many-regressor heteroskedastic regressions

Stanislav Anatolyev

CERGE-EI, Czech Republic

The original Akaike information criteria (AIC) and its corrected version (AICc) have been routinely used for model selection for ages. The penalty terms in these criteria are tied to the classical normal linear regression, characterized by conditional homoskedasticity and a small number of regressors relative to the sample size, which leads to very simple and computationally attractive penalty forms.

We derive, from the same principles, a general version that takes account of conditional heteroskedasticity and regressor numerosity. The new AICm penalty takes a form of a ratio of certain weighted average error variances, and encompasses the classical ones: it is approximately equal to the AIC penalty when the regression is conditionally homoskedastic and regressors are few, and to the AICc penalty when the regression is conditionally homoskedastic but the number of regressors is not negligible. In contrast to those of AIC and AICc, the AICm penalty is stochastic and thus not immediately implementable, as it in addition depends on the pattern of conditional heteroskedasticity in the sample.

The infeasible AICm criterion, however, can be operationalized via unbiased estimation of individual variances. The feasible AICm criterion still minimizes the expected Kullback-Leibler divergence up to an asymptotically negligible term that does not relate to regressor numerosity. In simulations, the feasible AICm does select models that deliver systematically better out-of-sample predictions than the classical criteria.


4:40 pm - 5:05 pm

Identification in ill-posed linear regression: estimation rates, prediction risk, asymptotic distributions

Gianluca Finocchio, Tatyana Krivobokova

University of Vienna, Austria

In applications from biology, chemistry, genomics and finance practitioners face a plethora of high-dimensional data sets affected by highly-correlated features. They often exploit regression models to predict unobserved responses or to identify combinations of features driving the underlying generating process.

The prediction problem can be tackled by nonlinear regression algorithms. Model complexity is not necessarily a burden, since overparametrised models in the regime of benign overfitting achieve small prediction error despite interpolating the observed response. Deep learning, random forests, kernel estimators, L2-penalised linear regression, latent factor regression, all can exhibit the double-descent pattern typical of benign overfitting: the prediction error as a function of the total number of parameters has a local minimum in the underparametrised regime and a global minimum in the overparametrised regime.

Most overparametrised methods are hard to interpret and thus unsuitable for identifying any combination of features that might be relevant for the response. Linear regression is the simplest alternative since it provides a vector of coefficients highlighting the contribution of the features and, despite its simplicity, the regularisation of ill-posed linear models is still relevant in modern data science.

An established strategy for identification relies on the sparsity principle. One assumes that only a few features actually carry any information on the response. Despite the recent development of diagnostic measures of influence, it has become apparent that the necessary regularity conditions fail dramatically when dealing with ill-posed data sets from genomics where model selection becomes essentially random. Furthermore, one of the major drawbacks of the sparsity principle in general is its lack of invariance under orthogonal transformations. This means that any sparse method will overestimate the degrees of freedom of the problem when only a few linear combinations of the features are important, rather than the features themselves.

Another strategy for identification relies on the principal components principle, which assumes that the response only depends on the main directions of variations of the features. The classical theory of latent factor models hinges on regularity conditions allowing consistent estimation of the true number of latent factors via sample eigenvalue ratio. This makes principal components regression (PCR), or unsupervised dimensionality reduction in general, the most natural approach to such problems. However, the sentiment that these assumptions are too restrictive is quite old and many authors have suggested that there is no logical reason for the principal components to contain any information at all on the response.

Extensive reviews are available for genome-wide association studies (GWAS) aiming at identifying the association of genotypes with phenotypes of many diseases such as coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease and breast cancer. The association is estimated by fitting linear models with the addition of possibly random effects. The problem is ill-posed because genotypes of genetic variants that are physically close together are not independent and, more importantly, complex traits may be highly polygenic in the sense that many genetic variants with small effects contribute to the phenotype. The interpretation of such complex models is a big open challenge that is beyond the capabilities of sparse regression and principal components regression.

Motivated by the above, we revisit the theory for identification in ill-posed linear models and propose a novel framework. The classical latent factor model for linear regression is extended by assuming that, up to an unknown orthogonal transformation, the features consist of subsets that are relevant and irrelevant for the response. Furthermore, a joint low-dimensionality is imposed only on the relevant features vector and the response variable. The proposed framework allows to: i) characterise the identifiable parameters of interest that are crucial for interpretation; ii) characterise the instrinsic geometrical properties of any regularisation algorithm; iii) comprehensively study the partial least squares (PLS) algorithm under random design with heavy tails. In particular, a novel perturbation bound for PLS solutions is proven and the high-probability L2-rate for estimation and prediction of the PLS estimator are obtained. As a corollary, necessary and sufficient conditions for the asymptotic normality of PLS estimators are derived. This framework sheds light on the identification performance of regularisation methods for ill-posed linear regression that exploit sparsity or unsupervised projection. The theoretical findings are confirmed by numerical studies on both real and simulated data.


 
Contact and Legal Notice · Contact Address:
Conference: GPSD 2025
Conference Software: ConfTool Pro 2.8.105
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany