Past Talks in Joint Analysis Seminar
There weren't any events in the past six months.
Past Talks in Post Graduate Seminar
08.02.2024, 16:15
Laura Paul (RWTH Aachen University):
Covariance Estimation for Massive MIMO
Massive multiple-input multiple-output (MIMO) communication systems are very promising for wireless communication and fifth generation (5G) cellular networks. In massive MIMO, a large number of antennas are employed at the base station (BS), which provides a high degree of spatial freedom and enables the BS to simultaneously communicate with multiple user terminals. Due to the limited angular spread, the user channel vectors lie on low-dimensional subspaces. For each user, we aim to find a low-dimensional beamforming subspace that captures a large amount of the power of its channel vectors. We will see, that this signal subspace estimation problem can be reduced to finding a good estimator of the covariance matrix in terms of a truncated version of the nuclear norm. Since the channel covariance matrix is not a priori known in practice, it has to be estimated from the observed data samples. In this talk, theoretical guarantees for signal covariance and subspace estimation from compressed measurements are investigated. We derive improved bounds on the estimation error in terms of the number of observed time samples, the truncation and noise level.
01.02.2024, 16:15
Arinze Folarin (RWTH Aachen University):
Tensor Recovery: Exploring Hierarchical Tensor Representation in ISLET Algorithm
This talk is intended to introduce you to the Importance Sketching Low-rank Estimation for Tensors (ISLET) Algorithm by Anru Zhang. The algorithm utilizes the High-Order Orthogonal Iteration tensor decomposition method to derive important sketching directions, which are valuable for the tensor estimates produced by the ISLET algorithm. I will introduce the Hierarchical Tensor representation to derive sketching directions, generating a variant of the ISLET algorithm. These algorithms produce tensor estimates using the responses and tensor covariates with randomized designs from a given low-rank tensor regression model, enabling the recovery of the unknown required low-rank tensor.
25.01.2024, 16:15
Robert Kunsch (RWTH Aachen University):
Monte Carlo quadrature with optimal confidence
We study the numerical integration of smooth functions using finitely many function evaluations within randomized algorithms, aiming for the smallest possible error guarantees with high probability (the so-called confidence level). There are different strategies for constructing robust estimators, for example, taking the median of several repeated realizations of a basic method where the number of required repetitions depends on the desired confidence level. For Sobolev classes of continuous functions, however, we can find linear integration methods that have optimal error bounds for all confidence levels. Numerical experiments show that the tails of the error distribution are significantly smaller than with previously known methods.
11.01.2024, 16:15
Johannes Müller (RWTH Aachen University):
Natural Gradients for Scientific Machine Learning
Neural network based PDE solvers have received vastly growing attention within the scientific machine learning community as they offer the promise to work effectively in high dimensions. However, even for low-dimensional problems common approaches are well documented to produce satisfactory accuracy. We present an error estimate decomposing the error into an approximation, optimization and penalization error, arising from approximately enforced boundary conditions. Further, we propose energy natural gradient descent, a natural gradient method with respect to a Hessian-induced Riemannian metric. As a main motivation we show that the update direction in function space resulting from the energy natural gradient corresponds to the Newton direction modulo an orthogonal projection onto the model’s tangent space. We demonstrate experimentally that energy natural gradient descent yields highly accurate solutions with errors several orders of magnitude smaller than what is obtained when training PINNs with standard optimizers like gradient descent, Adam or BFGS, even when those are allowed significantly more computation time.
21.12.2023, 16:15
Robert M. Gower (Flatiron Institute):
Analysing stochastic gradient descent with adaptive stepsize and under interpolation
Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. In this talk I will cover new adaptive step sizes for SGD (stochastic gradient descent). I will start by introducing an "optimal" step size that relies on interpolation. By interpolation, I mean that we can perfectly fit our model to the data at hand. I will then present some new elegant theory for these optimal step sizes. I will then move on to even more practical considerations, by showing how such step-sizes can be used in conjunction with any momentum method. This will lead to a Momentum Model based adaptive learning rate for SGD-M (Stochastic gradient descent with momentum) which we call MoMo. MoMo uses momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. We then approximately minimize this model at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam - which is Adam with our new model-based adaptive learning rate. Through extensive numerical experiments, we demonstrate that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet, recommender systems on the Criteo dataset, and a transformer model on the translation task IWSLT14.
14.12.2023, 16:15
07.12.2023, 16:15
Yaim Cooper (University of Notre Dame):
Tradeoffs in Machine Learning
In this talk, I'll discuss three classical and influential tradeoffs in machine learning: the bias-variance tradeoff, accuracy-interpretability tradeoff, and tradeoffs between different definitions of fairness. No prior background is assumed - I will describe each tradeoff, highlight work from the past decade revisiting each, and invite consideration of the role of these tradeoffs in our work.
30.11.2023, 16:15
Nathan Srebro (Toyota Technological Institute at Chicago):
Interpolation Learning and Overfitting with Linear Predictors and Short Programs
Classical theory, conventional wisdom, and all textbooks, tell us to avoid reaching zero training error and overfitting the noise, and instead balance model fit and complexity. Yet, recent empirical and theoretical results suggest that in many cases overfitting is benign, and even interpolating the training data can lead to good generalization. Can we characterize and understand when overfitting is indeed benign, and when it is catastrophic as classic theory suggests? And can existing theoretical approaches be used to study and explain benign overfitting and the "double descent" curve? I will discuss interpolation learning in linear (and kernel) methods, as well as using the universal "minimum description length" or "shortest program" learning rule.
16.11.2023, 16:15
El Mehdi Achour (RWTH Aachen University):
The loss landscape of deep linear neural networks: a second-order analysis
We study the optimization landscape of deep linear neural networks with the square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a full analysis of the optimization landscape at order 2. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
09.11.2023, 16:15
Matus Telgarsky (NYU Courant Institute):
Linear and nonlinear implicit biases of gradient descent on classification losses
The first half of this talk will survey a variety of results, mostly co-authored with Ziwei Ji, on the implicit bias of gradient descent with classification losses: specifically,
slow and fast rates in the linear case, and a variety of asymptotic results in deep linear and nonlinear homogeneous cases. The talk will be conducted on a tablet, and thus the second half will
be more free-form, allowing time to delve into proofs and other coincidences.
Short bio: Matus Telgarsky is an assistant professor at the Courant Institute of Mathematical Sciences at New York University, specializing in deep learning theory. He was fortunate to receive a
PhD at UCSD under Sanjoy Dasgupta. Other highlights include: co-founding, in 2017, the Midwest ML Symposium (MMLS) with Po-Ling Loh (while on faculty at the University of Illinois, Urbana-Champaign); receiving a 2018 NSF CAREER award; and organizing two Simons Institute programs, one on deep learning theory (summer 2019), and one on generalization (fall 2024).
02.11.2023, 16:15
Quanquan Gu (UCLA):
Why Does Sharpness-Aware Minimization Generalize Better Than SGD?
Abstract: The challenge of overfitting, in which the model memorizes the training data and fails to generalize to test data, has become increasingly significant in the training of large neural
networks. To tackle this challenge, Sharpness-Aware Minimization (SAM) has emerged as a promising training method, which can improve the generalization of neural networks even in the presence of
label noise. However, a deep understanding of how SAM works, especially in the setting of nonlinear neural networks and classification tasks, remains largely missing. In this talk, I will show why
SAM generalizes better than Stochastic Gradient Descent (SGD) for certain data model and two-layer convolutional ReLU networks. Our theoretical analysis explains the benefits of SAM, particularly
its ability to prevent noise learning in the early stages, which enables the learning of weak features more effectively. Experiments on both synthetic and real data sets corroborate our theory.
This talk is based on joint work with Zixiang Chen, Junkai Zhang, Yiwen Kou, Xiangning Chen and Cho-Jui Hsieh.
Short bio: Quanquan Gu is an Associate Professor of Computer Science at UCLA. His research is in the area of artificial intelligence and machine learning, with a focus on developing and analyzing
nonconvex optimization algorithms for machine learning to understand large-scale, dynamic, complex, and heterogeneous data and building the theoretical foundations of deep learning and
reinforcement learning. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2014. He is a recipient of the Sloan Research Fellowship, NSF CAREER
Award, Simons Berkeley Research Fellowship among other industrial research awards.