# Past Talks in Joint Analysis Seminar

There weren't any events in the past six months.

# Past Talks in Post Graduate Seminar

13.05.2024, 16:00

**Adit Radhakrishnan** (Harvard University):**How do neural networks learn features from data?**

Understanding how neural networks learn features, or relevant patterns in data, for prediction is necessary for their reliable use in technological and scientific applications. We propose a unifying mechanism that characterizes feature learning in neural network architectures. Namely, we show that features learned by neural networks are captured by a statistical operator known as the average gradient outer product (AGOP). Empirically, we show that the AGOP captures features across a broad class of network architectures including convolutional networks and large language models. Moreover, we use AGOP to enable feature learning in general machine learning models through an algorithm we call Recursive Feature Machine (RFM). We show that RFM automatically identifies sparse subsets of features relevant for prediction and explicitly connects feature learning in neural networks with classical sparse recovery and low rank matrix factorization algorithms. Overall, this line of work advances our fundamental understanding of how neural networks extract features from data, leading to the development of novel, interpretable, and effective models for use in scientific applications.

06.05.2024, 16:00

**Dominik Stöger** (KU Eichstätt-Ingolstadt):**Breaking the quadratic rank bottleneck in non-convex matrix sensing: Recovery guarantees with (near-)optimal sample complexity**

Low-rank matrix recovery problems are ubiquitous in many areas of science and engineering. Most of the methods that have been studied for these problems can be divided
into two categories: Convex optimization approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent.

While the latter promises to be computationally much less expensive, basically all existing recovery guarantees for factorized gradient descent are much more pessimistic with respect
to the number of samples required. In particular, they require the number of samples to scale quadratically with the rank of the ground truth matrix. This is in stark contrast to
empirical observations which suggest that the non-convex approaches perform as well as the convex ones with respect to the sample complexity.

In this talk, we resolve this issue and we present the first theoretical guarantees to the best of our knowledges for matrix sensing that show that factorized gradient descent recovers
the ground truth matrix with a sample size that is optimal in the number of degrees of freedom. Our proof is based on new probabilistic decoupling arguments, which we expect to be of
independent interest. Joint work with Yizhe Zhu (UC Irvine).

29.04.2024, 10:00

**Massimo Fornasier** (TU Munich):**Wassertein Sobolev functions and their numerical approximations**

We start the talk by presenting general results of strong density of sub-algebras of bounded Lipschitz functions in metric Sobolev spaces. We apply such results to show the
density of smooth cylinder functionsin Sobolev spaces of functions on the Wasserstein space $\mathcal P_2$ endowed with a finite positive Borel measure. As a byproduct, we obtain the
infinitesimal Hilbertianity of Wassertein Sobolev spaces. By taking advantage of these results, we further address the challenging problem of the numerical approximation of Wassertein
Sobolev functions defined on probability spaces. Our particular focus centers on the Wasserstein distance function, which serves as a relevant example. In contrast to the existing body
of literature focused on approximating efficiently pointwise evaluations, we chart a new course to define functional approximants by adopting three machine learning-based approaches:

1. Solving a finite number of optimal transport problems and computing the corresponding Wasserstein potentials.

2. Employing empirical risk minimization with Tikhonov regularization in Wasserstein Sobolev spaces.

3. Addressing the problem through the saddle point formulation that characterizes the weak form of the Tikhonov functional's Euler-Lagrange equation.

As a theoretical contribution, we furnish explicit and quantitative bounds on generalization errors for each of these solutions. In the proofs, we leverage the theory of metric
Sobolevspaces introduced above and we combine it with techniques of optimal transport,variational calculus, and large deviation bounds. In our numerical implementation, we harness
appropriately designed neural networks to serve as basis functions.Consequently, our constructive solutions significantly enhance at equal accuracy the evaluation speed, surpassing
that of state-of-the-art methods by several orders of magnitude.

The talk presents a collection of results with Pascal Heid, Giacomo Sodini, and Giuseppe SavarÃ©.

08.02.2024, 16:15

**Laura Paul** (RWTH Aachen University):**Covariance Estimation for Massive MIMO**

Massive multiple-input multiple-output (MIMO) communication systems are very promising for wireless communication and fifth generation (5G) cellular networks. In massive MIMO, a large number of antennas are employed at the base station (BS), which provides a high degree of spatial freedom and enables the BS to simultaneously communicate with multiple user terminals. Due to the limited angular spread, the user channel vectors lie on low-dimensional subspaces. For each user, we aim to find a low-dimensional beamforming subspace that captures a large amount of the power of its channel vectors. We will see, that this signal subspace estimation problem can be reduced to finding a good estimator of the covariance matrix in terms of a truncated version of the nuclear norm. Since the channel covariance matrix is not a priori known in practice, it has to be estimated from the observed data samples. In this talk, theoretical guarantees for signal covariance and subspace estimation from compressed measurements are investigated. We derive improved bounds on the estimation error in terms of the number of observed time samples, the truncation and noise level.

01.02.2024, 16:15

**Arinze Folarin** (RWTH Aachen University):**Tensor Recovery: Exploring Hierarchical Tensor Representation in ISLET Algorithm**

This talk is intended to introduce you to the Importance Sketching Low-rank Estimation for Tensors (ISLET) Algorithm by Anru Zhang. The algorithm utilizes the High-Order Orthogonal Iteration tensor decomposition method to derive important sketching directions, which are valuable for the tensor estimates produced by the ISLET algorithm. I will introduce the Hierarchical Tensor representation to derive sketching directions, generating a variant of the ISLET algorithm. These algorithms produce tensor estimates using the responses and tensor covariates with randomized designs from a given low-rank tensor regression model, enabling the recovery of the unknown required low-rank tensor.

25.01.2024, 16:15

**Robert Kunsch** (RWTH Aachen University):**Monte Carlo quadrature with optimal confidence**

We study the numerical integration of smooth functions using finitely many function evaluations within randomized algorithms, aiming for the smallest possible error guarantees with high probability (the so-called confidence level). There are different strategies for constructing robust estimators, for example, taking the median of several repeated realizations of a basic method where the number of required repetitions depends on the desired confidence level. For Sobolev classes of continuous functions, however, we can find linear integration methods that have optimal error bounds for all confidence levels. Numerical experiments show that the tails of the error distribution are significantly smaller than with previously known methods.

11.01.2024, 16:15

**Johannes Müller** (RWTH Aachen University):**Natural Gradients for Scientific Machine Learning**

Neural network based PDE solvers have received vastly growing attention within the scientific machine learning community as they offer the promise to work effectively in high dimensions. However, even for low-dimensional problems common approaches are well documented to produce satisfactory accuracy. We present an error estimate decomposing the error into an approximation, optimization and penalization error, arising from approximately enforced boundary conditions. Further, we propose energy natural gradient descent, a natural gradient method with respect to a Hessian-induced Riemannian metric. As a main motivation we show that the update direction in function space resulting from the energy natural gradient corresponds to the Newton direction modulo an orthogonal projection onto the modelâ€™s tangent space. We demonstrate experimentally that energy natural gradient descent yields highly accurate solutions with errors several orders of magnitude smaller than what is obtained when training PINNs with standard optimizers like gradient descent, Adam or BFGS, even when those are allowed significantly more computation time.

21.12.2023, 16:15

**Robert M. Gower** (Flatiron Institute):**Analysing stochastic gradient descent with adaptive stepsize and under interpolation**

Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. In this talk I will cover new adaptive step sizes for SGD (stochastic gradient descent). I will start by introducing an "optimal" step size that relies on interpolation. By interpolation, I mean that we can perfectly fit our model to the data at hand. I will then present some new elegant theory for these optimal step sizes. I will then move on to even more practical considerations, by showing how such step-sizes can be used in conjunction with any momentum method. This will lead to a Momentum Model based adaptive learning rate for SGD-M (Stochastic gradient descent with momentum) which we call MoMo. MoMo uses momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. We then approximately minimize this model at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam - which is Adam with our new model-based adaptive learning rate. Through extensive numerical experiments, we demonstrate that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet, recommender systems on the Criteo dataset, and a transformer model on the translation task IWSLT14.

14.12.2023, 16:15

**Maxime Nguegnang** (RWTH Aachen University):**Analysis of gradient descent and stochastic gradient descent for learning linear neural networks**

In this talk we will present our work on the understanding of the training phase of deep linear neural networks. First of all, we studied the dynamic of gradient descent (GD) for learning deep linear neural networks, established the bound of GD iterates and proved its convergence to a critical point of the square loss. We then extended the convergence results towards a global minimum of Bah et al. (2020) from gradient flow to GD. In contrast to gradient flow result, our work provides precise conditions that ensure convergence for both constant and decreasing step sizes. In the second part, we will talk about our work in progress on the extension of the insights of this study from GD to SGD. Base on stochastic approximation theory, we used an analytical approach that combines SGD iterates and gradient flow trajectories to study the convergence properties of SGD for for learning deep linear neural networks.

07.12.2023, 16:15

**Yaim Cooper** (University of Notre Dame):**Tradeoffs in Machine Learning**

In this talk, I'll discuss three classical and influential tradeoffs in machine learning: the bias-variance tradeoff, accuracy-interpretability tradeoff, and tradeoffs between different definitions of fairness. No prior background is assumed - I will describe each tradeoff, highlight work from the past decade revisiting each, and invite consideration of the role of these tradeoffs in our work.

30.11.2023, 16:15

**Nathan Srebro** (Toyota Technological Institute at Chicago):**Interpolation Learning and Overfitting with Linear Predictors and Short Programs**

Classical theory, conventional wisdom, and all textbooks, tell us to avoid reaching zero training error and overfitting the noise, and instead balance model fit and complexity. Yet, recent empirical and theoretical results suggest that in many cases overfitting is benign, and even interpolating the training data can lead to good generalization. Can we characterize and understand when overfitting is indeed benign, and when it is catastrophic as classic theory suggests? And can existing theoretical approaches be used to study and explain benign overfitting and the "double descent" curve? I will discuss interpolation learning in linear (and kernel) methods, as well as using the universal "minimum description length" or "shortest program" learning rule.