Vergangene Vorträge im Joint Analysis Seminar
In den vergangenen sechs Monaten haben keine Veranstaltungen stattgefunden.
Vergangene Vorträge im Oberseminar
Johannes Hertrich (Université Paris Dauphine, online):
Fast Summations via Kernel Slicing
The fast computation of large kernel sums is a challenging task which arises as a subproblem in any kernel method. Initially, this problem has complexity O(N²), where N
is the number of considered data points. In this talk, we propose an approxmation algorithm which reduces this complexity to O(N). Our approach is based on two ideas. First, we prove
that under mild assumptions radial kernels can be represented as sliced version of some one-dimensional kernel and derive an analytic formula for the one-dimensional counterpart.
Hence, we can reduce the d-dimensional kernel summation to a one-dimensional setting. Second, for solving these one-dimensional problems efficiently, we apply fast Fourier summations
on non-equispaced data or a sorting algorithm. We prove bounds for the slicing error, employ quasi-Monte Carlo methods for improved error rates and demonstrate the advantages of our
approach by numerical examples. Finally, we present an application in image generation.
(Abstract ein-/ausblenden)
David Holzmueller (INRIA Paris, hybrid):
Generalization of Neural Networks and Kernel Methods: Theory and Practice
Training neural networks on vectors of continuous or discrete data is a classical and highly relevant problem, yet there are still many questions and new developments
from both a theoretical and applied standpoint. From a theoretical viewpoint, I will discuss how neural networks are connected to kernel methods and can be studied using linearization.
In some cases, this allows us to prove whether they generalize to new data with or even without regularization. Finally, I will provide a quick summary of what matters in practice, and
how we managed to create kernel methods and neural networks achieving state-of-the-art results on applied benchmarks.
(Abstract ein-/ausblenden)
Jason Altschuler (University of Pennsylvania, online):
Negative Stepsizes Make Gradient-Descent-Ascent Converge
Solving min-max problems is a central question in optimization, games, learning, and controls. Arguably the most natural algorithm is Gradient-Descent-Ascent (GDA), however since the 1970s,
conventional wisdom has argued that it fails to converge even on simple problems. This failure spurred the extensive literature on modifying GDA with extragradients, optimism, momentum, anchoring,
etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes.
The key innovation is the proposal of unconventional stepsize schedules that are time-varying, asymmetric, and (most surprisingly) periodically negative. We show that all three properties are
necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). The core intuition is that although
negative stepsizes make backward progress, they de-synchronize the min/max variables (overcoming the cycling issue of GDA) and lead to a slingshot phenomenon in which the forward progress in the
other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps
cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. Joint work with Henry Shugart.
(Abstract ein-/ausblenden)
Semih Cayci (RWTH Aachen, hybrid):
A Riemannian Perspective of the Gauss-Newton Method for Feedforward Neural Networks
Gauss-Newton methods offer a geometry-aware alternative to the standard first-order optimization in deep learning, yet their behavior when applied to neural network
training remains elusive. In this talk, I will present new convergence guarantees and provably effective adaptive damping schemes for the Gauss-Newton method in the neural tangent
kernel regime.
In the underparameterized regime, the Gauss-Newton gradient flow induces a Riemannian gradient flow on a low-dimensional smooth submanifold of the Euclidean predictor space.
Leveraging tools from Riemannian optimization, we establish last-iterate convergence of the Riemannian gradient flow to the optimal in-class predictor at an exponential rate
independent of the conditioning of the Gram matrix, without requiring explicit regularization. We also characterize the critical impacts of the neural network scaling factor and the
initialization on the convergence behavior. In the overparameterized regime, we show that the regularized Gauss-Newton dynamics under appropriate adaptive damping schedules leads
to the global optimum at an exponential rate that is independent of the conditioning of the neural tangent kernel, mirroring the behavior observed in the underparameterized regime. We
further characterize the implicit bias of the Gauss-Newton method in the NTK regime.
(Abstract ein-/ausblenden)
Davide Pucci (University of Florence, hybrid):
Transferring Line Search Step Sizes across Scales: Challenges and Insights
Stochastic Gradient Descent (SGD) and its variants are the de-facto standard to solve large-scale finite-sum problems. Despite the huge success of these methods in
training neural networks, their speed of convergence and the generalization abilities of the resulting model strongly depend on the selection of the step size. In recent years,
(nonmonotone) line search methods have been successfully applied to SGD and its variants leading to impressive experimental performance, while maintaining strong convergence
properties.
In this talk, we discuss a key bottleneck of this class of methods: the number of backtracks required to satisfy the line search condition at each iteration. In fact, especially when
training huge networks with billions of parameters, the extra forward steps required by this internal procedure may be too costly. To overcome this limitation, we explore a solution
based on hyperparameter transfer. Inspired by a line of recent works, we follow the intuition that some properties of both the models and the optimization process should be preserved
when scaling the width and the depth of the architecture. This observation could be leveraged to transfer optimal parameters from a small "toy" network, which can be easily
tuned, to a larger "target" network.
Following this intuition, we present the idea of transferring the whole sequence of accepted learning rates from a small architecture to a larger one. We show that this operation is
not trivial, as the loss and the gradient norm, that directly influence the accepted step size, typically behave differently across scales. We report a detailed experimental insight of
how the training problem changes when the model is scaled, offering some precious insights from the optimization viewpoint.
(Abstract ein-/ausblenden)
Christian Fiedler (TUM, hybrid):
Revisiting statistical learning theory with distributional inputs
In a range of machine learning applications, from medical diagnostics to causality, distributions appear as inputs, which in turn might not even be accessible directly,
but only through samples thereof. Kernel-based methods are particularly suited for this task, and the case of regression -- called distributional regression -- is by now
well-established with efficient algorithms and substantial theory. However, other variants of learning with distributional inputs have received less attention, most notably the case of
distributional classification. Motivated by this latter fact, we present recent work on advancing statistical learning theory with distributional inputs, including new oracle
inequalities and stability-based generalization bounds.
(Abstract ein-/ausblenden)
Felix Voigtländer (KU Eichstaett-Ingolstadt, hybrid):
Rates of approximation by multivariate ridge functions and shallow neural networks
We consider approximation by sums of $n$ multivariate ridge functions for fixed input dimension $d$,
as the number of summands $n$ tends to infinity.
Here, for $1 \leq \ell < d$, an $\ell$-variate ridge function is of the form $\R^d \ni x \mapsto h(A x)$
with $A \in \R^{\ell \times d}$ and $h : \R^\ell \to \R$, where we will assume that $h$ is locally bounded.
For the lower bound, we show that approximating functions in the (unit ball of the) Sobolev space
$W^{r,\infty}$ on the unit ball in $\R^d$ with error measured in $L^1$,
the error is lower-bounded by $\mathcal{O}(n^{-r / (d - \ell)})$.
Regarding the upper bound, we show that this approximation rate is achievable, even when approximating
$L^p$ Sobolev functions with respect to $L^p$, for arbitrary $1 \leq p \leq \infty$.
For the case $\ell = 1$ of ``vanilla'' ridge functions, this recovers known results by Maiorov,
where we remark that we fix a small gap in the existing proof.
Finally, we discuss the implications of our bounds regarding the approximation rates for
shallow neural networks, both in the case of real-valued networks (which corresponds to $\ell=1$)
and complex-valued networks (which corresponds to the new case $\ell=2$).
In particular, our results imply that shallow complex-valued networks can achieve a strictly better
approximation rate than shallow real-valued networks.
(Abstract ein-/ausblenden)
Pedro Abdalla Teixeira (UCI, online):
LLM Watermarking Using Mixtures and Statististical-to-Computational Gaps
Given a text, can we determine whether it was generated by a large language model (LLM) or by a human? Watermarking is a prominent approach used to tackle this question.
In this talk, we will examine the watermarking problem from a theoretical perspective, with a particular focus on how mixtures of distributions can be useful for watermarking. Time
permitting, we will also discuss a robust framework involving tools from statistical-to-computational gaps. This is joint work with Roman Vershynin.
(Abstract ein-/ausblenden)
Simon Du (University of Washington, online):
How Over-Parameterization Slows Down Gradient Descent
We investigate how over-parameterization impacts the convergence behaviors of gradient descent through two examples. In the context of learning a single ReLU neuron, we
prove that the convergence rate shifts from $exp(-T)$ in the exact-parameterization scenario to an exponentially slower $1/T^3$ rate in the over-parameterized setting. In the
canonical matrix sensing problem, specifically for symmetric matrix sensing with symmetric parametrization, the convergence rate transitions from $exp(-T)$ in the
exact-parameterization case to $1/T^2$ in the over-parameterized case. Interestingly, employing an asymmetric parameterization restores the $exp(-T)$ rate, though this rate also
depends on the initialization scaling. Lastly, we demonstrate that incorporating an additional step within a single gradient descent iteration can achieve a convergence rate
independent of the initialization scaling.
(Abstract ein-/ausblenden)
Johannes Schmidt-Hieber (University of Twente, online):
On the expressivity of deep Heaviside networks
We show that deep Heaviside networks (DHNs) have limited expressiveness but that this can be overcome by including either skip connections or neurons with linear activation. We provide lower and
upper bounds for the Vapnik-Chervonenkis (VC) dimensions and approximation rates of these network classes. As an application, we derive statistical convergence rates for DHN fits in the
nonparametric regression model. Joint work with Insung Kong (Twente), Juntong Chen (Twente), and Sophie Langer (Bochum)
(Abstract ein-/ausblenden)
Francesco Orabona (KAUST, online):
New Lower Bounds for Stochastic Non-Convex Optimization through Divergence Composition
We study fundamental limits of first-order stochastic optimization in a range of nonconvex settings, including L-smooth functions satisfying Quasar-Convexity (QC),
Quadratic Growth (QG), and Restricted Secant Inequalities (RSI). While the convergence properties of standard algorithms are well-understood in deterministic regimes, significantly
fewer results address the stochastic case, where only unbiased and noisy gradients are available.
We establish new lower bounds on the number of noisy gradient queries to minimize these classes of functions, also showing that they are tight (up to a logarithmic factor) in all the
relevant quantities characterizing each class. Our approach reformulates the optimization task as a function identification problem, leveraging divergence composition arguments to
construct a challenging subclass that leads to sharp lower bounds.
(Abstract ein-/ausblenden)