# Past Talks in Joint Analysis Seminar

There weren't any events in the past six months.

# Past Talks in Post Graduate Seminar

**Enric Boix** (MIT):

**Transformers learn through gradual rank increase**

We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove
this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in
practice without the simplifying assumptions. Joint work with Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind.

*(Show/Hide Abstract)*

**Nicolas Flammarion** (EPFL):

**Implicit regularization, stochasticity and stepsize**

In this talk, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent
(SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions
through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution.
Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are
magnified for stepsizes in a tight window just below the divergence threshold, in the ``edge of stability'' regime.

*(Show/Hide Abstract)*

**Simon Foucart** (Texas A&M University):

**A Trustworthy Learning Theory? The View from Optimal Recovery**

For a function observed through point evaluations, is there an optimal way to recover it or merely to estimate a dependent quantity? Guided by an atmospheric science
thread, I will give affirmative answers to several variations of this data-focused question, especially under the assumption that the function belongs to a model set defined by
approximation capabilities. In fact, I will uncover computationally implementable linear recovery maps that are optimal in the worst-case setting. I will also present some recent and
ongoing works extending the theory in several directions, with particular emphasis put on observations that are inexact---adversarially or randomly.

*(Show/Hide Abstract)*

**Hossein Mobahi** (Google Research):

**Sharpness Aware Minimization (SAM): The Algorithm and Its Recent Developments**

In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training
loss value can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a procedure for
simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly
low loss; this formulation results in a min-max optimization problem. SAM has been shown to result in a significant generalization improvement across a number of domains, especially
computer vision and NLP tasks. In this talk, I will present the main ideas behind SAM, and then discuss open problems and recent developments related to SAM.

*(Show/Hide Abstract)*

**Leonardo Galli** (RWTH Aachen University):

**Line Search Methods for Deep Learning. It Could Work**

Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However,
existing line searches may take steps that are smaller than necessary since they
require a monotone decrease of the (mini-)batch objective function. We explore
nonmonotone line search methods to relax this condition and possibly accept larger
step sizes. Despite the lack of a monotonic decrease, we prove the same fast rates
of convergence as in the monotone case. Our experiments show that nonmonotone methods improve the speed of convergence and generalization properties
of SGD/Adam even beyond the previous monotone line searches. We propose
a POlyak NOnmonotone Stochastic (PoNoS) method, obtained by combining a
nonmonotone line search with a Polyak initial step size. Furthermore, we develop
a new resetting technique that in the majority of the iterations reduces the amount
of backtracks to zero while still maintaining a large initial step size. To the best of
our knowledge, a first runtime comparison shows that the epoch-wise advantage of
line-search-based methods gets reflected in the overall computational time.

*(Show/Hide Abstract)*

**Sakirudeen Abdulsalaam** (RWTH Aachen University):

**Phase recovery from masked measurements**

It is commonly known that Fourier phase is oftentimes more crucial than Fourier magnitude in reconstructing a signal from its Fourier transform. However, in many real-life measurement systems, only the magnitude square of the Fourier transform of the underlying signal is available. This may be due to the fact that the phase is lost or expensive/impractical to measure. The problem of reconstructing a signal from its Fourier magnitude is known as phase retrieval problem. Reconstructing a signal from its Fourier magnitude only is generally very difficult. In this talk, we will consider the recent Convex Optimization technique for phase recovery from Fourier measurements with random mask.

*(Show/Hide Abstract)*

**Emmanuel Abbe** (EPFL):

**Leap complexity and generalization on the unseen**

Boolean functions with sparse Fourier transform on uniform inputs are known to be efficiently learnable in the PAC model.
Here we consider the more specific model of learning with SGD on 'regular' neural networks.
We claim that the sample complexity of learning sparse target functions in this model is controlled by a new "leap" complexity measure,
which measures how "hierarchical" target functions are in the orthonormal L2 basis (Fourier transform for Boolean functions).
For depth 2, we prove such a claim: we show that a time complexity of d^Leap is sufficient for a (layerwise projected) SGD and necessary fo noisy GD.
In particular, it is shown that SGD learns such functions with a saddle-to-saddle dynamic by climbing the degree of hierarchical features.
We then discuss consequences for out-of-distribution generalization and how this leads to a new 'degree curriculum' learning algorithm.
Joint works with E. Boix (MIT), T. Misiakiewicz (Stanford) and S. Bengio (Apple), A. Lotfi (EPFL).

*(Show/Hide Abstract)*

**Simon S. Du** (University of Washington):

**Passive and Active Multi-Task Representation Learning**

Representation learning has been widely used in many applications. In this talk, I will present our work which uncovers when and why representation learning provably improves the
sample efficiency, from a statistical learning point of view. Furthermore, I will talk about how to actively select the most relevant task to boost the performance.

*(Show/Hide Abstract)*