# Past Talks in Joint Analysis Seminar

There weren't any events in the past six months.

# Past Talks in Post Graduate Seminar

11-28-2022, 11:30 AM

**Jingzhao Zhang** (Tsinghua University):**On the (Non)smoothness of Neural Network Training**

In this talk, we will discuss the following questions - why is neural network training non-smooth from an optimization perspective, and how should we analyze convergence in nonsmooth settings? We will start by showing that the non-smoothness is essential to standard neural network training procedures, and that network training converges in an unstable manner. We then provide theoretical models for understanding why optimization in neural network is unstable. We conclude by showing that new definitions of convergence in the nonsmooth settings can reconcile theory with practice.

11-07-2022, 12:30 PM

10-31-2022, 10:00 AM

**Zhiyuan Li** (Toyota Technological Institute at Chicago):**Understanding Gradient Descent on Edge of Stability in Deep Learning**

Deep learning experiments by Cohen et al. [2021] using deterministic Gradient Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and sharpness (i.e., the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around 2/LR and loss goes up and down across iterations, yet still with an overall downward trend. In this talk, we present a mathematical analysis for a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function L with certain regularity condition, this effect is demonstrated for (1) Normalized GD, i.e., GD with loss L and a varying LR, which is proportional to the inverse of gradient norm and (2) GD with constant LR and a non-smooth loss, $\sqrt{L}$. Both provably enter the Edge of Stability regime, with the associated flow on the manifold minimizing sharpness. The above theoretical results have been corroborated by an experimental study. If time permits, I will also talk about our recent results on how EoS emerges naturally from running GD on a scale invariant loss (networks with normalization layers) with l2 regularization with constant learning rate.

10-24-2022, 12:30 PM

**Jeremy M. Cohen** (Carnegie Mellon University):**The dynamics of optimization in deep learning**

Neural networks are trained using first-order optimization algorithms. The simplest algorithm in this class is full-batch gradient descent with a fixed step size. In the traditional viewpoint, gradient descent converges because this step size is set sufficiently small compared to the local curvature (second derivatives). Yet, experiments have shown that this viewpoint does not describe the actual behavior of gradient descent in deep learning. Instead, a ubiquitous feature of practical deep learning tasks is that the negative gradient points in a direction of increased curvature. As a consequence, for practical step sizes, the curvature does not remain “small enough” along the optimization trajectory, but rather rises past the threshold of 2/(step size) at which gradient descent becomes unstable — an eventuality not envisioned by the traditional theory. Interestingly, after becoming unstable, gradient descent does not fail, but rather starts to operate at the “edge of stability,” an unusual regime in which unstable oscillations along high-curvature directions automatically regulate the curvature from increasing beyond the threshold of 2/(step size). These findings suggest that much (if not most) of optimization theory will need to be rethought for deep learning. We will conclude by discussing a number of open problems.

09-07-2022, 03:00 PM

**Ekkehard Schnoor** (RWTH Aachen University):**Deciphering Lasso-based Classification Through a Large Dimensional Analysis of the Iterative Soft-Thresholding Algorithm**

Machine learning is fundamentally about generalization, i.e. aims for performance guarantuees for unseen (test) data, based on the training samples available. Typically, generalization error bounds are obtained through uniform convergence results (e.g. based on VC dimension bounds of the hypothesis class) and depend on the ratio of the (finite) data dimension and the sample size. In this talk, we present a different approach that aims to predict the asymptotically precise performance (classification accuracy) of a classifier, in a realistic regime where the dimension of the data and the sample size tend to infinity, at the same order of magnitude. Here, in contrast to the framework of uniform convergence, we investigate how the data distribution induces a probability distribution over the hypothesis class. We will study this approach at the example of Lasso-based sparse linear classifiers learned through the iterative soft-thresholding algorithm (ISTA).

(Joint work with Malik Tiomoko, Mohamed El Amine Seddik, Igor Colin and Aladin Virmaux.)

07-15-2022, 10:30 AM

**Daniel Soudry** (Technion):**On stability and efficiency in deep learning**

This talk has three unrelated parts:

[1] "The Implicit Bias of Minima Stability: A View from Function Space", R. Mulayoff, T. Michaeli, D. Soudry, NeurIPS 2021. We prove that the tendency of SGD to converge to dynamically stable ("flat") minima leads to a step-size dependent bound on the smoothness of the learned function, in shallow univariate ReLU nets.

[2] "Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks", I. Hubara, B. Chmiel, M. Island, R. Banner, S. Naor, D. Soudry, NeurIPS 2021. We show how to reduce training time by 33%, using fine-grained (2:4) sparsity on both the weight matrix and its transpose, by mapping this problem to a min-cost flow problem and finding an efficient 2-approximation.

[3] "Physics-Aware Downsampling with Deep Learning for Scalable Flood Modeling", N. Giladi, Z. Ben-Haim, S. Nevo, Y. Matias, D. Soudry, NeurIPS 2021. We show how to have large speed-ups (e.g., x4096) in PDE-based simulations for flood predictions, by downsampling the terrain map using a physics-aware deep net which aims to minimize its steady-state prediction error.

07-11-2022, 12:30 PM

**Frederik Hoppe** (RWTH Aachen University):**Confidence intervals for compressive MRI**

High-dimensional statistical problems can be found in many applications in the machine learning framework. In order to decide if a solution is good or not, uncertainty quantification must be performed. However, this task is challenging and very few methods have been proposed. Arguably, the most important one, known as desparsified LASSO, removes the bias of the LASSO. Initial works on this method focused on real Gaussian designs as a toy model for such high-dimensional problems. However, in medical imaging applications, such as compressive sensing for MRI, the measurement systems possess physical constraints that translate into structured sensing matrices. The purpose of this work is to extend debiased solutions of the LASSO to the complex subsampled Fourier case that is present in MRI applications. This allows us to construct valid confidence intervals for every pixel in MR images. This is a joint work with Felix Krahmer, Claudio Mayrink Verdun and Holger Rauhut.

07-04-2022, 12:30 PM

**Robert J. Kunsch** (RWTH Aachen University):**How much randomness is needed for high-confidence Monte Carlo integration?**

We study Monte Carlo methods for integrating smooth d-variate functions based on n function evaluations.\ The classical way of assessing the precision \uc0\u949 of a randomized integration method is based on the root mean squared error (RMSE) or the mean error. Optimal Monte Carlo error rates in terms of the mean error are well known for classical isotropic as well as mixed smoothness Sobolev spaces. One integration method known to be optimal is based on a randomly shifted and dilated Frolov rule and uses only 2d random parameters. If, however, the error is measured in terms of small error \u949 with high probability 1\u8722 \u948 , the so-called probabilistic error criterion, random Frolov turn out to be suboptimal with the error \u949 =e(n,\u948 ) depending polynomially on 1/\u948 instead of the polynomial dependence on log(1/\u948 ) we hope for. Some integration methods with optimal confidence properties exploit concentration phenomena, e.g. Hoeffding\'92s inequality, but require independent sample points; in any event, the amount of random numbers in such optimal methods is proportional to dn. This raises the question: How small can the probabilistic error be if we limit the amount of randomness? Restricted Monte Carlo methods that only use a small amount of random bits have been studied by Heinrich, Novak, and Pfeiffer (2004) for the RMSE criterion. A similar study for the probabilistic error criterion of restricted Monte Carlo methods will be presented.

06-13-2022, 12:30 PM

**Laura Paul** (RWTH Aachen University):**Covariance Estimation for Massive MIMO**

Massive multiple-input multiple-output (MIMO) communication systems are very promising for wireless communication and fifth generation (5G) cellular networks. In massive MIMO, a large number of antennas are employed at the base station (BS), which provides a high degree of spatial freedom and enables the BS to simultaneously communicate with multiple user terminals. Due to the limited angular spread, the user channel vectors lie on low-dimensional subspaces. For each user, we aim to find a low-dimensional beamforming subspace that captures a large amount of the power of its channel vectors. We will see, that this signal subspace estimation problem can be reduced to finding a good estimator of the covariance matrix in terms of a truncated version of the nuclear norm. Since the channel covariance matrix is not a priori known in practice, it has to be estimated from the observed data samples. In this talk, theoretical guarantees for signal covariance estimation from compressed measurements are investigated. We will derive bounds on the estimation error in terms of the number of observed time samples, the number of sampled entries (antennas) and the truncation level by using a result for the restricted isometry constant of partial circulant random matrices generated by isotropic vectors with independent L-subgaussian entries.\ \ Based on joint work with Sjoerd Dirksen and Holger Rauhut.

06-07-2022, 04:30 PM

05-30-2022, 12:30 PM

**Leonardo Galli** (RWTH Aachen University):**A Polyak Nonmonotone Stochastic Gradient Descent Method for Training Deep Neural Networks**

Deep Learning (DL) models are nowadays the key to achieve state-of-the-art performances in a wide range of real-world applications. Despite their extensive use, the understanding of DL is limited and the theoretical guarantees are few. Moreover, performances of DL models are very sensible to various training algorithmic choices, for instance and above all, the learning rate. The robust selection of the learning rate is indeed a very central topic in DL and it is still one of the main concern of researchers working in this field. In this project, we study nonmonotone Stochastic Gradient Descent (SGD) algorithms to improve the robustness of the training method w.r.t. the learning rate. This is the first time in which nonmonotone line searches are applied together with a stochastic method. For this reason, the convergence rates of deterministic nonmonotone methods needs to be adapted to the stochastic setting. To conclude, a Polyak method for selecting the initial step is here shown to achieve very promising numerical results.

05-09-2022, 12:30 PM

**Josef Teichmann** (ETH Zürich):**Machine Learning in Finance via randomization**

Randomized Signature or random feature selection are two instances of machine learning, where randomly chosen structures appear to be highly expressive. We analyze several aspects of the theory behind it, show that these structures have several theoretically attractive properties and introduce two classes of examples from finance. (joint works with Christa Cuchiero, Lukas Gonon, Lyudmila Grigoryeva, Martin Larsson, and Juan-Pablo Ortega).

04-25-2022, 12:30 PM

**Francis Bach** (INRIA):**Statistics, Machine Learning, and Optimization with Kernel Sums-of-Squares**

In this talk, I will present recent work on representing non-negative functions with infinite-dimensional sums of squares, with applications to non-convex optimization, optimal transport, optimal control, Bayesian inference, and shape-constrained optimization.

04-11-2022, 12:30 PM

**Hung-Hsu Chou** (RWTH Aachen University):**More is Less: Inducing Sparsity via Overparameterization**

In deep learning it is common to overparameterize neural networks, that is, to use more parameters than training samples. Quite surprisingly training the neural network via (stochastic) gradient descent leads to models that generalize very well, while classical statistics would suggest overfitting. In order to gain understanding of this implicit bias phenomenon we study the special case of sparse recovery (compressed sensing) which is of interest on its own. More precisely, in order to reconstruct a vector from underdetermined linear measurements, we introduce a corresponding overparameterized square loss functional, where the vector to be reconstructed is deeply factorized into several vectors. We show that, if there exists an exact solution, vanilla gradient flow for the overparameterized loss functional converges to a good approximation of the solution of minimal $\\ell_1$-norm. The latter is well-known to promote sparse solutions. As a by-product, our results significantly improve the sample complexity for compressed sensing via gradient flow/descent on overparameterized models derived in previous works. The theory accurately predicts the recovery rate in numerical experiments. Our proof relies on a analyzing a certain Bregman divergence of the flow. This bypasses the obstacles caused by non-convexity and should be of independent interest.

04-04-2022, 04:30 PM

**Sharan Vaswani** (Simon Fraser University):**Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent**

We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise $\\sigma^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $\\kappa$, we prove that $T$ iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an $\\tilde\{O\} \\left(\\exp \\left( \\nicefrac\{-T\}\{\\kappa\} \\right) + \\nicefrac\{\\sigma^2\}\{T\} \\right)$ rate, without knowing $\\sigma^2$. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal $\\tilde\{O\} \\left(\\exp \\left( \\nicefrac\{-T\}\{\\sqrt\{\\kappa\}\} \\right) + \\nicefrac\{\\sigma^2\}\{T\} \\right)$ rate, without knowledge of $\\sigma^2$. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. Finally, we empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.

Link: https://arxiv.org/abs/2110.11442

02-07-2022, 12:30 PM

**Shahar Mendelson** (University of Warwick):**Uniform preservation of tails**

Given a class of functions on a probability space (Omega,mu) and X_1,...,X_N that are independent and distributed according to mu, the hope is that, with high probability with respect to \\mu^N and for many (all?) possible values of t, the tail distributions N^\{-1\}|\{ i : f(X_i) > t\}| accurately capture the true distributions P( f(X) > t) for every f in the class. I will present some recent progress on this question and some applications of the current state of the art.

01-31-2022, 04:30 PM

**Mark Schmidt** (University of British Columbia):**Faster Algorithms for Deep Learning?**

The last 10 years have seen a revolution in stochastic gradient methods, with variance-reduced methods like SAG/SVRG provably achieving faster convergence rates than all previous methods. These methods give dramatic speedups in a variety of applications, but have had virtually no impact to the practice of training deep models. We hypothesize that this is due to the over-parameterized nature of modern deep learning models, where the models are so powerful that they could fit every training example with zero error (at least theoretically). Such over-parameterization nullifies the benefits of variance-reduced methods, because in some sense it leads to "easier" optimization problems. In this work, we present algorithms specifically designed for over-parameterized models. This leads to methods that provably achieve Nesterov acceleration, methods that automatically tune the step-size as they learn, and methods that achieve superlinear convergence with second-order information.

01-17-2022, 12:30 PM

**Lyudmila Grigoryeva** (University of Warwick):**Deep Learning for Discrete-time Pricing and Calibration**

The pricing and hedging of derivative contracts have been extensively studied in the context of stochastic volatility models in both continuous and discrete time. The former suffer from inability to precisely capture the ``stylized facts'' observed in financial markets while the latter rarely admit closed form pricing formulas and hence require computationally expensive Monte-Carlo simulations. We propose an alternative deep learning approach and demonstrate that convolutional neural networks (CNNs) allow for accurate approximation of pricing maps and calibration mappings in the discrete time setting. We showcase these claims with the Heston-Nandi GARCH model for which a semi-closed pricing formula is available and hence the quantitative performance of deep learning approach can be assessed. Moreover, we illustrate that pricing and calibrating CNNs can be used as transfer learning models in a real data application. Additionally, illustrate that missing data problems are efficiently combatted by means of extreme learning machines (ELMs).

01-10-2022, 04:30 PM

**Katya Scheinberg** (Cornell University):**Stochastic Oracles and Where to Find Them**

Continuous optimization is a mature field, which has recently undergone major expansion and change. One of the key new directions is the development of methods that do not require exact information about the objective function. Nevertheless, the majority of these methods, from stochastic gradient descent to "zero-th order" methods use some kind of approximate first order information. We will overview different methods of obtaining this information in different settings, including simple stochastic gradient via sampling, traditional and randomized finite difference methods and more. We will discuss what key properties of these inexact, stochastic order oracles and compare the cost and benefit of having access to gradient estimates versus function value estimates.

12-13-2021, 12:30 PM

**Volkan Cevher** (EPFL):**Optimization challenges in adversarial machine learning**

Thanks to neural networks (NNs), faster computation, and massive datasets, machine learning (ML) is under increasing pressure to provide automated solutions to even harder real-world tasks beyond human performance with ever faster response times due to potentially huge technological and societal benefits. Unsurprisingly, the NN learning formulations present a fundamental challenge to the back-end learning algorithms despite their scalability, in particular due to the existence of traps in the non-convex optimization landscape, such as saddle points, that can prevent algorithms from obtaining \'93good\'94 solutions.

In this talk, we describe our recent research that has demonstrated that the non-convex optimization dogma is false by showing that scalable stochastic optimization algorithms can avoid traps and rapidly obtain locally optimal solutions. Coupled with the progress in representation learning, such as over-parameterized neural networks, such local solutions can be globally optimal.

Unfortunately, this talk will also demonstrate that the central min-max optimization problems in ML, such as generative adversarial networks (GANs), robust reinforcement learning (RL), and distributionally robust ML, contain spurious attractors that do not include any stationary points of the original learning formulation. Indeed, we will describe how algorithms are subject to a grander challenge, including unavoidable convergence failures, which could explain the stagnation in their progress despite the impressive earlier demonstrations. We will conclude with promising new preliminary results from our recent progress on some of these difficult challenges.

12-06-2021, 12:30 PM

**Stefan Kunis** (Universität Osnabrück):**Super-resolution of points and curves**

The term super-resolution is used in many different contexts but in general asks for resolving fine details from low-frequency measurements. Algebraic techniques have proven useful in this context with different imaging tasks such as spike reconstruction (single molecule microscopy), phase retrieval (X-ray crystallography), and contour reconstruction (natural images). The available data typically consists of a blurred version of the specimen or equivalently trigonometric moments of low to moderate order and one asks for the reconstruction of fine details modeled by zero- or positive-dimensional algebraic varieties. Often, such reconstruction problems have a generically unique solution when the number of data is larger than the degrees of freedom in the model. Beyond that, we concentrate on simple a-priori conditions to guarantee that the reconstruction problem is well or only mildly ill conditioned. For the reconstruction of points on the complex torus, popular results ask the order of the moments to be larger than the inverse minimal distance of the points. Moreover, simple and efficient eigenvalue based methods achieve this stability numerically in specific settings. Recently, the situations involving clustered points, points with multiplicities, and positive-dimensional algebraic varieties are starting to gain interest, and these shall be discussed within this talk.

11-22-2021, 12:30 PM

**Guido F. Montufar** (UCLA):**Geometry of Linear Convolutional Networks**

We study the family of functions that are represented by a linear convolutional neural network (LCN). These functions form a semi-algebraic subset of the set of linear maps from input space to output space. In contrast, the families of functions represented by fully-connected linear networks form algebraic sets. We observe that the functions represented by LCNs can be identified with polynomials that admit certain factorizations, and we use this perspective to describe the impact of the network's architecture on the geometry of the resulting function space. We further study the optimization of an objective function over an LCN, analyzing critical points in function space and in parameter space, and describing dynamical invariants for gradient descent. Overall, our theory predicts that the optimized parameters of an LCN will often correspond to repeated filters across layers, or filters that can be decomposed as repeated filters. We also conduct numerical and symbolic experiments that illustrate our results and present an in-depth analysis of the landscape for small architectures. This talk is based on joint work with Kathl\'e9n Kohn, Thomas Merkh, Matthew Trager: https://arxiv.org/abs/2108.01538.

11-15-2021, 12:30 PM

**Ekkehard Schnoor** (RWTH Aachen University):**Generalization Error Bounds for Iterative Recovery Algorithms Unfolded as Neural Networks**

We present generalization error bounds for iterative recovery algorithms unfolded as neural networks, based on estimates of the Rademacher complexity of hypothesis classes consisting of such deep networks.\
In the first part, we consider compressive sensing in the scenario where the sparsity basis (dictionary) is not known in advance, but needs to be learned from examples. Motivated by the well-known iterative soft thresholding algorithm (ISTA) for the reconstruction, we define deep networks parametrized by the dictionary. Based on training samples, we aim at learning the optimal sparsifying dictionary and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. The dictionary learning is performed via minimizing the empirical risk. In the second part, we extend our previous work to a more general and flexible scenario, allowing different degrees of weight-sharing between the layers and also taking more trainable parameters into account, such as the thresholding parameters.

Based on joint work with Arash Beboobi and Holger Rauhut.

11-08-2021, 12:30 PM

**Afonso S. Bandeira** (ETH Zürich):**Noncommutative Matrix Concentration Inequalities**

Matrix Concentration inequalities such as Matrix Bernstein inequality have played an important role in many areas of pure and applied mathematics. These inequalities are intimately related to the celebrated noncommutative Khintchine inequality of Lust-Piquard and Pisier. In the middle of the 2010's, Tropp improved the dimensional dependence of this inequality in certain settings by leveraging cancellations due to non-commutativity of the underlying random matrices, giving rise to the question of whether such dependency could be removed.

In this talk we leverage ideas from Free Probability to fully remove the dimensional dependence in a range of instances, yielding optimal bounds in many settings of interest. As a byproduct we develop matrix concentration inequalities that capture non-commutativity (or, to be more precise, ``freeness''), improving over Matrix Bernstein in a range of instances. No background knowledge of Free Probability will be assumed in the talk.

Joint work with March Boedihardjo and Ramon van Handel, more information at arXiv:2108.06312 [math.PR].

10-25-2021, 12:30 PM

**Noam Razin** (Tel Aviv University):**Generalization in Deep Learning Through the Lens of Implicit Rank Minimization**

The mysterious ability of overparameterized neural networks to generalize is believed to stem from an implicit regularization, a tendency of gradient-based optimization to fit data with predictors of low complexity. A widespread hope is that this complexity can be characterized using norms, and a standard test-bed for studying this prospect is matrix factorization --- matrix completion via linear neural networks. In this talk, I will establish that the implicit regularization in matrix factorization cannot be captured by norms, proving there exist settings on which it drives all norms towards infinity while minimizing rank. Then, as a step towards practical deep learning, I will show that the tendency to low rank extends from matrix factorization to tensor factorization --- tensor completion via certain non-linear neural networks. Motivated by tensor rank capturing the implicit regularization of these non-linear networks, I will propose it as a measure of complexity, and present empirical evidence that it captures aspects of natural data. Our results suggest that notions of rank (such as tensor rank) may shed light on both implicit regularization of neural networks, and properties of real-world data translating this implicit regularization to generalization.

Works covered in this talk were in collaboration with Asaf Maman and Nadav Cohen.

10-04-2021, 12:30 PM

**Jonathan Siegel** (Penn State University):**The Approximation Theory of Shallow Neural Networks**

A shallow neural network is a linear combination of ridge functions whose profile is determined by a fixed activation function. We will introduce spaces of functions which can be efficiently approximated by shallow neural networks for a wide variety of activation functions and analyze their properties. Specifically, we will compute their metric entropy and n-widths, which are fundamental quantities in approximation theory that control the limits of linear and non-linear approximation and statistical estimation for a given class of functions. Consequences of these results include: the optimal approximation rates which can be attained for shallow neural networks, that shallow neural networks dramatically outperform linear methods of approximation, and indeed that shallow neural networks outperform all stable methods of approximation on these spaces. This will provide insights into how neural networks break the curse of dimensionality. Finally, we discuss algorithmic aspects of approximation by shallow networks. Specifically, we analyze a class of greedy algorithms and show that they can attain the theoretically optimal approximation rates.

09-27-2021, 12:30 PM

**Dominik Stoeger** (USC):**Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction**

Modern machine learning models are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Due to over-parameterization, these models in principle have the capacity to (over)fit any set of labels including pure noise. Despite this high fitting capacity, somewhat paradoxically, these models trained via first-order methods continue to predict well on yet unseen test data. In this talk, I will focus on overparameterized learning in the context of low-rank reconstruction from a few measurements. For this problem, I will show that despite the presence of many global optima gradient descent from small random initialization converges to a generalizable solution and finds the underlying low-rank matrix. Notably, this analysis is not in the \'93lazy\'94 training regime and is based on intriguing phenomena uncovering the critical role of small random initialization: a few iterations of gradient descent behaves akin to popular spectral methods. We also show that this implicit spectral bias from small random initialization, which is provably more prominent for overparameterized models, puts the gradient descent iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well.

09-20-2021, 03:30 PM

**Joan Bruna** (NYU):**On the role of data structure in learning guarantees**

High-dimensional learning remains an outstanding task where empirical successes often coexist alongside mathematical and statistical curses. In this talk, we will describe two vignettes of this tension that underscore the importance of distributional assumptions. First, we will describe the role of invariance and symmetry priors in statistical learning theory, by studying the gains in sample complexity brought by incorporating these priors into the learning model. Next, we will describe the role of data structure on the computational side, by studying computational-to-statistical gaps arising in the seemingly simple problem of learning a single neuron. \ Joint work with Alberto Bietti and Luca Venturi (first part), and Min Jae Song and Ilias Zadik (second part).

08-23-2021, 09:00 AM

**Yi Ma** (UC Berkeley):**White-Box Deep (Convolution) Networks from First Principles**

In this talk, we offer an entirely “white box” interpretation of deep (convolution) networks from the perspective of data compression (and group invariance). In particular, we show how modern deep layered architectures, linear (convolution) operators and nonlinear activations, and even all parameters can be derived from the principle of maximizing rate reduction (with group invariance). All layers, operators, and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of so-obtained network, called ReduNet, have precise optimization, geometric, and statistical interpretation. There are also several nice surprises from this principled approach: it reveals a fundamental tradeoff between invariance and sparsity for class separability; it reveals a fundamental connection between deep networks and Fourier transform for group invariance \'96 the computational advantage in the spectral domain (why spiking neurons?); this approach also clarifies the mathematical role of forward propagation (optimization) and backward propagation (variation). In particular, the so-obtained ReduNet is amenable to fine-tuning via both forward and backward (stochastic) propagation, both for optimizing the same objective.

This is joint work with students Yaodong Yu, Ryan Chan, Haozhi Qi of Berkeley, Dr. Chong You now at Google Research, and Professor John Wright of Columbia University. A related paper can be found at: https://arxiv.org/abs/2105.10446

07-26-2021, 12:30 PM

**Gabin Maxime Nguegnang** (RWTH Aachen University):**Convergence of Gradient Descent in Learning Deep Liner Neural Network**

In this work, we study the convergence properties of gradient descent in training deep linear neural networks. We prove that if the initial weight matrices are approximately balanced and the weight norm of each of this matrices is bounded by a constant which depends on the initial loss value and the network depth (N) then the weight matrices will stay bounded throughout all iterations. In addition, we show that under suitable conditions gradient descent with both constant and non constant step size converges to a critical point of the loss function. Moreover, we demonstrate that for almost all initialization gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we prove that gradient descent converges to a global minimum on the manifold matrices of rank k less or equal to r the rank of the over-parameterized matrix of the network. Finally, we provide an attempt to extend these results to stochastic gradient descent.

07-19-2021, 12:30 PM

**Johannes Lederer** (Ruhr-University Bochum):**Sparse Deep Learning**

Sparsity is popular in statistics and machine learning, because it can avoid overfitting, speed up computations, and facilitate interpretations. In deep learning, however, the full potential of sparsity still needs to be explored. In this presentation, we first discuss sparsity in the framework of high-dimensional statistics and corresponding statistical theories. We then use these insights to further our understanding of sparsity in deep learning.

07-12-2021, 12:30 PM

**Juan-Pablo Ortega Lahuerta** (Nanyang Technological University):**Reservoir Computing and the Learning of Dynamic Processes**

Dynamic processes regulate the behaviour of virtually any artificial and biological agent, from stock markets to epidemics, from driverless cars to healthcare robots. The problem of modeling, forecasting, and generally speaking learning dynamic processes is one of the most classical, sophisticated, and strategically significant problems in the natural and the social sciences. In this talk we shall discuss both classical and recent results on the modeling and learning of dynamical systems and input/output systems using an approach generically known as reservoir computing. This information processing framework is characterized by the use of cheap-to-train randomly generated state-space systems for which promising high-performance physical realizations with dedicated hardware have been proposed in recent years. In our presentation we shall put a special emphasis in the approximation properties of these constructions.

07-05-2021, 12:30 PM

**Frederik Hoppe** (RWTH Aachen University):**Debiased Lasso: Confidence intervals for the Lasso**

This talk approaches the problem of constructing confidence intervals for the Lasso estimator. Based on the Lasso, we build an estimator whose distribution is asymptotically known. We compare some existing approaches for this and identify the key requirements. The new estimator is called debiased or desparsified Lasso and can be decomposed into a \'93rest\'94 term and a \'93noise\'94 term. Under certain assumptions the \'93rest\'94 term vanishes and the remaining \'93noise\'94 term is asymptotically Gaussian distributed. This allows us to construct confidence intervals. Furthermore, we address a possible application in magnetic resonance imaging.

06-28-2021, 12:30 PM

**Vicky Kouni** (RWTH Aachen University):**Spark Deficient Gabor Frames for Inverse Problems**

In this presentation, we address two inverse problems: analysis Compressed Sensing and speech denoising. Both problems\'92 approach is based on a redundant, analysis-sparse representation of the original signals of interest. We pick an eigenvector of the Zauner unitary matrix and \'96under certain assumptions on the ambient dimension\'96 we use it as window vector to generate a spark deficient Gabor frame. The analysis operator associated with such a frame, is a (highly) redundant Gabor transform, which we use as a sparsifying transform in both analysis Compressed Sensing and denoising procedures. We conduct computational experiments on synthetic and real-world data, solving the analysis l1-minimization and analysis basis pursuit for Compressed Sensing and de- noising problems respectively, with four different choices of analysis operators, including our Gabor analysis operator. The results show that our proposed redundant Gabor transform outperforms \'96in all cases\'96 Gabor transforms generated by state-of-the-art window vectors of time-frequency analysis.

06-14-2021, 12:30 PM

**Alessandro Rudi** (INRIA and École Normale Supérieure):**Finding Global Minima via Kernel Approximations**

We consider the global minimization of smooth functions based solely on function evaluations. Algorithms that achieve the optimal number of function evaluations for a given precision level typically rely on explicitly constructing an approximation of the function which is then minimized with algorithms that have exponential running-time complexity. In this paper, we consider an approach that jointly models the function to approximate and finds a global minimum. This is done by using infinite sums of square smooth functions and has strong links with polynomial sum-of-squares hierarchies. Leveraging recent representation properties of reproducing kernel Hilbert spaces, the infinite-dimensional optimization problem can be solved by subsampling in time polynomial in the number of function evaluations, and with theoretical guarantees on the obtained minimum.\ Given n samples, the computational cost is O(n^(3.5)) in time, O(n^2) in space, and we achieve a convergence rate to the global optimum that is O(n^(\uc0\u8722 m/d+1/2+3/d)) where m is the degree of differentiability of the function and d the number of dimensions. The rate is nearly optimal in the case of Sobolev functions and more generally makes the proposed method particularly suitable for functions that have a large number of derivatives. Indeed, when m is in the order of d, the convergence rate to the global optimum does not suffer from the curse of dimensionality, which affects only the worst-case constants.

06-07-2021, 12:30 PM

**Leonardo Galli** (RWTH Aachen University):**Robustness and Generalization in Training Deep Neural Networks**

Recent breakthroughs in the field of AI and machine learning are achieved with Deep Learning (DL) models. Despite their extensive use to solve a wide range of real problems, the understanding of DL is limited and the theoretical guarantees are few. Moreover, performances of DL models are very sensible to various training algorithmic choices e.g., learning rate. The robust selection of the learning rate is indeed a very central topic in DL. In this project, we study nonmonotone Stochastic Gradient Descent (SGD) algorithms to improve the robustness of the training method w.r.t. the learning rate. The focus of this seminar will be on a first comparison between various learning rate selections. Results are showing that the current research avenue is promising and that nonmonotone techniques might play an interesting role in the selection of the learning rate.

05-10-2021, 12:30 PM

**Hung-Hsu Chou** (RWTH Aachen University):**Exploration and Exploitation: Understanding Deep Neural Network via Matrix Factorization**

This talk aim to provide analysis and experiments on linear neural network, and its relation with general neural network via kernel. It consists of two parts: exploration, where the dynamics is searching for the relevant feature space, and exploitation, where the dynamics is evaluating the importance of those features. Many of the current literature, including neural tangent kernel, commuting measurement for matrix sensing, and our our previous work, focus on the exploitation. However, exploitation can not be separated from exploration, and is in fact could be quite meaningless without exploration in practice. Thus we provide some analysis and experiments with the exploration, that will hopefully integrate the two parts. This is a joint work with Holger Rauhut and Johannes Maly.

05-03-2021, 12:30 PM

**Lénaïc Chizat** (Université Paris-Saclay):**Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Networks**

In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that for appropriate initializations, its limit, when it exists, is a global minimizer. We also study the implicit regularization of this algorithm when the objective is the unregularized logistic loss, which leads to a max-margin classifier in a certain functional space. We finally discuss what these results tell us about the generalization performance, and in particular how these models compare to kernel methods. This is based on joint work with Francis Bach, and be around my COLT 2020 https://arxiv.org/abs/2002.04486.

04-15-2021, 04:30 PM

**Panayotis Mertikopoulos** (Univ. Grenoble Alpes, CNRS):**Games, dynamics, and (min-max) optimization**

This talk aims to survey the triple-point interface between optimization, game theory, and dynamical systems. We will begin by discussing how the ordinary differential equation (ODE) method of stochastic approximation can be used to analyze the trajectories of a wide array of stochastic first-order algorithms in non-convex minimization problems and games. The key notion here is that of an internally chain transitive (ICT) set: in minimization problems, ICT sets correspond to the problem's critical points, and we discuss a range of conditions guaranteeing convergence to critical points while avoiding unstable saddle points. Similar results can also be derived for min-max problems and games: unstable stationary points are avoided and stable Nash equilibria are attracting with probability 1 (or with arbitrarily high probability in the local case). However, despite these encouraging results, the overall situation can be considerably more involved: we show that the sequence of play may converge with arbitrarily high probability to lower-dimensional manifolds that are in no way unilaterally stable (or even stationary). "Spurious attractors" of this type can arise even in simple two-player zero-sum games with one-dimensional action sets and polynomial losses of degree 4, a fact which highlights the fundamental gap between minimization problems and games.\ \ Relevant papers: The talk will be based on material from the preprints below - https://arxiv.org/pdf/2006.11144.pdf - https://arxiv.org/pdf/2006.09065.pdf

04-08-2021, 04:30 PM

**Robert Kunsch** (RWTH Aachen University):**A new rotation equivariant neural network for 3D point clouds - proof of concept**

Scanning data in real world applications are often organized as 3D point clouds rather than 2D images. Approaches for deep point cloud processing are diverse. In this talk, a new rotation equivariant neural architecture is presented for processing 3D point clouds that are assumed to be sampled from 3D manifolds. The architecture is inspired by PointNet and its state of the art extension PointNet++. The main idea is that when extracting features from the neighbourhood of a point (let us call it a \'93centroid\'94), the point cloud is locally rotated around that centroid to let the surface tangent plane fit the xy-plane in relative coordinates, feature extraction is repeated for a bunch of equidistant rotation angles within the xy-plane. That way, features come along with interpretable geometric data which might be useful e.g. in pose detection. First experiments on a toy problem prove that, indeed, the network can be trained on examples in fixed orientation and later perform successfully on rotated (and translated) test examples.

04-01-2021, 04:30 PM

**Michael Schaub** (RWTH Aachen University):**Learning from signals on graphs with unobserved edges**

In many applications we are confronted with the following system identification scenario: we observe a dynamical process that describes the state of a system at particular times. Based on these observations we want to infer the (dynamical) interactions between the entities we observe. In the context of a distributed system, this typically corresponds to a "network identification" task: find the (weighted) edges of the graph of interconnections. However, often the number of samples we can obtain from such a process are far too few to identify the edges of the network exactly. Can we still reliably infer some aspects of the underlying system? \ \ Motivated by this question we consider the following identification problem: instead of trying to infer the exact network, we aim to recover a (low-dimensional) statistical model of the network based on the observed signals on the nodes. More concretely, here we focus on observations that consist of snapshots of a diffusive process that evolves over the unknown network. We model the (unobserved) network as generated from an independent draw from a latent stochastic blockmodel (SBM), and our goal is to infer both the partition of the nodes into blocks, as well as the parameters of this SBM. We present simple spectral algorithms that provably solve the partition and parameter inference problems with high-accuracy. We further discuss some possible variations and extensions of this problem setup.

03-25-2021, 04:30 PM

**Yeonjong Shin** (Brown University):**Towards a mathematical understanding of modern machine learning: theory, algorithms, and applications**

Modern machine learning (ML) has achieved unprecedented empirical success in many application areas. However, much of this success involves trial-and-error and numerous tricks. These result in a lack of robustness and reliability in ML. Foundational research is needed for the development of robust and reliable ML. This talk consists of two parts. The first part will present the first mathematical theory of physics informed neural networks (PINNs) - one of the most popular deep learning frameworks for solving PDEs. Linear second-order elliptic and parabolic PDEs are considered. I will show the consistency of PINNs by adapting the Schauder approach and the maximum principle. The second part will focus on some recent mathematical understanding and development of neural network training. Specifically, two ML phenomena are analyzed -- "Plateau Phenomenon" and "Dying ReLU." New algorithms are developed based on the insights gained from the mathematical analysis to improve neural network training.

03-18-2021, 04:30 PM

**Marc Goerigk** (Uni Siegen):**Robust Optimization with Deep Neural Networks**

To derive a robust optimization model, a central ingredient is to identify a suitable model for uncertainty, containing all scenarios against which we wish to protect. An ongoing challenge in the recent literature is to derive uncertainty sets from given historical data. \ \ In this talk I propose to use deep neural networks to construct non-convex uncertainty sets from data. Trained neural networks can be integrated into a robust optimization model by formulating the adversarial problem as a convex quadratic mixed-integer program. This allows us to derive robust solutions through an iterative scenario generation process.\ \ In a comparison to uncertainty sets by kernel-based support vector clustering, I show that our method can give a better description of data, leading to robust solutions that considerably outperform the comparison method both with respect to objective value and feasibility. This presentation is based on joint work with Jannis Kurtz.

03-04-2021, 10:00 AM

**Rayan Saab** (UCSD):**A greedy algorithm for quantizing neural networks**

While neural networks have been remarkably successful in a wide array of application areas, miniaturizing them and implementing them in hardware remains an area of intense research. By quantizing, or replacing the weights of a neural network with quantized (e.g., 8-bit, or binary) counterparts, massive savings in cost, computation speed, memory, and power consumption can be attained. Of course, one wishes to attain these savings without compromising the performance of the network. We propose a new data-driven and computationally efficient method for quantizing the weights of already trained neural networks. Specifically, we quantize each neuron, or hidden unit, using a greedy path-following algorithm. We prove that this simple algorithm has favorable error guarantees for a single-layer neural network (or, alternatively, for quantizing the first layer of a multi-layer network) when the training data are random from an appropriate distribution. Under these assumptions, the quantization error decays with the width of the layer, i.e., its level of over-parametrization. We also provide numerical experiments, on multi-layer networks, to illustrate the performance of our methods.

02-25-2021, 04:30 PM

**Arthur Ulysse Jacot-Guillarmod** (EPFL):**Neural Tangent Kernel: Convergence and Generalization of DNNs**

Modern deep learning has popularized the use of very large neural networks, but the theoretical tools to study such networks are still lacking. The Neural Tangent Kernel (NTK) describes how the output neutrons evolve during training allowing a precise description of the convergence and generalization of DNNs. In the infinite width limit (when the number of hidden neutrons grows to infinity) the NTK converges to a deterministic and fixed limit ensuring convergence to a global minimum. In particular, training a DNN with the MSE loss corresponds to doing Kernel Ridge Regression (KRR) with the NTK. Under the assumption of i.i.d input samples, the risk of KRR can be approximated using the Signal Capture Threshold (SCT), which identifies which principal components of the signal are learned. A further approximation leads to the Kernel Alignement Risk Estimator (KARE), which predicts the test error of KRR from the train data only.

02-18-2021, 04:30 PM

**Lukas Gonon** (LMU):**Deep ReLU network expression rates for option prices in high-dimensional, exponential Levy models**

We study the expression rates of deep neural networks for option prices written on high-dimensional baskets of risky assets, whose log-returns are modelled by a multivariate Levy process with general correlation structure of jumps. We establish sufficient conditions on the characteristic triplet of the Levy process that ensure that the size of the DNN required to achieve a given approximation accuracy grows only polynomially with respect to the dimension of the Levy process and the reciprocal of the approximation accuracy, thereby overcoming the curse of dimensionality and justifying the use of DNNs in financial modelling of large baskets in markets with jumps. In addition, we exploit parabolic smoothing of Kolmogorov partial integrodifferential equations for certain multivariate Levy processes to present alternative architectures of ReLU DNNs that provide higher approximation rates, however, with constants potentially growing exponentially with respect to the dimension. Under stronger, dimension-uniform non-degeneracy conditions on the Levy symbol, we obtain algebraic expression rates of option prices in exponential Levy models which are free from the curse of dimensionality. In this case the ReLU DNN expression rates of prices depend on certain sparsity conditions on the characteristic Levy triplet. We indicate several consequences and possible extensions of the present results. The talk is based on joint work with Christoph Schwab.

02-11-2021, 04:30 PM

**Harald Oberhauser** (Oxford):**Random Paths, Random Tensors, and some Machine Learning**

Many real-world data sources have a natural sequential structure (e.g. patient data in health care, market data in finance, etc). For many inference tasks, it is important to capture this sequential structure. I will talk about how some classic ideas from pure mathematics provide a structured description of a path/sequence in terms of tensors. This can in turn be combined with ideas from machine learning to produce scalable methods. Vice versa, we will see how applications in machine learning lead to interesting theoretical questions in probability theory and stochastic analysis.

02-04-2021, 04:30 PM

**Felix Krahmer** (TUM):**Sketching with Kerdock's Crayons: Sparse transforms for arbitrary linear maps**

Data that admits a sparse representation arises naturally in many different applications. Consequently, efficiently computing the significant coefficients of a sparse signal is of great importance. For specific representation systems such as the Fourier basis or certain tensor bases, this problem has been studied in a number of works, and various fast algorithms are available. In most unstructured cases, however, no algorithm of computational complexity o(n^2), where n is the dimension, was known prior to our work. In this talk, we discuss a randomized approach to address this problem. Our method relies on a representation of the sparsity basis in terms of certain real-valued mutually unbiased bases derived from Kerdock codes. This representation has to be computed only once, which requires a total of O(n^3 log^2(n)) operations. We show that from this representation, our randomized fast transform computes the significant coefficients of any vector that is s-sparse signal in the given basis with high probability requiring only a significantly reduced computational complexity of O(n(s + log(n))). The talk is based on joint work with Tim Fuchs (TUM), David Gross (U Cologne), Richard Kueng (JKU Linz), and Dustin Mixon (OSU).

01-21-2021, 04:30 PM

**Massimo Fornasier** (TUM):**Identification of deep feed forward neural networks**

(A joint work with C. Fiedler, T. Klock, and M. Rauchensteiner.) In this talk we approach the problem of unique and stable identifiability of generic deep artificial neural networks from a finite number of input-output samples.\ The scope of this work is to shed light on explainability and interpretability of the parameters/weights of a network, its uniform stability, and the amount of needed input-output data for identifiability in the realizable context of data generated by a given generic network.\ More specifically we introduce the so-called \{\\it entangled weights\}, which compose weights of successive layers intertwined with suitable diagonal and invertible matrices depending on the activation functions and their shifts. We prove that entangled weights are completely and stably approximated by an efficient and robust algorithm as soon as O(D^2 x m)$ nonadaptive input-output samples of the network are collected, where $m$ is the number of neurons of the network and $D$ is the input dimension.\ Provided knowledge of layer assignments and of the remaining parameters, which can be further heuristically obtained by least squares, the entangled weights identify the network completely and uniquely.\ To highlight the relevance of the theoretical result of stable recovery of entangled weights, we present numerical experiments, which demonstrate that multilayered networks with generic weights can be robustly identified and therefore uniformly approximated by the presented algorithmic pipeline. In contrast backpropagation that cannot generalize stably very well in the current setting, being always limited by relatively large uniform error. In terms of practical impact, our study shows that we can relate input-output information uniquely and stably to network parameters, providing a form of explainability. Moreover, our method paves the way for compression of overparametrized networks, which could be also interpreted as a new form of malicious attack for network cloning.

01-14-2021, 04:30 PM

**Youness Boutaib** (RWTH Aachen University):**Path classification with continuous-time recurrent neural networks**

We seek mathematical learning guarantees for path-classification tasks using stochastic (linear, in this talk) recurrent networks with continuous-time rate dynamics. This architecture finds its origin in neuroscience and is simple as training only requires finding a pre-processing projection vector and the parameters of a read-out map. We give generalisation error bounds and argue that stochasticity provides learning with a robustness against adversarial attacks. We also explicitly show that training (in the case of our loss function) is equivalent to an optimisation problem of a functional on the signatures of the training paths. This is an ongoing joint work with Wiebke Bartolomeus (RWTH Aachen), S. Nestler (FZ J\'fclich) and H. Rauhut (RWTH Aachen).

12-17-2020, 04:30 PM

**Martin Genzel** (Utrecht University):**Robust Solutions to (Non-)Linear Inverse Problems: From Compressed Sensing to Deep Learning**

Recent advances in quantized compressed sensing and high-dimensional estimation have shown that signal recovery is even feasible under strong non-linear distortions in the observation process. An important characteristic of associated guarantees is uniformity, i.e., recovery succeeds for an entire class of structured signals with a fixed measurement ensemble. However, despite significant results in various special cases, a general understanding of uniform recovery from non-linear observations is still missing. In this talk, we will discuss a unified approach to this problem under the assumption of i.i.d. sub-Gaussian measurement vectors. It will turn out that a simple generalized Lasso estimator can serve as a universal recovery strategy, which is (outlier) robust and does not require explicit knowledge of the underlying non-linearity. A key technical ingredient is an approximative increment condition that can be implemented for many types of non-linear models. This flexibility allows us to apply our approach to a variety of problems in quantized compressed sensing and high-dimensional statistics.\ The second part of this talk is devoted to linear inverse problems and how they can be solved by deep learning techniques. Compared to traditional estimators such as the Lasso, recent works have pointed out that deep neural networks suffer from severe instabilities in various image reconstruction tasks. We will shed new light on this concern and deal with an extensive empirical study of the robustness of deep-learning-based algorithms for solving underdetermined inverse problems. This covers compressed-sensing tasks with Gaussian measurements as well as image recovery from Fourier measurements, including a real-world scenario for MRI (using the NYU fastMRI dataset). Our main focus is on computing adversarial perturbations of the measurements that maximize the reconstruction error. In contrast to previous findings, our results reveal that standard end-to-end network architectures are not only surprisingly resilient against statistical noise, but also against adversarial perturbations. Remarkably, all considered networks are trained by common deep learning techniques, without sophisticated defense strategies.\ This is joint work with Jan Macdonald (TU Berlin), Maximilian März (TU Berlin), and Alexander Stollenwerk (UCLouvain).

12-03-2020, 04:30 PM

**Michael Perlmutter** (UCLA):**Understanding Neural Networks via the Scattering Transform**

The scattering transform is a mathematical framework for understanding convolutional neural networks (CNNs), originally introduced by St\\'ephane Mallat for functions defined on $\\mathbb\{R\}^n.$ Similar to standard CNNs, the scattering transform is constructed as an alternating cascade of convolutions and nonlinearities. It differs from traditional CNNs by using predesigned, wavelet filters rather than filters learned from training data. This leads to a network that provably has desirable mathematical properties, such as translation invariance and diffeomorphism stability.\ \ In addition to these theoretical properties, the scattering transform is also a practical object that can achieve near state of the art numerical results in certain settings. Moreover, in addition to preforming well on tasks such as image recognition, the scattering transform can also be used to extract statistical information about stochastic processes with stationary increments. I will provide an overview of Mallat's original construction and also discuss recent variations of the scattering transform which are customized for certain tasks including geometric deep learning and quantum chemistry.