# Passed Talks in Joint Analysis Seminar

There weren't any events in the past six months.

# Passed Talks in Post Graduate Seminar

12-06-2021, 12:30 PM

**Stefan Kunis** (Universität Osnabrück):**Super-resolution of points and curves**

The term super-resolution is used in many different contexts but in general asks for resolving fine details from low-frequency measurements. Algebraic techniques have proven useful in this context with different imaging tasks such as spike reconstruction (single molecule microscopy), phase retrieval (X-ray crystallography), and contour reconstruction (natural images). The available data typically consists of a blurred version of the specimen or equivalently trigonometric moments of low to moderate order and one asks for the reconstruction of fine details modeled by zero- or positive-dimensional algebraic varieties. Often, such reconstruction problems have a generically unique solution when the number of data is larger than the degrees of freedom in the model. Beyond that, we concentrate on simple a-priori conditions to guarantee that the reconstruction problem is well or only mildly ill conditioned. For the reconstruction of points on the complex torus, popular results ask the order of the moments to be larger than the inverse minimal distance of the points. Moreover, simple and efficient eigenvalue based methods achieve this stability numerically in specific settings. Recently, the situations involving clustered points, points with multiplicities, and positive-dimensional algebraic varieties are starting to gain interest, and these shall be discussed within this talk.

11-22-2021, 12:30 PM

**Guido F. Montufar** (UCLA):**Geometry of Linear Convolutional Networks**

We study the family of functions that are represented by a linear convolutional neural network (LCN). These functions form a semi-algebraic subset of the set of linear maps from input space to output space. In contrast, the families of functions represented by fully-connected linear networks form algebraic sets. We observe that the functions represented by LCNs can be identified with polynomials that admit certain factorizations, and we use this perspective to describe the impact of the network's architecture on the geometry of the resulting function space. We further study the optimization of an objective function over an LCN, analyzing critical points in function space and in parameter space, and describing dynamical invariants for gradient descent. Overall, our theory predicts that the optimized parameters of an LCN will often correspond to repeated filters across layers, or filters that can be decomposed as repeated filters. We also conduct numerical and symbolic experiments that illustrate our results and present an in-depth analysis of the landscape for small architectures. This talk is based on joint work with Kathl\'e9n Kohn, Thomas Merkh, Matthew Trager: https://arxiv.org/abs/2108.01538.

11-15-2021, 12:30 PM

**Ekkehard Schnoor** (RWTH Aachen University):**Generalization Error Bounds for Iterative Recovery Algorithms Unfolded as Neural Networks**

We present generalization error bounds for iterative recovery algorithms unfolded as neural networks, based on estimates of the Rademacher complexity of hypothesis classes consisting of such deep networks.\
In the first part, we consider compressive sensing in the scenario where the sparsity basis (dictionary) is not known in advance, but needs to be learned from examples. Motivated by the well-known iterative soft thresholding algorithm (ISTA) for the reconstruction, we define deep networks parametrized by the dictionary. Based on training samples, we aim at learning the optimal sparsifying dictionary and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. The dictionary learning is performed via minimizing the empirical risk. In the second part, we extend our previous work to a more general and flexible scenario, allowing different degrees of weight-sharing between the layers and also taking more trainable parameters into account, such as the thresholding parameters.

Based on joint work with Arash Beboobi and Holger Rauhut.

11-08-2021, 12:30 PM

**Afonso S. Bandeira** (ETH Zürich):**Noncommutative Matrix Concentration Inequalities**

Matrix Concentration inequalities such as Matrix Bernstein inequality have played an important role in many areas of pure and applied mathematics. These inequalities are intimately related to the celebrated noncommutative Khintchine inequality of Lust-Piquard and Pisier. In the middle of the 2010's, Tropp improved the dimensional dependence of this inequality in certain settings by leveraging cancellations due to non-commutativity of the underlying random matrices, giving rise to the question of whether such dependency could be removed.

In this talk we leverage ideas from Free Probability to fully remove the dimensional dependence in a range of instances, yielding optimal bounds in many settings of interest. As a byproduct we develop matrix concentration inequalities that capture non-commutativity (or, to be more precise, ``freeness''), improving over Matrix Bernstein in a range of instances. No background knowledge of Free Probability will be assumed in the talk.

Joint work with March Boedihardjo and Ramon van Handel, more information at arXiv:2108.06312 [math.PR].

10-25-2021, 12:30 PM

**Noam Razin** (Tel Aviv University):**Generalization in Deep Learning Through the Lens of Implicit Rank Minimization**

The mysterious ability of overparameterized neural networks to generalize is believed to stem from an implicit regularization, a tendency of gradient-based optimization to fit data with predictors of low complexity. A widespread hope is that this complexity can be characterized using norms, and a standard test-bed for studying this prospect is matrix factorization --- matrix completion via linear neural networks. In this talk, I will establish that the implicit regularization in matrix factorization cannot be captured by norms, proving there exist settings on which it drives all norms towards infinity while minimizing rank. Then, as a step towards practical deep learning, I will show that the tendency to low rank extends from matrix factorization to tensor factorization --- tensor completion via certain non-linear neural networks. Motivated by tensor rank capturing the implicit regularization of these non-linear networks, I will propose it as a measure of complexity, and present empirical evidence that it captures aspects of natural data. Our results suggest that notions of rank (such as tensor rank) may shed light on both implicit regularization of neural networks, and properties of real-world data translating this implicit regularization to generalization.

Works covered in this talk were in collaboration with Asaf Maman and Nadav Cohen.

10-04-2021, 12:30 PM

**Jonathan Siegel** (Penn State University):**The Approximation Theory of Shallow Neural Networks**

A shallow neural network is a linear combination of ridge functions whose profile is determined by a fixed activation function. We will introduce spaces of functions which can be efficiently approximated by shallow neural networks for a wide variety of activation functions and analyze their properties. Specifically, we will compute their metric entropy and n-widths, which are fundamental quantities in approximation theory that control the limits of linear and non-linear approximation and statistical estimation for a given class of functions. Consequences of these results include: the optimal approximation rates which can be attained for shallow neural networks, that shallow neural networks dramatically outperform linear methods of approximation, and indeed that shallow neural networks outperform all stable methods of approximation on these spaces. This will provide insights into how neural networks break the curse of dimensionality. Finally, we discuss algorithmic aspects of approximation by shallow networks. Specifically, we analyze a class of greedy algorithms and show that they can attain the theoretically optimal approximation rates.

09-27-2021, 12:30 PM

**Dominik Stoeger** (USC):**Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction**

Modern machine learning models are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Due to over-parameterization, these models in principle have the capacity to (over)fit any set of labels including pure noise. Despite this high fitting capacity, somewhat paradoxically, these models trained via first-order methods continue to predict well on yet unseen test data. In this talk, I will focus on overparameterized learning in the context of low-rank reconstruction from a few measurements. For this problem, I will show that despite the presence of many global optima gradient descent from small random initialization converges to a generalizable solution and finds the underlying low-rank matrix. Notably, this analysis is not in the \'93lazy\'94 training regime and is based on intriguing phenomena uncovering the critical role of small random initialization: a few iterations of gradient descent behaves akin to popular spectral methods. We also show that this implicit spectral bias from small random initialization, which is provably more prominent for overparameterized models, puts the gradient descent iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well.

09-20-2021, 03:30 PM

**Joan Bruna** (NYU):**On the role of data structure in learning guarantees**

High-dimensional learning remains an outstanding task where empirical successes often coexist alongside mathematical and statistical curses. In this talk, we will describe two vignettes of this tension that underscore the importance of distributional assumptions. First, we will describe the role of invariance and symmetry priors in statistical learning theory, by studying the gains in sample complexity brought by incorporating these priors into the learning model. Next, we will describe the role of data structure on the computational side, by studying computational-to-statistical gaps arising in the seemingly simple problem of learning a single neuron. \ Joint work with Alberto Bietti and Luca Venturi (first part), and Min Jae Song and Ilias Zadik (second part).

08-23-2021, 09:00 AM

**Yi Ma** (UC Berkeley):**White-Box Deep (Convolution) Networks from First Principles**

In this talk, we offer an entirely “white box” interpretation of deep (convolution) networks from the perspective of data compression (and group invariance). In particular, we show how modern deep layered architectures, linear (convolution) operators and nonlinear activations, and even all parameters can be derived from the principle of maximizing rate reduction (with group invariance). All layers, operators, and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of so-obtained network, called ReduNet, have precise optimization, geometric, and statistical interpretation. There are also several nice surprises from this principled approach: it reveals a fundamental tradeoff between invariance and sparsity for class separability; it reveals a fundamental connection between deep networks and Fourier transform for group invariance \'96 the computational advantage in the spectral domain (why spiking neurons?); this approach also clarifies the mathematical role of forward propagation (optimization) and backward propagation (variation). In particular, the so-obtained ReduNet is amenable to fine-tuning via both forward and backward (stochastic) propagation, both for optimizing the same objective.

This is joint work with students Yaodong Yu, Ryan Chan, Haozhi Qi of Berkeley, Dr. Chong You now at Google Research, and Professor John Wright of Columbia University. A related paper can be found at: https://arxiv.org/abs/2105.10446

07-26-2021, 12:30 PM

**Gabin Maxime Nguegnang** (RWTH Aachen University):**Convergence of Gradient Descent in Learning Deep Liner Neural Network**

In this work, we study the convergence properties of gradient descent in training deep linear neural networks. We prove that if the initial weight matrices are approximately balanced and the weight norm of each of this matrices is bounded by a constant which depends on the initial loss value and the network depth (N) then the weight matrices will stay bounded throughout all iterations. In addition, we show that under suitable conditions gradient descent with both constant and non constant step size converges to a critical point of the loss function. Moreover, we demonstrate that for almost all initialization gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we prove that gradient descent converges to a global minimum on the manifold matrices of rank k less or equal to r the rank of the over-parameterized matrix of the network. Finally, we provide an attempt to extend these results to stochastic gradient descent.

07-19-2021, 12:30 PM

**Johannes Lederer** (Ruhr-University Bochum):**Sparse Deep Learning**

Sparsity is popular in statistics and machine learning, because it can avoid overfitting, speed up computations, and facilitate interpretations. In deep learning, however, the full potential of sparsity still needs to be explored. In this presentation, we first discuss sparsity in the framework of high-dimensional statistics and corresponding statistical theories. We then use these insights to further our understanding of sparsity in deep learning.

07-12-2021, 12:30 PM

**Juan-Pablo Ortega Lahuerta** (Nanyang Technological University):**Reservoir Computing and the Learning of Dynamic Processes**

Dynamic processes regulate the behaviour of virtually any artificial and biological agent, from stock markets to epidemics, from driverless cars to healthcare robots. The problem of modeling, forecasting, and generally speaking learning dynamic processes is one of the most classical, sophisticated, and strategically significant problems in the natural and the social sciences. In this talk we shall discuss both classical and recent results on the modeling and learning of dynamical systems and input/output systems using an approach generically known as reservoir computing. This information processing framework is characterized by the use of cheap-to-train randomly generated state-space systems for which promising high-performance physical realizations with dedicated hardware have been proposed in recent years. In our presentation we shall put a special emphasis in the approximation properties of these constructions.

07-05-2021, 12:30 PM

**Frederik Hoppe** (RWTH Aachen University):**Debiased Lasso: Confidence intervals for the Lasso**

This talk approaches the problem of constructing confidence intervals for the Lasso estimator. Based on the Lasso, we build an estimator whose distribution is asymptotically known. We compare some existing approaches for this and identify the key requirements. The new estimator is called debiased or desparsified Lasso and can be decomposed into a \'93rest\'94 term and a \'93noise\'94 term. Under certain assumptions the \'93rest\'94 term vanishes and the remaining \'93noise\'94 term is asymptotically Gaussian distributed. This allows us to construct confidence intervals. Furthermore, we address a possible application in magnetic resonance imaging.

06-28-2021, 12:30 PM

**Vicky Kouni** (RWTH Aachen University):**Spark Deficient Gabor Frames for Inverse Problems**

In this presentation, we address two inverse problems: analysis Compressed Sensing and speech denoising. Both problems\'92 approach is based on a redundant, analysis-sparse representation of the original signals of interest. We pick an eigenvector of the Zauner unitary matrix and \'96under certain assumptions on the ambient dimension\'96 we use it as window vector to generate a spark deficient Gabor frame. The analysis operator associated with such a frame, is a (highly) redundant Gabor transform, which we use as a sparsifying transform in both analysis Compressed Sensing and denoising procedures. We conduct computational experiments on synthetic and real-world data, solving the analysis l1-minimization and analysis basis pursuit for Compressed Sensing and de- noising problems respectively, with four different choices of analysis operators, including our Gabor analysis operator. The results show that our proposed redundant Gabor transform outperforms \'96in all cases\'96 Gabor transforms generated by state-of-the-art window vectors of time-frequency analysis.

06-14-2021, 12:30 PM

**Alessandro Rudi** (INRIA and École Normale Supérieure):**Finding Global Minima via Kernel Approximations**

We consider the global minimization of smooth functions based solely on function evaluations. Algorithms that achieve the optimal number of function evaluations for a given precision level typically rely on explicitly constructing an approximation of the function which is then minimized with algorithms that have exponential running-time complexity. In this paper, we consider an approach that jointly models the function to approximate and finds a global minimum. This is done by using infinite sums of square smooth functions and has strong links with polynomial sum-of-squares hierarchies. Leveraging recent representation properties of reproducing kernel Hilbert spaces, the infinite-dimensional optimization problem can be solved by subsampling in time polynomial in the number of function evaluations, and with theoretical guarantees on the obtained minimum.\ Given n samples, the computational cost is O(n^(3.5)) in time, O(n^2) in space, and we achieve a convergence rate to the global optimum that is O(n^(\uc0\u8722 m/d+1/2+3/d)) where m is the degree of differentiability of the function and d the number of dimensions. The rate is nearly optimal in the case of Sobolev functions and more generally makes the proposed method particularly suitable for functions that have a large number of derivatives. Indeed, when m is in the order of d, the convergence rate to the global optimum does not suffer from the curse of dimensionality, which affects only the worst-case constants.

06-07-2021, 12:30 PM

**Leonardo Galli** (RWTH Aachen University):**Robustness and Generalization in Training Deep Neural Networks**

Recent breakthroughs in the field of AI and machine learning are achieved with Deep Learning (DL) models. Despite their extensive use to solve a wide range of real problems, the understanding of DL is limited and the theoretical guarantees are few. Moreover, performances of DL models are very sensible to various training algorithmic choices e.g., learning rate. The robust selection of the learning rate is indeed a very central topic in DL. In this project, we study nonmonotone Stochastic Gradient Descent (SGD) algorithms to improve the robustness of the training method w.r.t. the learning rate. The focus of this seminar will be on a first comparison between various learning rate selections. Results are showing that the current research avenue is promising and that nonmonotone techniques might play an interesting role in the selection of the learning rate.

05-10-2021, 12:30 PM

**Hung-Hsu Chou** (RWTH Aachen University):**Exploration and Exploitation: Understanding Deep Neural Network via Matrix Factorization**

This talk aim to provide analysis and experiments on linear neural network, and its relation with general neural network via kernel. It consists of two parts: exploration, where the dynamics is searching for the relevant feature space, and exploitation, where the dynamics is evaluating the importance of those features. Many of the current literature, including neural tangent kernel, commuting measurement for matrix sensing, and our our previous work, focus on the exploitation. However, exploitation can not be separated from exploration, and is in fact could be quite meaningless without exploration in practice. Thus we provide some analysis and experiments with the exploration, that will hopefully integrate the two parts. This is a joint work with Holger Rauhut and Johannes Maly.

05-03-2021, 12:30 PM

**Lénaïc Chizat** (Université Paris-Saclay):**Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Networks**

In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that for appropriate initializations, its limit, when it exists, is a global minimizer. We also study the implicit regularization of this algorithm when the objective is the unregularized logistic loss, which leads to a max-margin classifier in a certain functional space. We finally discuss what these results tell us about the generalization performance, and in particular how these models compare to kernel methods. This is based on joint work with Francis Bach, and be around my COLT 2020 https://arxiv.org/abs/2002.04486.

04-15-2021, 04:30 PM

**Panayotis Mertikopoulos** (Univ. Grenoble Alpes, CNRS):**Games, dynamics, and (min-max) optimization**

This talk aims to survey the triple-point interface between optimization, game theory, and dynamical systems. We will begin by discussing how the ordinary differential equation (ODE) method of stochastic approximation can be used to analyze the trajectories of a wide array of stochastic first-order algorithms in non-convex minimization problems and games. The key notion here is that of an internally chain transitive (ICT) set: in minimization problems, ICT sets correspond to the problem's critical points, and we discuss a range of conditions guaranteeing convergence to critical points while avoiding unstable saddle points. Similar results can also be derived for min-max problems and games: unstable stationary points are avoided and stable Nash equilibria are attracting with probability 1 (or with arbitrarily high probability in the local case). However, despite these encouraging results, the overall situation can be considerably more involved: we show that the sequence of play may converge with arbitrarily high probability to lower-dimensional manifolds that are in no way unilaterally stable (or even stationary). "Spurious attractors" of this type can arise even in simple two-player zero-sum games with one-dimensional action sets and polynomial losses of degree 4, a fact which highlights the fundamental gap between minimization problems and games.\ \ Relevant papers: The talk will be based on material from the preprints below - https://arxiv.org/pdf/2006.11144.pdf - https://arxiv.org/pdf/2006.09065.pdf

04-08-2021, 04:30 PM

**Robert Kunsch** (RWTH Aachen University):**A new rotation equivariant neural network for 3D point clouds - proof of concept**

Scanning data in real world applications are often organized as 3D point clouds rather than 2D images. Approaches for deep point cloud processing are diverse. In this talk, a new rotation equivariant neural architecture is presented for processing 3D point clouds that are assumed to be sampled from 3D manifolds. The architecture is inspired by PointNet and its state of the art extension PointNet++. The main idea is that when extracting features from the neighbourhood of a point (let us call it a \'93centroid\'94), the point cloud is locally rotated around that centroid to let the surface tangent plane fit the xy-plane in relative coordinates, feature extraction is repeated for a bunch of equidistant rotation angles within the xy-plane. That way, features come along with interpretable geometric data which might be useful e.g. in pose detection. First experiments on a toy problem prove that, indeed, the network can be trained on examples in fixed orientation and later perform successfully on rotated (and translated) test examples.

04-01-2021, 04:30 PM

**Michael Schaub** (RWTH Aachen University):**Learning from signals on graphs with unobserved edges**

In many applications we are confronted with the following system identification scenario: we observe a dynamical process that describes the state of a system at particular times. Based on these observations we want to infer the (dynamical) interactions between the entities we observe. In the context of a distributed system, this typically corresponds to a "network identification" task: find the (weighted) edges of the graph of interconnections. However, often the number of samples we can obtain from such a process are far too few to identify the edges of the network exactly. Can we still reliably infer some aspects of the underlying system? \ \ Motivated by this question we consider the following identification problem: instead of trying to infer the exact network, we aim to recover a (low-dimensional) statistical model of the network based on the observed signals on the nodes. More concretely, here we focus on observations that consist of snapshots of a diffusive process that evolves over the unknown network. We model the (unobserved) network as generated from an independent draw from a latent stochastic blockmodel (SBM), and our goal is to infer both the partition of the nodes into blocks, as well as the parameters of this SBM. We present simple spectral algorithms that provably solve the partition and parameter inference problems with high-accuracy. We further discuss some possible variations and extensions of this problem setup.

03-25-2021, 04:30 PM

**Yeonjong Shin** (Brown University):**Towards a mathematical understanding of modern machine learning: theory, algorithms, and applications**

Modern machine learning (ML) has achieved unprecedented empirical success in many application areas. However, much of this success involves trial-and-error and numerous tricks. These result in a lack of robustness and reliability in ML. Foundational research is needed for the development of robust and reliable ML. This talk consists of two parts. The first part will present the first mathematical theory of physics informed neural networks (PINNs) - one of the most popular deep learning frameworks for solving PDEs. Linear second-order elliptic and parabolic PDEs are considered. I will show the consistency of PINNs by adapting the Schauder approach and the maximum principle. The second part will focus on some recent mathematical understanding and development of neural network training. Specifically, two ML phenomena are analyzed -- "Plateau Phenomenon" and "Dying ReLU." New algorithms are developed based on the insights gained from the mathematical analysis to improve neural network training.

03-18-2021, 04:30 PM

**Marc Goerigk** (Uni Siegen):**Robust Optimization with Deep Neural Networks**

To derive a robust optimization model, a central ingredient is to identify a suitable model for uncertainty, containing all scenarios against which we wish to protect. An ongoing challenge in the recent literature is to derive uncertainty sets from given historical data. \ \ In this talk I propose to use deep neural networks to construct non-convex uncertainty sets from data. Trained neural networks can be integrated into a robust optimization model by formulating the adversarial problem as a convex quadratic mixed-integer program. This allows us to derive robust solutions through an iterative scenario generation process.\ \ In a comparison to uncertainty sets by kernel-based support vector clustering, I show that our method can give a better description of data, leading to robust solutions that considerably outperform the comparison method both with respect to objective value and feasibility. This presentation is based on joint work with Jannis Kurtz.

03-04-2021, 10:00 AM

**Rayan Saab** (UCSD):**A greedy algorithm for quantizing neural networks**

While neural networks have been remarkably successful in a wide array of application areas, miniaturizing them and implementing them in hardware remains an area of intense research. By quantizing, or replacing the weights of a neural network with quantized (e.g., 8-bit, or binary) counterparts, massive savings in cost, computation speed, memory, and power consumption can be attained. Of course, one wishes to attain these savings without compromising the performance of the network. We propose a new data-driven and computationally efficient method for quantizing the weights of already trained neural networks. Specifically, we quantize each neuron, or hidden unit, using a greedy path-following algorithm. We prove that this simple algorithm has favorable error guarantees for a single-layer neural network (or, alternatively, for quantizing the first layer of a multi-layer network) when the training data are random from an appropriate distribution. Under these assumptions, the quantization error decays with the width of the layer, i.e., its level of over-parametrization. We also provide numerical experiments, on multi-layer networks, to illustrate the performance of our methods.

02-25-2021, 04:30 PM

**Arthur Ulysse Jacot-Guillarmod** (EPFL):**Neural Tangent Kernel: Convergence and Generalization of DNNs**

Modern deep learning has popularized the use of very large neural networks, but the theoretical tools to study such networks are still lacking. The Neural Tangent Kernel (NTK) describes how the output neutrons evolve during training allowing a precise description of the convergence and generalization of DNNs. In the infinite width limit (when the number of hidden neutrons grows to infinity) the NTK converges to a deterministic and fixed limit ensuring convergence to a global minimum. In particular, training a DNN with the MSE loss corresponds to doing Kernel Ridge Regression (KRR) with the NTK. Under the assumption of i.i.d input samples, the risk of KRR can be approximated using the Signal Capture Threshold (SCT), which identifies which principal components of the signal are learned. A further approximation leads to the Kernel Alignement Risk Estimator (KARE), which predicts the test error of KRR from the train data only.

02-18-2021, 04:30 PM

**Lukas Gonon** (LMU):**Deep ReLU network expression rates for option prices in high-dimensional, exponential Levy models**

We study the expression rates of deep neural networks for option prices written on high-dimensional baskets of risky assets, whose log-returns are modelled by a multivariate Levy process with general correlation structure of jumps. We establish sufficient conditions on the characteristic triplet of the Levy process that ensure that the size of the DNN required to achieve a given approximation accuracy grows only polynomially with respect to the dimension of the Levy process and the reciprocal of the approximation accuracy, thereby overcoming the curse of dimensionality and justifying the use of DNNs in financial modelling of large baskets in markets with jumps. In addition, we exploit parabolic smoothing of Kolmogorov partial integrodifferential equations for certain multivariate Levy processes to present alternative architectures of ReLU DNNs that provide higher approximation rates, however, with constants potentially growing exponentially with respect to the dimension. Under stronger, dimension-uniform non-degeneracy conditions on the Levy symbol, we obtain algebraic expression rates of option prices in exponential Levy models which are free from the curse of dimensionality. In this case the ReLU DNN expression rates of prices depend on certain sparsity conditions on the characteristic Levy triplet. We indicate several consequences and possible extensions of the present results. The talk is based on joint work with Christoph Schwab.

02-11-2021, 04:30 PM

**Harald Oberhauser** (Oxford):**Random Paths, Random Tensors, and some Machine Learning**

Many real-world data sources have a natural sequential structure (e.g. patient data in health care, market data in finance, etc). For many inference tasks, it is important to capture this sequential structure. I will talk about how some classic ideas from pure mathematics provide a structured description of a path/sequence in terms of tensors. This can in turn be combined with ideas from machine learning to produce scalable methods. Vice versa, we will see how applications in machine learning lead to interesting theoretical questions in probability theory and stochastic analysis.

02-04-2021, 04:30 PM

**Felix Krahmer** (TUM):**Sketching with Kerdock's Crayons: Sparse transforms for arbitrary linear maps**

Data that admits a sparse representation arises naturally in many different applications. Consequently, efficiently computing the significant coefficients of a sparse signal is of great importance. For specific representation systems such as the Fourier basis or certain tensor bases, this problem has been studied in a number of works, and various fast algorithms are available. In most unstructured cases, however, no algorithm of computational complexity o(n^2), where n is the dimension, was known prior to our work. In this talk, we discuss a randomized approach to address this problem. Our method relies on a representation of the sparsity basis in terms of certain real-valued mutually unbiased bases derived from Kerdock codes. This representation has to be computed only once, which requires a total of O(n^3 log^2(n)) operations. We show that from this representation, our randomized fast transform computes the significant coefficients of any vector that is s-sparse signal in the given basis with high probability requiring only a significantly reduced computational complexity of O(n(s + log(n))). The talk is based on joint work with Tim Fuchs (TUM), David Gross (U Cologne), Richard Kueng (JKU Linz), and Dustin Mixon (OSU).

01-21-2021, 04:30 PM

**Massimo Fornasier** (TUM):**Identification of deep feed forward neural networks**

(A joint work with C. Fiedler, T. Klock, and M. Rauchensteiner.) In this talk we approach the problem of unique and stable identifiability of generic deep artificial neural networks from a finite number of input-output samples.\ The scope of this work is to shed light on explainability and interpretability of the parameters/weights of a network, its uniform stability, and the amount of needed input-output data for identifiability in the realizable context of data generated by a given generic network.\ More specifically we introduce the so-called \{\\it entangled weights\}, which compose weights of successive layers intertwined with suitable diagonal and invertible matrices depending on the activation functions and their shifts. We prove that entangled weights are completely and stably approximated by an efficient and robust algorithm as soon as O(D^2 x m)$ nonadaptive input-output samples of the network are collected, where $m$ is the number of neurons of the network and $D$ is the input dimension.\ Provided knowledge of layer assignments and of the remaining parameters, which can be further heuristically obtained by least squares, the entangled weights identify the network completely and uniquely.\ To highlight the relevance of the theoretical result of stable recovery of entangled weights, we present numerical experiments, which demonstrate that multilayered networks with generic weights can be robustly identified and therefore uniformly approximated by the presented algorithmic pipeline. In contrast backpropagation that cannot generalize stably very well in the current setting, being always limited by relatively large uniform error. In terms of practical impact, our study shows that we can relate input-output information uniquely and stably to network parameters, providing a form of explainability. Moreover, our method paves the way for compression of overparametrized networks, which could be also interpreted as a new form of malicious attack for network cloning.

01-14-2021, 04:30 PM

**Youness Boutaib** (RWTH Aachen University):**Path classification with continuous-time recurrent neural networks**

We seek mathematical learning guarantees for path-classification tasks using stochastic (linear, in this talk) recurrent networks with continuous-time rate dynamics. This architecture finds its origin in neuroscience and is simple as training only requires finding a pre-processing projection vector and the parameters of a read-out map. We give generalisation error bounds and argue that stochasticity provides learning with a robustness against adversarial attacks. We also explicitly show that training (in the case of our loss function) is equivalent to an optimisation problem of a functional on the signatures of the training paths. This is an ongoing joint work with Wiebke Bartolomeus (RWTH Aachen), S. Nestler (FZ J\'fclich) and H. Rauhut (RWTH Aachen).

12-17-2020, 04:30 PM

**Martin Genzel** (Utrecht University):**Robust Solutions to (Non-)Linear Inverse Problems: From Compressed Sensing to Deep Learning**

Recent advances in quantized compressed sensing and high-dimensional estimation have shown that signal recovery is even feasible under strong non-linear distortions in the observation process. An important characteristic of associated guarantees is uniformity, i.e., recovery succeeds for an entire class of structured signals with a fixed measurement ensemble. However, despite significant results in various special cases, a general understanding of uniform recovery from non-linear observations is still missing. In this talk, we will discuss a unified approach to this problem under the assumption of i.i.d. sub-Gaussian measurement vectors. It will turn out that a simple generalized Lasso estimator can serve as a universal recovery strategy, which is (outlier) robust and does not require explicit knowledge of the underlying non-linearity. A key technical ingredient is an approximative increment condition that can be implemented for many types of non-linear models. This flexibility allows us to apply our approach to a variety of problems in quantized compressed sensing and high-dimensional statistics.\ The second part of this talk is devoted to linear inverse problems and how they can be solved by deep learning techniques. Compared to traditional estimators such as the Lasso, recent works have pointed out that deep neural networks suffer from severe instabilities in various image reconstruction tasks. We will shed new light on this concern and deal with an extensive empirical study of the robustness of deep-learning-based algorithms for solving underdetermined inverse problems. This covers compressed-sensing tasks with Gaussian measurements as well as image recovery from Fourier measurements, including a real-world scenario for MRI (using the NYU fastMRI dataset). Our main focus is on computing adversarial perturbations of the measurements that maximize the reconstruction error. In contrast to previous findings, our results reveal that standard end-to-end network architectures are not only surprisingly resilient against statistical noise, but also against adversarial perturbations. Remarkably, all considered networks are trained by common deep learning techniques, without sophisticated defense strategies.\ This is joint work with Jan Macdonald (TU Berlin), Maximilian März (TU Berlin), and Alexander Stollenwerk (UCLouvain).

12-03-2020, 04:30 PM

**Michael Perlmutter** (UCLA):**Understanding Neural Networks via the Scattering Transform**

The scattering transform is a mathematical framework for understanding convolutional neural networks (CNNs), originally introduced by St\\'ephane Mallat for functions defined on $\\mathbb\{R\}^n.$ Similar to standard CNNs, the scattering transform is constructed as an alternating cascade of convolutions and nonlinearities. It differs from traditional CNNs by using predesigned, wavelet filters rather than filters learned from training data. This leads to a network that provably has desirable mathematical properties, such as translation invariance and diffeomorphism stability.\ \ In addition to these theoretical properties, the scattering transform is also a practical object that can achieve near state of the art numerical results in certain settings. Moreover, in addition to preforming well on tasks such as image recognition, the scattering transform can also be used to extract statistical information about stochastic processes with stationary increments. I will provide an overview of Mallat's original construction and also discuss recent variations of the scattering transform which are customized for certain tasks including geometric deep learning and quantum chemistry.

11-26-2020, 04:30 PM

**Soledad Villar** (JHU):**Dimensionality reduction, regularization, and generalization in overparameterized regressions**

Overparameterization in deep learning has shown to be powerful: very large models can fit the training data perfectly and yet generalize well. Investigation of overparameterization brought back the study of linear models, which, like more complex models, show a \'93double descent\'94 behavior. This involves two features: (1) The risk (out-of-sample prediction error) can grow arbitrarily when the number of samples n approaches the number of parameters p (from either side), and (2) the risk decreases with p at p > n, sometimes achieving a lower value than the lowest risk at p < n. The divergence of the risk at p = n is related to the condition number of the empirical covariance in the feature set. For this reason, it can be avoided with regularization. In this work we show that performing a PCA-based dimensionality reduction also avoids the divergence at p = n; we provide a finite upper bound for the variance of the estimator that decreases with p. This result is in contrast to recent work that shows that a different form of dimensionality reduction\'97 one based on the population covariance instead of the empirical covariance\'97does not avoid the divergence. We connect these results to an analysis of adversarial attacks, which become more effective as they raise the condition number of the empirical covariance of the features. We show that ordinary least squares is arbitrarily susceptible to data-poisoning attacks in the overparameterized regime\'97unlike the underparameterized regime\'97and how regularization and dimensionality reduction improve its robustness. We also translated the results on the highly overparameterized linear regression regime to Gaussian Processes.

11-19-2020, 04:30 PM

**Romain Couillet** (University Grenoble-Alpes):**Can random matrices change the future of machine learning?**

Many standard machine learning algorithms and intuitions are known to misbehave, if not dramatically collapse, when operated on large dimensional data. In this talk, we will show that large dimensional statistics, and particularly random matrix theory, not only can elucidate this behavior but provides a new set of tools to understand and (sometimes drastically) improve machine learning algorithms. Besides, we will show that our various theoretical findings are provably applicable to very realistic and not-so-large dimensional data.

11-05-2020, 04:30 PM

**Matthew Hirn** (Michigan State University):**Understanding convolutional neural networks through signal processing: From signals to manifolds to graphs**

Convolutional neural networks (CNNs) are the go-to tool for image processing tasks in machine learning. But how and why do they work so well? Using the basic guiding principles of CNNs, namely their convolutional structure, invariance properties, and multi-scale nature, we will discuss how the CNN architecture arises as a natural bi-product of these principles using the language of nonlinear signal processing. In doing so we will extract some core ideas that allow us to generalize these architectures to manifolds and graphs, while still being able to provide mathematical guarantees on the nature of the representation provided by these tools.

10-29-2020, 04:30 PM

**Leonardo Galli** (Global Optimization Laboratory):**A Nonmonotone Line Search Approach to Stochastic Gradient Descent for Over-parametrized Models**

Stochastic Gradient Descent (SGD) is the dominant optimization method to train deep networks today. Even if its simplicity and extremely low memory requirements seem crucial for dealing with these huge models, SGD accomplishments are deeply connected to the choice of the step size. For this reason, the focus of many optimizers and machine learning researchers shifted to the development of efficient adaptive methods for updating the step size (e.g., AdaGrad, AdaM, AdaDelta). On the other side, the classical optimization approach for dealing with the selection of the step size is that of using a line search technique. A recent branch of research is now trying to combine the line search with SGD. In particular in Vaswani et al. (2019), promising numerical results and convergence rates are shown, but the theoretical algorithm they provide is not matching their implementation. In fact, they developed a numerical heuristic to deal with a small step size issue. Also in this case, the classical optimization doctrine provides a more elegant solution to solve this issue: a nonmonotone modification to the line search. In my presentation, I will show a tentative agenda about various avenues that may open thanks to the application of nonmonotone line searches in the SGD world.

10-22-2020, 04:30 PM

**Sjoerd Dirksen** (Utrecht University):**Covariance estimation under one-bit quantization**

In my talk, I will consider the classical problem of estimating the covariance matrix of a high-dimensional subgaussian distribution from i.i.d. samples in the novel context of coarse quantization. That is, instead of having full knowledge of the samples, they are quantized to one or two bits per entry. This estimation problem occurs naturally in signal processing applications. I will introduce new estimators for two popular quantization scenarios and present non-asymptotic bounds for the estimation errors. In both scenarios, the results allow for masked estimation. I will illustrate the near-optimality of the derived error bounds using (minimax) lower bounds and numerical simulations. The talk is based on joint work with Johannes Maly and Holger Rauhut.

10-14-2020, 03:00 PM

**Nadav Cohen** (Tel Aviv University):**Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent**

Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is perhaps the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.\ Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.

10-08-2020, 04:30 PM

**Dustin Mixon** (Ohio State University):**Ingredients matter: Quick and easy recipes for estimating clusters, manifolds, and epidemics**

Data science resembles the culinary arts in the sense that better ingredients allow for better results. We consider three instances of this phenomenon. First, we estimate clusters in graphs, and we find that more signal allows for faster estimation. Here, "signal" refers to having more edges within planted communities than across communities. Next, in the context of manifolds, we find that an informative prior allows for estimates of lower error. In particular, we apply the prior that the unknown manifold enjoys a large, unknown symmetry group. Finally, we consider the problem of estimating parameters in epidemiological models, where we find that a certain diversity of data allows one to design estimation algorithms with provable guarantees. In this case, data diversity refers to certain combinatorial features of the social network. Joint work with Jameson Cahill, Charles Clum, Hans Parshall, and Kaiying Xie.

10-01-2020, 04:30 PM

**Jannis Kurtz** (RWTH Aachen University):**An Integer Programming Approach to Deep Neural Networks with Binary Activation Functions**

We study deep neural networks with binary activation functions (BDNN), i.e. the activation function only has two states. We show that the BDNN canbe reformulated as a mixed-integer linear program which can be solved to global optimality by classical integer programming solvers. Additionally, a heuristic solution algorithm is presented and we study the model under data uncertainty, applying a two-stage robust optimization approach. We implemented our methods on random and real datasets and show that the heuristic version of the BDNN outperforms classical deep neural networks on the Breast Cancer Wisconsin dataset while performing worse on random data.

09-24-2020, 04:30 PM

**Yuege (Gail) Xie** (UT Austin):**Weighted Optimization: better generalization by smoother interpolation**

We provide a rigorous analysis of how implicit bias towards smooth interpolations leads to low generalization error in the overparameterized setting. We provide the first case study of this connection through a random Fourier series model and weighted least squares. We then argue through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants. I will also talk about some possible extensions. This is joint work with Rachel Ward, Holger Rauhut, and Hung-Hsu Chou.

09-15-2020, 04:30 PM

**Soon Hoe Lim** (KTH Stockholm):**Understanding Recurrent Neural Networks Using Response Theory**

Recurrent neural networks (RNNs) are brain-inspired models widely used in machine learning for analyzing sequential data. In this talk, we discuss how RNNs process input signals using the response theory from nonequilibrium statistical mechanics. For a class of continuous-time stochastic RNNs (SRNNs) driven by an input signal, we derive a Volterra type series representation for their output. This representation is interpretable and disentangles the input signal from the SRNN architecture. The kernels of the series are certain recursively defined correlation functions with respect to the unperturbed dynamics that completely determine the output. Moreover, exploiting connections of this representation and its implications to path signature, we identify a universal feature -- the response feature, which turns out to be the signature of tensor product of the input signal and a natural support basis. In particular, we show that the SRNNs can be viewed as kernel machines operating on a reproducing kernel Hilbert space associated with the response feature.

09-10-2020, 04:30 PM

**Jan Vybiral** (CTU Prague):**Numerical integration and Schur's product theorem**

The classical Schur's product theorem says that the coordinate-wise product of two symmetric positive semi-definite matrices is a positive semi-definite matrix. We derive a new version of the Schur's product theorem and use it to solve an open problem of Erich Novak about the tractability of numerical integration in high dimensions. Furthermore, we show the consequences of the new Schur's theorem for Bochner's theorem, covariance matrices and mean values of trigonometric polynomials.\ This is a joint work with A. Hinrichs, D. Krieg, and E. Novak.

09-03-2020, 04:30 PM

**Felix Voigtländer** (Universität Wien):**Approximation theoretic and topological properties of neural networks**

In the first part of the talk, we discuss the approximation of $C^\\beta$ functions\ and of classification functions with $C^\\beta$ classification boundaries using\ ReLU networks of constant depth.\ The proof is based on implementing or approximating important "standard operations"\ like multiplication and partitions of unity using neural networks.\ \ The optimality of these approximation results is discussed in the second part of the talk.\ Under the assumption that the complexity of the individual weights of the approximating networks\ can be reasonably controlled in terms of the approximation accuracy ("quantised networks"),\ the derived approximation rates are indeed optimal.\ Yet, if one allows arbitrarily deep networks with arbitrarily complex weights, then recent\ remarkable results by Yarotsky show that one can in fact *double* the "optimal" approximation rate.\ \ The third part of the talk covers more general lower bounds for neural network approximation.\ On the one hand, we show for input dimension $d = 1$ that functions that are well approximated\ by neural networks necessarily have to have a certain (albeit low) Besov smoothness.\ On the other hand, for certain function classes one can improve on the optimality results\ presented earlier by showing that there is not just one single "hard to approximate" function,\ but that the property of being hard to approximate is in fact *generic*.\ More precisely, for certain function classes we construct a probability measure on the function\ class such that the probability of picking a function that can be approximated at a\ "higher than optimal rate" is zero.\ \ We close the talk by discussing topological properties\ of the set of all functions implemented by neural networks of a fixed architecture.\ We show that this set is highly non-convex and not closed in $L^p(\\mu)$,\ for a wide range of finite measures $\\mu$ on $[-B,B]^d$.\ Closedness with respect to the uniform norm is more subtle:\ For most activation functions except for the ReLU, the set is not closed in $C([-B,B]^d)$,\ but for shallow ReLU networks, the set is indeed closed;\ for deeper ReLU networks this is an open problem.\ As we discuss, these pathological topological properties provide a partial explanation\ for the phenomenon of weight blowup in the training of neural networks.\ \ This is joint work with Philipp Petersen, Mones Raslan, Philipp Grohs, Andreas Klotz,\ R\'e9mi Gribonval, Gitta Kutyniok, and Morten Nielsen.

08-27-2020, 04:30 PM

**Kathlen Kohn** (KTH Stockholm):**The geometry of neural networks**

A fundamental goal in the theory of deep learning is to explain why the optimization of the loss function of a neural network does not seem to be affected by the presence of non-global local minima. Even in the case of linear networks, most of the existing literature paints a purely analytical picture of the loss, and provides no explanation as to *why* such architectures exhibit no bad local minima. We explain the intrinsic geometric reasons for this behavior of linear networks. \ For neural networks in general, we discuss the neuromanifold, i.e., the space of functions parameterized by a network with a fixed architecture. For instance, the neuromanifold of a linear network is a determinantal variety, a classical object of study in algebraic geometry. We introduce a natural distinction between pure critical points, which only depend on the neuromanifold, and spurious critical points, which arise from the parameterization.\ This talk is based on joint work with Matthew Trager and Joan Bruna.

08-20-2020, 04:30 PM

**Bubacarr Bah** (AIMS South Africa):**Practical High-Throughput, Non-Adaptive and Noise-Robust SARS-CoV-2 Testing**

We propose a compressed sensing-based testing approach with a practical measurement design and a tuning-free and noise-robust algorithm for detecting infected persons.\ Compressed sensing results can be used to provably detect a small number of infected persons among a possibly large number of people. There are several advantages of this method compared to classical group testing. Firstly, it is non-adaptive and thus possibly faster\ to perform than adaptive methods which is crucial in exponentially growing pandemic phases.\ Secondly, due to nonnegativity of measurements and an appropriate noise model, the compressed sensing problem can be solved with the non-negative least absolute deviation regression (NNLAD) algorithm. This convex tuning-free program requires the same number of tests as current state of the art group testing methods. Empirically it performs significantly better than theoretically guaranteed, and thus the high-throughput, reducing the number of tests to a fraction compared to other methods. Further, numerical evidence suggests that our method can correct sparsely occurring errors.

08-13-2020, 04:30 PM

**Markus Faulhuber** (Universität Wien):**Inside the Frame Set**

In this talk I am going to present my new project called "Inside the Frame Set". The project originates from the area of time-frequency analysis, but also branches out into the areas of mathematical physics, analytic number theory or geometric function theory. We will start this talk with an introduction to Gabor systems, introduced by Dennis Gabor in 1946, but already used by John von Neumann in the 1930s. The idea is to expand a signal (function) on the real line into a generalized Fourier series, such that the coefficients contain joint information about the temporal behavior and the frequency distribution. The information is obtained from a 2-dimensional representation in the time-frequency plane (also called phase space) of the 1-dimensional signal. This leads to the theory of Gabor Frames, which provide a possibility to generalize orthonormal bases, and in the sequel we encounter the frame set and (lattice) packing problems. We will discuss several open problems of current interest in the field of time-frequency analysis and will outline connections to open problems in other fields. Also, we will see an entertaining application of the arithmetic-geometric mean of Gauss for Gaussian Gabor frames.

07-17-2020, 02:00 PM

**Robert Kunsch** (RWTH Aachen University):**Equivariances in 3D point cloud processing**

Scanning data in real world applications are \ often organized as point clouds rather than 2D images.\ While Convolutional Neural Networks (CNNs) prove quite \ successful for image processing tasks\ and naturally implement translation equivariance, approaches \ for point cloud processing are diverse.\ This talk will give an overview over different approaches to \ point cloud processing in Deep Learning,\ a focus will lie on rotation equivariance and new ideas in \ this regard will be presented.

07-09-2020, 04:30 PM

**Maxime Nguegnang** (RWTH Aachen University):**Convergence of Gradient\
Descent in Deep Neural Networks**

Despite the great success of deep learning during the last decades, scientists are still not yet able to clearly describe the theory process behind it. Our project aims to contribute on the mathematical foundations that underly deep neural networks. Recently Bah et al. 2019 showed the convergence of the Riemannian gradient flows to the global minimizer in linear neural networks. They also showed that in the particular case of an autoencoder the flow converges to a global minimum for almost all initializations. We intended to extend these results to gradient descent and stochastic gradient descent. Next, based on the insights gained on linear neural networks we will investigate the convergence properties in non-linear neural networks.\

07-02-2020, 04:30 PM

**Hans Christian Jung** (RWTH Aachen University):**BIHT 2: Optimal measurement rates for 1-bit compressive sensing via projected subgradient descent**

In one-bit compressive sensing one tries to estimate a real-valued signal from\ its one-bit measurements. Due to the limited amount of information conveyed by\ the one-bit measurements it is only possible to approximate the signal up to a\ predetermined accuracy depending on the number of bits collected. In this talk\ we discuss the fundamental limitations of one-bit compressive sensing and prove\ that if the row of the measurement matrix are subgaussian, then using a projected\ gradient descent method leads to optimal measurement rates,\ i.e., projected subgradient descent uses the collected information in an optimal manner.\ This provides the first tractable recovery method for one-bit compressive sensing\ with a provably optimal measurement rate. Joint work with S. Dirksen and L.T. Huynh.

06-25-2020, 04:30 PM

**Youness Boutaib** (RWTH Aachen University):**Introduction to the Mathematical Foundations of Recurrent Neural Networks**

In this introductory talk, we will motivate the need for recurrent neural network for tasks where the temporal structure of the data plays an important role. We will review the mathematical foundations behind Reservoir Computing that deal with discrete time (time series) inputs with a dependence structure (based on works by L. Gonon, L.Grigoryeva and J-P. Ortega) and then present some preliminary insights in the case of time-continuous streams of data where the reservoir computing is disturbed by noise. This last part is based on an ongoing joint work with H. Rauhut.

06-19-2020, 04:00 PM

**Ekkehard Schnoor** (RWTH Aachen University):**Generalization bounds for deep thresholding networks**

We consider compressive sensing in the scenario where the sparsity basis (dictionary) is not known in advance, but needs to be learned from examples. Motivated by the well-known iterative soft thresholding algorithm for the reconstruction, we define deep networks parametrized by the dictionary, which we call deep thresholding networks. Based on training samples, we aim at learning the optimal sparsifying dictionary and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. The dictionary learning is performed via minimizing the empirical risk.\ We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks. We obtain estimates of the sample complexity that depend only linearly on the dimensions and on the depth. (Based on joint work with Arash Beboobi and Holger Rauhut.)

06-04-2020, 04:30 PM

**Ulrich Terstiege** (RWTH Aachen University):**Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers**

We study the convergence of gradient flows related to learning deep linear neural networks (where the activation function is the identity map) from data. In this case, the composition of the network layers amounts to simply multiplying the weight matrices of all layers together, resulting in an overparameterized problem. The gradient flow with respect to these factors can (for suitable initializations) be re-interpreted as a Riemannian gradient flow on the manifold of rank-r matrices endowed with a suitable Riemannian metric. We show that the flow always converges to a critical point of the underlying functional. Moreover, we establish that, for almost all initializations, the flow converges to a global minimum on the manifold of rank k matrices for some k less or equal to r.\ This is joint work with Bubacarr Bah, Holger Rauhut, and Michael Westdickenberg.

05-28-2020, 04:30 PM

**Johannes Maly** (RWTH Aachen University):**Robust Recovery of Low-Rank Matrices with Non-Orthogonal Sparse Decomposition II**

We again consider the problem of recovering an unknown effectively (s1,s2)-sparse low-rank-R matrix X with possibly non-orthogonal rank-1 decomposition from incomplete and inaccurate linear measurements of the form A(X)+n, where n is an ineliminable noise. We propose an alternative version of the formerly presented algorithm ATLAS which is based on multi-penalty regularization and is able to leverage both structures (low-rankness and sparsity) simultaneously. The new algorithm\'92s variational formulation allows for neater theoretical analysis.

05-14-2020, 04:30 PM

**Holger Rauhut** (RWTH Aachen University):**Generic Chaining**

Generic chaining is a method to bound the supremum of a stochastic process introduced by Michel Talagrand. It can be used, for instance, to investigate the smallest and largest singular values and the restricted isometry constants of Gaussian and structured random matrices as well as the Rademacher complexity in machine learning. This talk will give an introduction to generic chaining.

05-07-2020, 04:30 PM

**Hung-Hsu Chou** (RWTH Aachen University):**Implicit Bias of Gradient Descent on Matrix Factorization with Perturbed Initialization**

Despite its success in many fields, deep neural networks lacks theoretical analysis due to its complicated nature, including non-linearity and over-parameterization. In particular in matrix sensing, it is observed that matrix factorization, a simplified version of neural networks, is a learning model that enjoys desired properties such as low rank, without additional regularization. This phenomenon is sometimes referred as "implicit bias". We study a particular instance where this phenomenon is tractable, and further improve the model by using perturbed initialization, as opposed to the conventional identical initialization.

04-30-2020, 04:30 PM

**Thang Huynh** (RWTH Aachen University):**Convergence Analysis of The Binary Iterative Hard Thresholding Algorithm**

One-bit compressive sensing aims to recover (sparse) signals from 1-bit quantized measurements. The binary iterative hard thresholding (BIHT) algorithm has been numerically demonstrated to achieve state-of-the-art performance. However, its theoretical guarantee is unknown. In this talk, we discuss the convergence analysis of the BIHT algorithm. In particular, if the measurement matrix is a Gaussian matrix, we show that the algorithm converges linearly with near-optimal number of measurements. Joint work with Hans-Christian Jung.

01-27-2020, 02:30 PM

**Maxime Nguegnang** (RWTH Aachen University):**Convergence Studies of Gradient\
Descent in Deep Neural Networks**

Despite the great success of deep learning during the last decades, scientists are still not yet able to clearly describe the theory process behind it. Our project aims to contribute on the mathematical foundations that underly deep neural networks. Recently Bah et al. 2019 showed the convergence of the Riemannian gradient flows to the global minimizer in linear neural networks. They also showed that in the particular case of an autoencoder the flow converges to a global minimum for almost all initializations. We intended to extend these results to gradient descent and stochastic gradient descent. Next, based on the insights gained on linear neural networks we will investigate the convergence properties in non-linear neural networks.\

01-20-2020, 02:30 PM

**Ekkehard Schnoor** (RWTH Aachen University):**Compressive Sensing and Neural Networks**

The talk begins with a general introduction to the project, which aims to explore connections between Compressive Sensing and Deep Learning. \ Then a specific algorithm for dictionary learning based on learned iterative soft thresholding algorithms (LISTA) is presented, which can be \ interpreted and implemented as a neural network. We will formulate this as a learning problem, which is currently under investigation. To be able \ to tackle this problem, we review some results in statistical learning theory for deep neural networks in the final part. \ (Based on joint work with Arash Beboobi and Holger Rauhut.)\

01-13-2020, 02:30 PM

**Robert Kunsch** (RWTH Aachen University):**Breaking the curse of dimensionality for uniform approximation in Hilbert spaces via random linear measurements**

We study the uniform approximation of d-variate functions from Hilbert spaces based on information from linear functionals (measurements).\ It is a common phenomenon in complexity analysis that high-dimensional problems with a priori equally important dimensions suffer from the curse of dimensionality,\ that is, the number n(\uc0\u949 ,d) of measurements needed in order to solve a problem within a given accuracy \u949 >0 grows exponentially in d.\ In this talk, certain approximation problems in periodic tensor product spaces are presented, in particular Korobov spaces with smoothness r > 1/2, for which deterministic approximation methods will suffer from the curse of dimensionality while Monte Carlo methods that use random measurements can break the curse, namely, n(\uc0\u949 ,d) \u8804 c \u949 \u8315 \'b2 d (1 + log d).\ Interestingly, random measurements prove to be successful in compressed sensing as well, though randomness is merely an expedient there as it is hard to construct optimal measurement schemes by hand.\

01-06-2020, 02:30 PM

**Adeline Fermanian** (Sorbonne University):**Learning with signatures**

Sequential and temporal data arise in many fields of research, such as quantitative finance, medicine, or computer vision. We will be concerned with a novel approach for sequential learning, called the signature method, and rooted in rough path theory. Its basic principle is to represent multidimensional paths by a graded feature set of their iterated integrals, called the signature. On the one hand, this approach relies critically on an embedding principle, which consists in representing discretely sampled data as paths, i.e., functions from [0,1] to Rd. After a survey of machine learning methodologies for signatures, we investigate the influence of embeddings on prediction accuracy with an in-depth study of three recent and challenging datasets. We show that a specific embedding, called lead-lag, is systematically better, whatever the dataset or algorithm used. On the other hand, in order to combine signatures with machine learning algorithm, it is necessary to truncate these infinite series. Therefore, we define an estimator of the truncation order and prove its convergence in a signature regression model.\

12-09-2019, 02:30 PM

**Hung-Hsu Chou** (RWTH Aachen University):**Double Descent and Implicit Bias of Gradient Descent**

In comparison to classical statistics, many modern learning tasks operate in the overparameterized regime, where the number of parameters are larger than the number of samples. Solutions to those problems are not unique without additional constraints or regularization such as sparsity or low rank, which is closely connected to compressed sensing. However, in many complicated systems, deep neural net for instance, it is difficult to quantify the effectiveness and interpretability of an regularizer. The success in numerical simulations without regularizer suggest that there might be an implicit bias in either the algorithm or the data. Double descent and implicit bias of gradient descent are models that provide us intuitions and tools to analyze such phenomenon. \