PhD Thesis Defence-CHEN FAN
Name: Chen Fan
Date: Friday, November 28, 2025
Time: 12:30 PM
Location: ICCS 146
Supervisors: Mark Schmidt, Christos Thrampoulidis
Thesis Title: Adaptive Step Sizes and Implicit Regularization in Optimization Models
Abstract:
Given the ever-increasing size of machine learning models, better optimization algorithms are needed to improve computation efficiency. Despite significant recent progress having been made, there is a lack of understanding of the general working principles behind common optimizers for complex tasks such as neural network training. There are two criteria to be considered when measuring the performance of an optimizer: 1. the convergence speed or the rate of training-loss decrease; 2. test accuracy of the converged solution. Given these criteria, this thesis focuses on three relevant ingredients: data sampling for (stochastic) gradient computation, step sizes, and optimization implicit bias.
We first consider a commonly used sampling-without-replacement scheme for computing the stochastic gradient, known as random reshuffling (RR). Despite its success in training deep neural networks, its theoretical justifications have only been studied recently. In the over-parameterized setting, it has not been shown that RR achieves the linear convergence rate as stochastic gradient descent (SGD) does. We bridge this gap by showing the rate of RR is indeed linear and can be faster than that of SGD.
Stochastic Polyak (SPS) and line-search (SLS) step sizes are known to converge fast under over-parameterization, thanks to their adaptivity to the local curvature of the loss. However, without over-parameterization, there is no guarantee that they would converge to the exact solution. Given this, we first extend SPS and SLS to the non over-parameterized setting. The advantage of our modifications is that the step sizes are not required to be monotonically decreasing. We then propose variants of SPS and SLS for bilevel optimization which involves tuning two step sizes.
Previous works that studied optimization implicit bias have focused more on the binary classification setting. However, machine learning applications are typically multiclass. To this end, we study a family of optimization algorithms known as normalized steepest descent in linear multiclass classification with separable data. This includes several popular algorithms. We show that their iterates converge to the max-margin defined with respect to the norms that are used to define the algorithms.