MSc Thesis Presentation - Alan Milligan

Date

November 25, 2025 2:00 PM –3:00 PM

Name: Alan Milligan

Date: Tuesday, November 25th, 2025

Time: 2:00pm to 3:00pm

Location: ICCS 146

Supervisor's name: Mark Schmidt

Title of the thesis: What does the Adam optimizer actually adapt to?

Abstract: As the impact of machine learning on all aspects of daily life grows, the importance of understanding why and how it works follows. Despite the success of recent machine learning based systems, several elements of the general machine learning pipeline remain poorly understand. This lack of understanding is no longer a question strictly for academics. Companies now spend millions of dollars training models, incurring massive energy and carbon costs. Without a solid understanding of the training process, predicting downstream model performance as a function of the choices made before training remains impossible. Many tricks have been discovered to address these challenges, but the tricks themselves remain poorly understood. In particular, training machine learning models is framed as an optimization problem, but optimization theory lacks sufficient tools to analyze modern problems. Optimization theory often fails to explain why problems are hard or why algorithms are effective. A pointed example of this is the Adam optimization algorithm. Adam is widely successful and considered to be the default optimization algorithm in machine learning, but theory predicts it to be no better than classical algorithms like gradient descent. Adam stems from a line of “adaptive optimizers” that in some way adapt to the problem, but what they are adapting to is not clearly defined either. This thesis aims to highlight characteristics of optimization problems that Adam is addressing, and highlight how classical theoretical assumptions fail to explain Adam. We isolate heavy-tailed class imbalance in language modelling as a characteristic that makes gradient descent fail, while Adam is unaffected. Further analysis shows that this characteristic leads to correlations between the gradients and Hessians of the model, a quality theorized to help Adam. Following this, we find that imbalanced features, as seen in a setting using graph neural networks, additionally cause gradient descent to fail while Adam remains effective. Finally, we further challenge existing theory, showing the performance of Adam can be both improved and destroyed by the choice of basis to run the optimization problem in. The majority of existing theory is invariant to the basis being used, and therefore fails to capture Adam’s advantage.