PhD Defence - Frederik Dieter Kunstner

Date

Name: Frederik Dieter Kunstner

Date: Aug 26, 2024

Time: 10:00 am to 1:00 pm

Location: ICCS X836

Supervisor: Mark Schmidt

 

Title: Optimization Algorithms in Machine Learning: Methods for Probabilistic Models and Adaptive Methods

Abstract: 

The impressive recent applications of machine learning have coincided with an increase in the costs of developing new methods. Beyond the obvious computational cost due to the large amount of data, the more insidious cost is complexity. The development of machine learning systems is not yet predictable and instead relies on rules of thumb and expensive trial and error. This thesis focuses on the fundamental methods used to build machine learning models, which “learn” by solving a numerical optimization problem to fit a model to data. Many successful optimization heuristics have emerged in recent years, but we have little understanding of why those heuristics outperform classical optimization schemes. The goal of this thesis is to build a better understanding of why methods that are widely used in machine learning work, presenting theoretical or empirical contributions in 3 areas.

  • Algorithms are often designed to tackle specific problems with a known structure, like the expectation maximization (EM) algorithm for statistical models with missing data. But our current optimization theory ignores this structure and relies on assumptions that are not satisfied, even by textbook applications. We derive the convergence rate of EM for exponential families without additional assumptions.
  • Many heuristics have been proposed to build “adaptive” methods that automatically adjust per-coordinate step-sizes, but these heuristics are often brittle as there is no formal definition of “adaptivity”. We formalize the problem as finding per-coordinate step-sizes that are competitive with the optimal ones for a given problem, and develop an algorithm that provably finds competitive step-sizes.
  • For recently developed language models, the Adam algorithm outperforms stochastic gradient descent by such a large margin that it is now the default option. But the reason for this improvement is not clear. We empirically evaluate hypotheses proposed to explain this performance gap, showing that the gap is not due to noise, and isolate a feature of language problems that leads to optimization difficulties. We show that heavy-tailed class imbalance, where many rare classes have a large impact on the objective function, leads to a performance gap and to a correlation between gradients and Hessians, hypothesized to benefit Adam.