PhD Thesis Defence - Dujian Ding
Name: Dujian Ding
Date: Thursday, 29 May 2025
Time: 12:30pm
Location: X836, ICICS, 2366 Main Mall
Zoom Link: https://ubc.zoom.us/j/62059037432?pwd=7mNZ6LibHfds6xZRMAoE0Jmro4BU7k.1 (Meeting ID: 620 5903 7432; Passcode: 598151)
Supervisor: Laks V.S. Lakshmanan
Thesis title: Towards Efficient Machine Learning Management Systems
Abstract:
Machine learning (ML), especially deep learning (DL), has become a leading force in both academic research and industrial applications, yielding state-of-the-art solutions for a wide range of real-world problems. In tandem with the impressive capability and high generalizability of ML models, the model sizes have drastically exploded over recent years. Gigantic model sizes not only lead to computational challenges including huge resource consumption and unacceptably high expenses at both training and deployment stages, but also ethical concerns such as green AI, responsible AI, and more. Significant research efforts have been invested into making ML more efficient -- either at the model-level or at the system-level. Model-level ML efficiency studies the fundamental trade-off between model efficiency and effectiveness, while system-level ML efficiency takes ML models as atomic operators and deals with the overall efficiency of answering ML inference queries invoking multiple models.
In this dissertation, we aim to answer the central question: How to make machine learning service more efficient without compromising the overall performance? Our work is aligned with both model-level and system-level ML efficiency. On the model level, we propose effective approaches to extract efficient subnetworks from gigantic ML models with user-specified sparsity targets while maintaining high model performance. On the system level, we identify two important sub-tasks: bulk query processing and streaming query processing, where query objects are either provided all at once as a bulk, or they stream in. As to bulk query processing, we study the Fixed-Radius Near Neighbour query and develop approximate algorithms to efficiently deliver high quality answers with statistical guarantees by choosing a judicious combination of one cheap proxy model and one expensive oracle model. Additionally, we investigate the more general setup where we have a set of ML models at different cost and accuracy trade-offs and conceive principled algorithms to select the optimal model assignments for given query objects and compute high accuracy answers with statistical guarantees for ML classification queries. As for streaming query processing, we consider powerful conversational ML services such as ChatGPT which are supported by the development of powerful large language models (LLM). We develop novel adaptive query routing solutions to significantly reduce the overall incurred costs by distributing query traffic from the expensive cloud models to the small on-device models without compromising the response quality. Furthermore, we extend the routing framework to a spectrum of LLMs at different efficiency and performance trade-offs and leverage the intrinsic non-determinism of modern LLMs to harness their respective strengths and deliver high quality query responses.