PhD Thesis Defense for Tanzila Rahman
Name: Tanzila Rahman
Date: Monday, April 15, 2024
Time: 1 pm
Location: ICCS X836
Supervisor: Leonid Sigal
Title: On Effective Learning for Multimodal Data
Abstract:
Humans can perceive the world through multiple modalities. Strong behavioral scientific evidence suggests that such ability, which includes implicit information integration and cross-modal alignment inherent in it, is critical for human learning. Nevertheless, until relatively recently, most deep learning methods have primarily focused on addressing single-modality issues associated with learning from vision, sound, or text. Over the recent years, however, researchers started to focus on multi-modal learning, specifically emphasizing high-level visual comprehension challenges like image-text matching, video captioning, and generation of audio-visual content. In this thesis, we aim to broaden the scope of learning from multi-modal information, enhance its integration, and solve problems related to human-centric spatio-temporal perception in a manner that does not necessarily require complete supervision ({\em e.g.}, granular spatio-temporal multi-modal alignment).
Specifically, in this thesis, we focus on addressing two fundamental challenges: (1) Multi-modal learning; and (2) Weak-supervision. We address these challenges across a range of diverse tasks. First, we focus on weakly-supervised dense video captioning, where we combine audio with visual features to improve state-of-the-art performance. We also show that audio itself can carry a surprising amount of information, compared to existing visual-only models. Second, we introduce an end-to-end audio-visual co-segmentation network to recognize individual objects and corresponding sounds using only object labels. Importantly, compared to prior work, our approach does not require any additional supervision and/or bounding box proposals. Third, we propose TriBERT, a transformer-based architecture with co-attention, that learns contextual features across three modalities: vision, pose, and audio. We show that these features are general and improve performance on a variety of tasks spanning audio-visual sound source separation and cross-modal retrieval. Fourth, we delve into generative text-to-image (TTI) models, specifically to address consistency when generating complex story visualizations. We do so by augmenting diffusion models with memory that can be leveraged for reference resolution and to, implicitly, maintain consistency between referred visual elements. Finally, we look at aspects of penalization within TTI. This allows us to generate diverse visuals for custom, user-specified concepts ({\em e.g.}, a specific person, dog, etc. Throughout our comprehensive analysis of these tasks within this thesis, we present significant algorithmic, theoretical, and empirical contributions to the field of multimodal machine learning and computer vision.