PhD Thesis Defense - Raghav Goyal
Name: Raghav Goyal
Date: November 19
Time: 1 pm
Location: ICCS 146
Supervisor: Leonid Sigal
Title: Data-Efficient Learning On Structured Output Data
Abstract:
Deep learning relies on huge amounts of labeled data which is time-consuming and expensive to gather and annotate. The issue becomes even more pressing when dealing with structured data, where the cost of annotation increases with complexity of annotations from scribbles, bounding boxes, segmentation masks to scene-graphs. Efficient learning approaches attempts to alleviate this issue by utilizing lower quantity ({\em e.g.}, few-shot learning) and quality ({\em e.g.}, weakly-supervised learning) of annotations for a target task. In this thesis, we explore and develop efficient learning approaches to tackle structured output tasks across images and videos.
Specifically, we first propose a unifying approach for any-shot (zero-shot and few-shot) object detection and segmentation using a semi-supervised transfer learning methodology which learns to semantically transform weak detectors/segmentors to strong ones. We then follow up on more granular annotations - scene graphs - and propose a simple weakly-supervised approach for human-centric scene-graph detection, where despite assuming weaker supervision for objects and relations, we perform competitively to state-of-the-art approaches.
We then turn our focus to videos, which compared to images, are more label intensive due to an additional temporal dimension. We explore model design choices for videos in the context of efficient learning paradigms. In particular, we first look at dynamic spatio-temporal annotations in videos, where we propose a single, unified model for tackling multi-modal, query-based video understanding in long-form videos, and show that multi-task training leads to improved performance and ability to generalize to unseen tasks. And second, in the context of Long-Video Object Segmentation, we propose a transformation-aware loss which places greater emphasis on parts of videos where the tracked object is undergoing deformation, and show improved performance over prior works, along with a time-coded memory beyond vanilla additive positional encoding, which helps propagate context across long videos.