PhD thesis defense - Shih-Han Chou

Date

July 29, 2025 3:00 PM –6:00 PM

Name: Shih-Han Chou

Date: July 29th

Time: 3 pm

Location: ICCS X836

Supervisor: Leonid Sigal

Title: Vision and Language: Representation Learning, Commonsense Reasoning, and Consistency

Abstract:

Vision-Language Models (VLMs), a prominent sub-class of multimodal architectures, are important because they allow automated systems to process and understand both visual and linguistic data, enabling more intuitive and accessible interactions. They bridge a gap between language and vision, leading to advancements in areas like visual search, virtual assistants and many others. The recent advent of Large Language Models (LLMs) has significantly accelerated progress in this domain, prompting new research directions and challenges.

This thesis investigates vision-language modeling from several complementary perspectives, with the overarching aim of improving alignment, reasoning, and consistency in multimodal systems. First, we address the problem of fine-grained alignment between vision and language. We propose a semi-supervised grounding mechanism that leverages a pre-trained phrase grounding model to generate region-phrase pseudo-labels. This approach facilitates more granular alignment and enhances feature learning without requiring additional human annotations, particularly in large-scale settings. Second, we explore the integration of commonsense knowledge in multi-sentence video captioning. We introduce a Transformer-based model that incorporates both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense information, which demonstrates that incorporating prior knowledge leads to more coherent and contextually appropriate captions. Third, we present a systematic analysis of the semantic consistency of Vision-Language Models (VLMs). We construct a benchmark comprising three tasks, Question Rephrasing, Image Restyling, and Context Reasoning, and evaluate model responses using correctness and consistency metrics. Finally, we propose a test-time adaptation framework to improve semantic consistency without requiring supervised re-training. This model-agnostic method optimizes two complementary objectives: a cross-entropy agreement loss to align predictions across semantically equivalent inputs, and a pseudo-label consistency loss that steers predictions toward a stable consensus. The method operates post hoc and uses only the test input, making it broadly applicable.

Collectively, the contributions of this thesis advance both the methodological foundations and empirical understanding of multimodal learning, particularly in enhancing alignment, reasoning, and reliability in VLMs.