MSc Thesis Presentation - Amirhossein Abaskohi

Date

Name: MSc Thesis Presentation - Amirhossein Abaskohi

Date: 7th August 

Time: 10 a.m

Location: ICCS 104

Supervisor: Giuseppe Carenini

Co-supervisor: Issam H. Laradji

Title: Multimodal Understanding of Long Documents: From Topic Modeling to Question Answering

Abstract:
Long multimodal documents, which contain text, images, and other types of content, are common in real-world settings but remain difficult for natural language processing (NLP) models to process. These documents pose challenges in both understanding their content and training models when labeled data is limited. This thesis presents two contributions that address these problems from different angles.

First, we introduce CEMTM, a topic modeling method designed for long documents that include both text and images. Instead of relying on bag-of-words or treating different modalities separately, CEMTM uses contextual embeddings and cross-modal alignment to produce more coherent and meaningful topics. It performs well across several datasets and offers better topic diversity and interpretability.

Second, we present FM2DS, a pipeline for generating synthetic training data for multimodal multihop question answering (MMQA). FM2DS uses prompting and document retrieval to create realistic question answering

(QA) examples, and applies knowledge distillation to transfer reasoning ability from a large teacher model to a smaller multimodal model. This approach makes it possible to train competitive QA systems with only a few examples, reducing the need for large annotated datasets.

Together, these two methods support more effective processing of long multimodal documents: CEMTM for exploring and summarizing content, and FM2DS for enabling downstream MMQA systems in low-resource settings. We evaluate both approaches across multiple tasks and demonstrate a substantial performance improvement.