Domain adaptation for automatic summarization of human conversations
By Oana Sandu, UBC CS
The goal of summarization in natural language processing is to create abridged and informative versions of documents. A popular approach is supervised extractive summarization: given a training source corpus of documents with sentences labeled with their informativeness, train a model to select sentences from a target document and produce an extract. Conversational text is challenging to summarize because it is less formal, its structure depends on the modality or domain, and few annotated corpora exist.

We use a labeled corpus of meeting transcripts as the source, and attempt to summarize a different target domain, threaded emails. We study two domain adaptation scenarios: a supervised scenario in which some labeled target domain data is available for training, and an unsupervised scenario with only labeled target data and labeled data available in a related but different domain.

We implement several recent domain adaptation algorithms and perform a comparative study of their performance. We also compare the effectiveness of using a small set of conversation-specific features with a large set of raw lexical and syntactic features for domain adaptation. We report significant improvements of the algorithms over their baselines.

Our results show that in the supervised case, given the amount of email data available and the set of features specific to conversations, training directly in-domain and ignoring the out-of-domain data is best. With only the more domain-specific lexical features, though overall performance is lower, domain adaptation can effectively leverage the lexical features to improve in both the supervised and unsupervised scenarios.

Visit the LCI Forum page