Supervised Machine Learning for Email Thread Summarization

By Jan Ulrich

Email has become a part of most people's lives, and the ever increasing amount of messages people receive can lead to email overload. We attempt to mitigate this problem using email thread summarization. We have built a machine learning summarizer for emails as well as annotated a dataset for training. While previous research has shown that machine learning algorithms are a promising approach to email summarization, there has not been a study on the impact of the choice of algorithm. We explore new techniques in email thread summarization using several regression-based classifiers, and the results show that the choice of classifier is very critical. We also present a novel feature set for email summarization and do analysis on two email corpora. The BC3 corpus, a new publicly available email dataset that is annotated for summarization purposes, is introduced as well as the open source framework that we built to do the annotation.

Visit the LCI Forum page