The BC3: British Columbia Conversation Corpus: The First Publicly Available Annotated Corpus for Email Summarization

                                                         Download here


The corpus consists of 40 email threads/3222 sentences from the W3C corpus. Each thread has been annotated by three different annotators. The annotation consists of the following:

bullet Extractive Summaries
bullet Abstractive Summaries with linked sentences
bullet Labeled Sentences with the following labels
bullet Speech Acts: Propose, Request, Commit, Meeting
bullet Meta Sentences
bullet Subjectivity

If you use the BC3 corpus please cite the following paper:

Ulrich J., Murray G., Carenini G., A Publicly Available Annotated Corpus for Supervised Email Summarization AAAI08 EMAIL Workshop, Chicago, USA, 2008. [pdf] [bib]

Papers using BC3 corpus:

Shafiq Joty, Giuseppe Carenini, Chin-Yew Lin. Unsupervised Modeling of Dialog Acts in Asynchronous Conversations. In Proceedings of the twenty second International Joint Conference on Artificial Intelligence (IJCAI) 2011. Barcelona, Spain.[pdf]

Shafiq Joty, Giuseppe Carenini, Gabriel Murray and Raymond Ng. Supervised Topic Segmentation of Email Conversations. In Fifth International AAAI Conference on Weblogs and Social Media (ICWSM) 2011, Barcelona, Spain, AAAI. (short paper).[pdf]

Joty S., Carenini G., Murray G. and Ng R.. Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails. In Proc. of the Conference on Empirical Methods in NLP (EMNLP 2010), MIT, Massachusetts, USA, Oct. 2010 .[pdf]

Minwoo Jeong, Chin-Yew Lin and Gary Geunbae Lee,  Semi-Supervised Speech Act Recognition in Emails and Forums. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), pages 12501259, Singapore, 6-7 August 2009. [pdf] [bib]

Murray G. and Carenini G., Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, 2009. [pdf] [bib]

Jan Ulrich, Giuseppe Carenini, Gabriel Murray, Raymond Ng. Regression-Based Summarization of Email Conversations. 3rd Int'l AAAI Conference on Weblogs and Social Media (ICWSM-09), San Jose, CA. [pdf] [bib]

Murray G. and Carenini G., Summarizing Spoken and Written Conversations. Empirical Methods in NLP (EMNLP 2008), Waikiki, Hawaii, 2008. [pdf] [bib]


The BC3 Annotation Software: An open-source tool for annotating email thread or other conversations

The BC3 corpus was annotated using a web-based annotation framework. This framework is open-sourced and is available for download for conversation annotation. The framework is built with Ruby on Rails and a MySQL database so that a web server can be set up that lets researchers import and manage an email corpus. It also lets users annotate emails threads for summaries and label email features.

                                                         Download here

Creative Commons License


The BC3 Corpus is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.

The BC3 framework is licensed under the MIT license.

If you have any questions or comments, please contact one of the following team members:

bullet Shafiq Joty
bullet Gabriel Murray
bullet Giuseppe Carenini

Previous Team members include:

