The BC3: British Columbia Conversation Corpora

 

So far comprises:

BC3- Email Corpus

BC3- Email Corpus- Polarity

BC3- Blog Corpus

BC3-Email and Blog Corpus Annotated with Topics


 

BC3- Email Corpus

The corpus consists of 40 email threads (3222 sentences) from the W3C corpus. Each thread has been annotated by three different annotators. The annotation consists of the following:

  • Extractive Summaries
  • Abstractive Summaries with linked sentences
  • Sentences labeled with:
    • Speech Acts: Propose, Request, Commit, Meeting
    • Meta Sentences
    • Subjectivity

     

Download the corpus

 

If you use the BC3 email corpus please cite the following paper:

 

Ulrich J., Murray G., Carenini G., A Publicly Available Annotated Corpus for Supervised Email Summarization AAAI08 EMAIL Workshop, Chicago, USA, 2008. [pdf] [bib]

Some Papers using BC3 Email corpus:

 

Shafiq Joty, Giuseppe Carenini, Chin-Yew Lin. Unsupervised Modeling of Dialog Acts in Asynchronous Conversations. In Proceedings of the twenty second International Joint Conference on Artificial Intelligence (IJCAI) 2011. Barcelona, Spain. [pdf] [bib]

 

Minwoo Jeong, Chin-Yew Lin and Gary Geunbae Lee, Semi-Supervised Speech Act Recognition in Emails and Forums, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1250–1259, Singapore, 6-7 August 2009. [pdf] [bib]


Murray G. and Carenini G., Predicting Subjectivity in Multimodal Conversations. Empirical Methods in NLP (EMNLP 2009), Singapore, 2009. [pdf] [bib]


Jan Ulrich, Giuseppe Carenini, Gabriel Murray, Raymond Ng. Regression-Based Summarization of Email Conversations. 3rd Int'l AAAI Conference on Weblogs and Social Media (ICWSM-09), San Jose, CA. [pdf] [bib]


Murray G. and Carenini G., Summarizing Spoken and Written Conversations. Empirical Methods in NLP (EMNLP 2008), Waikiki, Hawaii, 2008. [pdf] [bib]


BC3- Email Corpus- Polarity

In a second round of annotations, three different annotators were asked to go through all of the sentences previously labeled as subjective and indicate whether each sentence was positive,negative, positive-negative, or other. Positive (P), negative (N), both (PN), or neither (X).

Download the email corpus with polarity annotations


BC3- Blog Corpus

This corpus consists of 7000 blog conversations with user-labeled comments from 6 popular websites (Slashdot, Macrumors, AndroidCentral, Dailykos, BusinessInsider, TSN) .

Download the blog corpus

For details on this corpus see this M.Sc. thesis


BC3-Email and Blog Corpus Annotated with Topics

This corpus comes with topic annotations for the 40 email threads of the BC3 email corpus and 20 blog conversations from Slashdot.

If you use this corpus please cite the following paper:

  • S. Joty, G. Carenini and R. T. Ng (2013) Topic Segmentation and Labeling in Asynchronous Conversations  JAIR, Volume 47, pages 521-573 (2013)
  • Papers using this corpus:

  • Yashar Mehdad, Giuseppe Carenini, Raymond Ng and Shafiq Joty. Towards Topic Labeling with Phrase Entailment and Aggregation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), Atlanta, USA.[pdf][bib]
  • Shafiq Joty, Giuseppe Carenini, Gabriel Murray and Raymond Ng. Supervised Topic Segmentation of Email Conversations. In Fifth International AAAI Conference on Weblogs and Social Media (ICWSM) 2011, Barcelona, Spain, AAAI. [pdf][bib]
  • Shafiq Joty, Giuseppe Carenini, Gabriel Murray and Raymond Ng. Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), MIT, Massachusetts, USA. [pdf][bib]


  • Creative Commons License


    The BC3 Corpus is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.

    Link to conversation annotation software

     If you have any questions or comments, please contact one of the following team members:

    Previous Team members include:

    • Jan Ulrich
    • Gabriel Murray