Software and Data

Unleashing the Power of Neural Discourse Parsers - A Context and Structure Aware Approach Using Large Scale Pretraining (COLING 2020)

We investigate the benefits of large-scale language models and silver-standard tree pretraining for RST discourse parsing. Our results indicate that both additions significanly improve the parsing results.

Improving Context Modeling in Neural Topic Segmentation (AACL 2020)

We enhance a neural topic segmenter based on a hierarchical attention BiLSTM network to better model context, by adding a coherence-related auxiliary task and restricted self-attention.

Systematically Exploring Redundancy Reduction in Summarizing Long Documents (AACL 2020)

We organize the current redundancy reduction methods into categories based on when and how the redundancy is considered, and propose three new methods balancing non-redundancy and importance in a general and flexible way.

Neural RST-based Evaluation of Discourse Coherence (AACL 2020)

We propose an approach for classifying document coherence on Grammarly Corpus of Discourse Coherence using silver-standard RST discourse trees.

  • Source Code: Github
  • Paper: TBD

MEGA RST Discourse Treebanks with Structure and Nuclearity from Scalable Distant Sentiment Supervision (EMNLP 2020)

We present a novel scalable methodology to automatically generate discourse treebanks using distant supervision from sentiment-annotated datasets, creating MEGA-DT, a new large-scale discourse-annotated corpus.

Towards Domain-Independent Text Structuring Trainable on Large Discourse Treebanks (Findings of EMNLP 2020)

We propose a pretraining approach for learning content structuring/ordering for long-document neural NLG. The results indicate that our approach learns getter groupings of the semantically relevant content than the pointer-based baseline.

Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help ! (CODI workshop at EMNLP 2020)

We incorporate the discourse information to the attention module in the neural extractive summarization model, to reduce the size of the model while keeping a competitive performance.

Coreference for Discourse Parsing: A Neural Approach (CODI workshop at EMNLP 2020)

We perform experiments on incorporating coreference resolution information into RST discourse parser.


The software listed below are open source, licensed under the GNU General Public License. Note that this is the full GPL, which allows many free uses but does not allow its incorporation (even in part or in translation) into any type of proprietary software which you distribute.

BC3 Corpus

The corpus consists of 40 email threads (3222 sentences) from the W3C corpus. Each thread has been annotated by three different annotators. The annotation consists of the following:

  • Extractive Summaries
  • Abstractive Summaries with linked sentences
  • Sentences labeled with Speech Acts and Subjectivity
Recently, topics are also annotated for this email corpora as well as for 20 blog conversations.

Topic Segmentation and Labelling

More information on Topic Segmentation and Labeling.

Discourse Parser

This parser builds a discourse tree by applying an optimal parsing algorithm to probabilities inferred from two Con- ditional Random Fields: one for intra-sentential parsing and the other for multi-sentential parsing.

Document-level Discourse Parser:  Demo  
For the latest version of the Discourse Parser see QCRI's Source Code Page  
More information on Discourse Parser.

Abstractive Meeting Summarization

An automatic abstractive summarization system of meeting conversations.

More information on abstrctive meeting summarization: Paper pubished in INLG 2014, Masters thesis

ConVis: Visual text analytic system for Asyncrhonous conversations

More information about this project.