Multi-document summarization. The What, Why and How

While reading the news you are most likely to encounter several articles that describe the same event or incident and each of these articles comes from a different news anchor and provides a different viewpoint to the event. However most of us do not have the time to read all of these articles in order to get a fully unbiased view of the current events, and we usually select the anchor that best suites our ideology or agrees with our preconceived opinions, leading to a biased view of the events.

Here at Almeta, we believe in the importance of discussing different viewpoints to any argument. Imagine having a service that can identify the various articles the describe the same event and then combine them into a short condensed and informative summary that covers all details given by the different anchors.

We have already described in a different series the details of our event detection system. However, in this article, we look closer at the second part namely combining various documents in one informative summary.

What is Multi-document Summarization?

Summarization methods can be categorized into single or multi-document approaches. In contrast to the single document, the multi-document setup can utilize the fact that in some domains like news articles there are different sources describing the same event and thus these articles have a lot of similarities. In such systems, the input would be two or more pieces of text and the output should be a summary that covers the details of all the inputs.

Furthermore, the summarization approaches can be classified into two other classes based on the way the summary is constructed from the input text. Extractive methods solely rely on the words of the input and e.g. extract whole sentences from it and does not produce any new sentences, while abstractive approaches, on the other hand, the output of the system is rarely bound to any constraints and these systems can generate any text that constitutes a valid summary of the input, these systems have gained a lot of traction recently due to current advances in Deep Learning.

Data Sources

In order to build any machine learning system, the main concern is having a large-enough and diverse enough dataset for this model to learn from. In the case of multi-document summarization, each data point is composed of a set of two or more articles alongside a summarization of them.

Available Datasets

In our previous article, we explored the available datasets for summarization in both Arabic and English languages. All of these datasets are built for single-document summarization with the exception of DUC2004, this dataset includes only 100 samples and is designed mainly for evaluation, other datasets are introduced through the MultiLing conference but they as well suffer from the issue of data sparsity.

On the contrary, there are several English multi-document summarization datasets including DUC(2001-2007), TAC(2008-2011) and Multiling Nonetheless even in the case of English all of these datasets are extremely small with the largest being the DUC2004 dataset. Last year, the researchers at google brain produced the first large scale Multi-document summarization dataset for English by basically using Wikipedia [1] and then on Wiki-news [2] (more on this in the next section). However, the size of the publicly available data is restricted by the authors to around 10k from over a million datapoint collected in the original dataset. And earlier this year some researchers at Yale [3] created another large scale dataset mostly by relying on the site this 50K dataset is available for download from GitHub.

How Can We Collect More Data

Although there is a limited number of publicly available datasets to train a multi-document summarizer, there are several methods reported in the literature to build a dataset for multi-document summarization in an automatic or semi-automatic manner. While most of these methods are applied to English, we focus on this part on methods that can be extended into relatively under-resourced languages especially Arabic.

Summarizing Wikipedia

Several people have used Wikipedia in one way or another to generate data for summarization. In [5], the Google Brain team uses the whole English Wikipedia as a multi-document summarization data set. This application is extremely complicated.

However, we believe that the following approach is applicable by a relatively small team with limited resources. The authors of [6] used Wikinews to generate a large multi-document summarization dataset. The idea is that Wikinews articles can be considered as the summaries, while the sources of the article can represent the sources of the articles.


  • While treating all of Wikipedia as a summarization dataset is hard in under-resourced languages, mainly because many of the sources are usually written in English or other western languages, in the case of Wikinews most of the sources come for news outlets speaking the native language.
  • The same approach can be applied to many languages.


  • The complexity of cleaning data from multiple outlets would usually need language detection modulos.
  • The size of data is rather limited to 52k articles.


In this section, we will highlight the main methods to implement a multi-document summarizer

Extractive Methods

Most of these algorithms are unsupervised and thus does not require any training data, in these approaches the sentences of all the document are extracted and scored using certain metrics, then the highest-scoring sentences are selected. The difference between these algorithms mainly lies in the difference between scoring methods.


One of the oldest of these algorithms is LexRank. LexRank is very similar to other graph-based summarizers like TextRank and DivRank, in that they “recommend” other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. The importance of this sentence also stems from the importance of the sentences “recommending” it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to any arbitrary new text.

To implement this, a graph of all the sentences in all the documents is generated where the distances between these sentences represent the lexical/semantic similarity and then the LexRank algorithm is used to rank the sentences.

For a very good overview of the differences between DivRank, TextRank, and LexRank we recommend this awesome review.

Finally, there were also a few attempts at using deep learning for performing sentence ranking for MDS. The system in [4] employs a Graph Convolutional Network (GCN) on the sentence relation graphs, with sentence embeddings obtained from Recurrent Neural Networks as input node features. Through multiple layer-wise propagations, the GCN generates high-level hidden sentence features for saliency estimation. A greedy heuristic is then applied to extract salient sentences while avoiding redundancy.

Reconstruction Based

Sparse reconstruction methods [5]–[7] employ the idea of data reconstruction in the summarization task. In this approach, the task of summarization is treated as a data reconstruction problem. Intuitively, a good summary should recover the whole documents, or in other words, reconstruct the whole documents. And thus the summary should meet three key requirements: Coverage, Sparsity, and Diversity

  • Coverage means the extracted summary can conclude every aspect of all documents. This is estimated by trying to reconstruct every document as a weighted average of the summary sentences.
  • Sparsity meaning that the summary sentences should describe different topics.
  • Diversity: meaning that the summary sentences should not be correlated and be as distinctive as possible.

The overall structure is shown in the following graph:

To represent this in a mathematically feasible manner, the main outline of these approaches goes as follows:

  • represent every sentence as a vector e.g. TF-IDF or Doc2Vec
  • assuming a summary set of sentences S define the following metrics:

Coverage metric: how many articles representations (vectors) can be calculated as a non-negative weighted average of the summary set sentence vectors, this is basically the MSE error as follows, where Si is a sentence from the original documents, s∗j is a sentence from summary and k is the number of sentences in the summary and aij is the weight of the jth summary sentence in constructing the ith original sentence

Sparsity metric: the previous step would result in a matrix of weights aij, we want every original sentence to be represented by a small number of the summary sentence, and thus sparsity can be achieved by imposing a sparsity constraint on the columns of the matrix, for example, using L1 norm

Diversity metric: the final metric tries to minimize the correlation between the sentence in the summary, the correlation between two sentences Si and Sj in the summary can be calculated as follows, where the Sbar is the average vector of the summary sentences:

based on these steps the overall error function looks like this:

The last step is to find the summary set of sentences that minimizes this error, using a stochastic search algorithm like simulated annealing or genetic algorithms.

The different implementations of this approach differ in the choice of:

  • Sentence representations TF-IDF, Doc2Vec, neural representation using auto-encoders, …
  • Metrics definition
  • Optimization algorithms


This is not to be confused with query-based summarization (also sometimes called topic-based because linguists are great at naming. In query-based methods, the system accepts a set of documents and a query and should construct a summary most relevant to that query.)

The algorithms used in these approaches are similar. The general structure of these algorithms goes like this:

  1. Find the topics/events described in the documents. The number of topics to be considered is usually based on the desired summary length. The topic modeling is usually carried out using LDA or similar algorithms in an unsupervised manner.
  2. Rank each of the sentences based on their similarity to the topics, the usual way of doing this is by constructing a query from each topic by concatenating prominent words, and then each sentence in the document is ranked using IR metrics like IDF or TF/IDF, although using other similarity metrics based on manifolds or documents embeddings was also suggested.
  3. After selecting a specific number of sentences for each of the topics, the final summary is usually constructed using the Maximum Marginal Relevance (MMR) algorithm [8]. MMR relies on a greedy approach to select sentences basically by choosing sentences that are similar enough to the topic (query) and in the same time dissimilar enough from the already selected sentences. The goal is to select the sentences with the highest relevant to the topics and the most novel at the same time.


  • Relatively simple to understand and implement, and have several ready-made implementations.
  • Is language-agnostic although language-dependent text pre-processing can improve the results.
  • Unsupervised and thus require no training data.
  • Can give relatively decent results in practice based on the domain.
  • Relatively very fast (with the exception of reconstruction based method).


  • These algorithms work on the sentence level and thus have no information about the sentence context, this can lead to discontinuities in the resulting summary that will look more like a bullet point list than as a coherent paragraph.
  • The possibility of eliminating some documents that are too short or there are too many documents.
  • No abstraction or entailment of ideas.

Intermediate Methods

These methods serve as a natural extension and usually as a second processing step to an extractive summarizer. In this step the system applies further processing on the selected sentences.

Sentence Compression

Compression is achieved by removing unimportant or redundant words/phrases from the selected sentences, making them more concise and general. this task is also called sentence compression, and you can learn more about it through our previous post on automatic paraphrasing in NLP.

Summary Revision

This step is also deployed as a post-processing step after an extractive summarizer. Basically, it tries to improve the quality of the summary by rewriting noun phrases and resolving co-references, see for example [9]. Furthermore our previous discussion of text paraphrasing can also be of help here.


  • There are several possible ways to implement this task in a supervised or unsupervised manner and in a language-dependent or language-agnostic way.
  • Relatively simple to understand and implement, and have several ready made implementations.


  • These algorithms also work on the sentence level and have no information about the sentence context.
  • Possibility of propagation error since these systems are usually implemented as a second step to extractive summarizers. Also abstraction or entailment of ideas.
  • The relative improvement over the simple extractive methods is limited when measured using automatic metrics like ROUGE however the authors claim improvement in human evaluations.

Abstractive Methods

Unlike the previous methods, abstraction-based methods can generate new sentences whose fragments come from different source sentences, using sentence aggregation and fusion. Here are some of the main approaches to the task.

Sentence Fusion

Several publications have addressed this issue, as mentioned above the goal here is to find similar sentences in a piece of text and then merge them to create a shorter version of the text. This approach is used in Quill-bot check it out to get an idea of the performance of these systems. The main structure of the algorithm is shown in the following figure

The main structure [10], [11] follows the following approach:

  1. First, split the documents into sentences and cluster them, one approach followed by [11] identify the most important document in the multi-document set. Then the sentences in the most important document are aligned to sentences in other documents to generate clusters of similar sentences.
  2. Second, generate K-shortest paths from the sentences in each cluster using a word-graph structure to generate the summary sentences. See the following figure for illustration
  3. Finally, Rank these sentences based on informativeness, … and combine them to generate the summary. For example, integer linear programming (ILP) is used to select the summary sentences.


  • Unsupervised and language agnostic.
  • Can generate abstractions of the original sentence.
  • Relatively simple to implement.


  • The generated abstractions are not extremely novel since they are generated from within the original text words.
  • The shortest path approach has issues when dealing with multi-word expressions.
  • The quality of the model depends on the clustering quality.

Deep Learning Methods

Basically, by treating the task as a Many-to-One machine translation task where the input is the original documents and the target is the summary, the models are trained in a supervised manner and thus require a sizable amount of data to generalize, using approaches like Wikipedia summarization as outlined above.


  • Truly abstractive and can yield better summaries


  • Need of large parallel training dataset.
  • Overfitting issues when working with different genres or domains, this can be extremely problematic since the language model component of these systems can start generating word salad.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.


[1] P. J. Liu et al., “Generating wikipedia by summarizing long sequences,” ArXiv Prepr. ArXiv180110198, 2018.

[2] J. Zhang and X. Wan, “Towards Automatic Construction of News Overview Articles by News Synthesis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2111–2116.

[3] A. R. Fabbri, I. Li, T. She, S. Li, and D. R. Radev, “Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model,” ArXiv Prepr. ArXiv190601749, 2019.

[4] M. Yasunaga, R. Zhang, K. Meelu, A. Pareek, K. Srinivasan, and D. Radev, “Graph-based neural multi-document summarization,” ArXiv Prepr. ArXiv170606681, 2017.

[5] H. Liu, H. Yu, and Z.-H. Deng, “Multi-document summarization based on two-level sparse representation model,” in Twenty-ninth AAAI conference on artificial intelligence, 2015.

[6] J. Yao, X. Wan, and J. Xiao, “Compressive document summarization via sparse optimization,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

[7] S. Ma, Z.-H. Deng, and Y. Yang, “An unsupervised multi-document summarization framework based on neural document model,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 1514–1523.

[8] J. G. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries.,” in SIGIR, 1998, vol. 98, pp. 335–336.

[9] A. Nenkova, “Entity-driven rewrite for multi-document summarization,” 2008.

[10] M. T. Nayeem, T. A. Fuad, and Y. Chali, “Abstractive unsupervised multi-document summarization using paraphrastic sentence fusion,” in Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1191–1204.

[11] S. Banerjee, P. Mitra, and K. Sugiyama, “Multi-document abstractive summarization using ilp based multi-sentence compression,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.

Leave a Reply

Your email address will not be published. Required fields are marked *