Measuring informativeness through summary

Can you measure a text Informativeness using its summary?

In a previous article (see next paragraph) we explored how to approximate an article informativeness in a supervised fashion, such a method would require training data, in this article we will explore on way to get this data, one very unusual way to do it.

This article is a part of our research on informativeness, and we have A LOT of it if you are interested in more details you can check each individual piece to learn, or review the gist of it.

What is the goal?

In this task, the goal is to assign a given piece of text a tag (or number) representing the level of informativeness or detail this text holds usually by training a model to do that.

Here we rely on the intuitive idea suggested in [1] which states that usually in any news article the lead paragraph can be used as a good summary especially if the author utilizes the inverted pyramid style, an example of a good lead paragraph looks something like this:

“The European Union’s chief trade negotiator, Peter Mandelson, urged the United States on Monday to reduce subsidies to its farmers and to address unsolved issues on the trade in services to avert a breakdown in global trade talks. Ahead of a meeting with President Bush on Tuesday, Mr. Mandelson said the latest round of trade talks, begun in Doha, Qatar,in 2001, are at a crucial stage. He warned of a ”serious potential breakdown” if rapid progress is not made in the coming months.”

however, this generalization does not apply on all authors as some uses a more creative approach, e.g.

“ART consists of limitation,” G. K. Chesterton said.” The most beautiful part of every picture is the frame.” Well put, though, the buyer of the latest multi-million dollar Picasso may not agree.

But there are pictures – whether sketches on paper or oils on canvas – that may look like nothing but scratch marks or listless piles of paint when you bring them home from the auction house or dealer. But with the addition of the perfect frame, these works of art may glow or gleam or rustle or whatever their makers intended them to do.

It is clear that the first lead paragraph is more informative than the second. The authors relied on the idea that leads like the first example constitute a more plausible summary of the article than the second example, and thus suggest that a corpora for informativeness prediction can be automatically generated using a corpora of human summarized news articles simply by comparing the human generated summary with the lead paragraph and assigning the lead paragraph a tag of (informative/ creative) based on this similarity.

What Similarity Metric to Use?

We have decided to go with a simple similarity metric for our dataset selection process, Jaro-winkler similarity is a normalized text similarity function similar to edit-distance. This choice is based on 2 factors:

  • While, ideally, we should use more informative distances like cosine similarity between the Doc2Vec representations of the summary and the lead paragraph, the usage of this simpler metric can represent a stronger constraint on the similarity and is acceptable since the summaries we are using are extractive.
  • On the other hand, this is a normalized metric which can account for the variations in summary and article length.

This metric value ranges between 0 and 1, where 1 represent a total match and 0 total difference (No match at all).

Available Summarization datasets:


  • The wiki-how dataset was built using the wikiHows website, it is a very large extractive summaries dataset, the English version contains around 500k articles. It is based on aggregating the titles sentences from the steps of a method to create an overall summary of that method, since the site is multi-lingual the same method can be applied to the Arabic version of the site (or any other language for that matter)
  • DUC2004 is an instance of the Document Understanding Conference Datasets for extractive summarization, this version includes noisy machine translated Arabic documents along side there summaries, the DUC dartasets are thoroughly studied throughout the literature and they are based on newswires. However, because the Arabic documents are machine translated they are relatively very noisy, further more acquiring this datasets requires a formal request and several bureaucratic forms to be filled.
  • RTLTDS is a large collection of Iranian scientific publications along with their abstracts, keywords, and authors
  • MULTILING2011 is a small dataset of abstractive summaries based on wikinews where the events are aggregated and summarized it is composed of 10 events each covers 10 short articles the data is manully translated ans summarized in 7 languages (Arabic, Czech, English, French, Hebrew, Hindi, Greek) with a limit of 250 words for each event summary, over all the data is small and the quality is OK with some small problems such as spelling errors, more importantly the dataset contains multiple references, enabling the usage of metrics such as rouge.
  • MULTILING2013 is an updated version of multiling2011, here the source of the articles is Wikipedia entries and not events and the summaries are extractive , the dataset includes 40 languages instead of 7 and for each language there is 30 articles with a single source manual summary, the summaries quality is relatively high. However, the original articles have scraping issues and needs near manual cleaning.
  • MULTILING2019Task1 is another dataset from multi ling that focuses on financial documents, although it is very large the quality is questionable since the documents are OCRed from pdf files and there is a lot of issues with the source text
  • EASC is a small extractive dataset it consists of 153 short articles extracted from wikipedia and Alwatan and Alrai newspapers, with each of them 5 reference manual summaries named {A to E}, and although most of the articles comes from Wikipedia nearly all of them adhere to the inverted triangle scheme and thus can be utilized for the summaries are of news articles it is possible to use them to generate informativeness tags, however the quality of the reference summaries are of varying degrees since they are generated using Mturk, after inspection the main issue we found with this dataset is inconsistency in both the articles length and the summaries length mainly because there was no restriction on the annotators with regards to the summary length, the articles have an average word length of 383 words but the standard deviation of 180 is relatively large, the same can be said to all the reference summaries whose average word length is around 130 across the 5 references but with an std that can reach 88 with an average compression rate of 0.32, overall the most consistent of the annotators seems to be E with the lowest std of 74, However if the we consider only the summaries shorter than 350 words, nearly all the references have the same std of 70, with the Exception of reference E which still have the lowest std of 65, However manual inspection of the summaries reveal no significant difference (except is individual cases) between the annotators behavior and thus any of them can be used, furthermore to asses the usability of the data set for informativeness annotation we used the first 4 sentences of every article as the lead paragraph and then calculated the distance between the lead paragraph and each of the reference summaries using the similarity metric. The choice of lead size is done empirically using the elbow method by observing the change in the overall standard deviation of the similarity (averaged across the 5 metrics) for different values of lead size between 1 and 10 sentences and we found that 4 sentences gave the highest reduction in overall std without impacting the similarity value. Following is the histogram of overall similarity.
  • Kalimat is a very large (20K) multi-purpose dataset, the original documents are scraped from several newspapers websites, it covers several features including morphological analysis, single and multi-doc summarization and NER as well as topic. The summaries in here are extractive summaries generated automatically and thus are of lower quality than other datasets, we conducted manual inspection of the sports part of the dataset and the following are our observations:
    • the length of the generated summaries is relatively consistent across all the articles with a mean of 174 words and a small std of 23 words, compare this with the massive variance of 235 found in the articles lengths, the following figure shows the histogram of the summaries and articles lengths. This means that there is great loss in information especially for longer articles.
  • There is a large number of very short articles, to which the summary is basically a permuted version of the original text in case of the sport genre out of 4100 articles nearly 1400 articles have the same length as their summaries.
  • While the generated summaries have low fluency, the main reason for this low quality is not omission of meaningful sentences in the article but rather because of the erroneous re-ordering of the sentences by the summarizer, this phenomenon appears clearly in short documents, while this issue is critical for the task of automatic summarization, it should be possible to use this dataset for the task of informativeness if an order agnostic distance measure was used like the one we are using. The following figure shows the distribution similarity function between the lead paragraph (4 sentences again) and the summary.


View this awesome list, however here are some honorable mentions that are truly eye opener:

  • TLDR 2017 is massive 3 million document dataset that is collected from Reddit site using the TL;DR tag from the posts, this is a whole community for the TLDR see [2]

Which dataset to use?

Dataset genre Pros Cons
wiki-how Tips and life hacks – multi-lingual
– relatively large
– summaries are too short
– this data set can’t be used for informativeness detection
DUC2004 News wires – there is previous literature on it
– based on news wire and thus can be used for informativeness
– relatively small
– is based on machine-translation and thus very noisy
– requires form filling and no less than 7 working days for response
RTLTDS Scientific publications

MULTILING2011 News articles about events – multi-lingual
– multiple reference summaries
– quality is ok
– very small 10 events per language where each event is segmented into 10 very short articles
– it is not trivial to utilize for informativeness since the events are not 100% articles are represents a chronological order of facts rather than an inverted triangle scheme
MULTILING2013 Various entities from Wikipedia that ranges from cities to characters to sites – multi-lingual
– quality is high
– very small 30 articles per langauge
– it is not trivial to utilize for informativeness since the wikipedia entries does not necessarily follows the inverted triangle scheme
MULTILING2019Task1 financial narrative disclosures – multi-lingual
– relatively large
– quality is questionable since the documents are OCRed from PDF files and there is a lot of issues with the source text
– the domain of the text is very narrow
– is not applicable to informativeness
EASC Wikipedia entries and news stories – high quality
– multi reference
– too small only 153 articles – nearly all of the lead paragraphs have the same distance (further investigation using other similarity metrics is needed)
Kalimat NewsWires – very large – single and multi-document – questionable quality since the summaries are automatically generated

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.


[1] Y. Yang and A. Nenkova, “Detecting information-dense texts in multiple news domains,” in Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

[2] M. Völske, M. Potthast, S. Syed, and B. Stein, “Tl; dr: Mining reddit to learn automatic summarization,” in Proceedings of the Workshop on New Frontiers in Summarization, 2017, pp. 59–63.

Leave a Reply

Your email address will not be published. Required fields are marked *