In the words of the daily show host Trevor Noah there is currently “So much news, so little time”. In fact, the issue of information explosion expands outside the realm of news and covers all the aspects of our life.
The increasing amount of text data in the digital age calls for methods to reduce reading time while maintaining information content. The process of summarization achieves this by deleting, generalizing or paraphrasing fragments of the input text to create a more conscious version.
In our previous article, we examined the task of automatic summarization with particular emphasis on the different ways you can implement abstractive summarization for an under-resourced language.
However, in any machine learning system, the main component is the dataset. In the case of abstractive summarization, this dataset must contain articles alongside a shortened excerpt of the article.
Furthermore, the models with the highest performance for the task of abstractive summarization are neural seq2seq models, such models are extremely hungry for data and usually, you will need a dataset size of hundreds of thousands or even millions of articles in order for the model to properly generalize.
In this article, we will explore 5 different ways you can build such a large dataset for abstractive summarization in an automatic or semi-automatic manner.
One of the earliest approaches to abstractive summarization is headline generation. In this task, the model accepts the article as input and produces the headline as it’s output
The largest available dataset for abstractive summarization not only in English but across the world is the Gigaword set, this 9 million article dataset contains news articles from various outlets.
This dataset is available in various languages including Arabic and you can download it from the LDC. However, if you are 3K dollars short, your other option is scraping this data manually, the process is usually not that hard and the availability of crawlers dumps from projects like CommonCrawl would greatly simplify this task.
Nonetheless, it is extremely important to filter out the titles, some of the titles might be designed to attract clicks and are purposefully non-informative. In  the authors provide a detailed way to filter the articles in order to limit the effects of click-bait titles, a similar approach is described in  as well.
Pros and Cons
- The main advantage of this method is the simplicity of application whether you are collecting the data yourself or using an already annotated dataset like GigaWord.
- While a title generator might have little usage among readers such a system can be a great asset for authors and content generators.
- While the size of the output would usually be limited to the usual length of a title, it is possible (to a certain length) to increase the output size in seq2seq model .
- A headline generator is not suitable for the task of multi-sent summarization (such as the service provided by our system at Almeta.) Mainly because the titles usually have a different nature than summaries or excerpts, and thus even if we increased its length to a summary length following methods like , the performance of the resulting model cannot be guaranteed.
- Using this method entails pre-processing steps to filter out clickbait titles.
News Articles Summarization
Many of the news outlets add a small excerpt or summary, these summaries usually vary from small subtitles to a full multi sentences summary.
Furthermore, while in some cases the summary constitutes a part of the article (usually the first paragraph) and thus can only support extractive summarization, in other cases, the summaries are independent entities that can be used in training an abstractive summarizer.
In  the authors compiled the largest summarization dataset for the Czech language by scraping 5 news outlets domains and generated around 2.5 million examples in a fully automatic manner.
However, in order to make sure the resulting summaries are informative it is important to apply several filtering steps to remove spammy articles.
In the case of Arabic, we reviewed the currently 60 most popular Arabic news outlets. the following table shows the outlets that include either abstractive or extractive summaries of their articles.
|http://www.akhbarana.com/||abstractive||very spammy site with low quality in both the article and the summary|
|http://www.almayadeen.net/||abstractive||very good summary|
|https://middle-east-online.com/||abstractive||Not summary but very good subtitles (multi-sent) if coupled with the title and with good filtering can create a good abstractive summary|
|https://www.dw.com/ar/||abstractive||very good summary|
|https://www.france24.com/ar/||abstractive||some of the articles have very good abstractive summary some don’t (there is a field to search for)|
- Possibility of generating a large amount of data
- Can be used in multi-sent summarization
- Several news outlets with extremely different topics
- Can generate data for either extractive or abstractive summarization
- The complexity of cleaning data from multiple outlets
Bullet Point Summarization
In this task, the system accepts an article and generates a bullet-point list that summarizes the main highlights of the article.
To collect data for this task the authors of  used the fact that the news articles of the daily mail (and formally CNN) have summaries with this format, see the following figure. The authors scraped over 300k examples of this type:
However, while exploring the 60 most popular news outlets in Arabic, we found no news outlet that follows this style.
Multi-document Summarization Using Wiki-news
Several people have used Wikipedia in one way or another to generate data for summarization. In , Google Brain team uses the whole English Wikipedia as a multi-document summarization data set. This application is extremely complicated.
However, we believe that the following approach is applicable by a relatively small team with limited resources. The authors of  used Wikinews to generate a large multi-document summarization dataset. The idea is that Wikinews articles can be considered as the summaries, while the sources of the article can represent the sources of the articles.
- While treating all of Wikipedia as a summarization dataset is hard in under-resourced languages, mainly because many of the sources are usually written in English or other western languages, in the case of Wikinews most of the sources come for news outlets speaking the native language.
- The same approach can be applied to many languages.
- The complexity of cleaning data from multiple outlets would usually need language detection modules.
- The size of data is rather limited to 52k articles.
Sentence compression is a paraphrasing task where the goal is to generate sentences shorter than given while preserving the essential content. A robust compression system would be useful for mobile devices as well as a module in an extractive summarization system.
The Largest data set for this task was built automatically by . The data is generated from news articles by utilizing the first sentence S and header H.
First, they apply several filters on S and H to exclude articles with grammatically or semantically problematic headlines (click-bait stuff.) Next, the headline is used to create a more condensed version of the first sentence. It is possible to extend such automatic to other languages like Arabic.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store
 Straka, Milan, et al. “SumeCzech: Large Czech News-Based Summarization Dataset.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.
 Filippova, Katja, and Yasemin Altun. “Overcoming the lack of parallel data in sentence compression.” (2013).
 Kikuchi, Yuta, et al. “Controlling output length in neural encoder-decoders.” arXiv preprint arXiv:1609.09552 (2016).
 Hermann, Karl Moritz, et al. “Teaching machines to read and comprehend.” Advances in neural information processing systems. 2015.
 Liu, Peter J., et al. “Generating wikipedia by summarizing long sequences.” arXiv preprint arXiv:1801.10198 (2018).
 Zhang, Jianmin, and Xiaojun Wan. “Towards Automatic Construction of News Overview Articles by News Synthesis.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.