Abstractive Summarization in Underresourced Languages

The increasing amount of text data in the digital age calls for methods to reduce reading time while maintaining information content. The process of summarization achieves this by deleting, generalizing or paraphrasing fragments of the input text to create a more conscious version. Summarization methods can be categorized into single or multi-document and extractive or abstractive approaches. In contrast to the single document, the multi-document setup can utilize the fact that in some domains like news articles there are different sources describing the same event and thus these articles hare a lot of similarities. Extractive methods solely rely on the words of the input and e.g. extract whole sentences from it. Abstractive approaches, on the other hand, are rarelybound toany constraints and they have gained a lot of traction recently due to current advances in Deep learning and seq2seq models.

In this article, we will concentrate our discussion on extractive Vs abstractive approaches.

On one Hand, extractive summarizes have several benefits:

  • They are typically easier to implement, as in the simplest form the extractive summarizer can be seen as sentence ranking model,
  • In practice, these models are very robust and can generalize to other domains easily mainly because their training step (if there is any) does not put any restrictions on the domain or language,
  • Finally, most of these approaches are either unsupervised or does not require any training and thus are helpful for under-resourced languages like Arabic

However, on the other hand, the best available models for summarization in terms of performance are abstractive summarization that relays on very large deep learning models with million of parameters. These models have much better performance than their extractive counter-parts. However, these models suffer from 2 major problems:

  • Deep learning models and Seq2Seq models, in particular, are data-hungry, they require large training sets of parallel text-summary where these sets are usually built in a manual way, these datasets are scarce in western languages let alone underresourced languages like Arabic.
  • Furthermore, these approaches struggle when working with text from domains other than their training set. And the options for model adaptation in this task is rather limited (in comparison with other tasks tat utilizes Seq2Seq architecture like machine translation or automatic speech recognition)

In the case of Arabic language for instance in our previous article we have already explored the available freely available summarization corpora and unfortunately, there is no Large training set to enable neural abstractive models.

In this article, we will explore several methods in which we can implement an abstractive summarizer in an underresourced language where no large parallel data is available.

We try to tackle this task in 4 different ways:

  1. Easily building large datasets to support Seq2Seq supervised models
  2. Using non-neural abstractive summarizers
  3. Using unsupervised neural summarizers that require no annotated data
  4. Using other miscellaneous approaches

Building Datasets for Neural Models

One option to allow the implementation of abstractive summarizer is building a large enough dataset to enable the training of such models. However, the process of manually creating such a huge can be extremely costly. However again, if there is a way to generate such a dataset in an automatic or semi-automatic manner in the under-resourced language, abstractive summarizers can be easily created.

The only method we found that tackled a similar task is reported in [1]. In this article people at Google Brain used the references of English Wikipedia articles as input and trained a Seq2Seq model to generate the actual article from the references. The goal of this article was not summarization in its own right but rather to test the neural models in real settings.

Apart from the complexity of parsing all the references in Wikipedia articles, this approach might be hard to generalize to other languages either due to the limited number of references. Also, for example, in languages like Arabic, a good portion of references are not Arabic but English.

Non-Neural Approaches

Research in abstractive summarization goes beyond current neural methods and some early Non-neural methods exist in this section we will outline some aspects of these methods.

In nearly all of these non-neural approaches the summarizer can be decomposed in 2 steps:

  1. Content Selection
  2. And Surface Realization

Content Selection aims to select a subset of the candidate phrases extracted from the text for inclusion in the final summary. Typically, subject to length constraints.

Some methods relay on heuristics for instance, the model in [2] heuristically selects the candidate phrases most frequently mentioned for an aspect.

However, the preferred method is Integer Linear Programming (ILP). ILP can be used to optimize an objective function subject to a set of linear constraints. When applied to content selection, the objective function is a weighted sum of a set of binary variables. Each variable represents a candidate phrase and has the value 1 if and only if ILP decides to select it for inclusion in the final summary. The weight associated with each variable indicates the importance of the phrase. Authors in [3] for instance, estimate the salience of each candidate phrase based on its position and its grammatical role in the input document and use the salience score as its weight. The linear constraints encode length constraints. e.g. one constraint limits the number of words in each sentence in the summary. The key advantage of employing ILP for content selection is that the decision is made jointly based on all phrases.

On the other hand, Surface Realization aims to combine the candidates selected in content selection using grammatical/syntactic rules to generate a summary. Tthis part usually includes complicated Natural Language Generation steps. And while some tools like SimpleNLG can be utilized for the English language, work in NLG on other languages is still extremely limited especially for under-resource languages like Arabic.

Some Approaches

On of the Prominent early approaches are Template-based methods. Template-based methods are motivated by the observation that human summaries of a given type (e.g., meeting summaries) have common sentence structures, which can be learned from the training set and encoded as templates. Then given an input document, a summary can be generated by filling the slots in the best-fitted templates learned for this type of documents. Template-based methods typically consist of three steps:

  1. Learning the templates from the human summaries
  2. Extracting important phrases from the input document
  3. Generating a summary based on the filled templates.

For example, in [4], they propose a template-based method for meeting summarization:

  1. In step 1 (template learning), a template is first generated from a sentence of each human summary in the training set by replacing each Nominal Phrase (NP) in the sentence with a blank slot that is labeled with the hypernym of the NP’s head using WordNet. Then, these templates are clustered based on their root verbs.
  2. In step 2 (keyphrase extraction), the important phrases for each topic of the input document are extracted and labelled with their hypernyms.
  3. Finally, in step 3 (generation), the templates with the highest similarity to each topic of the meeting are selected. Then candidate summary sentences can be generated by filling each template with matching labels. With a sentence ranker is trained to rank the generated sentences in each segment. The highest ranked sentence for each topic segment will be selected for inclusion in the summary. Finally, The selected sentences are sorted by the chronological order of the topic segments in the input document.

Other early methods encode the text in a graphical way like event semantic link networks (ESLN) [5]. In this approach, given an input text, a graph is constructed where each node corresponds to an event mentioned in the input text. An edge between two nodes encodes the semantic relation between the corresponding events. After graph construction, ILP can be applied to this network to perform content selection (i.e., selecting a subset of nodes for generating the summary.) Then from such intuitive representation, simple NLG can build the final summary.

Pros and Cons

Most of these methods either use unsupervised machine learning or are fully knowledge-based systems meaning they require very little to no training data.

Many of these methods relay on other NLP steps like key-phrase extraction and syntactic parsing this dependence causes several issues:

  1. Propagation of errors from sub systems to the summarizes
  2. Most of these per-processing steps are challenging in their own right and can have a low performance on under-resourced languages
  3. The integration of these subsystems is usually done manually with a lot of tinkering
  4. The complexity of such systems could impact their speed
  5. The performance of such models are much lower than their neural counterparts

Unsupervised Neural Approaches

These approaches aim at securing the gains of using a neural model, the simplicity and the performance without the main problem of data shortage. Following are some of the methods reported in the literature:

Most of these methods relay on some form of auto-encoders while we present an introduction to Auto-encoders here, detailed study of auto-encoders is beyond the limits of this article yet you can follow this series for a simple intro with code, or read this article if you wish to better understand the math behind auto-encoders.

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction. The basic architecture (shown ing the figure below) is similar to the encoder-decoder systems with one main difference, Autoencoders (being Auto) use the same text for input and output of the model.

In the case of abstractive summarization the auto-encoders are trained to shrink the size of the input text there are several ways this is accomplished:

In [6] the first-ever unsupervised abstractive summarizer is introduced, in this article the goal is performing multi-document summarization on customers reviews. Where given a set of customer reviews the model returns a single review that is supposed to include all the information of these reviews. The model is depicted in the following figure, it is composed of 2 Auto encoders the first is used to learn encodings of the input reviews, these representations are then averaged and used in another autoencoder that produces the final summary.

The whole model is trained in a single run to optimize both the Average Summary Similarity and the Auto-encoder Reconstruction Loss. The results are far from optimal. Yet, they are comparable to the other abstractive approaches. The following figure compares the results of this unsupervised model with a strong extractive baseline. The code of this publication is available here

Another Approach is reported in [7] In this work, the authors train an auto-encoder to learn representations of the input text, but use a technique from [8] to restrict the length of output of the auto-encoder to a predefined length, the authors report their results of the annotated Gigaword dataset which is one of the most famous corpora in English and use ROUGE as their metric, if you are not familiar with ROUGE consult this article. The authors report a rouge score of 23.41, which compared to the score of 39.11 reported in the state of the art model on this task.

Other unsupervised abstractive summarizers include [9]

Other Related Tasks

In this section, we outline some tasks that we believe are close enough to abstractive summarization yet they fall outside the scope of this article.

Sentence Compression

Sentence compression is a paraphrasing task where the goal is to generate sentences shorter than given while preserving the essential content. A robust compression system would be useful for mobile devices as well as a module in an extractive summarization system. The Largest data set for this task was built automatically by [10]. The data is generated from news articles by utilizing the first sentence S and the header H. First they apply several filters on S and H to exclude articles with grammatically or semantically problematic headlines (click-bait stuff). Next, the headline is used to create a more condensed version of the first sentence. It is possible to extend such automatic to other languages like Arabic, yet the extension is not trivial.

Cross-Lingual Abstractive Summarization

In [11], [12] the authors suggest nearly the same idea, basically, they start with a large summarization dataset usually found in English language, then they do a round trip translation as follows:

  • To build the training data for a given underresourced language X, the articles of the corpora is translated from English to language X using a neural machine translation system (google translate for example). This results in noisy translation in language X, next, this noisy translation is re-translated back to English, in the same way, resulting in even more noisy articles.
  • For Training the model is trained to use the noisy round trip translations articles as input and output the original clean English summaries.
  • In deployment: for a given article in language X, the text is first translated to English using the same neural translator and then the summarizer can use this translation to generate English summary.

This can allow for summarizing articles in under-resourced languages, however, the summaries are in English which is not optimal for most cases.

Automatic Text Paraphrasing

One of the main shortcomings of extractive summarization in real applications is the legal constraints where is it not allowed to copy large chunks of the article directly as this will be a violation of intellectual property. However, it is allowed to display text if it is a processed version of the article. This means that adding a paraphrasing component on top of a strong extractive summarizer can solve the legal issues if the extractive summarizer is good enough.

There is a large body of research on text paraphrasing, sentence compression is considered a paraphrasing task, another prominent paraphrasing task is style transfer.


In this article, we tried to provide a very wide and very shallow overview of the implementation of abstractive summarization in under-resourced languages. This research only lists some of the approaches that we believe are promising in implementing an under-resourced abstractive summarizer. And that implementing any of the aforementioned approaches must be preceded by further “deeper” research. Following are some of the highlights we can have from our research:

  • The main issue when moving to abstractive instead of extractive summarization is fear of non-truthful summary where the model would start generating random stories related to the whole topic instead of summarizing the actual article. This phenomenon is usually associated with neural models and to a smaller degree with non-neural approaches.
  • We have not come across many research articles in the task of automatically or semi-automatically building training set for abstractive summarizers with the exception of [1] and [11] some of these approaches are applicable but we believe that further research in this area can be of value.
  • The non-neural models while being general and usually domain and language independent are complex to implement and usually have little to no improvement on their extractive counterparts
  • Although the reported performance of the unsupervised neural models is rather low, since some of them include code bases it might be easy to test them as an off the shelf component.
  • In our point of view, the usage of paraphrasing component can be a great addition to our current system that uses extractive summarization and would usually require much lower time to build.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store


[1] P. J. Liu et al., “Generating wikipedia by summarizing long sequences,” ArXiv Prepr. ArXiv180110198, 2018.

[2] P.-E. Genest and G. Lapalme, “Fully abstractive approach to guided summarization,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2012, pp. 354–358.

[3] L. Bing, P. Li, Y. Liao, W. Lam, W. Guo, and R. J. Passonneau, “Abstractive multi-document summarization via phrase selection and merging,” ArXiv Prepr. ArXiv150601597, 2015.

[4] T. Oya, Y. Mehdad, G. Carenini, and R. Ng, “A template-based abstractive meeting summarization: Leveraging summary and source text relationships,” in Proceedings of the 8th International Natural Language Generation Conference (INLG), 2014, pp. 45–53.

[5] W. Li, L. He, and H. Zhuge, “Abstractive news summarization based on event semantic link network,” 2016.

[6] E. Chu and P. Liu, “MeanSum: a neural model for unsupervised multi-document abstractive summarization,” in International Conference on Machine Learning, 2019, pp. 1223–1232.

[7] R. Schumann, “Unsupervised abstractive sentence summarization using length controlled variational autoencoder,” ArXiv Prepr. ArXiv180905233, 2018.

[8] Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura, “Controlling output length in neural encoder-decoders,” ArXiv Prepr. ArXiv160909552, 2016.

[9] M. T. Nayeem, T. A. Fuad, and Y. Chali, “Abstractive unsupervised multi-document summarization using paraphrastic sentence fusion,” in Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1191–1204.

[10] K. Filippova and Y. Altun, “Overcoming the lack of parallel data in sentence compression,” 2013.

[11] J. Zhu et al., “NCLS: Neural Cross-Lingual Summarization,” ArXiv Prepr. ArXiv190900156, 2019.

[12] J. Ouyang, B. Song, and K. McKeown, “A robust abstractive system for cross-lingual summarization,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 2025–2031.

Leave a Reply

Your email address will not be published. Required fields are marked *