Term Informativeness Estimation in the Arabic Language

In our effort at Almeta to provide the articles with the highest informative value to the Arabic readers, we have employed several methods to measure the informativeness of a piece of news, in this article we will shed light to one of the algorithms we are using, specifically term informativeness.

This article is a part of our research on measuring text informativeness if you are interested jump directly to the gist of our research or review other parts:

The What

This task is based on the idea that some words carry more semantic content than others, which leads to the notion of term specificity, or informativeness.

A closely related task is term weighting (for search engines, indexing, summarization, …)

The goal of this task is to assign every term a certain metric representing its importance. This type of methods was dominant before the neural explosion as assigning a measure of informativeness to every term can be used in many NLP and IR applications including (key phrase extraction, summarization, … ) basically by ranking the terms using this informativeness metric.

Ideally, with such metric we can estimate the article informativeness by the average informativeness of its terms.

There is no simple manner for evaluating this task (because it is mostly metric based). Interesting ways to do it includes:

• Observing the avg, median and relative stats of metrics across several test sets
• Measuring the relative informativeness gain between parents and children in lexicons like WordNet (based on the assumption that children in such hierarchies tend to be more specific and thus more informative than their parent terms)
• Embedding the metrics as features in other tasks most notably key phrase extraction.

The How

The methods can be used to measure a term informativeness can be categorized in three major sets:

Statistics-based

These metrics collect some statistics about each of the terms from a relatively large corpus of texts. Many of these methods are based on some variant of term frequency, and they are especially common in the IR community. Note that all of these methods are language independent:

• TF-IDF (term frequency inverted document frequency): the TF-IDF is a very popular algorithm in NLP it weights the terms in a corpus, based on their occurrence frequencies, the details of TF-IDF is beyond the scoop of this article and the reader is referred to this fairly simple explanation.
• Term variance [1]: is a measure of the variance of the term frequency from the mean. this metric is given by:

where D is the size of training corpus, is the frequency of term w in document d and

Here $\bar t_w$ is the the total number of occurrence of the word w in the corpus

• Burstiness [1]: this metric basically compares the collection frequency and document frequency directly, this metric is given by
• IDF based metrics: these are metrics that represents a variation over the Inverse document frequency. The IDF is a very popular algorithm in the field of Information Retrieval here are some of these metrics:
• Residual IDF [2]: this metric is given by:

where

• IDF gain [2] : another IDF based metric given by
• Mixture score[3] : this score is based on the idea that although topic-centric words are somewhat rare (across the whole corpora). But they also exhibit two modes of operation:
1. A high frequency mode, when the document is relevant to the word, and
2. A low (or zero) frequency mode, when the document is irrelevant. Based o this they suggest modeling this fact through a mixture model of binomial distributions. The score is given by:

Where $P_mix$ is the mixture model probability of the word appearances in the whole corpora while $latexP_uni$ is the uni-gram probability of the word in the corpora. Where the mixture model is estimated using EM.

However there is a couple of issues with this metric:

• The probabilities is calculated for each individual word (great overhead)
• Using a mixture of binomial distribution also greatly increase the complexity as for every single word each document must be represented as a one-hot vector
• The final results of the score have negligible to no improvement over simpler metrics like RIDF

Semantics based

These are metrics that creates some sort of representation of the term meaning and harness it to measure the term informativeness.

LSA informativity [4]: is a very simple and plausible metric, LSA here represents the Latent Semantic Analysis also a very popular algorithm in the field of NLP and is also out of the scope of our article but again we refer you to [9] for a general introduction to LSA,

Here, LSA is used to model the corpora and for each word a vector is generated, usually, the de-facto applications of LSA are clustering, similarity measure, … all of which are based on measuring the cosine similarity between words vectors (or weighted sums of them for the case of document similarity).

However, the authors utilizes the length of the word vector (which is usually used as the weighting terms in the document weighted sum) to measure term informativeness. This is motivated by the fact that “Intuitively, the vector length tells us how much information LSA has about this vector. […] Words that LSA knows a lot about (because they appear frequently in the training corpus[…]) have greater vector lengths than words LSA does not know well.”

Function words that are used frequently in many different contexts have low vector lengths – LSA knows nothing about them and cannot tell them apart since they appear in all contexts.” which greatly overlaps with the motivations of the mixture scores mentioned above. Their simple metric is as follows:

Interestingly enough, the same analogy can be carried out to Word2Vec, see this question and therefore a similar measure can be based on Word2Vec.

However, the main problem with such approaches just like statistics methods is their dependence on the training corpora and the issue of OOV (out of vocabulary). Furthermore, it seems that using genre specific corpora is superior to building the systems using generic ones like Wikipedia.

Model Reuse

As stated above these metrics have been originally used to rank the terms in tasks like keyphrase extraction and summarization, but as these tasks are currently dominated by neural supervised models one option is to use the confidence of such models (specifically key-phrase extraction) to rank the words based on their importance. However there are 2 issues with this scheme:

• These models confidence tend to be non-smooth with non key words having nearly 0 confidence
• Such scheme will cause issues of error propagation and will by tightly coupled with the training set

Term Weighting

The task is really similar to the task of term weighting in IR. this task is out of the scope of our discussion Now and we will hopefully present a separate article for this however if you are out of patience you can review the following papers (in order of relevance): [1], [5]–[8]

Final remarks regarding implementation

• Nearly all of the presented methods are language agnostic and are therefore any monolingual dataset can be utilized,
• Some of the metrics like TF-IDF, IDF, and LSA stuff are really simple and efficient to compute, and given the proper training set can be implemented in 1 to 2 workdays.
• Most of the methods depend directly or indirectly (the case of model reuse) on the training corpora and thus focusing on a restricted domain for a start is advised.
• Some metrics like LSA or Word2Vec have a heavy memory foot print and indexing frameworks and structures might be of relevance

Conclusion

In this article we tried to explore a simple question of how can we measure the importance of a single word in text, and how such a simple measure can be a proxy to measuring the informativeness of the whole article.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

References

[1] Z. Wu and C. L. Giles, “Measuring term informativeness in context,” in Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies, 2013, pp. 259–269.

[2] K. Papineni, “Why inverse document frequency?,” in Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, 2001, pp. 1–8.

[3] J. D. Rennie and T. Jaakkola, “Using term informativeness for named entity detection,” in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 2005, pp. 353–360.

[4] K. Kireyev, “Semantic-based estimation of term informativeness,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009, pp. 530–538.

[5] N. Nanas, V. Uren, A. De Roeck, and J. Domingue, “A comparative study of term weighting methods for information filtering. KMi-TR-128,” Knowl. Media Institue Open Univ., 2003.

[6] G. Murray and S. Renals, “Term-weighting for summarization of multi-party spoken dialogues,” in International Workshop on Machine Learning for Multimodal Interaction, 2007, pp. 156–167.

[7] M. Shirakawa, T. Hara, and S. Nishio, “N-gram idf: A global term weighting scheme based on information distance,” in Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 960–970.

[8] C. Lioma and R. Blanco, “Part of Speech Based Term Weighting for Information Retrieval,” ArXiv Prepr. ArXiv170401617, 2017.

[9] Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. “An introduction to latent semantic analysis.” Discourse processes 25.2-3 (1998): 259-284.