Let’s start with a simple question, what constitutes an informative article? based on Oxford’s dictionary.
informative/ɪnˈfɔːmətɪv/ adjective: informative
providing useful or interesting information
However, this is still an abstract concept. The question of measuring How informative a piece of news is not really a simple one. And we at Almeta have been debating this issue and testing multiple things for quite some time, in this article we will provide an overview of our plan to implement a service that can measure an article informativeness and provide you with the best possible news feed.
If you don’t feel like reading a lot you can jump directly to the final paragraph for a gist of the gist.
This article represents a gist of all of our research on informativeness, and we have A LOT of it if you are interested in more details you can check each individual piece to learn:
- What makes an article informative in our view and how can we measure this quantitatively?
- How to detect cliches in text? and why do they matter?
- How to measure the informativeness of each individual word?
- How to train a supervised model to measure informativeness?
- Can we get data to train such a model from summaries? ?
- Can we get such data in Arabic?
- How can we rank articles based on how informative they are?
Properties of the Informative Text
One way to think about informativeness is to see it as a collection of features that comes together in a news piece and determines it’s worth.
Imagine a beautiful building, can we define what makes it beautiful? it is really hard to pin down, it might be the structure, the distribution of light, or maybe it is the selection of colours. All of these aspects are simpler and hopefully can be quantitatively measured.
The same analogy can be taken when dealing with text, here the readability, the skimability or the usage of cliches can constitute measurable features, in our previous article we tried to explore what type of features will mark an article as informative, but at the same time be measurable by our systems.
A student-teacher approach
While the above-mentioned reasoning might seem sound, let us explore a different assumption, here we are assuming that the informativity is an abstract concept that is hard to pin down exactly and that we can measure it using common sense just like the coherence of a text or a musical composition.
These common-sense rules must have been constructed in our mind through years of experience. If this assumption is valid wouldn’t it be possible to transfer this experience to information systems? In this article, we are exploring exactly this possibility.
In AI terminology this means that we should deal with the problem of informativeness measurement as a learning problem, basically train a model in a supervised manner to take a piece of text and spit out a number that represents how informative an article is. Obviously this system can utilize the features from here.
The bottleneck in such a scheme is the ability to create a large amount of data to train such a model, we have explored this issue in details. How we can do it and how can we harness data for this task (here and here) but overall the main issue when dealing with this problem is our failure in harnessing data for Arabic articles.
Humans Point of View
The failure in the supervised informativity detection made us rethink our steps, is this task really what we think it is.
Any task that we should try to automate we must firstly try to understand. it is not possible to ask an information system to measure the worth of an article if we don’t know how humans do that.
Yes, it is much simpler to flag an article as spammy or unprofessional, in our previous article here we have explored those limits that make an article totally not worthy of your time, but apart from the obvious what makes an article more informative and more noteworthy than others.
Imagine you are given 10 articles and asked to rank each on a scale from 1 to 10 based on how “informative” they are how long will it take you? an hour? half?
Let’s try another task given 5 pairs of articles, for each pair you should choose the article that seems more informative to you. it is clear that the second task is much simpler.
This is the difference between the task of regression the former and ranking the latter. We as humans are much suitable for the latter than the former. It is much easier to choose between chocolate and vanilla ice cream than to rank chocolate on a scale of 10. The same analogy can be carried out to the AI models. Mainly since building datasets by human annotators for pairwise ranking is simpler than direct mapping.
While this seems like a sensible proposition, it is important for it to be applicable, this is what we explored in our previous article and how we can hopefully deal with the lack of data using Snorkel.
The Road Ahead
Overall our current plan for building an informativity detection system goes as follows:
- Treat the task as a ranking rather than a regression problem
- Build a pairwise ranker of articles as follows:
- create a small manually annotated evaluation and development dataset
- utilize snorkel to expand this dataset
- train a pairwise ranker like LambdaRank on the created dataset
- Reuse the pairwise ranker to give a score for each of the fetched articles effectively converting the model from a ranking model into a regression model.
This is what we will do.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.