Difference is a fine and beautiful phenomenon. Difference should always be accepted, expected and respected.
Difference adds richness to the topics we discuss and opens to everyone new perspectives that they never thought of.
Starting from our belief that a different viewpoint from ours is the other side of the truth that we could not reach, but was reached by others. And as a part of proceeding in our mission in Almeta to provide our users with the whole picture. In this post, we are analyzing the task of finding articles with different viewpoints than a one under consideration.
Suppose we have a corpus of articles discussing different topics. Given an article with a specific viewpoint about a topic, we want to determine the articles discussing the same topic but from the contrary viewpoint.
We can split this problem into two phases, given an article A, discussing a topic T, from a viewpoint V:
- Finding other articles discussing T.
- Ranking these articles according to their disagreement with V.
Phase 1: Topic Identification
In general, topics vary in their level such that higher-level topics are broader. For instance:
topic level1: “Sport” > topic level2: “Football” > topic level3: “The world cup” > topic level4: “France won the world cup 2018”.
Let’s consider the question Which level of topics do we need to capture for this problem?
Obviously, it doesn’t make sense trying to capture contrary viewpoints in broad topics, imagine trying to measure the disagreement between two articles, one reporting a story related to basketball, while the other reporting a story related to football!
Hence, we should try to identify as finer topics as possible.
As a result, nor broad-genres text classification methods or topic modeling methods are applicable here. Simply because the intended topic-level is finer than the output of these systems.
From the example stated above, we can see that the finest topic-level is just an event reported in an article, which leads us to think about using event detection methods.
Event detection aimed to aggregate news articles into fine-grained story clusters. This task usually deals with news streams in an online manner, using sequential clustering algorithms. We usually rely on features related to named entities and timestamps to detect events. For more information about this task, you can refer to our previous post Event Detection in Media using NLP and AI.
Phase 2: Disagreement Based Ranking
Now we have all the articles for one event grouped together, for each article A in event E we want to rank the other articles belonging to E according to their disagreement with A. Following are our suggestions to perform this task.
When we are considering political news articles, one idea that may jump to our mind for detecting the disagreement is identifying the political orientation of the article. Then simply, articles from different orientations usually disagreed.
The emerged questions What are these orientations? and how to detect them?
Well, orientations vary according to the region. For instance, in western politics, the political spectrum is uniform with a clear dichotomy between the left and right. However, in Arabic politics, the vision is blurred with various orientations, that may be organizational or religious related.
Once these orientations are determined, the problem can be treated as a normal text classification problem.
For more details about the political orientation identification task, you can refer to our previous post, Political Orientation Detection – AI and NLP Approach.
- Genre-specific: Solves the problem only for politics articles.
- Region-specific: Orientations vary between regions.
- Not ranking: The problem can’t be treated as ranking anymore.
- Vague base: Identifying political orientation tends to be vague. The orientations are indistinct, and the classification features are unclear.
- Weak assumption: Different orientations doesn’t always mean disagreement in viewpoints. Perhaps two different orientations disagree on multiple concepts while agreeing on others.
Entity-Level Sentiment Analysis
Recalling that an event is characterized by a set of entities, a viewpoint about an event can be interpreted as sentiment polarities against its entities. The entity-level sentiment analysis is a separate NLP task by itself. You can find more details about it and its solutions in our post: Aspect-level Vs Entity-level Sentiment Analysis.
In this suggested method, we rank the articles according to the agreement of their views about the named entities.
The suggested steps to rank an article is as follows:
- Extract the named entities from the articles, using named entity extraction system.
- Find the polarity that corresponds to each entity, using an entity-level sentiment analysis system.
- Segment the clitics from the named entities for easier matching. This step is pretty important for languages like Arabic. However, the segmentation error risk is high as we’re working on the entity-level e.g. “الميتا” is likely to be segmented as “الـ+ميتا”. Thus, the quality of this step depends a lot on the segmentation system performance.
- Unify the entities, one entity may be mentioned multiple times in an article and with varying polarities. Aggregate all the polarities of an entity in one value by averaging them.
- Represent each article with a vector, where each element corresponds to the polarity of a named entity in the article. The dimension of this vector is equal to the number of named entities that appeared in all the articles belonging to the same event.
- After all, we calculate the similarity between two articles by finding the distance between their vectors.
- An additional step could be weighting the entities according to their significance in the event. To get these weights, for each article we generate the same vector as 5, but with the TF-IDF values of the entities as its elements. Then, all the articles’ vectors are aggregated together by averaging them. The resulted vector contains the entities weights. These weights can be given to the entities by multiplying the weights vector in an element-wise manner with the vector generated in 5.
What features can we consider other than the ones related to the sentiment polarities?
The intuitive answer is the words used for expressing an opinion, these words should differ in articles with different viewpoints, and are in two types:
- Words that usually have polarities, and then they can be captured by sentiment lexicons. For instance:
- Words that express different viewpoints, but may not be captured by sentiment lexicons. For instance:
In general, features related to the article content can be represented using TF-IDF, or maybe using distributed embeddings like doc2vec. While this representation includes the whole text, the first-type words can be separated in its own features vector too after extracting them using the sentiment lexicons.
Related feature to the first-type words are:
- The overall article polarity, which is the averaged polarities of the polar article’s words, extracted using a sentiment lexicon.
- The count of positive and negative words in the article. Again, using a sentiment lexicon.
Another answer to this question is considering metrics that indicate an opinion type, which should, in theory, differ in articles with different viewpoints, such as:
- Abusive: Demeaning and abusive language.
- Obscenity: Obscene or profane language.
- Racism: Demeaning and abusive language targeted towards a particular ethnicity.
- Sexism: Demeaning and abusive language targeted towards a particular gender.
- Insults: Scornful remarks directed towards an individual.
- Threats: Expressing a wish or intention for pain, injury, or violence against an individual or group.
All these features along with the entity-level sentiment analysis features, discussed in the previous section, should be aggregated in one weighted sum which is the ranking formula.
The weights in the ranking formula are determined manually.
To overcome the limitation of the previously suggested method, we suggest training a pairwise classifier. The input of the classifier is the different similarity values between two articles, all concatenated together in one vector, while the output is a value indicating whether the two articles agreed on one viewpoint or not. The classifier will automatically determine the weights of the ranking formula by learning to link the input to the corresponding output.
How to get training data?
As we suppose having an event detection system for the first phase of the problem, we can collect an event detection test collection, where we label articles reporting the same event, you can refer to Building a Test Collection for Event Detection Systems Evaluation for the details about the collection process. Now, for each event in the test collection, we generate all the possible articles pairs, then for each articles pair, we decide whether they disagreed or not.
- Treating the problem classification rather than a ranking task has pros on the performance. Basically, for an event of size N, we have to rank the whole N articles against a new incoming article to get a ranked list of contrary view articles, instead, we can compare against the previously recognized P viewpoints whose number P << N, where each viewpoint can be represented for example using the average of its articles vectors.
- Using the similarities as the features of the classifier should ideally isolate it from variations in genre, regions, etc. thus we can train a single global classifier.
Until now we depend heavily on the named entities in our suggestions, Let’s consider the named entity linking task and see how it can improve the algorithm’s understanding to the named entities.
Named entity recognition (NER) systems identify and classify named entities occurrences in text into pre-defined categories. On the other hand, Named entity linking (NEL) will assign a unique identity to entities mentioned in the text. In other words, NEL is the task of linking entities mentioned in a text with their corresponding entities in a knowledge base.
But how does it help us in our mission?
Consider the following example:
|In English||In Arabic|
|Ajoubair said during a press conference in the British capital London “we are convinced based the evidence we have that the Iranian Army had a role to play in Aramko attacks”||وقال الجبير في مؤتمر صحفي بالعاصمة البريطانية لندن “مقتنعون من خلال الأدلة الموجودة لدينا بتورط الجيش الإيراني في هجمات أرامكو|
One of the entities extracted from this text would be “الجيش الإيراني” (the Iranian Army) and we can see that the sentence convenes a negative sentiment towards this entity.
Now consider the following excerpt describing the same event just in a different way:
|In English||In Arabic|
|Aljoubair repeated his accusations of the Iranian armed forces in being behind the Aramco incident||و جدد الجبير اتهامه للقوات العسكرية الإيرانية بالوقوف وراء حادثة أرامكو|
In order to have a better understanding of the articles, it is important for our contrary view detection system to be able to say that the entities “الجيش الإيراني” (the Iranian Army) and “للقوات المسلحة الإيرانية” (the Iranian armed Forces) represents the same physical thing.
For further information about NEL and how to implement it, refer to our post Aspect Detection and Named Entity Linking (NEL): Using SPARQL and DBpedia.
In this post, we introduced a new task namely: Contrary View Detection in its two phases: 1. articles grouping and 2. disagreement based ranking. We presented our suggestions to solve this problem including methods to be used and features to represent the problem.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store