In our effort to provide the best news feed out there, one of the goals we are trying to achieve here at Almeta is to capture the interaction between different news outlets and how the coverage of the same event is presented by different views and different publishers.
Imagine having the ability to combine all the articles from various news outlets that are talking about a certain event in one place that gives you a summary of all of the coverage while at the same time providing you with the different views on this event.
The process of Named Entity Linking (NEL) is a crucial part
To understand NEL we must first understand the core concept of Information Extraction (IE.)
While IE encompasses various applications from DNA mapping to weather and time series forecasting, the common aspect in all of these tasks is the fact that we are trying to extract structured information from unstructured data.
The original data can be measurements from a telescope or texts from Wikipedia but the goal remains the same.
Let us assume that out of a paragraph we have the following sentence
… and it is believed that Tim was born in London on the 8th of June 1955 …
The process of IE on this sentence is
Now let us take a deeper look at this “information extraction” algorithm.
The process can be done in three simple steps:
- Find the entities in the text (in our case Tim, London and the date)
- Disambiguating entities: for each of the entities found in the text we need to know what is the physical entity they correspond to for example the disambiguator have found from the context that we are talking about Tim
BernersLee and have linked it to a resource in a knowledge base like Wikipedia.
- The last step is to find relations between the entities, for example, the relation between Tim and London is that he was born there
The process seems simple enough for a human annotator, but how would a computer be able to do it and do it efficiently?
The aforementioned list represents 3 different tasks in NLP:
- Named Entity Recognition (NER): given a text find all named entities and assign to each of them a class (Person, Place, …)
- Named Entity disambiguation (NEL): given a text and a list of entities from it assign each entity to a resource from a knowledge base
- Relation Extraction (RE): given a text and a list of entities from it assign find relations that link these entities
NER VS NEL
A named entity is a real-world object, such as persons, locations, organizations, etc. NER identifies and classify named entity occurrences in text into pre-defined categories. NER is modelled as a task of assigning a tag to each word in a sentence. NER will tell us what words are entities and what are their types.
On the other hand, NEL will assign a unique identity to entities mentioned in the text. In other words, NEL is the task to link entity mentions in text with their corresponding entities in a knowledge base. The target knowledge base depends on the application, but we can use knowledge bases derived from Wikipedia for open-domain text.
The space of possible applications of the NEL is simply massive, here is a list of what you can do:
- By disambiguating the entities we can have more accurate search services,
- We can extract more rich information from our articles and compose them into a semantic web, see the figure below, this allows us to answer questions like (whos is the wife of whom) or (how many children does X have)
- And way way more
But how does it help our goal: consider the following phrase from a news article:
وقال الجبير في مؤتمر صحفي بالعاصمة البريطانية لندن “مقتنعون من خلال الأدلة الموجودة لدينا بتورط الجيش الإيراني في هجمات أرامكو”
The quote translates to :
Ajoubair said during a press conference in the british capital London “we are convenced based the evidences we have that the Iranian Army had a role to play in Aramko attacks”
One of the entities extracted from this text would be “الجيش الإيراني” (the Iranian Army) and we can see that the sentence convenes a negative sentiment towards this entity. Now consider the following excerpt describing the same event just in a different way:
و جدد الجبير اتهامه للقوات العسكرية الإيرانية بالوقوف وراء حادثة أرامكو
Which translates to:
Aljoubair repeated his acqusations of the Iranian armed forces in being behined the Aramko incident
In order to have a true understanding of these articles, it is important for our platform to be able to say that the entities “الجيش الإيراني” (the Iranian Army) and “للقوات المسلحة الإيرانية” (the Iranian armed Forces) represents the same physical thing.
This is a basic building block in finding relations between different articles.
Hopefully, you now understand what is NEL and why it is a good idea to have it around now let us find out how, wow, we can do it.
The Lazy Way
If you are as lazy as me and happens to be working on texts for either English, German or Portuguese, then you are in for a treat. You basically need to do nothing since the people at DBPedia Spotlight have a service ready for you that will do all of the fuss and wikify your text directly. Don’t believe me? test the demo. Technically, the service well
Follow this if you are lazy and you are working on other languages like Arabic. We will also be able to use DBPedia to link our data by a very simple approximation:
Here, we are assuming that the NER step is already implemented. There is plenty off the shelf packages to achieve this, cool? let’s talk NEL.
Now you have a system that takes in a text and spits out a list of entities with their classes, the simplest way to disambiguate these entities goes in 3 steps for every entity do the following:
- Find a ranked list of resources from a knowledge base that matches your entity, in case of DPBedia you can use their amazing text search service, here. The ranking is
usuallybased on the occurrence on commonly used metric is inverse candidate frequency measure (ICF), which is used to weight the words in the context based on their whole frequency.
- Filter out the list based on the named entity class. If your entity is a person, you won’t need resources that are places or dates.
- Select the most frequent candidate
The shortcoming of this approach is pretty clear since we are linking any mention of the entity with the most frequent disambiguation, but if you can accept the hit to the performance then Wikifier is your guy.
The Right Way, aka the, “Really missed up I have no life” Way
OK don’t freak out it is not that hard.
The fact is: NEL is a really wide research area and usually the task is converted into an ML problem. The papers in the reference section are some of the best options to start with. All of the aforementioned papers are cross-lingual or multi-lingual and some of them have their own implementations open-sourced.
It would really hard for us to cover all of the fields in this article and therefore we will focus on one particularly amazing work. DeepType  this paper by open.ai is the current state of the art on this task.
In this approach, the authors rely on very detailed NER that can find very specific classes like Animal, Road, vehicle or Region. And from these classes they add symbolic constraints on the output of their ML model by splitting the learning process in 2 steps:
- Finding the list of classes to be used based on DBpedia classes and types;
- Training a constrained ML model using those types;
I am obviously glossing over a ton of details and if you like math then you should totally give the paper a read.
In this article, we introduced the task of named entity linking, explained what is it and why you should but an effort to incorporate it in your system. We have as well introduced some ways to tackle this issue while leaving you with a reading list to delve deeper. check the references list.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.
 Raiman, Jonathan Raphael, and Olivier Michel Raiman. “DeepType: multilingual entity linking by neural type system evolution.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
 Le, Phong, and Ivan Titov. “Improving entity linking by modeling latent relations between mentions.” arXiv preprint arXiv:1804.10637 (2018).
 Ganea, Octavian-Eugen, and Thomas Hofmann. “Deep joint entity disambiguation with local neural attention.” arXiv preprint arXiv:1704.04920 (2017).
 Kolitsas, Nikolaos, Octavian-Eugen Ganea, and Thomas Hofmann. “End-to-end neural entity linking.” arXiv preprint arXiv:1808.07699 (2018).