Event Detection – Almeta’s Research Gist

News stories are created every day at many news agencies. Users may receive news streams from multiple sources. Browsing in large-scale information spaces without guidance is not effective.

Suppose, for example, a person who has returned from a long vacation and wants to find out what happened during the period. It is impossible to read the whole news collection and it is unrealistic to generate specific queries about unknown facts. As a result, it is difficult to retrieve or to check all the potentially relevant stories.

Thus, it is useful to have an intelligent agent to automatically locate related stories in the continuous stream of news articles.

This article represents a gist of all of our research on Event detection, and we have A LOT of it. If you are interested in more details you can check each individual piece to learn:

How Do Humans and Machines Perceive News Events

As humans, what constitutes an event in a stream of news? we can say that an event is a group of articles that:

  • Very similar to each other. They usually share the same vocabulary and named entities such as locations names, people’s names, and organizations’ names.
  • They are typically reported in a relatively brief time window e.g. 1-4 weeks, which is the period when the actual event is taking place.

In order to build systems that can find articles describing the same event using machine learning Algorithms, we need to have a way to represent articles numerically while signifying the aforementioned features of events (like similarity in vocabulary or closeness in time)

The usual way to represent an article in many NLP applications is to use bag-of-word scheme where the article is represented as a vector of its words’ counts. Such vectors can be easily handled by machine learning systems if you are not familiar with the concept of bag-of-words please follow this very simplified explanation.

While such simple representation can capture the vocabulary similarities, it does not encode any information about the publishing time of the article, nor the entities that appear in it (locations, politicians’ names, …). To resolve this issue we followed the 5Ws1H method.

5Ws1H Representation

5W1H is a technique employed in journalism to gather all information about a story, to turn it into a news article. It consists of six questions: What, Who, Where, When, Why, and How. An article is not considered complete until all of these six questions are answered.

It seems that little agreement could be reached on what to consider part of What or Why, or even how to represent How, but Where, Who and When are more concrete and could give better results.

Who and Where can be represented as Named Entities (Proper Nouns, Organizations, Locations, etc.) that appeared in the text.

Time Information (When) represented by the publish date of the article.

What can be represented by the topic of the article, or in a more concrete sense using the aforementioned bag-of-word method.

Can We Find a More Detailed Representation

All of the aforementioned information is straight forward to extract from the article. Yet there is a lot of information that we still miss about the event. For instance, while we can extract every named entity in the article (places or people’s names) not all of these names are equally important to the event. The basic missing piece is the participants and their relation to the event.

For example, in the case of a protest movement, the names of the protest leaders or politicians that took part in the protest are much more important than the name of the street where the demonstration took place. On the other hand in case of a terrorist attack, the location of the attack is far more important.

To handle this issue we tackled the task of event extraction, read all over it in our piece. The goal of this task is to extract detailed information about the events that appear in an article body, what entities have participated in them, and in what way. Although we are not currently using event extraction in our current system, we are planning to add it in future versions.

How to Train Machines to Detect Events in The News Stream

We have already agreed on possible ways to represent the news article to be interpret-able by computers, and at the same time to encode vital information about the event. The next step is discussing the algorithms that can detect events using such representations.

Let us consider search queries, passport scans, barcode scans, your online shopping history, your photos on Instagram, your tweets on twitter, every day news articles, and more, and more…

The basic common denominator between all of these data streams is the fact that they are streams, which are collections of information that gradually and dynamically grows.

Do you know what is also a data stream?

Events in the news feed.

Time crumbles things; everything grows old and is forgotten under the power of Time


The main problem with data streams for machine learning is their evolving nature. Most of the machine learning systems are trained in a static manner on some historical data, then asked to repeat the same task over and over again. However, because of the dynamic nature of data streams, such systems would quickly be obsolete.

For example, a model trained on the historical data of the last month’s news would be able to detect events in these articles. However, when new events happen such a model will have a hard time recognizing them, so to keep the performance high we will need to retrain the model again using both newer data (to detect new events) as well as historical data (to recognize the developments of older events), this might lead to long cycle of periodic model retraining which can lead to extreme difficulties regarding computational time, team effort as well as storage cost.

Given a new article, a dream system should implement at least 2 functionalities:

  • Decide whether this article reports a new event or an old one
  • If the article doesn’t report a new event, identify the old event it belongs to.

In our previous article, we presented the algorithms of sequential clustering and why we believe it is a very good candidate for our task.

What Have We Accomplished Already in Almeta

In this section, we will explore in a very shallow manner how we combined sequential clustering with the aforementioned representation of the articles, in order to build our first version of the event detection system.

The Plan

In our first trials, we experimented with representing the article with a large heterogeneous vector of features that covers both the text and the publishing time of the article. However, this initial representation yielded non-optimal results, mainly because each of these features comes from a different domain and has a different nature. In order to resolve this issue, we re-evaluated the way we calculated the similarity between articles in (to decide if they belong to the same event or not).

In this article, we explored in full detail the shortcomings of the initial approach and what can be done to improve the system performance. But let us try to summarize the approach here in less than 100 words.

In short, when comparing a new article with the representation of an event (to decide if the article belongs to the event or not), instead of comparing 2 large heterogeneous representations as before, we created 2 separate representations: one for the time and another for the text that should answer the remaining questions like who and how. Next, we need to learn to balance the impact of each of these representations on the system performance.

The Execution

Plans are useless, but planning is indispensable.

Dwight D. Eisenhower

Well, to be honest, our plan was very good. However, plans aren’t everything and there are always new modifications needed.

In this article, we explore in detail again the results of applying our algorithm to some data from Almeta’s news feed, alongside everything we did to get these results like model parameters tuning. Next, we performed a very detailed analysis of these results to figure out the pitfalls of the current system.

When to “decomission” a Stale Event?

One of the issues that we didn’t account in our plan is the age of the event. In very simple terms, and as the algorithm process more data, it finds more events, it is possible to keep track of all the events in the system and thus when a new article is processed it will be compared against all the historical events. However, it does not seem right since events in real-world cover _as we said before_ a small period of time, and since real events emerge and then disappear from the news as time progress, so should the events in our system.

Furthermore, storing 100 events for example instead of 10 means that the system when clustering new articles will have to compare it to all 100 instead of 10 and this can really hurt the performance. you can read all over this issue in the same article. However, let us try to summarize it again.

So, we had to ask the following question: when do we consider that an event has ended and is no longer covered by media?

One suggestion is to use the update rate; if no new articles cover this event in say 3 days, then we can consider this event ended. But, the answer isn’t that trivial! Let’s say we have a cluster with one article that has not been updated for three days and another cluster with 100 articles that has not been updated for three days too. Clearly, these can’t be treated in the same way, while the small cluster seems to be outdated, the larger one does not seem so, as large events usually tend to expand over a long period of time and may not be updated daily.

To resolve this matter we created a delicate balance between the size of the event (number of articles that have covered it) and its last update.

Can We Build a System That Works with any Language?

Well yes, and No.

Yes, because language-agnostic event detectors do exist and we have explored this area of research already, you can read more about it at the end of this article

And No, mainly because the systems we found in the literature are, in our view, not robust enough to be in deployment and generally require more effort to update and maintain.

Overall, while we are not considering this task at the moment, we believe more research into language-independent event detection is needed.


In this article, we tried to cover in a very simple manner the results of our research here at Almeta on event detection. If you are interested in more details please follow our individual articles:

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store

Leave a Reply

Your email address will not be published. Required fields are marked *