An Initial, Failed Solution For The Event Detection Task

In this post, we are trying to validate our initial solution for the event detection task.

If you’re not familiar with the task you can refer to our previous post about “How to” Event Detection in Media using NLP and AI.

For applying event detection in news articles we are planning to do the following:

  • Represent each article as a vector of expressive features.
  • Feed the vectorized articles into a sequential clustering model to aggregate the ones talking about the same event.

In this post, we’re solving the problem of event detection in the Arabic language.

Features Effectiveness

In this section, we’re trying to measure the effectiveness of different features in representing the articles.

Our methodology based on the existence of a ground truth dataset of news events (i.e. news articles aggregated under the events they’re reporting). Hence, effective representation for the articles in the vector space is a one that represents the articles reporting the same event close to each other, while they are far from the articles reporting other events.

A Ground-Truth Events Dataset

In a previous effort, we’ve collected a ground truth events dataset which contains ~1,400 news articles over ~120 different events. The dataset was collected following a methodology described in a previous post Building a Test Collection for Event Detection Systems Evaluation

Measuring the Quality of the Problem Representation

We measured the effectiveness of representation in two ways.


For each representation we want to validate:

  1. Represent the articles in the ground-truth dataset with this representation.
  2. Measure the overlap between the events according to this representation.

Representations that cause a higher overlap between the events are worse.

To measure the overlap between the events we used silhouette score which is a measure of how similar an object (i.e. article) is to its own cluster (i.e. event) (cohesion) compared to other clusters (i.e. events) (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighbouring clusters. Values near 0 indicate overlapping clusters.


Moreover, for visual exploration we plot the following:

  • All the articles after projecting them into 2D space using t-SNE.
  • The events, where each event is the mean of all of its articles vectors also projected using t-SNE.

You may notice events that are not associated with a group of points this can be explained as the articles of this event are scattered around and this representation couldn’t capture their similarities.

The Explored Features

We explored features that may capture the answers of the 5W1H questions related to an event.

  • Who: this feature can be captured by the output of a NER model, it’s specifically the persons and maybe the organizations mentioned in the article.
  • Where: this can be also captured by the locations extracted using a NER model.
  • When: we can use the publish date of the article to capture when the event took place.
  • What/Why/How: are more complicated, we decided to capture them by the TF-IDF representation for the article content. Another option was to use word clusters which can refer to something similar to topics, where each word in the article is replaced by its cluster number, then the resulted content is vectorized using TF-IDF too. Check out our previous post about Arabic Words Clustering Using word2vec.
  • Other features: multi-words terms which are simply frequent 2-grams and 3-grams. We believe that these features can also capture something related to the answer of who/where and also when.

We experimented with these features and their concatenations.


The results of applying the previously discussed features on our ground truth event detection dataset.

Expressiveness Validation

To validate that the chosen features are expressive as we think; we visualized the top-weighted features for each feature type in a word-cloud:


Silhouette Score

The following table summarizes the results of some of our experiments

Feature Filtering Silhouette Score
TF-IDF (Uni) max_df=0.1, min_df=10 ~0.11
TF-IDF (Uni + Bi) Uni| max_df=0.1, min_df=10
Bi| max_df=0.5, min_df=50
TF-IDF (Uni + Tri) Uni| max_df=0.1, min_df=10
Tri| max_df=0.5, min_df=50
TF-IDF (Uni + Bi + Tri) Uni| max_df=0.1, min_df=10
Bi| max_df=0.5, min_df=50
Tri| max_df=0.5, min_df=50
TF-IDF (W2V clusters) ~0.08
TF-IDF (Uni) + TF (PERS + LOC + ORG) Uni| max_df=0.1, min_df=10 PERS|LOC|ORG| min_df=5 ~0.11
TF-IDF (W2V clusters) + TF (PERS + LOC + ORG) PERS|LOC|ORG| min_df=5 ~0.11
TF-IDF (Uni) + TF (PERS) PERS| min_df=5 ~0.06
TF-IDF (Uni) + TF (LOC) LOC| min_df=5 ~0.03
TF-IDF (Uni) + TF (ORG) ORG| min_df=5 ~-0.02
  • max_df: is a filtering threshold based on document frequency e.g. max_df=0.1 means omitting the features that appeared in more than 10% of the docuemts.
  • min_df: is a filtering threshold based on document frequency e.g. min_df=5 means omitting the features that appeared in less than 5 docuemts.
  • max_df & min_df threshold values were chosen by manually inspecting the produced features by the vectorizers.
  • Why to use TF-IDF with W2V clusters? Simply because very frequent clusters tend to be stopwords or at least not topic-related.

Observations: The silhouette scores -even the best ones- are not promising as they are too close to zero which means that the events are highly overlapped with this representation.


We’re not going to show all the projection plots but two as examples.

Following is the projection of the TF-IDF (Uni):

Here are two of the highly overlapped events (31, 90):

In English

In Arabic

The Lebanese prime minister Saad Al-Hariri asked for international help after the struggling to contain forest fires.

تواصل سلسلة من حرائق الغابات في لبنان، ورئيس الوزراء الحريري يطلب مساعدات دولية لاحتوائها

More than 100 wildfires have erupted in the forests of three Syrian governorates, most of them have been put under control by the Syrian Civil Defense Forces

نشوب أكثر من مائة حريق في غابات ثلاث محافظات سورية، استطاعت قوات الدفاع المدني السوري السيطرة على معظمها

Such events were supposed to be split after using the NER features according to their locations.

Following is the projection of TF-IDF (Uni) + TF (PERS + LOC + ORG):

The events are obviously more scattered and overlapped than the previous one, which explains why this representation Silhouette Score it lower.

Although the new representation using NEs could split 31 and 90 events according to their locations (one in Syria and the other in Lebanon), another problem emerged as 90 and 34 overlapped:

In English

In Arabic

The Lebanese prime minister Saad Al-Hariri asked for international help after the struggling to contain forest fires.

تواصل سلسلة من حرائق الغابات في لبنان، ورئيس الوزراء الحريري يطلب مساعدات دولية لاحتوائها

Protests in Lebanon over plans to impose new taxes

احتجاجات واعتصامات في شوارع لبنان، احتجاجاً على فرض ضرائب جديدة على الشعب اللبناني

Experiments With The Date Feature

We decided to use the publish date of the articles to model the time aspect of the events. Since date features have a very different meaning than the TF-IDF ones it’s probably not a good idea to combine them in one vector. However, we tried with doing so.

There are many ways to handle time data representation as machine learning features, but none of them seemed to be suitable for our problem. Here are some of our experiments:

  • Date as three features (day, month, year): As these features ranges are much bigger than the range of the TF-IDF weights, the date features omitted the TF-IDF features effect, moreover the day feature has the bigger effect as it’s the most changeable one with high values e.g. articles published in day 20 from any month tend to be close to each other despite their different content or even different publish month.
  • Date as a timestamp: we tried to represent the date in one feature as a timestamp to reduce the effect of the day feature alone. However, again this feature range is much bigger than the range of the TF-IDF weights, which omitted the TF-IDF features effect.
  • Date as a normalized timestamp: we normalized the timestamp to reduce its effect. However, it did not affect the results anymore.

To Wrap up

The chosen representation seems not able to characterize the problem perfectly:

  • High overlap between our ground truth events.
  • NEs don’t play any role, although they represent most of the information about the event.
  • Dates can’t be added to the representation, although it’s a significant characteristic for an event.

Training The Model

let’s try to validate our assumptions towards the event detection features representation, by training and evaluating a model using them.

Evaluation Metrics

The evaluation metrics for clustering given a ground truth include:

Homogeneity score: A clustering result satisfies homogeneity if all of its clusters contain only data points that are members of a single class. Examples:

Split classes into more clusters can be perfectly homogeneous:

True labels = [0, 0, 1, 1], Predicted labels = [0, 0, 1, 2]

Score = 1

Clusters that include samples from different classes do not make for homogeneous labelling:

True labels = [0, 0, 1, 1], Predicted labels = [0, 1, 0, 1]

Score = 0

Completeness Score: A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster. Examples:

Non-perfect labelling that assigns all classes members to the same clusters are still complete:

True Labels = [0, 0, 1, 1], Predicted labels = [0, 0, 0, 0]

Score = 1

If classes members are split across different clusters, the assignment cannot be complete:

True labels = [0, 0, 1, 1], Predicted labels = [0, 1, 0, 1]

Score = 0

V-Measure Score: The V-measure is the harmonic mean between homogeneity and completeness, thus perfect labelling is both homogeneous and complete. Examples:

Labellings that assign all classes members to the same clusters are complete be not homogeneous, hence penalized:

True labels = [0, 0, 1, 2], Predicted Labels = [0, 0, 1, 1]

Score = 0.8

Labellings that have pure clusters with members coming from the same classes are homogeneous but unnecessary splits harm completeness and thus penalized as well:

True labels = [0, 0, 1, 1], Predicted Labels = [0, 0, 1, 2]

Score = 0.8

Hence, we chose V-Measure as our evaluation metric.

Clustering Algorithm

We performed Birch which is an online learning algorithm that constructs a tree data structure with the cluster centroids being read off the leaf. Birch has an important hyper-parameter that plays a critical role in its final results which it the threshold, in Birch the radius of the sub-cluster obtained by merging a new sample and the closest sub-cluster should be lesser than the threshold. Otherwise, a new sub-cluster is started. So, setting the threshold value to be very low promotes splitting and vice-versa.

Training and Evaluation

In each experiment, the Birch model was fed by the articles gradually according to their published order, where each fed-batch represents the articles published along a day.

The following tables summarize our experiments which involve two kinds of features:

TF-IDF (W2V clusters) + TF (PERS + LOC + ORG)

Birch Threshold V-measure score n_clusters
2 0.6591082932535002 900
2.5 0.6388812411242022 706
3 0.6157702294699662 523
3.5 0.5842678797976664 362
4 0.5387831995084763 255
4.5 0.4744330956061701 162

TF-IDF (Uni) + TF (PERS + LOC + ORG)

Birch Threshold V-measure score n_clusters
2 0.6643862201200933 916
2.5 0.6388584348941666 723
3 0.6170028933666231 548
3.5 0.5881280950330579 378
4 0.5472522486363269 262
4.5 0.49996199074765435 178

Observations: Given that we have ~120 events in our ground truth, a number of clusters like 900 is huge enough to approximately put each article in a separate cluster which is definitely not a good clustering. However, V-measure gave it the highest score, thus we couldn’t be confident in this evaluation score and we checked the resulted clusters manually. The best clusters were produced by TF-IDF (W2V clusters) + TF (PERS + LOC + ORG) with a threshold=3. However, these clusters are still messy. This can be revert to and prove the fact that the used features can’t characterize this problem perfectly.

Examples of the produced clusters:


Our initial solution for event detection does not seem to be the perfect way to solve this problem. Now we need a new road map to continue on, which will be proposed in our future posts.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store

Leave a Reply

Your email address will not be published. Required fields are marked *