In a previous post, we discussed a News Stream Sequential Clustering Algorithm. In this post, we’re discussing the details of implementing this algorithm with minimal tuning, and showing the results produced by this implementation.
Along with this post, we’re evaluating our results and tuning the algorithm’s hyper-parameters using a test collection built in a previous effort according to a method described in Building a Test Collection for Event Detection Systems Evaluation.
Applying The Algorithm
In this section, we’re discussing the implementation details of the previously proposed news stream clustering algorithm.
In this algorithm, each news article is represented by a set of subvectors indicating textual features, and three values indicating the time feature.
To capture an event in text we relied on two textual features: the TF-IDF of the main content of the news article, and the named entities extracted from that content. We also considered the time feature which is pretty important for event detection. For more information about the event detection problem features you can refer to our article Event Detection in Media using NLP and AI.
In order to tune the hyper-parameters, we need an evaluation metric. The chosen evaluation metric based on the existence of a labeled test dataset (news articles assigned to the events they’re reporting).
We evaluated the clustering in the following manner: let Tp be the number of correctly clustered-together document pairs, let Fp be the number of incorrectly clustered-together document pairs and let Fn be the number of incorrectly not-clustered-together document pairs. Then we report precision as Tp/(Tp+Fp), recall as Tp/(Tp+Fn) and F1 as the harmonic mean of the precision and recall measures.
The algorithm has many hyper-parameters:
- The weights of the features, that denote which feature matters more in measuring the similarity between an article and a cluster. We Fixed all these weights to “1” in this experiment, which means all our features were equally contributing to the similarity measure.
- σ is a parameter denotes a standard deviation of a Gaussian function related to the time features, and it refers to the difference _in days_ between two dates. Was fixed to “3 days”.
- μ is the mean of a Gaussian function related to the time features. Was fixed to “0”.
- We only tuned τ which is the similarity threshold, if the similarity between a document and cluster exceeds it, the document would be added to that cluster. We tuned this parameter on our test collection using the F1 measure metric introduced above: the best value for τ was “4.5”, where F1 = 60%.
Let’s show you some of our results after applying the algorithm using the previously discussed configurations.
Our Results on The Test Collection
We got ~160 clusters for ~2000 news articles in the test collection. Examples of the good clusters produced by the model are shown below:
Here is another example where the model got confused:
The model merged between two distinct events, which is a common error probably caused by the following reasons altogether, that appeared in the previous example:
- Many of the named entities are shared between the two events e.g. ترامب، أمريكا
- Many of the non-named entities are shared between the two events too e.g. قرار
- They are too close to each other in terms of time.
Follow up on our suggestions to improve the model performance in the last section of this post.
Representing The Event
In this section, we’re trying to summarize the main event using expressive words or phrases, in simple manners.
Let’s try to represent the event as a word-cloud.
Let’s take one event as an example:
The following word-clouds were generated by aggregating the TF-IDF vectors of the titles/contents of all the articles in the event.
Following are word-clouds generated using the content:
Following are word-clouds generated using the title:
The Best Word/Phrase
Let’s try to represent the event in one word/phrase, which should be the most expressive one.
First, we’ll try to take the word/phrase with the highest TF-IDF score, after aggregating the TF-IDF vectors over all the articles title/content in the event.
For the same cluster shown in the previous section:
Choosing the word/phrase from the content:
|ذمار||التحالف السعودي||اللجنة الدولية للصليب|
Choosing the word/phrase from the title:
|ذمار||المبعوث الأممي||عشرات القتلى والجرحى|
One example was enough to prove that these words/phrases are generic and incomplete to represent the whole event.
Another way we tried, is searching for the longest words sequence that is occurred in more than half of the event articles.
For the same previous example we got:
اللجنة الدولية للصليب الأحمر
And for the following cluster:
So, sometimes we’re getting good results using this method but other times we’re still getting generic phrases. Moreover, in some clusters where there are huge shared text parts among the articles e.g. when they are telling that someone said something, we got very huge phrases.
The Best Named Entities:
For each article in the event, we extracted three types of named entities: people’s names, locations, and organizations’ names. We expressed each event by the named entities from each type that were mentioned in more than half of the articles in the event.
To show the results let’s consider the same clusters above:
|يوسف، عبد||ذمار، اليمن، صنعاء، السعودية||الحوثي|
|دوريان، دونالد، ترامب||دوريان، الولايات، المتحدة، فلوريدا||–|
A new example cluster:
Where the used named entity extractor is FARASA.
Expressing The Events with Concepts
This time we’re trying to express each event by the concepts presented in its articles. For this task, we’re using Wikifier which is a web service that annotates a text with links to relevant Wikipedia concepts.
For each article related to the event, we extracted its concepts. We used the page rank value associated with each concept as its weight, as the page rank values of concepts can be used to disambiguate them, you can refer to  for more information about the Wikifier algorithm. We aggregated the weights of the presented concepts over all the articles. After all, we chose the concepts that their weights exceeded a threshold to express the event. For the following results, we chose the threshold to be 0.004.
Again to show the results, let’s consider the same example clusters above using Wikifier:
|Dhamar attack||حوثيون، اللجنة الدولية للصليب الأحمر، السعودية، اليمن، صنعاء، التدخل العسكري في اليمن، الأمم المتحدة، ذمار، الجزيرة (قناة)،|
|Hurricane Dorian||إعصار، باهامس، فلوريدا، الولايات المتحدة، المحيط الأطلسي، الشرق الأوسط، جزر أباكو، دونالد ترامب، النمسا، دوريان،|
|Amazon Fires||جايير بولسونارو، غابة الأمازون، رسام، البرازيل، منظمة الصحة العالمية، غابة، الشرق الأوسط، نظام بيئي، العالم، طفل|
We can see huge improvement over all the previous techniques.
- Concepts are unified. In other words, in the previous techniques, we faced the following problem which does not exist for this technique:
إعصار <> وإعصار
- Concepts are always complete. In the previous technique some extracted phrases were not complete e.g. :
الولايات، المتحدة، اللجنة الدولية للصليب
- The presence of concepts that are mentioned in the articles but are not really key concepts e.g. الجزيرة (قناة)
The Clusters’ Count Effect on The Clustering Time
As the number of the clusters increases over time with the developing news stream, the number of the similarity measurements increases too, which is reflected in the clustering time of an incoming article.
The following graph, which was generated using the test collection, shows this effect visually:
Obviously the number of clusters affects the clustering time badly. One way to reduce this effect is by eliminating the outdated clusters which we’re discussing in the next section.
Eliminating The Outdated, Stale Clusters
We defined an outdated cluster as a cluster that has not been updated for a while. The question emerges here: how much is that “while“?
Let’s say we have a cluster with one article that has not been updated for three days and another cluster with 100 articles that has not been updated for three days too. Clearly, these can’t be treated in the same way, while the small cluster seems to be outdated, the larger one does not seem so, as large events usually tend to expand over a long period of time and may not be updated daily.
Hence, for eliminating the outdated clusters we should consider both the cluster last update date and its size.
For this goal let’s propose the following two normalized Gaussian functions:
F1 = exp(- (x – μ1)^2 / (2*σ1^2) )
Where the input is the difference between today’s date and the publish date of the newest article that was added to the cluster, in terms of days.
μ1: is the mean, was set to “0”
σ1: is the standard deviation, denotes a number of days, was set to “3”.
The longer the cluster has not been updated, the higher the value of this function is.
F2 = 1 – exp(- (x – μ2)^2 / (2*σ2^2) )
Where the input is the number of articles in the cluster.
μ2: is the mean, was set to “0”
σ2: is the standard deviation, denotes a number of articles, was set to “5”.
The larger the cluster is, the higher the value of this function is.
We combined both functions in one formula as the following:
outdated_indicator(c) = F1(c.news_cluster_date) + 2 × F2(c.articles_num)
The maximum value of this function is “3”.
Finally, we defined a threshold, if the outdated indicator value doesn’t exceed it, the cluster would be eliminated, otherwise, it would be kept. And we set this threshold value to “2”, by manually tuning it.
As an additional constraint, any cluster that was updated in the last three days no matter what its size is, shouldn’t be eliminated.
To prove the effectiveness of the proposed method we generated inputs for F1 in the range [3, 50], and for F2 in the range [1, 100], then we took their combinations to generate inputs for the outdated indicator formula.
We randomly sampled these combinations and plotted them in the following graph where each point denotes a combination, while it’s color denotes whether the cluster was kept or removed to its outdated indicator value.
You can see that large clusters can be preserved for about a month maximum without any update, while small clusters are eliminated after a few days from being frozen.
After applying this cutoff method while clustering our test collection, the following graph was generated:
The maximum number of the stored clusters we reached is ~40, while it was the whole 170 clusters before the cutoff. And the maximum clustering time for an incoming document is ~0.2s, while it was ~0.7s before the cutoff.
In comparison with the first graph without the cutoff, there is definitely a huge improvement in the clustering time. The cutoff tries to keep the clustering time stable. A major point to add is that applying this cutoff did not affect the clustering performance in any way.
Our Results on The Real-World Data
In this section, we’re validating the algorithm and the chosen hyper-parameters against the real-world data.
We tried to detect the events in Almeta’s news feed data over three months. Here are some of the produced clusters:
Some clusters make sense, like the following:
While others tend to be totally disastrous, like the following one:
In general, our evaluation on the test collection was better than the actual results we got on real-world data. This can be interpreted because of the noise in the real-world data (e.g. articles from different domains other than news) while the test collection is clean and determined.
Another common error type that we noticed is articles that their similarity with their cluster exceeded the threshold just because of the time features, although their textual similarity with the actual articles in the cluster is too low. Thus, we applied a textual similarity threshold such that if the textual similarity between the incoming article and the cluster doesn’t exceed it the article won’t be merged into the cluster. In the following example is one cluster before and after applying the textual threshold:
Before applying the textual threshold:
Where red titles are unrelated.
After applying the textual threshold:
We got rid of all the unrelated documents.
The textual threshold was tuned manually on the real-world data and was set to 1.5.
As a try to improve the results on the real-world example, to be more specific in our clustering we tried a 1 more final thing:
Instead of taking just the TF-IDF of the whole article content we took both the title and the content TF-IDF and concatenate them with each other. This resulted in higher F1 on the test collection 62% taking the same hyper-parameters as above. However, the results on the real-world examples improved for just a little bit with this step.
For further experimenting and improvements, we’re listing some aspects we would like to explore more in the future.
Better clustering decisions:
- Improving our decision about the best matching cluster for an incoming document, by learning to weight our similarity features. By treating the problem of making this decision as a ranking algorithm, where the ranked items are the clusters, which are ranked with reference to an incoming document. A ranking model can be trained to perform this task, and it would implicitly learn to weight our similarity features. For further information, you can refer to News Stream Clustering – Sequential Clustering in Action.
- Improving our decision about when to make a new cluster for an incoming article and when not to. By replacing the threshold parameter σ with a binary classifier that has the complete ownership about making this decision. Also, for further information, you can refer to News Stream Clustering – Sequential Clustering in Action.
What about further experiments with the features?
- Different combinations of the different parts of the document e.g title, body, first paragraph.
- What’s the effect of some preprocessing steps like lemmatization and stemming?
- Does using a dense representation like doc2vec improve the performance?
In this post, we discussed the details of our implementation of a news stream clustering algorithm for event detection with minimal tuning. we also presented the produced results by this implementation on both test and real-world data.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.
 Janez Brank, Gregor Leban, Marko Grobelnik. Annotating Documents with Relevant Wikipedia Concepts. Proceedings of the Slovenian Conference on Data Mining and Data Warehouses (SiKDD 2017), Ljubljana, Slovenia, 9 October 2017.