In a previous post, we talked about “How to” Event Detection in Media using NLP and AI. In another post, we presented the Sequential Clustering. Today we’re introducing an online (sequential) clustering algorithm specialized in aggregating news articles into fine-grained story clusters.
We focus on the clustering of a stream of documents, where the number of clusters is not fixed and learned automatically. We denote by D (potentially infinite) space of documents. We are interested in associating each document with a cluster via the function C(d) ∈ N, which returns the cluster label given a document.
For each cluster, we maintained a centroid that needs to be incrementally updated to reflect new information that exists in a new incoming document.
The Clustering Algorithm
The online clustering process works as follows:
With a new incoming document d, represented as a vector, we compute a similarity metric between the document vector and each of the existing centroids. If the largest similarity exceeds a threshold τ for cluster index j, then we set C(d) = j. In that case, we also update the jth cluster centroid to include new information from document d. If none of the similarity values exceeds a threshold τ, we find the first cluster’s id which is still unassigned i and set C(d) = i, therefore creating a new cluster.
We maintain a centroid for each cluster that is created as the average of all the representations of documents that belong to that cluster. We recalculate this average for each document insertion in the cluster.
In the proposed clustering algorithm each news document can be represented by multiple subvectors, where each subvector denotes a feature type e.g. several TF-IDF subvectors with words, word lemmas and named entities. Besides these textual vectors, we use the documents’ timestamp.
In our previous post: “How to” Event Detection in Media using NLP and AI, we discussed a bunch of representative features for events in news documents. We’re going to briefly give an overview of these features in the rest of this section.
The textual features related to an event usually try to capture the answers of the 5W1H questions in the news article to gather information about a story. Questions like who and where can be answered by extracting the named entities in the article. While what and how are more complicated, some representations like the TF-IDF of the whole content or the LDA representation of the article can be used to capture them. Finally, the when question is answered using the timestamp feature.
The used similarity metric computes weighted cosine similarity on the different subvectors. Formally, the similarity is given by a function defined as:
Here, dj is the jth document in the stream and cl is a cluster.
The textual similarity is computed on the features subvectors where K is the number of subvectors for the relevant document representation.
The function φi(dj, cl) returns the cosine similarity between the document representation of the jth document and the centroid for cluster cl. The vector q0 denotes the weights through which each of the cosine similarity values for each subvector is weighted.
The function γ(d, c) that maps a pair of a document and a cluster is defined as follows:
for a given μ and σ > 0. For each document d and cluster c, we generate the following three-dimensional vector γ(d, c) = (s1 , s2, s3):
- s1 = f(t(d) − n1(c)) where t(dj) is the timestamp for document d and n1(c) is the timestamp for the newest document in cluster c.
- s2 = f(t(d)−n2(c)) where n2(c) is the average timestamp for all documents in cluster c.
- s3 = f(t(d) − n3(c)) where n3(c) is the timestamp for the oldest document in cluster c.
The vector q1 denotes the weights for the timestamp features
These three timestamps features model the time aspect of the online stream of news data and help disambiguate clustering decisions since time is a valuable indicator that a news story has changed, even if a cluster representation has a reasonable match in the textual features with the incoming document.
Regarding the hyper-parameters related to the timestamp features, we can fix μ = 0 and tune σ on the development set.
Learning To Weight The Similarities
Perhaps the similarity value related to a specific feature type matters more than others in taking the decision towards choosing the best cluster for a document. q0 and q1 encode the contributions of each of these feature types to this decision. In order to determine the weights in q0 and q1 we can consider training a ranking model to learn them, where the ranking problem can be formulated as follows:
The problem is about choosing the most relevant cluster to an incoming document. This can be treated as a ranking problem where the most relevant cluster has the highest rank, thus the considered features in this problem would be our similarity metrics, and a ranking model can be trained to learn how to weight these features.
For this goal, we need to generate a collection of ranking examples, where for each incoming document in the dataset we have a ranked list of the best clusters that match it.
How to generate the training data for the clusters ranking problem?
In a previous post, we discussed Building a Test Collection for Event Detection Systems Evaluation. We can generate the training data using a partition of our test collection (that we should not get evaluated on). For each document in our training partition, we create a list of clusters ranked according to their similarity to the document, this ranking can be done using only a simple vector representation like the TF-IDF of the content. Hence, we should simulate the execution of the clustering algorithm on the training partition to produce the clusters.
After all, a ranking algorithm is trained on the generated data using the document-cluster similarities as features, thus the ranking algorithm will weigh the similarities according to their contribution to the decision of choosing the best matching cluster, the resulted weights are q0 and q1.
New Event Detection
New event detection refers to the process of deciding when an incoming document should be merged to the best cluster or create a new one. This decision can be made by two different common ways:
- Defining a parameter τ which is a similarity threshold. If the largest similarity exceeds this threshold τ the new document should be merged to the best cluster, otherwise, a new cluster should be created. This parameter can be tuned on the development set using a grid search.
- Training a binary classifier to learn this decision: simply by passing the max of the similarities between the incoming document and the current clustering pool as the input feature vector to the model. This way, the classifier learns when the current clusters, as a whole, are of different news stories than the incoming document.
In this post, we described a state of the art method for clustering of an incoming stream of documents. The method works by maintaining centroids for the clusters, where a cluster groups a set of documents. We also discussed how to leverage different training procedures for ranking and classification to improve clustering decisions.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store
. Miranda, Sebastião, et al. “Multilingual Clustering of Streaming News.” arXiv preprint arXiv:1809.00540 (2018).
. Staykovski, Todor, et al. “Dense vs. Sparse Representations for News Stream Clustering.” Text2Story@ ECIR. 2019.
. Aggarwal, Charu C., and Philip S. Yu. “A framework for clustering massive text and categorical data streams.” Proceedings of the 2006 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2006.