Search queries, passport scans, barcode scans, your online shopping history, your photos on Instagram, your tweets on twitter, voice messages, every day news articles, and more, and more… All of these contain a huge amount of data…
Data generation is exponentially growing… Here’s come the Big Data term that characterizes this huge generated data by its large volume, its high velocity which is the speed at which data gets generated, and its wide variety.
But without a way to get insights from this data that is being generated every day, it would be just worthless!
Due to the growing need for big data analysis, a lot of data mining algorithms were developed that allow us to make sense of this increasingly amounts of information in real-time.
Our focus in this post is on the velocity attribute of the big data, we are briefly discussing how to handle the infinite data streams, with a major focus on the unsupervised learning algorithms.
What’s The Difficulty of Handling Data Streams?
Let’s consider a traditional learning process. As usual, the process starts by getting a static training dataset, with labels if it’s a supervised learning problem. The next step would be to perform feature engineering, scaling, selection, etc. We can then fit multiple machine learning models on the data, try to fine-tune their parameters, and finally select the best one to be deployed and make live predictions.
This kind of
This is computationally expensive and requires that all relevant data is stored for periodic re-evaluation.
So, How to Handle These Streams?
The solution in known as online learning or incremental learning. Which in contrast to the traditional batch learning, embraces the fact that the sequentially received data can be used to update our best model for future data at each step.
Sequential Clustering
Also known as incremental, online, or stream clustering is a set of unsupervised online learning algorithms.
It’s a suitable choice when we aim to recognize patterns in an unordered, infinite and evolving stream of observations. It enables updating the existing clusters and integrating new observations into the existing model by identifying emerging structures and removing outdated structures incrementally.
Available Tools for Sequential Clustering
Here we’re providing you with a set of the available implementations for the Sequential Clustering algorithms in different languages:
Name | Language | Implemented Algorithms |
stream | R-Package | D-Stream, D-Stream with Attraction, DBSTREAM, BICO and BIRCH. |
Massive Online Analysis (MOA) | JAVA framework | StreamKM++, CluStream, ClusTree, DenStream and D-Stream, BICO and COBWEB |
streamMOA | R-Package | Interfaces to JAVA implementation of CluStream, DenStream and ClusTree |
subspaceMOA | JAVA framework | PreDeConStream and HDDStream |
evoStream | R, C++, Python | evolutionary stream clustering |
BIRCH | C |
|
BIRCH | Python |
|
BICO | C++ |
|
BICO | Python |
|
StreamKM++ | C++ |
|
Conclusion
In this post, we discussed the difficulties behind handling data streams. We also introduced the online learning approach. We talked about sequential clustering, and provided you with a list of tools that implement multiple sequential clustering algorithms to enjoy applying them yourself.
If you wish to see a real-life example of sequential clustering check out our series on event-detection to learn how we have used sequential clustering to detect articles describing the same event
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.
Further Reading
[1] Carnein, Matthias, and Heike Trautmann. “Optimizing data stream representation: An extensive survey on stream clustering algorithms.” Business & Information Systems Engineering (2019): 1-21.