Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack (ELK), it centrally stores your data so you can discover the expected and uncover the unexpected.
In this post, we’re investigating some features and out of the box use cases for ElasticSearch in the field of NLP.
Search Enhancement Features
ElasticSearch provides us with a sort of cool stuff to enhance our end-user search experience.
You Complete Me
Effective search is not just about returning relevant results when a user types in a search phrase, it’s also about helping your user to choose the best search phrases.
Did you mean …?
Elasticsearch has a phrase-suggester that can correct the user’s spelling after they have searched.
Phrase-suggester selects entire corrected phrases weighted based on n-gram language models. It’s able to make decisions about which tokens to pick based on co-occurrence and frequencies.
Suggestions while you type
Completion-suggester can make suggestions while typing. Giving the user the right search phrase before they have issued their first search makes for happier users and reduced load on your servers, these suggestions can be e.g. existing tags.
A fuzzy search is one that is lenient toward spelling errors. To give an example, you can find
Levenshtein when searching for
Fuzzy searches are simple to enable and can enhance “recall” a lot, but they can also be very expensive to perform.
Fuzzy matching isn’t always the right tool for the job, oftentimes imprecise matches can be found through other techniques. The phonetic analysis plugin contains a number of tools for approximating matches, such as the Metaphone analyzer, which finds words that sound similar to other words. For instance, if what is required is making sure words like
ran are both considered equivalent. Alternatively, for checking misspellings, N-gram analysis as described in this short tutorial can run quite a bit faster at query time, depending on the dataset.
Basic NLP Tasks
These are basic NLP tasks, typically, running as a part of the preprocessing step before applying more complicated NLP analysis on the data.
ElasticSearch has over 20 built-in language-analyzers including ones for Arabic.
What can an analyzer do?
- Stopword removal
Also, the analyzer can be customized, for instance, by providing your own stopwords list.
Detecting languages is a so-called “solved” NLP problem. So no need to reinvent the wheel over and over. When you already have ElasticSearch up and running, you can simply install one of its plugins like this one. And you can always provide a custom plugin.
Advanced NLP Task
These are higher-level tasks that require more complicated setup and strategies, let’s discover how can they be handled by ElasticSearch.
Text classification is a task traditionally solved with supervised machine learning. The input to train a model is a set of labeled documents. The minimal representation of this would be a JSON document with 2 fields: “content” and “category”.
Traditionally, text classification can be solved with a tool like SciKit Learn, Weka, NLTK, Apache Mahout, etc.
This task can be solved in a much simpler way with Elasticsearch, providing the same described input:
You just need to execute 4 steps:
- Configure your mapping (“content” : “text“, “category” : “keyword“)
- Index your documents
- Run a More Like This Query (MLT Query)
- Write a small script that aggregates the hits of that query by score
The MLT query is a very important query for text mining.
How does it work?
It can process arbitrary text, extract the top n keywords relative to the actual “model” and run a boolean match query with those keywords. This query is often used to gather similar documents.
So, all you need is to just run an MLT query with the input document as the like-field and write a small script that aggregates score and category of the top n hits.
With the Elasticsearch approach training happens at index time and your model can be updated dynamically at any point in time with zero downtime of your application. If your data is stored in Elasticsearch anyway, you don’t need any additional infrastructure. With over 10% highly accurate results you can usually fill the first page. In many applications that’s enough for a first good impression.
Recommendations and search are two sides of the same coin. Both rank content for a user based on “relevance” the only difference is whether a keyword query is provided.
By translating the problem of recommending content to a user into a search problem for users’ implied interests, we can base our recommender system on a search ElasticSearch.
For further information, you can refer to our previous post that tackles this use case in detail.
Social Media Monitoring
Thanks to the digital social revolution, opinions that used to be bottled up within confined media channels or the four walls of one’s private life are out there for all to see.
Wouldn’t it be intriguing to get insights into this public sentiment towards e.g. your brand, products, hot topics, etc. ?
We would like to get a view of the public sentiment expressed towards a specific entity, which requires three main tasks:
- Tracking the data streamed from Twitter.
- Analyzing this data stream to get the sentiment and maybe semantic features.
- Visualizing the analysis results.
To this end, ELK Stack (Elasticsearch, Logstash, Kibana) can be in benefit.
The world’s most popular open-source log analysis platform, instead of being used to ingest log files, can be fed with the tweets using Twitter’s streaming API. On top of the aggregated data, we can create a series of graphic visualizations that best depict the Twitter trends.
Logstash, the “L” in the “ELK Stack“, is used at the beginning of the log pipeline to ingest and collect logs before sending them on to Elasticsearch for indexing. Log analysis the most common use case, but any type of event can be forwarded into Logstash and parsed using plugins.
In the context of this task, it would be configured to deal with the Twitter stream using this plugin. After setting up Logstash you can configure it to track specific keywords in the tweets.
After a while of receiving the feed from Twitter, a larger pool of data from which to pull should have existed.
We can then begin to use the Kibana to search for the data we’re looking for.
If we’re tracking public sentiment regarding our company’s brand, for example, we could query the brand name itself and check the correlation with sentimental expressions.
Once we have narrowed the available down to the information that interests us, the next step is to create a graphical depiction of the data so that we can identify trends over time.
As an example, we can create Mentions Over Time visualization, showing mentions of the entities we’re tracking over time.
Another example is creating a map depicting the geographic locations of tweets.
And these are just simple examples of what can be done with your Twitter data in Kibana!
ElasticSearch is a cool search engine provides us with many simple to use features to supplement our search engine with many enhancement features.
Some NLP tasks such as syntactic parsing require deep linguistic analysis. For this kind of tasks, Elasticsearch doesn’t provide the ideal architecture and data format out of the box. That is, for tasks that go beyond token-level, custom plugins accessing the full-text need to be written or used. But tasks such as classification, clustering, keyword extraction, measuring similarity, etc. only require a normalized and possibly weighted Bag of Words representation of a given document, ElasticSearch can be in benefit.
The integration of ElasticSearch with the other ELK stack components namely: Logstash and Kibana is a strong data mining and analysis stack.