Fetching And Indexing Tweets in AWS ElasticSearch

Here in Almeta we are striving to be a source of all valuable content on the Arabic side of the web, and while our current effort focuses on indexing and analysing news articles or blogsposts we believe that social outlets remain a vital part of the while news experience of any internet user.

In this post we present our plan to fetch, analyze and index tweets in a simple and effecient manner.

Twitter API

Twitter have 2 major free APIs (asynchronous and streaming) with the following main differences between them:

  1. Functionaity: async API allows you to search, filter and inspect historical/archived tweets while the streaming API allow us only to get the latest tweets as they come online.
  2. Limits: streaming API is basically unlimited for most small use cases, you can filter (subscribe to) 5000 users to get any new tweet they post, while the async api is extremely limited and most of its endpoints have limits of a couple of hundred requests per 15 minutes
  3. Usage: the streaming API is a pushing stream that keeps an open connection always. so anything that should consume must be serverfull, this means we need to find an effecient way to send these tweets to ElasticSearch or any other component in AWS stack. The async api on the other hand can be easily handled using lambda functions or other serverless implementations.

Use Cases

The exact design of the stack is highly dependent on the type of use case we are trying to implement. Some of the most common applications includes:

  1. Creating a specialized tweets feed: for example we want to find all tweets posted by an original outlet (say Aljazeera tweeted one of its articles): in a nutshell this will require implementing the stream API and subscribing to the outlet accounts on twitter.
  2. Given a tweet find all of the responses/ retweets of it: This can also very helpful for collecting stats and users opinion mining. For this case the Twitter endpoint for fetching retweets has a relatively high limit of 300 requests per 15 minutes using which we can get the latest 100 retweets of the article in consideration. However, this can’t be applied to a real-time stream since it can easily overwhelm the limit. So we need to find a more sensible solution for this case
  3. Find all the tweets that mention a certain URL: many news outlets add to their twitter cards, a useful use case is the ability to link news articles with tweets referencing them. This use case is extremely similar to the second use case. However, this one is dependent on the articles stream instead of tweets stream.
  4. Find all the tweets that mention a certain entity: this is a more elaborate use case since it requires adding more inteligence to the system. in it’s simplest form the system should be able to identify entities in the tweet text link them with other information such as sentiment and then index all of these information in ElasticSearch to support analytics using Kibana. or filtering based on the entity itself.

The Components

The main parts of this system are as follows:

The Consumer: this service is always connected to Twitter’s streaming API. Whenever a new tweet is published from the subscribed list, the consumer fetches it and either sends it directly to Elastic search to enable use case 1. For this component LogStach is a very good tool, it is possible to host it on an EC2 machine with a cost under 4$ per month. It is also possible for this component to send the tweets to the Tweets AI Workflow to enable the smart processing needed in use case 4 either directly or through any of the streaming services in AWS such as kinesis and SQS. for more information about this issue you can review our previous article on user tracking in ElasticSearch

The Tweets AI Workflow: this is a simple workflow that uses AWS step function and implements a set of smart processing steps needed to extract information from the tweet text, these steps can include Normalization of the text, finding entitties such as names of people and places, finding the sentiment of the wverall tweets or the sentiment twards each of the entities, and so on.

The Articles AI Workflow: This is the current system we have in place here at almeta that can accept a new article url, fetch the article and perform various processing steps on it.

The PostProcessing service: for the use case number 3 we already said that we can’t directly extract information in real time from the tweets stream. in the case of retweets or replies for example it does not make any sense to do so since we need to wait for a certain period of time until the retweets are posted. the PostProcessing service allows this time delayed processing.

ElasticSeach: here we have 3 main indexes, /Tweets were we store the tweets and their info, /TweetsActions where we store any further actions we wish to apply to those Tweets such as fetching the replies or the retweets, /Articles were we store the articles information.

Linking it up

we can expect 3 simple flows:

The tweets flow is designed to process the tweets, here:

  1. The cosumer would fetch a new tweet and then send it to the tweets AI Workflow
  2. Tweets AI Workflow does the processing steps and then indexes the results in ElasticSearch in /Tweets index
  3. Tweets AI Workflow also indexes a new actions object for the tweet in ElasticSearch in /TweetsActions index. this object simply includes the tweet URL + any PostProcessing we want to apply to it such as replies or retweets extraction.
  4. After indexing it is possible to use ElasticSearch queries to find tweets that descripe a certain topic or entity or even collect stats about them, simply by issuing queries to the /Tweets index.

The articles flow is designed to process the tweets mentioning an article URL, in this case:

  1. An article URL is detected through RSS or other means
  2. The URL is sent to the The Articles AI Workflow that performs the processing on the article and then index it in /Articles index
  3. From the URL we create an action object that includes the article url and the action search, and indexes this action in /TweetsActions index

The PostProcessing flow is designed to apply the actions indexed in the /Actions index and it works as follows:

  1. A schedular calls the PostProcessing Service on uniform intervals
  2. PostProcessing Services queries the /Actions index to find any new actions
  3. PostProcessing Services would apply these actions usually by calling the async Twitter API.

Conclusion

In this post we presented our plan at implemnting tweets analsis tool for almeta, while this design might not fit your exact use case we hope that it would help you develop a special solution for your tasks.

Leave a Reply

Your email address will not be published. Required fields are marked *