User Profiling Using AWS ElasticSearch – RomCom use case.

Personal differences and preferences marks a very important part of our identity, and optimizing the user experiences based on them can be a great tool in improving users engagement. In our previous post to tackled the issue of personalized recommendations and how can ElasticSearch make the process extreemly simpler. However in order to build a robust personal recomendation system it is paramount to have an idea of each user. Who are they and what do they like. This is commonly refered to as a user profile. In this post we will present a road map to enabling user profiling with … Continue reading User Profiling Using AWS ElasticSearch – RomCom use case.

Visualization Platforms for Data Search — Amplitude VS Kibana

In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions. Visualization is an increasingly key tool to make sense of the trillions of rows of data generated every day. Our eyes are drawn to colors and patterns. We can quickly identify red from blue, square from circle. Our culture is visual, including everything from art and advertisements to TV and movies. Data visualization is another form of visual art that grabs our interest and keeps our eyes on the message. Hence, starting from the data value in … Continue reading Visualization Platforms for Data Search — Amplitude VS Kibana

RomCom — The Personalized Recommendation System For Almeta

The nice thing about working in Almeta, is that we are our own users. As a platform for the best Arabic content on the web, we want to deliver the best content for each user. For us. For each Arabic-speaking reader. As usual, many company think of “personalization” as the way to go to improve engagement, reach, or even acquisition. Just let the user tell you what he wants, fit a model against his needs, and let him see what he wants to see. Not what he should see. We tend to disagree. Given our own unique thoughts, interests and … Continue reading RomCom — The Personalized Recommendation System For Almeta

Communication/Messaging Tools and Patterns between Microservices

When it is designed, microservice, required to remember that other services will need to integrate with it. There is no general best style of communication that should be used. In practice, we need to find the best solutions for the problem at hand. In this post, we’re discussing different approaches and technologies used in designing the communication between microservices, shading light on the most common communication services provided by AWS, trying to make preferences towards the different patterns and services taking different factors into the consideration. Communication Patterns In this section, let’s introduce you to the two major communication patterns … Continue reading Communication/Messaging Tools and Patterns between Microservices

Automatically Extracting Valuable Content from News Streams.

Almeta News, as a content aggregator app from ~ 50 sources – to the time we wrote this blog post – always aims to provide its users with the best quality pieces to read. Rather than an army of content watchers and editors, Almeta is looking forward to developing the best algorithms to review content automatically, looking for indicators of quality, assessing a content’s placement. This post is a part of our research efforts seeking the best content ranking methodology.In this post, we’re trying to determine the most effective indicators of content quality. Depending on human experts’ points of view, … Continue reading Automatically Extracting Valuable Content from News Streams.

Initial Genre Classification Experiments

The ability to filter your news feed based on the genre is a critical component of any news aggregator, users would usually want to read sports or political news only not just the most recent or hottest news. In this post, we will explore in great details our initial genre classification system. Let’s start with the.. data In the following experiments, we used an in-house data set. The data set is composed of 190307 HTML document crawled from the following domains [Aljazira, Alarabia, Aljadeed, RT Arabic, BBC arabic]. For each of the documents we tried to extract the following features: … Continue reading Initial Genre Classification Experiments

Intial Experiments on Measuring Informativity of an Arabic Content – Data Collection

In one of our previous articles we suggested a method to build an initial system for informativeness detection, this system should utilize a small set of pairwise comparisons manually annotated and use Snorkel to expand these annotations automatically to a larger training set and then train the model to estimate the article informativeness using this set.In this article, we will go into the details of the implementation of this plan. Data Annotation As noted above Snorkel will need 3 typed of training data: A small manually annotated test set to evaluate the results of the model A smaller manually annotated … Continue reading Intial Experiments on Measuring Informativity of an Arabic Content – Data Collection

Git submodules in the python world Why and How

The basic principle that makes many professional tech companies professional is the simple principle of domain engineering. Basically working for a long period of time on a small set of domains with the hope that you will grow your codebase to be more efficient and successful in developing projects from these domains. the main component in this formula is the idea of code reuse. Sooner or later you will have a certain piece of code that you will use constantly across all your projects, if we are talking about NLP these might be your text normalizers your features extractors or … Continue reading Git submodules in the python world Why and How

What is Political Bias? – In Technical Terms

In this article, we will review all the researches done in the field of discovering political bias. Understanding Characteristics of Biased Sentences in News Articles Methodology Bias Labeling via Crowd-Sourcing They used crowdsourcing to collect bias labels using “Figure Eight” platform. In crowdsourcing they let the workers make judgements on each target news article (using also the reference news article). Analysis of Perceived News Bias To analyze what kind of words are tagged as bias triggers by the workers: they analyze the phrases annotated as biased in terms of the word length (4 words in a sentence have been annotated). … Continue reading What is Political Bias? – In Technical Terms

News Stream Clustering – Sequential Clustering in Action

In a previous post, we talked about “How to” Event Detection in Media using NLP and AI. In another post, we presented the Sequential Clustering. Today we’re introducing an online (sequential) clustering algorithm specialized in aggregating news articles into fine-grained story clusters. Problem Formulation We focus on the clustering of a stream of documents, where the number of clusters is not fixed and learned automatically. We denote by D (potentially infinite) space of documents. We are interested in associating each document with a cluster via the function C(d) ∈ N, which returns the cluster label given a document. For each … Continue reading News Stream Clustering – Sequential Clustering in Action

How to Version Control Your Machine Learning? – A Look into Data Version Control (DVC)

If you have spent time working with Machine Learning, one thing is clear: it’s an iterative process. Machine learning is about rapid experimentation and iteration, each experiment consists of different parts: the data you use, hyperparameters, learning algorithm, architecture, and the optimal combination of all of those Throughout this iterative process, your accuracy on your dataset will vary accordingly, and without keeping track of your experimenting history you won’t be able to learn much. Versioning lets you keep track of all of your experiments and their different components. How to Version Control ML Projects? One of the most popular ways … Continue reading How to Version Control Your Machine Learning? – A Look into Data Version Control (DVC)

Available Visualization Libraries to Handle Stream of Data

Google Analytics Google Analytics generates detailed statistics and fresh insights into your website’s traffic and traffic sources. With Google Analytics users can track visitors from all referrers, including search engines and social networks, direct visits and referring sites. It also tracks and monitors display advertising, PPC networks, email marketing and other digital collateral. You can not only measure sales and conversions, but also gain fresh insights into how visitors use your site and how you can keep them coming back. Segment From startups to the Fortune 500, thousands of companies use Segment as their customer data hub. We believe that … Continue reading Available Visualization Libraries to Handle Stream of Data

Contrary view detection based on VODUM

While reading the news each one of us perceives it in a different manner. We have our own biases and we tend to search for information that confirms our previous beliefs. Thus different people might have drastically different viewpoints of … Continue reading Contrary view detection based on VODUM

AWS Batch Jobs — An Overview

AWS Batch enables you to run batch computing workloads on the AWS Cloud. This service can automatically provision compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. Related Definitions Jobs: A unit of work (such as a shell script, a Linux executable, or a Docker container image) that you submit to AWS Batch. It runs as a containerized application on an Amazon EC2 instance in your computing environment, using parameters that you specify in a job definition. Container images are stored in and pulled from container registries. Job Definitions: specifies how jobs are … Continue reading AWS Batch Jobs — An Overview

AWS Lambda and SQS Payload Limitation

In this post, we’re talking about deploying an AI service that processes the political news articles in Almeta’s database. The service is assumed to be deployed as AWS Lambda function, with the use of AWS SQS to maintain the incoming requests while the function is throttled. An important limit to be considered in such a situation, is the limit that is put on the payload size by both AWS Lambda and AWS SQS. Along this post, we investigate what is this limit for each of the two services and how are we affected by it. AWS SQS There are two … Continue reading AWS Lambda and SQS Payload Limitation

Almeta App — Caricature Tab

Caricature Tab is a new incoming feature planned to be a part of the Almeta News app soon. In its primary version, the feature will provide the user with a stack of in-house designed Caricature images to enjoy browsing. If you’re curious about how we in Almeta manage to handle such new features, then you will discover this in this post. In this post, we’re showing the entire process towards making decisions to answer a bunch of design and deployment related questions: Where to store the images? How to handle new ones? Do we need caching? What to store in … Continue reading Almeta App — Caricature Tab

Informativity Detection, Our Research Gist

Informativity Detection – Almeta’s Research Gist

Let’s start with a simple question, what constitutes an informative article? based on Oxford’s dictionary. informative/ɪnˈfɔːmətɪv/ adjective: informativeproviding useful or interesting information However, this is still an abstract concept. The question of measuring How informative a piece of news is … Continue reading Informativity Detection – Almeta’s Research Gist

A Guideline for Writing Research/Tech Blogs

Intro In Almeta you have to write a lot for those research tickets you have in a Sprint. You’ve to read tons of research, academic, and sometimes boring paper. But, when you write your proposal, you don’t have to write like them. As a matter of fact we want to be as close to non-techies as possible when writing our tech blogs. So, you’re an engineer and you love to code. You are a machine learning engineer and you love to read. You’re both and here comes a research/investigation ticket. You read, read, and read some more and now comes … Continue reading A Guideline for Writing Research/Tech Blogs

Our Agile/Scrum Setup in Almeta

Intro We’re currently trying with different style. Between Agile/Scrum and Kanban. This is the latest we’re doing. We’re going to keep this post updated. The Team in Almeta We are a remote, cross functional team. We try to have balance in skill we have. We favor T shape employees. We <3 Valve. Skin in the Game: In a startup you’ve to eat your own food. And you’ve to take extra responsibility for any code you develop. We don’t have researchers and engineers. We have research-engineers. Those who learned to do research, develop ideas, write their code and also bring them … Continue reading Our Agile/Scrum Setup in Almeta

How to Fact-Check using Natural Language Processing Techniques? A Literature Review

In this article, we present the summary of our research in the field of fact-checking. We categorized them in two categories, first are the closed source published applications and the second are the research projects done in this field. Closed Source Snobs Their methodology depends on human annotators to fact check a piece of the news and present a detailed report regarding the inaccuracies in the article Reporters’ Lab Their methodology depends on human annotators as well, and dataset can be found in https://www.politifact.com/texas/ and https://factnameh.com/ Fullfact Their methodology builds a fully automated fact checker, but no details are provided … Continue reading How to Fact-Check using Natural Language Processing Techniques? A Literature Review