Automatically Extracting Valuable Content from News Streams.

Almeta News, as a content aggregator app from ~ 50 sources – to the time we wrote this blog post – always aims to provide its users with the best quality pieces to read.

Rather than an army of content watchers and editors, Almeta is looking forward to developing the best algorithms to review content automatically, looking for indicators of quality, assessing a content’s placement.

This post is a part of our research efforts seeking the best content ranking methodology.
In this post, we’re trying to determine the most effective indicators of content quality. Depending on human experts’ points of view, famous competitors’ plans, and scientific literature. We’re as well suggesting ways for measuring these indicators automatically.

What Constitutes Quality Content?

In this section, we’re listing the most important quality content indicators, derived from human experts’ recommendations to the Authors. With a brief description of how to apply these factors technically.

1. Quality Content is the valuable content in the user’s point of view

What my audience really cares about is what they care about, not what I care about.

a. Cultural context
Culture is what ultimately drives people and their decision-making process. Thus, we need to make sure that the top content is relevant to our target segment culture.

b. Geographical context
This, on one hand, is related to the cultural context. On the other hand, mostly, people are more likely to be interested in what affects them directly. Thus, geographically-based content filtering may be effective to determine what a user probably cares about.

c. Personal context
People are different, what’s considered interesting to someone, doesn’t have to be for others.

Technical Application

  • Studying our target segment and focusing on the types of content that attract them.
  • Detecting the user’s location. Filtering the content geographically, which can be considered as an automatic content classification problem. Providing the user with cultural and local preferences.
  • Getting to know each user alone, by the means of:
    • Profile: users can provide some basic information about their career, education, etc. and other additional information about their hobbies, feelings, etc.
    • Interests: let the users choose topics, keywords, etc. they are interested in following.
    • History: learn about the user from his behavior. What kind of content he liked, commented on positively or negatively, shared, viewed, blocked, followed, etc.
  • Making the content searchable letting the user find what he exactly needs in a point.

What kinds of recommendations to make in this context?

  • Recommend content/interests that are similar to the one the user is interested in.
  • Recommend content/interests based on the user’s activity; things he’s not following but based on his recent activity seems he’s interested in.
  • Recommend content/interests similar users are interested in.
  • Recommend content that helps the user improve his career, makes him feel better, and an endless list of ideas!

The recommendations here may require a mix of classification, clustering, recommendation and even rule-based methods.

2. Quality Content is made in the way the audience likely want to consume it in

It’s not all about the substance of the content, but the structure of it, too. For instance, does an audience want to read long narratives to glean what happened at a conference?

Technical Application

Give the user the choice to decide about the structure of the content he wants to read e.g. sorting the content according to their lengths.

3. Quality is determined by the audience

If they find it useful – which we can tell from the data and feedback – it may likely be quality.

Technical Application

Saving data and statistics for each article and favor the ones having the best feedback.

4. Quality Content is usually actionable

A general rule, content that has practical use is naturally interesting—or at least more interesting than content that is not.
Favor the content that the readers can take and use in real life. After consuming the content, readers walk away with advice, steps, or insight they can apply to a job they need to get done.

Technical Application

This can be implemented as a kind of automatic content classification. Where the data can be gathered from content like Wiki How.

5. Quality Content is Novel

The ideas discussed in the content should be novel! Not rehashing the same concepts over and over again.
Also, the novelty can be looked at from the user’s point of view; the novel content is the one new and interested to the user.

Technical Application

We need to implement a kind of novelty detection algorithm. One of the proposed solutions is the online clustering of the content, where the most novel content forms a new cluster.

6. Quality Content in Trendy

To make the content more interesting to the current audience, we can leverage the power of recent trends. Trends can be thought of both in a worldwide or local setting.

Technical Application

One of the ways to detect trends can be also using online clustering. Where trends form emerging clusters in a period of time.

7. Quality Content is Fresh

The importance of a piece of news changes over time. Although old content can be interesting too, considering the freshness factor aims to keep the user up with the latest updates, and omits the stale stories.

8. Quality Content is Accurate

Accuracy builds trust with readers.

Technical Application

Concluded from the tips recommended to the authors to think about when mulling over the issues of content accuracy, the following two factors can be considered to detect the accuracy:

  • The number of links to other sources and content: The more the author can back up and substantiate what he’s writing about, the more trusted his content will become.
  • Consider who the content is linking to: Are they a trusted and authoritative source? Linking to other quality websites will earn more trust from the readers.

9. Quality Content is Trusted

People tend to consume content from sources they trust more.

Technical Application

  • Let users decide which sources to follow.
  • Score the sources over time, according to the quality, amount, and frequency of the content they provided, also taking the popularity into the consideration.

10. Quality Content is Short and Pointed

There is nothing better than a brief, to-the-point content that is filled with information. So don’t focus on word count. Longer content does not mean better content. Quality content is detailed and covers the topic from many angles. But it’s tight. Not a word is wasted.

Technical Application

  • Automatic readability scoring.
  • Read time estimation.
  • The length of the article.
  • How diverse are the sentences from the centroid of the content.
  • And maybe more…

11. Quality Content is Engaging

A content that engages the reader.

Technical Application

Based on tips recommended for the authors to make engaging content, we may consider the following factors:

  • The number of questions: engaging content includes questions that make readers reflect on how they can implement the knowledge you provided.
  • Have an important and promising introduction: Most people probably decide within the first few sentences if the post is worth reading.
    • In the simplest setting, this can be implemented as a kind of automatic classification, where the model determines how much an introduction is relative to the “good introductions” class.
      The dataset can be gathered automatically using real introductions from famous trusted sources as positive examples, while choosing random sentences for negative examples, or for smoother results using introductions from known lower-quality sources for negative examples.

12. Quality Content Avoids Balanced View

Quality content is a content that considers the strengths and weaknesses of alternatives and selects the most correct.

Technical Application

Measuring the opinion bias in the content.

13. Quality Content Provides good communication

Vision is the strongest human sense, so people are naturally drawn to visually appealing pieces. Whether the content uses pictures, videos, or diagrams, they can help illustrate the author’s point.
Quality content is often highly visual and easy to consume.

How Do Giant Content Providers Do This?

A deeper look at how some famous content providers are surfacing valuable content.

Google News

Forget PageRank, Google’s news service doesn’t rely on the same algorithm used by “regular” Google, of which PageRank is a part of. Instead, Google News taps into its own unique ranking signals. This was denoted by Josh Cohen, the business product manager of Google News in one of his interviews on November 24, 2009.

During this interview, Cohen showed us what’s under the hood of Google News ranking methodology.

So, How Does Google News Work?

Google News structure news into story clusters; a group of individual articles that are all on a given angle to a particular news event.

What causes an individual article to be the lead item in a particular story cluster?
Various factors are involved, Cohen said:

1. Freshness, local relevancy & originality

Is there original content? The timeliness. Coverage of recent developments? The relevancy to the cluster at hand. In some cases, is there local relevance? Is there content from a local source with local content?

2. Publication reputation

How an individual article ranks within a story cluster is further influenced by the reputation that its publishing source carries within Google News.

What’s the volume of publication of original content in a given category? If you look at Bloomberg and Reuters, they may have hundreds of original articles in business. That’s a pretty good indication of the quality of that source for that category. Compared to sports, there’s not that much original content [and so they might not have as much authority for when ranking sports stories].

3. Measuring Clicks

what users are clicking on from the results they see. You understand who are trusted sources for users. If you go to a given cluster for Google News, you’d expect the first story to get more clicks than the second and so on.

4. Textual Content Counts, Tool

Your URL, title and body are three components you can look at. If you’re weak any one of those, it puts additional weight on other categories.

Moreover, Google News has various “editions” for different countries, such as Google News UK versus Google News US. Each edition has its own particular blend of signals it uses to rank news content. Furthermore, each section within a Google News edition (such as Entertainment versus Sports) also uses its own unique blend of ranking signals.


At Flipboard, they deliver over 100,000 stories per day, and here are some factors they consider to surface such good quality stories:

1. Ranking sources

A team of humans determines the editorial quality of a source, and then something called the domain ranker comes into play. Built for spam detection, the ranker allows the team to favor sources with known track records, who themselves follow time-honored journalistic principles. Who’s ranked and how is carefully guarded and continually reviewed.

2. Incorporating signal from as many people as possible

While the ranker does make it harder for stories from the long tail to surface on Flipboard, there’s another filter that influences what you see: the user satisfaction score. A set of signals that indicate how engaged people are with a piece of content, the score is a proxy for quality.

3. Clustering stories for multiple perspectives.

They claim that they surface the plurality of sources and voices they have by story clustering. Story clustering is an algorithmic technique they use to pull together stories from different sources on the same topic.

Not every cluster might actually have stories with truly unique viewpoints —machine learning just isn’t there yet— but the structure gives us a framework to offer balance.

Flipboard Team

4. Attributing for context.

All stories on Flipboard have author, publisher and/or curator attribution so the user can see where it comes from and make his own informed decisions about the person’s inherent biases.


The ethos of Medium is inherently democratic; it seeks to give a voice to people who have something interesting to say, even if they don’t have thousands of Twitter followers, an active blog or friends in the right places. Medium is built to reward content for its quality, not for the pedigree or popularity of the author.

So while Medium allows anyone to publish pretty much anything, it works hard to guarantee that visitors only see the good stuff.

The website’s ever-evolving algorithm that determines post ranking considers a variety of factors. Ev Williams, who co-founded Medium with Biz Stone, explained:

What we’re doing is ordering things by our best guess of the relative quality/interestingness of the different items—according to the people who have seen them… It’s not a direct popularity ranking. It takes in a variety of factors, including whether or not a post seems to actually have been read (not just clicked on) and whether people click the “Recommend” button at the bottom of posts. The ratio of people who view it who read it and who read it and recommend it are important factors, not just the number. (This is an attempt to level of the playing field for those who don’t already have large followings and/or a penchant for writing click-bait headlines.)

On the other hand, Medium’s algorithm prioritizes quality over the date something was published in.

Medium also allows users to personalize their experience.

How Do They Do this in the Literature?

In terms of white papers, just a few countable works considered this problem.

[1, 2, 3] proposed ranking algorithms for news information, finding the most authoritative news sources and identifying the most interesting events.
All of the proposed algorithms share the following properties:

1. Ranking the news sources:

Important News articles are Clustered. An important news story is probably (partially) replicated by many sources. For instance, consider a news article n originated from a press agency. The measure of its importance is also expressed by the number of different online newspapers which replicate n, this means that the (weighted) size of the cluster formed around n is a measure of its importance.

2. Mutual Reinforcement between News Articles and News Sources:

We can assign different importance to different news sources according to the importance of the news articles they produce. So that, a piece of news coming from “Washington Post” can be more authoritative than a similar article coming from say “ACME press”, since ”Washington Post” is known for producing good stories.

3. Time awareness:

The importance of a piece of news changes over time. We are dealing with a stream of information where a fresh news story should be considered more important than an old one.


In this post, we discussed a variety of factors to consider for surfacing the valuable content from a large amount of aggregated data.
We gathered these factors looking in:

  • Experts recommendations for authors to write quality content.
  • The methodologies followed by giant content providers.
  • Scientific literature review.

How to combine these factors in a meaningful way depends on your application. You may find a way to develop a weighted sum combining all of them, or simply provide some of them as separate features.

So pick your factors, put your plan, and let your customers enjoy valuable content!

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store


[1] Del Corso, Gianna M., Antonio Gulli, and Francesco Romani. “Ranking a stream of news.” Proceedings of the 14th international conference on World Wide Web. 2005.

[2] Mahour, Bhavana, and Akhilesh Tiwari. “A Ranking Algorithm for News Data Streams.” International Journal of Computer Applications 94.6 (2014).

[3] Trajkovski, Igor. “Pagerank-like algorithm for ranking news stories and news portals.” International Conference on ICT Innovations. Springer, Heidelberg, 2013.

Further Reading


One thought on “Automatically Extracting Valuable Content from News Streams.

Leave a Reply

Your email address will not be published. Required fields are marked *