Building a Test Collection for Event Detection Systems Evaluation

Before we start, if you’re not familiar with the Event Detection task in NLP you can refer to our previous post on this topic here.

So you’ve built a system to detect events in the media… now what?

While building a system is a key step, how the system performs on real-world data has equal importance. We need to know whether it actually works and if we can trust its decisions.

So.. we need to evaluate our system before putting it in use.

Evaluation is a highly important step in the development of any system type as it allows the judgment on the system quality and assures that it meets its targeted goals efficiently providing the customers with a fulfilling reliable service.

In this post, we’re discussing a technique to build a test corpus for evaluating an event detection system.

What Does It Mean to Be a Successful Event Detection System?

Event detection systems usually deal with a stream of data (e.g. news articles when working in the traditional media domain).

The performance of an event detection system is measured in terms of its efficiency in performing its main functionalities:

  • Locating similar news stories in the data stream.
  • Detecting the new news stories in the data stream.

How to Judge an Event Detection System Performance?

Judging an event detection system can be done using a test collection.

A test collection typically consists of:

  1. A set of documents (e.g. news stories).
  2. Events representing information needs.
  3. Relevance judgments (i.e. labels) that specify which documents are relevant to the topics.

How to Build a Test Collection for Event Detection Systems?

When constructing a test collection there are typically a number of practical issues that must be addressed.

What Items Should be Selected to Create The Document Collection?

To construct a test collection, a dataset has to be acquired first. This document collection should be a good representative of the domain in which the systems will be applied.

How Should a Suitable Set of Events be Generated?

Previous works elect to choose significant (i.e., popular) events to be the core of the test collection.

This has several advantages:

  • Rich content (i.e., relevant stories) is more available for popular events.
  • The popularity of the topics might help annotators to do more consistent judgments due to their probable familiarity with the significant events.

How to determine significant events in a predefined period of time?

  • A set of events that took place in that period is determined with the help of Wikipedia’s Current Events Portal.
  • The event set is filtered through manually searching for articles on popular news websites such as Aljazeera and CNN… using several manually-crafted queries per event. Keep those events that have been discussed by at least # news article.

How Should The Events be Expressed?

Each event requires getting a set of potentially-relevant documents that constitute its Judgment Pool.

Previous studies showed that query variations for a given event are strong in producing a diverse document pool. Therefore, in order to diversify the documents pool, researchers adopted a process that manually crafts a list of keyword and phrase queries for each event by performing an interactive search on news websites. This would ensure wide coverage of the event aspects.

The crafted queries can then be used to search a documents-collection using an off-the-shelf retrieval engine to retrieve the potentially-relevant documents that constitute the judgment pool.

How Many Events are Required for Obtaining Reliable Evaluation Results?

A minimum of 50 events should be included in the test collection to ensure reliable. That is the number typically used in TREC (Text Retrieval Conference) and stated to be reliable in practice in literature [1].

Do The Events Represent a Diverse Enough Set of Information Needs?

A range of events should be selected with varying characteristics to test the systems under a range of settings.

Document-Event Relevance Assessments

After identifying the potentially-relevant documents, we need to collect relevance judgments.

Relevance judgment considers answering the question of: is this document talking about this event?

After this judgment, we’ll get a set of clusters each is relevant to a specific event. Each event cluster contains stories cover different aspects of that event.

Novelty Judgments

Relevant documents for each event are distributed into clusters of “semantically-similar” documents; each cluster has a set of relevant documents considered to carry the same information. The determination of those internal clusters is considered a second layer of labels on top of the relevance judgments. The instructions for this labeling-level task have been shared by [1] here.

Event Detection in the Arabic Language

To the best of our knowledge, there is no prior work that constructs an Arabic-only test collection as a primary contribution. Therefore, researchers working on Arabic event detection systems had to construct their own test collections to conduct their experiments.

The only work that we are aware of is [1]. However, they considered building a test collection which consists of tweets rather than news articles.


Event detection task requires relevance judgments for each event. This inherently enables the test collection to support the Ad-hoc Search (AS) task as well. Ad-hoc search is the typical search task in IR in which an ad-hoc
query (representing a topic of interest) is issued at a search system which is required to retrieve a ranked list of documents that are relevant to that topic over a collection of documents.


A typical test collection comprises three major components: a document collection, a set of topics, and a set of judgments per topic. There are many challenges in the test collection construction process. First, the document
collection should be a good representative of the domain in which the systems will be applied. Second, the events should be carefully designed to represent real-world information needs. Third, the document-event pairs
to be judged should be carefully selected and the collected judgments should be consistent to achieve reliable evaluation.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store


[1] Hasanain, Maram, et al. “EveTAR: building a large-scale multi-task test collection over Arabic tweets.” Information Retrieval Journal 21.4 (2018): 307-336.

Leave a Reply

Your email address will not be published. Required fields are marked *