In one of our previous articles we suggested a method to build an initial system for informativeness detection, this system should utilize a small set of pairwise comparisons manually annotated and use Snorkel to expand these annotations automatically to a larger training set and then train the model to estimate the article informativeness using this set.
In this article, we will go into the details of the implementation of this plan.
As noted above Snorkel will need 3 typed of training data:
- A small manually annotated test set to evaluate the results of the model
- A smaller manually annotated development set to facilitate automatic annotation using Snorkel
- A large unannotated training set to be statistically annotated using Snorkel
To build the test and development sets manual annotation is required, each data point is composed of 2 lead paragraphs, where the human annotator has the task of choosing the more informative paragraph in his/her view. The choice of using paragraphs instead of full articles aimed at simplifying the process of manual annotation. There were No strict guidelines of what constituted one article to be more informative than another and we mainly relied on the annotators common sense.
However, in order for the 2 paragraphs, they should be describing the same event, so we utilized a dataset developed in-house for the task of event detection, in this dataset the articles were grouped into classes that correspond to the same event. the events were sampled from various news outlets and manually annotated. From this event data we are aiming to build our training development and test sets.
The original dataset includes 121 events, we started by removing any events that have less than 2 articles this reduced the number of events to 104, these events are distributed across 9 genres. The following graph shows the distribution of #articles across genres. We can clearly see that the distribution is extremely imbalanced.
Next, we removed any article with less than 30 words since such articles are mostly related to programs or info-graphics.
For the manual annotation sets, to make sure the data points are as unique as possible we randomly sampled even number of articles from each event and grouped every unique tuple as a data point. The total manual set amounted to 360 data points distributed among 3 annotators. Furthermore to measure inter-annotator agreement we made a random sample of 23% of the data shared among the annotators.
On the other hand for the unannotated training set, we subtracted the random set from the original event set then we used all the possible permutations of 2 articles among the unannotated split, Note that the permutations are restricted to articles with the same event.
Analysis of the annotation process
The manual set was annotated by 3 annotators with a shared random part the dataset to facilitate analysis. each annotator had to make one of 3 choices Top, Down, Unk where Top means that the first paragraph is more informative, Down means the second paragraph is more informative, and Unk means that the annotator couldn’t choose among them.
We first examined the distribution of tags in the overall annotated set by annotator and we found that there are clear differences in the annotator’s behaviour, the following figures show the distribution of classes per annotator for annotator_1 16% of the data is annotated as unknown while for the other annotators this number is less than 3%
To further assess this issue we used the shared data between annotators to calculate Cohen’s Kappa inter-annotator agreement, this metric is given by
where P_o is the empirical agreement probability (agreement percentage) and P_e is the probability of random agreement, this factor ranges between -1 and +1 where values closer to +1 indicate high agreement while negative values and values closer to zero indicate agreement by chance. The following table shows the metric value for different pairs of annotators and the overall agreement, we have tested 2 settings: using all the tags, and discarding the unknown tag.
|classes||1 vs 2||1 vs 3||2 vs 3||average|
We can clearly see that using the small shared set, the agreement between annotators is very low. While this does look like a red flag, it might be attributed to the small size of the shared set.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store