The ability to filter your news feed based on the genre is a critical component of any news aggregator, users would usually want to read sports or political news only not just the most recent or hottest news. In this post, we will explore in great details our initial genre classification system.
Let’s start with the.. data
In the following experiments, we used an in-house data set.
The data set is composed of 190307 HTML document crawled from the following domains [Aljazira, Alarabia, Aljadeed, RT Arabic, BBC arabic].
For each of the documents we tried to extract the following features:
- title: title of the article
- text: the actual text of the article
- keyphrases: using either meta info or keywords set by the author
- summary: some anchors use the inverted pyramid style and their lead paragraph can be used a basic extractive summary
- genre (political, sports, … ) extracted from the URL, the meta-tags or in sometimes the HTML body
After the information is extracted only articles that contain text (regardless of the size) is preserved (many of the pages represents tags, infographics, …) leaving us with 113526 articles.
Next, we removed any articles that have text but to which we could not extract the genre automatically this reduced the number of documents to 104205. The distribution of the articles among the news anchors is shown in the following graph:
However upon manual inspection, we found that several articles specifically from Aljadeed had very little text in them, the following figures show the distribution of articles’ lengths by words without any normalization among some of the news anchors. Note that in rt and BBC articles nearly 20% of the articles have length below 50 words and 10% of Aljazeera and Aljadeed articles follow this phenomenon. These articles were mostly breaking news, one-liners or reports and info-graphics. We decided to discard any article with a length lower than 50 words, the resulting set accounted for 95140 articles.
Next, we started filtering based on genre, many of the articles had uninformative genres like years in case of Alarabia and numerical values in case of BBC. The new filtered set contained 94953 articles. The distribution of the genres among anchors is shown below.
Note that all the anchors have extremely splintered distributions with the exception of Aljadeed that have way too little genres. note that in the BBC there are several articles with the same genre but with different formatting example “science-and-tech” vs “scienceandtech” these will be resolved once we unify the genres across the whole dataset.
Aljazeera graph was omitted because it was extremely splintered mainly because the genres in Aljazeera have hierarchical structure example:
- اقتصاد > قضايا
- اقتصاد > مؤسسات وهياكل
- الأخبار > استطلاع رأي
- الأخبار > الاقتصاد
- الأخبار > تقارير وحوارات
It is important to find a unified mapping among these different genres that correspond to the same content, we developed a manual mapping that unifies all of these different genres based on a single labels set the categories of this mapping is shown below:
- uncategorizable: these are genres that have no clear topic they include editor choices, videos and reports and similar stuff, or have an extremely low number of article to have its own category like hajj
- News: these are news articles that cover both the Arabic world and other countries they might include both global catastrophes like earthquakes, alongside political events
- world news
- art_and_culture: this news focus on artistic news such as galleries, history stories or musical pieces
- IT: specifically related to IT
- science: a broader view covering both tech and other sciences like genetics, math, …
It is clear that such a list of genres is very limited and much more detailed feeds should be considered. However, the choice of these genres is motivated by 2 reasons:
- the goal of this experiment is to build an initial genre classification system, having a lower number of genres means faster development and relatively better performance
- as we will show next we have found out that our dataset is extremely imbalanced even with such a small number of classes, increasing the number of genres might create a lot of small classes on which machine learning model would struggle to train and generalize
The following figure illustrates the distribution of the whole data based on this new mapping, we can easily see the imbalance between the classes with news covering nearly half of the data.
Furthermore, on initial glence we can see that the classes are not 100% pure following are some word clouds for the classes these are simply the most frequent words apart from stop words
To simplify evaluation we created a holdout test set by randomly sampling 100 articles from each class, the following graph illustrates the distribution of the classes in the new training set.
As we already mentioned the data is extremely imbalanced, and thus training a full multi-class classifier might cause the model to favour the most frequent classes, to circumvent this for each of the classes we created a balanced one vs all training set where for a given genre X the positive samples are those belonging to the genre and the negative sample are randomly selected from the other genres with each genre contributing nearly the same amount of articles. This resulted in a nearly balanced negative set for each of the genres.
Regarding our models choice we settled with Facebook’s fasttext is pretty fast and pretty simple that is why we started with it, if you are not familiar with the algorithm check this, following are the experiments we did, we will report only f1 to be brief, all the results are reported on a per-class 10% held out development set
|1||training directly on each of the genres datasets separately|
|2||in the previous experiment, we noticed that the classes with a lower number of articles had the worst results, this might be linked to the fact that the model is not being able to get a good words representation. To rectify this we pre-trained a fasttext model on all the training texts in an unsupervised manner to get better words representations and then re-adapted the model for each of the classes data|
Next, we tried a fast hyperparameters search on the development set following are the details of each experiment, for simplicity we didn’t do a full grid search (No time) and thus we select the best value for each of the hyperparameters. In all of the following experiments, we start from experiment No2
- Initial learning rate: the default value in exp2 is 0.05 we tested the values 0.1 and 0.25 in experiments 3 and 4 the latter gives a small increase in overall f1 of nearly 0.5% absolute
- Number of epochs: again we started from 2 the default value is 5 we tried 10, 25, 50 and 100 these are experiments 5,6,7 and 8 we found that the value of 25 gave a tangible increase in overall f1 of 1% absolute while greater values had no real effect.
- Final layer: the default in 2 is the Softmax we tried 2 additional types Negative Sampling exp 9 and Hierarchical Softmax exp 10 we found that Negative sampling loss had a small increase of 0.5% absolute
- Best parameters model: we selected the best values for every parameter that is lr=0.25, epochs=25 and loss=Negative Sampling the model gives a sizable increase of 1.5% absolute over the model in 2
The following table shows the overall results on the development set, while they seem high we are afraid of over fitting especially in sports class since it’s results were extremely high.
Testing on Live Data
In order to validate our results we tested the model on a fresh version of our news feed.
We are using a snapshot of Almeta dataset to do this analysis, the original set contains 12935 samples.
Furthermore, the text fields in this snapshot do not include any English words as they are extracted from the results of the ESL model. However, our models were trained on articles with English words retained this might affect the results of the model on this set. However, Note that this deficiency will not be present during the deployment.
Furthermore, we removed all the articles that have less than 50 words in their text
The distribution of the articles among the set is shown below
The model used in this analysis is trained using the best configuration we obtained in exp No11 with one difference, we removed the uncategorizable class from the negative samples of the training set since it was extremely impure. The model is tested in 2 settings single and multi-tag:
- Single_tag: here we simply retrieve the tag with the highest confidence if it has confidence higher than 0.5 or uncategorized. In this case, the model is correct if the tag is suitable for the article.
- Multi_tag: here for a given threshold T we report all the tags with confidence higher, here the model is correct if all the tags reported are suitable for the article.
Single Tag Analysis
First, let us see the distribution of the genres among the dataset and see if it is rational, the distribution seems ok, although we expected the news to be larger while the uncategorizable class have very small size.
Next let us see the confidence distribution across the whole data, from the distribution we see that nearly all the tags have very high confidence this might seem problematic, However, remember that for each article we are selecting the model with the highest confidence. To assess this let us see the confidence distribution of each of the individual models.
Following are the distributions of each individual model, we can see that most of these model have near binary confidence which might be problematic meaning that each individual classifier is very strict, but this does not imply that the bootstrapping of these classifiers is bad, actually it is the exact opposite.
We split the data into 3 ranges only based on model confidence, these ranges are based on the extremely binary distribution of confidence alternatively:
- strong confidence [100-90]
- normal confidence [90-50]
- uncategorizable [<50]
Note that these ranges apply only to the most probable tag
This constitutes most of the dataset 11813 articles, as we have seen the distribution is very binary, we analyzed only 2% of them basically 236 articles, furthermore, to avoid the bias towards the news class due to it’s size the selected sample was uniformly distributed between the genres to facilitate per genre error evaluation. The following table shows the analysis of the errors on this set. Where:
- #false_positives mean actual false positives where the model totally failed to select the correct tag and where it is possible to give the model a tag from the classes we have considered,
- #incidents represents one class that we have seen a lot in the false positives namely the incidents like killing, theft, or accidents. Example
- شاهد.. إخلاء جوي للمتواجدين في محطة قطار الحرمين خلال الحريق
- شرطة سلطنة عمان تصدر بيانا بعد القبض على أفراد عصابة من جنسية عربية
- فيديو.. حريق في منشأة نفطية إيرانية
- # unknown represents articles that won’t fit any of the classes used examples
- صراع تحت الماء من أجل البقاء
- #invalid articles with invalid text from the collection process
Now let us analyze the patterns of the false positives:
- Incident class or local news, this is a relatively common class it’s a percentage in this small sample is around 3% but it was not accounted for.
- Weapons industry: a lot of articles specifically from RT talks about weapon technology and industry, this class was not considered during training and in test time it represents the bulk of the science and tech classes examples:
- الولايات المتحدة تعلن عن اختبار ناجح لصاروخ مضاد للسفن قرب جزيرة غوام
- انطلاق مبيعات النماذج المدنية لـ”كلاشينكوف آكا -12″ بالتجزئة
- شاهد.. إنزال فوج من مظليي البحرية الروسية
This sample amount to 1115 article from the set, we select 10% of it amounting to 111 articles uni-formally among the genres. And manually check the results.
The main problem in this range is the prevalence of invalid articles following is an example (sorry it is normalized so a bit hard to read), Note How the article is basically word salad
الوسائط مكتبه التقارير P D P D D P D P P D P D P P P P P P D D D D D D D D D P P P D P D D P D P P D P D يجمع خبراء المال والاعمال علي ان هناك عده اسباب تقف وراء فشل الكثير من المشاريع التي تكون واعده في بداياتها P ومن الاسباب التقليد الاعمي للمشاريع الاخري P او عدم الاستعانه بالخبرات المؤهله P او عدم القدره علي ضبط المصاريف P تقرير P محمد فاوري قراءه P محمد رمال تاريخ البث P D بحسب وسائل اعلام تركيه فان المنطقه الامنه تمتد من نهر الفرات غربا الي مدينه المالكيه بمثلث الحدود التركيه العراقيه بعمق يتراوح بين D كيلومترا و D كيلومترا P ويسيطر علي معظمها وحدات حمايه الشعب الكرديه P تاريخ البث P D اطلقت تركيا عمليه عسكريه بشمال شرق سوريا اسمتها P نبع السلام P وقالت ان هدفها محاربه التنظيمات الارهابيه P الا ان ردود الفعل الدوليه تباينت بشانها P فقد ندد بها الاوروبيون ورفضت اميركا المشاركه فيها P في حين اظهرت روسيا وايران تفهمها للعمليه P تقرير P ناصر ايت طاهر تاريخ البث P D اكدت رئاسه تركيا علي لسان رئيس دائره الاتصال فيها ان الجيش التركي جاهز لعبور الحدود السوريه وتنفيذ عمليه عسكريه P في حين توقع مسؤولون اميركيون ان يبدا الاتراك العمليه التي يشارك فيها الجيش السوري الحر خلال D ساعه P تقرير P فاضل ابو الحسن تاريخ البث P D تمور فلسطينحسب تاكيد البنك المركزي الاردني ارتفاع حجم مديونيه الافراد الي D مليار دولار P وهناك عوامل عده قللت من القدره الشرائيه للاردنيين واجبرتهم علي الاقتراض P في مقدمتها ارتفاع مطرد لاسعار السلع والخدمات وزياده معدل الضرائب P تقرير P رائد عواد تاريخ البث P D
The second important problem is the prevalence of the weapons articles in this range which particularly hurts the tech class
This set has very low confidence across all the models and is comprised only of uncategorizable articles the size of this set is only 205 articles and we manually annotated 25 articles out of it only 11 were actually uncategorizable, the best option for this range is to simply be ignored since it holds very little articles.
Text Length Impact
To measure the impact of text length on the classifier results let us see the distribution of the errors based on the length of the article in words, the following graph (left) shows the histogram of this over the three manually annotated sets. We can clearly see that the number of errors especially concentrates in the area of short articles and that the errors nearly diminishes as the articles grow longer, to validate this, even more, let us inspect the distribution of the High confidence range alone since it represents the bulk of the results, the following graph (right) illustrates the results, we can see even clearer correlation between the article length and the errors rate (basically most of the errors happens in articles shorter than 150 words), most importantly for articles shorter than 50 words the accuracy falls below 50%
Multi Tag Analysis
Again, we split the confidence range into the following sub-ranges:
- Strong confidence [100-90]
- Normal confidence [90-50]
Here we test based on 2 settings:
- Strict: all the tags must be reported correctly for the example to be assigned as correct
- Mild: the majority of the tags reported should be correct for the example to be assigned as correct
The range covers 1122 articles that can be annotated by at least 2 tags. The following chart illustrates the distribution of the number of tags, most of the articles can be assigned 2 tags, some 3 and hardly any 4.
We sample 10% of these articles this results in 111 articles and manually checked them in the 2 aforementioned settings.
- Strict fashion: 86 articles had correct mapping, i.e. an accuracy of 77% out of these errors 28% were caused by confusion in the health class, 24% by tech, 12% by culture
- Mild fashion: 108 articles i.e. accuracy of 97% this is mainly motivated by the fact that our classes are really wide.
The main patterns found here are extremely similar to the single tag case specifically weapons and military training in case of tech and incidents in case of health
This part of the results contains only 140 articles (there are 140 articles that have 2 or more classes with confidence between 0.7 and 0.9), The following chart illustrates the distribution of number of tags, most of the articles can be assigned 2 tags, some 3 and hardly any 4. we manually annotate 50% of this set to find any weird error patterns.
In this article we detailed the process we took to develop our initial genre classification mode, here are the notes we found:
- The plain accuracy of the models in this case is not awful but does need to be improved.
- The pre-normalized text in the test set have reduced the accuracy and we expect better results on deployment
- Short articles represents a problem for this model
- Most of the models have a nearly binary confidence, this is good since a bootstrapped or a voting model will have less confusion, but on the other hand this means that much less of the results will have 2 or more tags which seems not accurate as many of the articles we have reviewed can have more than a single tag
- The main cause for errors is the under-representation of some classes most notably incidents, weapons, tourism,… the best way to handle this issue is to create a fairly detailed taxonomy based on our choices of important classes rather than the tags given by the authors and scrapping specialized sites for it, this can enable us to do something like Hashtags and hashtags following in a manner similar to Tumblr.
- Multi-tag classification can and should be implemented since the accuracy of it is rather OK
- The binary nature of the confidence nearly wipes out the other class this might be caused by the selected model (fasttext) we can resolve this in 2 ways:
- Add much more models to cover as many other classes
- Use different models that have less edgy confidence
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store