In a previous article, How to Detect Clickbait Headlines using NLP? We introduced the task of clickbait detection and explored how it can be modeled within the domain of machine learning and NLP. If you are not familiar with the concept of clickbait detection, make sure to review it before continuing.
In this post, we’re building a classifier for clickbait detection in the news headlines depending on a pre-trained Arabic Word2Vec model and we’re validating this solution. If you are not familiar with the Word2Vec concept you can refer to this Wikipedia article for more information.
News Headlines Representation
In order to get a vectorized representation for a headline using the word2vec, we aggregated the vectors of all the words in that headline by averaging them. The headline can be represented in this manner since it’s a too short text, while this method is not applicable to other longer text parts like the content of the article.
As the word2vec features are not easily interpret-able and trying to get an insight into the effectiveness of its representation for this problem we investigated the projection of the training set headlines vectors on the two-dimensional space using the t-SNE algorithm.
The following graph shows the projection:
neg: the negative class, indicates “not-clickbait”.
pos: the positive class, indicates “clickbait”.
As we can see in the projection, all the clickbait headlines are nearly grouped together in a cluster separating them from the not-clickbait headlines. While the occurred overlapping may be caused by three sources:
- Annotation error.
- Projection error.
- Representation error, which will be a part of our classification error.
Hence, we can say that this representation can effectively represent our problem, but not perfectly. So, we can proceed in our implementation.
Another noticeable thing from this projection is that the classes are not linearly separable.
Now.. is it? It can represent the problem but they’re definitely not 100% separable since there’s an area of intersection.
Well, It’s as I think, after inspecting the results, and in comparison to our first implementation that relied on hand-crafted features at least. Moreover, I think most of our error using this representation is because of the shortage in coverage in the training set, not because of the used representation itself. Another point is that I think it shouldn’t be perfect to be effective.
Testing on Real-World Data
After training a clickbait GBoost classifier on our training dataset using the discussed representation, in this section, we’re analyzing its performance on real-world data.
The Test Dataset
The dataset consists of a collection of articles headlines collected from ALMETA’s news feed data plus other headlines scrapped from the news outlets’ websites. It contains ~14,000 headlines, the following pie graph shows these headlines distribution over the news outlets they were collected from:
For a valid evaluation, none of these domains were used to collect the training dataset.
The Probability Distribution of Being Clickbait
Let’s investigate the probability distribution of being clickbait that’s produced by this classifier:
Most of the headlines seem to have low probabilities of being clickbait, which makes sense, since clickbait is an anomaly characteristic, thus most of the headlines in the real world should not be clickbait.
Investigating The Learned Features
To be more confident in the model let’s try to discover the features that were used by it to separate the classes, it would be a good sign if they make sense to us as humans.
The following method was applied to discover the types of words that are used by the classifier to detect clickbait:
- Clustering the Arabic words (the vocabularies of the used pre-trained word2vec model) according to their word2vec representation, such that semantically related words _from the model viewpoint_ are grouped together.
- Finding the word-clusters that mostly appeared in the headlines that have a high probability of being clickbait.
- Highlighting the words that belong to each of these clusters in those headlines.
Following are examples from our test set with the discovered words in bold:
|The words-cluster interpretation||Examples|
|Adjective|| بريد جستون تتلقي صدمه جديده|
ثغره خطيره في انستغرام صورك وفيديوهاتك ليست خاصه بك
بعد انستغرام فيسبوك تختبر ميزه صادمه للمستخدمين
|Words common in clickbait|| باختصار خمس حقائق مثيره عن اصحاب العيون الخضراء|
اسرار جمال منزليه .. لا تعرفها سوي المراه التركيه
|Time Adverbs|| بعد ان حير العلماء الكشف عن اصل وموطن الحشيش|
مغنيه اميركيه تعتنق الاسلام بعد حادث نيوزلندا اشعر ببراءه الطفوله
|Stopwords common in clickbait|| من الاعتزال الي احضان الغوريلا حلا شيحه تثير الانتقادات مجددا|
ناسا تقرر العوده برواد فضاء الي القمر مره اخري
|Question words|| هل يمكنك البحث عن معلومه داخل فيديو ؟ غوغل ترد|
ثلاث صور غدت ايقونات ل 11 سبتمبر 2001 .. فماذا يقول اصحابها ؟
|Time adverbs, demonstrative pronouns, etc.|| فانتازيا بشائر طيبه حينما تتفوق الدراما علي الروايه الاصليه|
لعشاق الايفون هكذا سيصبح سعر الهاتف لو صنع في اميركا
السيارات الكهربائيه واقع يشبه الخيال العلمي ولكن
|Currencies|| ارتداء الشورت في السعوديه يكلفك خمسه الاف ريال|
وسم غرامه البيجامه الف دينار يتصدر بالكويت والسبب
|Superlative adjective|| اطول رحله جويه في العالم تهبط في نيويورك فكم دامت|
هل يساريو اليد اذكي وانجح من يمينيي اليد
Recall the handcrafted features used in the clickbait problem from How to Detect Clickbait Headlines using NLP? In comparison, these features seem sensible.
Measuring The Error Rate
We’re determining the rate of the error made by the model in different sub-ranges from the probability range of being clickbait, based on a manual inspection of a set of randomly sampled examples, where for each sub-range we sampled examples that are classified with probabilities from this sub-range and count the occurred errors by labeling these examples manually.
The False-Positive Error
These are headlines classified as being clickbait while they’re not. The following graph shows the distribution of this error over the probability sub-ranges:
So, ~6% of the not-clickbait headlines may be classified with a high probability of being clickbait. As the probability of being clickbait approaches to 50%, this error rate increases.
The False-Negative Error
These are headlines classified as being not-clickbait while they’re clickbait, in other words, clickbait headlines that were classified as being clickbait with very low probabilities. The following graph shows the rate of this error in the low probability range [0, 50[:
So, ~10% of the clickbait headlines may be classified with a low probability of being clickbait.
Analyzing The Error Sources
We manually checked the false-positive errors searching for frequent patterns in them, and we found some words-types that are frequently appeared there:
|Person-related words|| يونيسيف الاطفال يشكلون ثلث ضحايا تجاره البشر في العالم|
الصحف المصريه تركز الاهتمام علي مصرع ابن كاهانا وزوجته ومقررات القمه الخليجيه
|Cars-related words|| سياره هيونداي سانتا في بجيلها الرابع|
تحسين موديل اكس 4 من بي ام دبليو
|Art related words|| ساره والموسيقي مجموعه من الاوركسترات الرائعه|
وفاه عازف البيانو الشهير فيكتور بورغ
These word-types are clearly neutral and shouldn’t be indicators of clickbait.
The following graph shows the rate of the presence of each of that words-type in the
To determine the cause of this error type, we investigated the presence of these words-types in the training set using the following method:
- Clustering the Arabic words (the vocabularies of the used pre-trained word2vec model) according to their word2vec representation, such that semantically related words _from the model viewpoint_ are grouped together
- We find the words-cluster corresponds to each words-type.
- We calculate the rate of the headlines that contain words from this cluster in both clickbait and not-clickbait classes in the training set.
The following graph shows the rates of words-types in both classes in the training set:
So, it seems the training set is biased in terms of these word-types, as they are over-appeared in the clickbait class examples while they’re rare in the not-clickbait class examples, thus the classifier learned to give them high weights as being clickbait indicators.
Examples of the presence of these word-types in the clickbait class in the training set:
|Person-related words|| تعرفوا على سبب اكل هذه طفله لسجاد والرمل|
بالفيديو: لحظه انقاذ فتاه حاولت الانتحار من اعلى مبنى
صور شاب يحول نفسه الى اميرات ديزني بمكياج يستحيل ان يكتشفه احد
|Car-related words|| صور: سعودي يحصل على سياره تويوتا هديه .. تعرف على السبب|
فيديو سوبارو معدله تخرج من حفره عميقه بطريقه مذهله! شاهد بالصوره …
عمر البشير بملابس بيضاء وسياره ” لاند كروزر “… اول ظهور للبشير منذ الاطاحه به
|Art-related words|| شمبانزي بدرجه فنان يعزف على الجيتار بطريقه رائعه|
بالفيديو : اب يشارك طفلته رقصه مجنونه برفقه المكنسه
رد فعل مفاجئ لانثى اسد لم يعجبها عزف احد الزوار
In this post, we proposed and validated our solution for the clickbait problem using the word2vec representation of the headlines of the articles. We also tried to interpret the behavior of the produced model by discovering the learned features and trying to analyze its error. The model seems to be affected by the domains of the articles that was trained on. To improve the results we need to build more diverse and unbiased dataset.
I don’t have any other proposals at all. I believe the article content is not useful in this problem, moreover, it can’t be represented using the word2vec vectors average because it’s long, and representing it with a different representation then concatenating it with the title seems a mess. The only solution I have is to improve the dataset.
Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.