Jigsaw Unintended Bias in Toxicity Classification

18 min readApr 22, 2022

In recent years, social media platforms have become an important place that provides an environment for its users, for expressing their thoughts and ideas, sharing content, and doing social interaction. Unfortunately, toxicity has become a serious issue affecting a wide range of users causing different psychological problems like depression and even suicidality.

Toxicity in online conversation can be defined as anything rude, disrespectful, troll, or anything else that is likely to initiate a discussion, offend someone, spread hatred and/or false information, etc. Toxicity online poses a serious challenge for platforms and publishers. Online abuse and harassment suppress important voices in conversation, forcing already marginalized people offline.

The prevalence of online toxicity has been growing rapidly in recent years, and the problem is a major concern. In some cases, toxic comments online have even caused real-life violence.

Social media platforms depend on thousands of human reviewers who struggle to moderate the ever-increasing volume of toxicity. In 2019, it was reported that Facebook moderators are at risk of suffering from PTSD (post-traumatic stress disorder) as a result of repeated exposure to such distressing content.

Hence, solving this problem by machine learning can help manage the rising volume of toxicity while limiting human exposure to it, and help mitigate toxicity and ensure healthy conversations online.

Prerequisites
Business Problem
Machine Learning Formulation of the Business Problem
Business Constraints
Source of Data
Performance Metrics
Existing Approaches and Improvements over them
Exploratory Data Analysis (EDA)
Data Pre-processing
Deep Learning Models
Comparison of Models
Final Model
Deployment
Kaggle Score
Future Work
References
GitHub Repository
LinkedIn

Prerequisites

Understanding of Statistical Concepts/Plots.
Concepts of Deep Learning topics like
i. Neural Networks and their layers.
ii. Recurrent Neural Network-based models like LSTM.
iii. Attention Mechanism.
iv. Transformer-based models like BERT.

Business Problem

Jigsaw hosted a competition in 2018 called “Toxic Comment Classification Challenge” which aimed to identify the toxicity in a given comment. The problem was expected to build a model that should detect different kinds of toxicity like toxic, severe toxic, obscene, threat, insult, and identity hate, by predicting the probability of each type of toxicity for each comment. The dataset included a large set of Wikipedia comments labeled by humans for toxic behavior. However, it was observed that the models trained on the dataset show some biases towards minority groups. For example, the sentence “I am a black man” shows toxicity of 80%, and the sentence “I am a man” shows toxicity of 20%. Such bias happened because the dataset was prepared from sources where unfortunately certain identities like “black” are referred to in offensive ways.

To overcome such unintended bias problem, Jigsaw hosted another competition in the year 2019 called “Jigsaw Unintended Bias in Toxicity Classification” with the intent to detect toxic comments and minimize the unintended model bias. To solve this problem, a machine learning solution can be used to cater to a variety of different use cases, in comment sections, forums, or any text-based conversations. This case study aims to use machine learning to identify toxic comments without any bias, making it easier to host better conversations online.

Machine Learning Formulation of the Business Problem

The problem can be projected into a binary classification task of determining whether a given text is toxic or not, along with providing the likelihood probability of toxicity in the given text. Given the dataset containing comments and labels for identity mentions, we need to build a model that would recognize toxicity and minimize unintended bias for mentions of identities, by optimizing a metric designed to measure unintended bias.

Business Constraints

Low Latency Requirement: As the toxicity score can be given as feedback to the commenters as soon as they write something, low latency is required so that it can help a commenter know whether the text typed by him, has any toxicity or not.
Class Probabilities: It would be better to provide the toxicity score which represents the percentage or likelihood that someone will perceive the given text as toxic.
Minimum Unintended Bias: To overcome the problems in the Jigsaw competition of 2018, it is of utmost importance for the model to minimize any unintended bias while predicting the toxicity. The suggested metric by Jigsaw would help in considering this constraint.

Source of Data

The data is taken from the Kaggle Competition:

Jigsaw Unintended Bias in Toxicity Classification

Detect toxicity across a diverse range of conversations

www.kaggle.com

The distribution of the dataset between toxic and non-toxic comments is shown below:

Dataset Distribution between toxic and non-toxic comments

Dataset Details

There are three files:

train.csv: Training dataset containing comment text, toxicity label, subgroup, and identities.
test.csv: Test dataset containing only the comment text for which the toxicity has to be found, without any toxicity labels, subgroups, or identities.
sample_submission.csv: A sample submission file in the correct format to be submitted for leaderboard scores.

Column Details

The training dataset contains 45 columns/attributes:

id: Unique identifier for comment text in training and test dataset.
target: Each comment in the training dataset has a toxicity label representing the fraction of human raters/annotators who believed the given comment was toxic. This has to be found out for the test dataset.
comment_text: Text of the comments.
Subgroup or Toxicity Subtype Attributes: Represents subtypes of toxicity. The following 6 subgroups are available: severity_toxicity, obscene, threat, insult, identity_attack, and sexual_explicit. The model does not need to predict these attributes.
Identity Attributes: Represents various identities mentioned in the comment. A total of 24 identities are given in the training dataset: male, female, transgender, other_gender, heterosexual, homosexual_gay_or_lesbian, bisexual, other_sexual_orientation, christian, jewish, muslim, hindu, buddhist, atheist, other_religion, black, white, asian, latino, other_race_or_ethinicity, physical_disability, intellectual_or_learning_disability, psychiatric_or_mental_illness, and other_disability.
Out of these 24 identity attributes, only identities (shown in bold) with more than 500 examples in the test dataset will be included in the evaluation calculation. Hence, we will consider only the attributes: male, female, homosexual_gay_or_lesbian, christian, jewish, muslim, black, white, and psychiatric_or_mental_illness.
In addition to the attributes described above, the training dataset also provides metadata from Jigsaw’s annotation: toxicity_annotator_count and identity_annotator_count, and metadata from Civil Comments: created_date, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, and disagree.

Performance Metrics

In this case study, a newly developed custom metric has already been given, which combines several sub metrics to balance overall performance with various aspects of unintended bias.

Source: Kaggle Evaluation

Before we define the final metric, let’s define each sub-metric.

1. Overall AUC

This is the ROC-AUC for the full evaluation set.

Code to calculate the AUC

Code to compute the Overall AUC

2. Bias AUCs

To measure the unintended bias, we again calculate the ROC-AUC, on three specific subsets of the test dataset for each identity, each capturing a different aspect of unintended bias.

The below figure will help in understanding the terms used in the Bias AUC sub-metrics:

2.1. Subgroup AUC

Here, we restrict the data set to only the examples that mention the specific identity subgroup. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

Code to compute the Subgroup AUC

2.2. BPSN (Background Positive, Subgroup Negative) AUC

Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

Code to compute the BPSN AUC

2.3. BNSP (Background Negative, Subgroup Positive) AUC

Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

Code to compute the BNSP AUC

Code to compute the metrics for each subgroup

3. Generalized Mean of Bias AUCs

To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below:

Code to compute the Power Mean

4. Final Metric

We combine the overall AUC with the generalized mean of the Bias AUCs to calculate the final model score:

Code to compute the Final Metric

Existing Approaches and Improvements over them

There are some existing solutions and research work on the problem that has been given in the references. The existing solutions use simple LSTM/GRU and CNN architecture to solve the problem.

In this solution, we would use complex architectures using bidirectional LSTM layers. Also, we would try the BERT model to extract the vector representation of the comment texts and use them as features in an MLP network.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps in analyzing the data using simple tools from statistics, plots, linear algebra, and others.

In EDA, we plot different graphs for getting a better understanding of the data. Let’s start doing the EDA on various features available in the training dataset to understand the dataset better.

There are various features in the training dataset like comment text, toxicity (target), toxicity subgroups, identities, and other metadata. In the test dataset, only two features are available viz., id and comment_text, which implies that we will just have the comment text available when predicting the toxicity in real-time.

1. EDA of ‘Target’ Class Label:

1.1. PDF and Box Plot of the data in the ‘Target’ Class:

Histogram/PDF and Box Plot of the data in the ‘Target’ Class Label

Observation from the Histogram/PDF and Box Plot of the ‘Target’ Class Label:

Most of the comments have 0 toxicity.
75% of the data have a toxicity of less than 0.2 ⟹ non-toxic comments.
The majority of the comments have a toxicity of less than 0.5 ⟹ non-toxic comments.
Very few comments have toxicity greater than 0.5 ⟹ toxic comments.

1.2. Count Plot and Pie Chart of the ‘Target’ Class:
Let’s plot the Count Plot for the ‘target’ class to see the number of comments in each category (toxic and non-toxic).

We will consider the comments as toxic if the toxicity is greater than or equal to 0.5, otherwise non-toxic.
Let’s create a new feature called ‘IsToxic’ to indicate the class label has two values:
i. 1 ⟹ toxic comment (toxicity ≥ 5).
ii. 0 ⟹ non-toxic comment (toxicity < 5).

Distribution of Class Labels (Toxic and Non-toxic)

Observation from the Count Plot of the Class Label:

The dataset is highly imbalanced with 92% non-toxic comments and 8% toxic comments.

2. Multi-variate Analysis of the Toxicity Subtype Features:

Here, we will do some multi-variate analysis of the various Subgroups or Toxicity Subgroup attributes.
These attributes represent the subtype of toxicity.
There are 6 toxicity subgroup attributes available viz., severe_toxicity, obscene, threat, insult, identity_attack, and sexual_explicit.
In the given problem, the model does not need to predict these attributes. Otherwise, the problem would have been a multi-label classification problem.

PDF of the data in various Toxicity Subgroups

Observation from the PDF of data in various Toxicity Subgroups:

From the PDF, we can see that most of the toxicity subgroup ‘severe_toxicity’ are 0 ⟹ In the training data, comments are least categorized as ‘severe_toxicity’ by the annotators.
Let’s plot the percentage of subgroups having maximum toxicity in a toxic comment. In other words, we will consider and count only the toxicity subgroup having a maximum score in a particular comment.

Percentage of Toxicity Subgroups having maximum Toxicity Score in a comment

Observation from the above plot:

We can see that 79.23% of the toxic comments are of ‘insult’ type, followed by little percentages of ‘identity_attack’, ‘obscene’, and others.
A negligible portion (0.02%) of toxic comments are of ‘severe_toxicity’ type.
Based on the above count plot, we could have considered creating a new feature called ‘IsInsult’ in doing the classification, due to its huge contribution to the toxic comment. However, we won’t get any information related to the toxicity subgroup from the test data. Hence, we would skip this feature.

3. Multi-variate Analysis of the Identity Features:

Here, we will do some multi-variate analysis of the various Identity attributes that represent various identities mentioned in the comments.
There are 24 identity features in the training dataset viz., male, female, transgender, other_gender, heterosexual, homosexual_gay_or_lesbian, bisexual, other_sexual_orientation, christian, jewish, muslim, hindu, buddhist, atheist, other_religion, black, white, asian, latino, other_race_or_ethinicity, physical_disability, intellectual_or_learning_disability, psychiatric_or_mental_illness, and other_disability.
Out of these 24 identity attributes, only 9 identities (shown in bold) with more than 500 examples in the test dataset will be included in the evaluation calculation.
Hence, we will consider only these 9 from the identity features for doing the EDA.

Observation from the PDF of the data in various Identity Subgroups:

From the above PDF, we can observe that the identity ‘psychiatric_or_mental_illness’ has the highest peak at 0 ⟹ ‘psychiatric_or_mental_illness’ is the least mentioned identity in the comments.

Observation from the Count Plot of various Identity Subgroups:

‘female’ is the most occurring identity in the toxic comments, followed by the identities: ‘male’, ‘white’, ‘muslim’, and so on.

4. Uni-variate Analysis of the Comment Texts:

Here, we will perform various EDA on the comment texts which is the most important feature in the whole dataset.

4.1. Plot number of comments per toxic class:

Let’s draw a count plot to view the number of toxic comments as well as the number of comments in each toxic subgroup where the score is greater than or equal to 50% (≥ 0.5).

Number of comments per toxic/target class and per each toxicity subgroups

Observation from the above plot:

There are a total of 1,44,334 toxic comments.
From the toxic subgroups, ‘insult’ has been attributed in maximum comments by a minimum of 50% of annotators.
Rest all toxic subgroups are less annotated.
‘severe_toxicity’ subgroup is the least attributed subgroup annotated by any annotators.

4.2. Trend Graph of the number of comments over time:

Here, we will plot the distribution of comments over time.

Trend Graph of the number of comments raised over time

Observation from the trend of toxic comments raised over time:

We can see an overall upward trend in the number of toxic comments raised over time.
There is a slight dip after Jan 2017 but the trend increases again after April 2017.
The maximum number of toxic comments was raised in October 2017.
There is a drastic fall after October 2017 in the next month. This could be due to the unavailability of the data from Nov 2017 and onwards.
Let’s try to plot a similar trend graph for various types of toxicity subgroups.
However, if we just consider the toxicity score to be greater than 0.5 for each of the subgroups, there is negligible data and the graph does not show any trend.
Hence, we will plot the trend graph for the toxicity subgroups considering a non-zero score i.e., even if at least one annotator believed the comment to be of any specific toxicity, we would consider them in the trend graph.

Trend Graph of the number of comments for each Toxicity Subgroup raised over time

Observation from the trend of various Toxicity Subgroup comments raised over time:

The trend for all the toxicity subgroup comments follow a similar pattern.
The comments marked as ‘insult’ show a significant increase over any other kind of comments.

4.3. Plot the distribution of comments’ length:

Here, we will plot the distribution of the length of the comments for both toxic and non-toxic classes.

Histogram/PDF and Bot Plot of the distribution of comments’ length class label wise

Observation from the distribution of comments’ length:

The comment lengths follow a similar distribution for both toxic and non-toxic classes.
Most of the comments (either toxic or non-toxic) have lengths between 10 and 30.
Almost all comments have a length of fewer than 150 words. However, there are some outliers with comment lengths even up to 350+ words.

4.4. Word Cloud of the words in the Comments:

4.4.1. Word Cloud based on the ‘Target’ Class:

Word Cloud of the words in Toxic Comments

Word Cloud of the words in Non-toxic Comments

Observation from the Word Cloud of the words in Comment Texts based on Toxic and Non-toxic classes:

In the toxic comments, ‘Trump’, ‘people’, and ‘like’ are the most often used words in the comment texts.
In the non-toxic comments, ‘people’ ‘one’ and ‘would’ are the most often used words.

4.4.2. Word Cloud based on the Toxicity Subgroups

Plot the word cloud of all the words in the comment text from various toxicity subgroup classes having scores greater than or equal to 0.5

Word Cloud of the words in all Toxicity Subgroups having toxicity score ≥ 0.5

Observation from the Word Cloud of the words in Comment Texts in various Toxicity Subgroups:

Most often used words based on the toxicity subgroup classes are as follows:
i. severe_toxic: kill, shit and time.
ii. obscene: people, get, like, would and crap.
iii. threat: kill and shoot.
iv. insult: Trump, people, like and one.
v. identity_attack: Muslim and black.
vi. sexual_explicit: sex and one.

Data Pre-processing

As the primary data to be used in the model is text data and we would build various non-transformer-based models as well, we can perform some text pre-processing to clean the data.

The below set of text pre-processing operations have been performed:

Remove HTML Tags:
HTML Tags do not add much value to understanding and analyzing texts.

2. Remove Accented Characters:

We may receive some accented characters/letters in comments. E.g., résumé, tête-à-tête, etc.
The most common accents are the acute (é), grave (è), circumflex (â, î or ô), tilde (ñ), umlaut, and dieresis (ü or ï — the same symbol is used for two different purposes), and cedilla (ç). Accent marks (also referred to as diacritics or diacriticals) usually appear above a character. (Reference)
We need to ensure that we convert and standardize such characters to ASCII characters.

3. Convert to Lowercase:

Convert the comment text to lower case before doing further preprocessing.
Texts in lowercase help in the process of preprocessing and in later stages in NLP.
Converting the text to lowercase is pretty easy and can be done by the ‘.lower()’ method of a string.

4. Remove IP Address, Hyperlinks, and Numbers:

Remove any IP Addresses, Hyperlinks, and numbers from the comment text as they won’t add any value to perform the toxicity classification.

5. Replace Emoticons with the corresponding words:

Replace the emoticons with the corresponding words like “:-(” with “sad”.
The emoticons and their corresponding words have been referenced from PC.net.

6. Remove Special Characters:

Remove special characters except for the below four characters because we would use these characters in tokenization and embeddings due to their importance in a sentence:
i. Single quote: ’
ii. Full stop (period): .
iii. Question mark: ?
iv. Exclamation mark: !

7. Add space around sentence end markers and remove duplicate markers:

Add one space around sentence end markers and remove duplicate end markers.
If there are multiple same sentence end markers separated by space or without space, then a single end marker will be used and space between them will be removed.
For example:
i. ‘ !! ’ will be replaced with ‘ ! ’
ii. ‘ ? ? ’will be replaced with ‘ ? ’
This is done to not lose information about these markers at the stage of transformation of the text into a word embedding.
This can be omitted while using BERT.
The following end markers will be considered:
i. Full stop (period): .
ii. Question mark: ?
iii. Exclamation mark: !

8. Decontraction:

A contraction, or short-form, is an abbreviated form of a word or words, from which one or more letters have been left out and replaced by an apostrophe. They are very common in English sentences. (Reference)
For example:
i. I am ⟶ I’m.
ii. He is ⟶ He’s.
iii. It is ⟶ It’s.
iv. We will ⟶ We’ll.
We need to de-contract (opposite of contraction) the words to their original form, to help with text standardization.
Here we will use a dictionary (contraction map) containing the contracted form in its keys and their corresponding expanded form in its values. (Reference)

Deep Learning Models

To classify and find the toxicity in a given text, various deep learning models from simple architecture to more complex ones have been tried.
For text embeddings, GloVe Model has been used to generate a vector representation of a word and create an embedding matrix for all the words (tokens). This embedding matrix is used in the embedding layer to generate a vector representation of the given comment/sentence/text.
Sentence embedding using the state-of-the-art (SOTA) Transformer-based BERT model (DistilBert) has also been tried.
We will see the architecture of all the models tried and a brief about them.

1. CNN-based Models

Below are two 1D CNN-based architectures used for the baseline modeling:
i. One with one Input Layer for the tokenized texts.
ii. Other with two Input Layers, one for the tokenized texts and another for the numerical feature indicating the “number of words in the comment text”.
The architectures of both the models are shown below:

CNN-based Model with one Input Layer for tokenized texts

CNN-based Model with one Input Layer for tokenized texts and another Input Layer for numerical feature

The inclusion of the word count feature did not help in getting a better score. Hence, this feature has not been used in the later models.

2. Bidirectional LSTM-based Models

Below are the bidirectional LSTM-based models’ architecture with skip connections:

Bidirectional LSTM Model with skip connection (Architecture 1)

Bidirectional LSTM Architecture with skip connection (Architecture 2)

3. Bidirectional LSTM-based Model with Attention Mechanism

Attention Layer using Bahdanu Attention Mechanism was used:

Bidirectional LSTM Model with Attention Mechanism

4. BERT Models

BERT (Bidirectional Encoder Representations from Transformers) is a SOTA transformer-based machine learning technique for the natural language process (NLP).
A lightweight BERT model called DistilBERT from Hugging Face Library was used to extract vector representation of the tokenized comment texts.
The comment texts were tokenized using DistilBERT Tokenizer first and then the text embedding was done on the tokenized texts using the DistilBERT Model ‘distilbert-base-uncased’.
The embedded vector was then passed to a simple Multi-level Perceptron Network (MLP).

MLP Model using BERT embedded vectors as input

5. Bidirectional LSTM Model with GloVe and BERT Features

The bidirectional LSTM Model was used with GloVe Embedding Layer and the output was combined with the BERT features, followed by skip connections.

Comparison of Models

Various Deep Learning Models (discussed above) were trained and evaluated on the test dataset.
The table below shows a comparison between various models.

Conclusion: As the Bidirectional LSTM Model with skip connections (Architecture 1) outperformed the other models, we have chosen it as the final model.

Final Model

The final model is chosen based on the evaluation score on the test dataset.
The final model is built using bidirectional LSTM layers followed by skip connections.
The architecture of the final model is shown below:

A sample code for building the architecture of the final model is shown below:

The model was trained for 21 epochs but it was early stopped after the 11th epoch.
Below is the learning curve of the model over different epochs:

Learning Curve of Loss and Accuracy w.r.t. epochs

Deployment

Flask API has been used to build REST APIs for the productionisation of the model.

The Web Application requires a comment text to be entered in a text area and shows the toxicity after prediction.

Below is the demo from localhost (which will be updated to show the demo from hosted environment later):

Kaggle Score

The final score on the validation data came to be 0.92620.

Kaggle’s Score on the test data came as 0.92482.

Future Work

The future scope of this problem could be:

Use a trainable embedding layer when using the GloVe Embedding Matrix.
Fine-tune the BERT Model. This could be resource-intensive.
Use a more advanced transformer-based model called XLNet, which overcomes the limitations of BERT.

References

GitHub Repository

GitHub - pulkitratnaganjeer/ToxicityClassification

In recent years, social media platforms have become an important place that provides an environment to its users, for…

github.com

Jigsaw Unintended Bias in Toxicity Classification

Table of Contents

Prerequisites

Business Problem

Machine Learning Formulation of the Business Problem

Business Constraints

Source of Data

Jigsaw Unintended Bias in Toxicity Classification

Detect toxicity across a diverse range of conversations

Dataset Details

Column Details

Performance Metrics

1. Overall AUC

2. Bias AUCs

2.1. Subgroup AUC

2.2. BPSN (Background Positive, Subgroup Negative) AUC

2.3. BNSP (Background Negative, Subgroup Positive) AUC

3. Generalized Mean of Bias AUCs

4. Final Metric

Existing Approaches and Improvements over them

Exploratory Data Analysis (EDA)

1. EDA of ‘Target’ Class Label:

2. Multi-variate Analysis of the Toxicity Subtype Features:

3. Multi-variate Analysis of the Identity Features:

4. Uni-variate Analysis of the Comment Texts:

Data Pre-processing

Deep Learning Models

Comparison of Models

Final Model

Deployment

Kaggle Score

Future Work

References

GitHub Repository

GitHub - pulkitratnaganjeer/ToxicityClassification

In recent years, social media platforms have become an important place that provides an environment to its users, for…

LinkedIn

Pulkit Ganjeer - Principal Software Engineer - emids | LinkedIn

Experienced Senior Software Engineer with a demonstrated history of working in the information technology and services…

Written by Pulkit Ratna Ganjeer

No responses yet