Quora is a platform that allows people to learn from each other. On Quora, people can ask questions and communicate with others who contribute unique insights and authoritative opinions. Several months ago Quora organized a competition on Kaggle — Quora Insincere Questions Classification, which was aimed to overcome an existential problem for any major website today — how to handle toxic and divisive content. A key challenge was to weed out insincere questions — those based on false premises, or made by users who rather intend to make a statement than to look for helpful answers.
This post will describe my approach to the problem which helped me to enter TOP 3 percent (the 85th place among 4000 participants, solo silver medal).
The dataset consists of 1.3M questions, the goal is to mark a question as toxic or non-toxic(binary classification task). An insincere question is defined as a question intended to make a statement rather than look for helpful answers. The following signs might be used to spot an insincere question(full definition):
- Has a non-neutral tone
- Is disparaging or inflammatory
- Isn’t grounded in reality
- Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers
Examples of questions:
1. What are the best rental property calculators?
2. Apart from wealth, fame, and their tragic ends, what did Anthony Bourdain and Kate Spade have in common that might provide some insight into their choice to take their own lives?
3. How do you find your true purpose or mission in life?
4. What is the relation between space and time if they are connected? Are they converting continuously in each other?
5. Is there an underlying message that can be read into the many multilateral agreement exits of the Trump administration during its first year and a half?
1. Lol no disrespect but I think you are ducking smart?
2. Are Denmark and England destroyed by Muslim immigrants?
3. How am I supposed to get a girlfriend if every woman thinks every man is a rapist?
4. How many black friends does a white person need to make what they say 'not racist' on this basis?
5. Are Russian women more beautiful than Ukrainian women?
The competition uses F1-score metrics which is calculated on the 56k (public) and 376k (private) unseen test questions. Quora Insincere Questions Classification was kernel only competition, so instead of making predictions, you have to submit code, which will be then run on Kaggle platform. The organizers have set a memory and time limit for both stages: 6 / 2 hours (CPU / CPU+GPU usage).
Also, the organizers decided to prohibit the use of any external data (it didn’t stop someone anyway), except several pretrained word embeddings: GoogleNews-vectors-negative300, glove.840B.300d, paragram_300 sl999, wiki-news-300d-1M.
In this competition, we had to detect toxic questions. Let’s suppose it was the only reason for starting competition (the other reasons might include hiring professional Data Scientists, generating publicity or comparing the score of the model in production with top scores in the competition). Probably the most important yet least frequent questions to be asked during the preparation of the data-driven project are: Do we actually need it? What problems are we going to solve?
Toxic behavior is bad for the Quora community. It could scare some users off, as they won’t feel safe sharing their knowledge with the world. We want our users to freely express their ideas without fear of being trolled (problem). We may achieve it by detecting toxic questions and removing them or making them only visible for the author of the question — hellban (solution). To detect toxic questions, we may use human pre-moderation (by Quora moderators or Mechanical Turk), but it requires time to process all questions and moderators could be overwhelmed by the number of these asked on Quora. So, we would like to automate the process by using a data-driven approach: before posting the question, it shall be checked by the model, which we have to come up with during this competition.
Does this solve the initial problem? We could take advantage of the experience from a sphere struggling with a similar problem — video games industry. According to “Can a video game company tame toxic behavior?”, most of the toxic comments in video games came from “average persons, who just had a bad day” and ban of all toxic players would not make any dramatic changes:
Common wisdom holds that the bulk of the cruelty on the Internet comes from a sliver of its inhabitants — the trolls. Indeed, Lin’s team found that only about 1% of players were consistently toxic. But it turned out that these trolls produced only about 5% of the toxicity in League of Legends. “The vast majority was from the average person just having a bad day,” says Lin.
That meant that even if Riot banned all the most toxic players, it might not have a big impact. To reduce the bad behaviour that most players experienced, the company would have to change how players act.
In my opinion, removing toxic comments won’t solve the problem but reduce its effect on the community. It could be integrated into the anti-troll strategy, but it’s only a part of the package to address the issue. Anyway, the overall idea of competition seemed viable and important for me, so I decided to participate.
For the whole competition, I was testing my pipeline both locally (on my PC) and in Kaggle kernels. It allowed me to (1) run more experiments simultaneously, (2) make sure that my pipeline will fit the time/memory requirements of the 2nd stage. I tracked the results of each experiment (time, validation score, public leaderboard score, comments, and findings) and then put them in the excel file.
I think that the most important part of this competition was not to trust public leaderboard. As it was mentioned previously, the public leaderboard was calculated on 56k questions, which is roughly 4% of a number of questions in the train part. It is an extremely small fraction and it might be used only for ensuring if you have not made any dramatic bug in the code, not for score checking.
I decided to fully trust my validation and used StratifiedKFold (K=5) strategy for local validation and parameters selection. For final submission, I used the same strategy for one submit and train-validation split (90/10%) for another one (two final submissions were allowed).
My preprocessing pipeline consisted of several steps:
- Clean math formulas. Approximately 2% of questions in train dataset had “math” tags in it. The formulas within contained no useful information nor toxic comments, but it dramatically increased the length of a question, so I decided to replace those formulas with word “math”. Although, clever trolls could use it and start to insult people inside of math formulas.
[math] 2 + 2 = 4 [\math] -> math
- · Add a space between word and punctuation symbol. It turns out to be a better strategy then removing punctuation:
"Wait... What is the relation between space and time if they are connected? Are they converting continuously in each other?" -> "Wait . . . What is the relation between space and time if they are connected ? Are they converting continuously in each other ?"
- If the word is absent in a dictionary, try to lowercase/uppercase/upper first letter it and try again. This simple yet effective move gradually reduce the number of “new” words (i.e. words which are not in w2v vocabulary). After the end of the competition, I realized that stemming or lemmatization might also decrease the number of absent words.
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
# 'ABcd' → 'abcd'
if embedding_vector is None:
embedding_vector = embeddings_index.get(word.lower())
# 'denis' -> 'Denis'
if embedding_vector is None:
embedding_vector = embeddings_index.get(word.capitalize())
# 'usa' -> 'USA'
if embedding_vector is None:
embedding_vector = embeddings_index.get(word.upper())
# deling with numbers in Google News embedding
if word.isdigit() and (embedding_vector is None):
temp_word = len(word) * '#'
embedding_vector = embeddings_index.get(temp_word)
# '1123336548956552515151515151544444444' -> 'number'
if word.isdigit() and (embedding_vector is None):
embedding_vector = embeddings_index.get('number')
if embedding_vector is not None:
in_vocab += 1
non_in_vocab += 1
Models and Embeddings
I tested CNN and LSTM models in this competition, however, the best one was the following LSTM:
def get_model(embedding_matrix, nb_words, embedding_size=607):
inp = Input(shape=(max_length,))
x = Embedding(nb_words, embedding_size, weights= [embedding_matrix], trainable=False)(inp)
x = SpatialDropout1D(0.3)(x)
x1 = Bidirectional(CuDNNLSTM(256, return_sequences=True))(x)
x2 = Bidirectional(CuDNNGRU(128, return_sequences=True))(x1)
max_pool1 = GlobalMaxPooling1D()(x1)
max_pool2 = GlobalMaxPooling1D()(x2)
conc = Concatenate()([max_pool1, max_pool2])
predictions = Dense(1, activation='sigmoid')(conc)
model = Model(inputs=inp, outputs=predictions)
adam = optimizers.Adam(lr=learning_rate)
I used concatenation of google news and glove embedding with additional features added to each word’s representation based on the relative position in a sentence (order number), relative position in a question (order number), is_upper, is_lower, is_ number, is_punctuation, mean of upper letters in word (“WORD” → 1.0, “Word” → 0.25), frequency of a word in train dataset.
Snapshot ensembling is a commonly used technique in Kaggle competition previouslydescribed in Snapshot Ensembles: Train 1, get M for free. The idea behind it is very simple: we train a single model with cyclic learning rate, saving the weight of the model at the end of each cycle (end of a cycle is usually is in a local minimum). In the end, we will get several models instead of just one. Averaging predictions of the ensemble will give you a better score than a single model.
During training, we are converging to and escaping from multiple local minima. A snapshot is taken in each local minimum
The idea of pseudo labeling is to increase the amount of available data for model training. It is a common approach in Kaggle competition, but unfortunately, it is not so commonly used in the industry (here are some papers on this topic: Training Deep Neural Networks on Noisy Labels with Bootstrapping, Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks). Now, there are a few ways one might use for pseudo-labeling:
- 1. Add the most confident examples of training data (examples with a confidence ≥ level, level = 0.9). We should not select top N1% and bottom N0% probabilities and add them to training dataset (due to class imbalance). Instead, we must determine the optimal threshold and select the most confident examples with a positive and a negative label. Following this approach on several epochs of NN training, I added about 45k negative samples and 100 positive samples.
- Add full test data with pseudo labels, but assign a weight to each example in accordance with its confidence. I tried several weight functions, but I’ve found that a linear weight function is the best one for this task.
Assigning weights to pseudo labels (first 1000 data points in the test set). The more confident our predictions are — the bigger weight we assign during training
Optimal Threshold Selection
The metric of this competition is the F1 score, which means that we have to submit classes (0/1, nontoxic/toxic). The output of the model is probabilities, so we need to select the appropriate threshold to maximize the score. There is a number of ways to do that:
- Select threshold before training. I think this way will lead to non-optimal threshold and score will be lower. I had it tested, but never used in the final submit. Although, the winners of competition used a variation of this approach.
- Fit model, make a prediction for the train part, select threshold optimizing the score on the train part. Straightway to overfit. Tested, but did not use in the final submit.
- Fit model, make a prediction for the train part, select threshold optimizing the score on the train part. Straightway to overfit. I had it tested, but never used in the final submit.
- · Fit model, make a prediction for the validation part, select threshold optimizing the score on the validation part. In my pipeline, the validation part is used to select the optimal number of epochs (which is a slight overfit of the validation data) and optimizing hyperparameters of the Neural Net (a major overfit). Select a threshold on already “used” data “as is” is not a good idea. We might completely overfit and get a low score on unseen data. Instead, I decided to select a threshold on subsets of the validation part, repeat it several times and then aggregate results (this idea was motivated by the idea of subsampling in statistics). It gave me good results both in this competition and a few others where I had to select a threshold as well [code].
- Make out of fold (OOF, means a separate model for each fold) predictions, find a threshold. This is a good way, but there are two problems about it: (1) We don’t have this much time to make OOF predictions because we want to fit as many diverse models (CNN/LSTM) as possible. (2) It might be a case, that our validation split is biased and we will get our predictions “shifted”. Threshold search is very sensitive and we will not get an optimal solution. However, I used this approach for ranking the probabilities on each fold to reduce the influence of the “shift”. It worked good enough for me.
During the competition, I’ve tested tons of ideas, while only a small part of these was used in the final pipeline. In this section, I provide an overview of some techniques which did not work for me.
A lot of stuff did not work during this competition (picture by Schmitz)
Data Augmentation and Test Time Augmentation (TTA)
The idea is to increase the training dataset. There are several ways to do that, most commonly used are repeated translating (translate the English sentence into French and then back into English), and synonyms. Repeated translating was not an option because Internet access was not available during the 2nd stage, so I decided to focus on synonyms. I tested two approaches:
- · Split sentence into words, replace a word with the closest word by w2v embedding with a pre-defined probability. Repeat a few times to get different sentences.
- · Add random noise to random words in the sentence.
Both approaches did not work well. Also, I tried to combine questions together, i.e. non-toxic + toxic = toxic. It didn’t worked either.
Non toxic: Is there an underlying message that can be read into the many multilateral agreement exits of the Trump administration during its first year and a half?Toxic: Lol no disrespect but I think you are ducking smart?____________________________________________________________________New toxic:
Is there an underlying message that can be read into the many multilateral agreement exits of the Trump administration during its first year and a half? Lol no disrespect but I think you are ducking smart?
Additional Sentence Features
I tried different features based on the sentence, but they did not help much. Some of them reduced the score, so I decided not to include them in the final model. Here are some of the features I’ve tested: number of words, number of upper words, number of numbers, sum/mean/max/min of numbers, number of punctuation, sum/mean/max/min of frequencies of words, number of sentences in a question, starting and ending character, etc.
Neural Net Inner Layers Output
The idea is simple: we take the output of the net’s concat layer and train tree-based model on top of it. I tested this approach in recent competitions and it always increased the score of the final ensemble. It allows us to use diverse models (Neural Net and tree-based models) for blending, which will increase the score of the final ensemble.
But in this case, it was a very tiny increase, especially given the time it took due to high output dimension (dimensionality of concat layer ~400 depending on the model).
TSNE visualization of NN inner layer output of random 1000 questions in the validation datasetread original article at https://towardsdatascience.com/automatic-detection-of-toxic-questions-1f97dfcf091?source=rss——artificial_intelligence-5