Experience from participating in computer vision competition hosted by Chinese data science platform AI Challenger
In this article I will share my experience solving a video classification problem in a Chinese machine learning competition.
There are a lot of data science platforms for competitors. We used to think about Kaggle – the most popular one. Anyway, there is a number of other platforms that provide data scientists with challenging tasks, and it is a good time to explore them. This is why me and my teammate Alexey Grigorev entered Short video real-time classification competition at Chinese platform AI Challenger. Big thanks to kaggle grandmaster Artur Kuzin who became our mentor in this competition. He was helping us with creative ideas and provided us with a GPU server.
The dataset of videos was about 2 TB large and was split into train and validation sets. There were 2 test sets: test A and test B. The videos could have from 1 to 3 labels each so it was a multi-label classification. For example below is a frame from a video that has 2 classes: piano and baby
And here is a full list of possible tags:
This competition had a complex metric for evaluation. Besides accuracy time restriction was added as an evaluation metric. You may think of it as the overall time taken from inputting a single video to outputting a single prediction.
Finally taking into account these 2 formulas the composite metric was calculated as the weighted distance between submitted results and the reference point. You may read more about that metric here.
The key idea is that organizers wanted from data science community a high-speed application that could be used in industry. For the submittals competitors had to load image docker package.
Training neural network
We used a mix of neural nets and simpler models to make an ensemble for final predictions. We tried different architectures for networks. The best result was with se_resnet50. “Squeeze and Excitation” block improves quality or resnet model with minimal computational cost. You may read more about this architecture in this publication. As a baseline for training the model we started from extracting the first frame. The learning rate was 0.001 and decayed by 10 every 30 epochs. The overall number of epochs was 150. As an optimizer stochastic gradient descent with weight decay and momentum was used for faster convergence. Interestingly, despite the the fact that we had multi-label and multi-class problem we got bad results with BCEwithlogitsloss. Instead we used cross entropy loss which improved accuracy by far. The reason for the quality gap could be a lack of multi-label examples in the data. We spend some time thinking about how to predict multiple labels. For example, we trained a binary classifier to predict whether an example has one or more labels. One more issue was with thresholds for classes. We analyzed class distribution and calculated weights for classes. The idea was to add some coefficient to weigh if a binary classifier predicted an example has more than one label. In this case we could reach the thresholds for some classes that we previously didn’t take into account.
Anyway we didn’t use thresholds for classes because according to metric we understood that given an example with more than one label it is better to predict only one of them rather than to predict a redundant one. It is still a lot of space to experiment with such methods to make multiple class prediction models more stable.
Trying classic ML algorithms for ensembling
The other part of the solution is that we were extracting features from neural nets after global pooling and used these features as an input for simpler classifiers such as gradient boosting, logistic regression and support vector machine. Final ensemble consisted of the four best single models (neural network, log reg, SVM, catboost) combined with a majority voting classifier. This model gave us 0.76 on test A but it was unfortunately not enough to progress.
Aggregating video frames and facing new problems
The next step was to add aggregation to neural network training. Now we extracting five random frames from each video and used the most frequent label as a single prediction. We did the same thing to extract new features for classic models, it worked very well, and we got 0.91 on validation. We were happy with these results. A week before the deadline we decided to submit it, but now we faced a problem with time. The models were fast but according to rules time for video preprocessing needed to be considered also. We noticed it takes about 0.03–0.05ms to extract one frame. Unfortunately, this didn’t allow us to match submission restrictions.
Then we decided to train model with aggregation, but predict on one random frame. This worked a bit better than our previous model without any aggregation. The score was 0.81 on validation and 0.79 on test A with time about 0.11ms for a single prediction. We had an idea to decode videos on GPU to speed up the process ,but we didn’t have enough time to manage how to make it with docker. This is how our pipeline looks:
let me add some words about competition organization. AIChallenger pretends to be a global data science platform, but it has a lot of work before it gets there. We faced technical problems while participating, mostly with the docker submission. This Challenge was organized by MeiTu company (美图). For example competition started on 3rd of september but information about the docker instructions appeared on website only up to 12th of october. All instructions were in Chinese language. We accidentally got to know that organizers were assisting participants in Chinese messenger WeChat, but it was also in Chinese and didn’t help foreign competitors. The funny thing was that even the submit button was in Chinese. We once even wrote a comment on the forum to encourage organizers to solve technical problems, but it only partly helped. In the future i hope we will see a better level of organization and technical support.
That’s how the competition ended for us. Despite the fact that we didn’t manage to submit our best model it was worth trying. We gained an understanding of how to solve this class of machine learning problems. Moreover, after almost 3 months working on this project we now have a good pipeline that you may find on github here.
However, a good trend is that rules of challenge forced competitors think of designing an algorithm for production use. So besides modelling it helped to improve skills of code optimization and creating a reusable pipeline. Finally, I would like to encourage all to explore new data science platforms and keep participating in competitions to make such challenges better.