Results which are “too good to be true”
Can results ever be “too good to be true”? Well, for example you may be using transfer learning: you don’t have time to build and train a complex architecture from scratch, so you get a pretrained model (say, ResNet101 trained on ImageNet) and you feed it on your data set. You hope that you will get a good accuracy without incurring the cost of a long learning time, but you don’t expect to get excellent results, except when your data set is extremely similar to the one on which the model was originally trained.
Or maybe, before training a complex and thus time-intensive architecture, you may want to try first a simpler model. The simpler neural network would serve as a baseline, in order to measure the advantage gained by using the bigger/badder model. You wouldn’t expect a simple model to get SOTA results.
Finally, you may be working on a simple model because its use case (e.g., a mobile app, or an on-premise app) prevents you from going too deep, especially if you plan on performing online training.
What do you think, then, if you get an accuracy of 99.3% on the test set on your first try? You shouldn’t immediately conclude that the problem is simple and that you just don’t need a Deep Network to ace it. Instead, there are a few practical tips which will help you assess whether you’ve been lucky, the problem is simple, or you made some glaring mistakes.Some of these tips only make sense for classifiers, while others can be applied to all kinds of Deep Learning model.
In the following we will assume that you already followed best practices to avoid overfitting, i.e., you either used k−fold Cross-Validation (CV) or a training/validation/test set split to estimate your model’s generalization error, and you found a suspiciously low value.
Check for data leaking
First of all, check that all data transforms (or estimators, in Tensorflow lingo), are fit on the training set, and applied to the test set, for each fold of k−fold CV. For example, if dropping “useless” features is part of your model, you must not choose the “useful” features on the full training set and then always use the same features for each fold. On the contrary, you must repeat the feature selection operation for each fold, which means that in each fold you may be using different features to make your predictions. If using training/validation/test set split instead of k−fold CV, the equivalent approach is to use only the validation set to choose the useful features, without ever peeking at the test set.
If you’re coding in Python, a simple way to make sure that the same exact operations are repeated for each fold is to use the `Pipeline` class (in `scikit-learn`) or the `Estimator` class (in Tensorflow), and to make sure that all operations on data are performed inside the pipeline/estimator.
Are you using the right metric?
If building a classifier, verify that you’re using the right metric for classification. For example, accuracy makes sense as a classification metric *only* if the various classes in the population are reasonably balanced. But on a medical data set, where the incidence of the disease in the population is 0.1%, you can get 99.9% accuracy on a representative data set, just by classifying each point to the majority class. In this case, 99.9% accuracy is nothing to be surprised of, because accuracy is not the right metric to use here.
Is the problem really simple?
Again, in the case of classification, check that the problem is not overly simple. For example, you could have perfectly separable data. In other words, there exists a manifold (or the intersection of multiple manifolds) in input space which separates the various classes perfectly (no noise). In the case of a two-class linearly separable problem, there is an hyperplane in the input space which perfectly separates the two classes. This is easily diagnosed since all lower-dimensional projections of an hyperplane are hyperplanes. Thus for example in the 2D scatterplots for all pairs of features, the two classes can be perfectly separated by a line:
Of course, since Deep Learning is often applied to problems with K (the number of features) in the order of thousands or ten of of thousands (e.g., image pixels), it may not be even possible to visualize a small fraction of all the (K+1)K/2 scatterplots. This is not a problem, because linearly separable data will be classified perfectly (or nearly perfectly) by simpler machine learning tools, such as linear discriminant analysis, linear Support Vector Machines (SVM) or regularized logistic regression (unregularized logistic regression fit with MLE is unstable for linearly separable problems). Thus if visualization is not possible, fitting these models will give the same information, although in a less direct way.
Nonlinearly perfectly separable data are harder to detect without visualization:
however, machine learning tools such as kernel SVMs (or better, disentagled VAE), or advanced visualization methods such as t-SNE, may help identify such a condition.
Plotting decision regions, for example using the `plot_decision_regions` function from the `mlxtend` helper package for `scikit-learn`) is another way to quickly identify nonlinearly perfectly separable data. However, again in Deep Learning problems where you have a vast amount of features, such visualizations may be impossible or impractical.
In both cases (linearly and nonlinearly separable data) the point is that there is a “hard” separation boundary between classes (i.e., no noise), so it’s possible to get perfect classification. Arguably, few real world problems fall in these categories: in most cases there is not an “hard” boundary which will perfectly separate the two classes, for all training sets extracted from the population. Still, it can be worthwhile to make this check.
Train on random labels
A test which is often illuminating is to refit the DNN on shuffled (permutated) labels. In this case there’s no relationship between the features and the class label, thus the neural network has only one way to get good training set error: memorize the whole training set. This will usually manifest in much longer training time. Also, it has no way to get good test set error. If you still get excellent performance on the test set, then there is either something seriously wrong with your code (consider increasing the coverage of your unit tests, to find what’s wrong) or with the framework you’re using. In the latter case, you should consider opening a GitHub issue at this point.
Train on a small data set
Go to the extreme opposite: train on a single (or very few) data points. In this case, you should get ∼100% accuracy on the training set very quickly (super-short training time), and of course extremely low accuracy on the test set. Again, if you don’t, this indicates some serious issues in your code or in the framework you’re using.
Last check: get new data
Finally, you can try to obtain a few new test points, representative of the real use case for the DNN, which you’ve never seen (let alone used in training) until now. If you still get excellent accuracy, then congratulations, your model is just great!