This tutorial was adapted from Fastai DL Lesson 1 of 2019 with many of my additions and clarifications. I hope you find it helpful.
Following this tutorial, you will be able to build and train an Image Recognizer on any image dataset of your choice, with a good understanding of the underlying model architecture and training process.
This tutorial covers:
1. Data Extraction
2. Data Visualization
3. Model Training: CNNs, ResNets, transfer learning
4. Results Interpretation
5. Freezing & Unfreezing of model layers
6. Fine-Tuning: Learning rate finder, One Cycle Policy
This tutorial is a great introduction to any new Deep Learning practitioner, anyone who wants to simply refresh on the basics of image classification using CNNs and ResNets, or anyone who has not used fastai library and wants to try it out.
The notebook of this tutorial can also be found here.
To run the notebook, you can simply open it with Google Colab here.
The notebook is all self-contained and bug free, so you can just run it as is.
Once in Colab, make sure to change the following to enable GPU backend,
Runtime -> Change runtime type -> Hardware Accelerator -> GPU
The code in this tutorial is concisely explained. Further documentation for any of the classes, methods, etc. can be found at fastai docs.
Let’s get started…
Importing necessary libraries,
Let’s do some initializations,
bs is our batch size, which is the number of training images to be fed to the model at once. The model parameters are updated after each batch iteration.
For instance, if we have 640 images and our batch size is 64; the parameters will be updated 10 times over the course of 1 epoch.
If you happen to run out of memory at some point during the tutorial, a smaller batch size can help. Batch size is usually multiple of 2s.
Initializing the pseudo-random number generator above with a specific value makes the system stable, creating reproducible results.
URLs.PETS is the url of the dataset. It features 12 cat breeds and 25 dogs breeds.
untar_data decompresses and downloads the data file into our
get_image_files gets the paths of ALL files contained in images directory and stores them into
fnames. An instance from
fnames would look as follows,
Since the label of each image is contained within the image filename, we shall use regular expressions to extract it. A regular expression, often abbreviated regex, is a pattern describing a certain amount of text. Our pattern to extract the image label is as follows,
This last step is specific to this dataset. For instance, we do not have to worry about it if the images belonging to the same class are within the same folder.
Let’s now create our training and validation datasets,
ImageDataBunch creates a training dataset, train_ds, and a validation dataset, valid_ds, from the images in the path
from_name_re gets the labels from the list of file names
fnames using the regular expression obtained after compiling the expression pattern
df_tfms are transformations to be applied to images on the fly. Here, images will be resized to 224×224, centered, cropped and zoomed. Such transformations are instances of Data Augmentation, which has proved promising in computer vision. Such transformations do not change what’s inside the image but change its pixel values for a better model generalization.
normalize normalizes the data using the standard deviation and mean of ImageNet images.
A training data sample is represented as
(Image (3, 224, 224), Category scottish_terrier)
where the first element represents the image 3 RGB channels, rows, and columns. The second element is the image label.
The corresponding image of this instance is,
len(data.valid_ds) output the number of training and validation samples, 5912 and 1478, respectively.
data.classes output the number of classes and their labels, respectively. There are 37 classes with the following labels,
['Abyssinian', 'Bengal', 'Birman', 'Bombay', 'British_Shorthair', 'Egyptian_Mau', 'Maine_Coon', 'Persian', 'Ragdoll', 'Russian_Blue', 'Siamese', 'Sphynx', 'american_bulldog', 'american_pit_bull_terrier', 'basset_hound', 'beagle','boxer', 'chihuahua', 'english_cocker_spaniel', 'english_setter', 'german_shorthaired', 'great_pyrenees', 'havanese', 'japanese_chin', 'keeshond', 'leonberger', 'miniature_pinscher', 'newfoundland', 'pomeranian', 'pug', 'saint_bernard', 'samoyed', 'scottish_terrier', 'shiba_inu', 'staffordshire_bull_terrier', 'wheaten_terrier', 'yorkshire_terrier']
show_batch shows few images inside a batch.
cnn_learner builds a CNN learner using a pre-trained model from a given architecture. The learned parameters from the pre-trained model are used to initialize our model, allowing a faster convergence with high accuracy.
The CNN architecture used here is ResNet34, which has had great success within the last few years and is still considered state-of-the-art.
There is great value in discussing CNNs and ResNets, as that will help us understand better our training process here. Shall we? 🙂
CNNs in a nutshell:
So first, what is a Convolutional Neural Network (CNN or convNet)? We can think of a ConvNet as a list of layers that transform the image volume into an output volume, which can be a class score as it is the case in this tutorial. These layers are made up of neurons connected to other neurons of the previous layers. For an in-depth read, I highly recommend Convolutional Neural Networks from Stanford’s CS231 class.
A typical CNN architecture [Source]
This figure is an illustration of a typical convNet architecture. We can think of all CNN architectures as various combinations of different differentiable functions (convolutions, downsamplings, and affine transformations). The above figure has only few layers, but deep networks have dozens to hundreds of layers.
ResNets in a nutshell:
A very common problem in deep networks is the degradation problem, where the model accuracy gets saturated and then degrades rapidly. This is counterintuitive as we expect that the additional layers should enable more detailed and abstract representations. This problem is exactly what ResNets aim to solve, as they make it safe to optimally train deeper networks without worrying about the degradation problem.
ResNets’ approach to solving the degradation problem is by introducing “identity shortcut connections”, often referred to as “skip connections”, which skip one or more layers. The output of the skip connection is added to the output of the stacked layers, as shown in the figure below. The skip connections effectively skip the learning process on some layers enabling the deep network to also act as a shallow network in a way.
A residual block [Source]
The skip function creates what is known as a residual block, F(x) in the figure, and that’s where the name Residual Nets (ResNets) came from. Traditional networks aim to learn the output H(x) directly, while ResNets aim to learn the residual F(x). Making F(x) = 0 allows the network to skip that subnetwork, as H(x) = x.
It has been shown that the addition of these identity mappings allows the model to go deeper without degradation in performance and such networks are easier to optimize than plain stacked layers. There are several variants of ResNets, such as ResNet50, ResNet101, ResNet152; the ResNet number represents the number of layers (depth) of the ResNet network.
In this tutorial, we are using ResNet34, which is look like as follows,
Architecture and convolutional kernels of ResNet34 [Source]
In the figure, the bottom number represents the input or feature map size (Height x Width) and the number above represents the number of channels (number of filters). For instance, the first left block represents the input image (224 x 224 x 3). Each of the “Layers” in the figure contains few residual blocks, which in turn contain stacked layers with different differentiable functions, resulting in 34 layers end-to-end. Below is the full underlying layout of ResNet34 architecture compared to a similar plain architecture; the side arrows represent the identity connections.
A plain 34-layer CNN (left) and a 34-layer ResNet (right) [Source]
Feel free to try any of the other ResNets by simply replacing
models.resnet50 or any other desired architecture. Bear in mind that increasing the number of layers would require more GPU memory.
What we have described above of using a pre-trained model and adapting it to our dataset is called Transfer learning. But why use transfer learning?
Deep neural networks have a huge number of parameters, often in the range of millions. Training such networks on a small dataset (one that is smaller than the number of parameters) greatly affects the network’s ability to generalize, often resulting in overfitting. So in practice, it is rare to train a network from scratch with random weights initialization.
The pre-trained model is usually trained on a very large dataset, such as ImageNet which contains 1.2 million images with 1000 categories. Thus, the pre-trained model would have already learned to capture universal features like curves, color gradients, and edges in its early layers, which can be relevant and useful to most other computer vision classification problems. Transfer learning has shown to also be effective in other domains as well, such as NLP and speech recognition.
Now, with transfer learning, our model is already pre-trained on ImageNet and we only need to make it more specific to the details of our dataset in-hand. We have two options to do this, we can update only the parameters of the last layers or we can update all of the model’s layers. The first option is often referred to as feature extraction, while the second is referred to as fine-tuning. In both approaches, it is important to first reshape the final layer to have the same number of classes in our dataset since the ImageNet pre-trained model has a size of 1000 in the output layer.
Great! we have covered many core concepts so far.
Let’s now train the model on our dataset,
fit_one_cycle trains the model for the number of epochs provided, i.e 4 here.
The epochs number represents the number of times the model looks at the entire set of images. However, in every epoch, the same image is slightly different following our data augmentation.
Usually, the metric error will go down with each epoch. It is a good idea to increase the number of epochs as long as the accuracy of the validation set keeps improving. However, a large number of epochs can result in learning the specific image and not the general class, something we want to avoid.
We can also train our model using the method
fit instead of
fit_one_cycle, but let’s not worry about this for now. We will cover their difference when we get to our discussion about learning rates and fine-tuning.
The training that we just did here is what we referred to as feature extraction, so only the parameters of the head (last layers) of our model were updated. We shall try fine-tuning all the layers next.
Congratulations!!! The model has been successfully trained to recognize dogs and cat breeds.
Our achieved accuracy above is ≈ 93.5%
Can we do even better? We’ll see after fine-tuning.
Let’s save the current model parameters in case we may want to reload that later.
Let’s now see how to properly interpret the current model results.
ClassificationInterpretation provides a visualization of the misclassified images.
plot_top_losses shows images with top losses along with their:
prediction label / actual label / loss / probability of actual image class
A high loss implies high confidence about the wrong answer. Plotting top losses is a great way to visualize and interpret classification results.
Misclassified images with the highest losses Classification confusion matrix
In a confusion matrix, the diagonal elements represent the number of images for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.
most_confused simply grabs out the most confused combinations of predicted and actual categories; in other words, the ones that it got wrong most often. We can see that it often misclassified staffordshire bull terrier as an american pitbull terrier, they do actually look very similar 🙂
[('Siamese', 'Birman', 6),
('american_pit_bull_terrier', 'staffordshire_bull_terrier', 5),
('staffordshire_bull_terrier', 'american_pit_bull_terrier', 5),
('Maine_Coon', 'Ragdoll', 4),
('beagle', 'basset_hound', 4),
('chihuahua', 'miniature_pinscher', 3),
('staffordshire_bull_terrier', 'american_bulldog', 3),
('Birman', 'Ragdoll', 2),
('British_Shorthair', 'Russian_Blue', 2),
('Egyptian_Mau', 'Abyssinian', 2),
('Ragdoll', 'Birman', 2),
('american_bulldog', 'staffordshire_bull_terrier', 2),
('boxer', 'american_pit_bull_terrier', 2),
('chihuahua', 'shiba_inu', 2),
('miniature_pinscher', 'american_pit_bull_terrier', 2),
('yorkshire_terrier', 'havanese', 2)]
By default in fastai, using a pre-trained model freezes the earlier layers so that the network can only make changes to the parameters of the last layers, as we did above. Freezing the first layers and training only the deeper layers can significantly reduce a lot of the computation.
We can always train all of the network’s layers by calling
unfreeze function, followed by
fit_one_cycle. This is what we called fine-tuning, as we are tuning the parameters of the whole network. Let’s do it,
The accuracy now is a little worse than before. Why is that?
It is because we are updating the parameters of all the layers at the same speed, which is not what we desire since the first layers do not need much change as the last layers do. The hyperparameter that controls the updating amount of the weights is called the learning rate, also referred to as step size. It adjusts the weights with respect to the gradient of the loss, with the objective to reduce the loss. For instance, in the most common gradient descent optimizer, the relationship between the weights and learning rate is as follows,
which translates to
new_weight = old_weight — lr * gradient
By the way, a gradient is simply a vector which is a multi-variable generalization of a derivative.
Therefore, a better approach to fine-tune the model would be to use different learning rates for the lower and higher layers, often referred to as differential or discriminative learning rates.
By the way, I am using parameters and weights interchangeably in this tutorial. More accurately, parameters are weights and biases, but let’s not worry about this subtlety here. However, note that hyperparameters and parameters are different; hyperparameters cannot be estimated within training.
In order to find the most adequate learning rate for fine-tuning the model, we use a learning rate finder, where the learning rate is gradually increased and the corresponding loss is recorded after each batch. The fastai library has this implemented in
lr_find. For a further read on this, check out How Do You Find A Good Learning Rate by @GuggerSylvain .
Let’s load the model we had previously saved and run
recorder.plot method can be used to plot the losses versus the learning rates. The plot stops when the loss starts to diverge.
From the resulting plot, we concur that an appropriate learning rate would be around 1e-4 or lower, a bit before the loss starts to increase and go out of control. We will assign 1e-4 to the last layers and a much smaller rate, 1e-6, to the earlier layers. Again, this is because the earlier layers are already well trained to capture universal features and would not need as much updating.
In case you are wondering about the learning rate used in our previous experiments since we did not explicitly declare it, it was 0.003 which is set by default in the library.
Before we train our model with these discriminative learning rates, now is a good time to demystify the difference between
fit methods. This discussion can be very valuable in understanding the training process, but feel free to skip directly to results.
Briefly, the difference is that
fit_one_cycle implements Leslie Smith 1cycle policy, which instead of using a fixed or a decreasing learning rate to update the network’s parameters, it oscillates between two reasonable lower and upper learning rate bounds. Let’s dig a little more on how this can help our training.
➯ Learning Rate Hyperparameter in Training
A good learning rate hyperparameter is crucial when tuning our deep neural networks. A high learning rate allows the network to learn faster, but too high of a learning rate can fail the model to converge. On the other hand, a small learning rate will make training progress very slowly.
Effect of various learning rate on convergence [Source ]
In our case, we estimated the appropriate learning rate (lr) by looking at the recorded losses at different learning rates. It is possible to use this learning rate as a fixed value in updating the network’s parameters; in other words, the same learning rate will be applied through all training iterations. This is what
learn.fit(lr)does. A much better approach would be to change the learning rate as the training progresses. There are two ways to do this, learning rate schedules (time-based decay, step decay, exponential decay, etc.) or adaptive learning rate methods (Adagrad, RMSprop, Adam, etc.). For more about this, check out CS230 Stanford class notes on Parameter Updates. Another good resource is An overview of gradient descent optimization algorithms by @Sebastian Ruder.
➯ One Cycle Policy in a nutshell
One cycle policy is one type of learning rate schedulers, that allows the learning rate to oscillate between reasonable minimum and maximum bounds. What are the values of these two bounds? The upper bound is what we got from our learning rate finder while the minimum bound can be 10 times smaller. The advantage of this approach is that it can overcome local minimas and saddle points, which are points on flat surfaces with typically small gradients. The 1cycle policy has proved to be faster and more accurate than other scheduling or adaptive learning approaches. Fastai implements the 1cycle policy in
fit_one_cycle, which internally calls
fit method along with a
OneCycleScheduler callback. Documentation of fastai 1cycle policy implementation can be found here.
One cycle length of 1cycle policy [Source]
A slight modification of the 1cycle policy in the fastai implementation is that consists of a cosine annealing in the second phase from
➯ 1cycle Policy discovery
Leslie Smith first discovered a method he called Cyclical Learning Rates (CLR) where he showed that CLRs are not computationally expensive and they eliminate the need to find the best learning rate value since the optimal learning rate will fall somewhere between the minimum and maximum bounds. He then followed that paper with another A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay, where he highlighted various remarks and suggestions to enable faster training of networks to produce optimal results. One of the propositions was to use CLR with just one cycle to achieve optimal and fast results, which he elaborated in another paper super-convergence. The authors named the approach 1cycle policy.
The figure below is an illustration of how the super-convergence method reaches higher accuracies than a typical (piecewise constant) training regime in much fewer iterations for Cifar-10, both using a 56 layer residual network architecture.
Super-convergence accuracy test vs a typical training regime with the same architecture on Cifar-10 [Source]
Moment of Truth
Now that we picked our discriminative learning rates for our layers, we can unfreeze the model and train accordingly.
The slice function assigns 1e-4 to the last layers and 1e-6 to the first layers; the layers in between get learning rates at equal increments within this range.
We see the accuracy has improved a bit but not much, so we wonder if we needed to fine-tune the model at all?
Two key factors to always consider prior to fine-tuning any model, the size of the dataset and its similarity with the dataset of the pre-trained model. Check out Stanford’s CS231 notes on When and how to fine-tune?. In our case, our Pet dataset is similar to the images in ImageNet and it is relatively small, and that’s why we achieved a high classification accuracy from the start without fine-tuning the full network.
Nonetheless, we were still able to improve our results a bit and learned so much, so GREAT JOB 🙂
The figure below illustrates the three plausible ways to use and fine-tune a pre-trained model. In this tutorial, we attempted the first and third strategy. Strategy 2 is also common in cases where the dataset is small but distinct from the dataset of the pre-trained model or when the dataset set is large but similar to the dataset of the pre-trained model.
Fine-tuning strategies on a pre-trained model
Congratulations, we have successfully covered image classification using a state-of-the-art CNN with a solid foundation of the underlying structure and training process ???
You are ready to build an image recognizer on your own dataset. If you do not already have one, you can scrape images from Google Images and make up a dataset. I made a very short tutorial for that ⬇ check it out.
A State-of-the-Art Image Classifier on Your Dataset in Less Than 10 Minutes
Fast multi-class image classification with code ready, using fastai and PyTorch libraries
I hope you found this tutorial helpful. Please share it and give it few claps, so it can reach others as well ? Feel free to leave any comments, connect with me on Twitter @ SalimChemlal, and follow me on Medium for more!
“A mind that is stretched by a new experience can never go back to its old dimensions.” — Oliver Wendell Holmes Jr.