We now have a look at some of the resulting outputs to see how we have done and to try to gain some insight and possibly understand how to improve the results.
Below we can see on the left side the entire images from the validation set, with some patches being highlighted. In the center the patches extracted from the output of our CNN and to the right the same patches extracted from the images scaled up with the standard process at idealo (which uses GIMP’s image scaling feature).
LR image (left), reconstructed SR (center), GIMP baseline scaling (right). Source: DIV2K dataset.
The results are definitely not perfect, for instance it is easy to spot unwanted artifact noise around the butterfly antennas, but details such as the hair around the neck and on the back of the butterfly and the contour of the spots on the wings in the output of the network look definitely crispier than in the baseline.
Understanding the results
To understand where our model generalizes well and where it does not, we extracted patches that have high PSNR values and patches that have a low value of the metric from the validation set.
Unsurprisingly, the best performing patches are the ones with large flat areas while more complex patterns are harder to reproduce accurately. We might then want to focus on these areas for both training and evaluation of the results.
The same is also highlighted by this a heat map representing the error between the original HR image and the SR output of the network: darker colors correspond to higher pixel-wise mean squared error, while lighter colors correspond to lower error, or better results.
Heatmap for pixel-wise HR-SR error. Darker colors mean higher error and lighter lower error, or better results.
We can see how areas with more patterns correspond to higher errors but also intuitively “simpler” transition areas (cloud-sky for instance) are fairly dark. This is something that might be improved upon, as it will be relevant for the idealo’s catalog use-case.
A few words on non-standard ground truth in deep learning tasks
Unlike more common supervised deep learning tasks where the labels are either categorical or numerical, the ground truth that we use to evaluate the output of the network is the original HR image.
This is both good and bad news. Bad news first: popular deep learning frameworks like Keras do not have pre-made solutions for training (such as generators) that can be applied in this settings, in fact they typically rely on fetching training/evaluation labels from a 1-dimensional array or file, or they derive them from the folder structure, so there will be some extra coding involved (is this really bad news?). The (very) good news is that there is no need to go great lengths to get the labels: given a decent pool of HR images, we can simply downscale them to obtain our LR training data and use the original HR to evaluate the loss.
Normally, when training a neural network on image data, the training batches are created by randomly selecting a number of images from the training set. These are then re-scaled to a smaller size, typically around 100×100 pixels, augmented on the fly with random transformations, and fed to the network. In this context, feeding the network with whole images is neither necessary nor desirable. This is mainly due to the fact that we can not rescale the images down to a say 100×100 little training point. We want to scale them up after all. At the same time, we can not afford to train it with large sized images (such as 500×600), as it would take a very long time to process. Instead, random patches of very small size (down to 16×16) can be extracted from the whole picture, giving us a whole lot more data points to play with, as each image can be the source of hundreds of different patches.
The reason why we can afford to take very small portions of the images is that we are not classifying a bunch of patterns into categories (legs + tails + whiskers + dead mouse =? cat). Hence the missing usual dense layers at the end of the architecture. We only need the network to construct an abstract representation of those patterns and learn how to scale them up (and recombine them so that the image makes sense). This abstract representation is done by the convolutional layers which, together with the upscaling layer, are the only type of layers in this network.
On a related note, the fully convolutional architecture makes this network input size independent. This means that, unlike many other CNN used for classification, you can feed the network with images of any size: no matter what the initial size is, the network will give as an output an image of double its original size.
For more details about the RDN check the paper linked in the introduction and at the bottom of this article.
On the flip side, one extra decision is needed, how to extract these patches from the images. We boiled it down to: extract n random images from the dataset, extract (and augment) p random patches from each one of them. We ended up trying a few ways of doing this, summarized by the picture below.
Pictorial description of the different feeding methods we tried.
At first we created an entire dataset of patches extracted following a uniform grid. At training time we would randomly extract batch_size of them, augment them on the fly and feed them to the network. This approach had the downsides of having a VERY large dataset that needed to be statically stored, which is not ideal if you want to use a cloud service for training: moving and extracting the dataset is a fairly time consuming operation, as well as having a deterministically defined dataset, which might not be optimal. An alternative approach we tried was to randomly select batch_size whole images and extract a single patch from it. The bottleneck for this approach turned out to be reading from disk, which drastically slowed down training (15 minutes to an entire hour per epoch with our setup).
We finally converged to randomly extracting one single whole image from the original dataset, and from it extract on the fly batch_size patches. This allowed us the storage of the original data set, while keeping training fast.
This was the first step towards magnifying idealo’s product catalog.
Below is the network output for a low quality, low resolution image from our product catalog.
Low resolution image of a sandal.
Super-scaled image of a sandal.
There is very noticeable noise where the image transitions from foreground object into flat background, and the text is also slightly distorted. These are the low hanging fruits that we plan on improving upon.
The next step will be training the network on our own product images dataset. Hopefully this will help with text and background/object contrast, which are not heavily present in the natural images coming from the DIV2K dataset. Another step on the wish list is incorporating noise-reduction by adding random noise at the time of down-scaling, but this is further down the line..
Please let me know if you found this article useful (👏🏻) so others can find it too, and share it with your friends. You can follow me here on Medium (Francesco Cardinale) to stay up-to-date with my work. Thanks a lot for reading!
Github: Image Super Resolution
Paper: Residual Dense Network for Image Super-Resolution (Zhang et al. 2018)