Understanding Convolution Neural Networks

Whether you noticed or not Deep learning is one of the most promising artificial intelligence topics. Its functionality is inspired by the brain’s neural network and it gave outstanding results in many areas: computer vision, natural language processing, speech recognition and more.

Furthermore, deep-learning gave really great result with respect to images:

Thanks to DL, we can now develop a simple script which recognizes faces in photos, classifies an image as a “cat” or “dog”…Hope you believe me when I say that there are deep-learning networks which can “create”(or better say generate) human-like faces!!!

When dealing with images, we always hear this term “Convolution Neural Networks” and yes CNN are (until now)the best architecture for image processing.

So what are CNN? and in how they differ from ANN?

Architecture of a simple CNN:

Generally a simple Convolution Neural Networks would look like this:

Ok! cool…

Let’s break it down:

It would be really good if you have basic understanding of Artificial Neural Network, other wise take a look at this page :

Or if you prefer video tuto, I highly recommend 3Blue1Brown:

So the first Layer is the “Convolution Layer”(CL):

To easily understand what this layer does actually, think of the case when we have to distinguish whether an input image is a cat or a dog.

How would a human being differentiate between a dog and a cat?

WEll simply there are some “features” with which you can make your choice (ex: the ears shape, the eyes, the head shape…)

And that is what the CL does

Its role is to extract these features!!

You would wonder: OK this is really great, but how???

Suppose that we have some kind of predefined matrix which says for us that an image is a dog or cat(of course using some patterns or features that exists in cats and not in dogs and vice versa), So we take this matrix and compute the “similarity” between it and an input image and this similarity is done thanks to the “Convolution Function”:

and graphically it is like this:

Just simply keep in mind that for the CL, we have some pre-defined matrices (of course experts developed these matrices) that allow us the extract some features from our input layer.These matrices are called “feature map”

Here is an example that computes an approximation of the vertical changes in the image A:

feature map to detect vertical changes

effect of previous feature map

Computation for the first value:

3*1 + 1*0 + 1*-1 + 1*1 + 0*0 + 7*-1 + 2*1 + 3*0 + 5*-1 = -7

As you saw in the above image, CL does not only extract important features but also it decreases the input size without losing “pertinent” information.

So, now we have a new matrix containing the extracted features,BUT!!

Don’t forget that we are working with images here, and the matrices representing them are huge in size so even after the CL we would have an enormous matrix that we would like to reduce and even extract better and more pertinent features, and here comes the role of the “Pooling Layer”(PL).

The role of PL is easier to digest, as I said it serves to extract important information from the output matrix of the CL.The idea is soo simple we take a square window (generally 2*2 or 3*3) and slide it over our matrix and in each time we only extract one special value.

The most commonly used PL takes the maximum value ‘Max-Pooling-Layer”

or the Average.

Here an example of Max PL:

So we used a 2*2 window on a 4*4 matrix:

For the first iteration we have four values : 4, 6, 1 and 3. The max is of course 6 so we keep and move the window and so on until we reach the end of the matrix.


In many cases you can encounter a model having more than one CL and PL

The purpose behind that is of course to improve the performance of the model

Next we find a “Fully Connected Network”(FC) which is nothing more than a traditional Neural Network which takes the the output of the PL layer and tries to compute the best predictions.

But earlier we said that the PL takes a matrix and reduces its size to its output is also a matrix which can not be the input of the FC


Well, there is an intermediate phase between pooling and the connected network which is called “Flattening

And what it basically does, is that it transforms the output matrix from PL into a vector of features which will be then feeded to the FC.

Recap: The CNN contains:

Convolution Layer: extracts important features using feature map and Convolution function

Pooling Layer: from the output of CL, it extract more important features thanks to the sliding window

Flattening: transform the output matrix from PL into a feature vector to feed it to the Fully Connected Network

Fully Connected Network: a normal Artificial Neural Network taking the feature vector as input and outputing a predicted vector.

If you are interested in a practical implementation of a CNN take a look at my repo : https://github.com/ayoubbenaissa/DeepLearningRepository/blob/master/CNN/CNN.py

I am still a beginner in this field, so if some how I made an error I apologize in advance, also if you have any comments it would be very valuable for me so I can improve my future articles.


A huge thank to Mr.Besbes Ahmed for his nice article:


This article was also helpful :


If you want to search for GIF for your article visit :


read original article at https://medium.com/@bucky01roberts/understanding-convolution-neural-networks-70ccdd1bbdb?source=rss——artificial_intelligence-5