DeepEmotion — A CNN-based system to predict what emotion a picture will evoke in humans


The main use-cases we wanted to address are:

  1. Enabling Artificial Intelligence (AI) systems to understand what emotion a particular image is likely to trigger.
  2. Better interpretability of human emotions – Finding the areas in the image that cause us to feel a certain emotion.

The Data

The database used is the Categorized Affective Pictures Database (CAP-D), which includes 526 affective pictures taken from four known databases. The images categorized to discrete emotions in two steps. First, clinical psychologists were asked to generate emotional labels for each image, according to the emotion the image evoked in them. This resulted in the creation of 10 emotional categories. Afterwards, students were shown images and were asked what is the emotional category that each image evoked in them. Then, agreement levels of the emotional categories were calculated for each image. For each image, agreement level for a category is the rate of student who classified the image to this category. For more information related to this dataset, read this journal paper.

Data pre-processing

During the hackathon, we considered several alternatives as data pre-processing. The goal was to meet the use-cases we defined, but also to reach a prototype that includes a trained model and will allow us to analyze the results in a short time.
The main choices we had to make were:
 1. Whether to predict for each image the distribution of emotions that it is expected to evoke or to predict only one emotion (i.e., solve a classification task).
 2. Whether to use all the images for training the model or only those that have been classified using a high agreement level.
 3. Whether to use all the classes (all 10 emotions) or to use only those that have a relatively high number of images.

Due to time constraints, we chose to solve only a classification problem, for the six most common emotions, using a subset of the data that includes images that were rated with an agreement level greater than 50 percent.

After pre-processing, we were left with about 300 images, which came from six emotion classes. The frequencies shown below were calculated before omitting images due to a low agreement level.

These images were divided into an 80 percent training set and a 20 percent test set. We cannot post images from the original database, since the database is classified. To illustrate, we have included similar images throughout this blog.


Our work can be divided into two main efforts:

I. Classification

Classification of images into one of the six classes — We used an Inception-V3 model that was pre-trained on Imagnet for the classification task. We removed the last Fully Connected (FC) layer of 1000 neurons and replaced it with an output FC layer of 6 neurons. We first trained the last layer only, while freezing the rest of the layers’ weights for X epochs. Then we trained all the layers for additional Y epochs. The data augmentations were carefully selected to create as many different images as possible but no illogical images (such as a landscape picture in which the sky is below, which is likely to be received using a y-axis flip).

Inception-V3 Architecture

II. Analysis and Visualization

Analyze the results and visualize the trained model — we used Google’s Deep Dream method to find the areas in the image that caused the network to classify the image into the given emotion. The method uses a gradient ascent to represent the areas in the image that have the strongest influence on the activation of neurons in different layers of the network, especially the last layer which is responsible for the classification of the emotion.


In an attempt to classify each image in the test set to one of the six classes we achieved 50.2% accuracy. The accuracy we received is satisfying given the time limit and the fact that the predicted accuracy in random classification is about 17% for six classes. Before training all the layers of the model, when training only the last one, the model had an accuracy of 20%. The significant different between the initial accuracy (20%) and the final accuracy (50%) suggest that the model found relations between the images and the emotion classes.

To address the second objective and in order to show the scientific potential of the network, we chose some of the clearest emotion evoking pictures (over 80% certainty) and highlighted the pixels that had the highest influence on the classification decision making process.

For example, a picture of a smiling child evokes happiness, and the reason for this is apparently the smile of the children. This is pretty straightforward and easy to understand. Here is a demonstration of how the network highlighted the most relevant pixels:

On the other hand, a picture of a snake evokes fear, but understanding what exactly in the snake caused the fear is not as simple as in the case of a child’s smile; in the smile image our method suggested that it was the child’s smile that made the network classify the picture as a joyous emotion by highlighting the pixels of the smile itself. For the snake image, the network marked a specific area in the snake’s body, mouth, and teeth. Those pixels might hold the explanation of what makes us fear of snakes:

By exploring this method, we can better interpret our feelings, and begin a journey at the end of which we could hold a much better understanding of our emotions.

One may say that the arousing factor is individual and cannot be generalized, but for this reason, the cognitive psychology science defines personality tests to understand what is normal and what the reasonable person is supposed to feel. Of course, time will tell whether this approach is the right one to deal with this issue or that it should be used differently. Until then, we play with the thought that we can make an AI systems understand emotions.

Our Team

It should be noted that Guy, Aviv and Ohad also took part in writing this post.

read original article at——artificial_intelligence-5