Perform emotion classification in the context of a conversation.
Welcome to another paper review!
DialogueRNN is a method that aims to perform emotion classification of utterances in the context of a conversation. Many applications can benefit from such type of analysis such as understanding the emotional context and interchange in debates and social media threads.
Previous methods do not pay attention to individuals’ emotional states. The proposed emotion detection model considers individual speakers by focusing on three different aspects: the speaker, the context of preceding utterances, and the emotion from preceding utterances. The idea is that these three aspects are important to accurately predict the emotion of the utterance.
Utterances for a party (i.e., individual) are represented through textual features obtained from a convolutional neural network. As utterances come in a multimodal setting, audio and visual features are also extracted using 3D-CNN and openSMILE, respectively. The network is trained at the utterance level with the target emotion labels.
The proposed model (called DialogueRNN), illustrated in the figure above, determines the final emotion of the utterance through the following factors:
- Party state — models the parties’ emotion dynamics through the conversations. The basic idea behind the party state is to ensure that the model is aware of the speaker of each utterance in the conversation.
- Global state — models the context of an utterance in the dialogue, given by jointly encoding preceding utterances and the party state. Note that attention mechanism is applied to the global state to provide improved context representation. This state basically serves as the speaker-specific utterance representation.
- Emotion representation — inferred through party state and preceding speaker’s states as context (global state). This representation is used to perform the final emotion classification via a softmax layer.
Each component of the architecture is modeled by a gated recurrent unit (GRU). It’s important to note that during training, the speaker state is updated using the current utterance along with its context, which is nothing less than the preceding global states applied an attention mechanism. The role of the attention mechanism is that it assigns higher attention scores to the utterances that are emotionally relevant to the current utterance.
Overall, the speaker update encodes — via the Party GRU (shown in blue) — the information on the current utterance along with its context from the Global GRU (shown in green). All this information is important for performing the final emotion classification, which is performed by the emotion GRU (shown in maroon). Note that the current emotion classification also relies on the previous emotion-relevant information as well.
Several variants of the DialogueRNN model are proposed and compared in this study:
- DialogueRNN_l — considers an extra listener state (defined at the end of this post) while a speaker utters.
- BiDialogueRNN — a bidirectional RNN architecture is used instead
- DialogueRNN+Att — attention is applied over all surrounding emotion representations
- BiDialogueRNN+Att — similar to the previous model but considers a bidirectional RNN instead
Other baselines are also proposed which you can refer to in the paper.
From the table below, we can observe that DialogueRNN (highlighted in green) outperforms all baselines and the state-of-the-art model (CMN) on both datasets. Note that these results are only using the text modality.
We can also observe in the table above that the listener component (model highlighted in orange) doesn’t improve the model’s performance. In general, the other variants were found to perform well, especially the BiDialogueRNN+Att, which in general produced the better results.
As shown in the table below, the proposed model, DialogueRNN, also significantly outperforms other models in the multimodal setting (using a fusion of modalities).
As a case study, we can observe from the attention figure below that DialogueRNN correctly anticipates the emotion of frustration (labeled Turn 44) using the preceding context (41 and 42). For the CMN model, this was found not to be the case.
An important ablation study was conducted to observe the importance of Emotion GRU and Party State components. We can see from the table below that the absence of part state decreases performance. In fact, it can be observed that the party state seems to be more important than Emotion GRU.