Microsoft has updated the performance of their Multi-Task Deep Neural Network (MT-DNN) ensemble model. The significant performance boost has the model sitting comfortably atop the benchmark GLUE rankings, as shown in the following table.
GLUE (General Language Understanding Evaluation ) is a multi-task benchmark and analysis platform for natural language understanding. The accuracy rate of most systems on the WNLI task is around 65 percent, while the new MT-DNN-ensemble system has reached 89.0 percent accuracy, second only to human performance of 95.9 percent. WNLI is a reading comprehension task that uses sentences containing a pronoun and a list of possible referents to test a model’s ability to solve pronoun disambiguation problems. It is one of the harder tasks ranked on GLUE.
Even educated people can struggle on particulars in the WNLI test, and so the high accuracy of the MT-DNN-ensemble model came as a pleasant surprise. Microsoft researchers didn’t reveal many details on the multi-model integration of MT-DNN, which draws on the paper Multi-Task Deep Neural Networks for Natural Language Understanding.
A fundamental part of natural language understanding is language embedding learning — the process of mapping symbolic natural language text to semantic vector representations. This is what Multi-Task Deep Neural Network models attempt to do — learn universal language embedding.
MT-DNN-ensemble is a multi-task learning method, which means all tasks share the same structure although the objective function of each task is different. It is also a combination of multi-task learning and language model pre-training. The MT-DNN model’s architecture is shown below.
Architecture of the MT-DNN model for representation learning
The lower layers are shared across all tasks. The input X is ﬁrst represented as a sequence of embedding vectors for each token in l_1, then the transformer-based encoder generates shared contextual embedding vectors in l_2. Finally, the top layers are task-specific, thus suitable for learning features that best fit a specific task. Similar to the BERT model, MT-DNN is trained in two phases: pre-training and fine-tuning. Unlike BERT, MT-DNN uses MTL during the fine-tuning phase and has multiple task-specific layers in its model architecture. The experiment results are shown below.
GLUE test set results
Only about two months ago the MT-DNN model had achieved similar results to the BERT model on WNLI, with accuracy of 65.1 percent. The improvement to 89 percent accuracy has many in the ML community waiting anxiously for Microsoft to reveal the secret of its success.
The associated paper Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding is on arXiv.