Deep Learning for Classical Japanese Literature

Overview

The paper introduces 3 new benchmark datasets for Machine Learning, namely: 
Kuzushiji-MNIST — A drop-in replacement for MNIST dataset (28×28)
Kuzushiji-49 — A much larger but imbalanced dataset containing 48 Hiragana characters and 1 Hiragana iteration mark (28×28)
Kuzushiji-Kanji — An imbalanced dataset of 3832 Kanji characters, including rare characters with very few samples. (64×64)

Fig 2. An example of a Kuzushiji script

Due to the modernization of the Japanese Language in this era, the Kuzushiji script is no longer taught in the normal school curriculum. Even though, Kuzushiji had been used over 1000 years most Japenese today cannot read or write books written over 150 years ago!

Fig 3. Difference between text printed in 1772 and 1900

There are over 1.7M books registered in General Catalog of National Books and over 3M unregistered books and over 1B historical documents and besides some of these being digitized, only people who knows Kuzushiji can read these documents.

This paper introduces a dataset specifically made for ML research and introduce the ML community to the world of Japenese Literature.

The paper releases benchmarks for the Kuzushiji-MNIST and Kuzushiji-Kanji datasets using recent models. Kuzushiji-MNIST can be used as a replacement to the normal MNIST dataset.

The paper also applies generative modelling to a domain transfer task between unseen Kuzushiji Kanji to Modern Kanji.

Fig 4. Domain Transfer Experiment between Kuzushiji Kanji and Modern Kanji

Datasets

The Kuzushiji dataset is created by the National Institute of Japanese Literature (NIJL) and is curated by the Center of Open Data in the Humanities(CODH).
Kuzushiji full dataset was released in November 2016, and now the dataset contains 3,999 character types and 403,242 characters.

The authors of this paper pre-processed characters scanned from 35 classical books printed in the 18th century and divided them into 3 datasets:
Kuzushiji-MNIST — A drop-in replacement for MNIST dataset (28×28)
Kuzushiji-49 — A much larger but imbalanced dataset containing 48 Hiragana characters and 1 Hiragana iteration mark (28×28)
Kuzushiji-Kanji — An imbalanced dataset of 3832 Kanji characters, including rare characters with very few samples. (64×64)

Fig 5. 10 Classes of Kuzushiji-MNIST

One characteristic of Classical Japanese which is very different from Modern Japanese is Hentaigana(変体仮名).

Hentaigana are Hiragana characters which have more than one form of writing as they were derived from different Kanji.

Therefore, one Hiragana class of Kuzushiji-MNIST and Kuzushiji-49 may have many characters mapped to it (as seen in the above image). This makes the Kuzushiji dataset more challenging than the MNIST dataset.

Fig 6. Few examples from Kuzushiji-Kanji

The high class imbalance in Kuzushiji-49 and Kuzushiji-Kanji is due to the appearance frequency in real textbooks and is kept that way to represent the real data distribution.

  • Kuzushiji-49 — has 49 classes with a total of 266,407 images (28×28)
  • Kuzushiji-Kanji — has 3832 classes with a total of 140,426 images ranging from 1,766 examples to only a single example per class. (64×64)

Kuzushiji-MNIST is balanced.
Kuzushiji-Kanji is created for more experimental tasks rather than merely classification and recognition benchmarks.

Fig 7. Kuzushiji-49 Classes

Experiments

Classification Baselines for Kuzushiji-MNIST and Kuzushiji-49

Fig 8. Classification baselines for Kuzushiji-MNIST and Kuzushiji-49

Domain Transfer from Kuzushiji-Kanji to Modern Kanji

The Kuzushiji-Kanji dataset is used for domain transfer from pixel images to vector images (opposed to previous such approaches which focuses on domain transfer from pixel images to pixel images)

The proposed model aims to generate Modern Kanji version of a given Kuzushiji-Kanji input, in both pixel and stroke based formats.

Fig 9. Kuzushiji Kanji to Modern Kanji

In the figure below, the overall approach is presented.

Fig 10. Kuzushiji-Kanji to Modern Kanji Approach

They train two separate Convolutional Variational Autoencoders, one on the Kuzushiji-Kanji dataset and another one on a pixel version of the KanjiVG dataset rendered to 64×64 pixels for consistency. The architecture of the VAE is identical to [3].
And both datasets are compressed into their own 64-dimensional latent space, z_old and z_new. The KL loss term is not optimized below a certain threshold.

Then, a Mixture Density Network (MDN) is trained with 2 hidden layers to model the density function of P(z_new | z_old) approximated as a mixture of Gaussians.

We can then sample a latent vector z_new given a latent vector z_old encoded from Kuzushiji-Kanji.

The paper says that training two separate VAE models on each dataset is much more efficient and achieves better results compared to training a single model end-to-end.

In the last step, a sketch-RNN decoder model is trained to generate Modern Kanji based on z_new.

Fig 11. The algorithm for domain transfer from Kuzushiji-Kanji to KanjiVG

There are 3600 overlapping characters between the two datasets.
– For the ones which are not in the overlapping space, we condition the sktech-RNN model from z_new encoded on KanjiVG data to generate the stroke data also from KanjiVG [see (1) in Fig 10]
– For the ones which are present in the overlapping dataset, we use z_new sampled from the MDN which is conditioned on z_old, to generate the stroke data also from KanjiVG [see (2) in Fig 10]

This helps the sketch-RNN fine-tune aspects of the VAE’s latent space that may not capture well parts of the data distribution of Modern Kanji when trained only on pixels.

References

[1] Y. LeCun. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/exdb/mnist/
[2] C. for Open Data in the Humanities. Kuzushiji dataset, 2016. http://codh.rois.ac.jp/char-shape/
[3] D. Ha and J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution.arXiv preprint arXiv:1809.01999, 2018. https://worldmodels.github.io/

read original article at https://towardsdatascience.com/deep-learning-for-classical-japanese-literature-48ae04c17dfd?source=rss——artificial_intelligence-5