Developing Neural Machine Translation (NMT) models for low-resource languages is a viral topic, both in the industry and academia. In this tutorial, we are going to discuss tagged back-translation as one of the most effective and efficient approaches to training more robust models. Tagged back-translation is not only useful for low-resource languages, but also for other scenarios of data sparsity.
- Tagged Back-Translation
- Lower-Casing vs. True-Casing
- Sub-wording to Avoid Unknowns
- Shared Vocab vs. Separate Vocab
- Crawled Data
This approach aims at augmenting the available parallel training data with synthetic data that represent the purpose of the model. Several researchers, including Edunov et al. (2018) and Caswell et al. (2019), have proved that tagged back-translation is very helpful when training NMT models for low-resource languages. Moreover, it can be helpful for rich-resource languages through enriching datasets with specific linguistic features.
Assuming we want to train an English-to-Hindi NMT mode, the Tagged Back-Translation data augmentation technique depends on the following steps:
- For an English-to-Hindi model, train another Hindi-to-English model (i.e. in the other direction), using publicly available data from OPUS;
- Select monolingual data in Hindi publicly available (e.g. at OSCAR), which must have domains and linguistic features similar to the potential texts to be translated;
- Use the Hindi-to-English model to create a synthetic dataset, by translating the Hindi monolingual data into English. Note here that only the English side (the source for EN-HI) is MTed while the Hindi side (the target for EN-HI) is human-generated text;
- Consider using one the available Quality Estimation tools such as TransQuest (Ranasinghe et al., 2020) or OpenKiwi (Kepler et al., 2019) to filter out back-translations of low quality;
- Add a special tag like
<BT>to the start of the MTed segments;
- Build the vocabulary on all the data, both the original and the synthetic datasets;
- Augment the original English-to-Hindi training dataset with the synthetic dataset;
- Train a new English-to-Hindi model using the dataset generated from the previous step.
For low-resource languages like Hindi, Haque et al. (2020) showed that the technique works well with 1:1 synthetic to original data. Still, you can experiment with different portions, especially for language pairs of richer resources.
As demonstrated by Hoang et al. (2018), iterative back-translation for 2-3 runs can improve the quality further. Now, as you have a better Hindi-to-English model, back-translate English monolingual data to train a new version of the English-to-Hindi model. After that, use the new English-to-Hindi model to back-translate the same Hindi monolingual dataset you used for the first run to create a new version of the Hindi-to-English model. The idea here is that you are using a better model to translate the same monolingual data, i.e. without any increase or change, which should result in a better NMT model.
Popel et al. (2010) explored the effect of Block-Backtranslation, where the training data are presented to the neural network in blocks of authentic parallel data alternated with blocks of synthetic data.
Lower-Casing vs. True-Casing
For low-resource languages, I prefer lower-casing the data. However, in real-life scenarios or if you are submitting a paper, you are usually required to produce the translation in the true-case. so you can train a truecaser or use sacreMoses’ truecaser for English.
Sub-wording to Avoid Unknowns
To avoid out-of-vocabulary, it is recommended to train your NMT model on subwords instead whole words. Subwording (e.g. BPE or unigram model) is recommended for any type of machine translation model, regardless of whether it is for a low-resource or rich-resource language pair. Among the most popular subwording tools is SentencePiece.
If you used
<BT> for example as the back-translation token, you have to add it to the SentencePiece model through using the option
--user_defined_symbols during training. The same option can be useful for adding any other special tokens found in your training data, such as tags and non-Latin numbers.
Consider also using the following SentencePiece options:
--input_sentence_sizeto determine maximum number of sentences the trainer loads. This number must be equal to the vocab size;
--shuffle_input_sentenceto shuffle the dataset;
--split_by_numberto split tokens by numbers (0-9); and
--byte_fallbackto decompose unknown pieces into UTF-8 byte pieces.
Shared Vocab vs. Separate Vocab
If both the source and target share some vocabulary, e.g. similar languages and code switching, using shared vocabulary might help. Using shared vocabulary involve two steps:
- Training a SentencePiece model on all datasets for both languages;
- Using shared vocab instead of separate vocabs while training the NMT model.
Currently, OPUS includes some datasets that are crawled from bilingual websites, and then the sentences are matched using multilingual similarity tools such as LASER, LabSE, and m-USE. However, according to Agrawal et al. (2021) crawled datasets suffer from quality issues that can affect the quality of outcome NMT models. Hence, it is important to try filtering them before using, and maybe exclude them from initial baselines.
- Tagged Back-Translation (Caswell et al., 2019)
- Understanding Back-Translation at Scale (Edunov et al., 2018)
- Terminology-Aware Sentence Mining for NMT Domain Adaptation (Haque et al., 2020)
- Iterative Back-Translation for Neural Machine Translation (Hoang et al., 2018)
- Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals
- CUNI Systems in WMT21: Revisiting Backtranslation Techniques for English-Czech NMT
- Bifixer and Bicleaner: two open-source tools to clean your parallel data (Ramírez-Sánchez et al., 2020)
- Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (Agrawal et al., 2021)