Developing Neural Machine Translation (NMT) models for low-resource languages is a viral topic, both in the industry and academia. In this tutorial, we are going to discuss tagged back-translation as one of the most effective and efficient approaches to training more robust models. Tagged back-translation is not only useful for low-resource languages, but also for other scenarios of data sparsity.
Table of Contents:
- Tagged Back-Translation
- Lower-Casing vs. True-Casing
- Sub-wording to Avoid Unknowns
- Shared Vocab vs. Separate Vocab
- Crawled Data
- Transfer Learning
- References
Tagged Back-Translation
This approach aims at augmenting the available parallel training data with synthetic data that represent the purpose of the model. Several researchers, including Edunov et al. (2018) and Caswell et al. (2019), have proved that tagged back-translation is very helpful when training NMT models for low-resource languages. Moreover, it can be helpful for rich-resource languages through enriching datasets with specific linguistic features.
Assuming we want to train an English-to-Hindi NMT mode, the Tagged Back-Translation data augmentation technique depends on the following steps:
- For an English-to-Hindi model, train another Hindi-to-English model (i.e. in the other direction), using publicly available data from OPUS;
- Select monolingual data in Hindi publicly available (e.g. at OSCAR), which must have domains and linguistic features similar to the potential texts to be translated;
- Use the Hindi-to-English model to create a synthetic dataset, by translating the Hindi monolingual data into English. Note here that only the English side (the source for EN-HI) is MTed while the Hindi side (the target for EN-HI) is human-generated text;
- Consider using one the available Quality Estimation tools such as TransQuest (Ranasinghe et al., 2020) or OpenKiwi (Kepler et al., 2019) to filter out back-translations of low quality;
- Add a special tag like
<BT>
to the start of the MTed segments; - Build the vocabulary on all the data, both the original and the synthetic datasets;
- Augment the original English-to-Hindi training dataset with the synthetic dataset;
- Train a new English-to-Hindi model using the dataset generated from the previous step.
For low-resource languages like Hindi, Haque et al. (2020) showed that the technique works well with 1:1 synthetic to original data. Still, you can experiment with different portions, especially for language pairs of richer resources.
As demonstrated by Hoang et al. (2018), iterative back-translation for 2-3 runs can improve the quality further. Now, as you have a better Hindi-to-English model, back-translate English monolingual data to train a new version of the English-to-Hindi model. After that, use the new English-to-Hindi model to back-translate the same Hindi monolingual dataset you used for the first run to create a new version of the Hindi-to-English model. The idea here is that you are using a better model to translate the same monolingual data, i.e. without any increase or change, which should result in a better NMT model. Interestingly, you can use both NMT and phrase-based SMT models for back-translation, and then train or fine-tune your baseline NMT system in the required language direction.
Popel et al. (2010) explored the effect of Block-Backtranslation, where the training data are presented to the neural network in blocks of authentic parallel data alternated with blocks of synthetic data.
Lower-Casing vs. True-Casing
For low-resource languages, I prefer lower-casing the data. However, in real-life scenarios or if you are submitting a paper, you are usually required to produce the translation in the true-case. so you can train a truecaser or use sacreMoses’ truecaser for English.
Sub-wording to Avoid Unknowns
To avoid out-of-vocabulary, it is recommended to train your NMT model on subwords instead whole words. Subwording (e.g. BPE or unigram model) is recommended for any type of machine translation model, regardless of whether it is for a low-resource or rich-resource language pair. Among the most popular subwording tools is SentencePiece.
If you used <BT>
for example as the back-translation token, you have to add it to the SentencePiece model through using the option --user_defined_symbols
during training. The same option can be useful for adding any other special tokens found in your training data, such as tags and non-Latin numbers.
Consider also using the following SentencePiece options:
--input_sentence_size
to determine maximum number of sentences the trainer loads. This number must be equal to the vocab size;--shuffle_input_sentence
to shuffle the dataset;--split_by_number
to split tokens by numbers (0-9); and--byte_fallback
to decompose unknown pieces into UTF-8 byte pieces.
Shared Vocab vs. Separate Vocab
If both the source and target share some vocabulary, e.g. similar languages and code switching, using shared vocabulary might help. Using shared vocabulary involve two steps:
- Training a SentencePiece model on all datasets for both languages;
- Using shared vocab instead of separate vocabs while training the NMT model.
Crawled Data
Currently, OPUS includes some datasets that are crawled from bilingual websites, and then the sentences are matched using multilingual similarity tools such as LASER, LabSE, and m-USE. However, according to Kreutzer et al. (2021) crawled datasets suffer from quality issues that can affect the quality of outcome NMT models. Hence, it is important to try filtering them before using, and maybe exclude them from initial baselines.
Transfer Learning
Instead of training a model from scratch, transfer learning can be applied. In this sense, you can use a multilingual model like mBART-50, M2M-100, or NLLB-200, and fine-tune it on your dataset. Moreover, unidirectional models can be used (e.g. OPUS). If your low-resource language is similar to languages supported by such models, it can benefit from shared linguistic features. Back-translation can be used here as well to augment the authentic dataset.
References
- Caswell, I., Chelba, C., & Grangier, D. (2019). Tagged Back-Translation. Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), 53–63. https://doi.org/10.18653/v1/W19-5206
- Edunov, S., Ott, M., Auli, M., & Grangier, D. Understanding Back-Translation at Scale. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 489–500. https://doi.org/10.18653/v1/D18-1045
- Gebauer, P., Bojar, O., Švandelík, V., & Popel, M. (2021). CUNI Systems in WMT21: Revisiting Backtranslation Techniques for English-Czech NMT. Proceedings of the Sixth Conference on Machine Translation, 123–129. https://aclanthology.org/2021.wmt-1.7
- Haque, R., Moslem, Y., & Way, A. (2020). Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task. Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task, 17–23. https://aclanthology.org/2020.icon-adapmt.4
- Hoang, V. C. D., Koehn, P., Haffari, G., & Cohn, T. (2018). Iterative Back-Translation for Neural Machine Translation. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 18–24. https://doi.org/10.18653/v1/W18-2703
- Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
- Popel, M., Tomkova, M., Tomek, J., Kaiser, Ł., Uszkoreit, J., Bojar, O., & Žabokrtský, Z. (2020). Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nature Communications, 11(1), 4381. https://doi.org/10.1038/s41467-020-18073-9
- Ramírez-Sánchez, G., Zaragoza-Bernabeu, J., Bañón, M., & Rojas, S. O. (2020). Bifixer and Bicleaner: two open-source tools to clean your parallel data. 291–298. https://aclanthology.org/2020.eamt-1.31/
- Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 86–96. https://doi.org/10.18653/v1/P16-1009