Yasmin Moslem

NLP Researcher

Notes on Multilingual Machine Translation

04 Dec 2021 » nmt

Multilingual NMT is featured by its scalability between any number of languages, instead of having to build individual models. MNMT systems are also desirable because training models with data from diverse language pairs might help a low-resource language acquire extra knowledge from other languages. Moreover, MNMT systems tend to generalize better due to exposure to diverse languages, leading to improved translation quality compared to bilingual NMT systems. This particular phenomenon is known as translation Transfer Learning or Knowledge Transfer (Dabre et al., 2020).

Tips for training multilingual NMT models

Building a many-to-one MT system that translates from several languages to one language is simple: just merge all the datasets. Here is an illustration of how your data should look like. Afterwards, it is recommended to shuffle your dataset.


Source Target
<ar> Thank you very much شكرا جزيلا
<es> Thank you very much Muchas gracias
<fr> Thank you very much Merci beaucoup
<hi> Thank you very much आपका बहुत बहुत धन्यवाद
<ar> आपका बहुत बहुत धन्यवाद           شكرا جزيلا
<en> आपका बहुत बहुत धन्यवाद Thank you very much
<es> आपका बहुत बहुत धन्यवाद Muchas gracias
<fr> आपका बहुत बहुत धन्यवाद Merci beaucoup
<ar> Muchas gracias شكرا جزيلا
<en> Muchas gracias Thank you very much
<fr> Muchas gracias Merci beaucoup
<hi> Muchas gracias आपका बहुत बहुत धन्यवाद
<en> شكرا جزيلا Thank you very much
<es> شكرا جزيلا Muchas gracias
<fr> شكرا جزيلا Merci beaucoup
<hi> شكرا جزيلا आपका बहुत बहुत धन्यवाद
<ar> Merci beaucoup شكرا جزيلا
<en> Merci beaucoup Thank you very much
<es> Merci beaucoup Muchas gracias
<hi> Merci beaucoup आपका बहुत बहुत धन्यवाद

There are a few important points to take into consideration while building multilingual models:

  • If the data is clearly unbalanced, like you have 75 million sentences for Spanish and 15 million sentences for Portuguese, you have to balance it; otherwise, you would end up with a system that translates Spanish better than Portuguese. This technique is called over-sampling (or up-sampling). The obvious way to achieve it in NMT toolkits is through giving weights to your datasets. In this example, the Spanish dataset can take the weight of 1 while the Portuguese can take the weight of 5 because your Spanish dataset is 5 times larger than your Portuguese dataset.
  • Some papers suggest adding a special token to the start of each sentence. For example, you can start Spanish sentences with the token <es> and Portuguese sentences with the token <pt>. In this case, you will have to add these tokens to your SentencePiece model through the option --user_defined_symbols. However, some researchers believe this step is optional.
  • Multilingual NMT models are more useful for low-resource languages than they are for rich-resource languages. Still, low-resource languages that share some linguistic characteristics with other rich-resource languages can benefit from coexistence in one multilingual model. In this sense, multilingual NMT can be considered one of “Transfer Learning” approaches (Tras et al., 2021 and Ding et al., 2021).
  • Languages that do not share the same alphabet cannot achieve the same linguistic benefits from a multilingual NMT model. Still, researchers investigate approaches like transliteration to increase knowledge transfer between languages that belong to the same language family, but use different alphabets. For example, using this transliteration trick, my Indic-to-English multilingual NMT model can translate from 10 Indic languages to English.
  • Integrating other data augmentation approaches like Back-Translation can still be useful.

Using pre-trained NMT models

What about pre-trained multilingual NMT models like mBART (Liu et al., 2020) and M2M-100 (Fan et al., 2020); when to use them? The simple answer is, for low-resource languages (e.g. a few thousands to a few millions, up to 15m), using directly or fine-tuning mBART can give better results. For high-resource languages, training a baseline model from scratch can outperform mBART. Then, applying mixed fine-tuning (Chu et al., 2017) on this new baseline using in-house data can even achieve better gains in terms of Machine Translation quality. Check this code snippet if you would like to try mBART. You can also convert M2M-100 model to the CTranslate2 format for better efficiency as explained here.

References: