Yasmin Moslem

Machine Translation Researcher

Domain Adaptation Techniques for Low-Resource Scenarios

18 Jan 2020 » nmt

Let’s imagine this scenario. You have a new Machine Translation project, and you feel excited. However, you have realized that your training corpus is too small. Now, you see that if you use such limited corpus, your machine translation model will be very poor, with many out-of-vocabulary words and maybe unidiomatic translations.

So, what is the solution? Should you just give up? Fortunately, Domain Adaptation can be a good solution to this issue.

Do you have another corpus that is big enough? Does this big corpus share some characteristics with the small corpus, like the language pair and/or major subject?

In this case, you can use one of Domain Adaptation techniques to make use of both the big generic corpus and the small specialized corpus. While the big generic corpus will help avoid out-of-vocabulary words and unidiomatic translations, the smaller specialized corpus will help force terminology and vocabulary required for your current Machine Translation project.

Table of Contents

Domain Adaptation Use Cases

  • Low-Resource Domains & Institutions
  • Low-Resource Languages

To give you a clearer idea bout Machine Translation Domain Adaptation, let’s consider these two popular use cases:

In the first use case, we have Institution A and Institution B, or Major Subject A and Minor Subject B. Institution A and Institution B share much vocabulary; however, they have some different terminology (e.g. chairman vs. president; vice-president vs. deputy chairperson). You have a big corpus for Institution A and a very small corpus for Institution B; however, your Machine Translation project is for Institution B with the small corpus. Domain Adaptation can help you to use the small corpus of Institution B for adapting or specializing the NMT model that could be generated from training on the big corpus of Institution A (assuming there are no license restrictions). With Domain Adaptation, our final model will, hopefully, give the right terminology used at Institution B.

In the second use case, we have a language with very limited bilingual resources. So we do not have enough data to train a good Machine Translation model for this language. I am sure you can think of many low-resource languages allover the world. Sometimes, there are other high-resource languages that are very similar to such low-resource languages, and share vocabulary and structure with them. Moreover, sometimes they are not independent languages, but rather just dialects from an original language.

So the question is: can we use the rich resources of Language A to train a better Machine Translation model for Language B that has low resources otherwise? Apparently, this is possible though Domain Adaptation.


Give an example of two languages:

  • Language A: High resources
  • Language B: Low resources

Language A and Language B share vocabulary and structure (vocabulary overlaps).

So this is a quiz. In the comments area, please mention two languages: Language A and Language B. Language A has rich resources while Language B has only very limited resources. However, there is a condition, Language A and Language B must share some vocabulary, meaning that many words in Language A overlap with words in Language B, so such words are the same or very similar in the two languages. Can you think of any example of Language A and Language B?

Domain Adaptation Approaches

  • Incremental Training / Re-training
  • Ensemble Decoding (of two models)
  • Combining Training Data
  • Data Weighting

There are several approaches of Domain Adaptation and I am going to discuss four of them.

  • Incremental Training / Re-training: So you have a big pre-trained model trained on a big corpus, and you continue training it with the new data from the small corpus.
  • Ensemble Decoding (of two models): You have two models and you use both models during translation.
  • Combining Training Data: You merge the two corpora and train one model on the whole combined data.
  • Data Weighting: You give higher weights for specialized segments over generic segments.

Let’s see how to apply these techniques and the best practices.

Incremental Training / Re-training

First Step: Training the Base Model a. Preprocessing the base (generic, big) corpus b. Training the base model

Second Step: Retraining with the New Data a. Preprocessing the new (specialized) corpus b. Retraining the base model on the specialized corpus

Incremental Training means to train a model on a corpus and then continue training the same model on a new corpus.

As part of my Machine Translation research, I managed to achieve successful results in retraining Neural Machine Translation models for the purpose of Domain Adaptation (see: Domain Adaptation Experiment)

Now you have two corpora. The first corpus is the base corpus; a generic or less-specialized and it is usually big, like several millions of segments. The other corpus is specialized and it might have a less number of translated segments.

In my experiment, the outcome was very promising and the model learned to use the in-domain terminology.

There is an important matter to take into consideration while using this Incremental Training approach for Domain Adaptation. If you only use in-domain data in your corpus, you may encounter a case of “catastrophic forgetting”, in which some sentences are translated badly (like with an unidiomatic structure or unknown words) by the retrained model while they are translated better by the base model. To avoid this issue, usually the retraining corpus should be a combination of in-domain and generic data. So for example, if your original in-domain corpus includes one hundred thousand segments, you can add like fifty thousand generic segments.

Another consideration is that you need to retrain on the new data for long enough to learn the new vocabulary. So you can see how many epochs or steps you used to train the base model and use a similar number to retrain on the new corpus.

Note also that depending on the NMT framework you are using, you may have the option to update vocabulary instead of re-initializing the whole network. For example, in OpenNMT-tf (the TensorFlow version of OpenNMT), there is a script that can be used to change the word vocabularies contained in a base model while keeping the learned weights of shared words, so that you can add in-domain terminology during retraining.

Ensemble Decoding (of two models)

One of the suggested methods of Domain Adaptation is to “ensemble” the baseline model trained on generic data and the continue model retrained on in-domain data. “Ensemble” simply means combining models during translation (not data during training). For more details about Ensemble Decoding, you may want to refer to a useful paper called, Fast Domain Adaptation for Neural Machine Translation, by Markus Freitag and Yaser Al-Onaizan.

Actually, there are different techniques for Ensemble Decoding; however, I am giving you an example of how it is used in OpenNMT-py framework to give you an idea.

Ensemble Decoding is a method that allows using multiple models simultaneously, combining their prediction distributions by averaging. All models in the ensemble must share a target vocabulary.

This means that although Ensemble Decoding is used during translation, you should observe some considerations during training. So during the preprocessing step, you have to include the vocabulary of both the generic corpus and in-domain corpus. Later during the training time, you first train the base generic model, and then continue training with your specialized data to create a new model. Finally, during translation, you can use the two models simultaneously with Ensemble Decoding. Note here that you do not train the two models independently; however, your second model is actually incrementally trained on the last checkpoint of the first model.

As you can see, Ensemble Decoding can be helpful in diverse occasions when you want to utilize multiple models at the translation time, and Domain Adaptation is only one of such use cases, with a special process.

Combining Training Data

Combining your training data is another approach you can use for Domain Adaptation. So you combine both the big generic corpus and the small specialized corpus into only one corpus. Now, you can train your model with this new corpus.

If you are going to combine two relatively different datasets, then according to Prof. Andrew NG (video), do not shuffle your combined dataset to generate the training, dev, and test sets; instead he recommends that you divide your data as follows:

  1. Training Dataset: 100% of the big, generic dataset + most of the small specialized dataset.
  2. Dev (validation) Dataset: Portion of the small specialized dataset (e.g. 2500).
  3. Test Dataset: Portion of the small specialized dataset (e.g. 2500).

So now, you are concentrating on improving the performance of your model to act well on the Dev (Validation) Dataset, which includes the data you care about.

However, when you think about combining data for the sake of training Neural Machine Translation models, there is a problem! In Neural Machine Translation, we extract only the most frequent vocabulary, the most frequent words in the corpus (~ 50,000 is common). Now, as you have a big generic corpus and a small specialized one, you might end up with vocabulary from the big corpus only while the words you want to include from the small corpus will be missing because they are not frequent enough. Plus, the model would observe terminology choices from the bigger corpus because they are more frequent.

I can hear you now asking: Can I extract all the words in the corpus? Of course, you can; however, if your corpus is really huge, and your training parameters are memory intensive, you might get an out-of-memory error and not be able to continue training or even start it.

So what is the solution? What about increasing the specialized data? There is a suggested method: Data Augmentation.

Data Augmentation for Neural Machine Translation:

The purpose of data Augmentation here is to increase the size of your limited specialized data. In my experiment, I used a statistical approach that is similar to what has been used in Statistical Machine Translation (e.g. Moses) as illustrated by Prof. Philipp Koehn in the chapter, Phrase-based Models, of his book “Statistical Machine Translation”.

First Step: Extract word alignment of the specialized corpus. You can use tools like fast_align, eflomal, or efmaral. You can use any of them as a word aligner which takes an input of parallel sentences, and produces outputs in the widely-used “Pharaoh format”.

neue modelle werden erprobt ||| new models are being tested
0-0 1-1 2-2 2-3 3-4

Second Step: Generate n-gram phrases. Here, you can see an example:

neue — new
neue modelle — new models
neue modelle werden — new models are being
neue modelle werden erprobt — new models are being tested
modelle — models
modelle werden — models are being
modelle werden erprobt — models are being tested
werden — are being
werden erprobt — are being tested
erprobt — tested

As I mentioned, this approach is very similar to the method used in Statistical Machine Translation; however, I did not move further to calculate probabilities because: 1) this would take a lot of time and memory; and most importantly 2) no need for this step because Neural Machine Translation has its own approach for calculating probabilities. So all what we need is a simple filtering step.

Thrid Step: Remove exact duplicates. Apply any other filters as needed; for example, you can delete very long sentences or uncommon single words, etc.

Now, you can combine your increased specialized data with the generic data, and start preprocessing and training your model.

Note here that we have two datasets, one uses this n-gram phrase splitting and one does not. In my experiment, when I trained my model on a dataset that I used this method on all of its segments, I got better translations for some segments; however, I noticed literal or unidiomatic translations in other occasions and in general the quality was less. So if you are going to use this n-gram phrase splitting with your Neural Machine Translation training, it is recommended to use it only as a part of the final dataset. That is why here we used this approach only on the specialized dataset and kept the generic dataset as is without phrase splitting.

Apart from training a model, you use the generated phrase-table for more options at the translation time.

Other combination methods may include: removing irrelevant segments from the big corpus or replacing mismatching terminology based on a glossary during the preprocessing time. </details>

Data Weighting

Data Weighting is another technique that can be useful for Domain Adaptation. In Data Weighting, you can either:

train one model on two corpora at the same time while giving a higher weight for the specialized corpus over the other generic corpus, or train the model on only one corpus that includes both generic segments and specialized segments, giving higher weights for specialized segments. For example, OpenNMT-py (the PyTorch version of OpenNMT) supports using different weights for different corpora; so we define the “data weights” list, which determines the weight each corpus should have; for example, 1 for Corpus A and 7 for Corpus B. This means when building batches, we will take 1 segment from Corpus A, then 7 segments from Corpus B, and so on.

Similarly, Marian NMT toolkit supports sentence and word-level data weighting strategies, weighting each data item according to its proximity to the in-domain data. In Marian, data weighting requires you to provide a special file with weights of sentences or words.

Other Domain Adaptation Approaches

For more state-of-the-art Domain Adaptation approaches, please check my AMTA’s presentation.

Final Note: Full Words vs. Sub-words

During preparing our data, we usually tokenize segments into complete words. However, it turns out that tokenizing segments into sub-words instead can be useful in improving translation quality. Sub-wording is not a technique related only to Domain Adaptation; it is actually recommended for any kind of Neural Machine Translation training.

The main purpose of Sub-wording is to minimize out-of-vocabulary words. As I mentioned earlier, in Neural Machine Translation, there are limitations to vocabulary extraction. If your corpus is really huge, you are forced to extract only the most frequent vocabulary (~ 50,000 is common), or you might get out-of-memory error during training. Extracting the the most frequent vocabulary will be enough for most translations as long as you translate only sentences in the same domain as your corpus; however, in some cases, you might encounter out-of-vocabulary words.

Sub-wording can help in some cases:

  • Word variations in the same language, e.g. “translate vs. translation”
  • Compound words in the same language, e.g. “multi-tasking”. So now you model is not only able to translate “multi-tasking”, but any other phase that includes the word “multi”.
  • Shared words between languages
  • Common misspellings, like forgetting accents.

Just as any other technique, in some occasions sub-wording will not give you better results; however, in many occasions, it will be a game changer. So, it is highly recommended to give it a try.

Methods of sub-wording include: Byte Pair Encoding (BPE) and unigram language model, both of which are supported by SentencePiece.


So in this article, you have seen how Domain Adaptation can be useful when you want to train a Machine Translation model, but you have only limited data for an institution, language, or minor domain. Then, I have discussed diverse techniques of Domain Adaptation including: Incremental Training / Re-training, Ensemble Decoding, Combining Training Data, and Data Weighting. In the meanwhile, I suggested a method for Data Augmentation, to increase the size of the limited specialized corpus. Finally, I explained how sub-wording can help avoid out-of-vocabulary words. If you have questions, or suggestions, please feel free to send a comment.