Yasmin Moslem

Machine Translation Researcher

Mixed Fine-Tuning - Domain Adaptation That Works!

06 Jan 2022 » nmt

Training a robust generic model is an interesting task; however, when you want to customize your Machine Translation model to observe the terminology and style of a certain domain or client, Domain Adaptation comes to life. In previous posts, we discussed several approaches to Domain Adaptation. In this post, we are going to concentrate on a very effective approach called Mixed Fine-Tuning.

Fine-tuning an NMT model usually consists of two steps:

  1. Building a baseline NMT model, e.g. a generic model.
  2. Continuing training the baseline NMT model on an in-domain dataset.

However, fine-tuning in this way can lead to “catastrophic forgetting”, i.e. the model overfits the in-domain data, starts forgetting the information learned from the baseline data, and loses generalization. So in practice, when you compare its quality to the baseline model, the in-domain model would give a better BLEU score and human evaluation for translation of sentences similar to the in-domain dataset, but worse BLEU score for out-of-domain sentences.

Solution: Unlike plain fine-tuning, in the Mixed Fine-Tuning approach (Chu et al., 2017), you randomly sample a portion from the generic data you used to train the baseline model, and use it during the fine-tuning step along with the in-domain dataset. Over-sampling the in-domain data is the main trick.

The training procedure of the Mixed Fine-tuning approach is as follows:

  1. Train a baseline NMT model on out-of-domain data until convergence.
  2. Continue training the NMT baseline model on a mix of in-domain and out-of-domain data (by oversampling the in-domain data) until convergence.

In NMT tools, such as OpenNMT and MarianMT, dataset weights can be used to replicate over-sampling.


Dataset Counts:
Dataset #1: 1,000,000 sentences
Dataset #2: 100,000 sentences

Use weights 1:10 so that the training takes 1 sentence from the bigger generic dataset, and 10 sentences from the smaller in-domain dataset.
Dataset #1: 1
Dataset #2: 10

In this example, we sequentially sample 1 example from Dataset #1, and 10 examples from Dataset 2, and so on. By giving Dataset #2 a higher weight, the model can learn the style and terminology from the in-domain dataset while still be able to generalize, i.e. output high-quality translations for out-of-domain sentences.

Setting the dataset weights differs from one tool to another. In OpenNMT-py, dataset weights are set as numbers as in the aforementioned example. In OpenNMT-tf, dataset weights are set as ratios.

Further notes on the Mixed Fine-tuning approach (feel free to experiment with something different, though!)

  • The approach works well for in-domain datasets between 50k and 500k. For very small in-domain datasets, this approach might not work well; for very big in-domain datasets, you might want to try different weights.
  • If your baseline training data is too big, you randomly extract 10 times the size of the in-domain data.
  • If both the generic and in-domain data are available before training the baseline, we build the vocabulary and SentencePiece models on all datasets, both generic and in-domain datasets.
  • During fine-tuning, we extract a dev/validation dataset from the in-domain dataset only.
  • After fine-tuning, we use two test datasets, one that we used for the out-of-domain baseline, and one extracted from the in-domain dataset, to make sure the model works in both cases.
  • To alleviate “catastrophic forgetting” on generic data, consider averaging the baseline model with the fine-tuned model.

The most important value of the Mixed Fine-tuning approach is that this fine-tuned NMT in-domain model still works well on both in-domain data and general/out-of-domain data.

It is worth mentioning that we have successfully applied the Mixed Fine-Tuning approach, proposed by Chu et al. (2017), at production-level scenarios in the industry. We also employed it in a number of our Domain Adaptation and Low-Resource NMT papers such as Haque et al. (2020) in combination with other approaches, through which we won the first place at ICON 2020.