Training a robust generic model is an interesting task; however, when you want to customize your Machine Translation model to observe the terminology and style of a certain domain or client, Domain Adaptation comes to life. In previous posts, we discussed several approaches to Domain Adaptation. In this post, we are going to concentrate on a very effective approach called Mixed Fine-Tuning.
Fine-tuning an NMT model usually consists of two steps:
- Building a baseline NMT model, e.g. a generic model.
- Continuing training the baseline NMT model on an in-domain dataset.
However, fine-tuning in this way can lead to “catastrophic forgetting”, i.e. the model overfits the in-domain data, starts forgetting the information learned from the baseline data, and loses generalization. So in practice, when you compare its quality to the baseline model, the in-domain model would give a better BLEU score and human evaluation for translation of sentences similar to the in-domain dataset, but worse BLEU score for out-of-domain sentences.
Solution: Unlike plain fine-tuning, in the Mixed Fine-Tuning approach (Chu et al., 2017), you randomly sample a portion from the generic data you used to train the baseline model, and use it during the fine-tuning step along with the in-domain dataset. Over-sampling the in-domain data is the main trick.
The training procedure of the Mixed Fine-tuning approach is as follows:
- Train a baseline NMT model on out-of-domain data until convergence.
- Continue training the NMT baseline model on a mix of in-domain and out-of-domain data (by oversampling the in-domain data) until convergence.
In NMT tools, such as OpenNMT and MarianMT, dataset weights can be used to replicate over-sampling.
Dataset #1: 1,000,000 sentences
Dataset #2: 100,000 sentences
Use weights 1:10 so that the training takes 1 sentence from the bigger generic dataset, and 10 sentences from the smaller in-domain dataset.
Dataset #1: 1
Dataset #2: 10
In this example, we sequentially sample 1 example from Dataset #1, and 10 examples from Dataset 2, and so on. By giving Dataset #2 a higher weight, the model can learn the style and terminology from the in-domain dataset while still be able to generalize, i.e. output high-quality translations for out-of-domain sentences.
Further notes on the Mixed Fine-tuning approach (feel free to experiment with something different, though!)
- The approach works well for in-domain datasets between 50k and 500k. For very small in-domain datasets, this approach might not work well; for very big in-domain datasets, you might want to try different weights.
- If your baseline training data is too big, you randomly extract 10 times the size of the in-domain data.
- If both the generic and in-domain data are available before training the baseline, we build the vocabulary and SentencePiece models on all datasets, both generic and in-domain datasets.
- During fine-tuning, we extract a dev/validation dataset from the in-domain dataset only.
- After fine-tuning, we use two test datasets, one that we used for the out-of-domain baseline, and one extracted from the in-domain dataset, to make sure the model works in both cases.
- To alleviate “catastrophic forgetting” on generic data, consider averaging the baseline model with the fine-tuned model.
The most important value of the Mixed Fine-tuning approach is that this fine-tuned NMT in-domain model still works well on both in-domain data and general/out-of-domain data.
It is worth mentioning that we have successfully applied the Mixed Fine-Tuning approach, proposed by Chu et al. (2017), at production-level scenarios in the industry. We also employed it in a number of our Domain Adaptation and Low-Resource NMT papers such as Haque et al. (2020) in combination with other approaches, through which we won the first place at ICON 2020.