Mixed Fine-Tuning - Domain Adaptation That Works!

Training a robust generic model is an interesting task. However, when you want to customize your Machine Translation model to observe the terminology and style of a certain domain or client, Domain Adaptation comes to life. In previous posts, we discussed several approaches to Domain Adaptation. In this post, we are going to concentrate on a very effective approach called Mixed Fine-Tuning, originally proposed by Chu et al., 2017.

Regular fine-tuning of an NMT model usually consists of two steps:

Building a baseline NMT model, e.g. a generic model.
Continuing training the baseline NMT model on an in-domain dataset.

However, fine-tuning in this way can lead to “catastrophic forgetting”, i.e. the model overfits the in-domain data, starts forgetting the information learned from the baseline data, and loses generalization. So in practice, when you compare its quality to the baseline model, the in-domain model would give a better BLEU score and human evaluation for translation of sentences very similar to the in-domain training dataset, but worse BLEU score for out-of-domain sentences or even new in-domain sentences.

Solution: Unlike plain fine-tuning, in the Mixed Fine-Tuning approach (Chu et al., 2017), you randomly sample a portion from the generic data you used to train the baseline model, and use it during the fine-tuning step along with the in-domain dataset. Over-sampling the in-domain data is the main trick.

The training procedure of the Mixed Fine-tuning approach is as follows:

Train a baseline NMT model on out-of-domain data until convergence.
Continue training the NMT baseline model on a mix of in-domain and out-of-domain data (by oversampling the in-domain data) until convergence.

In NMT tools, such as OpenNMT and MarianMT, dataset weights can be used to replicate over-sampling.

Example:

Dataset Counts:

Generic Dataset: 1,000,000 sentences
In-domain Dataset: 100,000 sentences

Use weights 1:10 so that the training takes 1 sentence from the bigger generic dataset, and 10 sentences from the smaller in-domain dataset.

Generic Dataset: 1
In-domain Dataset: 10

In this example, we sequentially sample 1 example from the “Generic Dataset” and 10 examples from the “In-domain Dataset” and so on. By giving the “In-domain Dataset” a higher weight, the model can learn the style and terminology from the in-domain dataset while still be able to generalize, i.e. output high-quality translations for out-of-domain sentences.

Setting the dataset weights differs from one tool to another. In OpenNMT-py, dataset weights are set as numbers as in the aforementioned example. In OpenNMT-tf, dataset weights are set as ratios.

Further notes on the Mixed Fine-tuning approach (feel free to experiment with something different, though!)

The approach works well for in-domain datasets between 50k and 500k. For very small in-domain datasets, this approach might not work well; for bigger in-domain datasets, you might want to try different weights; and for very big in-domain datasets, you can just use the in-domain dataset only, but enrich it with missing aspects like shorter sentences, if needed.
If your baseline training data is too big, you randomly extract 10 times the size of the in-domain data.
If both the generic and in-domain data are available before training the baseline, we build the vocabulary and SentencePiece models on all datasets, both generic and in-domain datasets.
During fine-tuning, we extract a dev/validation dataset from the in-domain dataset only.
After fine-tuning, we use two test datasets, one that we used for the out-of-domain baseline, and one extracted from the in-domain dataset, to make sure the model works in both cases.
To alleviate “catastrophic forgetting” on generic data, consider averaging the baseline model with the fine-tuned model.

Among the advantages of the Mixed Fine-tuning approach is that this fine-tuned NMT in-domain model still works well on both unseen in-domain data and general/out-of-domain data. Moreover, the approach can be fully automated (e.g. for various clients) once you verify it for your use cases.

It is worth mentioning that we have successfully applied the Mixed Fine-Tuning approach, proposed by Chu et al. (2017), in production-level scenarios in the industry. We also employed it in a number of our Domain Adaptation and Low-Resource NMT papers such as Haque et al. (2020) in combination with other approaches, through which we achieved the first place at ICON 2020 shared task, as well as Moslem et al. (2022) where we used synthetic in-domain data.

References

Axelrod, A., He, X., & Gao, J. (2011). Domain Adaptation via Pseudo In-Domain Data Selection. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 355–362. https://aclanthology.org/D11-1033
Chinea-Ríos, M., Peris, Á., & Casacuberta, F. (2017). Adapting Neural Machine Translation with Parallel Synthetic Data. Proceedings of the Second Conference on Machine Translation, 138–147. https://doi.org/10.18653/v1/W17-4714
Chu, C., Dabre, R., & Kurohashi, S. (2017). An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 385–391. https://doi.org/10.18653/v1/P17-2061
Freitag, M., & Al-Onaizan, Y. (2016). Fast Domain Adaptation for Neural Machine Translation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1612.06897
Haque, R., Moslem, Y., & Way, A. (2020). Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task. Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task, 17–23. https://aclanthology.org/2020.icon-adapmt.4
Kobus, C., Crego, J., & Senellart, J. (2017). Domain Control for Neural Machine Translation. Proceedings of Recent Advances in Natural Language Processing, 372–378. http://arxiv.org/abs/1612.06140
Luong, M.-T., & Manning, C. 2015. Stanford neural machine translation systems for spoken language domains. Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, 76–79. https://aclanthology.org/2015.iwslt-evaluation.11
Moslem, Y., Haque, R., Kelleher, J., & Way, A. (2022). Domain-Specific Text Generation for Machine Translation. Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), 14–30. https://aclanthology.org/2022.amta-research.2
Moslem, Y. (2024). Language Modelling Approaches to Adaptive Machine Translation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2401.14559
Saunders, D. (2022). Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey. Journal of Artificial Intelligence Research, 75, 351–424. https://doi.org/10.1613/jair.1.13566
Sennrich, R., Haddow, B., & Birch, A. (2016a). Controlling Politeness in Neural Machine Translation via Side Constraints. Proceedings of the 2016 Conference of the North AMerican Chapter of the Association for Computational Linguistics: Human Language Technologies, 35–40. https://doi.org/10.18653/v1/N16-1005

Sennrich, R., Haddow, B., & Birch, A. (2016b). Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 86–96. https://doi.org/10.18653/v1/P16-1009

Yasmin Moslem

Mixed Fine-Tuning - Domain Adaptation That Works!

References

Related Posts