Yasmin Moslem

Machine Translation Researcher.

Adaptive Neural Machine Translation

21 Apr 2021 » nmt

At a linguistic environment, translations and edits do not stop. Therefore, while periodical fine-tuning of our NMT models can help, there is definitely a need to simultaneously take new translated and edited segments into consideration. Otherwise, the MT system will keep making the same mistakes, not always observing new terminology and style, until a new/fine-tuned version of the model is released. Hence, Online Learning or Online Adaptation comes in handy in such a situation, so that the NMT model can incrementally learn from new translations and edits as it goes along!

Multi-Domain Neural Machine Translation through Unsupervised Adaptation (Farajian et al., 2017) is one of the best papers I read about the topic, especially that it does this on the fly, so there is no need for training individual models. A similar approach is used by ModernMT for Adaptive NMT.

We can highlight the process offered by the paper as follows:

  1. Given a source input q (this can range from a single translation unit to an entire document), extract from the dataset/TM the top (source, target) pairs in terms of similarity between the source and q.
  2. Use the retrieved pairs to fine-tune the baseline model, which is then applied to translate q.
  3. After a linguist edits the MT translation and approves it, add it to the dataset/TM. Consider also having a dedicated “context” dataset for each client or project.
  4. Reset the adapted model to the original parameters, translate the next input source, and so on.

It is best applied in a CAT tool. The “dataset” or “parallel data” in this case is what linguists call a “translation memory”. “Instead of the static pool of in-domain parallel data, you can have a dynamic pool which is consistently updated by adding the new post-edited sentence pairs,” said Amin Farajian, the main auther of the paper. “You will have a system that learns constantly from your post-editions. Moreover, by having separate pools for each of your post-editors, you can even have MT systems that adapt to the style of your translators!”

Similarly, Emil Lynegaard explained the process in simple words. “When you use a context memory for a translation request, it will look for similar source paragraphs in the reference context memory. If any are found, […] it will briefly “fine-tune” the underlying model. This actually modifies the weights and biases of the neural network, albeit it only does so temporarily. When the fine-tuning has finished (this is typically a sub-second training run), then your input paragraph will be translated using the updated model, after which the model will have its weights reset to the original configuration.”

This human-in-the-loop, adaptive approach is just brilliant in multiple aspects. For example, it solves the issue of “catastrophic forgetting” that could happen due to fine-tuning on a small number of sentences by simply resetting the model. Moreover, it does this in a straightforward way without having to change the original architecture of the model.

For the purpose of testing the system, we need to create development and tests datasets. According to the paper, “from each specific domain a set of size 500 sentence pairs is randomly selected as development set, and 1,000 sentence pairs are used as held-out test corpus.”

One matter we need to notice about this approach though is that while it saves time and resources by eliminating the need for training many in-domain/custom models, especially if these domains have limited data, the approach is still compute-intensive as it would require real-time use of GPUs, usually equivalent to those used for training the baseline model. That said, I believe in some scenarios this approach can be a perfect solution, especially if it is combined with other lines of work like Knowledge Distillation (Kim and Rush, 2016; Crego and Senellart, 2016; Zhang et al., 2018) to make the fine-tuning process more efficient.

I was honoured that I presented this paper among others in my presentation about NMT Domain Adaptation Techniques at AMTA2020.