MachineTranslation.io

Iterative Layer Pruning for Efficient Inference

Mon, 17 Nov 2025 00:00:00 +0000

Model pruning is a compression technique, that aims to remove redundant components without significantly compromising the model’s performance or accuracy. This process facilitates efficient deployment of complex models by making them smaller and faster.

Pruning is a hardware-agnostic compression approach. Unlike some other compression approaches such as quantisation, models resulting from structured pruning can be deployed on any modern GPU with similar performance gains. Moreover, pruning can be part of a sophisticated compression pipeline that incorporates other techniques such as quantisation and efficient fine-tuning (e.g. LoRA, QLoRA), which can lead to higher compression and efficiency levels.

In this article, we will cover some insights from our two papers about Iterative Layer Pruning at IWSLT 2025 [1] and WMT 2025 [2].

Iterative Layer Pruning

The process of Iterative Layer Pruning involves incrementally identifying and removing layers with minimal contribution to translation or generation quality, one layer at a time. The pruning process is usually followed by fine-tuning the resulting models on relevant training data to restore the translation quality. Moreover, knowledge distillation data from the baseline (teacher) model can be used to help the pruned (student) model to reach the quality of the teacher model.

Layer Importance Evaluation

We conduct layer importance evaluation by measuring translation performance without each layer. The process is as follows:

Remove one layer of the model.
Evaluate the model (chrF++).
Repeat for the rest of the layers.
Prune the least important layer (best chrF++ without it).
Repeat #1 to #4 until reaching the pruning target.

Evaluation Results

For translation from Czech to German (CES-DEU), pruning 8 layers and then fine-tuning the resulting model retains 98% of the translation quality (as measured by COMET), while achieving considerable speedup gains. Interestingly, for translation from English to Egyptian Arabic (ENG-ARZ), the model resulting from pruning up to 16 layers and then fine-tuning outperforms the Aya-Expanse-8B baseline for this language pair. Pruned models achieve up to ~2× speedup.

Knowledge Distillation

Knowledge Distillation aims at transferring knowledge from a larger model (teacher) to a smaller one (student). In “sequence-level” knowledge distillation, the student model is trained to generate sequences that match the teacher’s sequence outputs. Fine-tuning the pruned models on a combination of authentic and synthetic data (from Aya-Expanse-32B) improved the Czech to German (CES-DEU) translation quality, with the 24-layer pruned model nearly matching the performance of the Aya-Expanse-8B baseline.

Further performance gains

It is highly recommended to use an efficient inference engine such as vLLM, which outperforms inference with the Transformers framework. In both cases, pruned models demonstrate up to ~2× speedup.

Moreover, you can quantise the pruned models for further compression. For example, in our IWSLT 2025 paper, we applied QLoRA after pruning the models. However, notice that low-precision quantisation (e.g. 4-bit and 8-bit) requires special hardware (e.g. H100 or H200) to observe performance gains.

Questions and Answers

Q. Why iterative layer pruning instead of random or middle layer pruning?
- A. Iterative layer pruning relies on layer importance analysis. Hence, only the layers with minimal contribution to the output quality can be removed. In our experiments, iterative layer pruning achieves better results than middle layer pruning.
Q. Can we prune other parts of the model other than layers?
- A. There are two types of pruning, structured pruning and unstructured pruning. In structured pruning, you can remove whole layers, attention heads, or other entire computational blocks, while in unstructured pruning, you can remove individual weights from a neural network. While unstructured pruning can achieve higher compression rates, it requires specialised hardware for efficient deployment. On the contrary, models resulting from structure pruning can be deployed on standards GPUs.
Q. Is it better to fine-tune the baseline model before pruning?
- A. If your task, domain, or language is very different from the distribution of the baseline model, it is better to fine-tune the baseline model first. Otherwise, you can prune the baseline directly. On the contrary, fine-tuning after pruning is always required to restore the quality of the baseline model.
Q. Can we fine-tune after each layer pruning step?
- A. We experimented with both fine-tuning after each layer pruning step, and after a number of pruned layers. In both cases, there was no difference than just fine-tuning once after the end of the pruning process. This might be because of overfitting resulted from several fine-tuning passes.
Q. Can the same approach be applied to encoder-decoder models?
- A. Yes, we applied this iterative layer pruning approach to both a decoder-only model, Aya-Expanse, and an encoder-decoder model, Qwen2-Audio. However, for encoder-decoder models, we observed that only pruning the decoder leads to better overall performance.

GitHub Repository

References

Efficient Speech Translation through Model Compression and Knowledge Distillation (Moslem, IWSLT 2025)
Iterative Layer Pruning for Efficient Translation Inference (Moslem et al., WMT 2025)

Adaptive Translation and Terminology with Large Language Models

Wed, 10 Jan 2024 00:00:00 +0000

Large-scale language models (LLMs) have shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain, terminology, and style characteristics.

Adaptive Machine Translation with Large Language Models

First preprint: January 2023

Peer-reviewed: EAMT 2023

Abstract:

Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, real-time adaptation remains challenging. Large-scale language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain and style characteristics. This work aims to investigate how we can utilize in-context learning to improve real-time adaptive MT. Our extensive experiments show promising results at translation time. For example, GPT-3.5 can adapt to a set of in-domain sentence pairs and/or terminology while translating a new sentence. We observe that the translation quality with few-shot in-context learning can surpass that of strong encoder-decoder MT systems, especially for high-resource languages. Moreover, we investigate whether we can combine MT from strong encoder-decoder models with fuzzy matches, which can further improve translation quality, especially for less supported languages. We conduct our experiments across five diverse language pairs, namely English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French (EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).

@inproceedings{moslem-etal-2023-adaptive,
    title = "Adaptive Machine Translation with Large Language Models",
    author = "Moslem, Yasmin  and
      Haque, Rejwanul  and
      Kelleher, John D.  and
      Way, Andy",
    booktitle = "Proceedings of the 24th Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2023",
    address = "Tampere, Finland",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2023.eamt-1.22/",
    pages = "227--237",
}

Domain Terminology Integration into Machine Translation: Leveraging Large Language Models

First preprint: October 2023

Peer-reviewed: WMT 2023

Abstract:

This paper discusses the methods that we used for our submissions to the WMT 2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech (EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to advance machine translation (MT) by challenging participants to develop systems that accurately translate technical terms, ultimately enhancing communication and understanding in specialised domains. To this end, we conduct experiments that utilise large language models (LLMs) for two purposes: generating synthetic bilingual terminology-based data, and post-editing translations generated by an MT model through incorporating pre-approved terms. Our system employs a four-step process: (i) using an LLM to generate bilingual synthetic data based on the provided terminology, (ii) fine-tuning a generic encoder-decoder MT model, with a mix of the terminology-based synthetic data generated in the first step and a randomly sampled portion of the original generic training data, (iii) generating translations with the fine-tuned MT model, and (iv) finally, leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms. The results demonstrate the effectiveness of our proposed approach in improving the integration of pre-approved terms into translations. The number of terms incorporated into the translations of the blind dataset increases from an average of 36.67% with the generic model to an average of 72.88% by the end of the process. In other words, successful utilisation of terms nearly doubles across the three language pairs.

@inproceedings{moslem-etal-2023-domain,
    title = "Domain Terminology Integration into Machine Translation: Leveraging Large Language Models",
    author = "Moslem, Yasmin  and
      Romani, Gianfranco  and
      Molaei, Mahdi  and
      Kelleher, John D.  and
      Haque, Rejwanul  and
      Way, Andy",
    booktitle = "Proceedings of the Eighth Conference on Machine Translation",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.wmt-1.82/",
    doi = "10.18653/v1/2023.wmt-1.82",
    pages = "902--911",
}

Fine-tuning Large Language Models for Adaptive Machine Translation

First preprint: December 2023

Published as: thesis chapter

Abstract:

This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose large language model (LLM), for adaptive machine translation (MT). The fine-tuning process involves utilising a combination of zero-shot and one-shot translation prompts within the medical domain. The primary objective is to enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt translations to the required domain at inference time. The results, particularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned model, demonstrating quality improvements in both zero-shot and one-shot translation scenarios, surpassing Mistral 7B’s baseline performance. Notably, the fine-tuned Mistral outperforms ChatGPT “gpt-3.5-turbo” in zero-shot translation while achieving comparable one-shot translation quality. Moreover, the zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B’s performance, and its one-shot translation quality surpasses that of NLLB 3.3B. These findings emphasise the significance of fine-tuning efficient LLMs like Mistral 7B to yield high-quality zero-shot translations comparable to task-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved in one-shot translation are comparable to those of commercial LLMs such as ChatGPT. Our experiments demonstrate that, with a relatively small dataset of 20,000 segments that incorporate a mix of zero-shot and one-shot prompts, fine-tuning significantly enhances Mistral’s in-context learning ability, especially for real-time adaptive MT.

@article{moslem-etal-2023-fine-tuning-llms,
      title={Fine-tuning Large Language Models for Adaptive Machine Translation}, 
      author={Yasmin Moslem and Rejwanul Haque and Andy Way},
      year={2023},
      eprint={2312.12740},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2312.12740}, 
}

GitHub Repository

Language Modelling Approaches to Adaptive Machine Translation

First preprint: January 2024

Published as: PhD thesis (DCU)

Abstract:

Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, in-domain data scarcity is common in translation settings, due to the lack of specialised datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. In such scenarios where there is insufficient in-domain data to fine-tune MT models, producing translations that are consistent with the relevant context is challenging. While real-time adaptation can make use of smaller amounts of in-domain data to improve the translation on the fly, it remains challenging due to supported context limitations and efficiency constraints. Large language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. Such capabilities have opened new horizons for domain-specific data augmentation and real-time adaptive MT. This work attempts to address two main relevant questions: 1) in scenarios involving human interaction and continuous feedback, can we employ language models to improve the quality of adaptive MT at inference time? and 2) in the absence of sufficient in-domain data, can we use pre-trained large-scale language models to improve the process of MT domain adaptation?

@article{moslem-2024-adaptive-mt-llms,
      title={Language Modelling Approaches to Adaptive Machine Translation, {PhD} thesis}, 
      author={Yasmin Moslem},
      year={2024},
      eprint={2401.14559},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2401.14559}, 
}

References

Adaptive Machine Translation with Large Language Models (Moslem et al., EAMT 2023)
Domain Terminology Integration into Machine Translation: Leveraging Large Language Models (Moslem et al., WMT 2023)
Fine-tuning Large Language Models for Adaptive Machine Translation (Moslem et al., thesis chapter 2023)
Language Modelling Approaches to Adaptive Machine Translation (Moslem, PhD thesis, DCU 2024)

Purely Synthetic Bilingual Data for Machine Translation?

Mon, 12 Dec 2022 00:00:00 +0000

In-domain data scarcity is common in translation settings, due to the lack of specialized datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. You might be familiar to such situation when there is a big translation project, but there is only a tiny in-domain translation memory, or no translation memory at all. In the absence of sufficient domain-specific data required to fine-tune machine translation (MT) systems, adhering to the domain terminology and client’s style can be challenging. Recently, there has been a considerable advancement in training large language models, not only for English, but also for diverse languages. Among autoregressive language models, trained to predict the next word in a sequence, are BLOOM, GPT-3, and GPT-J. The question is: can we use these large language models to generate more domain-specific bilingual data?

Method

Interestingly, when you feed such large language models with an in-domain sentence, they can generate more synthetic sentences, that simulate the domain and linguistic characteristics of the authentic sentence. In the research “Domain-Specific Text Generation for Machine Translation” (Moslem et al., 2022), we investigated the feasibility of this domain-specific text generation technique, when either no or limited bilingual in-domain dataset is available. We proposed a novel approach to domain adaptation leveraging state-of-the-art pre-trained language models to generate huge amounts of synthetic bilingual in-domain data with the goal of improving translation of in-domain texts. The process can be summarised in three simple steps:

1. Text generation (target)

Generate target-side synthetic sentences using a large pre-trained language model.

When there is a small in-domain translation memory, you can use each target sentence as a prompt to generate text, that simulates the domain characteristics of the authentic in-domain data. If there is no translation memory at all, you can first forward-translate the source text to be translated, or a portion of it, using the baseline MT model.

2. Back-translation (source)

Back-translate the synthetic target-side sentences into source language.

Combining the idea of in-domain text generation with back-translation, you can generate huge amounts of synthetic bilingual in-domain data, for both use cases.

3. Mixed fine-tuning

Fine-tune the baseline model, on a mix of synthetic and authentic data.

Finally, the baseline MT model should be fine-tuned using a combination of the synthetic bilingual in-domain dataset and a randomly sampled section of the original generic dataset.

Target Text Generation

This code snippet shows how to load the GPT-J language model. You can use some efficient loading techniques such float16 and low_cpu_mem_usage.

from transformers import GPTJForCausalLM, AutoTokenizer
import torch


tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model =  GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B",
                                         revision="float16",
                                         torch_dtype=torch.float16,
                                         low_cpu_mem_usage=True,
                                         cache_dir = "models_cache/",
                                         pad_token_id=tokenizer.eos_token_id)
model = model.half()
model = model.to("cuda")

Afterwards, you can use each target segment in the authentic in-domain dataset, to generate synthetic in-domain text. We use top-K and top-p sampling to generate diverse text sequences. Here, we set num_return_sequences to generate 5 sequences. Each sequence might include multiple sentences. You can then split these text sequences into several sentences using any sentence splitter.

target_segment = "I am an example sentence that talks about something very specialized!"
input_ids = tokenizer(target_segment, add_special_tokens=True, return_tensors="pt").input_ids.to("cuda")

sample_outputs = model.generate(input_ids,
                                do_sample=True, 
                                max_length=300, 
                                top_k=50, 
                                top_p=0.95, 
                                num_return_sequences=5,
                                early_stopping=True)

generated_text = tokenizer.batch_decode(sample_outputs, skip_special_tokens=True)

The quality of the language model is important. Here, you can see examples of generated text, that is both linguistically correct, and factually accurate.

In March 2020, India ordered the countrywide shut down of all non-essential economic activities due to the spreading COVID-19 pandemic.

While the overall worldwide economic impact of COVID-19 will only be realized through the end of 2020 and the recovery phase in 2021, it is clear that certain parts of the world have been severely impacted.

Sometimes, the generated text can be linguistically correct; however, numbers or names might be inaccurate.

Antiviral drugs are approved for pregnant women and should be considered for children younger than XX years, although some are still being investigated.

Scientists have found some species of unicorn in Amazon rainforests.

If there are only small mistakes, the generated synthetic data can be still used. Obviously, the better the quality of the text generated by the language model, the better the quality we can expect when fine-tuning the baseline MT model on this synthetic data.

Back-Translation

Now, we have the target side of our new in-domain dataset. To generate the source side, use back-translation, into the other language direction. For back-translation, you can either train another MT model yourself, or use pre-trained models such as OPUS models. Optionally, you can convert OPUS models to the CTranslate2 formate with quantisation to enhance efficiency.

Basically, both the source and target sides of our new large in-domain dataset, consist of synthetic data. The target side is generated by a language model, while the source text is generated by back-translation, in the other language direction.

Mixed Fine-Tuning

Now, it’s time to apply mixed fine-tuning to the baseline model.

In other words, continue training our baseline model on a mix of (a) the synthetic bilingual in-domain dataset we obtained from the two previous steps, and (b) a randomly sampled portion of the original generic dataset. In our experiments, we oversampled the synthetic in-domain dataset, by a 9x ratio.

Mixed fine-tuning (Chu et al., 2017)

To apply oversampling, we employed the dataset weights feature in OpenNMT-tf. If you are using OpenNMT-py or OpenNMT-tf, you can find more details in this tutorial on mixed fine-tuning of MT models.

Results

In both scenarios, our proposed method achieves significant improvements, demonstrated by both automatic and human evaluations. As expected, Setup 1 (when there is a tiny bilingual dataset) reveals better results than Setup 2 (where there is no bilingual dataset at all). Still, models resulted from both setups outperform the baseline model.

Evaluation results on the in-domain test set, TICO-19

Conclusion

Previously, synthetic data for machine translation had been created either on the source-side only (forward-translation) or the target side only (back-translation). In some cases, researchers replaced a few words from the source and/or the target with synonyms or similar words. The assumption was that “relevant” monolingual data was available, which was not always the case!

In real-life scenarios, things can get more complex. Usually, there is insufficient human-produced data to train or fine-tune high-quality MT systems. Production-level projects can be so specialised, while mining crawled monolingual datasets is inefficient and not necessarily helpful.

This research work generates brand new synthetic data on both the source and target sides. It employs large language models to put together coherent sentences similar to those to be translated in the current project. Then, the new synthetic data can be used to fine-tune production-level MT systems for domain-specific scenarios. Feel free to check out our paper, Domain-Specific Text Generation for Machine Translation.

Download Scripts

You can download our scripts and configuration files at GitHub. If you have applied the method and/or have questions, please let me know.

Citation

@inproceedings{moslem-etal-2022-domain,
    title = "Domain-Specific Text Generation for Machine Translation",
    author = "Moslem, Yasmin  and
      Haque, Rejwanul  and
      Kelleher, John  and
      Way, Andy",
    booktitle = "Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)",
    month = sep,
    year = "2022",
    address = "Orlando, USA",
    publisher = "Association for Machine Translation in the Americas",
    url = "https://aclanthology.org/2022.amta-research.2",
    pages = "14--30",
    abstract = "Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art Transformer architecture. We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, in both scenarios, our proposed methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results",
}

Translation Auto-suggestions: What do Linguists Think?

Mon, 24 Oct 2022 00:00:00 +0000

Translation auto-suggestion and auto-completion are among the important features that can help translators better utilize Machine Translation (MT) systems. In a Computer-Aided Translation (CAT) environment, a translator can make use of the MT word auto-suggestion feature as follows:

typing a few words, or clicking a word in a proposed MT translation, a list of suggestions is displayed.
selecting one of the word suggestions from the list, the rest of the translation is modified accordingly.

In a user survey we designed and distributed via social media networks, we asked participants whether they thought an MT word-level auto-suggestions feature could be helpful, and provided a simple definition and an illustrative image. If their answer was “Yes”, the respondent was asked to specify a reason. By the time of writing this article, we received 41 responses to our survey. While we do not believe this survey is enough to justify introducing an auto-suggestions feature into every MT system, it can be an indicator as to why some users think such a feature could be helpful.

To answer the question, “Which of the following best describes you?” 46.3% (19) of the respondents chose “Translator/Linguist”, 31.7% (13) selected “NLP Engineer/Researcher”, and the rest 22% (9) were other “MT Users”, not included in the two aforementioned categories.

Among the respondents to the survey, 90.2% (37) answered “Yes” to the question “In general, do you believe that a word-level auto-suggestions feature is helpful?” The figure below shows the breakdown of answers to the question, “Why do you believe that a word-level auto-suggestions feature can be helpful?” taking into consideration those who answered “No” to the previous question.

Out of the 37 persons who believed a word-level auto-suggestions feature can be helpful, 40.5% (15) of the respondents specified that it can give them some inspiration. This answer is specifically interesting as it is not constrained by time-saving benefits; hence, it focusses more on effectiveness rather than efficiency. The respondent that answered with “Other” mentioned that it allows them to look for alternative senses or phrasings, especially when they suspect the initial translation is bad, and referred to this as “human in the loop”.

Respondents were allowed to give extra comments; among the notable comments were:

I think word-level suggestions can be a useful feature, particularly when the target language can have several translations of a single source word.

Word-level suggestions can be helpful, but sometimes you end up spending a lot of time figuring out if the MT suggestion is a valid translation in that context. So, I’m not really sure yet how I feel about it.

It’s useful, as long as it’s seen as a suggestion, and not inserted in the target where the translator is typing.

Among the respondents who answered “For me, it is easier or faster than typing”, comments included:

Though most of the time; the suggestions are lousy.

I don’t think it gives me inspiration as I mostly need it for structures, not single words.

Auto-suggestion does not have to come from machine translation. History is much more useful.

The last comment above might be referring to the fact that in some CAT tools, auto-suggestions can also include glossary terms, and translation memory sub-segments, which encourages further research efforts to investigate methods to enhance leveraging and interaction between various translation resources in human-in-the-loop environments.

We hope this survey will inspire future user studies to look deeper into how diverse users of MT and CAT tools prefer to utilize certain features, such as auto-suggestions, and the value they seek. More aspects should be taken into consideration such as language pairs, translation workflows, and user interfaces. This can help improve these features to better support linguists and other MT users and boost their productivity as well as translation quality.

Citation

@inproceedings{moslem2022-autosuggest,
    title = "Word-Level Auto-Completion: What can we achieve out of the box?",
    author = "Moslem, Yasmin  and
      Haque, Rejwanul  and
      Way, Andy",
    booktitle = "In Proceedings of the Seventh Conference on Machine Translation",
    month = Dec,
    year = "2022",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2210.12802",
}

Machine Translation Robustness

Wed, 28 Sep 2022 00:00:00 +0000

Let’s talk briefly about the concept of “Robustness” of neural machine translation (NMT) systems. While robustness should be emphasized when building any NMT system, even high-resource languages with plenty of data can still face linguistic challenges.

What does NMT “Robustness” mean?

It simply means that a certain NMT engine can handle a specific linguistic feature, found in the input to be translated. This feature might not be naturally occurring in the training data. Among examples of linguistic features we want our NMT model to be robust while translating are: domain terminology, proper names, number formats, text case, misspellings, code switching (between two languages), and untranslatables such as tags, email addresses, etc.

How can we improve NMT robustness?

The first step to machine translation robustness is defining the issues that your model frequently encounters when translating a certain type of text. This step is underestimated, and in my opinion it is a sign of the maturity of production-level operations.

This goes beyond numerical human evaluation, and moves a step further towards defining specific types of issues. In simple words, human evaluators are asked to mention a clear reason why they think a translation should be ranked as 3 out of 5, for example. At the beginning, they might be provided with common lists, but they should also have the option to add more issues, that can be integrated into that list later. Such explanations should not be vague; they should be precise in a way allowing MT engineers to fix these issues. Problematic words should be marked; sometimes the track-changes feature is used. The main question is: What is the most critical issue in this translation?

How can we improve NMT “Robustness”?

In the findings of the WMT2020 Robustness Shared Task, under the “Common Trends” section, Specia et al. (2020) stated: “Participating systems were trained following a standard recipe, i) using big-transformer models, ii) boosting performance with tagged back-translation, iii) continued training with filtered data and in-domain data (where available), iv) ensembling different models to obtain further improvements.”

In this sense, data augmentation techniques can be helpful, and then you integrate this new data into the NMT system, via combining or fine-tuning.

As training a new system frequently might not be feasible, it is common in some companies to temporarily apply on-the-fly find-replace operations on translations, until the next training is possible. Some researchers also suggest making such on-the-fly handling easier by injecting the training data with certain placeholders, to be able to replace later. To apply this, in a portion of the training data, natural tags (html, xml, long numbers, etc.) are replaced with pseudo-tags (e.g. <t0>, <t1>, <t2>, …). These pseudo-tags should be also added as user_defined_symbols to the SentencePiece model (cf. SPM options). At inference time, it is easy to define and replace these tags with untraslatables during pre-processing and post-processing steps. On a relevant note, activating the SentencePiece option split_digits helps with copying longer numbers without intervention, while the option byte_fallback sometimes helps with irregular characters in the training data.

References

MQM - Multidimensional Quality Metrics (Lommel et al., 2013)
Training Neural Machine Translation to Apply Terminology Constraints (Dinu et al., 2019)
Improving Robustness in Real-World Neural Machine Translation Engines (Gupta et al., 2019)
How Should Markup Tags Be Translated? (Hanneman and Dinu, 2020)
Evaluating Robustness to Input Perturbations for Neural Machine Translation (Niu et al., 2020)
Findings of the WMT 2020 Shared Task on Machine Translation Robustness (Specia et al., 2020)
Business Critical Errors: A Framework for Adaptive Quality Feedback (Stewart et al., 2022)
Improve MT for Search with Selected Translation Memory using Search Signals (Zhang 2022)

Machine Translation Models: How to Build and Deploy

Tue, 15 Mar 2022 00:00:00 +0000

This is a Neural Machine Translation (NMT) tutorial with OpenNMT-py and relevant tools. It covers data preprocessing, model training, evaluation, and deployment. The tutorial was put together as part of a mentorship activity I organised in 2022.

Fundamentals

Data Processing (notebook, repository)
NMT Model Training with OpenNMT-py (notebook)
Translation/Inference with CTranslate2 (code)
MT Evaluation with BLEU (tutorial, repository)
Simple Web UI (tutorial, repository)

Advanced Topics

Running TensorBoard with OpenNMT (tutorial)
Low-Resource Neural Machine Translation (tutorial)
Domain Adaptation with Mixed Fine-tuning (tutorial)
Overview of Domain Adaptation Techniques (tutorial)
Multilingual Machine Translation (tutorial)
Using Pre-trained NMT models with CTranslate2 (tutorial)

Mixed Fine-Tuning - Domain Adaptation That Works!

Thu, 06 Jan 2022 00:00:00 +0000

Training a robust generic model is an interesting task. However, when you want to customize your Machine Translation model to observe the terminology and style of a certain domain or client, Domain Adaptation comes to life. In previous posts, we discussed several approaches to Domain Adaptation. In this post, we are going to concentrate on a very effective approach called Mixed Fine-Tuning, originally proposed by Chu et al., 2017.

Regular fine-tuning of an NMT model usually consists of two steps:

Building a baseline NMT model, e.g. a generic model.
Continuing training the baseline NMT model on an in-domain dataset.

However, fine-tuning in this way can lead to “catastrophic forgetting”, i.e. the model overfits the in-domain data, starts forgetting the information learned from the baseline data, and loses generalization. So in practice, when you compare its quality to the baseline model, the in-domain model would give a better BLEU score and human evaluation for translation of sentences very similar to the in-domain training dataset, but worse BLEU score for out-of-domain sentences or even new in-domain sentences.

Solution: Unlike plain fine-tuning, in the Mixed Fine-Tuning approach (Chu et al., 2017), you randomly sample a portion from the generic data you used to train the baseline model, and use it during the fine-tuning step along with the in-domain dataset. Over-sampling the in-domain data is the main trick.

The training procedure of the Mixed Fine-tuning approach is as follows:

Train a baseline NMT model on out-of-domain data until convergence.
Continue training the NMT baseline model on a mix of in-domain and out-of-domain data (by oversampling the in-domain data) until convergence.

In NMT tools, such as OpenNMT and MarianMT, dataset weights can be used to replicate over-sampling.

Example:

Dataset Counts:

Generic Dataset: 1,000,000 sentences
In-domain Dataset: 100,000 sentences

Use weights 1:10 so that the training takes 1 sentence from the bigger generic dataset, and 10 sentences from the smaller in-domain dataset.

Generic Dataset: 1
In-domain Dataset: 10

In this example, we sequentially sample 1 example from the “Generic Dataset” and 10 examples from the “In-domain Dataset” and so on. By giving the “In-domain Dataset” a higher weight, the model can learn the style and terminology from the in-domain dataset while still be able to generalize, i.e. output high-quality translations for out-of-domain sentences.

Setting the dataset weights differs from one tool to another. In OpenNMT-py, dataset weights are set as numbers as in the aforementioned example. In OpenNMT-tf, dataset weights are set as ratios.

Further notes on the Mixed Fine-tuning approach (feel free to experiment with something different, though!)

The approach works well for in-domain datasets between 50k and 500k. For very small in-domain datasets, this approach might not work well; for bigger in-domain datasets, you might want to try different weights; and for very big in-domain datasets, you can just use the in-domain dataset only, but enrich it with missing aspects like shorter sentences, if needed.
If your baseline training data is too big, you randomly extract 10 times the size of the in-domain data.
If both the generic and in-domain data are available before training the baseline, we build the vocabulary and SentencePiece models on all datasets, both generic and in-domain datasets.
During fine-tuning, we extract a dev/validation dataset from the in-domain dataset only.
After fine-tuning, we use two test datasets, one that we used for the out-of-domain baseline, and one extracted from the in-domain dataset, to make sure the model works in both cases.
To alleviate “catastrophic forgetting” on generic data, consider averaging the baseline model with the fine-tuned model.

Among the advantages of the Mixed Fine-tuning approach is that this fine-tuned NMT in-domain model still works well on both unseen in-domain data and general/out-of-domain data. Moreover, the approach can be fully automated (e.g. for various clients) once you verify it for your use cases.

It is worth mentioning that we have successfully applied the Mixed Fine-Tuning approach, proposed by Chu et al. (2017), in production-level scenarios in the industry. We also employed it in a number of our Domain Adaptation and Low-Resource NMT papers such as Haque et al. (2020) in combination with other approaches, through which we achieved the first place at ICON 2020 shared task, as well as Moslem et al. (2022) where we used synthetic in-domain data.

References

Axelrod, A., He, X., & Gao, J. (2011). Domain Adaptation via Pseudo In-Domain Data Selection. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 355–362. https://aclanthology.org/D11-1033
Chinea-Ríos, M., Peris, Á., & Casacuberta, F. (2017). Adapting Neural Machine Translation with Parallel Synthetic Data. Proceedings of the Second Conference on Machine Translation, 138–147. https://doi.org/10.18653/v1/W17-4714
Chu, C., Dabre, R., & Kurohashi, S. (2017). An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 385–391. https://doi.org/10.18653/v1/P17-2061
Freitag, M., & Al-Onaizan, Y. (2016). Fast Domain Adaptation for Neural Machine Translation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1612.06897
Haque, R., Moslem, Y., & Way, A. (2020). Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task. Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task, 17–23. https://aclanthology.org/2020.icon-adapmt.4
Kobus, C., Crego, J., & Senellart, J. (2017). Domain Control for Neural Machine Translation. Proceedings of Recent Advances in Natural Language Processing, 372–378. http://arxiv.org/abs/1612.06140
Luong, M.-T., & Manning, C. 2015. Stanford neural machine translation systems for spoken language domains. Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, 76–79. https://aclanthology.org/2015.iwslt-evaluation.11
Moslem, Y., Haque, R., Kelleher, J., & Way, A. (2022). Domain-Specific Text Generation for Machine Translation. Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), 14–30. https://aclanthology.org/2022.amta-research.2
Moslem, Y. (2024). Language Modelling Approaches to Adaptive Machine Translation. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2401.14559
Saunders, D. (2022). Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey. Journal of Artificial Intelligence Research, 75, 351–424. https://doi.org/10.1613/jair.1.13566
Sennrich, R., Haddow, B., & Birch, A. (2016a). Controlling Politeness in Neural Machine Translation via Side Constraints. Proceedings of the 2016 Conference of the North AMerican Chapter of the Association for Computational Linguistics: Human Language Technologies, 35–40. https://doi.org/10.18653/v1/N16-1005

Sennrich, R., Haddow, B., & Birch, A. (2016b). Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 86–96. https://doi.org/10.18653/v1/P16-1009

Notes on Multilingual Machine Translation

Sat, 04 Dec 2021 00:00:00 +0000

Multilingual NMT is featured by its scalability between any number of languages, instead of having to build individual models. MNMT systems are also desirable because training models with data from diverse language pairs might help a low-resource language acquire extra knowledge from other languages. Moreover, MNMT systems tend to generalize better due to exposure to diverse languages, leading to improved translation quality compared to bilingual NMT systems. This particular phenomenon is known as translation Transfer Learning or Knowledge Transfer (Dabre et al., 2020).

Tips for training multilingual NMT models

Building a many-to-one MT system that translates from several languages to one language is simple: just merge all the datasets. Here is an illustration of how your data should look like. Afterwards, it is recommended to shuffle your dataset.

Source	Target
<ar> Thank you very much	شكرا جزيلا
<es> Thank you very much	Muchas gracias
<fr> Thank you very much	Merci beaucoup
<hi> Thank you very much	आपका बहुत बहुत धन्यवाद
<ar> आपका बहुत बहुत धन्यवाद	شكرا جزيلا
<en> आपका बहुत बहुत धन्यवाद	Thank you very much
<es> आपका बहुत बहुत धन्यवाद	Muchas gracias
<fr> आपका बहुत बहुत धन्यवाद	Merci beaucoup
<ar> Muchas gracias	شكرا جزيلا
<en> Muchas gracias	Thank you very much
<fr> Muchas gracias	Merci beaucoup
<hi> Muchas gracias	आपका बहुत बहुत धन्यवाद
<en> شكرا جزيلا	Thank you very much
<es> شكرا جزيلا	Muchas gracias
<fr> شكرا جزيلا	Merci beaucoup
<hi> شكرا جزيلا	आपका बहुत बहुत धन्यवाद
<ar> Merci beaucoup	شكرا جزيلا
<en> Merci beaucoup	Thank you very much
<es> Merci beaucoup	Muchas gracias
<hi> Merci beaucoup	आपका बहुत बहुत धन्यवाद

There are a few important points to take into consideration while building multilingual models:

If the data is clearly unbalanced, like you have 75 million sentences for Spanish and 15 million sentences for Portuguese, you have to balance it; otherwise, you would end up with a system that translates Spanish better than Portuguese. This technique is called over-sampling (or up-sampling). The obvious way to achieve it in NMT toolkits is through giving weights to your datasets. In this example, the Spanish dataset can take the weight of 1 while the Portuguese can take the weight of 5 because your Spanish dataset is 5 times larger than your Portuguese dataset.
Some papers suggest adding a special token to the start of each sentence. For example, you can start Spanish sentences with the token <es> and Portuguese sentences with the token <pt>. In this case, you will have to add these tokens to your SentencePiece model through the option --user_defined_symbols. However, some researchers believe this step is optional.
Multilingual NMT models are more useful for low-resource languages than they are for rich-resource languages. Still, low-resource languages that share some linguistic characteristics with other rich-resource languages can benefit from coexistence in one multilingual model. In this sense, multilingual NMT can be considered one of “Transfer Learning” approaches (Tras et al., 2021 and Ding et al., 2021).
Languages that do not share the same alphabet cannot achieve the same linguistic benefits from a multilingual NMT model. Still, researchers investigate approaches like transliteration to increase knowledge transfer between languages that belong to the same language family, but use different alphabets. For example, using this transliteration trick, my Indic-to-English multilingual NMT model can translate from 10 Indic languages to English.
Integrating other data augmentation approaches like Back-Translation can still be useful.

Using pre-trained NMT models

What about pre-trained multilingual NMT models like mBART (Liu et al., 2020) and M2M-100 (Fan et al., 2020); when to use them? The simple answer is, for low-resource languages (e.g. a few thousands to a few millions, up to 15m), using directly or fine-tuning mBART can give better results. For high-resource languages, training a baseline model from scratch can outperform mBART. Then, applying mixed fine-tuning (Chu et al., 2017) on this new baseline using in-house data can even achieve better gains in terms of Machine Translation quality. Check this code snippet if you would like to try mBART. You can also convert M2M-100 model to the CTranslate2 format for better efficiency as explained here.

References:

Low-Resource Neural Machine Translation

Sat, 25 Sep 2021 00:00:00 +0000

Developing Neural Machine Translation (NMT) models for low-resource languages is a viral topic, both in the industry and academia. In this tutorial, we are going to discuss tagged back-translation as one of the most effective and efficient approaches to training more robust models. Tagged back-translation is not only useful for low-resource languages, but also for other scenarios of data sparsity.

Table of Contents:

Tagged Back-Translation
Lower-Casing vs. True-Casing
Sub-wording to Avoid Unknowns
Shared Vocab vs. Separate Vocab
Crawled Data
Transfer Learning
References

Tagged Back-Translation

This approach aims at augmenting the available parallel training data with synthetic data that represent the purpose of the model. Several researchers, including Edunov et al. (2018) and Caswell et al. (2019), have proved that tagged back-translation is very helpful when training NMT models for low-resource languages. Moreover, it can be helpful for rich-resource languages through enriching datasets with specific linguistic features.

Assuming we want to train an English-to-Hindi NMT mode, the Tagged Back-Translation data augmentation technique depends on the following steps:

For an English-to-Hindi model, train another Hindi-to-English model (i.e. in the other direction), using publicly available data from OPUS;
Select monolingual data in Hindi publicly available (e.g. at OSCAR), which must have domains and linguistic features similar to the potential texts to be translated;
Use the Hindi-to-English model to create a synthetic dataset, by translating the Hindi monolingual data into English. Note here that only the English side (the source for EN-HI) is MTed while the Hindi side (the target for EN-HI) is human-generated text;
Consider using one the available Quality Estimation tools such as TransQuest (Ranasinghe et al., 2020) or OpenKiwi (Kepler et al., 2019) to filter out back-translations of low quality;
Add a special tag like <BT> to the start of the MTed segments;
Build the vocabulary on all the data, both the original and the synthetic datasets;
Augment the original English-to-Hindi training dataset with the synthetic dataset;
Train a new English-to-Hindi model using the dataset generated from the previous step.

For low-resource languages like Hindi, Haque et al. (2020) showed that the technique works well with 1:1 synthetic to original data. Still, you can experiment with different portions, especially for language pairs of richer resources.

As demonstrated by Hoang et al. (2018), iterative back-translation for 2-3 runs can improve the quality further. Now, as you have a better Hindi-to-English model, back-translate English monolingual data to train a new version of the English-to-Hindi model. After that, use the new English-to-Hindi model to back-translate the same Hindi monolingual dataset you used for the first run to create a new version of the Hindi-to-English model. The idea here is that you are using a better model to translate the same monolingual data, i.e. without any increase or change, which should result in a better NMT model. Interestingly, you can use both NMT and phrase-based SMT models for back-translation, and then train or fine-tune your baseline NMT system in the required language direction.

Popel et al. (2010) explored the effect of Block-Backtranslation, where the training data are presented to the neural network in blocks of authentic parallel data alternated with blocks of synthetic data.

Lower-Casing vs. True-Casing

For low-resource languages, I prefer lower-casing the data. However, in real-life scenarios or if you are submitting a paper, you are usually required to produce the translation in the true-case. so you can train a truecaser or use sacreMoses’ truecaser for English.

Sub-wording to Avoid Unknowns

To avoid out-of-vocabulary, it is recommended to train your NMT model on subwords instead whole words. Subwording (e.g. BPE or unigram model) is recommended for any type of machine translation model, regardless of whether it is for a low-resource or rich-resource language pair. Among the most popular subwording tools is SentencePiece.

If you used <BT> for example as the back-translation token, you have to add it to the SentencePiece model through using the option --user_defined_symbols during training. The same option can be useful for adding any other special tokens found in your training data, such as tags and non-Latin numbers.

Consider also using the following SentencePiece options:

--input_sentence_size to determine maximum number of sentences the trainer loads. This number must be equal to the vocab size;
--shuffle_input_sentence to shuffle the dataset;
--split_by_number to split tokens by numbers (0-9); and
--byte_fallback to decompose unknown pieces into UTF-8 byte pieces.

Shared Vocab vs. Separate Vocab

If both the source and target share some vocabulary, e.g. similar languages and code switching, using shared vocabulary might help. Using shared vocabulary involve two steps:

Training a SentencePiece model on all datasets for both languages;
Using shared vocab instead of separate vocabs while training the NMT model.

Crawled Data

Currently, OPUS includes some datasets that are crawled from bilingual websites, and then the sentences are matched using multilingual similarity tools such as LASER, LabSE, and m-USE. However, according to Kreutzer et al. (2021) crawled datasets suffer from quality issues that can affect the quality of outcome NMT models. Hence, it is important to try filtering them before using, and maybe exclude them from initial baselines.

Transfer Learning

Instead of training a model from scratch, transfer learning can be applied. In this sense, you can use a multilingual model like mBART-50, M2M-100, or NLLB-200, and fine-tune it on your dataset. Moreover, unidirectional models can be used (e.g. OPUS). If your low-resource language is similar to languages supported by such models, it can benefit from shared linguistic features. Back-translation can be used here as well to augment the authentic dataset.

References

Caswell, I., Chelba, C., & Grangier, D. (2019). Tagged Back-Translation. Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), 53–63. https://doi.org/10.18653/v1/W19-5206
Edunov, S., Ott, M., Auli, M., & Grangier, D. Understanding Back-Translation at Scale. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 489–500. https://doi.org/10.18653/v1/D18-1045
Gebauer, P., Bojar, O., Švandelík, V., & Popel, M. (2021). CUNI Systems in WMT21: Revisiting Backtranslation Techniques for English-Czech NMT. Proceedings of the Sixth Conference on Machine Translation, 123–129. https://aclanthology.org/2021.wmt-1.7
Haque, R., Moslem, Y., & Way, A. (2020). Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task. Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task, 17–23. https://aclanthology.org/2020.icon-adapmt.4
Hoang, V. C. D., Koehn, P., Haffari, G., & Cohn, T. (2018). Iterative Back-Translation for Neural Machine Translation. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 18–24. https://doi.org/10.18653/v1/W18-2703
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447
Popel, M., Tomkova, M., Tomek, J., Kaiser, Ł., Uszkoreit, J., Bojar, O., & Žabokrtský, Z. (2020). Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nature Communications, 11(1), 4381. https://doi.org/10.1038/s41467-020-18073-9
Ramírez-Sánchez, G., Zaragoza-Bernabeu, J., Bañón, M., & Rojas, S. O. (2020). Bifixer and Bicleaner: two open-source tools to clean your parallel data. 291–298. https://aclanthology.org/2020.eamt-1.31/
Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 86–96. https://doi.org/10.18653/v1/P16-1009

Web Interface for Machine Translation

Sun, 25 Jul 2021 00:00:00 +0000

Today, we will create a very simple Machine Translation (MT) Web Interface for OpenNMT-py, OpenNMT-tf and FairSeq models using CTranslate2 and Streamlit.

Previously, there were other tutorials on how to use a simple server and web interface with Flask. However, today’s tutorial is for those who want to create an ultra simple, quick demo.

We also aim at highlighting that CTranslate2 is now the way to go for serving OpenNMT models due to its exceptional performance. It is completely up to you to use it in a simple way like what we will do here, or to integrate it into a REST API for advanced uses.

So let’s start…

Table of Contents:

Objective: Simple Machine Translation Web Interface
Install Requirements
- Optional: Create and Activate a Virtual Environment
- Install Required Libraries
Convert Model to CTranslate2
- CTranslate2 Example
Create Your App
Next Steps
- Streamlit Components
- Deployment

Objective: Simple Machine Translation Web Interface

Our objective is to develop a simple web interface for Machine Translation like this one.

Install Requirements

Optional: Create and Activate a Virtual Environment

Install virtualenv:
```
pip3 install virtualenv
```
Create a virtual environment, e.g. myvenv:
```
virtualenv myvenv --python=python3
```
Activate the virtual environment:
```
source myvenv/web/bin/activate
```

Install Required Libraries

pip3 install ctranslate2 sentencepiece streamlit watchdog nltk

Convert Model to CTranslate2

CTranslate2 supports both OpenNMT-py and OpenNMT-tf models. As of version 2.0, it also supports FairSeq models. However, you need to convert your model to the CTranslate2 format before using it.

The following commands are simply copied from the CTranslate2 repository, and tested to make sure they are up-to-date. This example uses pre-trained Transformer English-German models. If you trained your own model, run the same commands on it instead.

For an OpenNMT-py model:

pip3 install OpenNMT-py

wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
tar xf transformer-ende-wmt-pyOnmt.tar.gz

ct2-opennmt-py-converter --model_path averaged-10-epoch.pt --output_dir ende_ctranslate2 --quantization int8

For an OpenNMT-tf model:

pip3 install OpenNMT-tf

wget https://s3.amazonaws.com/opennmt-models/averaged-ende-ckpt500k-v2.tar.gz
tar xf averaged-ende-ckpt500k-v2.tar.gz

ct2-opennmt-tf-converter --model_path averaged-ende-ckpt500k-v2 --output_dir ende_ctranslate2 \
    --src_vocab averaged-ende-ckpt500k-v2/wmtende.vocab \
    --tgt_vocab averaged-ende-ckpt500k-v2/wmtende.vocab \
    --model_type TransformerBase \
    --quantization int8

For a FairSeq model:

ct2-fairseq-converterconverter --model_path $MODEL --data_dir dict --fixed_dictionary $DICT --output_dir $OUTPUT --quantization int8

As you can see, we used the option --quantization int8 to imporve both the size and the performance of the model.

CTranslate2 Python Sample

Let’s make sure that CTranslate2 works properly in our setup by running this Python code:

import ctranslate2
translator = ctranslate2.Translator("ende_ctranslate2/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

Note: translate_batch() can take a list of sentences and translate them in batches, which would be very efficient. Here we are using only one sentence as an example for demonstration purposes.

You can also check this detailed example that opens a file and translates it with CTranslate2.

Create Your App

Test App

Let’s first create a small app to see how Streamlit works.

Create a file called test.py for example and add the following lines to it.

import streamlit as st

st.title("Upper My Text")

user_input = st.text_input("Write something and press Enter \
    to convert it to the UPPER case.")

if len(user_input) > 0:
    output = user_input.upper()
    st.info(output)

Launch your test app by opening the Terminal and running the following command.

streamlit run test.py

If everything works as expected, you should see something like this in your browser at the URL http://localhost:8501. Once you type a text and press Enter, the text will be printed in the UPPER case.

Translation App

Let’s now develop our translation web interface. Create a file called translate.py for example, and add the following to it.

import streamlit as st
import sentencepiece as spm
import ctranslate2
from nltk import sent_tokenize


def translate(source, translator, sp_source_model, sp_target_model):
    """Use CTranslate model to translate a sentence

    Args:
        source (str): Source sentences to translate
        translator (object): Object of Translator, with the CTranslate2 model
        sp_source_model (object): Object of SentencePieceProcessor, with the SentencePiece source model
        sp_target_model (object): Object of SentencePieceProcessor, with the SentencePiece target model
    Returns:
        Translation of the source text
    """

    source_sentences = sent_tokenize(source)
    source_tokenized = sp_source_model.encode(source_sentences, out_type=str)
    translations = translator.translate_batch(source_tokenized)
    translations = [translation[0]["tokens"] for translation in translations]
    translations_detokenized = sp_target_model.decode(translations)
    translation = " ".join(translations_detokenized)

    return translation


# [Modify] File paths here to the CTranslate2 SentencePiece models.
ct_model_path = "/path/to/the/ctranslate/model/directory"
sp_source_model_path = "/path/to/the/sentencepiece/source/model/file"
sp_target_model_path = "/path/to/the/sentencepiece/target/model/file"

# Create objects of CTranslate2 Translator and SentencePieceProcessor to load the models
translator = ctranslate2.Translator(ct_model_path, "cpu")    # or "cuda" for GPU
sp_source_model = spm.SentencePieceProcessor(sp_source_model_path)
sp_target_model = spm.SentencePieceProcessor(sp_target_model_path)


# Title for the page and nice icon
st.set_page_config(page_title="NMT", page_icon="🤖")
# Header
st.title("Translate")

# Form to add your items
with st.form("my_form"):
    # Textarea to type the source text.
    user_input = st.text_area("Source Text", max_chars=200)
    # Translate with CTranslate2 model
    translation = translate(user_input, translator, sp_source_model, sp_target_model)

    # Create a button
    submitted = st.form_submit_button("Translate")
    # If the button pressed, print the translation
    # Here, we use "st.info", but you can try "st.write", "st.code", or "st.success".
    if submitted:
        st.write("Translation")
        st.info(translation)

Note: Make sure you update the variables ct_model, sp_source_model, and sp_target_model with our own paths to the CTranslate2 model, and the SentencePiece source and target models.

Let’s launch our translator. Run the following command in the Terminal.

streamlit run translate.py

If everything works fine, you should see an output like this at the URL http://localhost:8501/

Try typing a sentence (in the same source language of your model) and press the button “Translate”. The translation should be printed as you see here!

Add Language Pairs

To give your visitor the option to select between multiple language pairs, you can add a dropdown menu like this one.

You can first change the paths part into a function:

def load_models(option):
    if option == "English-to-Japanese":
        ct_model_path = "path/to/your/ct_model"
        sp_source_model_path = "path/to/your/sp_source_model"
        sp_target_model_path = "path/to/your/sp_target_model"
    elif option == "Japanese-to-English":
        ct_model_path = "path/to/your/ct_model"
        sp_source_model_path = "path/to/your/sp_source_model"
        sp_target_model_path = "path/to/your/sp_target_model"
    
    translator = ctranslate2.Translator(ct_model_path)
    sp_source_model = spm.SentencePieceProcessor(sp_source_model_path)
    sp_target_model = spm.SentencePieceProcessor(sp_target_model_path)

    return translator, sp_source_model, sp_target_model

Then, you change the form to:

with st.form("my_form"):

    # Dropdown menu to select a language pair
    option = st.selectbox(
    "Select Language Pair",
    ("English-to-Japanese", "Japanese-to-English"))
    #st.write('You selected:', option)

    # Textarea to type the source text.
    user_input = st.text_area("Source Text", max_chars=200)

    # Load models
    translator, sp_source_model, sp_target_model = load_models(option)
    
    # Translate with CTranslate2 model
    translation = translate(user_input, translator, sp_source_model, sp_target_model)

    # Create a button
    submitted = st.form_submit_button("Translate")
    # If the button pressed, print the translation
    # Here, we use "st.info", but you can try "st.write", "st.code", or "st.success".
    if submitted:
        st.write("Translation")
        st.info(translation)

Full Code

I will be updating this repository with Python samples.

Next steps

Streamlit Components

Streamlit comes with more components. One of the most interesting NLP components you might want to check is spacy-streamlit

Deployment

You can deploy your app on any service of your choice. However, if you are looking for a free and easy option, consider using Heroku. For better performance, test your app with and without Streamlit’s caching option and see if it helps.

Thanks for reading! If you have questions or suggestions, feel free to contact me.

Adaptive Neural Machine Translation

Wed, 21 Apr 2021 00:00:00 +0000

At a linguistic environment, translations and edits do not stop. Therefore, while periodical fine-tuning of our neural machine translation (NMT) models can help, there is definitely a need to simultaneously take new translated and edited segments into consideration. Otherwise, the MT system will keep making the same mistakes, not always observing new terminology and style, until a new/fine-tuned version of the model is released. Hence, Online Learning or Online Adaptation comes in handy in such a situation, so that the NMT model can incrementally learn from new translations and edits as it goes along!

Generally speaking, there are several approaches to online adaptation. In this article, I am mainly discussing two types of adaptive machine translation: (a) instance-based adaptation of encoder-decoder MT models (Farajian et al., 2017); and (b) adaptive translation with autoregressive LLMs (Moslem et al., 2023).

Adaptive Translation with Encoder-Decoder NMT Models

Multi-Domain Neural Machine Translation through Unsupervised Adaptation (Farajian et al., 2017) is one of the best papers I read about the topic, especially that it does this on the fly, so there is no need for training individual models. A similar approach is used by ModernMT for Adaptive NMT.

We can highlight the process offered by the paper as follows:

Given a source input q (this can range from a single translation unit to an entire document), extract from the dataset/TM the top (source, target) pairs in terms of similarity between the source and q.
Use the retrieved pairs to fine-tune the baseline model, which is then applied to translate q.
After a linguist edits the MT translation and approves it, add it to the dataset/TM. Consider also having a dedicated “context” dataset for each client or project.
Reset the adapted model to the original parameters, translate the next input source, and so on.

It is best applied in a CAT tool. The “dataset” or “parallel data” in this case is what linguists call a “translation memory”. “Instead of the static pool of in-domain parallel data, you can have a dynamic pool which is consistently updated by adding the new post-edited sentence pairs,” said Amin Farajian, the main auther of the paper. “You will have a system that learns constantly from your post-editions. Moreover, by having separate pools for each of your post-editors, you can even have MT systems that adapt to the style of your translators!”

Similarly, Emil Lynegaard explained the process in simple words. “When you use a context memory for a translation request, it will look for similar source paragraphs in the reference context memory. If any are found, […] it will briefly “fine-tune” the underlying model. This actually modifies the weights and biases of the neural network, albeit it only does so temporarily. When the fine-tuning has finished (this is typically a sub-second training run), then your input paragraph will be translated using the updated model, after which the model will have its weights reset to the original configuration.”

This human-in-the-loop, adaptive approach is just brilliant in multiple aspects. For example, it solves the issue of “catastrophic forgetting” that could happen due to fine-tuning on a small number of sentences by simply resetting the model. Moreover, it does this in a straightforward way without having to change the original architecture of the model.

For the purpose of testing the system, we need to create development and tests datasets. According to the paper, “from each specific domain a set of size 500 sentence pairs is randomly selected as development set, and 1,000 sentence pairs are used as held-out test corpus.”

One matter we need to notice about this approach though is that while it saves time and resources by eliminating the need for training many in-domain/custom models, especially if these domains have limited data, the approach is still compute-intensive as it would require real-time use of GPUs, usually equivalent to those used for training the baseline model. That said, I believe in some scenarios this approach can be a perfect solution, especially if it is combined with other lines of work like Knowledge Distillation (Kim and Rush, 2016; Crego and Senellart, 2016; Zhang et al., 2018) to make the fine-tuning process more efficient.

I was honoured that I presented this paper among others in my presentation about NMT Domain Adaptation Techniques at AMTA2020.

Adaptive Translation with Autoregressive Decoder-only Language Models

One of the advantages of high-quality large language models (LLMs) is that they take context into consideration. At inference time, feeding an LLM with in-domain example translations or terminology can enhance its ability to generate more accurate and relevant translations. In general, this on-the-fly adaptation feature of LLMs is referred to as in-context learning. Early in 2023, I published my paper on the topic, namely Adaptive Machine Translation with Large Language Models, which was later peer-reviewed and accepted for publication at EAMT 2023.

Running TensorBoard with OpenNMT

Fri, 19 Feb 2021 00:00:00 +0000

TensorBoard is a tool that provides useful visualization of how training of a deep learning model is going on. It allows you to track and visualize metrics such as accuracy and perplexity. You can use TensorBoard in diverse deep learning frameworks such as TensorFlow and PyTorch. In this tutorial, you will learn how to activate TensorBoard in OpenNMT-tf and OpenNMT-py in different environments.

1- Activating TensorBoard
2- Accessing TensorBoard

1- Activating TensorBoard

For OpenNMT-tf, TensorBoard is enabled by default. For OpenNMT-py, you need to enable TensorBoard, and optionally customize the log directory. Add these lines to the training configuration YAML file.
```
tensorboard: true
tensorboard_log_dir: logs
```
Start your OpenNMT training as usual.
Create a screen for TensorBoard: screen -S tensorboard. Note: if you use Google Colab, you do not need screen.
Open the directory of the log files. In OpenNMT-tf, by default the log files are in the same folder as the model. In OpenNMT-py, the logs are in a directory with today’s date inside “runs/onmt” or the path you specified for tensorboard_log_dir in your config file.
If you have multiple models you want to compare, located in one parent directory, you can rather use the path of this parent directory.
Start TensorBoard and specify the log directory: tensorboard --logdir="."
At this point, you should see a message that TensorBoard is running on localhost http://localhost:6006/ and that’s how to access it from a local browser if you are working on the same machine.
Get out of this screen by pressing: Ctrl+A+D.

2- Accessing TensorBoard

There are multiple ways in which you can display the output of TensorBoard. We are exploring some of the most popular approaches.

Google Colab

You can start TensorBoard within the notebook using magics:

%load_ext tensorboard
%tensorboard --logdir runs

Exposing TensorBoard to network

You can add the flag --bind_all to your command to be able to open TensorBoard in a local browser with the server IP.

tensorboard --logdir logs --bind_all

ngrok

Sign up to ngrok and download the suitable version; for example the one for Linux.
Unzip the downloaded ngrok archive.
Find your authentication key here and run the command: ./ngrok authtoken <your_authentication_key>
Start a new screen: screen -S ngrok
Start ngrok on TensorBoard’s default port 6006: ./ngrok https 6006
If everything works well, you should see a black screen with “Session Status Online” and other details, including “Forwarding”.
Copy the “Forwarding” HTTP or HTTPs and run it in your browser. You should be able to see something like this:

Disclaimer: The ngrok method should be only used for research or demonstration purposes. For corporate and security-sensitive purposes, consult with your team first. Depending on the infrastructure you are using, there might be better methods.

Google Cloud Platform (GCP)

If you are training your models on Google Cloud Platform (GCP), you can rather run TensorBoard locally using the approach explained here and here for example.

You can learn more here about TensorBoard and how to use it in other scenarios.

Bash Commands for NLP Engineers

Sun, 02 Aug 2020 00:00:00 +0000

As using Bash commands is inevitable if you work on NLP and MT tasks, I thought it would be useful to list the majority of commands I learnt to use on a daily base, thanks to practice, searching, and helpful colleagues I met over years. Obviously, this is not an exclusive list; however, I hope it includes most of the one-line Bash commands you would need. Please note the majority of these commands have been mainly tested on Linux.

Table of Contents

File Management
Reading Files
Nano Editor Commands
Finding
Downloading
Compressing and Extracting
Server-related Bash Commands
Other Useful Packages

File Management

Open a directory:

cd <path/dir_name>

List the files and sub-directories in the current directory:

ls

Create a new directory:

mkdir <dir_name>

Rename or move a file or directory:

mv <old_filename> <new_filename>

Move a file to a directory:

mv <old_filename> <dir_name>

Move all files whose name starting with a string, using *:

mv <old_filename>* <folder_name>

Rename multiple files: (details) rename ‘s/<original_string>/<new_string>/g’ *

Delete a file:

rm <file_name>

To delete multiple files, just add them after the rm command separated by spaces:

rm <file_name1> <file_name2> <file_name3>

Delete any file that starts with “wow”, using *:

rm wow*

Delete a directory and its contents:

rm -r <dir_name>

Avoid deleting files by mistake by using trash instead of rm, installing trash-cli:

sudo apt-get install trash-cli
• Delete:
trash <file_name>
• List trashed items:
trash-list
• Restore a file (first move to the root folder or a specific folder):
restore-trash and then type a number.
• Empty the trash list:
trash-empty

Copy a file:

cp <original_filename> <new_filename>

Copy a directory and its contained files (at least -r is required):

cp -avr <original_dirname> <new_dirname>

Copy and show a progress bar (good for large files)

rsync -ah --progress <source> <destination>

Complete a command or file name (e.g. my_file_name.txt):

Type my and then press Tab – once if there is no other file starting with “my”.
OR
Type my and then press Tab – twice if you want to know what files starting with “my”.

Move to a location in a command or text: Move the cursor to the location, press Alt or Option, and click.

Clear the current window:

Type clear
OR
Press Ctrl+l

End the current command (before it finishes):

Press Ctrl+c

Move to the last accessed path:

cd -

List your previous commands

history

Search your command history
Ctrl+r

List the *.txt files in the current directory (or path):

ls *.txt

Show the files in all folders that starts with “aaa”:

ls aaa*

Show files and subdirectories in all directories in the current directory:

ls *

List all the files with details:

ls -l

Display file details:

ls -l <file_name>

List all the files with details, the size is in MB/GB:

ls -lh

List all the files with details, the size in MB/GB, arrange by time ascendingly:

ls -lht ls -lht <dir_name1>/*/<dir_name2>

List all the files with details, the size in MB/GB, arrange by time ascendingly:

ls -lhtr

List file sizes only for all files in the current directory:

ls -hs
OR
du

Display the file size only:

ls -hs <file_name> OR
du -h <file_name> OR for one size for all directories and files du -hs <file_name>

Display the last modified file:

ls -t | head -1

Display sizes of the current directory:

du -d 1 -h . Sort the results in ascending order:
du -d 1 -h . | sort -h Sort the results in descending order:
du -d 1 -h . | sort -h -r

Find files the are bigger than 200MB:

find /home/$USER/ -type f -size +200000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

Display file size with stat (Linux):

stat –printf=”%s” <file_name>

Display file last edited time (Linux):

stat -c %y <file_name>

Display file last edited time (Mac):

stat -x <file_name>

Get the current path (print working directory):

pwd

Create a symbolic link, i.e. a shortcut to a file or directory:

ln -s <file_name> <shortcut_name>

Get the path of a file:

readlink -f <file_name> OR echo “$(pwd)/file_name” OR realpath <file_name>

Get word count in a file:

wc <file_name>

Get the number of lines in a file:

wc -l <file_name>

Count lines of all file in subdirectories; use * if the file name is partial (details):

find ./ -type f -name “<file_name>” -exec wc -l {} +

Count lines in a*. gz file, use -c to avoid writing the uncompressed file to desk:

gunzip -c <file_name.gz> | wc -l

Split a file into multiple files, 3000 lines each, with numeric-suffixes:

split -a 4 -d -l 3000 <file_name> <prefix> –additional-suffix <extension>

Find out if two files are identical:

cmp –silent first_file_name second_file_name | echo “——> Files are different.”

Find out the difference between two files:

diff <file_name1> <file_name2>

Find different lines in file1.txt compared to file2.txt:

comm -23 <(sort file1.txt) <(sort file2.txt) > different.txt

Find common lines in both file1.txt and file2.txt:

comm -12 <(sort file1.txt) <(sort file2.txt) > common.txt

Complete a long command in a new line:

\

Reading Files

Read the whole file:

cat <file_name>

Read the whole file; display line numbers:

cat -n <file_name>

Read the first 10 lines of a file:

head <file_name>

Read the first 4 lines of a file:

head -4 <file_name> OR
head -n 4 <file_name>

Read the first 3 lines of two files:

head -q -n <file_name1> <file_name2>

Read the last 10 lines of a file:

tail <file_name>

Read the last 3 lines of a file:

tail -3 <file_name> OR
tail -n 3 <file_name>

Read a specific line of a file, e.g. line #10:

sed -n 10p <file_name>

Read the end of the file and use -f to update the output:

tail -f <file_name>
Use Ctrl+c to exit.

Read a file in chunks:

less <file_name> Press Enter to move to the next chunk of the file, and “q” to quick.

Read a file in chunks, display line numbers:

less -N <file_name>

Disable sending to stdout (i.e. printing in Terminal) by adding 1> /dev/null

cat <file_name1> <file_name2> | tee <output_file_name> 1> /dev/null

Processing

Merge two files, use > to create the output file:

cat <file_name1> <file_name2> > <output_file>

Merge all the files that ends with (say “.en”) to a file (e.g. “all.en”):

cat *.en > all.en

Merge all the files in the current folder:

cat * > <output_file_name>

Merge the source text and target translation into one tab-delimited file

paste -d "\t" all.en all.ar > all.enar

Remove duplicates from a file

sort -S 95% --parallel=8 all.enar | uniq -u > all.unique.enar

Shuffle

shuf all.unique.enar > all.unique.shuf.enar

Split into the source and target from a one tab-delimited file into two files

cut -f 1 all.unique.shuf.enar > all.unique.en
cut -f 2 all.unique.shuf.enar > all.unique.ar

Replace “abc” with “XYZ” in a file

sed -i -e 's/abc/XYZ/g' /tmp/file.txt

Nano Editor Commands

Create a new file:

nano <new_file_name>

Open an existing file:

nano <file_name>

Open multiple files:

nano <file_name1> <file_name2>

Search the current file:

Ctrl+w

Move to the end of the file:

Ctrl+w and then Ctrl+v

Move to the end of the line:

Ctrl+e

Move to the start of the line:

Ctrl+a

Delete the current line:

Ctrl+k

Move a page down:

Ctrl+v

Move a page up:

Ctrl+y

Cut the curret line
Ctrl+k

Mark text:

Ctrl+Shift+6 (i.e. it is Ctrl+^) and then move in the direction to you need.

Cut the marked text:

Ctrl+k

Paste the cut text:

Ctrl+u

Note to be able to pate across multiple files, the second file must be open first open the two files, copy/cut from the first file, close it, and then paste to the second file.

Close the current file:

Ctrl+x

You will be prompted if you want to save; type “y” for yes and “n” for no. If you select to save, just press Enter to keep the current file name. You can also move between two open files as in the next command.

Move between two open files:

alt+. to move forward one file.
alt+, move backward one file.

Note that if you are on Mac, Option+. and Option+, are used to insert ≥≤ symbols, so you need to first press Alt+Command+O to change the behaviour of Option in Terminal.

Finding

Find a file that includes a word (e.g. “really great” in *.txt files):

grep “really great” *.txt

Search sub-directories recursively using grep:

grep -r <word_to_search> * OR
grep -R <word_to_search> *

Use regular expressions with grep, e.g. the only word in the line is ‘nan’:

grep ^nan$ <file_name>

Find a file on the machine by name:

sudo find / -name <file_name>

Find all files in directory and subdirectories that end with *.en:

find “$PWD” -type f | grep ‘.en$’

Find all files in directory and subdirectories that has ‘aaa’ followed with some text:

find “$PWD” -type f | grep “aaa*”

Find files in the current directory that either whose name or content includes “wonderful”:

ls | grep “wonderful”

If you have very long list generated by ls and want to display them page by page:

ls | less

List files whose names include a range of numbers:

ls model.0{1..3}*

List files whose names include different letters:

ls model.{a,b,c,d}

Move multiple files (or run any command on multiple files):

add the difference between { } separated by a comma.

Find installed Python3 packages:

pip3 freeze

Find installed Python3 packages that start with “tensor”, use -i to ignore case:

pip3 freeze | grep -i tensor

Find the location of a command (e.g python3):

which python3

Downloading

Download a file using curl:

curl <http://some.url> –output <file_name>

If this is the first time to use curl, you might get a message like “Command ‘curl’ not found, but can be installed with:

sudo apt install curl

Download a file that requires cookies:

curl –cookie <cookies.txt> <http://some.url> –output <file_name> To get the “cookies.txt” file, you can use a Chrome extension like “cookies.txt” to export cookies into a TXT file.

Copy GitHub repository to the machine:

git clone https://github.com/USERNAME/REPOSITORYNAME

Update a downloaded GitHub repository:

cd <repository_dir_name> git pull git checkout master

Stage and Commit a GitHub repository (details)

git add <file_name\>
git commit -m “Message, e.g. Update file”
git push origin main

The default branch is usually called “master” or “main” – if it is not, replace it with the right name.

Compressing and Extracting

Extract a *.zip file:

unzip <file_name>

Create a zip archive from file(s):

zip <archive_filename> <file_list>

Create a zip archive from a directory with high level of compression:

zip -r -9 <archive_filename.zip> <dir_name>

Extract a *.gz file:

gunzip <file_name.gz>

Compress all the files separately as file_name.gz

cd <dir_name>
gzip *

Compress all the files in the same directory even if there are subdirectories:

cd <dir_name>
gzip -r .

Extract a *.tar.gz file:

tar xzvf <file_name.tar.gz>

Extract a *.tgz file:

tar xzvf <file_name.tgz>

Extract in a different directory:

tar xzvf <file_name.tgz> -C </path/dir_name>
OR
gunzip -c <file_name.tgz> | tar xvf -

Create a *.tar archive:

tar -czvf archive.tar.gz <dir_name>

Create a *.tar archive from multiple files/directories:

tar -czvf <archive_file_name.tar.gz> <file_name1> <file_name2>

Compress as *.tar.bz2 (higher compression):

tar -jcvf <archive_name.tar.bz2> <file_dir_name>

Extract a *.tar.bz2 archive:

tar -jxvf <archive_name.tar.bz2>

Compress all the files separately as file_name.bz2

bzip2 *

Extract file_name.bz2 (without tar)

bzip2 -d <file_name.bz2>

Obviously, many of these commands can be used locally, but they are most useful while working on servers.

Find out the server date and time:

date

Measure time taken to run a script or command:

time <python3 script.py>

Find out the space on the desk:

df -h

Create an alias for a command: (details)

alias <command>

To save aliases, put this in ~/.bash_aliases

nano ~/.bash_aliases

For example, you can add this command to the ~/.bash_aliases file, use quotes for multi-word commands:

alias frz="pip3 freeze"

For the alias change to take effect

source ~/.bash_aliases
OR
exec bash
The next time you type frz in the Terminal, it will run the command pip3 freeze

Repeat the same command

watch

Avoid ending a command if the local Terminal is closed:

screen

Create a new screen with a name:

screen -S <name>

Create a new screen with logging enabled; screenlog.0 is created:

screen -L -S <name>

Detach the current screen:

Ctrl+a+d

Resume a single screen:

screen -r

Resume a screen from multiple running screen:

screen -r <name> OR screen -r <id>

List the currently running screens:

screen -list screen -ls

End a screen:

screen -X -S <id> quit
or resume the screen and then
Ctrl+a then k then y

Shutdown the machine after finishing a command — separate them with ;

python3 file.py; sudo shutdown

Adjust File permissions, access by the current user only:

chmod 700 <file_name>

For example, this is required before using the *.pem key file provided by AWS E2.

Display RAM used:

free -m

Display GPU memory used:

nvidia-smi

Find the CUDA version:

nvcc –version

Run a command continously (optionally use -n for interval seconds, and -d to highlight changes):

watch

Check kernel termination errors (use one of these commands)

dmesg
OR
nano /var/log/kern.log

Check currently running processes - use grep if you are looking for a specific type of processes:

ps -ef | grep python3

Move a file from a server (e.g. AWS2) to the local Machine (run it from the local machine):

scp <file_name> <user>@<serpver_ip:port>:/<dir_name>

Move a directory from a server (e.g. AWS2) to the local Machine; use -r (run it from the local machine):

scp -r <dir_name> <user>@<serpver_ip:port>:/<dir_name>

Move a file from AWS2 to the local Machine (run it from the local machine):

scp -i <key.pem> <file_name> ubuntu@ec2[…].compute.amazonaws.com:~/<dir_name>

Move a file from the local machine to a server (run it from the local machine):

scp <user>@<server_ip:port>:/<dir_name>/<file_name> </path/on/the/local/machine>

Move a file from Google Could to the local machine:

gcloud compute scp –project <project_name> –recurse <user_name>@machine_name:~/<dir_name>/<file_name> </path/on/the/local/machine>

Log out of the current connection (and similar senarios):

Ctrl+d

Other Useful Packages

Among useful packages that you might want to install yourself are:

curl or wget for downloading files, or aria2c for faster download
trash-cli for trashing unwanted files into a folder instead of using the rm command
tree for displaying the directory structure
htop for monitoring CPU resources
locate for quickly finding files by name after updatedb
ack for searching files like grep, but faster
parallel for multithreading from the bash
s3cmd for uploading and downloading files between AWS S3 buckets and non-AWS servers. For AWS E2 servers, use the aws s3 command instead.

Pre-trained Neural Machine Translation (NMT) Models

Tue, 16 Jun 2020 00:00:00 +0000

Neural Machine Translation (NMT) in-domain models outperform generic models for the “domain” on which they are trained. In other words, in-domain models can observe terminology and generate translations that are more in line with a specialized context.

You can download the NMT models below. Enjoy!

Download Pre-Trained NMT Models

WER Score for Machine Translation

Wed, 04 Mar 2020 00:00:00 +0000

Update: Currently, it is recommended to use SacreBLEU for calculating BLEU, ChrF, and WER.

Word Error Rate (WER) computes the minimum Edit Distance between the human-generated sentence and the machine-predicted sentence. In other tutorials, I explained how to use Python to compute BLEU and Edit Distance, and this tutorial, I am going to explain how to calculate the WER score.

For this WER score tutorial, I am going to use the Python library, JIWER.

Table of Contents

Files Required to Compute WER
Corpus WER Calculator
Sentence WER Calculator
Conclusion

Files Required to Compute WER

To measure WER score, you need to have two files:

Ground Truth: It is the reference human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
Hypothesis: It is the Machine Translation prediction for the source of the same test dataset used for “Ground Truth”. In the code, I will refer to such sentences as “preds”.

Corpus WER Calculator

JIWER allows computing the overall WER score on multiple sentences using two lists that include the same number of sentences. I am quoting a sample code from JIWER’s page as follows:

from jiwer import wer

ground_truth = ["hello world", "i like monthy python"]
hypothesis = ["hello duck", "i like python"]

error = wer(ground_truth, hypothesis)

Now, let’s apply the same concept on the two files.

Create the argument list. This is optional, but it is useful for being able to run the file with arguments from CMD/Terminal as follows:

python3 wer.py human.txt mt.txt

Open the two files, human translation and machine translation of the same test dataset, and add the sentences (lines) to two lists using the Python method readlines()
From the JIWER library, use wer to calculate the WER score on the two lists of sentences, and print the output.

Here you can find the code that reflects these steps.

Sentence WER Calculator

The previous code computes WER for the whole test dataset, and this is the common practice. Still, you might want to calculate WER for segment by segment. The following code uses the same method wer from the JIWER library for achieving this task using a for loop. Finally, it saves the output, i.e. the WER score for each sentence in a new line, into a text file.

Conclusion

So just as we did for Computing BLEU Score for Machine Translation, we now managed to use the WER score as well. As I said earlier, these scores are mainly useful for evaluating the quality of different models, rather than the meaning acceptance of each sentence. While evaluating Speech Recognition, it makes sense that you want the system to exactly convey each uttered word and in the same order; however, Machine Translation evaluation is somehow tricky as different wordings can still convey the same meaning. Hence, Machine Translation evaluation is still a hot research topic; and in some cases, human evaluation is preferred.

If you have questions, please feel free to comment.

Computing BLEU Score for Machine Translation

Sun, 26 Jan 2020 00:00:00 +0000

In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.

BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target is from a high-level framework like OpenNMT or Marian, or from a lower-level one like TensorFlow or PyTorch. It does not also matter whether it is a Neural Machine Translation system or a Statistical Machine Translation tool like Moses.

So let’s see the steps I follow to calculate the BLEU score.

Table of Contents

Files Required to Compute BLEU
Detokenization & BLEU Calculation
Code of MT BLEU Calculator
File Names as Arguments
Sentence BLEU Calculator
Multi-BLEU
METEOR
Final Note: Is BLEU Accurate?

Files Required to Compute BLEU

To measure BLEU, you need to have two files:

Reference: It is the human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
System: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”. In the code, I will refer to such sentences as “preds”.

Detokenization & BLEU Calculation

To compute BLEU, I use sacreBLEU which works on detokenized text (unless the ‘--force’ parameter is used). For the detokenization step, I use the Python library SacreMoses.

Why detokenization? Different tokenization tools generate different outputs while to be able to say that BLEU is a standard score, the factors must be the same. That is why sacreBLEU works on detokenized data and applies standard tokenization rules.

For languages other than Japanese and Chinse, SacreBLEU uses mteval-v13a, the standard tokenization used by WMT.

Code of MT BLEU Calculator

BLEU is a corpus-level metric. Here is how you can calculated Corpus BlEU using SacreBLEU.

File Names as Arguments

In the above script, file names are hardcoded. You can easily add the file names as arguments. To let the Python script understand the arguments, you will need first to import sys and then create two variables one for the test dataset, e.g. target_test, with the value sys.argv[1] for the test file argument and one for the MT output, e.g. target_pred, with the value sys.argv[2] for the MTed file argument. Optionally, you can also add an argument for language segmentation. Finally, instead of hardcoding the test dataset name and the MTed file name, you can use these two variables.

As you can see in the Python script below, I used argv which is a list including the arguments given in the command line; the first item [0] is saved for the Python script file name. So to run this script, you can use a similar command line in your CMD or Terminal:

python3 bleu-scrip.py test.txt mt.txt

Here is the BLEU script, but now with arguments.

Sentence BLEU Calculator

The previous code computes BLEU for the whole test dataset, and this is the common practice. Still, you might want to calculate BLEU for segment by segment. The following code uses the function sentence_bleu() from the sacreBLEU library for achieving this task using a for loop. Finally, it saves the output, i.e. the BLEU score for each sentence in a new line, into a file called “bleu.txt”.

As we did with the corpus BLEU script, here is the sentence BLEU script, but now with arguments.

Update:

The code is now updated to reflect two main changes: 1- Updates in version 1.4: a) the reference sentence must be a list; and b) use bleu.score instead of bleu to print/write the score. 2- Conclusions from this discussion: add the argument smooth_method='exp' if you want to get the same result as when using sacreBLEU from the command line.

Multi-BLEU

One of the popular scripts to calculate BLEU is multi-bleu.perl. It works very similarly to sacreBLEU.

According to the script “… you should detokenize then use mteval-v14.pl, which has a standard tokenization.”

To use multi-bleu.perl, you can simply run this command line in your Terminal.

perl multi-bleu.perl human-translation.txt < mt-pred.txt

METEOR

Using BLEU, you might wonder why it does not count some sub-words of the same origin as correct alternatives. So, I came across another metric called METEOR, which somehow solves such issue.

I am quoting Rachael Tatman’s article Evaluating Text Output in NLP: BLEU at your own risk:

METEOR is similar to BLEU but includes additional steps, like considering synonyms and comparing the stems of words (so that “running” and “runs” would be counted as matches).

I have created the following script for METEOR calculation using NLTK. For the same sentences, METEOR gives me higher scores than BLEU. Unlike many other metrics including BLEU, METEOR mainly works on sentence evaluation rather than corpus evaluation.

Final Note: Is BLEU Accurate?

Well, BLEU simply compares the human translation to the machine translation. It does not take into consideration synonyms or accepted word order changes.

Here is an example of the original translation in the corpus:

FR: Notre ONU peut jouer un rôle déterminant dans la lutte contre les menaces qui se présentent à nous, et elle le jouera.

EN: Our United Nations can and will make a difference in the fight against the threats before us.

… and here is the machine translation by two of my NMT models:

EN: Our United Nations can play a decisive role in combating the threats we face, and it will do so.

EN: Our United Nations can play a decisive role in combating the threats we face, and it will play it.

As you can see, the MT translations are very acceptable; yet if you calculate BLEU against the original sentence, you will get ≈ 15.7 BLEU score only!

So BLEU –just as any other automatic measure– can be used for reference until reaching a pre-agreed score, and you can expect a better translation from a model with an overall higher BLEU score. Moreover, some other new metrics are worth considering such as Yisi and COMET. Still, some companies would finally run a human evaluation, which we might talk about in another article.

Stand-alone Executable Translator for OpenNMT

Sat, 18 Jan 2020 00:00:00 +0000

The question was: if I want to have a stand-alone version of OpenNMT to run on Windows, without any manual preparations or installations on the target machine, and does not connect to the Internet for Machine Translation, what are my options to achieve this?

Note: This post is somehow old, it uses OpenNMT-py 0.9.1, and currently applies only to Windows. If you want to develop a web interface or a stand-alone application on Windows, Linux or Mac, check the following up-to-date options:

After some research, I finally managed to achieve progress to create a Translator GUI for Windows, using Python Tkinter, PyInstaller, NSIS and the PyTorch version of OpenNMT.

Purpose

Creating a stand-alone executable of OpenNMT-py on Windows that requires the minimal technical experience to install and use, and no Internet connection, for Machine Translation.

Outcome

A proof-of-concept version can be downloaded here.

Tested on Windows 7 and Windows 10. Support for 64-bit version of Windows only (PyTorch works on Python 64-bit only).

The executable can be used to locally translate files, using a local pre-trained model file generated by OpenNMT-py Neural Machine Translation framework.

How it works

Installation

After downloading and launching the installer, it will copy the files to the “Program Files” folder. When the installer finishes, there will be a shortcut on the Desktop called “translate-gui”.

Usage

Running the shortcut “translate-gui” (which refers to translate-gui.exe), this window opens.

Select the source file (*.txt)
Select the model file (*.pt)
Click “Translate”.

Note: For a quick test, you can download this test source file (right-click > Save link as) and this test model file.

If everything works fine, it should create the translation file “youtranslation.txt” on the Desktop. Responding with “Yes” to this prompt message should open the translation TXT file in NotePad.

Uninstallation

To uninstall, simply delete the folder “translate-gui” from the “Program Files” folder.

Changes in the OpenNMT-py Code

Simple use of the same arguments; no serious changes.

1- onmt/opts.py

For the arguments -src and -model, changing the attribute required=True to required=False

group.add('--src', '-src', required=False,
    help="Source sequence to decode (one line per "
        "sequence)")

group.add('--model', '-model', dest='models', metavar='MODEL',
    nargs='+', type=str, default=[], required=False, 
    help="Path to model .pt file(s). "
        "Multiple models can be specified, "
        "for ensemble decoding.")

2- translate.py

Assigning values from the Tkinter GUI to the following variables:

opt.src (source file path – string)
opt.models (model file path – list of strings)
opt.output (target file path – string)

For testing purposes, you can hardcode the values to get an idea how it works (without a GUI).

if name == "main":
    parser = _get_parser()
    opt = parser.parse_args()

    # edits
    opt.src = r"D:\Users\yasmin\output\source.txt"
    opt.models = [r"D:\Users\yasmin\output\test_model.pt"]
    opt.output = "yourtranslation.txt"

main(opt)

However, in the actual file, I replaced this with a function (e.g. go)

def go():
    parser = _get_parser()
         opt = parser.parse_args()

    try:
             opt.src = file_source
             opt.models = [file_model]
        opt.output = "yourtranslation.txt"
        
        main(opt)
        
        success = messagebox.askyesno('Success', 'Your source text has been successfully translated and saved as "yourtranslation.txt". Do you want to open the target file?')
        if success == True:
            webbrowser.open("yourtranslation.txt")
    except:
             messagebox.showerror('Error', 'Make sure you select the right Source and Model files.')

… and then assigned this go function to the attribute command of the “Translate” button in the GUI.

btn_translate = Button(frame3, text="Translate", width=20, highlightbackground="#BBCAE8", command=go).pack(padx=1)

Note that the variables file_source and file_model get their values from the GUI.

Final minimum working example can be found here.

Notes on PyInstaller and NSIS

PyInstaller freezes (packages) Python applications into stand-alone executables. NSIS is an open-source tool to create Windows installers. Here are some notes on using them:

Installing PyInstaller is straightforward through using this command in your CMD/Terminal: pip3 install pyinstaller or through installing Auto PY to EXE.
Consider bundling on Windows 7 and then testing on Windows 10. Otherwise, you might have to deal with some Windows dependencies.
To use PyInstaller, specify the Python file name and the argument -w to hide the console window: pyinstaller -y -w "yourfile.py"
At this stage, you created a folder including all the dependencies and an *.exe inside it that will run the Python file. Do NOT use the “onefile” argument -F of PyInstaller which creates a one-file bundled executable, i.e. instead of having the above-mentioned folder, you will have a big *.exe for the whole thing. Why not? This external *.exe is like an archive that extracts the packaged files (including the internal *.exe) to a temporary directory every time you run it, which takes a long time due to the huge file size of PyTorch and other dependencies. Instead, use NSIS to create an installer which will extract the files only once.
NSIS can be downloaded, installed and used on Windows like any application.
Before using NSIS, compress the contents of the “dist” directory created by PyInstaller into a *.zip archive using any tool like 7-Zip or WinZip.
Launch NSIS, click “Installer based on a .ZIP file”, and click “Open” to locate the package *.zip file you have just created.
If you want to make the files installed (extracted) to the “Program Files” of the target user, in the “Default Folder” enter $PROGRAMFILES
If you want to add a shortcut to the internal *.exe file on the Desktop after installation, you can add something like this to the file “Modern.nsh” at: "C:\Program Files\NSIS\Contrib\zip2exe\". Depending on your OS, the path could be at “Program Files (x86)”. I just added these lines at the end of the file. Note that the exe path should be consistent with the path you selected under NSIS’s “Default Folder” drop-down menu, the folder name, and the exe file name.

Section "Desktop Shortcut" SectionX
    SetShellVarContext current
    CreateShortCut "$DESKTOP\translate-gui.lnk" "$PROGRAMFILES\translate-gui\translate-gui.exe"
SectionEnd

Finally, click the NSIS “Generate” button, which will create the *.exe installer that can be shipped to other Windows machines.

Future Work

Adding more OpenNMT-py translation options (and maybe training options) to the GUI.
Improving the user experience during installation, usage, and uninstallation.
Reducing the required space by removing unnecessary dependencies.
Testing the same approach for the TensorFlow version, OpenNMT-tf.

Domain Adaptation Techniques for Low-Resource Scenarios

Sat, 18 Jan 2020 00:00:00 +0000

Let’s imagine this scenario. You have a new Machine Translation project, and you feel excited. However, you have realized that your training corpus is too small. Now, you see that if you use such limited corpus, your machine translation model will be very poor, with many out-of-vocabulary words and maybe unidiomatic translations.

So, what is the solution? Should you just give up? Fortunately, Domain Adaptation can be a good solution to this issue.

Do you have another corpus that is big enough? Does this big corpus share some characteristics with the small corpus, like the language pair and/or major subject?

In this case, you can use one of Domain Adaptation techniques to make use of both the big generic corpus and the small specialized corpus. While the big generic corpus will help avoid out-of-vocabulary words and unidiomatic translations, the smaller specialized corpus will help force terminology and vocabulary required for your current Machine Translation project.

Table of Contents

Domain Adaptation Use Cases
Domain Adaptation Approaches
Conclusion

Domain Adaptation Use Cases

Low-Resource Domains & Institutions
Low-Resource Languages

To give you a clearer idea bout Machine Translation Domain Adaptation, let’s consider these two popular use cases:

In the first use case, we have Institution A and Institution B, or Major Subject A and Minor Subject B. Institution A and Institution B share much vocabulary; however, they have some different terminology (e.g. chairman vs. president; vice-president vs. deputy chairperson). You have a big corpus for Institution A and a very small corpus for Institution B; however, your Machine Translation project is for Institution B with the small corpus. Domain Adaptation can help you to use the small corpus of Institution B for adapting or specializing the NMT model that could be generated from training on the big corpus of Institution A (assuming there are no license restrictions). With Domain Adaptation, our final model will, hopefully, give the right terminology used at Institution B.

In the second use case, we have a language with very limited bilingual resources. So we do not have enough data to train a good Machine Translation model for this language. I am sure you can think of many low-resource languages allover the world. Sometimes, there are other high-resource languages that are very similar to such low-resource languages, and share vocabulary and structure with them. Moreover, sometimes they are not independent languages, but rather just dialects from an original language.

So the question is: can we use the rich resources of Language A to train a better Machine Translation model for Language B that has low resources otherwise? Apparently, this is possible though Domain Adaptation.

Quiz

Give an example of two languages:

Language A: High resources
Language B: Low resources

Language A and Language B share vocabulary and structure (vocabulary overlaps).

So this is a quiz. In the comments area, please mention two languages: Language A and Language B. Language A has rich resources while Language B has only very limited resources. However, there is a condition, Language A and Language B must share some vocabulary, meaning that many words in Language A overlap with words in Language B, so such words are the same or very similar in the two languages. Can you think of any example of Language A and Language B?

Domain Adaptation Approaches

Incremental Training / Re-training
Ensemble Decoding (of two models)
Combining Training Data
Data Weighting

There are several approaches of Domain Adaptation and I am going to discuss four of them.

Incremental Training / Re-training: So you have a big pre-trained model trained on a big corpus, and you continue training it with the new data from the small corpus.
Ensemble Decoding (of two models): You have two models and you use both models during translation.
Combining Training Data: You merge the two corpora and train one model on the whole combined data.
Data Weighting: You give higher weights for specialized segments over generic segments.

Let’s see how to apply these techniques and the best practices.

Incremental Training / Re-training

First Step: Training the Base Model a. Preprocessing the base (generic, big) corpus b. Training the base model

Second Step: Retraining with the New Data a. Preprocessing the new (specialized) corpus b. Retraining the base model on the specialized corpus

Incremental Training means to train a model on a corpus and then continue training the same model on a new corpus.

As part of my Machine Translation research, I managed to achieve successful results in retraining Neural Machine Translation models for the purpose of Domain Adaptation (see: Domain Adaptation Experiment)

Now you have two corpora. The first corpus is the base corpus; a generic or less-specialized and it is usually big, like several millions of segments. The other corpus is specialized and it might have a less number of translated segments.

In my experiment, the outcome was very promising and the model learned to use the in-domain terminology.

There is an important matter to take into consideration while using this Incremental Training approach for Domain Adaptation. If you only use in-domain data in your corpus, you may encounter a case of “catastrophic forgetting”, in which some sentences are translated badly (like with an unidiomatic structure or unknown words) by the retrained model while they are translated better by the base model. To avoid this issue, usually the retraining corpus should be a combination of in-domain and generic data. So for example, if your original in-domain corpus includes one hundred thousand segments, you can add like fifty thousand generic segments.

Another consideration is that you need to retrain on the new data for long enough to learn the new vocabulary. So you can see how many epochs or steps you used to train the base model and use a similar number to retrain on the new corpus.

Note also that depending on the NMT framework you are using, you may have the option to update vocabulary instead of re-initializing the whole network. For example, in OpenNMT-tf (the TensorFlow version of OpenNMT), there is a script that can be used to change the word vocabularies contained in a base model while keeping the learned weights of shared words, so that you can add in-domain terminology during retraining.

Ensemble Decoding (of two models)

One of the suggested methods of Domain Adaptation is to “ensemble” the baseline model trained on generic data and the continue model retrained on in-domain data. “Ensemble” simply means combining models during translation (not data during training). For more details about Ensemble Decoding, you may want to refer to a useful paper called, Fast Domain Adaptation for Neural Machine Translation, by Markus Freitag and Yaser Al-Onaizan.

Actually, there are different techniques for Ensemble Decoding; however, I am giving you an example of how it is used in OpenNMT-py framework to give you an idea.

Ensemble Decoding is a method that allows using multiple models simultaneously, combining their prediction distributions by averaging. All models in the ensemble must share a target vocabulary.

This means that although Ensemble Decoding is used during translation, you should observe some considerations during training. So during the preprocessing step, you have to include the vocabulary of both the generic corpus and in-domain corpus. Later during the training time, you first train the base generic model, and then continue training with your specialized data to create a new model. Finally, during translation, you can use the two models simultaneously with Ensemble Decoding. Note here that you do not train the two models independently; however, your second model is actually incrementally trained on the last checkpoint of the first model.

As you can see, Ensemble Decoding can be helpful in diverse occasions when you want to utilize multiple models at the translation time, and Domain Adaptation is only one of such use cases, with a special process.

Combining Training Data

Combining your training data is another approach you can use for Domain Adaptation. So you combine both the big generic corpus and the small specialized corpus into only one corpus. Now, you can train your model with this new corpus.

If you are going to combine two relatively different datasets, then according to Prof. Andrew NG (video), do not shuffle your combined dataset to generate the training, dev, and test sets; instead he recommends that you divide your data as follows:

Training Dataset: 100% of the big, generic dataset + most of the small specialized dataset.
Dev (validation) Dataset: Portion of the small specialized dataset (e.g. 2500).
Test Dataset: Portion of the small specialized dataset (e.g. 2500).

So now, you are concentrating on improving the performance of your model to act well on the Dev (Validation) Dataset, which includes the data you care about.

However, when you think about combining data for the sake of training Neural Machine Translation models, there is a problem! In Neural Machine Translation, we extract only the most frequent vocabulary, the most frequent words in the corpus (~ 50,000 is common). Now, as you have a big generic corpus and a small specialized one, you might end up with vocabulary from the big corpus only while the words you want to include from the small corpus will be missing because they are not frequent enough. Plus, the model would observe terminology choices from the bigger corpus because they are more frequent.

I can hear you now asking: Can I extract all the words in the corpus? Of course, you can; however, if your corpus is really huge, and your training parameters are memory intensive, you might get an out-of-memory error and not be able to continue training or even start it.

So what is the solution? What about increasing the specialized data? There is a suggested method: Data Augmentation.

Data Augmentation for Neural Machine Translation:

The purpose of data Augmentation here is to increase the size of your limited specialized data. In my experiment, I used a statistical approach that is similar to what has been used in Statistical Machine Translation (e.g. Moses) as illustrated by Prof. Philipp Koehn in the chapter, Phrase-based Models, of his book “Statistical Machine Translation”.

First Step: Extract word alignment of the specialized corpus. You can use tools like fast_align, eflomal, or efmaral. You can use any of them as a word aligner which takes an input of parallel sentences, and produces outputs in the widely-used “Pharaoh format”.

neue modelle werden erprobt ||| new models are being tested
0-0 1-1 2-2 2-3 3-4

Second Step: Generate n-gram phrases. Here, you can see an example:

neue — new
neue modelle — new models
neue modelle werden — new models are being
neue modelle werden erprobt — new models are being tested
modelle — models
modelle werden — models are being
modelle werden erprobt — models are being tested
werden — are being
werden erprobt — are being tested
erprobt — tested

As I mentioned, this approach is very similar to the method used in Statistical Machine Translation; however, I did not move further to calculate probabilities because: 1) this would take a lot of time and memory; and most importantly 2) no need for this step because Neural Machine Translation has its own approach for calculating probabilities. So all what we need is a simple filtering step.

Thrid Step: Remove exact duplicates. Apply any other filters as needed; for example, you can delete very long sentences or uncommon single words, etc.

Now, you can combine your increased specialized data with the generic data, and start preprocessing and training your model.

Note here that we have two datasets, one uses this n-gram phrase splitting and one does not. In my experiment, when I trained my model on a dataset that I used this method on all of its segments, I got better translations for some segments; however, I noticed literal or unidiomatic translations in other occasions and in general the quality was less. So if you are going to use this n-gram phrase splitting with your Neural Machine Translation training, it is recommended to use it only as a part of the final dataset. That is why here we used this approach only on the specialized dataset and kept the generic dataset as is without phrase splitting.

Apart from training a model, you use the generated phrase-table for more options at the translation time.

Other combination methods may include: removing irrelevant segments from the big corpus or replacing mismatching terminology based on a glossary during the preprocessing time. </details>

Data Weighting

Data Weighting is another technique that can be useful for Domain Adaptation. In Data Weighting, you can either:

train one model on two corpora at the same time while giving a higher weight for the specialized corpus over the other generic corpus, or train the model on only one corpus that includes both generic segments and specialized segments, giving higher weights for specialized segments. For example, OpenNMT-py (the PyTorch version of OpenNMT) supports using different weights for different corpora; so we define the “data weights” list, which determines the weight each corpus should have; for example, 1 for Corpus A and 7 for Corpus B. This means when building batches, we will take 1 segment from Corpus A, then 7 segments from Corpus B, and so on.

Similarly, Marian NMT toolkit supports sentence and word-level data weighting strategies, weighting each data item according to its proximity to the in-domain data. In Marian, data weighting requires you to provide a special file with weights of sentences or words.

Other Domain Adaptation Approaches

For more state-of-the-art Domain Adaptation approaches, please check my AMTA’s presentation.

Final Note: Full Words vs. Sub-words

During preparing our data, we usually tokenize segments into complete words. However, it turns out that tokenizing segments into sub-words instead can be useful in improving translation quality. Sub-wording is not a technique related only to Domain Adaptation; it is actually recommended for any kind of Neural Machine Translation training.

The main purpose of Sub-wording is to minimize out-of-vocabulary words. As I mentioned earlier, in Neural Machine Translation, there are limitations to vocabulary extraction. If your corpus is really huge, you are forced to extract only the most frequent vocabulary (~ 50,000 is common), or you might get out-of-memory error during training. Extracting the the most frequent vocabulary will be enough for most translations as long as you translate only sentences in the same domain as your corpus; however, in some cases, you might encounter out-of-vocabulary words.

Sub-wording can help in some cases:

Word variations in the same language, e.g. “translate vs. translation”
Compound words in the same language, e.g. “multi-tasking”. So now you model is not only able to translate “multi-tasking”, but any other phase that includes the word “multi”.
Shared words between languages
Common misspellings, like forgetting accents.

Just as any other technique, in some occasions sub-wording will not give you better results; however, in many occasions, it will be a game changer. So, it is highly recommended to give it a try.

Methods of sub-wording include: Byte Pair Encoding (BPE) and unigram language model, both of which are supported by SentencePiece.

Conclusion

So in this article, you have seen how Domain Adaptation can be useful when you want to train a Machine Translation model, but you have only limited data for an institution, language, or minor domain. Then, I have discussed diverse techniques of Domain Adaptation including: Incremental Training / Re-training, Ensemble Decoding, Combining Training Data, and Data Weighting. In the meanwhile, I suggested a method for Data Augmentation, to increase the size of the limited specialized corpus. Finally, I explained how sub-wording can help avoid out-of-vocabulary words. If you have questions, or suggestions, please feel free to send a comment.

Domain Adaptation Experiment in Neural Machine Translation

Sat, 27 Jul 2019 00:00:00 +0000

Domain Adaptation is useful for specializing current generic Machine Translation models, mainly when the specialized corpus is too limited to train a separate model. Furthermore, Domain Adaptation techniques can be handy for low-resource languages that share vocabulary and structure with other rich-resource family languages.

As part of my Machine Translation research, I managed to achieve successful results in retraining Neural Machine Translation models for the purpose of Domain Adaptation using OpenNMT-py (the PyTorch version of OpenNMT). In this article, I am elaborating on the path I took and the achieved outcomes; hopefully, this will be useful for others.

The base model is a vertical (in-domain) model trained on approx. 13 million segments, and retrained on approx. 123,000 institution-specific segments. Language Pair: French-English. Tokenization: complete words.

Table of Contents

First Step: Training the Base Model
- Preprocessing
- Training
Second Step: Retraining with the New Data

First Step: Training the Base Model

Preprocess

Using default options of OpenNMT-py.

Train

Using the recommended Transformer model options, except that I had only 2 GPUs.

CUDA_VISIBLE_DEVICES=0,1 python3 train.py -data basedata \ 
    -save_model basemodel -layers 6 -rnn_size 512 -word_vec_size 512 \ 
    -transformer_ff 2048 -heads 8  -encoder_type transformer \ 
    -decoder_type transformer -position_encoding -train_steps 200000 \ 
    -max_generator_batches 2 -dropout 0.1 -batch_size 4096 \ 
    -batch_type tokens -normalization tokens  -accum_count 2 \ 
    -optim adam -adam_beta2 0.998 -decay_method noam \ 
    -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 \ 
    -param_init 0 -param_init_glorot -label_smoothing 0.1 \ 
    -valid_steps 10000 -save_checkpoint_steps 10000 -world_size 2 \ 
    -gpu_ranks 0 1 -log_file train.log ; sudo shutdown

Second Step: Retraining with the New Data

Preprocess

I passed the basedata.vocab.pt file to the parameter -src_vocab. There is no need for -tgt_vocab, but use -share_vocab as well (reference). Actually, only -src_vocab supports *.vocab.pt files, and adding the file to -tgt_vocab will cause an error.

I used also -src_seq_length 200 because I have long sentences, but you can use the default (50) or whatever you need.

python3 preprocess.py -train_src newdata.fr -train_tgt newdata.en \ 
    -save_data newdata -src_seq_length 200 -tgt_seq_length 200 \ 
    -src_vocab basedata.vocab.pt -dynamic_dict -share_vocab \ 
    -log_file preprocess-new.log

Continue training

I used -train_from the last step file of the base model, retraining the model for extra 10,000 steps. Note the old model was trained for 200,000 steps; so to set the extra 10,000 steps in retraining, it will be 210,000 steps because retraining uses the previous arguments unless you use the argument -reset_optim

Note also that the second machine was with 8 GPUs; so with the same batch size, 10,000 steps on 8 GPUs are similar to 40,000 steps on 2 GPUs (reference). Calculating steps in the first place was tricky because the batch type here depends on tokens not sentences and there are multiple GPUs (reference), but I used the sequence length from the preprocessing step as a reference (actually half of it because not many sentences are of 200 tokens), which will not be very accurate as it is the max not an exact number, but it helps understand what one is doing. The ultimate purpose was to retrain on the new data for long enough to learn the new vocabulary (reference).

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train.py -data newdata \
    -train_from basemodel_step_200000.pt -save_model newmodel \ 
    -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 \ 
    -heads 8 -encoder_type transformer -decoder_type transformer \ 
    -position_encoding -train_steps 210000 -max_generator_batches 2 \ 
    -dropout 0.1 -batch_size 4096 -batch_type tokens \ 
    -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 \ 
    -decay_method noam -warmup_steps 8000 -learning_rate 2 \ 
    -max_grad_norm 0 -param_init 0 -param_init_glorot \ 
    -label_smoothing 0.1 -save_checkpoint_steps 10000 -world_size 8 \ 
    -gpu_ranks 0 1 2 3 4 5 6 7 -log_file retrain.log ; sudo shutdown

Retraining took 37360 seconds (about 10.38 hours) on an AWS p2.8xlarge machine with 8 GPUs, 12 GB memory each, and 488 GB of RAM.

Outcomes

When I started retraining with OpenNMT-py, I was not sure if the model will only learn new vocabulary or will also replace vocabulary because it was usually said that OpenNMT-py is not the best for retraining as it does not have an update vocabulary option, compared to the TensorFlow version, OpenNMT-tf.

However, the outcome is very promising. The model learnt to use the institution-based terminology. Here is one simple example to get an idea: the base model translates the French words “président” and “vice-président” as “president” and “vice-president” in English respectively while the retrained model translates them as “chairperson” and “deputy chairperson” respectively, which are the adopted English terms in the institution.

Further Research

The issue I noticed though is that some sentences are translated badly (like unidiomatic structure or UNKs) by the retrained model while they are translated better by the base model. I am not sure why, and I wonder if this could be because of an exaggerated number of re-training steps; so I have to test this. Another suggestion I got on the OpenNMT forum, and I am going to try, is that this may be a case of “catastrophic forgetting”; usually the retraining should be a combination of in-domain and generic data. Still, note that my base model was not trained on generic data, but rather on a dataset from the same domain as the new dataset. Again, I believe as a workaround, I can offer translations from the two models and let the user select, or automatically select the best translation based on automatic evaluation. So I am going to conduct more experiments and report the outcomes.

So that is it. If you have questions or suggestions, please let me know.