<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>MachineTranslation.io</title>
    <description>Research topics on Machine Translation</description>
    <link>https://blog.machinetranslation.io/</link>
    <atom:link href="https://blog.machinetranslation.io/sitemap.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 17 Nov 2025 16:15:49 +0000</pubDate>
    <lastBuildDate>Mon, 17 Nov 2025 16:15:49 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Iterative Layer Pruning for Efficient Inference</title>
        <description>&lt;p&gt;Model pruning is a compression technique, that aims to remove redundant components without significantly compromising the model’s performance or accuracy. This process facilitates efficient deployment of complex models by making them smaller and faster.&lt;/p&gt;

&lt;p&gt;Pruning is a hardware-agnostic compression approach. Unlike some other compression approaches such as quantisation, models resulting from structured pruning can be deployed on any modern GPU with similar performance gains. Moreover, pruning can be part of a sophisticated compression pipeline that incorporates other techniques such as quantisation and efficient fine-tuning (e.g. LoRA, QLoRA), which can lead to higher compression and efficiency levels.&lt;/p&gt;

&lt;p&gt;In this article, we will cover some insights from our two papers about Iterative Layer Pruning at IWSLT 2025 [1] and WMT 2025 [2].&lt;/p&gt;

&lt;h2 id=&quot;iterative-layer-pruning&quot;&gt;Iterative Layer Pruning&lt;/h2&gt;

&lt;p&gt;The process of Iterative Layer Pruning involves incrementally identifying and removing layers with minimal contribution to translation or generation quality, one layer at a time. The pruning process is usually followed by fine-tuning the resulting models on relevant training data to restore the translation quality. Moreover, knowledge distillation data from the baseline (teacher) model can be used to help the pruned (student) model to reach the quality of the teacher model.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/pruning/pruning-digram.png&quot; alt=&quot;pruning-digram&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;layer-importance-evaluation&quot;&gt;Layer Importance Evaluation&lt;/h2&gt;

&lt;p&gt;We conduct layer importance evaluation by measuring translation performance without each layer. The process is as follows:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Remove one layer of the model.&lt;/li&gt;
  &lt;li&gt;Evaluate the model (chrF++).&lt;/li&gt;
  &lt;li&gt;Repeat for the rest of the layers.&lt;/li&gt;
  &lt;li&gt;Prune the least important layer (best chrF++ without it).&lt;/li&gt;
  &lt;li&gt;Repeat #1 to #4 until reaching the pruning target.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;evaluation-results&quot;&gt;Evaluation Results&lt;/h2&gt;

&lt;p&gt;For translation from Czech to German (CES-DEU), pruning 8 layers and then fine-tuning the resulting model retains 98% of the translation quality (as measured by COMET), while achieving considerable speedup gains. Interestingly, for translation from English to Egyptian Arabic (ENG-ARZ), the model resulting from pruning up to 16 layers and then fine-tuning outperforms the Aya-Expanse-8B baseline for this language pair. Pruned models achieve up to ~2× speedup.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/pruning/pruning-results-wmt.png&quot; alt=&quot;pruning-results&quot; style=&quot;zoom:70%;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;knowledge-distillation&quot;&gt;Knowledge Distillation&lt;/h2&gt;

&lt;p&gt;Knowledge Distillation aims at transferring knowledge from a larger model (teacher) to a smaller one (student). In “sequence-level” knowledge distillation, the student model is trained to generate sequences that match the teacher’s sequence outputs. Fine-tuning the pruned models on a combination of authentic and synthetic data (from Aya-Expanse-32B) improved the Czech to German (CES-DEU) translation quality, with the 24-layer pruned model nearly matching the performance of the Aya-Expanse-8B
baseline.&lt;/p&gt;

&lt;div style=&quot;text-align:center;&quot;&gt;&lt;img src=&quot;../static/img/pruning/pruning-results-wmt-kd.png&quot; alt=&quot;pruning-results&quot; style=&quot;zoom:40%;&quot; /&gt;&lt;/div&gt;

&lt;h2 id=&quot;further-performance-gains&quot;&gt;Further performance gains&lt;/h2&gt;

&lt;p&gt;It is highly recommended to use an efficient inference engine such as vLLM, which outperforms inference with the Transformers framework. In both cases, pruned models demonstrate up to ~2× speedup.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/pruning/trasformers-vllm.png&quot; alt=&quot;vLLM&quot; style=&quot;zoom:70%;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Moreover, you can quantise the pruned models for further compression. For example, in our IWSLT 2025 paper, we applied QLoRA after pruning the models. However, notice that low-precision quantisation (e.g. 4-bit and 8-bit) requires special hardware (e.g. H100 or H200) to observe performance gains.&lt;/p&gt;

&lt;h2 id=&quot;questions-and-answers&quot;&gt;Questions and Answers&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Q. Why iterative layer pruning instead of random or middle layer pruning?
    &lt;ul&gt;
      &lt;li&gt;A. Iterative layer pruning relies on layer importance analysis. Hence, only the layers with minimal contribution to the output quality can be removed. In our experiments, iterative layer pruning achieves better results than middle layer pruning.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Q. Can we prune other parts of the model other than layers?
    &lt;ul&gt;
      &lt;li&gt;A. There are two types of pruning, structured pruning and unstructured pruning. In structured pruning, you can remove whole layers, attention heads, or other entire computational blocks, while in unstructured pruning, you can remove individual weights from a neural network. While unstructured pruning can achieve higher compression rates, it requires specialised hardware for efficient deployment. On the contrary, models resulting from structure pruning can be deployed on standards GPUs.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Q. Is it better to fine-tune the baseline model &lt;em&gt;before&lt;/em&gt; pruning?
    &lt;ul&gt;
      &lt;li&gt;A. If your task, domain, or language is very different from the distribution of the baseline model, it is better to fine-tune the baseline model first. Otherwise, you can prune the baseline directly. On the contrary, fine-tuning &lt;em&gt;after&lt;/em&gt; pruning is always required to restore the quality of the baseline model.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Q. Can we fine-tune after each layer pruning step?
    &lt;ul&gt;
      &lt;li&gt;A. We experimented with both fine-tuning after each layer pruning step, and after a number of pruned layers. In both cases, there was no difference than just fine-tuning once after the end of the pruning process. This might be because of overfitting resulted from several fine-tuning passes.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Q. Can the same approach be applied to encoder-decoder models?
    &lt;ul&gt;
      &lt;li&gt;A. Yes, we applied this iterative layer pruning approach to both a decoder-only model, Aya-Expanse,  and an encoder-decoder model, Qwen2-Audio. However, for encoder-decoder models, we observed that only pruning the decoder leads to better overall performance.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;github-repository&quot;&gt;GitHub Repository&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ymoslem/Model-Compression&quot;&gt;&lt;img src=&quot;https://github-readme-stats.vercel.app/api/pin/?theme=graywhite&amp;amp;username=ymoslem&amp;amp;repo=Model-Compression&quot; alt=&quot;Model-Compression&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/2025.iwslt-1.40/&quot;&gt;Efficient Speech Translation through Model Compression and Knowledge Distillation&lt;/a&gt; (Moslem, IWSLT 2025)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/2025.wmt-1.78/&quot;&gt;Iterative Layer Pruning for Efficient Translation Inference&lt;/a&gt; (Moslem et al., WMT 2025)&lt;/li&gt;
&lt;/ol&gt;

</description>
        <pubDate>Mon, 17 Nov 2025 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/iterative-layer-pruning/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/iterative-layer-pruning/</guid>
        
        
        <category>efficiency</category>
        
        <category>llm</category>
        
        <category>mt</category>
        
        <category>speech</category>
        
      </item>
    
      <item>
        <title>Adaptive Translation and Terminology with Large Language Models</title>
        <description>&lt;p&gt;Large-scale language models (LLMs) have shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain, terminology, and style characteristics.&lt;/p&gt;

&lt;h2 id=&quot;adaptive-machine-translation-with-large-language-models&quot;&gt;Adaptive Machine Translation with Large Language Models&lt;/h2&gt;

&lt;p&gt;First preprint: January 2023&lt;/p&gt;

&lt;p&gt;Peer-reviewed: EAMT 2023&lt;/p&gt;

&lt;h3 id=&quot;abstract&quot;&gt;Abstract:&lt;/h3&gt;

&lt;p&gt;Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, real-time adaptation remains challenging. Large-scale language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain and style characteristics. This work aims to investigate how we can utilize in-context learning to improve real-time adaptive MT. Our extensive experiments show promising results at translation time. For example, GPT-3.5 can adapt to a set of in-domain sentence pairs and/or terminology while translating a new sentence. We observe that the translation quality with few-shot in-context learning can surpass that of strong encoder-decoder MT systems, especially for high-resource languages. Moreover, we investigate whether we can combine MT from strong encoder-decoder models with fuzzy matches, which can further improve translation quality, especially for less supported languages. We conduct our experiments across five diverse language pairs, namely English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French (EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).&lt;/p&gt;

&lt;div class=&quot;language-bib highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@inproceedings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;moslem-etal-2023-adaptive&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Adaptive Machine Translation with Large Language Models&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Moslem, Yasmin  and
      Haque, Rejwanul  and
      Kelleher, John D.  and
      Way, Andy&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;booktitle&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Proceedings of the 24th Annual Conference of the European Association for Machine Translation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;jun&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2023&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Tampere, Finland&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;publisher&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;European Association for Machine Translation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://aclanthology.org/2023.eamt-1.22/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;pages&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;227--237&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;domain-terminology-integration-into-machine-translation-leveraging-large-language-models&quot;&gt;Domain Terminology Integration into Machine Translation: Leveraging Large Language Models&lt;/h2&gt;

&lt;p&gt;First preprint: October 2023&lt;/p&gt;

&lt;p&gt;Peer-reviewed: WMT 2023&lt;/p&gt;

&lt;h3 id=&quot;abstract-1&quot;&gt;Abstract:&lt;/h3&gt;

&lt;p&gt;This paper discusses the methods that we used for our submissions to the WMT 2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech (EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to advance machine translation (MT) by challenging participants to develop systems that accurately translate technical terms, ultimately enhancing communication and understanding in specialised domains. To this end, we conduct experiments that utilise large language models (LLMs) for two purposes: generating synthetic bilingual terminology-based data, and post-editing translations generated by an MT model through incorporating pre-approved terms. Our system employs a four-step process: (i) using an LLM to generate bilingual synthetic data based on the provided terminology, (ii) fine-tuning a generic encoder-decoder MT model, with a mix of the terminology-based synthetic data generated in the first step and a randomly sampled portion of the original generic training data, (iii) generating translations with the fine-tuned MT model, and (iv) finally, leveraging an LLM for terminology-constrained automatic post-editing of the translations that do not include the required terms. The results demonstrate the effectiveness of our proposed approach in improving the integration of pre-approved terms into translations. The number of terms incorporated into the translations of the blind dataset increases from an average of 36.67% with the generic model to an average of 72.88% by the end of the process. In other words, successful utilisation of terms nearly doubles across the three language pairs.&lt;/p&gt;

&lt;div class=&quot;language-bib highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@inproceedings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;moslem-etal-2023-domain&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Domain Terminology Integration into Machine Translation: Leveraging Large Language Models&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Moslem, Yasmin  and
      Romani, Gianfranco  and
      Molaei, Mahdi  and
      Kelleher, John D.  and
      Haque, Rejwanul  and
      Way, Andy&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;booktitle&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Proceedings of the Eighth Conference on Machine Translation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;dec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2023&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Singapore&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;publisher&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Association for Computational Linguistics&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://aclanthology.org/2023.wmt-1.82/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;doi&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;10.18653/v1/2023.wmt-1.82&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;pages&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;902--911&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;fine-tuning-large-language-models-for-adaptive-machine-translation&quot;&gt;Fine-tuning Large Language Models for Adaptive Machine Translation&lt;/h2&gt;

&lt;p&gt;First preprint: December 2023&lt;/p&gt;

&lt;p&gt;Published as: thesis chapter&lt;/p&gt;

&lt;h3 id=&quot;abstract-2&quot;&gt;Abstract:&lt;/h3&gt;

&lt;p&gt;This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose large language model (LLM), for adaptive machine translation (MT). The fine-tuning process involves utilising a combination of zero-shot and one-shot translation prompts within the medical domain. The primary objective is to enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt translations to the required domain at inference time. The results, particularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned model, demonstrating quality improvements in both zero-shot and one-shot translation scenarios, surpassing Mistral 7B’s baseline performance. Notably, the fine-tuned Mistral outperforms ChatGPT “gpt-3.5-turbo” in zero-shot translation while achieving comparable one-shot translation quality. Moreover, the zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B’s performance, and its one-shot translation quality surpasses that of NLLB 3.3B. These findings emphasise the significance of fine-tuning efficient LLMs like Mistral 7B to yield high-quality zero-shot translations comparable to task-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved in one-shot translation are comparable to those of commercial LLMs such as ChatGPT. Our experiments demonstrate that, with a relatively small dataset of 20,000 segments that incorporate a mix of zero-shot and one-shot prompts, fine-tuning significantly enhances Mistral’s in-context learning ability, especially for real-time adaptive MT.&lt;/p&gt;

&lt;div class=&quot;language-bib highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;moslem-etal-2023-fine-tuning-llms&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{Fine-tuning Large Language Models for Adaptive Machine Translation}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
      &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{Yasmin Moslem and Rejwanul Haque and Andy Way}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{2023}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;eprint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{2312.12740}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;archivePrefix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{arXiv}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;primaryClass&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{cs.CL}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{https://arxiv.org/abs/2312.12740}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;github-repository&quot;&gt;GitHub Repository&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ymoslem/Adaptive-MT-LLM-Fine-tuning&quot;&gt;&lt;img src=&quot;https://github-readme-stats.vercel.app/api/pin/?theme=graywhite&amp;amp;username=ymoslem&amp;amp;repo=Adaptive-MT-LLM-Fine-tuning&quot; alt=&quot;Adaptive-MT-LLM-Fine-tuning&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;language-modelling-approaches-to-adaptive-machine-translation&quot;&gt;Language Modelling Approaches to Adaptive Machine Translation&lt;/h2&gt;

&lt;p&gt;First preprint: January 2024&lt;/p&gt;

&lt;p&gt;Published as: PhD thesis (DCU)&lt;/p&gt;

&lt;h3 id=&quot;abstract-3&quot;&gt;Abstract:&lt;/h3&gt;

&lt;p&gt;Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, in-domain data scarcity is common in translation settings, due to the lack of specialised datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. In such scenarios where there is insufficient in-domain data to fine-tune MT models, producing translations that are consistent with the relevant context is challenging. While real-time adaptation can make use of smaller amounts of in-domain data to improve the translation on the fly, it remains challenging due to supported context limitations and efficiency constraints. Large language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. Such capabilities have opened new horizons for domain-specific data augmentation and real-time adaptive MT. This work attempts to address two main relevant questions: 1) in scenarios involving human interaction and continuous feedback, can we employ language models to improve the quality of adaptive MT at inference time? and 2) in the absence of sufficient in-domain data, can we use pre-trained large-scale language models to improve the process of MT domain adaptation?&lt;/p&gt;

&lt;div class=&quot;language-bib highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;moslem-2024-adaptive-mt-llms&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{Language Modelling Approaches to Adaptive Machine Translation, {PhD} thesis}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
      &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{Yasmin Moslem}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{2024}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;eprint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{2401.14559}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;archivePrefix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{arXiv}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;primaryClass&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{cs.CL}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{https://arxiv.org/abs/2401.14559}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/2023.eamt-1.22/&quot;&gt;Adaptive Machine Translation with Large Language Models&lt;/a&gt; (Moslem et al., EAMT 2023)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/2023.wmt-1.82/&quot;&gt;Domain Terminology Integration into Machine Translation: Leveraging Large Language Models&lt;/a&gt; (Moslem et al., WMT 2023)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2312.12740&quot;&gt;Fine-tuning Large Language Models for Adaptive Machine Translation&lt;/a&gt; (Moslem et al., thesis chapter 2023)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2401.14559&quot;&gt;Language Modelling Approaches to Adaptive Machine Translation&lt;/a&gt; (Moslem, PhD thesis, DCU 2024)&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Wed, 10 Jan 2024 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/adaptive-mt/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/adaptive-mt/</guid>
        
        
        <category>nmt</category>
        
        <category>llm</category>
        
      </item>
    
      <item>
        <title>Purely Synthetic Bilingual Data for Machine Translation?</title>
        <description>&lt;p&gt;In-domain data scarcity is common in translation settings, due to the lack of specialized datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. You might be familiar to such situation when there is a big translation project, but there is only a tiny in-domain translation memory, or no translation memory at all. In the absence of sufficient domain-specific data required to fine-tune machine translation (MT) systems, adhering to the domain terminology and client’s style can be challenging.  Recently, there has been a considerable advancement in training large language models, not only for English, but also for diverse languages. Among autoregressive language models, trained to predict the next word in a sequence, are BLOOM, GPT-3, and GPT-J. The question is: &lt;strong&gt;can we use these large language models to generate more domain-specific &lt;em&gt;bilingual&lt;/em&gt; data?&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;method&quot;&gt;Method&lt;/h2&gt;

&lt;p&gt;Interestingly, when you feed such large language models with an in-domain sentence, they can generate more synthetic sentences, that simulate the domain and linguistic characteristics of the authentic sentence. In the research “&lt;strong&gt;&lt;em&gt;&lt;a href=&quot;https://aclanthology.org/2022.amta-research.2&quot;&gt;Domain-Specific Text Generation for Machine Translation&lt;/a&gt;&lt;/em&gt;&lt;/strong&gt;” (Moslem et al., 2022), we investigated the feasibility of this domain-specific text generation technique, when either no or limited bilingual in-domain dataset is available. We proposed a novel approach to domain adaptation leveraging state-of-the-art pre-trained language models to generate huge amounts of synthetic bilingual in-domain data with the goal of improving translation of in-domain texts. The process can be summarised in three simple steps:&lt;/p&gt;

&lt;h4 id=&quot;1-text-generation-target&quot;&gt;1. Text generation (target)&lt;/h4&gt;

&lt;blockquote&gt;
  &lt;p&gt;Generate target-side synthetic sentences using a large pre-trained language model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When there is a small in-domain translation memory, you can use each target sentence as a prompt to generate text, that simulates the domain characteristics of the authentic in-domain data. If there is no translation memory at all, you can first forward-translate the source text to be translated, or a portion of it, using the baseline MT model.&lt;/p&gt;

&lt;h4 id=&quot;2-back-translation-source&quot;&gt;2. Back-translation (source)&lt;/h4&gt;

&lt;blockquote&gt;
  &lt;p&gt;Back-translate the synthetic target-side sentences into source language.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Combining the idea of in-domain text generation with back-translation, you can generate huge amounts of synthetic bilingual in-domain data, for both use cases.&lt;/p&gt;

&lt;h4 id=&quot;3-mixed-fine-tuning&quot;&gt;3. Mixed fine-tuning&lt;/h4&gt;

&lt;blockquote&gt;
  &lt;p&gt;Fine-tune the baseline model, on a mix of synthetic and authentic data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally, the baseline MT model should be fine-tuned using a combination of the synthetic bilingual in-domain dataset and a randomly sampled section of the original generic dataset.&lt;/p&gt;

&lt;h2 id=&quot;target-text-generation&quot;&gt;Target Text Generation&lt;/h2&gt;

&lt;p&gt;This code snippet shows how to load the GPT-J language model. You can use some efficient loading techniques such &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float16&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;low_cpu_mem_usage&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;transformers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GPTJForCausalLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoTokenizer&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;torch&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AutoTokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;EleutherAI/gpt-j-6B&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;padding_side&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;left&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pad_token&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eos_token&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;GPTJForCausalLM&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;EleutherAI/gpt-j-6B&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                         &lt;span class=&quot;n&quot;&gt;revision&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;float16&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                         &lt;span class=&quot;n&quot;&gt;torch_dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                         &lt;span class=&quot;n&quot;&gt;low_cpu_mem_usage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                         &lt;span class=&quot;n&quot;&gt;cache_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;models_cache/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                         &lt;span class=&quot;n&quot;&gt;pad_token_id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eos_token_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Afterwards, you can use each target segment in the authentic in-domain dataset, to generate synthetic in-domain text. We use top-K and top-p sampling to generate diverse text sequences. Here, we set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;num_return_sequences&lt;/code&gt; to generate 5 sequences. Each sequence might include multiple sentences. You can then split these text sequences into several sentences using any sentence splitter.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;target_segment&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;I am an example sentence that talks about something very specialized!&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;input_ids&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target_segment&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_special_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;return_tensors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;sample_outputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_ids&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                &lt;span class=&quot;n&quot;&gt;do_sample&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                                &lt;span class=&quot;n&quot;&gt;max_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;300&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                                &lt;span class=&quot;n&quot;&gt;top_k&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                                &lt;span class=&quot;n&quot;&gt;top_p&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.95&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                                &lt;span class=&quot;n&quot;&gt;num_return_sequences&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                &lt;span class=&quot;n&quot;&gt;early_stopping&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;generated_text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch_decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sample_outputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;skip_special_tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The quality of the language model is important. Here, you can see examples of generated text, that is both linguistically correct, and factually accurate.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In March 2020, India ordered the countrywide shut down of all non-essential economic activities due to the spreading COVID-19 pandemic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;While the overall worldwide economic impact of COVID-19 will only be realized through the end of 2020 and the recovery phase in 2021, it is clear that certain parts of the world have been severely impacted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sometimes, the generated text can be linguistically correct; however, numbers or names might be inaccurate.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Antiviral drugs are approved for pregnant women and should be considered for children younger than &lt;strong&gt;XX&lt;/strong&gt; years, although some are still being investigated.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scientists have found some species of &lt;strong&gt;unicorn&lt;/strong&gt; in Amazon rainforests.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If there are only small mistakes, the generated synthetic data can be still used. Obviously, the better the quality of the text generated by the language model, the better the quality we can expect when fine-tuning the baseline MT model on this synthetic data.&lt;/p&gt;

&lt;h2 id=&quot;back-translation&quot;&gt;Back-Translation&lt;/h2&gt;

&lt;p&gt;Now, we have the target side of our new in-domain dataset. To generate the source side, use back-translation, into the other language direction. For back-translation, you can either train another MT model yourself, or use pre-trained models such as &lt;a href=&quot;https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/results/tatoeba-models-all.md&quot;&gt;OPUS models&lt;/a&gt;. Optionally, you can convert OPUS models to the &lt;a href=&quot;https://opennmt.net/CTranslate2/&quot;&gt;CTranslate2&lt;/a&gt; formate with quantisation to enhance efficiency.&lt;/p&gt;

&lt;p&gt;Basically, both the source and target sides of our new large in-domain dataset, consist of synthetic data. The target side is generated by a language model, while the source text is generated by back-translation, in the other language direction.&lt;/p&gt;

&lt;h2 id=&quot;mixed-fine-tuning&quot;&gt;Mixed Fine-Tuning&lt;/h2&gt;

&lt;p&gt;Now, it’s time to apply mixed fine-tuning to the baseline model.&lt;/p&gt;

&lt;p&gt;In other words, continue training our baseline model on a mix of (a) the synthetic bilingual in-domain dataset we obtained from the two previous steps, and (b) a randomly sampled portion of the original generic dataset. In our experiments, we oversampled the synthetic in-domain dataset, by a 9x ratio.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/mixed-fine-tuning-oversampling.png&quot; alt=&quot;Mixed Fine-Tuning Oversampling&quot; /&gt;&lt;/p&gt;

&lt;center&gt;Mixed fine-tuning (Chu et al., 2017)&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;To apply oversampling, we employed the dataset weights feature in OpenNMT-tf. If you are using OpenNMT-py or OpenNMT-tf, you can find more details in this &lt;a href=&quot;https://blog.machinetranslation.io/domain-adaptation-mixed-fine-tuning/&quot;&gt;tutorial on mixed fine-tuning&lt;/a&gt; of MT models.&lt;/p&gt;

&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;

&lt;p&gt;In both scenarios, our proposed method achieves significant improvements, demonstrated by both automatic and human evaluations. As expected, Setup 1 (when there is a tiny bilingual dataset) reveals better results than Setup 2 (where there is no bilingual dataset at all). Still, models resulted from both setups outperform the baseline model.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/MT-ML-results.png&quot; alt=&quot;Results&quot; /&gt;&lt;/p&gt;

&lt;center&gt;Evaluation results on the in-domain test set, TICO-19&lt;/center&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Previously, synthetic data for machine translation had been created either on the source-side only (forward-translation) or the target side only (back-translation). In some cases, researchers replaced a few words from the source and/or the target with synonyms or similar words. The assumption was that “relevant” monolingual data was available, which was not always the case!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In real-life scenarios&lt;/strong&gt;, things can get more complex. Usually, there is insufficient human-produced data to train or fine-tune high-quality MT systems. Production-level projects can be so specialised, while mining crawled monolingual datasets is inefficient and not necessarily helpful.&lt;/p&gt;

&lt;p&gt;This research work &lt;strong&gt;generates brand new synthetic data on both the source and target sides&lt;/strong&gt;. It employs large language models to put together coherent sentences similar to those to be translated in the current project. Then, the new synthetic data can be used to fine-tune production-level MT systems for domain-specific scenarios.  Feel free to check out our paper, &lt;a href=&quot;https://aclanthology.org/2022.amta-research.2&quot;&gt;&lt;em&gt;Domain-Specific Text Generation for Machine Translation&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;download-scripts&quot;&gt;Download Scripts&lt;/h2&gt;

&lt;p&gt;You can download our scripts and configuration files at &lt;a href=&quot;https://github.com/ymoslem/MT-LM&quot;&gt;GitHub&lt;/a&gt;. If you have applied the method and/or have questions, please let me know.&lt;/p&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;@inproceedings{moslem-etal-2022-domain,
    title = &quot;Domain-Specific Text Generation for Machine Translation&quot;,
    author = &quot;Moslem, Yasmin  and
      Haque, Rejwanul  and
      Kelleher, John  and
      Way, Andy&quot;,
    booktitle = &quot;Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)&quot;,
    month = sep,
    year = &quot;2022&quot;,
    address = &quot;Orlando, USA&quot;,
    publisher = &quot;Association for Machine Translation in the Americas&quot;,
    url = &quot;https://aclanthology.org/2022.amta-research.2&quot;,
    pages = &quot;14--30&quot;,
    abstract = &quot;Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we use the state-of-the-art Transformer architecture. We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, in both scenarios, our proposed methods achieve improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results&quot;,
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Mon, 12 Dec 2022 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/synthetic-data-machine-translation/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/synthetic-data-machine-translation/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Translation Auto-suggestions: What do Linguists Think?</title>
        <description>&lt;p&gt;Translation auto-suggestion and auto-completion are among the important features that can help translators better utilize Machine Translation (MT) systems. In a Computer-Aided Translation (CAT) environment, a translator can make use of the MT word auto-suggestion feature as follows:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;typing a few words, or clicking a word in a proposed MT translation, a list of suggestions is displayed.&lt;/li&gt;
  &lt;li&gt;selecting one of the word suggestions from the list, the rest of the translation is modified accordingly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a user survey we designed and distributed via social media networks, we asked participants whether they thought an MT word-level auto-suggestions feature could be helpful, and provided a simple definition and an illustrative image. If their answer was “Yes”, the respondent was asked to specify a reason. By the time of writing this article, we received 41 responses to our survey. While we do not believe this survey is enough to justify introducing an auto-suggestions feature into every MT system, it can be an indicator as to why some users think such a feature could be helpful.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/autosuggest.png&quot; alt=&quot;MT-autosuggestions&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To answer the question, “&lt;em&gt;Which of the following best describes you?&lt;/em&gt;” 46.3% (19) of the respondents chose “&lt;em&gt;Translator/Linguist&lt;/em&gt;”, 31.7% (13) selected “&lt;em&gt;NLP Engineer/Researcher&lt;/em&gt;”, and the rest 22% (9) were other “&lt;em&gt;MT Users&lt;/em&gt;”, not included in the two aforementioned categories.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/autosuggest-categories.png&quot; alt=&quot;users&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Among the respondents to the survey, 90.2% (37) answered “Yes” to the question “&lt;em&gt;In general, do you believe that a word-level auto-suggestions feature is helpful?&lt;/em&gt;” The figure below shows the breakdown of answers to the question, “&lt;em&gt;Why do you believe that a word-level auto-suggestions feature can be helpful?&lt;/em&gt;” taking into consideration those who answered “No” to the previous question.&lt;/p&gt;

&lt;p&gt;Out of the 37 persons who believed a word-level auto-suggestions feature can be helpful, 40.5% (15) of the respondents specified that it can give them some inspiration. This answer is specifically interesting as it is not constrained by time-saving benefits; hence, it focusses more on effectiveness rather than efficiency. The respondent that answered with “Other” mentioned that it allows them to look for alternative senses or phrasings, especially when they suspect the initial translation is bad, and referred to this as “human in the loop”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/autosuggest-why.png&quot; alt=&quot;autosuggest-why&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Respondents were allowed to give extra comments; among the notable comments were:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;small&gt;I think word-level suggestions can be a useful feature, particularly when the target language can have several translations of a single source word.&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;small&gt;Word-level suggestions can be helpful, but sometimes you end up spending a lot of time figuring out if the MT suggestion is a valid translation in that context. So, I’m not really sure yet how I feel about it.&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;small&gt;It’s useful, as long as it’s seen as a suggestion, and not inserted in the target where the translator is typing.&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Among the respondents who answered “For me, it is easier or faster than typing”, comments included:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;small&gt;Though most of the time; the suggestions are lousy.&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;small&gt;I don’t think it gives me inspiration as I mostly need it for structures, not single words.&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;small&gt;Auto-suggestion does not have to come from machine translation. History is much more useful.&lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The last comment above might be referring to the fact that in some CAT tools, auto-suggestions can also include glossary terms, and translation memory sub-segments, which encourages further research efforts to investigate methods to enhance leveraging and interaction between various translation resources in human-in-the-loop environments.&lt;/p&gt;

&lt;p&gt;We hope this survey will inspire future user studies to look deeper into how diverse users of MT and CAT tools prefer to utilize certain features, such as auto-suggestions, and the value they seek. More aspects should be taken into consideration such as language pairs, translation workflows, and user interfaces. This can help improve these features to better support linguists and other MT users and boost their productivity as well as translation quality.&lt;/p&gt;

&lt;h3 id=&quot;citation&quot;&gt;Citation&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;@inproceedings{moslem2022-autosuggest,
    title = &quot;Word-Level Auto-Completion: What can we achieve out of the box?&quot;,
    author = &quot;Moslem, Yasmin  and
      Haque, Rejwanul  and
      Way, Andy&quot;,
    booktitle = &quot;In Proceedings of the Seventh Conference on Machine Translation&quot;,
    month = Dec,
    year = &quot;2022&quot;,
    address = &quot;Abu Dhabi, UAE&quot;,
    publisher = &quot;Association for Computational Linguistics&quot;,
    url = &quot;https://arxiv.org/abs/2210.12802&quot;,
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Mon, 24 Oct 2022 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/translation-autosuggestion-survey/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/translation-autosuggestion-survey/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Machine Translation Robustness</title>
        <description>&lt;p&gt;Let’s talk briefly about the concept of “Robustness” of neural machine translation (NMT) systems. While robustness should be emphasized when building any NMT system, even high-resource languages with plenty of data can still face linguistic challenges.&lt;/p&gt;

&lt;h2 id=&quot;what-does-nmt-robustness-mean&quot;&gt;What does NMT “Robustness” mean?&lt;/h2&gt;

&lt;p&gt;It simply means that a certain NMT engine can handle a specific linguistic feature, found in the input to be translated. This feature might not be naturally occurring in the training data. Among examples of linguistic features we want our NMT model to be robust while translating are: domain terminology, proper names, number formats, text case, misspellings, code switching (between two languages), and untranslatables such as tags, email addresses, etc.&lt;/p&gt;

&lt;h2 id=&quot;how-can-we-improve-nmt-robustness&quot;&gt;How can we improve NMT robustness?&lt;/h2&gt;

&lt;p&gt;The first step to machine translation robustness is defining the issues that your model frequently encounters when translating a certain type of text. This step is underestimated, and in my opinion it is a sign of the maturity of production-level operations.&lt;/p&gt;

&lt;p&gt;This goes beyond numerical human evaluation, and moves a step further towards defining specific types of issues. In simple words, human evaluators are asked to mention a clear reason why they think a translation should be ranked as 3 out of 5, for example. At the beginning, they might be provided with common lists, but they should also have the option to add more issues, that can be integrated into that list later. Such explanations should not be vague; they should be precise in a way allowing MT engineers to fix these issues. Problematic words should be marked; sometimes the track-changes feature is used. The main question is: &lt;em&gt;What is the most critical issue in this translation?&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;how-can-we-improve-nmt-robustness-1&quot;&gt;How can we improve NMT “Robustness”?&lt;/h2&gt;

&lt;p&gt;In the findings of the WMT2020 Robustness Shared Task, under the “Common Trends” section, Specia et al. (2020) stated: “Participating systems were trained following a standard recipe, i) using big-transformer models, ii) boosting performance with tagged back-translation, iii) continued training with filtered data and in-domain data (where available), iv) ensembling different models to obtain further improvements.”&lt;/p&gt;

&lt;p&gt;In this sense, data augmentation techniques can be helpful, and then you integrate this new data into the NMT system, via combining or fine-tuning.&lt;/p&gt;

&lt;p&gt;As training a new system frequently might not be feasible, it is common in some companies to temporarily apply on-the-fly find-replace operations on translations, until the next training is possible. Some researchers also suggest making such on-the-fly handling easier by injecting the training data with certain placeholders, to be able to replace later. To apply this, in a portion of the training data, natural tags (html, xml, long numbers, etc.) are replaced with pseudo-tags (e.g. &amp;lt;t0&amp;gt;, &amp;lt;t1&amp;gt;, &amp;lt;t2&amp;gt;, …). These pseudo-tags should be also added as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_defined_symbols&lt;/code&gt; to the SentencePiece model (cf. SPM &lt;a href=&quot;https://github.com/google/sentencepiece/blob/master/doc/options.md&quot;&gt;options&lt;/a&gt;). At inference time, it is easy to define and replace these tags with untraslatables during pre-processing and post-processing steps. On a relevant note, activating the SentencePiece option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;split_digits&lt;/code&gt; helps with copying longer numbers without intervention, while the option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte_fallback&lt;/code&gt; sometimes helps with irregular characters in the training data.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://themqm.org/&quot;&gt;MQM&lt;/a&gt; - Multidimensional Quality Metrics (&lt;a href=&quot;https://aclanthology.org/2013.tc-1.6/&quot;&gt;Lommel et al., 2013&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Training Neural Machine Translation to Apply Terminology Constraints (&lt;a href=&quot;https://aclanthology.org/P19-1294/&quot;&gt;Dinu et al., 2019&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Improving Robustness in Real-World Neural Machine Translation Engines (&lt;a href=&quot;https://aclanthology.org/W19-6727/&quot;&gt;Gupta et al., 2019&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;How Should Markup Tags Be Translated? (&lt;a href=&quot;https://aclanthology.org/2020.wmt-1.138/&quot;&gt;Hanneman and Dinu, 2020&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Evaluating Robustness to Input Perturbations for Neural Machine Translation (&lt;a href=&quot;https://arxiv.org/abs/2005.00580&quot;&gt;Niu et al., 2020&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Findings of the WMT 2020 Shared Task on Machine Translation Robustness (&lt;a href=&quot;https://aclanthology.org/2020.wmt-1.4/&quot;&gt;Specia et al., 2020&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Business Critical Errors: A Framework for Adaptive Quality Feedback (&lt;a href=&quot;https://aclanthology.org/2022.amta-upg.17/&quot;&gt;Stewart et al., 2022&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Improve MT for Search with Selected Translation Memory using Search Signals (&lt;a href=&quot;https://aclanthology.org/2022.amta-upg.9/&quot;&gt;Zhang 2022&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Wed, 28 Sep 2022 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/machine-translation-robustness/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/machine-translation-robustness/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Machine Translation Models: How to Build and Deploy</title>
        <description>&lt;p&gt;This is a Neural Machine Translation (NMT) tutorial with &lt;a href=&quot;https://github.com/ymoslem/OpenNMT-py&quot;&gt;OpenNMT-py&lt;/a&gt; and relevant tools. It covers data preprocessing, model training, evaluation, and deployment. The tutorial was put together as part of a mentorship activity I organised in 2022.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ymoslem/OpenNMT-Tutorial&quot;&gt;&lt;img src=&quot;https://github-readme-stats.vercel.app/api/pin/?theme=graywhite&amp;amp;username=ymoslem&amp;amp;repo=OpenNMT-Tutorial&quot; alt=&quot;NMT-tutorial&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;fundamentals&quot;&gt;Fundamentals&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;Data Processing (&lt;a href=&quot;1-NMT-Data-Processing.ipynb&quot;&gt;notebook&lt;/a&gt;, &lt;a href=&quot;https://github.com/ymoslem/MT-Preparation&quot;&gt;repository&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;NMT Model Training with OpenNMT-py (&lt;a href=&quot;2-NMT-Training.ipynb&quot;&gt;notebook&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Translation/Inference with CTranslate2 (&lt;a href=&quot;https://gist.github.com/ymoslem/60e1d1dc44fe006f67e130b6ad703c4b&quot;&gt;code&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;MT Evaluation with BLEU (&lt;a href=&quot;https://blog.machinetranslation.io/compute-bleu-score/&quot;&gt;tutorial&lt;/a&gt;, &lt;a href=&quot;https://github.com/ymoslem/MT-Evaluation&quot;&gt;repository&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Simple Web UI (&lt;a href=&quot;https://blog.machinetranslation.io/nmt-web-interface/&quot;&gt;tutorial&lt;/a&gt;, &lt;a href=&quot;https://github.com/ymoslem/OpenNMT-Web-Interface&quot;&gt;repository&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;advanced-topics&quot;&gt;Advanced Topics&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;Running TensorBoard with OpenNMT (&lt;a href=&quot;https://blog.machinetranslation.io/TensorBoard/&quot;&gt;tutorial&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Low-Resource Neural Machine Translation (&lt;a href=&quot;https://blog.machinetranslation.io/low-resource-nmt/&quot;&gt;tutorial&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Domain Adaptation with Mixed Fine-tuning (&lt;a href=&quot;https://blog.machinetranslation.io/domain-adaptation-mixed-fine-tuning/&quot;&gt;tutorial&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Overview of Domain Adaptation Techniques (&lt;a href=&quot;https://amtaweb.org/wp-content/uploads/2020/11/NMTDomainAdaptationTechniques.pdf&quot;&gt;tutorial&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Multilingual Machine Translation (&lt;a href=&quot;https://blog.machinetranslation.io/multilingual-nmt/&quot;&gt;tutorial&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Using Pre-trained NMT models with CTranslate2 (&lt;a href=&quot;https://gist.github.com/ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d&quot;&gt;tutorial&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Tue, 15 Mar 2022 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/OpenNMT-tutorial/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/OpenNMT-tutorial/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Mixed Fine-Tuning - Domain Adaptation That Works!</title>
        <description>&lt;p&gt;Training a robust generic model is an interesting task. However, when you want to customize your Machine Translation model to observe the terminology and style of a certain domain or client, Domain Adaptation comes to life. In previous posts, we discussed several approaches to Domain Adaptation. In this post, we are going to concentrate on a very effective approach called &lt;strong&gt;Mixed Fine-Tuning&lt;/strong&gt;, originally proposed by &lt;a href=&quot;https://aclanthology.org/P17-2061/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Chu et al., 2017&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Regular fine-tuning of an NMT model usually consists of two steps:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Building a baseline NMT model, e.g. a generic model.&lt;/li&gt;
  &lt;li&gt;Continuing training the baseline NMT model on an in-domain dataset.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, fine-tuning in this way can lead to “catastrophic forgetting”, i.e. the model overfits the in-domain data, starts forgetting the information learned from the baseline data, and loses generalization. So in practice, when you compare its quality to the baseline model, the in-domain model would give a better BLEU score and human evaluation for translation of sentences very similar to the in-domain training dataset, but worse BLEU score for out-of-domain sentences or even new in-domain sentences.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Unlike plain fine-tuning, in the Mixed Fine-Tuning approach (Chu et al., 2017), you randomly sample a portion from the generic data you used to train the baseline model, and use it during the fine-tuning step along with the in-domain dataset. Over-sampling the in-domain data is the main trick.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The training procedure of the Mixed Fine-tuning approach is as follows:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Train a baseline NMT model on out-of-domain data until convergence.&lt;/li&gt;
  &lt;li&gt;Continue training the NMT baseline model on a &lt;em&gt;mix&lt;/em&gt; of in-domain and out-of-domain data (by &lt;em&gt;oversampling&lt;/em&gt; the in-domain data) until convergence.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In NMT tools, such as OpenNMT and MarianMT, &lt;em&gt;dataset weights&lt;/em&gt; can be used to replicate over-sampling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dataset Counts:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Generic Dataset: 1,000,000 sentences&lt;/li&gt;
  &lt;li&gt;In-domain Dataset: 100,000 sentences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use weights 1:10 so that the training takes 1 sentence from the bigger generic dataset, and 10 sentences from the smaller in-domain dataset.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Generic Dataset: 1&lt;/li&gt;
  &lt;li&gt;In-domain Dataset: 10&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this example, we sequentially sample 1 example from the “Generic Dataset” and 10 examples from the “In-domain Dataset” and so on. By giving the “In-domain Dataset” a higher weight, the model can learn the style and terminology from the in-domain dataset while still be able to generalize, i.e. output high-quality translations for out-of-domain sentences.&lt;/p&gt;

&lt;p&gt;Setting the dataset weights differs from one tool to another. In &lt;a href=&quot;https://opennmt.net/OpenNMT-py/FAQ.html#how-can-i-weight-different-corpora-at-training&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;OpenNMT-py&lt;/a&gt;, dataset weights are set as numbers as in the aforementioned example. In &lt;a href=&quot;https://opennmt.net/OpenNMT-tf/data.html?highlight=weighted#weighted-dataset&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;OpenNMT-tf&lt;/a&gt;, dataset weights are set as ratios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Further notes on the Mixed Fine-tuning approach&lt;/strong&gt; (feel free to experiment with something different, though!)&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;The approach works well for in-domain datasets between 50k and 500k. For very small in-domain datasets, this approach might &lt;em&gt;not&lt;/em&gt; work well; for bigger in-domain datasets, you might want to try different weights; and for very big in-domain datasets, you can just use the in-domain dataset only, but enrich it with missing aspects like shorter sentences, if needed.&lt;/li&gt;
  &lt;li&gt;If your baseline training data is too big, you randomly extract 10 times the size of the in-domain data.&lt;/li&gt;
  &lt;li&gt;If both the generic and in-domain data are available before training the baseline, we build the vocabulary and SentencePiece models on all datasets, both generic and in-domain datasets.&lt;/li&gt;
  &lt;li&gt;During fine-tuning, we extract a dev/validation dataset from the in-domain dataset only.&lt;/li&gt;
  &lt;li&gt;After fine-tuning, we use two test datasets, one that we used for the out-of-domain baseline, and one extracted from the in-domain dataset, to make sure the model works in both cases.&lt;/li&gt;
  &lt;li&gt;To alleviate “catastrophic forgetting” on generic data, consider averaging the baseline model with the fine-tuned model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Among the advantages of the Mixed Fine-tuning approach is that this fine-tuned NMT in-domain model still works well on both unseen in-domain data and general/out-of-domain data. Moreover, the approach can be fully automated (e.g. for various clients) once you verify it for your use cases.&lt;/p&gt;

&lt;p&gt;It is worth mentioning that we have successfully applied the Mixed Fine-Tuning approach, proposed by Chu et al. (2017), in production-level scenarios in the industry. We also employed it in a number of our Domain Adaptation and Low-Resource NMT papers such as &lt;a href=&quot;https://aclanthology.org/2020.icon-adapmt.4/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Haque et al. (2020)&lt;/a&gt; in combination with other approaches, through which we achieved the first place at ICON 2020 shared task, as well as &lt;a href=&quot;https://aclanthology.org/2022.amta-research.2/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Moslem et al. (2022)&lt;/a&gt; where we used synthetic in-domain data.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Axelrod, A., He, X., &amp;amp; Gao, J. (2011). Domain Adaptation via Pseudo In-Domain Data Selection. &lt;em&gt;Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 355–362. &lt;a href=&quot;https://aclanthology.org/D11-1033&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://aclanthology.org/D11-1033&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Chinea-Ríos, M., Peris, Á., &amp;amp; Casacuberta, F. (2017). Adapting Neural Machine Translation with Parallel Synthetic Data. &lt;em&gt;Proceedings of the Second Conference on Machine Translation&lt;/em&gt;, 138–147. &lt;a href=&quot;https://doi.org/10.18653/v1/W17-4714&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/W17-4714&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Chu, C., Dabre, R., &amp;amp; Kurohashi, S. (2017). An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 385–391. &lt;a href=&quot;https://doi.org/10.18653/v1/P17-2061&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/P17-2061&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Freitag, M., &amp;amp; Al-Onaizan, Y. (2016). Fast Domain Adaptation for Neural Machine Translation. In &lt;em&gt;arXiv [cs.CL]&lt;/em&gt;. arXiv. &lt;a href=&quot;http://arxiv.org/abs/1612.06897&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;http://arxiv.org/abs/1612.06897&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Haque, R., Moslem, Y., &amp;amp; Way, A. (2020). Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task. Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task, 17–23. &lt;a href=&quot;https://aclanthology.org/2020.icon-adapmt.4&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://aclanthology.org/2020.icon-adapmt.4&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Kobus, C., Crego, J., &amp;amp; Senellart, J. (2017). Domain Control for Neural Machine Translation. &lt;em&gt;Proceedings of Recent Advances in Natural Language Processing&lt;/em&gt;, 372–378. &lt;a href=&quot;http://arxiv.org/abs/1612.06140&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;http://arxiv.org/abs/1612.06140&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Luong, M.-T., &amp;amp; Manning, C. 2015. Stanford neural machine translation systems for spoken language domains. &lt;em&gt;Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign&lt;/em&gt;, 76–79. &lt;a href=&quot;https://aclanthology.org/2015.iwslt-evaluation.11&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://aclanthology.org/2015.iwslt-evaluation.11&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Moslem, Y., Haque, R., Kelleher, J., &amp;amp; Way, A. (2022). Domain-Specific Text Generation for Machine Translation. &lt;em&gt;Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)&lt;/em&gt;, 14–30. &lt;a href=&quot;https://aclanthology.org/2022.amta-research.2&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://aclanthology.org/2022.amta-research.2&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Moslem, Y. (2024). Language Modelling Approaches to Adaptive Machine Translation. In &lt;em&gt;arXiv [cs.CL]&lt;/em&gt;. arXiv. &lt;a href=&quot;http://arxiv.org/abs/2401.14559&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;http://arxiv.org/abs/2401.14559&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Saunders, D. (2022). Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey. &lt;em&gt;Journal of Artificial Intelligence Research&lt;/em&gt;, &lt;em&gt;75&lt;/em&gt;, 351–424. &lt;a href=&quot;https://doi.org/10.1613/jair.1.13566&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.1613/jair.1.13566&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Sennrich, R., Haddow, B., &amp;amp; Birch, A. (2016a). Controlling Politeness in Neural Machine Translation via Side Constraints. &lt;em&gt;Proceedings of the 2016 Conference of the North AMerican Chapter of the Association for Computational Linguistics: Human Language Technologies&lt;/em&gt;, 35–40. &lt;a href=&quot;https://doi.org/10.18653/v1/N16-1005&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/N16-1005&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Sennrich, R., Haddow, B., &amp;amp; Birch, A. (2016b). Improving Neural Machine Translation Models with Monolingual Data. &lt;em&gt;Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)&lt;/em&gt;, 86–96. &lt;a href=&quot;https://doi.org/10.18653/v1/P16-1009&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/P16-1009&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Thu, 06 Jan 2022 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/domain-adaptation-mixed-fine-tuning/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/domain-adaptation-mixed-fine-tuning/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Notes on Multilingual Machine Translation</title>
        <description>&lt;p&gt;Multilingual NMT is featured by its scalability between any number of languages, instead of having to build individual models. MNMT systems are also desirable because training models with data from diverse language pairs might help a low-resource language acquire extra knowledge from other languages. Moreover, MNMT systems tend to generalize better due to exposure to diverse languages, leading to improved translation quality compared to bilingual NMT systems. This particular phenomenon is known as translation Transfer Learning or Knowledge Transfer (Dabre et al., 2020).&lt;/p&gt;

&lt;h2 id=&quot;tips-for-training-multilingual-nmt-models&quot;&gt;Tips for training multilingual NMT models&lt;/h2&gt;

&lt;p&gt;Building a many-to-one MT system that translates from several languages to one language is simple: just merge all the datasets. Here is an illustration of how your data should look like. Afterwards, it is recommended to shuffle your dataset.&lt;/p&gt;

&lt;hr /&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Source&lt;/th&gt;
      &lt;th&gt;Target&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;ar&amp;gt; Thank you very much&lt;/td&gt;
      &lt;td&gt;شكرا جزيلا&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;es&amp;gt; Thank you very much&lt;/td&gt;
      &lt;td&gt;Muchas gracias&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;fr&amp;gt; Thank you very much&lt;/td&gt;
      &lt;td&gt;Merci beaucoup&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;hi&amp;gt; Thank you very much&lt;/td&gt;
      &lt;td&gt;आपका बहुत बहुत धन्यवाद&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;ar&amp;gt; आपका बहुत बहुत धन्यवाद          &lt;/td&gt;
      &lt;td&gt;شكرا جزيلا&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;en&amp;gt; आपका बहुत बहुत धन्यवाद&lt;/td&gt;
      &lt;td&gt;Thank you very much&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;es&amp;gt; आपका बहुत बहुत धन्यवाद&lt;/td&gt;
      &lt;td&gt;Muchas gracias&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;fr&amp;gt; आपका बहुत बहुत धन्यवाद&lt;/td&gt;
      &lt;td&gt;Merci beaucoup&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;ar&amp;gt; Muchas gracias&lt;/td&gt;
      &lt;td&gt;شكرا جزيلا&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;en&amp;gt; Muchas gracias&lt;/td&gt;
      &lt;td&gt;Thank you very much&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;fr&amp;gt; Muchas gracias&lt;/td&gt;
      &lt;td&gt;Merci beaucoup&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;hi&amp;gt; Muchas gracias&lt;/td&gt;
      &lt;td&gt;आपका बहुत बहुत धन्यवाद&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;en&amp;gt; شكرا جزيلا&lt;/td&gt;
      &lt;td&gt;Thank you very much&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;es&amp;gt; شكرا جزيلا&lt;/td&gt;
      &lt;td&gt;Muchas gracias&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;fr&amp;gt; شكرا جزيلا&lt;/td&gt;
      &lt;td&gt;Merci beaucoup&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;hi&amp;gt; شكرا جزيلا&lt;/td&gt;
      &lt;td&gt;आपका बहुत बहुत धन्यवाद&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;ar&amp;gt; Merci beaucoup&lt;/td&gt;
      &lt;td&gt;شكرا جزيلا&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;en&amp;gt; Merci beaucoup&lt;/td&gt;
      &lt;td&gt;Thank you very much&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;es&amp;gt; Merci beaucoup&lt;/td&gt;
      &lt;td&gt;Muchas gracias&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&amp;lt;hi&amp;gt; Merci beaucoup&lt;/td&gt;
      &lt;td&gt;आपका बहुत बहुत धन्यवाद&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;hr /&gt;

&lt;p&gt;There are a few important points to take into consideration while building multilingual models:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If the data is clearly unbalanced, like you have 75 million sentences for Spanish and 15 million sentences for Portuguese, you have to balance it; otherwise, you would end up with a system that translates Spanish better than Portuguese. This technique is called over-sampling (or up-sampling). The obvious way to achieve it in NMT toolkits is through giving &lt;strong&gt;weights&lt;/strong&gt; to your datasets. In this example, the Spanish dataset can take the weight of 1 while the Portuguese can take the weight of 5 because your Spanish dataset is 5 times larger than your Portuguese dataset.&lt;/li&gt;
  &lt;li&gt;Some papers suggest adding a special token to the start of each sentence. For example, you can start Spanish sentences with the token &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;es&amp;gt;&lt;/code&gt; and Portuguese sentences with the token &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;pt&amp;gt;&lt;/code&gt;. In this case, you will have to add these tokens to your SentencePiece model through the option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--user_defined_symbols&lt;/code&gt;. However, some researchers believe this step is optional.&lt;/li&gt;
  &lt;li&gt;Multilingual NMT models are more useful for low-resource languages than they are for rich-resource languages. Still, low-resource languages that share some linguistic characteristics with other rich-resource languages can benefit from coexistence in one multilingual model. In this sense, multilingual NMT can be considered one of “Transfer Learning” approaches (Tras et al., 2021 and Ding et al., 2021).&lt;/li&gt;
  &lt;li&gt;Languages that do not share the same alphabet cannot achieve the same linguistic benefits from a multilingual NMT model. Still, researchers investigate approaches like &lt;em&gt;transliteration&lt;/em&gt; to increase knowledge transfer between languages that belong to the same language family, but use different alphabets. For example, using this &lt;em&gt;transliteration&lt;/em&gt; trick, my &lt;a href=&quot;https://www.machinetranslation.io/&quot;&gt;Indic-to-English multilingual NMT model&lt;/a&gt; can translate from 10 Indic languages to English.&lt;/li&gt;
  &lt;li&gt;Integrating other data augmentation approaches like &lt;a href=&quot;https://blog.machinetranslation.io/low-resource-nmt/&quot;&gt;Back-Translation&lt;/a&gt; can still be useful.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;using-pre-trained-nmt-models&quot;&gt;Using pre-trained NMT models&lt;/h2&gt;

&lt;p&gt;What about pre-trained multilingual NMT models like mBART (Liu et al., 2020) and M2M-100 (Fan et al., 2020); when to use them? The simple answer is, for low-resource languages (e.g. a few thousands to a few millions, up to 15m), using directly or fine-tuning mBART can give better results. For high-resource languages, training a baseline model from scratch can outperform mBART. Then, applying mixed fine-tuning (Chu et al., 2017) on this new baseline using in-house data can even achieve better gains in terms of Machine Translation quality. Check this &lt;a href=&quot;https://gist.github.com/ymoslem/d85b55d2182cfd2ab5d08bed6c63c713&quot;&gt;code snippet&lt;/a&gt; if you would like to try mBART. You can also convert M2M-100 model to the CTranslate2 format for better efficiency as explained &lt;a href=&quot;https://gist.github.com/ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://dl.acm.org/doi/abs/10.1145/3406095&quot;&gt;A Survey of Multilingual Neural Machine Translation, Dabre et al., 2020&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/2020.tacl-1.47/&quot;&gt;Multilingual Denoising Pre-training for Neural Machine Translation, Liu et al., 2020&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2109.10465&quot;&gt;Scalable and Efficient MoE Training for Multitask Multilingual Models, Kem et al., 2021&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/2021.nodalida-main.5/&quot;&gt;Extremely low-resource machine translation for closely related languages, Tras et al., 2021&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://aclanthology.org/2021.emnlp-main.263/&quot;&gt;Improving Neural Machine Translation by Bidirectional Training, Ding et al., 2021&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Sat, 04 Dec 2021 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/multilingual-nmt/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/multilingual-nmt/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Low-Resource Neural Machine Translation</title>
        <description>&lt;p&gt;Developing Neural Machine Translation (NMT) models for low-resource languages is a viral topic, both in the industry and academia. In this tutorial, we are going to discuss &lt;strong&gt;tagged back-translation&lt;/strong&gt; as one of the most effective and efficient approaches to training more robust models. Tagged back-translation is not only useful for low-resource languages, but also for other scenarios of data sparsity.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;strong&gt;Table of Contents:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#tagged-back-translation&quot;&gt;Tagged Back-Translation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#lower-casing-vs-true-casing&quot;&gt;Lower-Casing vs. True-Casing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#sub-wording-to-avoid-unknowns&quot;&gt;Sub-wording to Avoid Unknowns&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#shared-vocab-vs-separate-vocab&quot;&gt;Shared Vocab vs. Separate Vocab&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#crawled-data&quot;&gt;Crawled Data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#transfer-learning&quot;&gt;Transfer Learning&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#references&quot;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;tagged-back-translation&quot;&gt;Tagged Back-Translation&lt;/h2&gt;

&lt;p&gt;This approach aims at augmenting the available parallel training data with synthetic data that represent the purpose of the model. Several researchers, including Edunov et al. (2018) and Caswell et al. (2019), have proved that tagged back-translation is very helpful when training NMT models for low-resource languages. Moreover, it can be helpful for rich-resource languages through enriching datasets with specific linguistic features.&lt;/p&gt;

&lt;p&gt;Assuming we want to train an English-to-Hindi NMT mode, the Tagged Back-Translation data augmentation technique depends on the following steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;For an English-to-Hindi model, train another Hindi-to-English model (i.e. in the other direction), using publicly available data from &lt;a href=&quot;https://opus.nlpl.eu/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;OPUS&lt;/a&gt;;&lt;/li&gt;
  &lt;li&gt;Select monolingual data in Hindi publicly available (e.g. at &lt;a href=&quot;https://oscar-corpus.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;OSCAR&lt;/a&gt;), which must have domains and linguistic features similar to the potential texts to be translated;&lt;/li&gt;
  &lt;li&gt;Use the Hindi-to-English model to create a synthetic dataset, by translating the Hindi monolingual data into English. Note here that only the English side (the source for EN-HI) is MTed while the Hindi side (the target for EN-HI) is human-generated text;&lt;/li&gt;
  &lt;li&gt;Consider using one the available Quality Estimation tools such as &lt;a href=&quot;https://github.com/TharinduDR/TransQuest&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;TransQuest&lt;/a&gt; (Ranasinghe et al., 2020) or &lt;a href=&quot;https://github.com/Unbabel/OpenKiwi&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;OpenKiwi&lt;/a&gt; (Kepler et al., 2019) to filter out back-translations of low quality;&lt;/li&gt;
  &lt;li&gt;Add a special tag like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;BT&amp;gt;&lt;/code&gt; to the start of the MTed segments;&lt;/li&gt;
  &lt;li&gt;Build the vocabulary on all the data, both the original and the synthetic datasets;&lt;/li&gt;
  &lt;li&gt;Augment the original English-to-Hindi training dataset with the synthetic dataset;&lt;/li&gt;
  &lt;li&gt;Train a new English-to-Hindi model using the dataset generated from the previous step.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For low-resource languages like Hindi, Haque et al. (2020) showed that the technique works well with 1:1 synthetic to original data. Still, you can experiment with different portions, especially for language pairs of richer resources.&lt;/p&gt;

&lt;p&gt;As demonstrated by Hoang et al. (2018), iterative back-translation for 2-3 runs can improve the quality further. Now, as you have a better Hindi-to-English model, back-translate English monolingual data to train a new version of the English-to-Hindi model. After that, use the new English-to-Hindi model to back-translate the same Hindi monolingual dataset you used for the first run to create a new version of the Hindi-to-English model. The idea here is that you are using a better model to translate the same monolingual data, i.e. without any increase or change, which should result in a better NMT model. Interestingly, you can use both NMT and phrase-based SMT models for back-translation, and then train or fine-tune your baseline NMT system in the required language direction.&lt;/p&gt;

&lt;p&gt;Popel et al. (2010) explored the effect of Block-Backtranslation,  where the training data are presented to the neural network in blocks of authentic parallel data alternated with blocks of synthetic data.&lt;/p&gt;

&lt;h2 id=&quot;lower-casing-vs-true-casing&quot;&gt;Lower-Casing vs. True-Casing&lt;/h2&gt;

&lt;p&gt;For low-resource languages, I prefer lower-casing the data. However, in real-life scenarios or if you are submitting a paper, you are usually required to produce the translation in the true-case. so you can train a truecaser or use sacreMoses’ &lt;a href=&quot;https://github.com/alvations/sacremoses#truecaser&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;truecaser&lt;/a&gt; for English.&lt;/p&gt;

&lt;h2 id=&quot;sub-wording-to-avoid-unknowns&quot;&gt;Sub-wording to Avoid Unknowns&lt;/h2&gt;

&lt;p&gt;To avoid out-of-vocabulary, it is recommended to train your NMT model on subwords instead whole words. Subwording (e.g. BPE or unigram model) is recommended for any type of machine translation model, regardless of whether it is for a low-resource or rich-resource language pair. Among the most popular subwording tools is &lt;a href=&quot;https://github.com/google/sentencepiece&quot;&gt;SentencePiece&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;BT&amp;gt;&lt;/code&gt; for example as the back-translation token, you have to add it to the SentencePiece model through using the option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--user_defined_symbols&lt;/code&gt; during training. The same option can be useful for adding any other special tokens found in your training data, such as tags and non-Latin numbers.&lt;/p&gt;

&lt;p&gt;Consider also using the following SentencePiece options:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--input_sentence_size&lt;/code&gt; to determine maximum number of sentences the trainer loads. This number must be equal to the vocab size;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--shuffle_input_sentence&lt;/code&gt; to shuffle the dataset;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--split_by_number&lt;/code&gt; to split tokens by numbers (0-9); and&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--byte_fallback&lt;/code&gt; to decompose unknown pieces into UTF-8 byte pieces.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;shared-vocab-vs-separate-vocab&quot;&gt;Shared Vocab vs. Separate Vocab&lt;/h2&gt;

&lt;p&gt;If both the source and target share some vocabulary, e.g. similar languages and code switching, using shared vocabulary might help. Using shared vocabulary involve two steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Training a SentencePiece model on all datasets for both languages;&lt;/li&gt;
  &lt;li&gt;Using shared vocab instead of separate vocabs while training the NMT model.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;crawled-data&quot;&gt;Crawled Data&lt;/h2&gt;

&lt;p&gt;Currently, OPUS includes some datasets that are crawled from bilingual websites, and then the sentences are matched using multilingual similarity tools such as &lt;a href=&quot;https://github.com/facebookresearch/LASER&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;LASER&lt;/a&gt;, &lt;a href=&quot;https://github.com/bojone/labse&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;LabSE&lt;/a&gt;, and &lt;a href=&quot;https://github.com/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;m-USE&lt;/a&gt;. However, according to Kreutzer et al. (2021) crawled datasets suffer from quality issues that can affect the quality of outcome NMT models. Hence, it is important to try filtering them before using, and maybe exclude them from initial baselines.&lt;/p&gt;

&lt;h2 id=&quot;transfer-learning&quot;&gt;Transfer Learning&lt;/h2&gt;

&lt;p&gt;Instead of training a model from scratch, transfer learning can be applied. In this sense, you can use a multilingual model like mBART-50, M2M-100, or NLLB-200, and fine-tune it on your dataset. Moreover, unidirectional models can be used (e.g. &lt;a href=&quot;https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;OPUS&lt;/a&gt;). If your low-resource language is similar to languages supported by such models, it can benefit from shared linguistic features. Back-translation can be used here as well to augment the authentic dataset.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;Caswell, I., Chelba, C., &amp;amp; Grangier, D. (2019). Tagged Back-Translation. Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), 53–63. &lt;a href=&quot;https://doi.org/10.18653/v1/W19-5206&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/W19-5206&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Edunov, S., Ott, M., Auli, M., &amp;amp; Grangier, D. Understanding Back-Translation at Scale. &lt;em&gt;Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing&lt;/em&gt;, 489–500. &lt;a href=&quot;https://doi.org/10.18653/v1/D18-1045&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/D18-1045&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Gebauer, P., Bojar, O., Švandelík, V., &amp;amp; Popel, M. (2021). CUNI Systems in WMT21: Revisiting Backtranslation Techniques for English-Czech NMT. &lt;em&gt;Proceedings of the Sixth Conference on Machine Translation&lt;/em&gt;, 123–129. &lt;a href=&quot;https://aclanthology.org/2021.wmt-1.7&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://aclanthology.org/2021.wmt-1.7&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Haque, R., Moslem, Y., &amp;amp; Way, A. (2020). Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task. Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task, 17–23. &lt;a href=&quot;https://aclanthology.org/2020.icon-adapmt.4&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://aclanthology.org/2020.icon-adapmt.4&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Hoang, V. C. D., Koehn, P., Haffari, G., &amp;amp; Cohn, T. (2018). Iterative Back-Translation for Neural Machine Translation. &lt;em&gt;Proceedings of the 2nd Workshop on Neural Machine Translation and Generation&lt;/em&gt;, 18–24. &lt;a href=&quot;https://doi.org/10.18653/v1/W18-2703&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/W18-2703&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. &lt;em&gt;Transactions of the Association for Computational Linguistics&lt;/em&gt;, &lt;em&gt;10&lt;/em&gt;, 50–72. &lt;a href=&quot;https://doi.org/10.1162/tacl_a_00447&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.1162/tacl_a_00447&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Popel, M., Tomkova, M., Tomek, J., Kaiser, Ł., Uszkoreit, J., Bojar, O., &amp;amp; Žabokrtský, Z. (2020). Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. &lt;em&gt;Nature Communications&lt;/em&gt;, &lt;em&gt;11&lt;/em&gt;(1), 4381. &lt;a href=&quot;https://doi.org/10.1038/s41467-020-18073-9&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.1038/s41467-020-18073-9&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Ramírez-Sánchez, G., Zaragoza-Bernabeu, J., Bañón, M., &amp;amp; Rojas, S. O. (2020). &lt;em&gt;Bifixer and Bicleaner: two open-source tools to clean your parallel data&lt;/em&gt;. 291–298. &lt;a href=&quot;https://aclanthology.org/2020.eamt-1.31/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://aclanthology.org/2020.eamt-1.31/&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Sennrich, R., Haddow, B., &amp;amp; Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. &lt;em&gt;Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)&lt;/em&gt;, 86–96. &lt;a href=&quot;https://doi.org/10.18653/v1/P16-1009&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://doi.org/10.18653/v1/P16-1009&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Sat, 25 Sep 2021 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/low-resource-nmt/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/low-resource-nmt/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Web Interface for Machine Translation</title>
        <description>&lt;p&gt;Today, we will create a very simple &lt;strong&gt;Machine Translation (MT) Web Interface&lt;/strong&gt; for &lt;em&gt;OpenNMT-py&lt;/em&gt;, &lt;em&gt;OpenNMT-tf&lt;/em&gt; and &lt;em&gt;FairSeq&lt;/em&gt; models using &lt;em&gt;CTranslate2&lt;/em&gt; and &lt;em&gt;Streamlit&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Previously, there were other tutorials on how to use a &lt;a href=&quot;https://forum.opennmt.net/t/simple-opennmt-py-rest-server/1392&quot;&gt;simple server&lt;/a&gt; and &lt;a href=&quot;https://github.com/ymoslem/OpenNMT-GUI&quot;&gt;web interface with Flask&lt;/a&gt;. However, today’s tutorial is for those who want to create an ultra simple, quick demo.&lt;/p&gt;

&lt;p&gt;We also aim at highlighting that &lt;em&gt;CTranslate2&lt;/em&gt; is now the way to go for serving OpenNMT models due to its exceptional performance. It is completely up to you to use it in a simple way like what we will do here, or to integrate it into a REST API for advanced uses.&lt;/p&gt;

&lt;p&gt;So let’s start…&lt;/p&gt;

&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;Table of Contents:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#objective-simple-machine-translation-web-interface&quot;&gt;Objective: Simple Machine Translation Web Interface&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#install-requirements&quot;&gt;Install Requirements&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#optional-create-and-activate-a-virtual-environment&quot;&gt;Optional: Create and Activate a Virtual Environment&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#install-required-libraries&quot;&gt;Install Required Libraries&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#convert-model-to-ctranslate2&quot;&gt;Convert Model to CTranslate2&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#ctranslate2-python-sample&quot;&gt;CTranslate2 Example&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#create-your-app&quot;&gt;Create Your App&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#test-app&quot;&gt;Test App&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#translation-app&quot;&gt;Translation App&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#full-code&quot;&gt;Full Code&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#next-steps&quot;&gt;Next Steps&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#streamlit-components&quot;&gt;Streamlit Components&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#deployment&quot;&gt;Deployment&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;objective-simple-machine-translation-web-interface&quot;&gt;Objective: Simple Machine Translation Web Interface&lt;/h2&gt;

&lt;p&gt;Our objective is to develop a simple web interface for Machine Translation like this one.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/streamlit-translate-gui.png&quot; alt=&quot;streamlit-translate-gui&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;install-requirements&quot;&gt;Install Requirements&lt;/h2&gt;

&lt;h3 id=&quot;optional-create-and-activate-a-virtual-environment&quot;&gt;Optional: Create and Activate a Virtual Environment&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virtualenv&lt;/code&gt;:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip3 &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;virtualenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Create a virtual environment, e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;myvenv&lt;/code&gt;:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;virtualenv myvenv &lt;span class=&quot;nt&quot;&gt;--python&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;python3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Activate the virtual environment:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;source &lt;/span&gt;myvenv/web/bin/activate
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;install-required-libraries&quot;&gt;Install Required Libraries&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip3 &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;ctranslate2 sentencepiece streamlit watchdog nltk
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;convert-model-to-ctranslate2&quot;&gt;Convert Model to CTranslate2&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/OpenNMT/CTranslate2&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;CTranslate2&lt;/a&gt; supports both &lt;em&gt;OpenNMT-py&lt;/em&gt; and &lt;em&gt;OpenNMT-tf&lt;/em&gt; models. As of version 2.0, it also supports &lt;em&gt;FairSeq&lt;/em&gt; models. However, you need to convert your model to the &lt;em&gt;CTranslate2&lt;/em&gt; format before using it.&lt;/p&gt;

&lt;p&gt;The following commands are simply copied from the &lt;em&gt;CTranslate2&lt;/em&gt; repository, and tested to make sure they are up-to-date. This example uses pre-trained Transformer English-German models. If you trained your own model, run the same commands on it instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For an OpenNMT-py model:&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip3 &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;OpenNMT-py

wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xf transformer-ende-wmt-pyOnmt.tar.gz

ct2-opennmt-py-converter &lt;span class=&quot;nt&quot;&gt;--model_path&lt;/span&gt; averaged-10-epoch.pt &lt;span class=&quot;nt&quot;&gt;--output_dir&lt;/span&gt; ende_ctranslate2 &lt;span class=&quot;nt&quot;&gt;--quantization&lt;/span&gt; int8
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For an OpenNMT-tf model:&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip3 &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;OpenNMT-tf

wget https://s3.amazonaws.com/opennmt-models/averaged-ende-ckpt500k-v2.tar.gz
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xf averaged-ende-ckpt500k-v2.tar.gz

ct2-opennmt-tf-converter &lt;span class=&quot;nt&quot;&gt;--model_path&lt;/span&gt; averaged-ende-ckpt500k-v2 &lt;span class=&quot;nt&quot;&gt;--output_dir&lt;/span&gt; ende_ctranslate2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--src_vocab&lt;/span&gt; averaged-ende-ckpt500k-v2/wmtende.vocab &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--tgt_vocab&lt;/span&gt; averaged-ende-ckpt500k-v2/wmtende.vocab &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--model_type&lt;/span&gt; TransformerBase &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--quantization&lt;/span&gt; int8
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For a FairSeq model:&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ct2-fairseq-converterconverter &lt;span class=&quot;nt&quot;&gt;--model_path&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$MODEL&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--data_dir&lt;/span&gt; dict &lt;span class=&quot;nt&quot;&gt;--fixed_dictionary&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$DICT&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--output_dir&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$OUTPUT&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--quantization&lt;/span&gt; int8
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, we used the option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--quantization int8&lt;/code&gt; to imporve both the size and the performance of the model.&lt;/p&gt;

&lt;h3 id=&quot;ctranslate2-python-sample&quot;&gt;CTranslate2 Python Sample&lt;/h3&gt;

&lt;p&gt;Let’s make sure that &lt;em&gt;CTranslate2&lt;/em&gt; works properly in our setup by running this Python code:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;ctranslate2&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctranslate2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ende_ctranslate2/&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translate_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;▁H&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;ello&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;▁world&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;!&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;translate_batch()&lt;/code&gt; can take a list of sentences and translate them in batches, which would be very efficient. Here we are using only one sentence as an example for demonstration purposes.&lt;/p&gt;

&lt;p&gt;You can also check this detailed example that opens a file and translates it with CTranslate2.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/60e1d1dc44fe006f67e130b6ad703c4b.js&quot;&gt;&lt;/script&gt;

&lt;h2 id=&quot;create-your-app&quot;&gt;Create Your App&lt;/h2&gt;

&lt;h3 id=&quot;test-app&quot;&gt;Test App&lt;/h3&gt;
&lt;p&gt;Let’s first create a small app to see how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Streamlit&lt;/code&gt; works.&lt;/p&gt;

&lt;p&gt;Create a file called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test.py&lt;/code&gt; for example and add the following lines to it.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;streamlit&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Upper My Text&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;user_input&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Write something and press Enter &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;    to convert it to the UPPER case.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;upper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;info&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Launch your test app by opening the Terminal and running the following command.&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;streamlit run test.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If everything works as expected, you should see something like this in your browser at the URL &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http://localhost:8501&lt;/code&gt;. Once you type a text and press Enter, the text will be printed in the UPPER case.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/streamlit-test.png&quot; alt=&quot;streamlit-test&quot; /&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;translation-app&quot;&gt;Translation App&lt;/h2&gt;

&lt;p&gt;Let’s now develop our translation web interface. Create a file called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;translate.py&lt;/code&gt; for example, and add the following to it.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;streamlit&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sentencepiece&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spm&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;ctranslate2&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;nltk&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sent_tokenize&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;translate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Use CTranslate model to translate a sentence

    Args:
        source (str): Source sentences to translate
        translator (object): Object of Translator, with the CTranslate2 model
        sp_source_model (object): Object of SentencePieceProcessor, with the SentencePiece source model
        sp_target_model (object): Object of SentencePieceProcessor, with the SentencePiece target model
    Returns:
        Translation of the source text
    &quot;&quot;&quot;&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;source_sentences&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sent_tokenize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;source_tokenized&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source_sentences&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;translations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translate_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source_tokenized&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;translations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;tokens&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;translations_detokenized&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translations_detokenized&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt;


&lt;span class=&quot;c1&quot;&gt;# [Modify] File paths here to the CTranslate2 SentencePiece models.
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ct_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;/path/to/the/ctranslate/model/directory&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sp_source_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;/path/to/the/sentencepiece/source/model/file&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sp_target_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;/path/to/the/sentencepiece/target/model/file&quot;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Create objects of CTranslate2 Translator and SentencePieceProcessor to load the models
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctranslate2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ct_model_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;cpu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# or &quot;cuda&quot; for GPU
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SentencePieceProcessor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sp_source_model_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SentencePieceProcessor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sp_target_model_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;c1&quot;&gt;# Title for the page and nice icon
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_page_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;page_title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;NMT&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;page_icon&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;🤖&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Header
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Translate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Form to add your items
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;form&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;my_form&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Textarea to type the source text.
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;user_input&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text_area&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Source Text&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_chars&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Translate with CTranslate2 model
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Create a button
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;submitted&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;form_submit_button&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Translate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# If the button pressed, print the translation
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Here, we use &quot;st.info&quot;, but you can try &quot;st.write&quot;, &quot;st.code&quot;, or &quot;st.success&quot;.
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;submitted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Translation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;info&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Make sure you update the variables &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ct_model&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sp_source_model&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sp_target_model&lt;/code&gt; with our own paths to the CTranslate2 model, and the SentencePiece source and target models.&lt;/p&gt;

&lt;p&gt;Let’s launch our translator. Run the following command in the Terminal.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;streamlit run translate.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If everything works fine, you should see an output like this at the URL &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http://localhost:8501/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Try typing a sentence (in the same source language of your model) and press the button “Translate”. The translation should be printed as you see here!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/streamlit-translate.png&quot; alt=&quot;streamlit-translate&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;add-language-pairs&quot;&gt;Add Language Pairs&lt;/h3&gt;

&lt;p&gt;To give your visitor the option to select between multiple language pairs, you can add a dropdown menu like this one.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/streamlit-dropdown.png&quot; alt=&quot;streamlit-dropdown&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can first change the paths part into a function:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;load_models&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;option&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;English-to-Japanese&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;ct_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;path/to/your/ct_model&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;sp_source_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;path/to/your/sp_source_model&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;sp_target_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;path/to/your/sp_target_model&quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;option&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Japanese-to-English&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;ct_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;path/to/your/ct_model&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;sp_source_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;path/to/your/sp_source_model&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;sp_target_model_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;path/to/your/sp_target_model&quot;&lt;/span&gt;
    
    &lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctranslate2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ct_model_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SentencePieceProcessor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sp_source_model_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SentencePieceProcessor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sp_target_model_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then, you change the form to:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;form&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;my_form&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Dropdown menu to select a language pair
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;option&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;selectbox&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;Select Language Pair&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;English-to-Japanese&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Japanese-to-English&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;#st.write(&apos;You selected:&apos;, option)
&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Textarea to type the source text.
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;user_input&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text_area&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Source Text&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_chars&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Load models
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;load_models&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Translate with CTranslate2 model
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;translator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_source_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_target_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Create a button
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;submitted&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;form_submit_button&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Translate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# If the button pressed, print the translation
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Here, we use &quot;st.info&quot;, but you can try &quot;st.write&quot;, &quot;st.code&quot;, or &quot;st.success&quot;.
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;submitted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Translation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;st&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;info&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;translation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;full-code&quot;&gt;Full Code&lt;/h3&gt;

&lt;p&gt;I will be updating &lt;a href=&quot;https://github.com/ymoslem/CTranslate-NMT-Web-Interface&quot;&gt;this repository&lt;/a&gt; with Python samples.&lt;/p&gt;

&lt;h2 id=&quot;next-steps&quot;&gt;Next steps&lt;/h2&gt;

&lt;h3 id=&quot;streamlit-components&quot;&gt;Streamlit Components&lt;/h3&gt;

&lt;p&gt;Streamlit comes with more &lt;a href=&quot;https://streamlit.io/components&quot;&gt;components&lt;/a&gt;. One of the most interesting NLP components you might want to check is &lt;a href=&quot;https://github.com/explosion/spacy-streamlit&quot;&gt;spacy-streamlit&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;deployment&quot;&gt;Deployment&lt;/h3&gt;

&lt;p&gt;You can deploy your app on any service of your choice. However, if you are looking for a free and easy option, consider using &lt;a href=&quot;https://devcenter.heroku.com/articles/getting-started-with-python&quot;&gt;Heroku&lt;/a&gt;. For better performance, test your app with and without Streamlit’s &lt;a href=&quot;https://docs.streamlit.io/en/stable/caching.html&quot;&gt;caching&lt;/a&gt; option and see if it helps.&lt;/p&gt;

&lt;p&gt;Thanks for reading! If you have questions or suggestions, feel free to &lt;a href=&quot;https://blog.machinetranslation.io/contact/&quot;&gt;contact me&lt;/a&gt;.&lt;/p&gt;

</description>
        <pubDate>Sun, 25 Jul 2021 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/nmt-web-interface/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/nmt-web-interface/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Adaptive Neural Machine Translation</title>
        <description>&lt;p&gt;At a linguistic environment, translations and edits do not stop. Therefore, while periodical fine-tuning of our neural machine translation (NMT) models can help, there is definitely a need to simultaneously take new translated and edited segments into consideration. Otherwise, the MT system will keep making the same mistakes, not always observing new terminology and style, until a new/fine-tuned version of the model is released. Hence, Online Learning or &lt;strong&gt;Online Adaptation&lt;/strong&gt; comes in handy in such a situation, so that the NMT model can incrementally learn from new translations and edits as it goes along!&lt;/p&gt;

&lt;p&gt;Generally speaking, there are several approaches to online adaptation. In this article, I am mainly discussing two types of adaptive machine translation: (a) instance-based adaptation of encoder-decoder MT models (&lt;a href=&quot;https://aclanthology.org/W17-4713/&quot;&gt;Farajian et al., 2017&lt;/a&gt;); and (b) adaptive translation with autoregressive LLMs (&lt;a href=&quot;https://aclanthology.org/2023.eamt-1.22/&quot;&gt;Moslem et al., 2023&lt;/a&gt;).&lt;/p&gt;

&lt;h2 id=&quot;adaptive-translation-with-encoder-decoder-nmt-models&quot;&gt;Adaptive Translation with Encoder-Decoder NMT Models&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Multi-Domain Neural Machine Translation through Unsupervised Adaptation&lt;/em&gt; (&lt;a href=&quot;https://aclanthology.org/W17-4713/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Farajian et al., 2017&lt;/a&gt;) is one of the best papers I read about the topic, especially that it does this on the fly, so there is no need for training individual models. A similar approach is used by ModernMT for Adaptive NMT.&lt;/p&gt;

&lt;p&gt;We can highlight the process offered by the paper as follows:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Given a source input &lt;em&gt;q&lt;/em&gt; (this can range from a single translation unit to an entire document), extract from the dataset/TM the top (source, target) pairs in terms of similarity between the source and &lt;em&gt;q&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Use the retrieved pairs to fine-tune the baseline model, which is then applied to translate &lt;em&gt;q&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;After a linguist edits the MT translation and approves it, add it to the dataset/TM. Consider also having a dedicated “context” dataset for each client or project.&lt;/li&gt;
  &lt;li&gt;Reset the adapted model to the original parameters, translate the next input source, and so on.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It is best applied in a CAT tool. The “dataset” or “parallel data” in this case is what linguists call a “translation memory”. “Instead of the static pool of in-domain parallel data, you can have a dynamic pool which is consistently updated by adding the new post-edited sentence pairs,” said &lt;a href=&quot;https://www.linkedin.com/in/tetris/amin-farajian/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Amin Farajian&lt;/a&gt;, the main auther of the paper. “You will have a system that learns constantly from your post-editions. Moreover, by having separate pools for each of your post-editors, you can even have MT systems that adapt to the style of your translators!”&lt;/p&gt;

&lt;p&gt;Similarly, &lt;a href=&quot;https://github.com/modernmt/modernmt/issues/546#issuecomment-650343894&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Emil Lynegaard&lt;/a&gt; explained the process in simple words. “When you use a context memory for a translation request, it will look for similar source paragraphs in the reference context memory. If any are found, […] it will briefly “fine-tune” the underlying model. This actually modifies the weights and biases of the neural network, albeit it only does so temporarily. When the fine-tuning has finished (this is typically a sub-second training run), then your input paragraph will be translated using the updated model, after which the model will have its weights reset to the original configuration.”&lt;/p&gt;

&lt;p&gt;This human-in-the-loop, adaptive approach is just brilliant in multiple aspects. For example, it solves the issue of “catastrophic forgetting” that could happen due to fine-tuning on a small number of sentences by simply resetting the model. Moreover, it does this in a straightforward way without having to change the original architecture of the model.&lt;/p&gt;

&lt;p&gt;For the purpose of testing the system, we need to create development and tests datasets. According to the paper, “from each specific domain a set of size 500 sentence pairs is randomly selected as development set, and 1,000 sentence pairs are used as held-out test corpus.”&lt;/p&gt;

&lt;p&gt;One matter we need to notice about this approach though is that while it saves time and resources by eliminating the need for training many in-domain/custom models, especially if these domains have limited data, the approach is still compute-intensive as it would require real-time use of GPUs, usually equivalent to those used for training the baseline model. That said, I believe in some scenarios this approach can be a perfect solution, especially if it is combined with other lines of work like Knowledge Distillation (&lt;a href=&quot;https://aclanthology.org/D16-1139/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Kim and Rush, 2016&lt;/a&gt;; &lt;a href=&quot;https://arxiv.org/abs/1612.06139&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Crego and Senellart, 2016&lt;/a&gt;; &lt;a href=&quot;https://workshop2018.iwslt.org/downloads/Proceedings_IWSLT_2018.pdf#page=38&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Zhang et al., 2018&lt;/a&gt;) to make the fine-tuning process more efficient.&lt;/p&gt;

&lt;p&gt;I was honoured that I presented this paper among others in my presentation about &lt;a href=&quot;https://amtaweb.org/wp-content/uploads/2020/11/NMTDomainAdaptationTechniques.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;NMT Domain Adaptation Techniques at AMTA2020&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;adaptive-translation-with-autoregressive-decoder-only-language-models&quot;&gt;Adaptive Translation with Autoregressive Decoder-only Language Models&lt;/h2&gt;

&lt;p&gt;One of the advantages of high-quality large language models (LLMs) is that they take context into consideration. At inference time, feeding an LLM with in-domain example translations or terminology can enhance its ability to generate more accurate and relevant translations. In general, this on-the-fly adaptation feature of LLMs is referred to as in-context learning. Early in 2023, I published my paper on the topic, namely &lt;a href=&quot;https://aclanthology.org/2023.eamt-1.22/&quot;&gt;Adaptive Machine Translation with Large Language Models&lt;/a&gt;, which was later peer-reviewed and accepted for publication at EAMT 2023.&lt;/p&gt;

</description>
        <pubDate>Wed, 21 Apr 2021 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/adaptive-nmt/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/adaptive-nmt/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Running TensorBoard with OpenNMT</title>
        <description>&lt;p&gt;TensorBoard is a tool that provides useful visualization of how training of a deep learning model is going on. It allows you to track and visualize metrics such as accuracy and perplexity. You can use TensorBoard in diverse deep learning frameworks such as TensorFlow and PyTorch. In this tutorial, you will learn how to activate TensorBoard in OpenNMT-tf and OpenNMT-py in different environments.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#1--activating-tensorboard&quot;&gt;1- Activating TensorBoard&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#2--accessing-tensorboard&quot;&gt;2- Accessing TensorBoard&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#google-colab&quot;&gt;Google Colab&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ngrok&quot;&gt;ngrok&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#google-cloud-platform-gcp&quot;&gt;Google Cloud Platform (GCP)&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;1--activating-tensorboard&quot;&gt;1- Activating TensorBoard&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;For OpenNMT-tf, TensorBoard is enabled by default. For OpenNMT-py, you need to enable TensorBoard, and optionally customize the log directory. Add these lines to the training configuration YAML file.
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tensorboard: true
tensorboard_log_dir: logs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Start your OpenNMT training as usual.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Create a screen for TensorBoard: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen -S tensorboard&lt;/code&gt;. Note: if you use Google Colab, you do not need &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Open the directory of the log files. In OpenNMT-tf, by default the log files are in the same folder as the model. In OpenNMT-py, the logs are in a directory with today’s date inside “runs/onmt” or the path you specified for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tensorboard_log_dir&lt;/code&gt; in your config file.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If you have multiple models you want to compare, located in one parent directory, you can rather use the path of this parent directory.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Start TensorBoard and specify the log directory: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tensorboard --logdir=&quot;.&quot;&lt;/code&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;At this point, you should see a message that TensorBoard is running on localhost &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http://localhost:6006/&lt;/code&gt; and that’s how to access it from a local browser if you are working on the same machine.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Get out of this screen by pressing: Ctrl+A+D.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;2--accessing-tensorboard&quot;&gt;2- Accessing TensorBoard&lt;/h2&gt;

&lt;p&gt;There are multiple ways in which you can display the output of TensorBoard. We are exploring some of the most popular approaches.&lt;/p&gt;

&lt;h3 id=&quot;google-colab&quot;&gt;Google Colab&lt;/h3&gt;

&lt;p&gt;You can start TensorBoard within the notebook using &lt;a href=&quot;https://ipython.readthedocs.io/en/stable/interactive/magics.html&quot;&gt;magics&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;%load_ext tensorboard
%tensorboard --logdir runs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;exposing-tensorboard-to-network&quot;&gt;Exposing TensorBoard to network&lt;/h3&gt;

&lt;p&gt;You can add the flag &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--bind_all&lt;/code&gt; to your command to be able to open TensorBoard in a local browser with the server IP.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tensorboard --logdir logs --bind_all
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;ngrok&quot;&gt;ngrok&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Sign up to &lt;a href=&quot;https://ngrok.com/&quot;&gt;ngrok&lt;/a&gt; and download the suitable version; for example the one for &lt;a href=&quot;https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip&quot;&gt;Linux&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Unzip the downloaded ngrok archive.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Find your authentication key &lt;a href=&quot;https://dashboard.ngrok.com/get-started/setup&quot;&gt;here&lt;/a&gt; and run the command: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./ngrok authtoken &amp;lt;your_authentication_key&amp;gt;&lt;/code&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Start a new screen: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen -S ngrok&lt;/code&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Start ngrok on TensorBoard’s default port 6006: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./ngrok https 6006&lt;/code&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If everything works well, you should see a black screen with “Session Status Online” and other details, including “Forwarding”.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Copy the “Forwarding” HTTP or HTTPs and run it in your browser. You should be able to see something like this:&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/tensorboard.png&quot; alt=&quot;tensorboard&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Disclaimer: The ngrok method should be only used for research or demonstration purposes. For corporate and security-sensitive purposes, consult with your team first. Depending on the infrastructure you are using, there might be better methods.&lt;/p&gt;

&lt;h3 id=&quot;google-cloud-platform-gcp&quot;&gt;Google Cloud Platform (GCP)&lt;/h3&gt;

&lt;p&gt;If you are training your models on Google Cloud Platform (GCP), you can rather run TensorBoard locally using the approach explained &lt;a href=&quot;https://cs.brown.edu/courses/csci1430/proj4/gcp-guide/&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://stackoverflow.com/questions/43711110/google-cloud-platform-access-tensorboard&quot;&gt;here&lt;/a&gt; for example.&lt;/p&gt;

&lt;p&gt;You can learn more &lt;a href=&quot;https://www.tensorflow.org/tensorboard/get_started&quot;&gt;here&lt;/a&gt; about TensorBoard and how to use it in other scenarios.&lt;/p&gt;

</description>
        <pubDate>Fri, 19 Feb 2021 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/TensorBoard/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/TensorBoard/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Bash Commands for NLP Engineers</title>
        <description>&lt;p&gt;As using Bash commands is inevitable if you work on NLP and MT tasks, I thought it would be useful to list the majority of commands I learnt to use on a daily base, thanks to practice, searching, and helpful colleagues I met over years. Obviously, this is not an exclusive list; however, I hope it includes most of the one-line Bash commands you would need. Please note the majority of these commands have been mainly tested on Linux.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#file-management&quot;&gt;File Management&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#reading-files&quot;&gt;Reading Files&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#nano-editor-commands&quot;&gt;Nano Editor Commands&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#finding&quot;&gt;Finding&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#downloading&quot;&gt;Downloading&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#compressing-and-extracting&quot;&gt;Compressing and Extracting&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#server-related-bash-commands&quot;&gt;Server-related Bash Commands&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#other-useful-packages&quot;&gt;Other Useful Packages&lt;/a&gt;
&lt;br /&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;file-management&quot;&gt;File Management&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open a directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cd &amp;lt;path/dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List the files and sub-directories in the current directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a new directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;mkdir &amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Rename or move a file or directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;mv &amp;lt;old_filename&amp;gt; &amp;lt;new_filename&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a file to a directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;mv &amp;lt;old_filename&amp;gt; &amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move all files whose name starting with a string, using *:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;mv &amp;lt;old_filename&amp;gt;* &amp;lt;folder_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Rename multiple files: (details)&lt;/strong&gt;
rename ‘s/&amp;lt;original_string&amp;gt;/&amp;lt;new_string&amp;gt;/g’ *&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delete a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;rm &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;To delete multiple files, just add them after the rm command separated by spaces:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;rm &amp;lt;file_name1&amp;gt; &amp;lt;file_name2&amp;gt; &amp;lt;file_name3&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Delete any file that starts with “wow”, using *:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;rm wow*&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Delete a directory and its contents:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;rm -r &amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Avoid deleting files by mistake by using trash instead of rm, installing trash-cli:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;sudo apt-get install trash-cli&lt;br /&gt;
• Delete:&lt;br /&gt;
trash &amp;lt;file_name&amp;gt;&lt;br /&gt;
• List trashed items:&lt;br /&gt;
trash-list&lt;br /&gt;
• Restore a file (first move to the root folder or a specific folder):&lt;br /&gt;
restore-trash and then type a number.&lt;br /&gt;
• Empty the trash list:&lt;br /&gt;
trash-empty&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Copy a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cp &amp;lt;original_filename&amp;gt; &amp;lt;new_filename&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Copy a directory and its contained files (at least -r is required):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cp -avr &amp;lt;original_dirname&amp;gt; &amp;lt;new_dirname&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Copy and show a progress bar (good for large files)&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;rsync -ah --progress &amp;lt;source&amp;gt; &amp;lt;destination&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Complete a command or file name (e.g. my_file_name.txt):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;my&lt;/code&gt; and then press &lt;kbd&gt;Tab&lt;/kbd&gt; – once if there is no other file starting with “my”. &lt;br /&gt;
OR&lt;/li&gt;
  &lt;li&gt;Type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;my&lt;/code&gt; and then press &lt;kbd&gt;Tab&lt;/kbd&gt; – twice if you want to know what files starting with “my”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Move to a location in a command or text:&lt;/strong&gt;
Move the cursor to the location, press &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Alt&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Option&lt;/code&gt;, and click.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clear the current window:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clear&lt;/code&gt;&lt;br /&gt;
OR&lt;/li&gt;
  &lt;li&gt;Press &lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;l&lt;/kbd&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;End the current command (before it finishes):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Press &lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;c&lt;/kbd&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Move to the last accessed path:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cd -&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List your previous commands&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;history&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Search your command history&lt;/strong&gt;&lt;br /&gt;
&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;r&lt;/kbd&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;List the *.txt files in the current directory (or path):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls *.txt&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Show the files in all folders that starts with “aaa”:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls aaa*&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Show files and subdirectories in all directories in the current directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls *&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List all the files with details:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -l&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display file details:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -l &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List all the files with details, the size is in MB/GB:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -lh&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List all the files with details, the size in MB/GB, arrange by time ascendingly:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -lht
ls -lht &amp;lt;dir_name1&amp;gt;/*/&amp;lt;dir_name2&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List all the files with details, the size in MB/GB, arrange by time ascendingly:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -lhtr&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List file sizes only for all files in the current directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -hs&lt;br /&gt;
OR&lt;br /&gt;
du&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display the file size only:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -hs &amp;lt;file_name&amp;gt;
OR&lt;br /&gt;
du -h &amp;lt;file_name&amp;gt;
OR for one size for all directories and files
du -hs &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display the last modified file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls -t | head -1&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display sizes of the current directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;du -d 1 -h .
Sort the results in ascending order:&lt;br /&gt;
du -d 1 -h . | sort -h
Sort the results in descending order:&lt;br /&gt;
du -d 1 -h . | sort -h -r&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find files the are bigger than 200MB:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;find /home/$USER/ -type f -size +200000k -exec ls -lh {} \; | awk &apos;{ print $9 &quot;: &quot; $5 }&apos;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display file size with stat (Linux):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;stat –printf=”%s” &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display file last edited time (Linux):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;stat -c %y &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display file last edited time (Mac):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;stat -x &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Get the current path (print working directory):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;pwd&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a symbolic link, i.e. a shortcut to a file or directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ln -s &amp;lt;file_name&amp;gt; &amp;lt;shortcut_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Get the path of a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;readlink -f &amp;lt;file_name&amp;gt;
OR
echo “$(pwd)/file_name”
OR
realpath &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Get word count in a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;wc &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Get the number of lines in a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;wc -l &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Count lines of all file in subdirectories; use * if the file name is partial (details):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;find ./ -type f -name “&lt;em&gt;&amp;lt;file_name&amp;gt;&lt;/em&gt;” -exec wc -l {} +&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Count lines in a*. gz file, use -c to avoid writing the uncompressed file to desk:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;gunzip -c &amp;lt;file_name.gz&amp;gt; | wc -l&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Split a file into multiple files, 3000 lines each, with numeric-suffixes:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;split -a 4 -d -l 3000 &amp;lt;file_name&amp;gt; &amp;lt;prefix&amp;gt; –additional-suffix &amp;lt;extension&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find out if two files are identical:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cmp –silent first_file_name second_file_name | echo “——&amp;gt; Files are different.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find out the difference between two files:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;diff &amp;lt;file_name1&amp;gt; &amp;lt;file_name2&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find different lines in file1.txt compared to file2.txt:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;comm -23 &amp;lt;(sort file1.txt) &amp;lt;(sort file2.txt) &amp;gt; different.txt&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find common lines in both file1.txt and file2.txt:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;comm -12 &amp;lt;(sort file1.txt) &amp;lt;(sort file2.txt) &amp;gt; common.txt&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Complete a long command in a new line:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;\&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;reading-files&quot;&gt;Reading Files&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read the whole file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cat &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read the whole file; display line numbers:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cat -n &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read the first 10 lines of a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;head &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read the first 4 lines of a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;head -4 &amp;lt;file_name&amp;gt;
OR&lt;br /&gt;
head -n 4 &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read the first 3 lines of two files:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;head -q -n &amp;lt;file_name1&amp;gt; &amp;lt;file_name2&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read the last 10 lines of a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tail &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read the last 3 lines of a file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tail -3 &amp;lt;file_name&amp;gt;
OR&lt;br /&gt;
tail -n 3 &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read a specific line of a file, e.g. line #10:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;sed -n 10p &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read the end of the file and use -f to update the output:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tail -f &amp;lt;file_name&amp;gt;&lt;br /&gt;
Use &lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;c&lt;/kbd&gt; to exit.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read a file in chunks:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;less &amp;lt;file_name&amp;gt;
Press Enter to move to the next chunk of the file, and “q” to quick.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Read a file in chunks, display line numbers:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;less -N &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Disable sending to stdout (i.e. printing in Terminal) by adding 1&amp;gt; /dev/null&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cat &amp;lt;file_name1&amp;gt; &amp;lt;file_name2&amp;gt; | tee &amp;lt;output_file_name&amp;gt; 1&amp;gt; /dev/null&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;processing&quot;&gt;Processing&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Merge two files, use &amp;gt; to create the output file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cat &amp;lt;file_name1&amp;gt; &amp;lt;file_name2&amp;gt; &amp;gt; &amp;lt;output_file&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Merge all the files that ends with (say “.en”) to a file (e.g. “all.en”):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cat *.en &amp;gt; all.en&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Merge all the files in the current folder:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cat * &amp;gt; &amp;lt;output_file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Merge the source text and target translation into one tab-delimited file&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;paste -d &quot;\t&quot; all.en all.ar &amp;gt; all.enar&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Remove duplicates from a file&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;sort -S 95% --parallel=8  all.enar | uniq -u &amp;gt; all.unique.enar&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Shuffle&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;shuf all.unique.enar &amp;gt; all.unique.shuf.enar&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Split into the source and target from a  one tab-delimited file into two files&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;cut -f 1 all.unique.shuf.enar &amp;gt; all.unique.en&lt;br /&gt;
cut -f 2 all.unique.shuf.enar &amp;gt; all.unique.ar&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Replace “abc” with “XYZ” in a file&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;sed -i -e &apos;s/abc/XYZ/g&apos; /tmp/file.txt&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;nano-editor-commands&quot;&gt;Nano Editor Commands&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a new file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;nano &amp;lt;new_file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Open an existing file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;nano &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Open multiple files:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;nano &amp;lt;file_name1&amp;gt; &amp;lt;file_name2&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Search the current file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;w&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move to the end of the file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;w&lt;/kbd&gt; and then &lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;v&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move to the end of the line:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;e&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move to the start of the line:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;a&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Delete the current line:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;k&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a page down:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;v&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a page up:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;y&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Cut the curret line&lt;/strong&gt;&lt;br /&gt;
&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;k&lt;/kbd&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mark text:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;Shift&lt;/kbd&gt;+&lt;kbd&gt;6&lt;/kbd&gt; (i.e. it is &lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;^&lt;/kbd&gt;) and then move in the direction to you need.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Cut the marked text:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;k&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Paste the cut text:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;u&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Note to be able to pate across multiple files, the second file must be open first open the two files, copy/cut from the first file, close it, and then paste to the second file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Close the current file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;x&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You will be prompted if you want to save; type “y” for yes and “n” for no. If you select to save, just press Enter to keep the current file name. You can also move between two open files as in the next command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Move between two open files:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;alt+. to move forward one file.&lt;br /&gt;
alt+, move backward one file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Note that if you are on Mac, Option+. and Option+, are used to insert ≥≤ symbols, so you need to first press Alt+Command+O to change the behaviour of Option in Terminal.&lt;/p&gt;

&lt;h2 id=&quot;finding&quot;&gt;Finding&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find a file that includes a word (e.g. “really great” in *.txt files):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;grep “really great”  *.txt&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Search sub-directories recursively using grep:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;grep -r &amp;lt;word_to_search&amp;gt; *
OR&lt;br /&gt;
grep -R &amp;lt;word_to_search&amp;gt; *&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Use regular expressions with grep, e.g. the only word in the line is ‘nan’:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;grep ^nan$ &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find a file on the machine by name:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;sudo find / -name &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find all files in directory and subdirectories that end with *.en:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;find “$PWD” -type f | grep ‘.en$’&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find all files in directory and subdirectories that has ‘aaa’ followed with some text:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;find “$PWD” -type f | grep “aaa*”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find files in the current directory that either whose name or content includes “wonderful”:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls | grep “wonderful”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;If you have very long list generated by ls and want to display them page by page:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls | less&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List files whose names include a range of numbers:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls model.0{1..3}*&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List files whose names include different letters:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ls model.{a,b,c,d}&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move multiple files&lt;/strong&gt; (or run any command on multiple files):&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;add the difference between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{ }&lt;/code&gt; separated by a comma.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Find installed Python3 packages:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;pip3 freeze&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find installed Python3 packages that start with “tensor”, use -i to ignore case:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;pip3 freeze | grep -i tensor&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find the location of a command (e.g python3):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;which python3&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;downloading&quot;&gt;Downloading&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Download a file using curl:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;curl &amp;lt;http://some.url&amp;gt; –output &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If this is the first time to use curl, you might get a message like “Command ‘curl’ not found, but can be installed with:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;sudo apt install curl&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Download a file that requires cookies:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;curl –cookie &amp;lt;cookies.txt&amp;gt; &amp;lt;http://some.url&amp;gt; –output &amp;lt;file_name&amp;gt;
To get the “cookies.txt” file, you can use a Chrome extension like “cookies.txt” to export cookies into a TXT file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Copy GitHub repository to the machine:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;git clone https://github.com/USERNAME/REPOSITORYNAME&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Update a downloaded GitHub repository:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cd &amp;lt;repository_dir_name&amp;gt;
git pull 
git checkout master&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Stage and Commit a GitHub repository&lt;/strong&gt; (&lt;a href=&quot;https://www.nobledesktop.com/learn/git/stage-commit-files&quot;&gt;details&lt;/a&gt;)&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;git add &amp;lt;file_name\&amp;gt;&lt;br /&gt;
git commit -m “Message, e.g. Update file”&lt;br /&gt;
git push origin main&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The default branch is usually called “master” or “main” – if it is not, replace it with the right name.&lt;/p&gt;

&lt;h2 id=&quot;compressing-and-extracting&quot;&gt;Compressing and Extracting&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extract a *.zip file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;unzip &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a zip archive from file(s):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;zip &amp;lt;archive_filename&amp;gt; &amp;lt;file_list&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a zip archive from a directory with high level of compression:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;zip -r -9 &amp;lt;archive_filename.zip&amp;gt; &amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Extract a *.gz file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;gunzip &amp;lt;file_name.gz&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Compress all the files separately as file_name.gz&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cd &amp;lt;dir_name&amp;gt;&lt;br /&gt;
gzip *&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Compress all the files in the same directory even if there are subdirectories:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;cd &amp;lt;dir_name&amp;gt;&lt;br /&gt;
gzip -r .&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Extract a *.tar.gz file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tar xzvf &amp;lt;file_name.tar.gz&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Extract a *.tgz file:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tar xzvf &amp;lt;file_name.tgz&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Extract in a different directory:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tar xzvf &amp;lt;file_name.tgz&amp;gt; -C &amp;lt;/path/dir_name&amp;gt;&lt;br /&gt;
OR&lt;br /&gt;
gunzip -c &amp;lt;file_name.tgz&amp;gt; | tar xvf -&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a *.tar archive:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tar -czvf archive.tar.gz &amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a *.tar archive from multiple files/directories:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tar -czvf &amp;lt;archive_file_name.tar.gz&amp;gt; &amp;lt;file_name1&amp;gt; &amp;lt;file_name2&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Compress as *.tar.bz2 (higher compression):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tar -jcvf &amp;lt;archive_name.tar.bz2&amp;gt; &amp;lt;file_dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Extract a *.tar.bz2 archive:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;tar -jxvf &amp;lt;archive_name.tar.bz2&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Compress all the files separately as file_name.bz2&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;bzip2 *&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Extract file_name.bz2 (without tar)&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;bzip2 -d &amp;lt;file_name.bz2&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;server-related-bash-commands&quot;&gt;Server-related Bash Commands&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Obviously, many of these commands can be used locally, but they are most useful while working on servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find out the server date and time:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;date&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Measure time taken to run a script or command:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;time &amp;lt;python3 script.py&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find out the space on the desk:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;df -h&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create an alias for a command: (details)&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;alias &amp;lt;command&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;To save aliases, put this in ~/.bash_aliases&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;nano ~/.bash_aliases&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;For example, you can add this command to the ~/.bash_aliases file, use quotes for multi-word commands:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;alias frz=&quot;pip3 freeze&quot;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;For the alias change to take effect&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;source ~/.bash_aliases&lt;br /&gt;
OR&lt;br /&gt;
exec bash&lt;br /&gt;
The next time you type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;frz&lt;/code&gt; in the Terminal, it will run the command &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip3 freeze&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Repeat the same command&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;watch &lt;command /&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Avoid ending a command if the local Terminal is closed:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;screen&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a new screen with a name:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;screen -S &amp;lt;name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Create a new screen with logging enabled; screenlog.0 is created:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;screen -L -S &amp;lt;name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Detach the current screen:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;a&lt;/kbd&gt;+&lt;kbd&gt;d&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Resume a single screen:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;screen -r&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Resume a screen from multiple running screen:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;screen -r &amp;lt;name&amp;gt;
OR
screen -r &amp;lt;id&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;List the currently running screens:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;screen -list
screen -ls&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;End a screen:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;screen -X -S &amp;lt;id&amp;gt; quit&lt;br /&gt;
or resume the screen and then&lt;br /&gt;
&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;a&lt;/kbd&gt; then &lt;kbd&gt;k&lt;/kbd&gt; then &lt;kbd&gt;y&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Shutdown the machine after finishing a command — separate them with ;&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;python3 file.py; sudo shutdown&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Adjust File permissions, access by the current user only:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;chmod 700 &amp;lt;file_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For example, this is required before using the *.pem key file provided by AWS E2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Display RAM used:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;free -m&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Display GPU memory used:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;nvidia-smi&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Find the CUDA version:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;nvcc –version&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Run a command continously&lt;/strong&gt; (optionally use -n &lt;n&gt; for interval seconds, and -d to highlight changes):&lt;/n&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;watch &lt;command /&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Check kernel termination errors&lt;/strong&gt; (use one of these commands)&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;dmesg&lt;br /&gt;
OR&lt;br /&gt;
nano /var/log/kern.log&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Check currently running processes - use grep if you are looking for a specific type of processes:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;ps -ef | grep python3&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a file from a server (e.g. AWS2) to the local Machine (run it from the local machine):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;scp &amp;lt;file_name&amp;gt; &amp;lt;user&amp;gt;@&amp;lt;serpver_ip:port&amp;gt;:/&amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a directory from a server (e.g. AWS2) to the local Machine; use -r (run it from the local machine):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;scp -r &amp;lt;dir_name&amp;gt; &amp;lt;user&amp;gt;@&amp;lt;serpver_ip:port&amp;gt;:/&amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a file from AWS2 to the local Machine (run it from the local machine):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;scp -i &amp;lt;key.pem&amp;gt; &amp;lt;file_name&amp;gt; ubuntu@ec2[…].compute.amazonaws.com:~/&amp;lt;dir_name&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a file from the local machine to a server (run it from the local machine):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;scp &amp;lt;user&amp;gt;@&amp;lt;server_ip:port&amp;gt;:/&amp;lt;dir_name&amp;gt;/&amp;lt;file_name&amp;gt; &amp;lt;/path/on/the/local/machine&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Move a file from Google Could to the local machine:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;gcloud compute scp –project &amp;lt;project_name&amp;gt; –recurse &amp;lt;user_name&amp;gt;@machine_name:~/&amp;lt;dir_name&amp;gt;/&amp;lt;file_name&amp;gt; &amp;lt;/path/on/the/local/machine&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Log out of the current connection (and similar senarios):&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;kbd&gt;Ctrl&lt;/kbd&gt;+&lt;kbd&gt;d&lt;/kbd&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;other-useful-packages&quot;&gt;Other Useful Packages&lt;/h2&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Among useful packages that you might want to install yourself are:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;curl&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wget&lt;/code&gt; for downloading files, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aria2c&lt;/code&gt; for faster download&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;trash-cli&lt;/code&gt; for trashing unwanted files into a folder instead of using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rm&lt;/code&gt; command&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tree&lt;/code&gt; for displaying the directory structure&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;htop&lt;/code&gt; for monitoring CPU resources&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;locate&lt;/code&gt; for quickly finding files by name after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updatedb&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ack&lt;/code&gt; for searching files like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;grep&lt;/code&gt;, but faster&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parallel&lt;/code&gt; for multithreading from the bash&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s3cmd&lt;/code&gt; for uploading and downloading files between AWS S3 buckets and non-AWS servers. For AWS E2 servers, use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws s3&lt;/code&gt; command instead.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Sun, 02 Aug 2020 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/bash-commands/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/bash-commands/</guid>
        
        
        <category>mt</category>
        
      </item>
    
      <item>
        <title>Pre-trained Neural Machine Translation (NMT) Models</title>
        <description>&lt;p&gt;Neural Machine Translation (NMT) in-domain models outperform generic models for the “domain” on which they are trained. In other words, in-domain models can observe terminology and generate translations that are more in line with a specialized context.&lt;/p&gt;

&lt;p&gt;You can download the NMT models below. Enjoy!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://machinetranslation.io/nmt-pretrained-models&quot;&gt;Download Pre-Trained NMT Models&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Tue, 16 Jun 2020 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/pre-trained-nmt-models/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/pre-trained-nmt-models/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>WER Score for Machine Translation</title>
        <description>&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Currently, it is recommended to use &lt;strong&gt;&lt;em&gt;SacreBLEU&lt;/em&gt;&lt;/strong&gt; for calculating &lt;em&gt;BLEU&lt;/em&gt;, &lt;em&gt;ChrF&lt;/em&gt;, and &lt;em&gt;WER&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/mjpost/sacrebleu&quot;&gt;&lt;img src=&quot;https://github-readme-stats.vercel.app/api/pin/?theme=graywhite&amp;amp;username=mjpost&amp;amp;repo=sacrebleu&quot; alt=&quot;SacreBLEU&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Word Error Rate (WER) computes the minimum Edit Distance between the human-generated sentence and the machine-predicted sentence. In other tutorials, I explained &lt;a href=&quot;2020-01-26-compute-bleu-score.md&quot;&gt;how to use Python to compute BLEU&lt;/a&gt; and &lt;a href=&quot;https://python.gotrained.com/nltk-edit-distance-jaccard-distance/&quot;&gt;Edit Distance&lt;/a&gt;, and this tutorial, I am going to explain how to calculate the WER score.&lt;/p&gt;

&lt;p&gt;For this WER score tutorial, I am going to use the Python library, &lt;em&gt;JIWER&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#files-required-to-compute-wer&quot;&gt;Files Required to Compute WER&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#corpus-wer-calculator&quot;&gt;Corpus WER Calculator&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#sentence-wer-calculator&quot;&gt;Sentence WER Calculator&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;
&lt;br /&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;files-required-to-compute-wer&quot;&gt;Files Required to Compute WER&lt;/h2&gt;

&lt;p&gt;To measure WER score, you need to have two files:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Ground Truth: It is the reference human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Hypothesis: It is the Machine Translation prediction for the source of the same test dataset used for “Ground Truth”. In the code, I will refer to such sentences as “preds”.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;corpus-wer-calculator&quot;&gt;Corpus WER Calculator&lt;/h2&gt;

&lt;p&gt;JIWER allows computing the overall WER score on multiple sentences using two lists that include the same number of sentences. I am quoting a sample code from JIWER’s page as follows:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;jiwer&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wer&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;ground_truth&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hello world&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;i like monthy python&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hypothesis&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hello duck&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;i like python&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;error&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ground_truth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hypothesis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, let’s apply the same concept on the two files.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create the argument list. This is optional, but it is useful for being able to run the file with arguments from CMD/Terminal as follows:&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-console highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;go&quot;&gt;python3 wer.py human.txt mt.txt
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Open the two files, human translation and machine translation of the same test dataset, and add the sentences (lines) to two lists using the Python method &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;readlines()&lt;/code&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;From the JIWER library, use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wer&lt;/code&gt; to calculate the WER score on the two lists of sentences, and print the output.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here you can find the code that reflects these steps.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/cdc320d9be2cb9a258fd5e0cc5871004.js&quot;&gt;&lt;/script&gt;

&lt;h2 id=&quot;sentence-wer-calculator&quot;&gt;Sentence WER Calculator&lt;/h2&gt;

&lt;p&gt;The previous code computes WER for the whole test dataset, and this is the common practice. Still, you might want to calculate WER for segment by segment. The following code uses the same method &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wer&lt;/code&gt; from the JIWER library for achieving this task using a for loop. Finally, it saves the output, i.e. the WER score for each sentence in a new line, into a text file.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/f1783b566b3a17b4107a34198daee6a6.js&quot;&gt;&lt;/script&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;So just as we did for Computing BLEU Score for Machine Translation, we now managed to use the WER score as well. As I said earlier, these scores are mainly useful for evaluating the quality of different models, rather than the meaning acceptance of each sentence. While evaluating Speech Recognition, it makes sense that you want the system to exactly convey each uttered word and in the same order; however, Machine Translation evaluation is somehow tricky as different wordings can still convey the same meaning. Hence, Machine Translation evaluation is still a hot research topic; and in some cases, human evaluation is preferred.&lt;/p&gt;

&lt;p&gt;If you have questions, please feel free to comment.&lt;/p&gt;

</description>
        <pubDate>Wed, 04 Mar 2020 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/compute-wer-score/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/compute-wer-score/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Computing BLEU Score for Machine Translation</title>
        <description>&lt;p&gt;In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.&lt;/p&gt;

&lt;p&gt;BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target is from a high-level framework like OpenNMT or Marian, or from a lower-level one like TensorFlow or PyTorch. It does not also matter whether it is a Neural Machine Translation system or a Statistical Machine Translation tool like Moses.&lt;/p&gt;

&lt;p&gt;So let’s see the steps I follow to calculate the BLEU score.&lt;/p&gt;

&lt;p&gt;Table of Contents&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#files&quot;&gt;Files Required to Compute BLEU&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#detoc&quot;&gt;Detokenization &amp;amp; BLEU Calculation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#code&quot;&gt;Code of MT BLEU Calculator&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#args&quot;&gt;File Names as Arguments&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#sent&quot;&gt;Sentence BLEU Calculator&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#multi&quot;&gt;Multi-BLEU&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#meteor&quot;&gt;METEOR&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#accurate&quot;&gt;Final Note: Is BLEU Accurate?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a name=&quot;files&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;files-required-to-compute-bleu&quot;&gt;Files Required to Compute BLEU&lt;/h2&gt;

&lt;p&gt;To measure BLEU, you need to have two files:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Reference: It is the human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.&lt;/li&gt;
  &lt;li&gt;System: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”. In the code, I will refer to such sentences as “preds”.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a name=&quot;detoc&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;detokenization--bleu-calculation&quot;&gt;Detokenization &amp;amp; BLEU Calculation&lt;/h2&gt;

&lt;p&gt;To compute BLEU, I use &lt;a href=&quot;https://github.com/mjpost/sacreBLEU&quot;&gt;sacreBLEU&lt;/a&gt; which works on detokenized text (unless the ‘--force’ parameter is used). For the detokenization step, I use the Python library &lt;a href=&quot;https://github.com/alvations/sacremoses&quot;&gt;SacreMoses&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Why detokenization? Different tokenization tools generate different outputs while to be able to say that BLEU is a standard score, the factors must be the same. That is why sacreBLEU works on detokenized data and applies standard tokenization rules.&lt;/p&gt;

&lt;p&gt;For languages other than Japanese and Chinse, SacreBLEU uses mteval-v13a, the standard tokenization used by WMT.&lt;/p&gt;

&lt;p&gt;&lt;a name=&quot;code&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;code-of-mt-bleu-calculator&quot;&gt;Code of MT BLEU Calculator&lt;/h2&gt;

&lt;p&gt;BLEU is a corpus-level metric. Here is how you can calculated Corpus BlEU using SacreBLEU.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/f9e4df761f527996115387a2144912c0.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;a name=&quot;args&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;file-names-as-arguments&quot;&gt;File Names as Arguments&lt;/h3&gt;

&lt;p&gt;In the above script, file names are hardcoded. You can easily add the file names as arguments. To let the Python script understand the arguments, you will need first to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;import sys&lt;/code&gt; and then create two variables one for the test dataset, e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;target_test&lt;/code&gt;, with the value &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sys.argv[1]&lt;/code&gt; for the test file argument and one for the MT output, e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;target_pred&lt;/code&gt;, with the value &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sys.argv[2]&lt;/code&gt; for the MTed file argument. Optionally, you can also add an argument for language segmentation. Finally, instead of hardcoding the test dataset name and the MTed file name, you can use these two variables.&lt;/p&gt;

&lt;p&gt;As you can see in the Python script below, I used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;argv&lt;/code&gt; which is a list including the arguments given in the command line; the first item [0] is saved for the Python script file name. So to run this script, you can use a similar command line in your CMD or Terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python3 bleu-scrip.py test.txt mt.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here is the BLEU script, but now with arguments.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/70c83345efb9c3aba193aad7102b3016.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;a name=&quot;sent&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;sentence-bleu-calculator&quot;&gt;Sentence BLEU Calculator&lt;/h2&gt;

&lt;p&gt;The previous code computes BLEU for the whole test dataset, and this is the common practice. Still, you might want to calculate BLEU for segment by segment. The following code uses the function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sentence_bleu()&lt;/code&gt; from the sacreBLEU library for achieving this task using a for loop. Finally, it saves the output, i.e. the BLEU score for each sentence in a new line, into a file called “bleu.txt”.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/27747f9e10c057ee13867f3a61b6a144.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;As we did with the corpus BLEU script, here is the sentence BLEU script, but now with arguments.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/c200e30288ff9f4dc745a62410062b10.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The code is now updated to reflect two main changes:
1- Updates in version 1.4: a) the reference sentence must be a list; and b) use bleu.score instead of bleu to print/write the score.
2- Conclusions from this &lt;a href=&quot;https://github.com/mjpost/sacrebleu/issues/98&quot;&gt;discussion&lt;/a&gt;: add the argument &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;smooth_method=&apos;exp&apos;&lt;/code&gt; if you want to get the same result as when using sacreBLEU from the command line.&lt;/p&gt;

&lt;p&gt;&lt;a name=&quot;multi&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;multi-bleu&quot;&gt;Multi-BLEU&lt;/h2&gt;

&lt;p&gt;One of the popular scripts to calculate BLEU is &lt;a href=&quot;https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl&quot;&gt;multi-bleu.perl&lt;/a&gt;. It works very similarly to sacreBLEU.&lt;/p&gt;

&lt;p&gt;According to the script “… you should detokenize then use &lt;a href=&quot;https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl&quot;&gt;mteval-v14.pl&lt;/a&gt;, which has a standard tokenization.”&lt;/p&gt;

&lt;p&gt;To use multi-bleu.perl, you can simply run this command line in your Terminal.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;perl multi-bleu.perl human-translation.txt &amp;lt; mt-pred.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;a name=&quot;meteor&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;meteor&quot;&gt;METEOR&lt;/h2&gt;

&lt;p&gt;Using BLEU, you might wonder why it does not count some sub-words of the same origin as correct alternatives. So, I came across another metric called &lt;a href=&quot;https://www.cs.cmu.edu/~alavie/METEOR/&quot;&gt;&lt;em&gt;METEOR&lt;/em&gt;&lt;/a&gt;, which somehow solves such issue.&lt;/p&gt;

&lt;p&gt;I am quoting Rachael Tatman’s article &lt;a href=&quot;https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213&quot;&gt;&lt;em&gt;Evaluating Text Output in NLP: BLEU at your own risk&lt;/em&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;METEOR is similar to BLEU but includes additional steps, like considering synonyms and comparing the stems of words (so that “running” and “runs” would be counted as matches).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I have created the following script for METEOR calculation using NLTK. For the same sentences, METEOR gives me higher scores than BLEU. Unlike many other metrics including BLEU, METEOR mainly works on sentence evaluation rather than corpus evaluation.&lt;/p&gt;

&lt;script src=&quot;https://gist.github.com/ymoslem/5174469f88d9f1fb1660121a663bb87f.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;a name=&quot;accurate&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;final-note-is-bleu-accurate&quot;&gt;Final Note: Is BLEU Accurate?&lt;/h2&gt;

&lt;p&gt;Well, BLEU simply compares the human translation to the machine translation. It does not take into consideration synonyms or accepted word order changes.&lt;/p&gt;

&lt;p&gt;Here is an example of the original translation in the corpus:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
    &lt;li&gt;FR: Notre ONU peut jouer un rôle déterminant dans la lutte contre les menaces qui se présentent à nous, et elle le jouera.&lt;/li&gt;
    &lt;li&gt;EN: Our United Nations can and will make a difference in the fight against the threats before us.&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;… and here is the machine translation by two of my NMT models:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
    &lt;li&gt;EN: Our United Nations can play a decisive role in combating the threats we face, and it will do so.&lt;/li&gt;
    &lt;li&gt;EN: Our United Nations can play a decisive role in combating the threats we face, and it will play it.&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;As you can see, the MT translations are very acceptable; yet if you calculate BLEU against the original sentence, you will get ≈ 15.7 BLEU score only!&lt;/p&gt;

&lt;p&gt;So BLEU –just as any other automatic measure– can be used for reference until reaching a pre-agreed score, and you can expect a better translation from a model with an overall higher BLEU score. Moreover, some other new metrics are worth considering such as &lt;a href=&quot;https://github.com/chikiulo/yisi&quot;&gt;Yisi&lt;/a&gt; and &lt;a href=&quot;https://github.com/Unbabel/COMET&quot;&gt;COMET&lt;/a&gt;. Still, some companies would finally run a human evaluation, which we might talk about in another article.&lt;/p&gt;

</description>
        <pubDate>Sun, 26 Jan 2020 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/compute-bleu-score/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/compute-bleu-score/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Stand-alone Executable Translator for OpenNMT</title>
        <description>&lt;p&gt;The question was: if I want to have a stand-alone version of OpenNMT to run on Windows, without any manual preparations or installations on the target machine, and does not connect to the Internet for Machine Translation, what are my options to achieve this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This post is somehow old, it uses OpenNMT-py 0.9.1, and currently applies only to &lt;strong&gt;Windows&lt;/strong&gt;. If you want to develop a web interface or a stand-alone application on Windows, Linux or Mac, &lt;strong&gt;check the following up-to-date options:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ymoslem/DesktopTranslator&quot;&gt;&lt;img src=&quot;https://github-readme-stats.vercel.app/api/pin/?theme=graywhite&amp;amp;username=ymoslem&amp;amp;repo=DesktopTranslator&quot; alt=&quot;DesktopTranslator&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ymoslem/CTranslate-NMT-Web-Interface&quot;&gt;&lt;img src=&quot;https://github-readme-stats.vercel.app/api/pin/?theme=graywhite&amp;amp;username=ymoslem&amp;amp;repo=CTranslate-NMT-Web-Interface&quot; alt=&quot;CTranslate-NMT-Web-Interface&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ymoslem/OpenNMT-Web-Interface&quot;&gt;&lt;img src=&quot;https://github-readme-stats.vercel.app/api/pin/?theme=graywhite&amp;amp;username=ymoslem&amp;amp;repo=OpenNMT-Web-Interface&quot; alt=&quot;OpenNMT-Web-Interface&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;After some research, I finally managed to achieve progress to create a Translator GUI for Windows, using Python Tkinter, PyInstaller, NSIS and the PyTorch version of OpenNMT.&lt;/p&gt;

&lt;h2 id=&quot;purpose&quot;&gt;Purpose&lt;/h2&gt;
&lt;p&gt;Creating a stand-alone executable of OpenNMT-py on Windows that requires the minimal technical experience to install and use, and no Internet connection, for Machine Translation.&lt;/p&gt;

&lt;h2 id=&quot;outcome&quot;&gt;Outcome&lt;/h2&gt;
&lt;p&gt;A proof-of-concept version can be downloaded &lt;a href=&quot;https://s3.us-west-2.amazonaws.com/opennmt-gui/translate-gui.exe&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Tested on Windows 7 and Windows 10. Support for 64-bit version of Windows only (PyTorch works on Python 64-bit only).&lt;/p&gt;

&lt;p&gt;The executable can be used to locally translate files, using a local pre-trained model file generated by OpenNMT-py Neural Machine Translation framework.&lt;/p&gt;

&lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;

&lt;h3 id=&quot;installation&quot;&gt;Installation&lt;/h3&gt;

&lt;p&gt;After downloading and launching the installer, it will copy the files to the “Program Files” folder. When the installer finishes, there will be a shortcut on the Desktop called “translate-gui”.&lt;/p&gt;

&lt;h3 id=&quot;usage&quot;&gt;Usage&lt;/h3&gt;

&lt;p&gt;Running the shortcut “translate-gui” (which refers to translate-gui.exe), this window opens.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Select the source file (*.txt)&lt;/li&gt;
  &lt;li&gt;Select the model file (*.pt)&lt;/li&gt;
  &lt;li&gt;Click “Translate”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/GUI-1.png&quot; alt=&quot;Translator&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Note: For a quick test, you can download this &lt;a href=&quot;../static/uploads/source.txt&quot;&gt;test source&lt;/a&gt; file (right-click &amp;gt; Save link as) and this &lt;a href=&quot;https://github.com/OpenNMT/OpenNMT-py/raw/master/onmt/tests/test_model.pt&quot;&gt;test model&lt;/a&gt; file.&lt;/p&gt;

&lt;p&gt;If everything works fine, it should create the translation file “youtranslation.txt” on the Desktop. Responding with “Yes” to this prompt message should open the translation TXT file in NotePad.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../static/img/GUI-2.png&quot; alt=&quot;Translator&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;uninstallation&quot;&gt;Uninstallation&lt;/h3&gt;

&lt;p&gt;To uninstall, simply delete the folder “translate-gui” from the “Program Files” folder.&lt;/p&gt;

&lt;h2 id=&quot;changes-in-the-opennmt-py-code&quot;&gt;Changes in the OpenNMT-py Code&lt;/h2&gt;

&lt;p&gt;Simple use of the same arguments; no serious changes.&lt;/p&gt;

&lt;h3 id=&quot;1--onmtoptspy&quot;&gt;1- onmt/opts.py&lt;/h3&gt;

&lt;p&gt;For the arguments &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-src&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-model&lt;/code&gt;, changing the attribute &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;required=True&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;required=False&lt;/code&gt;&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;group.add(&apos;--src&apos;, &apos;-src&apos;, required=False,
    help=&quot;Source sequence to decode (one line per &quot;
        &quot;sequence)&quot;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;group.add(&apos;--model&apos;, &apos;-model&apos;, dest=&apos;models&apos;, metavar=&apos;MODEL&apos;,
    nargs=&apos;+&apos;, type=str, default=[], required=False, 
    help=&quot;Path to model .pt file(s). &quot;
        &quot;Multiple models can be specified, &quot;
        &quot;for ensemble decoding.&quot;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2--translatepy&quot;&gt;2- translate.py&lt;/h3&gt;

&lt;p&gt;Assigning values from the Tkinter GUI to the following variables:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opt.src&lt;/code&gt; (source file path – string)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opt.models&lt;/code&gt; (model file path – list of strings)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opt.output&lt;/code&gt; (target file path – string)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For testing purposes, you can hardcode the values to get an idea how it works (without a GUI).&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;if name == &quot;main&quot;:
    parser = _get_parser()
    opt = parser.parse_args()

    # edits
    opt.src = r&quot;D:\Users\yasmin\output\source.txt&quot;
    opt.models = [r&quot;D:\Users\yasmin\output\test_model.pt&quot;]
    opt.output = &quot;yourtranslation.txt&quot;

main(opt)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However, in the actual file, I replaced this with a function (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go&lt;/code&gt;)&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;def go():
    parser = _get_parser()
         opt = parser.parse_args()

    try:
             opt.src = file_source
             opt.models = [file_model]
        opt.output = &quot;yourtranslation.txt&quot;
        
        main(opt)
        
        success = messagebox.askyesno(&apos;Success&apos;, &apos;Your source text has been successfully translated and saved as &quot;yourtranslation.txt&quot;. Do you want to open the target file?&apos;)
        if success == True:
            webbrowser.open(&quot;yourtranslation.txt&quot;)
    except:
             messagebox.showerror(&apos;Error&apos;, &apos;Make sure you select the right Source and Model files.&apos;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;… and then assigned this go function to the attribute command of the “Translate” button in the GUI.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;btn_translate = Button(frame3, text=&quot;Translate&quot;, width=20, highlightbackground=&quot;#BBCAE8&quot;, command=go).pack(padx=1)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that the variables &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;file_source&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;file_model&lt;/code&gt; get their values from the GUI.&lt;/p&gt;

&lt;p&gt;Final minimum working example can be found &lt;a href=&quot;https://gist.github.com/ymoslem/de033c2886c01e7b18e6f558b33bd24a&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;notes-on-pyinstaller-and-nsis&quot;&gt;Notes on PyInstaller and NSIS&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://www.pyinstaller.org/&quot;&gt;PyInstaller&lt;/a&gt; freezes (packages) Python applications into stand-alone executables. &lt;a href=&quot;https://nsis.sourceforge.io/Main_Page&quot;&gt;NSIS&lt;/a&gt; is an open-source tool to create Windows installers. Here are some notes on using them:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Installing PyInstaller is straightforward through using this command in your CMD/Terminal: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip3 install pyinstaller&lt;/code&gt; or through installing &lt;a href=&quot;https://pypi.org/project/auto-py-to-exe/&quot;&gt;Auto PY to EXE&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Consider bundling on Windows 7 and then testing on Windows 10. Otherwise, you might have to deal with some Windows dependencies.&lt;/li&gt;
  &lt;li&gt;To use PyInstaller, specify the Python file name and the argument &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-w&lt;/code&gt; to hide the console window: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyinstaller -y -w &quot;yourfile.py&quot;&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;At this stage, you created a folder including all the dependencies and an *.exe inside it that will run the Python file.
Do NOT use the “onefile” argument &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-F&lt;/code&gt; of PyInstaller which creates a one-file bundled executable, i.e. instead of having the above-mentioned folder, you will have a big *.exe for the whole thing. Why not? This external *.exe is like an archive that extracts the packaged files (including the internal *.exe) to a temporary directory every time you run it, which takes a long time due to the huge file size of PyTorch and other dependencies. Instead, use NSIS to create an installer which will extract the files only once.&lt;/li&gt;
  &lt;li&gt;NSIS can be &lt;a href=&quot;https://nsis.sourceforge.io/Download&quot;&gt;downloaded&lt;/a&gt;, installed and used on Windows like any application.&lt;/li&gt;
  &lt;li&gt;Before using NSIS, compress the contents of the “dist” directory created by PyInstaller into a *.zip archive using any tool like 7-Zip or WinZip.&lt;/li&gt;
  &lt;li&gt;Launch NSIS, click “Installer based on a .ZIP file”, and click “Open” to locate the package *.zip file you have just created.&lt;/li&gt;
  &lt;li&gt;If you want to make the files installed (extracted) to the “Program Files” of the target user, in the “Default Folder” enter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$PROGRAMFILES&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;If you want to add a shortcut to the internal *.exe file on the Desktop after installation, you can add something like this to the file “Modern.nsh” at: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;C:\Program Files\NSIS\Contrib\zip2exe\&quot;&lt;/code&gt;. Depending on your OS, the path could be at “Program Files (x86)”. I just added these lines at the end of the file. Note that the exe path should be consistent with the path you selected under NSIS’s “Default Folder” drop-down menu, the folder name, and the exe file name.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Section &quot;Desktop Shortcut&quot; SectionX
    SetShellVarContext current
    CreateShortCut &quot;$DESKTOP\translate-gui.lnk&quot; &quot;$PROGRAMFILES\translate-gui\translate-gui.exe&quot;
SectionEnd
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;ul&gt;
  &lt;li&gt;Finally, click the NSIS “Generate” button, which will create the *.exe installer that can be shipped to other Windows machines.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;future-work&quot;&gt;Future Work&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Adding more OpenNMT-py translation options (and maybe training options) to the GUI.&lt;/li&gt;
  &lt;li&gt;Improving the user experience during installation, usage, and uninstallation.&lt;/li&gt;
  &lt;li&gt;Reducing the required space by removing unnecessary dependencies.&lt;/li&gt;
  &lt;li&gt;Testing the same approach for the TensorFlow version, OpenNMT-tf.&lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Sat, 18 Jan 2020 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/stand-alone-executable-gui-opennmt/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/stand-alone-executable-gui-opennmt/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Domain Adaptation Techniques for Low-Resource Scenarios</title>
        <description>&lt;p&gt;Let’s imagine this scenario. You have a new Machine Translation project, and you feel excited. However, you have realized that your training corpus is too small. Now, you see that if you use such limited corpus, your machine translation model will be very poor, with many out-of-vocabulary words and maybe unidiomatic translations.&lt;/p&gt;

&lt;p&gt;So, what is the solution? Should you just give up? Fortunately, Domain Adaptation can be a good solution to this issue.&lt;/p&gt;

&lt;p&gt;Do you have another corpus that is big enough? Does this big corpus share some characteristics with the small corpus, like the language pair and/or major subject?&lt;/p&gt;

&lt;p&gt;In this case, you can use one of Domain Adaptation techniques to make use of both the big generic corpus and the small specialized corpus. While the big generic corpus will help avoid out-of-vocabulary words and unidiomatic translations, the smaller specialized corpus will help force terminology and vocabulary required for your current Machine Translation project.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#domain-adaptation-use-cases&quot;&gt;Domain Adaptation Use Cases&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#domain-adaptation-approaches&quot;&gt;Domain Adaptation Approaches&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#incremental-training--re-training&quot;&gt;Incremental Training / Re-training&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#ensemble-decoding-of-two-models&quot;&gt;Ensemble Decoding (of two models)&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#combining-training-data&quot;&gt;Combining Training Data&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#data-weighting&quot;&gt;Data Weighting&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#other-domain-adaptation-approaches&quot;&gt;Other Domain Adaptation Approaches&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#final-note-full-words-vs-sub-words&quot;&gt;Final Note: Full Words vs. Sub-words&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;
&lt;br /&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;domain-adaptation-use-cases&quot;&gt;Domain Adaptation Use Cases&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Low-Resource Domains &amp;amp; Institutions&lt;/li&gt;
  &lt;li&gt;Low-Resource Languages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To give you a clearer idea bout Machine Translation Domain Adaptation, let’s consider these two popular use cases:&lt;/p&gt;

&lt;p&gt;In the first use case, we have Institution A and Institution B, or Major Subject A and Minor Subject B. Institution A and Institution B share much vocabulary; however, they have some different terminology (e.g. chairman vs. president; vice-president vs. deputy chairperson). You have a big corpus for Institution A and a very small corpus for Institution B; however, your Machine Translation project is for Institution B with the small corpus. Domain Adaptation can help you to use the small corpus of Institution B for adapting or specializing the NMT model that could be generated from training on the big corpus of Institution A (assuming there are no license restrictions). With Domain Adaptation, our final model will, hopefully, give the right terminology used at Institution B.&lt;/p&gt;

&lt;p&gt;In the second use case, we have a language with very limited bilingual resources. So we do not have enough data to train a good Machine Translation model for this language. I am sure you can think of many low-resource languages allover the world. Sometimes, there are other high-resource languages that are very similar to such low-resource languages, and share vocabulary and structure with them. Moreover, sometimes they are not independent languages, but rather just dialects from an original language.&lt;/p&gt;

&lt;p&gt;So the question is: can we use the rich resources of Language A to train a better Machine Translation model for Language B that has low resources otherwise? Apparently, this is possible though Domain Adaptation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quiz&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Give an example of two languages:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Language A: High resources&lt;/li&gt;
  &lt;li&gt;Language B: Low resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Language A and Language B share vocabulary and structure (vocabulary overlaps).&lt;/p&gt;

&lt;p&gt;So this is a quiz. In the comments area, please mention two languages: Language A and Language B. Language A has rich resources while Language B has only very limited resources. However, there is a condition, Language A and Language B must share some vocabulary, meaning that many words in Language A overlap with words in Language B, so such words are the same or very similar in the two languages. Can you think of any example of Language A and Language B?&lt;/p&gt;

&lt;h2 id=&quot;domain-adaptation-approaches&quot;&gt;Domain Adaptation Approaches&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Incremental Training / Re-training&lt;/li&gt;
  &lt;li&gt;Ensemble Decoding (of two models)&lt;/li&gt;
  &lt;li&gt;Combining Training Data&lt;/li&gt;
  &lt;li&gt;Data Weighting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are several approaches of Domain Adaptation and I am going to discuss four of them.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Incremental Training / Re-training: So you have a big pre-trained model trained on a big corpus, and you continue training it with the new data from the small corpus.&lt;/li&gt;
  &lt;li&gt;Ensemble Decoding (of two models): You have two models and you use both models during translation.&lt;/li&gt;
  &lt;li&gt;Combining Training Data: You merge the two corpora and train one model on the whole combined data.&lt;/li&gt;
  &lt;li&gt;Data Weighting: You give higher weights for specialized segments over generic segments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s see how to apply these techniques and the best practices.&lt;/p&gt;

&lt;h3 id=&quot;incremental-training--re-training&quot;&gt;Incremental Training / Re-training&lt;/h3&gt;

&lt;p&gt;First Step: Training the Base Model
a. Preprocessing the base (generic, big) corpus
b. Training the base model&lt;/p&gt;

&lt;p&gt;Second Step: Retraining with the New Data
a. Preprocessing the new (specialized) corpus
b. Retraining the base model on the specialized corpus&lt;/p&gt;

&lt;p&gt;Incremental Training means to train a model on a corpus and then continue training the same model on a new corpus.&lt;/p&gt;

&lt;p&gt;As part of my Machine Translation research, I managed to achieve successful results in retraining Neural Machine Translation models for the purpose of Domain Adaptation (see: Domain Adaptation Experiment)&lt;/p&gt;

&lt;p&gt;Now you have two corpora. The first corpus is the base corpus; a generic or less-specialized and it is usually big, like several millions of segments. The other corpus is specialized and it might have a less number of translated segments.&lt;/p&gt;

&lt;p&gt;In my experiment, the outcome was very promising and the model learned to use the in-domain terminology.&lt;/p&gt;

&lt;p&gt;There is an important matter to take into consideration while using this Incremental Training approach for Domain Adaptation. If you only use in-domain data in your corpus, you may encounter a case of “catastrophic forgetting”, in which some sentences are translated badly (like with an unidiomatic structure or unknown words) by the retrained model while they are translated better by the base model. To avoid this issue, usually the retraining corpus should be a combination of in-domain and generic data. So for example, if your original in-domain corpus includes one hundred thousand segments, you can add like fifty thousand generic segments.&lt;/p&gt;

&lt;p&gt;Another consideration is that you need to retrain on the new data for long enough to learn the new vocabulary. So you can see how many epochs or steps you used to train the base model and use a similar number to retrain on the new corpus.&lt;/p&gt;

&lt;p&gt;Note also that depending on the NMT framework you are using, you may have the option to update vocabulary instead of re-initializing the whole network. For example, in OpenNMT-tf (the TensorFlow version of OpenNMT), there is a script that can be used to change the word vocabularies contained in a base model while keeping the learned weights of shared words, so that you can add in-domain terminology during retraining.&lt;/p&gt;

&lt;h3 id=&quot;ensemble-decoding-of-two-models&quot;&gt;Ensemble Decoding (of two models)&lt;/h3&gt;

&lt;p&gt;One of the suggested methods of Domain Adaptation is to “ensemble” the baseline model trained on generic data and the continue model retrained on in-domain data. “Ensemble” simply means combining models during translation (not data during training). For more details about Ensemble Decoding, you may want to refer to a useful paper called, Fast Domain Adaptation for Neural Machine Translation, by Markus Freitag and Yaser Al-Onaizan.&lt;/p&gt;

&lt;p&gt;Actually, there are different techniques for Ensemble Decoding; however, I am giving you an example of how it is used in OpenNMT-py framework to give you an idea.&lt;/p&gt;

&lt;p&gt;Ensemble Decoding is a method that allows using multiple models simultaneously, combining their prediction distributions by averaging. All models in the ensemble must share a target vocabulary.&lt;/p&gt;

&lt;p&gt;This means that although Ensemble Decoding is used during translation, you should observe some considerations during training. So during the preprocessing step, you have to include the vocabulary of both the generic corpus and in-domain corpus. Later during the training time, you first train the base generic model, and then continue training with your specialized data to create a new model. Finally, during translation, you can use the two models simultaneously with Ensemble Decoding. Note here that you do not train the two models independently; however, your second model is actually incrementally trained on the last checkpoint of the first model.&lt;/p&gt;

&lt;p&gt;As you can see, Ensemble Decoding can be helpful in diverse occasions when you want to utilize multiple models at the translation time, and Domain Adaptation is only one of such use cases, with a special process.&lt;/p&gt;

&lt;h3 id=&quot;combining-training-data&quot;&gt;Combining Training Data&lt;/h3&gt;

&lt;p&gt;Combining your training data is another approach you can use for Domain Adaptation. So you combine both the big generic corpus and the small specialized corpus into only one corpus. Now, you can train your model with this new corpus.&lt;/p&gt;

&lt;p&gt;If you are going to combine two relatively different datasets, then according to Prof. Andrew NG (video), do not shuffle your combined dataset to generate the training, dev, and test sets; instead he recommends that you divide your data as follows:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Training Dataset: 100% of the big, generic dataset + most of the small specialized dataset.&lt;/li&gt;
  &lt;li&gt;Dev (validation) Dataset: Portion of the small specialized dataset (e.g. 2500).&lt;/li&gt;
  &lt;li&gt;Test Dataset: Portion of the small specialized dataset (e.g. 2500).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So now, you are concentrating on improving the performance of your model to act well on the Dev (Validation) Dataset, which includes the data you care about.&lt;/p&gt;

&lt;p&gt;However, when you think about combining data for the sake of training Neural Machine Translation models, there is a problem! In Neural Machine Translation, we extract only the most frequent vocabulary, the most frequent words in the corpus (~ 50,000 is common). Now, as you have a big generic corpus and a small specialized one, you might end up with vocabulary from the big corpus only while the words you want to include from the small corpus will be missing because they are not frequent enough. Plus, the model would observe terminology choices from the bigger corpus because they are more frequent.&lt;/p&gt;

&lt;p&gt;I can hear you now asking: Can I extract all the words in the corpus? Of course, you can; however, if your corpus is really huge, and your training parameters are memory intensive, you might get an out-of-memory error and not be able to continue training or even start it.&lt;/p&gt;

&lt;p&gt;So what is the solution? What about increasing the specialized data? There is a suggested method: Data Augmentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Augmentation for Neural Machine Translation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The purpose of data Augmentation here is to increase the size of your limited specialized data. In my experiment, I used a statistical approach that is similar to what has been used in Statistical Machine Translation (e.g. Moses) as illustrated by Prof. Philipp Koehn in the chapter, Phrase-based Models, of his book “Statistical Machine Translation”.&lt;/p&gt;

&lt;p&gt;First Step: Extract word alignment of the specialized corpus. You can use tools like fast_align, eflomal, or efmaral. You can use any of them as a word aligner which takes an input of parallel sentences, and produces outputs in the widely-used “Pharaoh format”.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;neue modelle werden erprobt ||| new models are being tested&lt;br /&gt;
0-0 1-1 2-2 2-3 3-4&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Second Step: Generate n-gram phrases. Here, you can see an example:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;neue — new&lt;br /&gt;
neue modelle — new models&lt;br /&gt;
neue modelle werden — new models are being&lt;br /&gt;
neue modelle werden erprobt — new models are being tested&lt;br /&gt;
modelle — models&lt;br /&gt;
modelle werden — models are being&lt;br /&gt;
modelle werden erprobt — models are being tested&lt;br /&gt;
werden — are being&lt;br /&gt;
werden erprobt — are being tested&lt;br /&gt;
erprobt — tested&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As I mentioned, this approach is very similar to the method used in Statistical Machine Translation; however, I did not move further to calculate probabilities because: 1) this would take a lot of time and memory; and most importantly 2) no need for this step because Neural Machine Translation has its own approach for calculating probabilities. So all what we need is a simple filtering step.&lt;/p&gt;

&lt;p&gt;Thrid Step: Remove exact duplicates. Apply any other filters as needed; for example, you can delete very long sentences or uncommon single words, etc.&lt;/p&gt;

&lt;p&gt;Now, you can combine your increased specialized data with the generic data, and start preprocessing and training your model.&lt;/p&gt;

&lt;p&gt;Note here that we have two datasets, one uses this n-gram phrase splitting and one does not. In my experiment, when I trained my model on a dataset that I used this method on all of its segments, I got better translations for some segments; however, I noticed literal or unidiomatic translations in other occasions and in general the quality was less. So if you are going to use this n-gram phrase splitting with your Neural Machine Translation training, it is recommended to use it only as a part of the final dataset. That is why here we used this approach only on the specialized dataset and kept the generic dataset as is without phrase splitting.&lt;/p&gt;

&lt;p&gt;Apart from training a model, you use the generated phrase-table for more options at the translation time.&lt;/p&gt;

&lt;p&gt;Other combination methods may include: removing irrelevant segments from the big corpus or replacing mismatching terminology based on a glossary during the preprocessing time.
&amp;lt;/details&amp;gt;&lt;/p&gt;

&lt;h3 id=&quot;data-weighting&quot;&gt;Data Weighting&lt;/h3&gt;

&lt;p&gt;Data Weighting is another technique that can be useful for Domain Adaptation. In Data Weighting, you can either:&lt;/p&gt;

&lt;p&gt;train one model on two corpora at the same time while giving a higher weight for the specialized corpus over the other generic corpus, or
train the model on only one corpus that includes both generic segments and specialized segments, giving higher weights for specialized segments.
For example, OpenNMT-py (the PyTorch version of OpenNMT) supports using different weights for different corpora; so we define the “data weights” list, which determines the weight each corpus should have; for example, 1 for Corpus A and 7 for Corpus B. This means when building batches, we will take 1 segment from Corpus A, then 7 segments from Corpus B, and so on.&lt;/p&gt;

&lt;p&gt;Similarly, Marian NMT toolkit supports sentence and word-level data weighting strategies, weighting each data item according to its proximity to the in-domain data. In Marian, data weighting requires you to provide a special file with weights of sentences or words.&lt;/p&gt;

&lt;h3 id=&quot;other-domain-adaptation-approaches&quot;&gt;Other Domain Adaptation Approaches&lt;/h3&gt;

&lt;p&gt;For more state-of-the-art Domain Adaptation approaches, please check my AMTA’s &lt;a href=&quot;https://amtaweb.org/wp-content/uploads/2020/11/NMTDomainAdaptationTechniques.pdf&quot;&gt;presentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;final-note-full-words-vs-sub-words&quot;&gt;Final Note: Full Words vs. Sub-words&lt;/h3&gt;

&lt;p&gt;During preparing our data, we usually tokenize segments into complete words. However, it turns out that tokenizing segments into sub-words instead can be useful in improving translation quality. Sub-wording is not a technique related only to Domain Adaptation; it is actually recommended for any kind of Neural Machine Translation training.&lt;/p&gt;

&lt;p&gt;The main purpose of Sub-wording is to minimize out-of-vocabulary words. As I mentioned earlier, in Neural Machine Translation, there are limitations to vocabulary extraction. If your corpus is really huge, you are forced to extract only the most frequent vocabulary (~ 50,000 is common), or you might get out-of-memory error during training. Extracting the the most frequent vocabulary will be enough for most translations as long as you translate only sentences in the same domain as your corpus; however, in some cases, you might encounter out-of-vocabulary words.&lt;/p&gt;

&lt;p&gt;Sub-wording can help in some cases:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Word variations in the same language, e.g. “translate vs. translation”&lt;/li&gt;
  &lt;li&gt;Compound words in the same language, e.g. “multi-tasking”. So now you model is not only able to translate “multi-tasking”, but any other phase that includes the word “multi”.&lt;/li&gt;
  &lt;li&gt;Shared words between languages&lt;/li&gt;
  &lt;li&gt;Common misspellings, like forgetting accents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Just as any other technique, in some occasions sub-wording will not give you better results; however, in many occasions, it will be a game changer. So, it is highly recommended to give it a try.&lt;/p&gt;

&lt;p&gt;Methods of sub-wording include: Byte Pair Encoding (BPE) and unigram language model, both of which are supported by SentencePiece.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;So in this article, you have seen how Domain Adaptation can be useful when you want to train a Machine Translation model, but you have only limited data for an institution, language, or minor domain. Then, I have discussed diverse techniques of Domain Adaptation including: Incremental Training / Re-training, Ensemble Decoding, Combining Training Data, and Data Weighting. In the meanwhile, I suggested a method for Data Augmentation, to increase the size of the limited specialized corpus. Finally, I explained how sub-wording can help avoid out-of-vocabulary words. If you have questions, or suggestions, please feel free to send a comment.&lt;/p&gt;

</description>
        <pubDate>Sat, 18 Jan 2020 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/nmt-domain-adaptation/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/nmt-domain-adaptation/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
      <item>
        <title>Domain Adaptation Experiment in Neural Machine Translation</title>
        <description>&lt;p&gt;Domain Adaptation is useful for specializing current generic Machine Translation models, mainly when the specialized corpus is too limited to train a separate model. Furthermore, Domain Adaptation techniques can be handy for low-resource languages that share vocabulary and structure with other rich-resource family languages.&lt;/p&gt;

&lt;p&gt;As part of my Machine Translation research, I managed to achieve successful results in retraining Neural Machine Translation models for the purpose of Domain Adaptation using OpenNMT-py (the PyTorch version of OpenNMT). In this article, I am elaborating on the path I took and the achieved outcomes; hopefully, this will be useful for others.&lt;/p&gt;

&lt;p&gt;The base model is a vertical (in-domain) model trained on approx. 13 million segments, and retrained on approx. 123,000 institution-specific segments. Language Pair: French-English. Tokenization: complete words.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#first-step-training-the-base-model&quot;&gt;First Step: Training the Base Model&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#preprocessing&quot;&gt;Preprocessing&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#training&quot;&gt;Training&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#second-step-retraining-with-the-new-data&quot;&gt;Second Step: Retraining with the New Data&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#preprocessing-1&quot;&gt;Preprocessing&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#retraining&quot;&gt;Retraining&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#outcomes&quot;&gt;Outcomes&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#further-research&quot;&gt;Further Research&lt;/a&gt;
&lt;br /&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a name=&quot;step1&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;first-step-training-the-base-model&quot;&gt;First Step: Training the Base Model&lt;/h2&gt;

&lt;h3 id=&quot;preprocess&quot;&gt;Preprocess&lt;/h3&gt;

&lt;p&gt;Using default options of OpenNMT-py.&lt;/p&gt;

&lt;h3 id=&quot;train&quot;&gt;Train&lt;/h3&gt;
&lt;p&gt;Using the recommended Transformer model options, except that I had only 2 GPUs.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CUDA_VISIBLE_DEVICES=0,1 python3 train.py -data basedata \ 
    -save_model basemodel -layers 6 -rnn_size 512 -word_vec_size 512 \ 
    -transformer_ff 2048 -heads 8  -encoder_type transformer \ 
    -decoder_type transformer -position_encoding -train_steps 200000 \ 
    -max_generator_batches 2 -dropout 0.1 -batch_size 4096 \ 
    -batch_type tokens -normalization tokens  -accum_count 2 \ 
    -optim adam -adam_beta2 0.998 -decay_method noam \ 
    -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 \ 
    -param_init 0 -param_init_glorot -label_smoothing 0.1 \ 
    -valid_steps 10000 -save_checkpoint_steps 10000 -world_size 2 \ 
    -gpu_ranks 0 1 -log_file train.log ; sudo shutdown
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;a name=&quot;step2&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;second-step-retraining-with-the-new-data&quot;&gt;Second Step: Retraining with the New Data&lt;/h2&gt;

&lt;h3 id=&quot;preprocess-1&quot;&gt;Preprocess&lt;/h3&gt;

&lt;p&gt;I passed the basedata.vocab.pt file to the parameter -src_vocab. There is no need for -tgt_vocab, but use -share_vocab as well (&lt;a href=&quot;http://forum.opennmt.net/t/incremental-learning-in-domain-adaptation-retraining-in-pytorch-version-opennmt/2417&quot;&gt;reference&lt;/a&gt;). Actually, only -src_vocab supports *.vocab.pt files, and adding the file to -tgt_vocab will cause an error.&lt;/p&gt;

&lt;p&gt;I used also -src_seq_length 200 because I have long sentences, but you can use the default (50) or whatever you need.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python3 preprocess.py -train_src newdata.fr -train_tgt newdata.en \ 
    -save_data newdata -src_seq_length 200 -tgt_seq_length 200 \ 
    -src_vocab basedata.vocab.pt -dynamic_dict -share_vocab \ 
    -log_file preprocess-new.log
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;continue-training&quot;&gt;Continue training&lt;/h3&gt;

&lt;p&gt;I used -train_from the last step file of the base model, retraining the model for extra 10,000 steps. Note the old model was trained for 200,000 steps; so to set the extra 10,000 steps in retraining, it will be 210,000 steps because retraining uses the previous arguments unless you use the argument -reset_optim&lt;/p&gt;

&lt;p&gt;Note also that the second machine was with 8 GPUs; so with the same batch size, 10,000 steps on 8 GPUs are similar to 40,000 steps on 2 GPUs (&lt;a href=&quot;https://forum.opennmt.net/t/multi-gpus-is-slower-than-single-gpu/981/12&quot;&gt;reference&lt;/a&gt;). Calculating steps in the first place was tricky because the batch type here depends on tokens not sentences and there are multiple GPUs (&lt;a href=&quot;https://github.com/OpenNMT/OpenNMT-py/issues/866&quot;&gt;reference&lt;/a&gt;), but I used the sequence length from the preprocessing step as a reference (actually half of it because not many sentences are of 200 tokens), which will not be very accurate as it is the max not an exact number, but it helps understand what one is doing. The ultimate purpose was to retrain on the new data for long enough to learn the new vocabulary (&lt;a href=&quot;https://forum.opennmt.net/t/problem-with-incremental-in-domain-training/330/7&quot;&gt;reference&lt;/a&gt;).&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train.py -data newdata \
    -train_from basemodel_step_200000.pt -save_model newmodel \ 
    -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 \ 
    -heads 8 -encoder_type transformer -decoder_type transformer \ 
    -position_encoding -train_steps 210000 -max_generator_batches 2 \ 
    -dropout 0.1 -batch_size 4096 -batch_type tokens \ 
    -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 \ 
    -decay_method noam -warmup_steps 8000 -learning_rate 2 \ 
    -max_grad_norm 0 -param_init 0 -param_init_glorot \ 
    -label_smoothing 0.1 -save_checkpoint_steps 10000 -world_size 8 \ 
    -gpu_ranks 0 1 2 3 4 5 6 7 -log_file retrain.log ; sudo shutdown
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Retraining took 37360 seconds (about 10.38 hours) on an AWS p2.8xlarge machine with 8 GPUs, 12 GB memory each, and 488 GB of RAM.&lt;/p&gt;

&lt;h3 id=&quot;outcomes&quot;&gt;Outcomes&lt;/h3&gt;
&lt;p&gt;When I started retraining with OpenNMT-py, I was not sure if the model will only learn new vocabulary or will also replace vocabulary because it was usually said that OpenNMT-py is not the best for retraining as it does not have an update vocabulary option, compared to the TensorFlow version, OpenNMT-tf.&lt;/p&gt;

&lt;p&gt;However, the outcome is very promising. The model learnt to use the institution-based terminology. Here is one simple example to get an idea: the base model translates the French words “président” and “vice-président” as “president” and “vice-president” in English respectively while the retrained model translates them as “chairperson” and “deputy chairperson” respectively, which are the adopted English terms in the institution.&lt;/p&gt;

&lt;h3 id=&quot;further-research&quot;&gt;Further Research&lt;/h3&gt;
&lt;p&gt;The issue I noticed though is that some sentences are translated badly (like unidiomatic structure or UNKs) by the retrained model while they are translated better by the base model. I am not sure why, and I wonder if this could be because of an exaggerated number of re-training steps; so I have to test this. Another suggestion I got on the OpenNMT forum, and I am going to try, is that this may be a case of “catastrophic forgetting”; usually the retraining should be a combination of in-domain and generic data. Still, note that my base model was not trained on generic data, but rather on a dataset from the same domain as the new dataset. Again, I believe as a workaround, I can offer translations from the two models and let the user select, or automatically select the best translation based on automatic evaluation. So I am going to conduct more experiments and report the outcomes.&lt;/p&gt;

&lt;p&gt;So that is it. If you have questions or suggestions, please let me know.&lt;/p&gt;

</description>
        <pubDate>Sat, 27 Jul 2019 00:00:00 +0000</pubDate>
        <link>https://blog.machinetranslation.io/domain-adaptation-neural-machine-translation/</link>
        <guid isPermaLink="true">https://blog.machinetranslation.io/domain-adaptation-neural-machine-translation/</guid>
        
        
        <category>nmt</category>
        
      </item>
    
  </channel>
</rss>
