Yasmin Moslem

Machine Translation Researcher

MachineTranslation.io


Purely Synthetic Bilingual Data for Machine Translation?

In-domain data scarcity is common in translation settings, due to the lack of specialized datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. You might be familiar to such situation when there is a big translation project, but there is only a tiny in-domain translation memory, or no translation memory at all. In the absence of sufficient domain-specific data required to fine-tune machine translation (MT) systems, adhering to the domain terminology and client’s style can be challenging. Recently, there has been a considerable advancement in training large language models, not only for English, but also for diverse languages. Among autoregressive language models, trained to predict the next word in a sequence, are BLOOM, GPT-3, and GPT-J. The question is: can we use these large language models to generate more domain-specific bilingual data?

» Read more...


Translation Auto-suggestions: What do Linguists Think?

Translation auto-suggestion and auto-completion are among the important features that can help translators better utilize Machine Translation (MT) systems. In a Computer-Aided Translation (CAT) environment, a translator can make use of the MT word auto-suggestion feature as follows:

  • typing a few words, or clicking a word in a proposed MT translation, a list of suggestions is displayed.
  • selecting one of the word suggestions from the list, the rest of the translation is modified accordingly.
» Read more...


Machine Translation Robustness

Let’s talk briefly about the concept of “Robustness” of neural machine translation (NMT) systems. While robustness should be emphasized when building any NMT system, even high-resource languages with plenty of data can still face linguistic challenges.

» Read more...


Machine Translation Models: How to Build and Deploy

This is a Neural Machine Translation (NMT) tutorial with OpenNMT-py and relevant tools. It covers data preprocessing, model training, evaluation, and deployment. The tutorial was put together as part of a mentorship activity I organised in 2022.

» Read more...


Mixed Fine-Tuning - Domain Adaptation That Works!

Training a robust generic model is an interesting task; however, when you want to customize your Machine Translation model to observe the terminology and style of a certain domain or client, Domain Adaptation comes to life. In previous posts, we discussed several approaches to Domain Adaptation. In this post, we are going to concentrate on a very effective approach called Mixed Fine-Tuning.

» Read more...


Notes on Multilingual Machine Translation

Multilingual NMT is featured by its scalability between any number of languages, instead of having to build individual models. MNMT systems are also desirable because training models with data from diverse language pairs might help a low-resource language acquire extra knowledge from other languages. Moreover, MNMT systems tend to generalize better due to exposure to diverse languages, leading to improved translation quality compared to bilingual NMT systems. This particular phenomenon is known as translation Transfer Learning or Knowledge Transfer (Dabre et al., 2020).

» Read more...


Low-Resource Neural Machine Translation

Developing Neural Machine Translation (NMT) models for low-resource languages is a viral topic, both in the industry and academia. In this tutorial, we are going to discuss tagged back-translation as one of the most effective and efficient approaches to training more robust models. Tagged back-translation is not only useful for low-resource languages, but also for other scenarios of data sparsity.

» Read more...


Web Interface for Machine Translation

Today, we will create a very simple Machine Translation (MT) Web Interface for OpenNMT-py, OpenNMT-tf and FairSeq models using CTranslate2 and Streamlit.

» Read more...


Adaptive Neural Machine Translation

At a linguistic environment, translations and edits do not stop. Therefore, while periodical fine-tuning of our NMT models can help, there is definitely a need to simultaneously take new translated and edited segments into consideration. Otherwise, the MT system will keep making the same mistakes, not always observing new terminology and style, until a new/fine-tuned version of the model is released. Hence, Online Learning or Online Adaptation comes in handy in such a situation, so that the NMT model can incrementally learn from new translations and edits as it goes along!

» Read more...


Running TensorBoard with OpenNMT

TensorBoard is a tool that provides useful visualization of how training of a deep learning model is going on. It allows you to track and visualize metrics such as accuracy and perplexity. You can use TensorBoard in diverse deep learning frameworks such as TensorFlow and PyTorch. In this tutorial, you will learn how to activate TensorBoard in OpenNMT-tf and OpenNMT-py in different environments.

» Read more...


Bash Commands for NLP Engineers

As using Bash commands is inevitable if you work on NLP and MT tasks, I thought it would be useful to list the majority of commands I learnt to use on a daily base, thanks to practice, searching, and helpful colleagues I met over years. Obviously, this is not an exclusive list; however, I hope it includes most of the one-line Bash commands you would need. Please note the majority of these commands have been mainly tested on Linux.

» Read more...


Pre-trained Neural Machine Translation (NMT) Models

Neural Machine Translation (NMT) in-domain models outperform generic models for the “domain” on which they are trained. In other words, in-domain models can observe terminology and generate translations that are more in line with a specialized context.

» Read more...


WER Score for Machine Translation

Word Error Rate (WER) computes the minimum Edit Distance between the human-generated sentence and the machine-predicted sentence. In other tutorials, I explained how to use Python to compute BLEU and Edit Distance, and this tutorial, I am going to explain how to calculate the WER score.

» Read more...


Computing BLEU Score for Machine Translation

In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.

» Read more...


Stand-alone Executable Translator for OpenNMT

The question was: if I want to have a stand-alone version of OpenNMT to run on Windows, without any manual preparations or installations on the target machine, and does not connect to the Internet for Machine Translation, what are my options to achieve this?

» Read more...


Domain Adaptation Techniques for Low-Resource Scenarios

Let’s imagine this scenario. You have a new Machine Translation project, and you feel excited. However, you have realized that your training corpus is too small. Now, you see that if you use such limited corpus, your machine translation model will be very poor, with many out-of-vocabulary words and maybe unidiomatic translations.

» Read more...


Domain Adaptation Experiment in Neural Machine Translation

Domain Adaptation is useful for specializing current generic Machine Translation models, mainly when the specialized corpus is too limited to train a separate model. Furthermore, Domain Adaptation techniques can be handy for low-resource languages that share vocabulary and structure with other rich-resource family languages.

» Read more...