Yasmin Moslem

NLP Researcher

Computing BLEU Score for Machine Translation

26 Jan 2020 » nmt

In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.

BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target is from a high-level framework like OpenNMT or Marian, or from a lower-level one like TensorFlow or PyTorch. It does not also matter whether it is a Neural Machine Translation system or a Statistical Machine Translation tool like Moses.

So let’s see the steps I follow to calculate the BLEU score.

Table of Contents

Files Required to Compute BLEU

To measure BLEU, you need to have two files:

  1. Reference: It is the human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
  2. System: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”. In the code, I will refer to such sentences as “preds”.

Detokenization & BLEU Calculation

To compute BLEU, I use sacreBLEU which works on detokenized text (unless the ‘--force’ parameter is used). For the detokenization step, I use the Python library SacreMoses.

Why detokenization? Different tokenization tools generate different outputs while to be able to say that BLEU is a standard score, the factors must be the same. That is why sacreBLEU works on detokenized data and applies standard tokenization rules.

For languages other than Japanese and Chinse, SacreBLEU uses mteval-v13a, the standard tokenization used by WMT.

Code of MT BLEU Calculator

BLEU is a corpus-level metric. Here is how you can calculated Corpus BlEU using SacreBLEU.