Computing BLEU Score for Machine Translation

In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.

BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target is from a high-level framework like OpenNMT or Marian, or from a lower-level one like TensorFlow or PyTorch. It does not also matter whether it is a Neural Machine Translation system or a Statistical Machine Translation tool like Moses.

So let’s see the steps I follow to calculate the BLEU score.

Table of Contents

Files Required to Compute BLEU
Detokenization & BLEU Calculation
Code of MT BLEU Calculator
File Names as Arguments
Sentence BLEU Calculator
Multi-BLEU
METEOR
Final Note: Is BLEU Accurate?

Files Required to Compute BLEU

To measure BLEU, you need to have two files:

Reference: It is the human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
System: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”. In the code, I will refer to such sentences as “preds”.

Detokenization & BLEU Calculation

To compute BLEU, I use sacreBLEU which works on detokenized text (unless the ‘--force’ parameter is used). For the detokenization step, I use the Python library SacreMoses.

Why detokenization? Different tokenization tools generate different outputs while to be able to say that BLEU is a standard score, the factors must be the same. That is why sacreBLEU works on detokenized data and applies standard tokenization rules.

For languages other than Japanese and Chinse, SacreBLEU uses mteval-v13a, the standard tokenization used by WMT.

Code of MT BLEU Calculator

BLEU is a corpus-level metric. Here is how you can calculated Corpus BlEU using SacreBLEU.

Yasmin Moslem

Files Required to Compute BLEU

Detokenization & BLEU Calculation

Code of MT BLEU Calculator