In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.
BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target is from a high-level framework like OpenNMT or Marian, or from a lower-level one like TensorFlow or PyTorch. It does not also matter whether it is a Neural Machine Translation system or a Statistical Machine Translation tool like Moses.
So let’s see the steps I follow to calculate the BLEU score.
Table of Contents
- Files Required to Compute BLEU
- Detokenization & BLEU Calculation
- Code of MT BLEU Calculator
- File Names as Arguments
- Sentence BLEU Calculator
- Final Note: Is BLEU Accurate?
Files Required to Compute BLEU
To measure BLEU, you need to have two files:
- Reference: It is the human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
- System: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”. In the code, I will refer to such sentences as “preds”.
Detokenization & BLEU Calculation
Why detokenization? Different tokenization tools generate different outputs while to be able to say that BLEU is a standard score, the factors must be the same. That is why sacreBLEU works on detokenized data and applies standard tokenization rules.
For languages other than Japanese and Chinse, SacreBLEU uses mteval-v13a, the standard tokenization used by WMT.
Code of MT BLEU Calculator
BLEU is a corpus-level metric. Here is how you can calculated Corpus BlEU using SacreBLEU.
File Names as Arguments
In the above script, file names are hardcoded. You can easily add the file names as arguments. To let the Python script understand the arguments, you will need first to
import sys and then create two variables one for the test dataset, e.g.
target_test, with the value
sys.argv for the test file argument and one for the MT output, e.g.
target_pred, with the value
sys.argv for the MTed file argument. Optionally, you can also add an argument for language segmentation. Finally, instead of hardcoding the test dataset name and the MTed file name, you can use these two variables.
As you can see in the Python script below, I used
argv which is a list including the arguments given in the command line; the first item  is saved for the Python script file name. So to run this script, you can use a similar command line in your CMD or Terminal:
python3 bleu-scrip.py test.txt mt.txt
Here is the BLEU script, but now with arguments.
Sentence BLEU Calculator
The previous code computes BLEU for the whole test dataset, and this is the common practice. Still, you might want to calculate BLEU for segment by segment. The following code uses the function
sentence_bleu() from the sacreBLEU library for achieving this task using a for loop. Finally, it saves the output, i.e. the BLEU score for each sentence in a new line, into a file called “bleu.txt”.
As we did with the corpus BLEU script, here is the sentence BLEU script, but now with arguments.
The code is now updated to reflect two main changes:
1- Updates in version 1.4: a) the reference sentence must be a list; and b) use bleu.score instead of bleu to print/write the score.
2- Conclusions from this discussion: add the argument
smooth_method='exp' if you want to get the same result as when using sacreBLEU from the command line.
One of the popular scripts to calculate BLEU is multi-bleu.perl. It works very similarly to sacreBLEU.
According to the script “… you should detokenize then use mteval-v14.pl, which has a standard tokenization.”
To use multi-bleu.perl, you can simply run this command line in your Terminal.
perl multi-bleu.perl human-translation.txt < mt-pred.txt
Using BLEU, you might wonder why it does not count some sub-words of the same origin as correct alternatives. So, I came across another metric called METEOR, which somehow solves such issue.
I am quoting Rachael Tatman’s article Evaluating Text Output in NLP: BLEU at your own risk:
METEOR is similar to BLEU but includes additional steps, like considering synonyms and comparing the stems of words (so that “running” and “runs” would be counted as matches).
I have created the following script for METEOR calculation using NLTK. For the same sentences, METEOR gives me higher scores than BLEU. Unlike many other metrics including BLEU, METEOR mainly works on sentence evaluation rather than corpus evaluation.
Final Note: Is BLEU Accurate?
Well, BLEU simply compares the human translation to the machine translation. It does not take into consideration synonyms or accepted word order changes.
Here is an example of the original translation in the corpus:
- FR: Notre ONU peut jouer un rôle déterminant dans la lutte contre les menaces qui se présentent à nous, et elle le jouera.
- EN: Our United Nations can and will make a difference in the fight against the threats before us.
… and here is the machine translation by two of my NMT models:
- EN: Our United Nations can play a decisive role in combating the threats we face, and it will do so.
- EN: Our United Nations can play a decisive role in combating the threats we face, and it will play it.
As you can see, the MT translations are very acceptable; yet if you calculate BLEU against the original sentence, you will get ≈ 15.7 BLEU score only!
So BLEU –just as any other automatic measure– can be used for reference until reaching a pre-agreed score, and you can expect a better translation from a model with an overall higher BLEU score. Moreover, some other new metrics are worth considering such as Yisi and COMET. Still, some companies would finally run a human evaluation, which we might talk about in another article.