In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.
BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target is from a high-level framework like OpenNMT or Marian, or from a lower-level one like TensorFlow or PyTorch. It does not also matter whether it is a Neural Machine Translation system or a Statistical Machine Translation tool like Moses.
So let’s see the steps I follow to calculate the BLEU score.
Table of Contents
- Files Required to Compute BLEU
- Detokenization & BLEU Calculation
- Code of MT BLEU Calculator
- File Names as Arguments
- Sentence BLEU Calculator
- Multi-BLEU
- METEOR
- Final Note: Is BLEU Accurate?
Files Required to Compute BLEU
To measure BLEU, you need to have two files:
- Reference: It is the human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
- System: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”. In the code, I will refer to such sentences as “preds”.
Detokenization & BLEU Calculation
To compute BLEU, I use sacreBLEU which works on detokenized text (unless the ‘--force’ parameter is used). For the detokenization step, I use the Python library SacreMoses.
Why detokenization? Different tokenization tools generate different outputs while to be able to say that BLEU is a standard score, the factors must be the same. That is why sacreBLEU works on detokenized data and applies standard tokenization rules.
For languages other than Japanese and Chinse, SacreBLEU uses mteval-v13a, the standard tokenization used by WMT.
Code of MT BLEU Calculator
BLEU is a corpus-level metric. Here is how you can calculated Corpus BlEU using SacreBLEU.