Computing BLEU Score for Machine Translation

In this tutorial, I am going to explain how I compute the BLEU score for the Machine Translation output using Python.

BLEU is simply a measure for evaluating the quality of your Machine Translation system. It does not really matter whether your MT target is from a high-level framework like OpenNMT or Marian, or from a lower-level one like TensorFlow or PyTorch. It does not also matter whether it is a Neural Machine Translation system or a Statistical Machine Translation tool like Moses.

So let’s see the steps I follow to calculate the BLEU score.

Table of Contents

Files Required to Compute BLEU
Detokenization & BLEU Calculation
Code of MT BLEU Calculator
File Names as Arguments
Sentence BLEU Calculator
Multi-BLEU
METEOR
Final Note: Is BLEU Accurate?

Files Required to Compute BLEU

To measure BLEU, you need to have two files:

Reference: It is the human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
System: It is the MTed translation/prediction, generated by the machine translation model for the source of the same test dataset used for “Reference”. In the code, I will refer to such sentences as “preds”.

Detokenization & BLEU Calculation

To compute BLEU, I use sacreBLEU which works on detokenized text (unless the ‘--force’ parameter is used). For the detokenization step, I use the Python library SacreMoses.

Why detokenization? Different tokenization tools generate different outputs while to be able to say that BLEU is a standard score, the factors must be the same. That is why sacreBLEU works on detokenized data and applies standard tokenization rules.

For languages other than Japanese and Chinse, SacreBLEU uses mteval-v13a, the standard tokenization used by WMT.

Code of MT BLEU Calculator

BLEU is a corpus-level metric. Here is how you can calculated Corpus BlEU using SacreBLEU.

	import sacrebleu
	from sacremoses import MosesDetokenizer
	md = MosesDetokenizer(lang='en')


	# Open the test dataset human translation file and detokenize the references
	refs = []

	with open("target.test") as test:
	for line in test:
	line = line.strip().split()
	line = md.detokenize(line)
	refs.append(line)

	print("Reference 1st sentence:", refs[0])

	refs = [refs] # Yes, it is a list of list(s) as required by sacreBLEU


	# Open the translation file by the NMT model and detokenize the predictions
	preds = []

	with open("target.pred") as pred:
	for line in pred:
	line = line.strip().split()
	line = md.detokenize(line)
	preds.append(line)

	print("MTed 1st sentence:", preds[0])


	# Calculate and print the BLEU score
	bleu = sacrebleu.corpus_bleu(preds, refs)
	print(bleu.score)

view raw compute-bleu.py hosted with ❤ by GitHub

File Names as Arguments

In the above script, file names are hardcoded. You can easily add the file names as arguments. To let the Python script understand the arguments, you will need first to import sys and then create two variables one for the test dataset, e.g. target_test, with the value sys.argv[1] for the test file argument and one for the MT output, e.g. target_pred, with the value sys.argv[2] for the MTed file argument. Optionally, you can also add an argument for language segmentation. Finally, instead of hardcoding the test dataset name and the MTed file name, you can use these two variables.

As you can see in the Python script below, I used argv which is a list including the arguments given in the command line; the first item [0] is saved for the Python script file name. So to run this script, you can use a similar command line in your CMD or Terminal:

python3 bleu-scrip.py test.txt mt.txt

Here is the BLEU script, but now with arguments.

	# Corpus BLEU with arguments
	# Run this file from CMD/Terminal
	# Example Command: python3 compute-bleu-args.py test_file_name.txt mt_file_name.txt


	import sys
	import sacrebleu
	from sacremoses import MosesDetokenizer
	md = MosesDetokenizer(lang='en')

	target_test = sys.argv[1] # Test file argument
	target_pred = sys.argv[2] # MTed file argument

	# Open the test dataset human translation file and detokenize the references
	refs = []

	with open(target_test) as test:
	for line in test:
	line = line.strip().split()
	line = md.detokenize(line)
	refs.append(line)

	print("Reference 1st sentence:", refs[0])

	refs = [refs] # Yes, it is a list of list(s) as required by sacreBLEU


	# Open the translation file by the NMT model and detokenize the predictions
	preds = []

	with open(target_pred) as pred:
	for line in pred:
	line = line.strip().split()
	line = md.detokenize(line)
	preds.append(line)

	print("MTed 1st sentence:", preds[0])


	# Calculate and print the BLEU score
	bleu = sacrebleu.corpus_bleu(preds, refs)
	print("BLEU: ", bleu.score)

view raw compute-bleu-args.py hosted with ❤ by GitHub

Sentence BLEU Calculator

The previous code computes BLEU for the whole test dataset, and this is the common practice. Still, you might want to calculate BLEU for segment by segment. The following code uses the function sentence_bleu() from the sacreBLEU library for achieving this task using a for loop. Finally, it saves the output, i.e. the BLEU score for each sentence in a new line, into a file called “bleu.txt”.

	# BLEU for segment by segment

	import sacrebleu
	from sacremoses import MosesDetokenizer
	md = MosesDetokenizer(lang='en')


	# Open the test dataset human translation file and detokenize the references
	refs = []

	with open("target.test") as test:
	for line in test:
	line = line.strip().split()
	line = md.detokenize(line)
	refs.append(line)

	print("Reference 1st sentence:", refs[0])

	# Open the translation file by the NMT model and detokenize the predictions
	preds = []

	with open("target.pred") as pred:
	for line in pred:
	line = line.strip().split()
	line = md.detokenize(line)
	preds.append(line)

	# Calculate BLEU for sentence by sentence and save the result to a file
	with open("bleu.txt", "w+") as output:
	for line in zip(refs,preds):
	test = line[0]
	pred = line[1]
	print(test, "\t--->\t", pred)
	bleu = sacrebleu.sentence_bleu(pred, [test], smooth_method='exp')
	print(bleu.score, "\n")
	output.write(str(bleu.score) + "\n")

view raw compute-bleu-sentence.py hosted with ❤ by GitHub

As we did with the corpus BLEU script, here is the sentence BLEU script, but now with arguments.

	# BLEU for segment by segment with arguments
	# Run this file from CMD/Terminal
	# Example Command: python3 compute-bleu-sentence-args.py test_file_name.txt mt_file_name.txt

	import sys
	import sacrebleu
	from sacremoses import MosesDetokenizer
	md = MosesDetokenizer(lang='en')

	target_test = sys.argv[1] # Test file argument
	target_pred = sys.argv[2] # MTed file argument

	# Open the test dataset human translation file and detokenize the references
	refs = []

	with open(target_test) as test:
	for line in test:
	line = line.strip().split()
	line = md.detokenize(line)
	refs.append(line)

	print("Reference 1st sentence:", refs[0])

	# Open the translation file by the NMT model and detokenize the predictions
	preds = []

	with open(target_pred) as pred:
	for line in pred:
	line = line.strip().split()
	line = md.detokenize(line)
	preds.append(line)

	# Calculate BLEU for sentence by sentence and save the result to a file
	with open("bleu-" + target_pred + ".txt", "w+") as output:
	for line in zip(refs,preds):
	test = line[0]
	pred = line[1]
	print(test, "\t--->\t", pred)
	bleu = sacrebleu.sentence_bleu(pred, [test], smooth_method='exp')
	print(bleu.score, "\n")
	output.write(str(bleu.score) + "\n")

view raw compute-bleu-sentence-args.py hosted with ❤ by GitHub

Update:

The code is now updated to reflect two main changes: 1- Updates in version 1.4: a) the reference sentence must be a list; and b) use bleu.score instead of bleu to print/write the score. 2- Conclusions from this discussion: add the argument smooth_method='exp' if you want to get the same result as when using sacreBLEU from the command line.

Multi-BLEU

One of the popular scripts to calculate BLEU is multi-bleu.perl. It works very similarly to sacreBLEU.

According to the script “… you should detokenize then use mteval-v14.pl, which has a standard tokenization.”

To use multi-bleu.perl, you can simply run this command line in your Terminal.

perl multi-bleu.perl human-translation.txt < mt-pred.txt

METEOR

Using BLEU, you might wonder why it does not count some sub-words of the same origin as correct alternatives. So, I came across another metric called METEOR, which somehow solves such issue.

I am quoting Rachael Tatman’s article Evaluating Text Output in NLP: BLEU at your own risk:

METEOR is similar to BLEU but includes additional steps, like considering synonyms and comparing the stems of words (so that “running” and “runs” would be counted as matches).

I have created the following script for METEOR calculation using NLTK. For the same sentences, METEOR gives me higher scores than BLEU. Unlike many other metrics including BLEU, METEOR mainly works on sentence evaluation rather than corpus evaluation.

	# Sentence METEOR

	# METEOR mainly works on sentence evaluation rather than corpus evaluation
	# Run this file from CMD/Terminal
	# Example Command: python3 sentence-meteor.py test_file_name.txt mt_file_name.txt

	import sys
	from nltk.translate.meteor_score import meteor_score


	target_test = sys.argv[1] # Test file argument
	target_pred = sys.argv[2] # MTed file argument


	# Open the test dataset human translation file
	with open(target_test) as test:
	refs = test.readlines()

	#print("Reference 1st sentence:", refs[0])

	# Open the translation file by the NMT model
	with open(target_pred) as pred:
	preds = pred.readlines()

	meteor_file = "meteor-" + target_pred + ".txt"

	# Calculate METEOR for each sentence and save the result to a file
	with open(meteor_file, "w+") as output:
	for line in zip(refs, preds):
	test = line[0]
	pred = line[1]
	#print(test, pred)

	meteor = round(meteor_score([test], pred), 2) # list of references
	#print(meteor, "\n")
	output.write(str(meteor) + "\n")

	print("Done! Please check the METEOR file '" + meteor_file + "' in the same folder!")

view raw sentence-meteor.py hosted with ❤ by GitHub

Final Note: Is BLEU Accurate?

Well, BLEU simply compares the human translation to the machine translation. It does not take into consideration synonyms or accepted word order changes.

Here is an example of the original translation in the corpus:

FR: Notre ONU peut jouer un rôle déterminant dans la lutte contre les menaces qui se présentent à nous, et elle le jouera.

EN: Our United Nations can and will make a difference in the fight against the threats before us.

… and here is the machine translation by two of my NMT models:

EN: Our United Nations can play a decisive role in combating the threats we face, and it will do so.

EN: Our United Nations can play a decisive role in combating the threats we face, and it will play it.

As you can see, the MT translations are very acceptable; yet if you calculate BLEU against the original sentence, you will get ≈ 15.7 BLEU score only!

So BLEU –just as any other automatic measure– can be used for reference until reaching a pre-agreed score, and you can expect a better translation from a model with an overall higher BLEU score. Moreover, some other new metrics are worth considering such as Yisi and COMET. Still, some companies would finally run a human evaluation, which we might talk about in another article.

Yasmin Moslem