Yasmin Moslem

NLP Researcher

WER Score for Machine Translation

04 Mar 2020 » nmt

Update: Currently, it is recommended to use SacreBLEU for calculating BLEU, ChrF, and WER.

SacreBLEU


Word Error Rate (WER) computes the minimum Edit Distance between the human-generated sentence and the machine-predicted sentence. In other tutorials, I explained how to use Python to compute BLEU and Edit Distance, and this tutorial, I am going to explain how to calculate the WER score.

For this WER score tutorial, I am going to use the Python library, JIWER.


Table of Contents

Files Required to Compute WER

To measure WER score, you need to have two files:

  1. Ground Truth: It is the reference human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.

  2. Hypothesis: It is the Machine Translation prediction for the source of the same test dataset used for “Ground Truth”. In the code, I will refer to such sentences as “preds”.

Corpus WER Calculator

JIWER allows computing the overall WER score on multiple sentences using two lists that include the same number of sentences. I am quoting a sample code from JIWER’s page as follows:

from jiwer import wer

ground_truth = ["hello world", "i like monthy python"]
hypothesis = ["hello duck", "i like python"]

error = wer(ground_truth, hypothesis)

Now, let’s apply the same concept on the two files.

  1. Create the argument list. This is optional, but it is useful for being able to run the file with arguments from CMD/Terminal as follows:
python3 wer.py human.txt mt.txt
  1. Open the two files, human translation and machine translation of the same test dataset, and add the sentences (lines) to two lists using the Python method readlines()

  2. From the JIWER library, use wer to calculate the WER score on the two lists of sentences, and print the output.

Here you can find the code that reflects these steps.

# Corpus WER
# WER score for the whole corpus
# Run this file from CMD/Terminal
# Example Command: python3 corpus-wer.py test_file_name.txt mt_file_name.txt
import sys
from jiwer import wer
target_test = sys.argv[1] # Test file argument
target_pred = sys.argv[2] # MTed file argument
# Open the test dataset human translation file
with open(target_test) as test:
refs = test.readlines()
# Open the translation file by the NMT model
with open(target_pred) as pred:
preds = pred.readlines()
wer_file = "wer-" + target_pred + ".txt"
# Calculate WER for the whole corpus
wer_score = wer(refs, preds)
print("WER Score:", wer_score)
view raw corpus-wer.py hosted with ❤ by GitHub

Sentence WER Calculator

The previous code computes WER for the whole test dataset, and this is the common practice. Still, you might want to calculate WER for segment by segment. The following code uses the same method wer from the JIWER library for achieving this task using a for loop. Finally, it saves the output, i.e. the WER score for each sentence in a new line, into a text file.

# Sentence WER
# WER for segment by segment with arguments
# Run this file from CMD/Terminal
# Example Command: python3 sentence-wer.py test_file_name.txt mt_file_name.txt
import sys
from jiwer import wer
target_test = sys.argv[1] # Test file argument
target_pred = sys.argv[2] # MTed file argument
# Open the test dataset human translation file
with open(target_test) as test:
refs = test.readlines()
# Open the translation file by the NMT model
with open(target_pred) as pred:
preds = pred.readlines()
wer_file = "wer-" + target_pred + ".txt"
# Calculate WER for sentence by sentence and save the result to a file
with open(wer_file, "w+") as output:
for line in zip(refs, preds):
test = line[0]
pred = line[1]
# print(test, pred)
wer_score = wer(test, pred)
# print(wer_score, "\n")
output.write(str(wer_score) + "\n")
print("Done! Please check the WER file '" + wer_file + "' in the same folder!")
view raw sentence-wer.py hosted with ❤ by GitHub

Conclusion

So just as we did for Computing BLEU Score for Machine Translation, we now managed to use the WER score as well. As I said earlier, these scores are mainly useful for evaluating the quality of different models, rather than the meaning acceptance of each sentence. While evaluating Speech Recognition, it makes sense that you want the system to exactly convey each uttered word and in the same order; however, Machine Translation evaluation is somehow tricky as different wordings can still convey the same meaning. Hence, Machine Translation evaluation is still a hot research topic; and in some cases, human evaluation is preferred.

If you have questions, please feel free to comment.