Word Error Rate (WER) computes the minimum-edit distance between the human-generated sentence and the machine-predicted sentence. In other tutorials, I explained how to use Python to compute BLEU and Edit Distance, and this tutorial, I am going to explain how to calculate the WER score.
For this WER score tutorial, I am going to use the Python library, JIWER.
Files Required to Compute WER
To measure WER score, you need to have two files:
1- Ground Truth: It is the reference human translation (target) file of your test dataset. In the code, I will refer to such sentences as “refs” or “test” interchangeably.
2- Hypothesis: It is the Machine Translation prediction for the source of the same test dataset used for “ Ground Truth”. In the code, I will refer to such sentences as “preds”.
Corpus WER Calculator
JIWER allows computing the overall WER score on multiple sentences using two lists that include the same number of sentences. I am quoting a sample code from JIWER’s page as follows:
from jiwer import wer ground_truth = ["hello world", "i like monthy python"] hypothesis = ["hello duck", "i like python"] error = wer(ground_truth, hypothesis)
Now, let’s apply the same concept on the two files.
1- Create the argument list. This is optional, but it is useful for being able to run the file with arguments from CMD/Terminal as follows:
python3 wer.py human.txt mt.txt
2- Open the two files, human translation and machine translation of the same test dataset, and add the sentences (lines) to two lists using the Python method
3- From the JIWER library, use
wer to calculate the WER score on the two lists of sentences, and print the output.
Note: The argument
standardize=True expands abbreviations, such as he’s, they’re won’t, let’s, n’t
Here you can find the code that reflects these steps.
Sentence WER Calculator
The previous code computes WER for the whole test dataset, and this is the common practice. Still, you might want to calculate WER for segment by segment. The following code uses the same method
wer from the JIWER library for achieving this task using a for loop. Finally, it saves the output, i.e. the WER score for each sentence in a new line, into a text file.
So just as we did for Computing BLEU Score for Machine Translation, we now managed to use the WER score as well. As I said earlier, these scores are mainly useful for evaluating the quality of different models, rather than the meaning acceptance of each sentence. While evaluating Speech Recognition, it makes sense that you want the system to exactly convey each uttered word and in the same order; however, Machine Translation evaluation is somehow tricky as different wordings can still convey the same meaning. Hence, Machine Translation evaluation is still a hot research topic; and in some cases, human evaluation is preferred.
If you have questions, please feel free to comment below.
Machine Translation Researcher