Article: A comprehensive understanding of popular machine translation evaluation metrics Journal: International Journal of Computational Science and Engineering (IJCSE) 2022 Vol.25 No.5 pp.467 - 478 Abstract: Machine translation is one of the pioneer applications of natural language processing and artificial intelligence. Automatic evaluation of the translation performance of the machine translators is one of the most challenging tasks, as manual evaluation of large volumes of document translations is infeasible in practice. Thus, to facilitate the evaluation of translation performance automatically, several metrics have been introduced and utilised widely. Although these translation performance evaluation metrics cannot match the efficiency level of human evaluation, they are popularly employed in automatic evaluation of translation quality of texts across multifarious application domains. This article discusses three such widely used evaluation metrics - BLEU, METEOR, and TER, with relevant details by demonstrating step by step calculations. The main novelty of this article lies in the consideration of several example translations to present and clarify the calculation process of these three of the most popular evaluation metrics for measuring the performance or quality of machine translation. Moreover, the article presents a comparative analysis among these three metrics using two different datasets to reveal their similarities and distinctions in terms of behaviour. Inderscience Publishers - linking academia, business and industry through research

Title: A comprehensive understanding of popular machine translation evaluation metrics

Authors: Md. Adnanul Islam; Md. Saddam Hossain Mukta

Addresses: Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh ' Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh

Abstract: Machine translation is one of the pioneer applications of natural language processing and artificial intelligence. Automatic evaluation of the translation performance of the machine translators is one of the most challenging tasks, as manual evaluation of large volumes of document translations is infeasible in practice. Thus, to facilitate the evaluation of translation performance automatically, several metrics have been introduced and utilised widely. Although these translation performance evaluation metrics cannot match the efficiency level of human evaluation, they are popularly employed in automatic evaluation of translation quality of texts across multifarious application domains. This article discusses three such widely used evaluation metrics - BLEU, METEOR, and TER, with relevant details by demonstrating step by step calculations. The main novelty of this article lies in the consideration of several example translations to present and clarify the calculation process of these three of the most popular evaluation metrics for measuring the performance or quality of machine translation. Moreover, the article presents a comparative analysis among these three metrics using two different datasets to reveal their similarities and distinctions in terms of behaviour.

Keywords: evaluation metrics; translation performance; bi-lingual evaluation understudy; BLEU; METEOR; translation edit rate; TER; machine translation.

DOI: 10.1504/IJCSE.2022.126258

International Journal of Computational Science and Engineering, 2022 Vol.25 No.5, pp.467 - 478

Received: 15 Mar 2021
Accepted: 02 Oct 2021
Published online: 18 Oct 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: A comprehensive understanding of popular machine translation evaluation metrics

Keep up-to-date