Next Article in Journal
Efficient Diagnosis of Autism Spectrum Disorder Using Optimized Machine Learning Models Based on Structural MRI
Previous Article in Journal
Cardiac Nuclear Imaging Findings in Atypical Variants of Takotsubo Cardiomyopathy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Task of Post-Editing Machine Translation for the Low-Resource Language

by
Diana Rakhimova
1,2,*,
Aidana Karibayeva
1,2 and
Assem Turarbek
1
1
Department of Information Systems, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan
2
Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(2), 486; https://doi.org/10.3390/app14020486
Submission received: 2 December 2023 / Revised: 23 December 2023 / Accepted: 2 January 2024 / Published: 5 January 2024
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
In recent years, machine translation has made significant advancements; however, its effectiveness can vary widely depending on the language pair. Languages with limited resources, such as Kazakh, Uzbek, Kalmyk, Tatar, and others, often encounter challenges in achieving high-quality machine translations. Kazakh is an agglutinative language with complex morphology, making it a low-resource language. This article addresses the task of post-editing machine translation for the Kazakh language. The research begins by discussing the history and evolution of machine translation and how it has developed to meet the unique needs of languages with limited resources. The research resulted in the development of a machine translation post-editing system. The system utilizes modern machine learning methods, starting with neural machine translation using the BRNN model in the initial post-editing stage. Subsequently, the transformer model is applied to further edit the text. Complex structural and grammatical forms are processed, and abbreviations are replaced. Practical experiments were conducted on various texts: news publications, legislative documents, IT sphere, etc. This article serves as a valuable resource for researchers and practitioners in the field of machine translation, shedding light on effective post-editing strategies to enhance translation quality, particularly in scenarios involving languages with limited resources such as Kazakh and Uzbek. The obtained results were tested and evaluated using specialized metrics—BLEU, TER, and WER.

1. Introduction

The modern world and our future are heavily reliant on applied intelligent systems as new technologies continue to advance each day. One of the tasks of intelligent systems is machine translation from one natural language to another. Machine translation (MT) enables people to communicate regardless of language differences, eliminating language barriers and opening up new avenues for communication. Machine translation is a groundbreaking technology, a significant step in human development [1]. This type of translation can be helpful when you need to quickly understand what your interlocutor has written or said in a letter. Of course, the quality of such translation is very low (for some language groups), but in most cases, the basic meaning can be understood. Machine translation (MT) has become a crucial tool in the fields of linguistics and natural language processing (NLP) in recent decades [2,3]. It facilitates the translation of textual data between different languages, easing communication and the dissemination of information in a globalized world. However, despite significant achievements in machine translation for languages with abundant resources, such as English or Chinese, many low-resource languages remain inadequately represented. Kazakh, which belongs to the Turkic language family, is one such low-resource language [4]. Currently, there are limited linguistic tools and resources available for the Kazakh language, making it challenging to create high-quality machine translation systems [5].
As a result of machine translation (MT), there are always certain shortcomings that can be addressed through post-editing. Post-editing involves human processing of the text after machine translation [6]. Today, many linguistic providers are developing methods for training editors and post-editing techniques. The first studies on post-editing and its related implementations emerged in the 1980s. To develop appropriate guidelines and training, members of the American Machine Translation Association and the European Association for Machine Translation established a Special Interest Group on Post-Editing in 1999 [7].
Post-editing of machine translation (PEMT) is the process of correcting errors made by machine translation systems to enhance the quality of the final translation (García, 2010). This approach is often used to improve translations from English into other languages, primarily for languages with substantial data volume. However, the question of its effectiveness for low-resource languages like Kazakh remains open. Kazakh, being an agglutinative language, presents specific challenges for machine translation. Post-editing of machine translation (PEMT) for the Kazakh language faces several unique challenges related to the language’s morphology, syntax, and lexicon.
In this article, we explore the application of post-editing methods to Kazakh language machine translation. The aim of this research is to determine whether PEMT methods can enhance the quality of machine translation for the Kazakh language, as well as identify the main problems and limitations researchers encounter when working with low-resource languages.

1.1. Review of Research Studies

A comprehensive review of the development of post-editing in machine translation (MT) was provided by the authors [7,8]. They describe that initially, post-editing was conceived as a final step in the MT process, and the post-editor did not need to have knowledge of both languages; they could edit the text knowing only the language of translation. Post-editing is increasingly carried out by professional translators, as it is now a service for clients with its international standard [9], which requires post-editors to have expertise and professional experience. Modern methods of text post-editing were discussed in the work [10]. The authors [11] examined post-editing in neural machine translation (NMT). Neural machine translation allows for achieving significant quality compared to other machine translation systems (MTSs). The results of neural machine translation are perceived as smoother and more natural than those of statistical machine translation (SMT). Therefore, it can be concluded that NMT should enhance post-editing efficiency to a greater extent compared to previously used approaches. The use of neural machine translation is significantly growing in application development. Many studies show that, overall, NMT results can achieve higher scores compared to the performance of other types of MTSs.
In both research and practical applications of NMT in professional settings, reviews and comparative analyses of NMT and SMT have been conducted regarding quality. In a study [12], it was found that the English–Spanish (EN–ES) NMT system outperformed the SMT system in a study incorporating training data from 15 million words. The authors noted issues with NMT, including domain mismatch and the use of rare words. Generally, when NMT has access to sufficiently large training data, matching the domain and vocabulary of the texts to be translated, there is a substantial improvement in translation quality. High-quality NMT systems make fewer errors, reducing the workload for automatic post-editing (APE) systems. Given the dominance of the NMT approach in both academic and industrial applications, significant attention to this approach is highly justified.
The development of machine translation for the Kazakh language began in 2000. Professor U. A. Tukeyev was one of the pioneers in this field, establishing a research school actively engaged in MT studies. Through the work of the research group under Professor U. A. Tukeyev’s guidance, grammatical formal models of simple sentences and the first version of a machine translation program (MTP) from Kazakh to English were defined [13,14]. Additionally, a polysemous method for machine translation of morphologically complex natural languages such as Russian and Kazakh was developed [15,16,17]. Since 2018, efforts have been ongoing to develop and research a neural machine translation system for the Kazakh language. Over the last three to four years, both the theory and practice of machine translation have significantly expanded, creating a new direction in machine translation that reflects very high quality. This, in turn, has raised the standard of machine translation quality to new heights.
Neural systems MTPs have significantly enhanced the capabilities of post-editing for the Kazakh language, as deep learning-based models demonstrated superior performance with agglutinative languages [18]. Research into the neural approach for automatic post-editing of machine translation began in 2016. For example, in the study [19], an automatic post-editing system for the English–Italian language pair based on a bidirectional recurrent neural network model is introduced. In this classical approach, the model consists of an encoder, which encodes the MT output into a fixed-length vector, from which the decoder ensures transformation post-editing. With continuous research, model architectures have become more complex, and results have improved compared to the previous architecture. For instance, in the study [20], the model architecture consists of three components: a Seq2Seq model for transforming the source sentence into the target sentence in forced decoding mode; Seq2Seq models for gradually generating a sequence of editing operations (action generator), and an RNN (recurrent neural network) for summarizing the sequence of words edited after the editing process, which were obtained as a result of the actions generated so far (interpreter). They demonstrated improvements up to +1 BLEU score and −0.7 TER score for German–English language pairs.
In 2019, the application of the transformer model began. In the study [21], a transformer model was employed for the English–German language pair, achieving results that surpassed contemporary technologies due to its simpler architecture, suitable for industrial applications. For post-editing English–Kazakh translations, classical approaches proved to be less effective due to the complexity of the sentences. Therefore, the transformer model was utilized to obtain good results.
The use of large language models (LLMs) such as GPT-3, BERT, and others in machine translation provides a number of advantages and opportunities to improve translation quality and system efficiency. Following are some ways these models are used in machine translation:
Pre-training: Large language models undergo extensive pre-training on huge corpora of text, allowing them to learn large language representations. These representations can then be used in machine translation tasks.
Contextual understanding: Models such as BERT are able to understand the context and dependencies between words in a sentence. This allows them to more accurately grasp semantic relationships and correctly interpret phrases in translation.
Fine-tuning for machine translation: After preliminary training, models can be further tuned on specific data for the machine translation task. This allows you to take into account the features of the language and the requirements of a specific task.
Processing long texts: Larger language models can be better at processing long texts, which can be useful for translating complex and extended sentences.
Generating context-aware translations: Transformer-based models are capable of generating context-aware translations, given the previous context. This allows for a more natural conveyance of the meaning and tone of the translated text.
Working with multiple languages: Many large language models are trained in multiple languages, allowing them to more efficiently translate text into different languages. This is especially useful in multicultural and multilingual scenarios.
Adaptation to specialized fields: Machine translation models based on large language models can be customized and adapted to specialized fields such as medicine, law, or engineering.
Supervised and unsupervised learning: Models can be applied to both supervised (using labeled data) and unsupervised (using unsupervised learning techniques for translation) machine translation tasks.
The use of large language models in machine translation continues to evolve, and these models provide new opportunities to improve the quality of automatic translation and mitigate some of the problems faced by older models [22].
Large language models typically require extensive data for pre-training. Unfortunately, at the moment, there are no large and diverse text corpora for the Kazakh language that can contribute to the successful application of the model. The Kazakh language has its own unique linguistic features, such as agglutination and the use of the Cyrillic and Latin alphabets. The model must be configured to take these features into account. Training and using large language models require significant computational resources. Having sufficient computing power can be critical. The application of large language models to the Kazakh language requires a careful approach and consideration of various factors. However, with the development in this direction through the use of open resources, with proper configuration and training, such models can achieve good results in machine translation tasks for the Kazakh language.

1.2. Overview of Post-Editing Types in Machine Translation

Post-editing of automated machine translation is carried out by a specialist fluent in the target language. Machine translation post-editing (MTPE) is the process of correcting and improving machine-generated translations to enhance their quality and bring them closer to the standard achieved by human translators [23]. Post-editing can be performed by professional editors or native speakers specialized in the relevant subject matter.
Two main types of post-editing are commonly used: Partial post-editing involves eliminating grammatical, gross semantic, and stylistic errors in the text; the resulting text is easily comprehensible and entirely understandable. Full post-editing, on the other hand, involves checking and correcting the text until it becomes equivalent to a translation done by a human, reflecting not only the essence of the original text but also the depth of industry-specific terminology. A wide range of expressive language elements is utilized in this process. In-demand websites and modern online editors define four types of post-editing. A detailed description is provided in Table 1.
  • High post-editing: No differences from human translation; the result is an accurate, error-free translation.
  • Full post-editing: Emphasizes translation quality, aiming to eliminate errors.
  • Light post-editing: Involves editing the machine translation output to make its meaning clear, accurate, and unambiguous, but it may still contain grammar, spelling, and sequence problems. The original style and tone are not emphasized.
  • Weak post-editing: Focuses on understanding over language quality and allows for linguistic and stylistic issues [24].
This table allows for comparing four types of post-editing in machine translation across various aspects, including descriptions, objectives, permissible errors, time investments, and examples of tasks to which they are applied.
There are several different machine translation systems (MTSs), and some perform translations more effectively than others. However, to determine which translator handles translations for specific language pairs or certain types of texts, it is necessary to conduct an analysis of translation quality based on individual text fragments. As a result of the analysis in this study, various factors were considered.

1.3. Description of Machine Translation Challenges for the Kazakh Language and Problem Statement

Machine translation for the Kazakh language encounters several unique challenges related to the language’s characteristics, its morphological complexity, and its status as a low-resource language, as depicted in Figure 1:
  • Morphological complexity. Kazakh is an agglutinative language, which means that words can have multiple forms due to prefixes and suffixes. This complexity poses challenges for machine translation, as it is necessary to correctly interpret and translate each morpheme. Translation between Kazakh and languages with different morphological structures, such as English, can lead to a lack or excess of information in the target text [25].
  • Low-resource language: The Kazakh language is considered low-resource because there are currently much fewer training data available for Kazakh compared to languages such as English or Russian. This complexity makes the training of highly effective machine translation models more difficult. Additionally, the absence of specialized corpora for specific domains or professions further limits the translation quality in these areas [26].
  • Transition to Latin script: It is also worth noting that the plans to transition the Kazakh language from Cyrillic to Latin script add an additional layer of complexity. Machine translation systems need to be adapted to both writing systems, further complicating the translation process.
  • Cultural and idiomatic differences: Idioms, proverbs, and other culturally specific expressions in the Kazakh language might lack direct equivalents in other languages, requiring careful adaptation during translation [27].
  • Syntactic differences: Syntactic differences, such as word order and sentence structure, in the Kazakh language can vary from other language groups like Germanic, Slavic, and others. This creates additional challenges for machine translation, as translation models need to account for these differences to generate grammatically correct and comprehensible sentences [28].
Given these challenges, there is a need for the development of a post-editing system for the Kazakh language.
Problem statement: The aim is to develop a machine translation post-editing system for a low-resource language, using the Kazakh language as an example, which takes into account the specific linguistic characteristics and ensures a high translation quality. Based on the analysis of existing research, the following research tasks have been identified:
  • Data collection, processing, and structuring; for this, it is necessary to collect and structure large bodies of bilingual texts, as well as develop tools for automatic preprocessing and text annotation;
  • Development of a post-editing approach for English–Kazakh and Russian–Kazakh machine translation based on machine learning;
  • Quality assessment, conducting an objective assessment of the quality of translation using modern methods and tools.
To implement these tasks, the methodology and stages of developing a machine translation post-editing system for the Kazakh language will be presented further.

2. Materials and Methods

2.1. Post-Editing of Machine Translation for English–Kazakh and Russian–Kazakh Translation

The task of post-editing machine translation is the focus of attention of researchers who are concerned with the development of automated translation systems. This problem becomes especially relevant for low-resource languages, such as Kazakh, Kyrgyz, Uzbek, and others [28]. The Kazakh language is characterized by an extensive morphology, which makes the task of machine translation difficult. Errors in determining the basis of a word or in morphemes can lead to significant errors in translation. In this context, machine translation post-editing (MTPE) often focuses on correcting morphological errors.
Traditional MTP systems based on statistical machine translation (SMT) were used in the early years of the development of translation technologies for the Kazakh language. Approaches to automatic post-editing of machine translation were initially developed on the basis of grammatical rules of the language. It all started with editing an error in a word, then the word forms that formulated the phrases, and then the sentences.
We consider that the architecture of the MTP process for the Kazakh language can contain the following stages:
  • Preparation of the source text. To do this, the initial input is scanned to check the correctness of the format and identify possible problems; after that, the source language is determined, and then the input text is divided into sentences or phrases.
  • Second item: The process of machine translation consists of tokenization, that is, splitting sentences into separate words or tokens, morphological analysis, the determination of morphological features of the language, a neural model, which is used by a trained MT model (for example, based on the transformer architecture) for the initial translation, and detokenization, that is, the conversion of tokenized text back into a readable format.
  • The post-editing stage: the initial review of the MT output data is carried out for the presence of obvious errors. After that, if the main goal is a general understanding, the text undergoes light editing, that is, only obvious errors are corrected. And, if the goal is a high-quality translation, the text undergoes complete editing and all kinds of errors, stylistic, grammatical, lexical, and semantic, are edited. The choice between light and full post-editing usually depends on the purpose of the translation, the project budget, and quality requirements.
  • Adaptive reverse cycle: corrections made during post-editing are returned back to the MP system to improve future translations. This stage is optional and depends on the availability of adaptive capabilities of the MT system.
  • Output data and verification: quality control is carried out, or rather, automatic checks for the presence of untranslated segments, inconsistencies, or problems with the format. The final review, the last stage, is performed to make sure that the translation meets the required quality standards. At the very end, the result is provided in the form of a ready translation and is output in the required format or medium. Figure 2 shows the general architecture of post-editing for the low-resource language.
This architecture provides a sequential process from input to output, detailing each stage involved in the post-editing processes for the Kazakh and Uzbek languages. The post-editing module itself is divided into two levels, which are responsible for easy post-editing, that is, for making small changes to the machine translation text. Full post-editing carries out a detailed correction of possible synthetic and lexical errors of machine translation [29].
Further, descriptions of the processes of the main levels of the machine translation post-editing system will be presented.

2.2. Development of the Light Post-Editing Level

Light post-editing is one of the initial and superficial stages of editing in the machine translation post-editing system. The light post-editing level involves taking the raw output from machine translation and making minimal changes to the text to make the translation understandable, essentially accurate, and grammatically correct [30,31].
The tasks included in the light post-editing level are as follows:
  • Correcting only the most obvious typos and lexical and grammatical errors;
  • Rectifying machine errors;
  • Removing unnecessary or redundant translation options generated by the machine.
The complete system of endings for the Kazakh language was proposed by Professor Tukeyev U.A. in the work [27], and the authors introduced the CSE model (Complete Set of Endings). The CSE system of endings for the Kazakh language was developed using a combinatorial model, taking into account semantic and morphological exceptions. A list of possible endings for all parts of speech was compiled. The system of endings was derived through a combinatorial approach, considering the semantic admissibility of placing the language’s basic suffixes and the peculiarities of sound harmonization in the language. All possible types of endings for the Kazakh language are presented, totaling 4727 types of endings.
The Kazakh language’s system of endings was also applied to the segmentation task [18]. The accuracy of converting words to their bases was above 87% for four different texts in the subject area. The system of endings was employed in the post-editing task of the Kazakh language in machine translation. The system of endings was used to correct morphological errors in post-edited Kazakh text.
To build the light post-editing level model, the following algorithm was developed:
  • Aligning sentences {mt} obtained from machine translation with high-quality sentences in the target language {pe};
  • Conducting sentence tokenization;
  • Splitting into training and validation sets, as well as building a vocabulary;
  • Processed data are fed into a recurrent neural network;
  • The trained model is used to translate the test corpus, and the results are compared using the BLEU metric {mt}-{pe} and {mt}-{ape}, where ape represents the text corpus obtained after the post-editing stage.
The stages of the light post-editing process are illustrated in Figure 3.
Certainly, light post-editing is the process of correcting machine translation with minimal changes to achieve an acceptable level of quality. The light post-editing level performs the following functions:
  • Elimination of glaring grammatical, orthographic, and punctuation errors;
  • Correction of machine-translated phrases or words that clearly do not match the original text;
  • Preservation of the basic formatting and structure of the original text without additional effort on high-quality formatting;
  • In the case of extremely complex or ambiguous translations, it may decide to skip or make minimal changes to maintain the text’s clarity.
Ultimately, light post-editing ensures readability and a structured text, preserves overall clarity and understanding of the text, avoids extensive revisions of complex sections, provides minimal grammatical correctness of the text, and corrects obvious translation errors while maintaining the overall meaning. Full post-editing is used for further editing.

2.3. Development of the Full Post-Editing Level

Full post-editing is a more profound editing process that encompasses all the basics of light post-editing with any necessary structural and stylistic corrections [32]. It is a more labor-intensive process but yields high-quality results. To implement the full post-editing module, additional analysis and solutions for deep editing will be applied, such as specific terminology, professional stylistic characteristics of the text, and machine learning approaches adapted for the Kazakh language.
Developing the full post-editing level for English–Kazakh and Russian–Kazakh translation requires careful design and implementation of an algorithm that will intervene more deeply in the text to enhance its quality. Figure 4 illustrates the stages that can be included in the algorithm development for this module.
This algorithm describes a more in-depth process of the full post-editing level, which focuses on the detailed correction of grammatical and structural errors while preserving the structure, style, and meaning of the original text.
The full post-editing level performs the following functions:
  • Conducts comprehensive deep editing of machine translation, taking into account the syntactic and semantic properties of the output Kazakh language;
  • Ensures the preservation of the structure of the original text and the properties of the Kazakh language for simple and complex sentences;
  • Ensures uniform translation of terms and phrases and evaluates the logical integrity and meaning of the text to ensure that the translation accurately conveys the ideas and information of the original text;
  • Conducts a final check of the translation to ensure that the text fully meets the requirements.
The foundation of the full post-editing level is based on NMT architectures, which have been intensively developed over the last decade. The traditionally used NMT architecture was Seq2Seq for translating from one language to another. The BRNN model is a modified version of the RNN model. In the architectural implementation of BRNN, neurons are divided into parts responsible for the forward and backward directions. Outputs from forward states are not connected to the inputs of backward states. This designed architecture transforms into a unidirectional RNN without backward states [33,34].

3. Results

The architecture of the post-editing information system for English–Kazakh and Russian–Kazakh machine translation consists of three main modules, each comprising subtasks and data:
  • Client-interface module;
  • Post-edit module;
  • NL resource module.
The prototype of the post-editing system consists of these three modules. The operational scheme of the post-editing system prototype is illustrated in the Figure 5 below.
The first module, the client-interface module, serves as the user interface connecting the user with the system. This module includes the following:
  • Input windows for processing and editing functions linked to the post-edit module.
  • Output result windows.
To process the input data using trained machine translation and post-editing models, a server is created, which calls the specified models upon request. As the experimental part involved training with the open-source framework OpenNMT, its server configuration is used for model invocation.
The subsequent steps involve the stages of post-editing. The post-edit module, the core component of the system, consists of the following levels (sub-modules): light and full post-editing, which include the following processes:
  • Processing level: this level involves linguistic and statistical processing of input data, removing stop words, correcting the input text, normalization, stemming, identifying incorrect words, resolving abbreviations, and proper noun resolution;
  • Structural level: this level addresses grammatical errors in the text, determining the correct syntactic structure of simple and complex sentences in the Kazakh language;
  • Edit level: at this level, semantic analysis of the text is performed, resolving word ambiguity, possible substitutions, and text editing.
In the software implementation, post-editing is carried out using both light and full post-editing modules and a rule-based approach. This approach allows for working with both simple and complex sentences [25].

Dataset

The implementation of the system and model training heavily relies on resources. The NL resource module is responsible for data storage and processing in this system.
The NL resource module consists of the following:
  • A parallel Treebank corpus for Kazakh, Russian, and English languages;
  • Dictionaries;
  • Sets of rules and data for the Kazakh language.
During text editing, common errors such as incorrect decoding or transliteration of abbreviations and proper nouns might occur, making it challenging to understand the translation. Shortening words, terms, organization names, and other frequently used words and phrases is an integral part of editing texts. It is crucial not only to learn how to shorten terminological constructions correctly but also to translate abbreviations correctly. For editing Kazakh machine translation texts, the following data dictionaries were created:
  • Abbreviation dictionary;
  • Synonym dictionary;
  • Proper noun dictionary [35].
Uzbek and Kazakh languages belong to the Turkic language family, which makes them genetically close. They both belong to the southern (Oghuz) division of the Turkic languages. However, they have differences in linguistic features, vocabulary, and grammar. Following are some similarities and differences between the Uzbek and Kazakh languages: Genetic origin: Both languages belong to the Turkic language family and belong to the southern (Oghuz) division of the Turkic languages.
Similar vocabulary: There are some similarities between the Uzbek and Kazakh languages in the basic vocabulary and in some phrases.
  • General grammatical features:
Both languages use agglutination, i.e., adding affixes to root words to create new forms and meanings.
  • General phonetic features:
Differences: The Uzbek language previously used the Arabic alphabet, which was replaced by Latin after 1928, and then by Cyrillic in 1940. Since 1993, the Latin alphabet has been used again. The Kazakh language is now written in Cyrillic, and from the age of 2–25, a state transition to the Latin script is planned.
Despite their genetic closeness, they have differences in vocabulary and grammar. For example, there are differences in verb forms and many words can have different meanings.
  • Cultural influence:
The Uzbek and Kazakh languages have been influenced by different cultural traditions and historical events, which may also be reflected in linguistic features.
Despite their proximity, each of these languages is a separate linguistic organism with its own unique characteristics and historical development.
To work with the Uzbek language, open-access electronic corpora were used on the portals https://www.sketchengine.eu/uzwac-uzbek-corpus/ (accessed on 22 June 2023) and https://github.com/elmurod1202/Uzbek-Corpus-Sample (accessed on 22 June October 2023). Experiments were carried out on a test corpus of 98,000 sentences.
In post-editing Kazakh machine translation texts, errors related to abbreviated words might occur, which might either be frequently transliterated or remain unchanged in translation. Considering the importance of understanding and translating abbreviations and simple shortenings, a module was developed to address abbreviation editing issues for the English–Kazakh and Russian–Kazakh language pairs.
In post-editing Kazakh machine translation texts, errors related to abbreviated words that might occur are presented in Table 2; they might either be frequently transliterated or remain unchanged in translation.
Considering the importance of understanding and translating abbreviations and simple shortenings, a module was developed to address abbreviation editing issues for the English–Kazakh and Russian–Kazakh language pairs; a description is provided in Table 3.
To develop the post-edit module for the English–Kazakh and Russian–Kazakh language pairs, a machine learning-based approach was adopted. Parallel corpora were necessary for experiments and testing. English–Kazakh parallel corpora were collected from open sources such as the OPUS project and WMT conferences [36,37] and by parsing websites containing content in multiple languages. The Bitextor open-source application was used for parsing, applying a set of heuristics to select file pairs containing the same text in two different languages. Bitextor was applied to various websites, including the Kazakh National University (http://www.kaznu.kz) (accessed on 1 April 2023), Bolashak International Scholarship bolashak.gov (http://www.bolashak.gov.kz) (accessed on 29 April 2023), Eurasian National University (http://www.enu.kz) (accessed on 13 June 2023), KazPost (http://www.kazpost.kz) (accessed on 19 May 2023), news portals (http://inform.kz, http://tengrinews.kz), Legal information system of Regulatory Legal Acts of the Republic of Kazakhstan (https://adilet.zan.kz) (accessed on 2 September 2023) and others. A total of 45,000 English–Kazakh sentence pairs were obtained.
For training the Russian–Kazakh and English–Kazakh translation model, a “Treebank” corpus (parallel sentences aligned in three languages—Kazakh, Russian, and English) was used, collected, and aligned through various tools. The total volume of the corpus is 680,000 sentences. The model training data consisted of 90% of the corpus, 5% for validation, and 5% for testing. The vocabulary was created from frequently used words in the corpora (occurring more than three times) and consisted of 120,000 words.
To create the prototype of the post-editing system, neural network models such as BRNN and transformer were utilized. These models were configured based on the user-friendly OpenNMT v2.0.0 (https://github.com/OpenNMT/OpenNMT-py/releases/tag/2.0.0rc1) (accessed on 1 September 2023) application, which employs the PyTorch machine learning framework. Experiments were conducted on a test corpus with a volume of 380,000 sentences. In the first stage, English–Kazakh or Russian–Kazakh neural machine translation was performed using the BRNN model. Subsequently, the transformer model was used for post-editing the text. Afterward, complex structural and grammatical forms were processed, and abbreviations were replaced.
Training MT models: OpenNMT [38] is a deep learning framework specializing in sequence-to-sequence models, covering various tasks related to machine translation, speech recognition, image-to-text conversion, and more. Training corpora for post-editing were implemented using OpenNMT.
In the experiment under consideration, post-editing for the English–Kazakh language pair was explored. A total of 45,000 English–Kazakh parallel sentences obtained from open sources and using the Bitextor application were utilized in this experiment. The parallel corpus for the English–Kazakh language pair was collected from various official websites, aligned, and tokenized. The translation of the source {src} English corpus was implemented using machine translation systems such as Prompt, Yandex, and Webtran.
Figure 6 illustrates the OpenNMT library set for training and deploying neural machine translation models on example of sentence “Ол сабаққа келді”, which translated as “He came to the lesson” in English.
The system was implemented using Lua/Torch frameworks. OpenNMT [39] is an open-source system freely accessible at https://github.com/OpenNMT/OpenNMT (accessed on 1 September 2023). To train the post-editing model {mt}-{pe}, the data were divided into 80% training, 10% validation, and 10% testing sets. The corpora were saved in text files (txt).
Next, text data preprocessing (tokenization) was performed to create a vocabulary for training. An efficient training approach was applied using the transformer model, which has shown good results on parallel corpora. For training, 85,000 parallel sentences were chosen, and the training was conducted for 50,000 steps.
The experiment was conducted on a computer with the following specifications: CPU—Core i7 4790 K, 32 GB RAM, 1 TB SSD, GPU: RTX 2070 Super, GTX 1080. To assess the quality of the English–Kazakh and Russian–Kazakh translation models, the following metrics were used: BLEU [34,40], WER [41], and TER [42].
Translation and text editing are depicted in Figure 7. In the first cell on the top left is the original text in English, in the cell opposite is its translation into the Kazakh language, and in the last cell the results of the post editing are shown. Additionally, testing was performed on translated texts containing complex structures and numerous abbreviations.
One of the first metrics that demonstrated a high correlation with human quality evaluations is BLEU. This metric still remains one of the most popular and widely used metrics. WER (Word Error Rate) is a general measure of system performance, and TER (Translation Error Rate) is a method used to predict the volume of post-editing for machine translation outputs.
BLEU (Bilingual Evaluation Understudy) is one of the most common automatic metrics for assessing the quality of machine translation. It is based on measuring the overlap between the sets of words in the generated translation and the reference translation. Following are some reasons why BLEU may be considered a better metric in some cases:
Simplicity and clarity: BLEU is easy to understand and calculate. This is one of the reasons for its popularity. Its formula involves simple operations such as counting sets of words and their intersections.
Consistency with human assessments: In some studies, BLEU has shown good agreement with human quality assessments. This does not mean that BLEU always accurately reflects human evaluation, but in some cases, it provides a good correlation.
Easy interpretation: BLEU provides a numerical value that is easy to interpret. A high BLEU often means a translation that is more accurate and similar to the reference.
Applicability to various tasks: BLEU can be used to evaluate the quality of translation in various fields and languages, making it universal and applicable to a wide range of tasks.
The results of the experiments with the trained post-editing model for English–Kazakh and Russian–Kazakh machine translation outputs are presented in Table 4, and for the Uzbek language, they are presented in Table 5.
The technologies and approaches proposed in the article for post-editing machine translation of the Kazakh and Uzbek languages have many advantages and usefulness and can also be effective in the context of improving the quality of translation. The developed modules provide linguistic accuracy and identification of types of complex sentences, consider the syntactic and semantic properties, which are important for complex language structures like Kazakh and Uzbek, and provide translation of terms and phrases. Since the hybrid approach benefits from an automatic editor plus additional editing modules, such as spelling errors, common nouns, and terms, productivity in the post-editing process is significantly increased. The developed approaches allow us to achieve a more accurate and high-quality edited text.
Thus, light and full editing approaches for post-editing machine translation are useful and effective tools that can significantly improve translation quality, increase productivity, and reduce costs.
To achieve better translation results, it is necessary to have a more extensive, well-structured, clean, and diverse corpus. Our corpus at this stage covers specific topics and was created using forward translation (FT) and backward translation (BT) methods. The translated texts generated by the BRNN models were improved using the transformer post-editing model, demonstrating its effectiveness in the considered post-editing task. However, the basic machine translation metrics can be enhanced by increasing the volume of the corpus and the variety of topics it covers, which is a crucial task for low-resource languages. The metrics show the usefulness and the effectiveness of the developed approach.
The developed system was also tested by users. In addition, to assess the usability of the system and the system interface, a System Usability Scale (SUS) survey was conducted. This survey consists of the classic 10 questions. Each question is scored using a Likert scale from 1 to 5 as follows:
  • 1—Completely disagree
  • 2—I do not agree
  • 3—Neutral
  • 4—Agree
  • 5—Completely agree
After the user has answered all questions, the total score is recalculated, taking into account the inversion of scores for some questions. The resulting total score is calculated and can range from 0 to 100. A higher score generally indicates a higher perception of system usability [38,39,40,41,42,43].
Based on the survey results, the following results were obtained:
To evaluate the usability of the developed system, 15 users were recruited for each language. A group of experts with knowledge in the relevant field and language were invited. The group of experts included five employees of the Faculty of Philology of KazNU, Al Farabi, five specialists who use machine translation systems in their daily work, and five students. The user was provided with clear instructions regarding the operation of the evaluation criteria system. The experts were tasked not only with assessing the convenience of the system but also with paying attention to the quality of grammatical correctness, conveying meaning, and the structure of sentences and text. To avoid bias, translations were provided to users without indicating their source (professional translation). Figure 8 and Figure 9 show the average of user responses to each question. Figure 10 shows the average score on the SUS scale, where the blue line represents the Kazakh language and the red line represents the Uzbek language. Based on the results of the SUS survey, a value of 71.16 was obtained for the Kazakh language, and a value of 67.6 was obtained for the Uzbek language. These values are not very bad. Of course, to improve the system, it is necessary to make additions in development.

4. Conclusions

In this paper, we explored the task of post-editing machine translation for a low-resource language, focusing on the Kazakh and Uzbek languages. Throughout the study, various aspects of machine translation were examined, starting from its history and evolution and concluding with post-editing methods and strategies.
During the research, an architecture and a post-editing system for the Kazakh and Uzbek languages were developed. This work serves as a valuable resource for researchers and practitioners in the field of machine translation, especially for those working with low-resource languages like Kazakh.
Practical experiments were conducted on various texts: news publications, legislative documents, IT sphere, etc. As shown in the practical results, the quality assessment according to the BLEU metrics improved by 0.12 for the English–Kazakh translation, by 0.24 for the Russian–Kazakh translation, and by 0.09 for the English–Uzbek translation. The developed system was also tested by users. To assess the usability of the system and the system interface, a System Usability Scale (SUS) survey was conducted. The advantage of the developed approach is that, while having a small number of resources and at the same time due to the linguistic properties of the language, it is possible to improve the quality of machine translation for various unrelated groups of languages based on modern machine learning methods. The developed approach can also be applied to related groups of Turkic languages such as Kyrgyz, Turkish, and others.
We hope that our study will contribute to the further development of post-editing methods and the improvement of machine translation quality for languages with limited resources. This is a significant direction in the field of artificial intelligence and machine translation, with substantial potential for practical applications in the modern world. It is planned to further apply the developed approaches, collected and processed resources in various applied problems of artificial intelligence: question-answer systems, speech processing and synthesis of Turkic languages, etc.

Author Contributions

Conceptualization, D.R.; Methodology, A.T.; Software, D.R. and A.K.; Validation, A.T.; Resources, A.K. and A.T.; Writing—original draft, D.R.; Project administration, D.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was performed and financed by the grant Project IRN AP 19677835 of the Ministry of Science and Higher Education of the Republic of Kazakhstan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mohamed, S.A.; Elsayed, A.A.; Hassan, Y.F.; Abdou, M.A. Neural machine translation: Past, present, and future. Neural Comput. Appl. 2021, 33, 15919–15931. [Google Scholar] [CrossRef]
  2. Sutskever, I.; Vinyals, O.; Le, Q. Sequence to Sequence Learning with Neural Networks. Adv. Neural Inf. Process. Syst. 2014, 4, 1–9. [Google Scholar]
  3. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
  4. Bissembayeva, L. Spiritual unity of the Kazakh and Kyrgyz peoples under colonialism (second half of the 19th century–beginning of the 20th century). In Proceedings of the International Scientific-Practical Conference “Academician Council Nurpeys and the History of the Revival of Kazakh Statehood” Held in the Framework of “Nurpeys Studies” on the Occasion of the 85th Anniversary of the Birth of Nurpeys Kenesy Nurpeysuly, Astana, Kazakhstan; 2020; pp. 153–158. (In Kazakh). [Google Scholar]
  5. Makazhanov, A.; Myrzakhmetov, B.; Assylbekov, Z. Manual vs Automatic Bitext Extraction. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; pp. 3834–3838. [Google Scholar]
  6. Vieira, L.N.; Alonso, E.; Bywood, L. Introduction: Post-editing in practice—Process, product and networks. J. Spec. Transl. 2019, 31, 2–13. [Google Scholar]
  7. Shterionov, D.; do Carmo, F.; Moorkens, J.; Hossari, M.; Wagner, J.; Paquin, E.; Schmidtke, D.; Groves, D.; Way, A. A roadmap to neural automatic post-editing: An empirical approach. Mach. Transl. 2020, 34, 67–96. [Google Scholar] [CrossRef] [PubMed]
  8. Negri, M.; Turchi, M.; Bertoldi, N.; Federico, M. Online Neural Automatic Post-editing for Neural Machine Translation. In Proceedings of the Fifth Italian Conference on Computational Linguistics, Torino, Italy, 10–12 December 2018; pp. 525–536. [Google Scholar]
  9. ISO 18587:2017; Translation Services—Post-Editing of Machine Translation Output—Requirements. ISO: Geneva, Switzerland, 2017. Available online: https://www.iso.org/obp/ui/en/#iso:std:iso:18587:ed-1:v1:en (accessed on 1 December 2023).
  10. Koponen, M.; Salmi, L.; Nikulin, M. A product and process analysis of post-editor corrections on neural, statistical and rule-based machine translation output. Mach. Transl. 2019, 33, 61–90. [Google Scholar] [CrossRef]
  11. Koehn, P. Statistical Machine Translation. Draft of Chapter 13: Neural Machine Translation. arXiv 2017, arXiv:1709.07809. [Google Scholar]
  12. Zhumanov, Z.M.; Tukeyev, U.A. Development of machine translation software logical model (translation from Kazakh into English language). In Proceedings of the Third Congress of the World Mathematical Society of Turkic Countries, Almaty, Kazakhstan, 6 July 2009; Volume 1, pp. 356–363. [Google Scholar]
  13. Tukeyev, U.; Zhumanov, Z.; Rakhimova, D. Features of development for natural language processing. In ICT—From Theory to Practice; Milosz, M., Ed.; Polish Information Processing Society: Warszawa, Poland, 2010; pp. 149–174. [Google Scholar]
  14. Tukeyev, U.; Rakhimova, D. Augmented attribute grammar in meaning of natural languages sentences. In Proceedings of the 6th International Conference on Soft Computing and Intelligent Systems, and the 13th International Symposium on Advanced Intelligent Systems, SCIS-ISIS2012, Kobe, Japan, 20–24 November 2012; pp. 1080–1085. [Google Scholar]
  15. Farrús Cabeceran, M.; Costa-Jussà, M.R.; Mariño Acebal, J.B.; Rodríguez Fonollosa, J.A. Linguistic-based evaluation criteria to identify statistical machine translation errors. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation, Saint-Raphaël, France, 27–28 May 2010; pp. 167–173. [Google Scholar]
  16. Matthias, E.; Stephan, V.; Alex, W. Communicating Unknown Words in Machine Translation. In Proceedings of the International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26–31 May 2014. [Google Scholar]
  17. Sinha, R.M.K. Dealing with unknowns in machine translation. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, e-Systems and e-Man for Cybernetics in Cyberspace, Tucson, AZ, USA, 7–10 October 2001; pp. 940–944. [Google Scholar]
  18. Turganbayeva, A.; Tukeyev, U. The Solution of the Problem of Unknown Words Under Neural Machine Translation of the Kazakh Language. In Proceedings of the Intelligent Information and Database Systems 12th Asian Conference, Phuket, Thailand, 23–26 March 2020; pp. 319–328. [Google Scholar]
  19. Zhang, J.; Zhai, F.; Zong, C. Handling unknown words in statistical machine translation from a new perspective. In Proceedings of the First CCF Conference Natural Language Processing and Chinese Computing, Beijing, China, 31 October–5 November 2012; pp. 176–187. [Google Scholar]
  20. Marton, Y.; Callison-Burch, C.; Resnik, P. Improved statistical machine translation using monolingually-derived paraphrases. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language, Singapore, 6–7 August 2009; pp. 381–390. [Google Scholar]
  21. Zhang, J.; Zhai, F.; Zong, C. A substitution-translation-restoration framework for handling unknown words in statistical machine translation. J. Comput. Sci. Technol. 2013, 28, 907–918. [Google Scholar] [CrossRef]
  22. Lyu, C.; Xu, J.; Wang, L. New Trends in Machine Translation using Large Language Models: Case Examples with ChatGPT. arXiv 2023, arXiv:2305.01181. [Google Scholar]
  23. Gulcehre, C.; Ahn, S.; Nallapati, R.; Zhou, B.; Bengio, Y. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 140–149. [Google Scholar]
  24. Li, X.; Zhang, J.; Zong, C. Towards zero unknown word in neural machine translation. In Proceedings of the International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 2852–2858. [Google Scholar]
  25. Turganbayeva, A.; Rakhimova, D.; Karyukin, V.; Karibayeva, A.; Turarbek, A. Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language. Information 2022, 13, 411. [Google Scholar] [CrossRef]
  26. Makhambetov, O.; Makazhanov, A.; Sabyrgaliyev, I.; Yessenbayev, Z. Data-driven morphological analysis and disambiguation for Kazakh. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, Cairo, Egypt, 14–20 April 2015; pp. 151–163. [Google Scholar]
  27. Tukeyev, U.; Karibayeva, A. Inferring the complete set of Kazakh endings as a language resource. In Proceedings of the ICCCI 2020, Communications in Computer and Information Science, Da Nang, Vietnam, 30 November–3 December 2020; Springer: Cham, Switzerland, 2020; Volume 1287. [Google Scholar]
  28. Tukeyev, U.; Karibayeva, A.; Zhumanov, Z. Morphological Segmentation Method for Turkic Language Neural Machine Translation. Cogent Eng. 2020, 7, 1856500. [Google Scholar] [CrossRef]
  29. Rubino, R.; Marie, B.; Dabre, R. Extremely low-resource neural machine translation for Asian languages. Mach. Transl. 2020, 34, 347–382. [Google Scholar] [CrossRef]
  30. Rakhimova, D.; Turarbek, A.; Karyukin, V.; Karibayeva, A.; Turganbayeva, A. The development of the Light post-editing module for English-Kazakh translation. In Proceedings of the ACM International Conference Proceeding Series: Proceedings of the 7th International Conference on Engineering & MIS, Almaty Kazakhstan, 11–13 October 2021. [Google Scholar]
  31. Lee, W.; Park, J.; Go, B.-H.; Lee, J.-H. Transformer-based Automatic Post-Editing with a Context-Aware Encoding Approach for Multi-Source Inputs. arXiv 2019, arXiv:1908.05679. [Google Scholar]
  32. Chatterjee, R.; Gebremelak, G.; Negri, M.; Turchi, M. Online Automatic Post-editing for MT in a Multi-Domain Translation Environment. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 525–535. [Google Scholar]
  33. Vu, T.; Haffari, G. Automatic Post-Editing of Machine Translation: A Neural Programmer-Interpreter Approach. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3048–3053. [Google Scholar]
  34. Pal, S.; Naskar, S.; Vela, M.; Genabith, J. A Neural Network based Approach to Automatic Post-Editing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 281–286. [Google Scholar]
  35. Rakhimova, D.; Sagat, K.; Zhakypbaeva, K.; Zhunussova, A. Development and Study of a Post-Editing Model for Russian-Kazakh and English-Kazakh Translation Based on Machine Learning. In Proceedings of the Advances in Computational Collective Intelligence. ICCCI 2021. Communications in Computer and Information Science, Rhodos, Greece, 29 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 1–10. [Google Scholar] [CrossRef]
  36. Github. Available online: https://github.com/danielvarga/hunalign (accessed on 15 August 2022).
  37. Lee, W.; Jung, B.; Shin, J.; Lee, J.-H. Adaptation of Back-translation to Automatic Post-Editing for Synthetic Data Generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Kyiv, Ukraine, 19–23 April 2021; pp. 3685–3691. [Google Scholar]
  38. Klein, G.; Kim, Y.; Deng, Y.; Nguyen, V.; Senellart, J.; Rush, A.M. OpenNMT: Neural machine translation toolkit. In Proceedings of the AMTA 2018—13th Conference of the Association for Machine Translation in the Americas, Boston, MA, USA, 17–21 March 2018; Volume 1, pp. 177–184. [Google Scholar]
  39. Gong, Y.; Yan, D. A toolset to integrate OpenNMT into production workflow. In Proceedings of the 20th Annual Conference of the European Association for Machine Translation, EAMT 2017, Prague, Czech Republic, 29–31 May 2017; p. 25. [Google Scholar]
  40. BLUE Metrics. Available online: https://en.wikipedia.org/wiki/BLEU (accessed on 21 October 2023).
  41. WER Metrics. Available online: https://medium.com/nlplanet/two-minutes-nlp-intro-to-word-error-rate-wer-for-speech-to-text-fc17a98003ea (accessed on 19 September 2023).
  42. TER Metrics. Available online: https://kantanmtblog.com/2015/07/28/what-is-translation-error-rate-ter/ (accessed on 5 October 2023).
  43. System Usability Scale—What Is It? Available online: https://thestory.is/en/journal/system-usability-scale-what-is-it/ (accessed on 20 December 2023).
Figure 1. Problems of machine translation for the Kazakh language.
Figure 1. Problems of machine translation for the Kazakh language.
Applsci 14 00486 g001
Figure 2. Architecture of the machine translation post-editing system for a low-resource language.
Figure 2. Architecture of the machine translation post-editing system for a low-resource language.
Applsci 14 00486 g002
Figure 3. Stages of the light post-editing algorithm.
Figure 3. Stages of the light post-editing algorithm.
Applsci 14 00486 g003
Figure 4. Algorithm of working with full post-editing.
Figure 4. Algorithm of working with full post-editing.
Applsci 14 00486 g004
Figure 5. The scheme of the prototype of the post-editing system.
Figure 5. The scheme of the prototype of the post-editing system.
Applsci 14 00486 g005
Figure 6. OpenNMT set of libraries for training and deploying neural machine translation models.
Figure 6. OpenNMT set of libraries for training and deploying neural machine translation models.
Applsci 14 00486 g006
Figure 7. Example of English–Kazakh translation and post-editing of the system.
Figure 7. Example of English–Kazakh translation and post-editing of the system.
Applsci 14 00486 g007
Figure 8. Average value of the system assessment using the SUS method for the Kazakh language.
Figure 8. Average value of the system assessment using the SUS method for the Kazakh language.
Applsci 14 00486 g008
Figure 9. Average value of the system assessment using the SUS method for the Uzbek language.
Figure 9. Average value of the system assessment using the SUS method for the Uzbek language.
Applsci 14 00486 g009
Figure 10. Average assessment of the performance of the prototype post-editing system for the Kazakh (blue line) and Uzbek (red line) languages using the SUS method.
Figure 10. Average assessment of the performance of the prototype post-editing system for the Kazakh (blue line) and Uzbek (red line) languages using the SUS method.
Applsci 14 00486 g010
Table 1. Comparison of types of post-editing.
Table 1. Comparison of types of post-editing.
Types of Post-EditingHigh Post-EditingFull Post-EditingLight Post-EditingWeak Post-Editing
DescriptionAccurate, error-free translation without linguistic and stylistic errorsHigh-quality translation with respect to structure and styleUnderstanding and meaning of the original textConditional understanding of the original and linguistic errors
Main objectiveProvide a clear and accurate translation without errorsProvide a high-quality translationImprove the understanding of the text without paying attention to grammar and structureUnderstand the basic understanding of the original, but make mistakes in grammar and structure
Errors allowedPunctuation and grammar errors. The translation may contain errors in punctuation and structureMinor errors may be presentGrammatical and spelling mistakes, but they are acceptableLinguistic and stylistic errors
Time and effortMinimal time and effort. Does not require correction.Considerable time and effortModerate time and effortConsiderable time and effort
Examples of tasksTranslation of instructions and technical documents. Technical translation. Translation of diagrams Translation of literary works. Academic translationTranslation of scientific texts. Product and technical descriptionsUnderstanding the general content of the text. Quick correction of basic errors
Table 2. Example of machine translation results and post-editing with nominal words.
Table 2. Example of machine translation results and post-editing with nominal words.
Text with Errors in Proper NamesTest for Editing Proper NamesTranslation from English
Қазахстан аумағында Алтаи тауы орналасқан.Қазақстан аумағында Алтай тауы орналасқан.Altai Mountain is located on the territory of Kazakhstan.
Алматe қаласы-көне қалалардың бірі.Алматы қаласы-көне қалалардың бірі.Almaty is one of the oldest cities.
Казахстан Республикасының мүгедектігі бар адамдарды әлеуметтiк қорғау туралы заңнамасын бұзуҚазақстан Республикасының мүгедектігі бар адамдарды әлеуметтiк қорғау туралы заңнамасын бұзуViolation of the legislation of the Republic of Kazakhstan on social protection of people with disabilities
Table 3. Example of machine translation results and post-editing with abbreviations words.
Table 3. Example of machine translation results and post-editing with abbreviations words.
Translation from EnglishText before EditingText after Editing Abbreviations
Changing the DNS server requires only entering the selected IP addresses into the appropriate fields of the router or other content configuration pageDNS серверін өзгерту тек таңдалған IP-мекен-жайларды маршрутизатордың сәйкес өрістеріне немесе басқа контентті конфигурациялау бетіне енгізуді талап етедіDNS (Domain Name System—Домендік атаулар жүйесі-система доменных имен) серверін өзгерту тек таңдалған IP-мекен-жайларды маршрутизатордың сәйкес өрістеріне немесе басқа контентті конфигурациялау бетіне енгізуді талап етеді.
WHO reports that since 1950, the suicide rate among men between the ages of 15 and 24 has increased to 268%.WHO 1950 жылдан бастап 15 пен 24 жас аралығындағы ерлер арасындағы суицид деңгейі 268%-ға дейін өскенін хабарлады.ДДСҰ (Дүниежүзілік денсаулық сақтау ұйымы) 1950 жылдан бастап 15 пен 24 жас аралығындағы ерлер арасындағы суицид деңгейі 268%-ға дейін өскенін хабарлады.
Note from ILLI! See Article 920 for the procedure for the implementation of this Codex.ЗҚАИ-ның ескертпесі! Осы Кодекстің қолданысқа енгізілу тәртібін 920-баптан қараңыз.ЗҚАИ (Заңнама және құқықтық ақпарат институты)-ның ескертпесі! Осы Кодекстің қолданысқа енгізілу тәртібін 920-баптан қараңыз.
Table 4. Quality metrics of the post-editing system for the Kazakh language.
Table 4. Quality metrics of the post-editing system for the Kazakh language.
CorpusArchitectureBLEUWERTER
English–Kazakh translationBRNN0.370.490.57
Russian–Kazakh translationBRNN0.250.560.25
Post-editing of Kazakh textTransformer0.490.450.47
Table 5. Quality metrics of the post-editing system for the Uzbek language.
Table 5. Quality metrics of the post-editing system for the Uzbek language.
CorpusArchitectureBLEUWERTER
English–Uzbek translationBRNN0.260.540.59
Post-editing of Uzbek textTransformer0.350.470.52
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rakhimova, D.; Karibayeva, A.; Turarbek, A. The Task of Post-Editing Machine Translation for the Low-Resource Language. Appl. Sci. 2024, 14, 486. https://doi.org/10.3390/app14020486

AMA Style

Rakhimova D, Karibayeva A, Turarbek A. The Task of Post-Editing Machine Translation for the Low-Resource Language. Applied Sciences. 2024; 14(2):486. https://doi.org/10.3390/app14020486

Chicago/Turabian Style

Rakhimova, Diana, Aidana Karibayeva, and Assem Turarbek. 2024. "The Task of Post-Editing Machine Translation for the Low-Resource Language" Applied Sciences 14, no. 2: 486. https://doi.org/10.3390/app14020486

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop