Reduction of Neural Machine Translation Failures by Incorporating Statistical Machine Translation

Dugonik, Jani; Sepesy Maučec, Mirjam; Verber, Domen; Brest, Janez

doi:10.3390/math11112484

Open AccessArticle

Reduction of Neural Machine Translation Failures by Incorporating Statistical Machine Translation

Faculty of Electrical Engineering and Computer Science, University of Maribor, SI-2000 Maribor, Slovenia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(11), 2484; https://doi.org/10.3390/math11112484

Submission received: 21 April 2023 / Revised: 24 May 2023 / Accepted: 25 May 2023 / Published: 28 May 2023

(This article belongs to the Special Issue Current Trends in Natural Language Processing (NLP) and Human Language Technology (HLT))

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a hybrid machine translation (HMT) system that improves the quality of neural machine translation (NMT) by incorporating statistical machine translation (SMT). Therefore, two NMT systems and two SMT systems were built for the Slovenian–English language pair, each for translation in one direction. We used a multilingual language model to embed the source sentence and translations into the same vector space. From each vector, we extracted features based on the distances and similarities calculated between the source sentence and the NMT translation, and between the source sentence and the SMT translation. To select the best possible translation, we used several well-known classifiers to predict which translation system generated a better translation of the source sentence. The proposed method of combining SMT and NMT in the hybrid system is novel. Our framework is language-independent and can be applied to other languages supported by the multilingual language model. Our experiment involved empirical applications. We compared the performance of the classifiers, and the results demonstrate that our proposed HMT system achieved notable improvements in the BLEU score, with an increase of 1.5 points and 10.9 points for both translation directions, respectively.

Keywords:

neural machine translation; statistical machine translation; sentence embedding; similarity; classification; hybrid machine translation

MSC:

68T50

1. Introduction

The statistical machine translation (SMT) paradigm was the primary approach used in machine translation (MT) research for many years. About a decade ago, neural machine translation (NMT) emerged and produced remarkable results. As a result, SMT systems were largely replaced by NMT systems in practical applications. Today, SMT systems are rarely used, with NMT architectures dominating both research and practical applications of machine translation. While NMT generally outperforms SMT, there are certain cases where SMT remains superior. Research has shown that the errors made by NMT and SMT systems are complementary [1]. For instance, NMT outputs are more prone to accuracy-related errors, such as mistranslation and omission errors, while both systems tend to make word-form errors in morphologically rich languages, with NMT performing slightly better [2].

Languages that share similarities tend to be easier to translate due to the presence of equivalent linguistic structures. In machine translation, languages are often paired with English, due largely to the availability of bilingual corpora. The English language is an analytic language that employs helper words (such as particles and prepositions) and word order to express relationships between words. As such, the linguistic structures are relatively simple. Conversely, morphologically rich languages tend to have more complex linguistic structures, with inflectional languages using inflections to express relationships between words and having a more relaxed word order. While NMT approaches generally outperform SMT for the majority of language phenomena, there are still cases that are handled better by SMT. For instance, according to [2], SMT may, in some cases, be preferable for highly inflected languages.

In this paper, we examine the translation between English and the highly inflected Slovenian language in both directions. We propose a hybrid machine translation system that combines both approaches in order to capitalize on their respective strengths. The main contributions of this paper are to improve NMT translation quality by using SMT, and to represent the source sentence and both translations as the vectors in the same vector space, using a multilingual language model. The used multilingual language model, mBERT [3], supports more than 100 languages, making it versatile across multiple languages. The source and translation vectors are then utilized to extract features, which are subsequently fed into classifiers that predict which translation system produced a superior translation. The proposed method of combining SMT and NMT in the hybrid system is novel. Our framework is language-independent and can be applied to other languages supported by the multilingual language model.

The remainder of the paper is organized as follows. Section 2 presents the background of our research. It contains related work, preliminaries of NMT and SMT, the classification task, and our aims and research contributions. Section 3 presents the methodology of the proposed HMT. The experiments and results are described in Section 4. We discuss the obtained results in Section 5, and conclude the paper with Section 6.

2. Background

In this section, we present the related work and provide the necessary preliminaries for a better understanding of this paper.

2.1. Related Work

There is no doubt that NMT is currently the prevalent approach to MT. Before NMT, the most effective SMT systems were based on phrase-based models [4,5]. In these systems, different models (the translation model, reordering model, language model, etc.) were trained independently and combined in a log-linear scheme, in which each model was assigned a different weight by a tuning algorithm [6].

In NMT, there are no separate models; instead, a large network is trained as a whole [7,8]. This network is trained to transform the source sentence directly into the target sentence, and words are represented as continuous vectors called word embeddings. The learned word embeddings capture morphological, syntactic, and semantic similarity across words [9]. Methods for training word embeddings on raw text often consider the context in which the word frequently occurs. For MT, it is desirable to embed whole phrases or sentences instead of single words. To accomplish this, self-attention is used to find sentence representations [10].

Different NMT architectures have been developed over time, and they generally exhibit comparable performance. The first standalone architecture was Long Short-Term Memory (LSTM), which is a sequence-to-sequence encoder–decoder architecture that uses two Recurrent Neural Networks (RNNs) [7,8]. An encoder network produces a representation of the source sentence, and a decoder network generates the target sentence from that representation. LSTM is used as a gated activation function to address the vanishing gradient problem, which makes it difficult to train RNNs to capture long-range dependencies [8]. The first architectures represented the source sentence as a fixed-length vector and different word orders were examined in the source sentence. Bidirectional RNNs are able to capture both directions and are most commonly used [11]. The concept of attention was introduced in [11] to avoid a fixed-length source sentence representation. The attention decoder can place its attention on the parts of the source sentence that are useful for generating the next word in the translation using time-dependent context vectors. The attention mechanism is the interface between the encoder and decoder. Afterwards, convolutional architectures were introduced [12], which have several potential advantages over RNN models. They reduce sequential computation, and their hierarchical structure connects distant words via a shorter path. For the translation of long sentences, multiple convolutional layers are used, which increase the effective context size. Convolutional models are deeper and often more difficult to train. The attention mechanism called self-attention relates several positions in the source and target sentences without using sequence-aligned RNNs or convolutions [10]. The Transformer architecture uses multi-headed self-attention and is currently one of the most widely used NMT architectures [10].

The authors in [13,14] provide an overview of the literature and approaches to combining NMT and SMT paradigms. They highlight that, while NMT has become the dominant approach in recent years, NMT and SMT have complementary strengths. Two categories of hybrid approaches are discussed. The first category includes methods that incorporate key ideas or components from SMT into NMT, such as combining NMT scores with SMT features and incorporating symbolic SMT-style lexical translation tables into the NMT decoder. The second category involves system combination, where a fully trained SMT system is combined with an independently trained NMT system, often using rescoring and reranking methods or minimum Bayes risk (MBR)-based approaches. Various combinations and cascades of NMT and SMT are explored, demonstrating the flexibility and potential for improving translation quality through hybrid approaches. Ensembling different NMT models has been shown to outperform single ones. The number of different NMT models in ensemble architectures ranges from 2 to up to 72 translation models [8,15]. However, the decoding speed is significantly worse when using many translation models. The decoder needs to apply multiple models rather than only one. It makes sense if the models complement each other. Models are either trained independently [8] or they share some training iterations [16]. The ensemble decoder computes predictions for each model, which are then combined using the arithmetic or geometric average [8,17]. The authors in [18] proposed a hybrid MT system that combines NMT and rule-based MT (RBMT) to compensate for the inadequacy of NMT in rare-resource domains. They used a classifier to predict which translation from the two systems was more reliable, and to do so, they explored a set of features that reflected the reliability of the translation. They also made a comparison between feature- and text-based classification, and the results showed that the feature-based classification achieved better classification accuracy. In our paper, we combine NMT and SMT for translation in both directions. We also use different sets of feature vectors, where we first transform our source sentence and both translations into the same vector space. Then, we use similarity and distance measures to obtain feature vectors. The authors in [19] address the challenge of improving NMT systems in low-resource scenarios, where large-scale parallel corpora are not readily available. The proposed approach leverages an SMT system to extract parallel phrases from the original training data, augmenting the training data for the NMT system. The approach utilizes gated recurrent unit (GRU) and Transformer architecture, and is evaluated on Hindi–English and Hindi–Bengali datasets in the domains of Health, Tourism, and Judicial.

2.2. Preliminaries

This section describes the basics of two MT paradigms: NMT and SMT. Both approaches belong to supervised approaches to MT based on machine learning technology, where training is conducted using sentence-aligned (human) translations. Given a large number of source/target language sentence pairs, the MT system learns how to translate fully automatically. NMT is described first since it is the dominant approach today, followed by the description of SMT, as it is used as the complementary approach. In this paper, we propose the HMT architecture as a two-engine combination in which the selection between NMT and SMT is made by the classification algorithm. Therefore, a short description of the classification algorithms that we used is also given in this section.

2.2.1. Neural Machine Translation

NMT is an approach to MT that uses an artificial neural network. The state-of-the-art NMT systems use the Transformer architecture [10], which is shown in Figure 1, to produce high-quality translations. The Transformer architecture relies on the attention mechanism and remains the dominant architecture for several language pairs. Self-attention is an attention mechanism that connects different positions of a single sequence to compute a sequence representation. The self-attention mechanism is used successfully in various tasks, such as text summarization and textual integration. The self-attention layers of this architecture learn the dependencies between words in a sequence by studying the connections between all the words in the matching sequences and by directly modeling these relationships. This approach is simpler than the gating mechanism used by RNNs. The simplicity of this architecture has allowed for researchers to develop high-quality translation models with the Transformer architecture, even for languages with few resources. The Transformer architecture was the first to rely entirely on the self-attention mechanism to compute input and output representations, without using feed-forward or sequence-aligned convolutional neural networks. The encoder and decoder can be stacked N layers high, with each layer taking inputs from the encoder and the previous layers. By stacking layers, the model can learn to extract and focus on different combinations of attention from its attention heads, boosting prediction power.

During training, the model is optimized to minimize the difference between its predicted translations and the true translations in the training data. This is typically achieved using maximum likelihood estimation, where the model is trained to maximize the likelihood of generating the correct target sentence given the source sentence.

Overall, NMT with the Transformer architecture has shown great promise in producing high-quality translations across a wide range of language pairs [20]. It is now used widely in many real-world translation applications and continues to be an active area of research [21,22,23,24].

2.2.2. Statistical Machine Translation

Phrase-based SMT, shown in Figure 2, is a traditional approach to MT that has been used widely for many years. It is based on the idea of breaking down the input sentence into smaller phrases or sequences of words, translating them independently, and then recombining them to form the final translation. Phrase-based SMT systems learn dependencies between words, phrases, or sequences of words in both languages, as well as dependencies between words in the target language and local reorderings, among other things [4]. These learned dependencies are stored in the various models of the SMT system. M denotes a number of models used in SMT.

2.2.3. Classification

In a classification task [25,26], the goal is to assign a set of input instances to predefined categories or classes. A binary classification task [27] specifically involves dividing the instances into two distinct classes. The task aims to determine to which class a given instance belongs based on its features or attributes.

In our framework, a vital part of the hybridization approach is the classification task that is used to choose either SMT or NMT translation as the final translation. Therefore, we used and compared some of the well-known algorithms for binary classification:

Logistic Regression (LR) [28] is a simple and widely used algorithm for binary classification. It works by modeling the probability of the positive class using a logistic function.
Decision Tree (DT) [29] is a simple algorithm for binary classification. It works by splitting the data recursively, based on the features that are most informative for the classification task.
Gradient-Boosted Decision Tree (GBDT) [30] is an algorithm that sequentially builds decision trees to correct errors made by previous trees, making it effective for binary classification tasks. It combines the predictions of multiple trees to provide accurate binary classification results, capturing complex patterns in the data while mitigating overfitting through regularization techniques.
Random Forest (RF) [31] is an ensemble learning method that combines multiple decision trees to improve the accuracy and stability of the model. It works by selecting a subset of features randomly at each node in the decision tree.
Naive Bayes (NB) [32] is a probabilistic algorithm that assumes independence between features and works by calculating the probability of the observation belonging to each class based on the likelihood and prior probabilities.
K-Nearest Neighbors (kNN) [33] is a non-parametric algorithm that works by finding the k-nearest data points to a new observation and assigning the label based on the majority of the neighbors.
Multilayer Perceptron (MLP) [28] is a type of neural network (NN) that can be used for binary classification problems. It works by building a network of interconnected nodes that process input data and produce an output.
Convolutional Neural Network (CNN) [34] is a type of neural network that excels at analyzing and extracting features from structured data-like images. Layers of convolutional filters are used to automatically learn hierarchical representations, making them highly effective for binary classification tasks where they can capture intricate patterns and relationships in the data to make accurate predictions.
Support Vector Machine (SVM) [28] is a powerful machine learning algorithm that is commonly used for binary classification problems. It works by finding a hyperplane that separates the two classes with the largest possible margin.

Equation (1) represents the obtaining of translations for a given source sentence s, where

t_{1}

and

t_{2}

are the translations generated by the NMT and SMT systems, respectively. The use of the trained models (obtained as shown in Figure 1 and Figure 2) in the translation procedure is shown in Figure 3.

t_{1} = N M T (s), t_{2} = S M T (s)

(1)

The classification task can be formalized as:

f : (x_{t_{1}}, x_{t_{2}}) \to {0, 1}, f \in {L R, D T, G B D T, R F, N B, k N N, M L P, C N N, S V M},

(2)

where

x_{t_{1}}

and

x_{t_{2}}

are the feature vectors of translations

t_{1}

and

t_{2}

, respectively, and 0 denotes translation

t_{2}

and 1 denotes translation

t_{1}

. The feature vectors fed into the classifier should reflect the adequacy of the translations. Therefore, they are constructed depending on the source sentence s. The construction of feature vectors is important for the accuracy of the classification task.

2.3. Aim and Research Contribution

The goal of this paper is to reduce translation failures in NMT by integrating SMT, and the source sentence and translations are represented as the vectors in the same vector space, using the multilingual language model. These source and translation vectors are then utilized to extract features, which are subsequently fed into classifiers that predict which translation system will produce a superior translation. Although NMT generally outperforms SMT, there are specific cases in which the SMT remains more competitive. HMT combines the strengths of both the SMT and NMT systems. By leveraging the best features of each system, HMT can offer improved translation quality compared to when either system is used independently.

3. Methodology

This section describes the proposed framework in detail, as shown in Figure 4. The core idea of our framework is to compare SMT and NMT translations and choose the better translation of the source sentence. All three sentences should be represented as vectors that can be compared in terms of semantic and syntactic similarities between the source and translation, and express the differences between the different translations. Sentence embedding is used to encode sentences into vectors. In the process of feature extraction, different measures are applied to determine the similarities and differences between vectors. Each measure provides the value of one feature in the vector. After obtaining informative feature vectors, various classifiers are trained to deduce which one has the better prediction power to select SMT or NMT translation. The following subsections will outline some of the relevant methods used in our framework.

3.1. Sentence Embeddings

Sentence embeddings in Natural Language Processing (NLP) refer to techniques that capture the semantic meaning of entire sentences by representing them as dense numerical vectors, enabling a wide range of downstream tasks such as sentence similarity, paraphrase detection, and text classification. BERT (Bidirectional Encoder Representations from Transformers) [35,36,37] embeddings capture rich contextual information and have revolutionized NLP tasks such as text classification, named entity recognition, and sentiment analysis. BOW and TF-IDF [38] embeddings are simpler but still useful for tasks such as document classification or information retrieval, where word frequency or presence is crucial. In our framework, three sentence embeddings were constructed for the source sentence and both translations. Because all three sentences could have different lengths, that makes them difficult to compare using similarity or distance measures. One of the most popular methods for generating sentence embeddings is to use pre-trained language models such as BERT. mBERT (Multilingual BERT) [3,39] is an extension of the original BERT model developed by Google, which was trained on a large corpus of text from more than 100 different languages. mBERT can learn to understand the meaning and context of words and sentences in multiple languages, and can be applied to a variety of NLP tasks, including text classification, question answering, and MT. One of the advantages of using mBERT is that it allows developers to build NLP applications that can work with multiple languages, without having to train separate models for each language. This can save time and resources, while also improving the overall accuracy and performance of the model. However, it is important to note that mBERT is not perfect and may not perform as well as language-specific models for certain languages. Additionally, it may not be able to capture all the nuances of each language, especially those with complex grammar or syntax.

3.2. Feature Extraction

Feature extraction is the process of transforming raw data into a set of meaningful features that are used as input to a machine learning algorithm. We are looking for informative features that would allow the learning algorithm to build a model that accurately predicts which translation is better, SMT or NMT. Feature extraction also helps to reduce the dimensionality of the input data, improve model performance, and increase the interpretability of the results. Features are extracted from sentence embeddings. To provide as accurate a classification as possible, we explore the following 11 features [18,40,41,42,43]: cosine similarity, Jensen–Shannon divergence, Euclidean distance, Cityblock distance, Squared Euclidean distance, Chebyshev distance, Canberra distance, Dice coefficient, Kulczynski distance, Russel–Rao similarity, and Sokal–Sneath similarity. Each feature has a value between 0 and 1. The full list of features and their positions in the feature vector can be seen in Table 1. We end up with two feature vectors:

x_{t_{1}}

represents the feature vector between the source sentence and the NMT translation, and

x_{t_{2}}

represents the feature vector between the source sentence and the SMT translation.

3.2.1. Cosine Similarity

The cosine similarity [44,45] is a measure of the similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In the context of comparing two real-valued vectors, cosine similarity is a popular feature similarity measure that is used commonly in machine learning and information retrieval. Cosine similarity is often used in text analysis applications, such as document similarity and clustering, but it can also be used in other domains where feature vectors are used to represent objects or entities.

3.2.2. Jensen–Shannon Divergence

The Jensen–Shannon divergence [46] is a measure of similarity or dissimilarity between two probability distributions. It is often used to compare two probability density functions or two sets of discrete probabilities. To use the Jensen–Shannon divergence to compare two real-valued vectors, we can first interpret them as probability distributions by normalizing them to sum to 1.

3.2.3. Euclidean Distance

The Euclidean distance [47] is a commonly used metric for comparing two real-valued vectors in machine learning and data analysis. It measures the straight-line distance between two points in the Euclidean space. The Euclidean distance can be used for a variety of tasks, such as clustering, classification, and anomaly detection. It is a useful metric for comparing vectors in many machine learning applications.

3.2.4. Cityblock Distance

The cityblock distance [48], also known as the Manhattan distance, is a way to measure the distance between two points in a two-dimensional space (or higher dimensions). In the context of comparing two real-valued vectors, the cityblock distance is a way to measure the similarity or dissimilarity between two vectors based on the sum of the absolute differences between their corresponding elements. The resulting distance is a non-negative value that represents the total distance between the two vectors. In other words, the larger the distance, the more dissimilar the vectors. The cityblock distance can be useful in many applications, such as image processing, clustering, and data analysis.

3.2.5. Squared Euclidean Distance

The squared Euclidean distance [49] metric is a way of measuring the distance between two real-valued vectors of equal length. This metric is sometimes preferred to the standard Euclidean distance, which calculates the square root of the sum of squared differences. The squared Euclidean distance can be useful in certain applications where the computation of square roots is computationally expensive or unnecessary, such as in some clustering or classification algorithms. It is worth noting that the squared Euclidean distance is always non-negative and symmetric, and satisfies the triangle inequality, which are all properties of a valid distance metric.

3.2.6. Chebyshev Distance

The Chebyshev distance [50] is a metric that can be used to compare two real-valued vectors. It is defined as the maximum absolute difference between the corresponding elements of the two vectors. The Chebyshev distance is a useful distance metric in many applications, such as image processing, pattern recognition, and clustering, where one wants to compare objects based on their maximum deviation in any one dimension.

3.2.7. Canberra Distance

The Canberra distance [51] is a measure of the distance between two points in a multidimensional space. It considers the magnitude of the differences between corresponding elements of two vectors, as well as their absolute values. The Canberra distance is a popular feature in data analysis and machine learning because it is robust to outliers and can handle sparse data. It is used commonly in clustering, classification, and regression problems.

3.2.8. Dice Coefficient

The Dice coefficient [52] is a similarity measure used to compare the similarity between two sets or vectors. The Dice coefficient ranges from 0 to 1, with 1 indicating that the two vectors are identical and 0 indicating that they are completely dissimilar. A higher Dice coefficient value indicates a higher degree of similarity between the two vectors. It is commonly used in clustering, classification, and information retrieval tasks where the similarity between two vectors needs to be computed.

3.2.9. Kulczynski Distance

The Kulczynski distance [53] is a statistical measure used to compare two real-valued vectors. It is a measure of similarity that considers the proportion of shared values between the two vectors. The Kulczynski distance has been used in a variety of applications, including information retrieval, data mining, and machine learning.

3.2.10. Russell-Rao Similarity

The Russell–Rao similarity [54] is a measure that can be used to compare two real-valued vectors. The Russell–Rao similarity is a simple and intuitive similarity measure that ranges from 0 to 1, where 0 indicates that the vectors have no coordinates in common, and 1 indicates that the vectors are identical. However, it does not consider the magnitude or direction of the vectors, and it can be sensitive to outliers.

3.2.11. Sokal–Sneath Distance

The Sokal–Sneath distance [55] is a measure of similarity between two real-valued vectors. It is used commonly in cluster analysis and classification problems. Intuitively, the Sokal–Sneath distance measures the proportion of non-matching values in the two vectors, considering the sparsity of the vectors. The Sokal–Sneath distance is useful when comparing sparse vectors, such as those that arise in text analysis or the analysis of high-dimensional data. It has the property of being symmetric and satisfying the triangle inequality, which makes it suitable for use in hierarchical clustering algorithms.

3.3. Classification

Before classification is applied, the feature vector

x

is constructed, in which each dimension

j = 1, \dots, 11

contains a value of a specific distance or similarity measure. Two feature vectors,

x_{t_{1}}

and

x_{t_{2}}

, are constructed: one for the NMT translation and one for SMT translation. These feature vectors are used as input for a classification algorithm to determine which translation is more accurate or appropriate for a given input sentence. The construction of these feature vectors is crucial in determining the accuracy of the classification algorithm. Careful selection of the measures is required to ensure that they capture the relevant features of the input data.

As shown in Figure 4, a classification is adopted to select the best translations generated by NMT and SMT systems. Since the performance of SMT is lower than that of NMT in general cases, the classification accuracy becomes more important to prevent the hybridized results from being lower than the accuracy of NMT, which we consider as a baseline.

4. Experiments

In this section, we present experiments conducted on the Slovenian–English language pair. The Slovenian language is a Slavic language with rich inflectional morphology and a relaxed word order. English is an analytic language with very little inflection, where word order is very important for understanding the meaning. Considering this, we are translating between two structurally very different languages. The experiments were conducted in both translation directions, and we describe the corpora and tools used for data preprocessing and training the MT systems. The experimental settings are provided, and we present the results of the HMT systems, comparing them with the baseline NMT systems. Additionally, we show the results obtained with various well-known binary classification algorithms.

4.1. Corpora and Tools

The ParaCrawl corpus [56] is a valuable resource for researchers and developers working in the fields of MT and NLP, providing a large and diverse set of parallel texts that can be used to train and evaluate models in a variety of languages (more than 80). The corpus was created by crawling and scraping multilingual content from the web, using a combination of automated and manual methods to filter and clean the data.

The corpus used was tokenized and lowercased, and sentences longer than 80 words were removed. To obtain a representative sample, sentences were chosen from different parts of the corpus to create the training, development, and test sets. The training set consisted of 9,000,000 sentences, the development set had 90,000 sentences, and the test set also had 90,000 sentences. The training set was used to train both the SMT and NMT systems. The development set was split into two parts, with 45,000 sentences each. For the NMT systems, the first part was used during the training as a validation set, and for the SMT systems, we used 500 sentences for optimization. In Ref. [57], the authors recommend using a maximum of 1000 sentences for optimization, and in Refs. [58,59], 500–700 sentences were used for optimization. We used the second part of the development set to train the classifiers and augmented the data to obtain 90,000 sentences. To test the SMT, NMT, and HMT systems, we used 3000 sentences from the test set. The final corpus sizes of all the sets are shown in Table 2.

To evaluate the MT systems, we used various evaluation metrics. Bilingual Evaluation Understudy (BLEU) [60] is a metric that operates on the principle of n-gram matching, which involves comparing sequences of words (or sometimes characters) between the machine translation and the reference translations. It considers both precision and brevity in its evaluation. The BLEU score is calculated by computing the precision of n-grams (usually up to a certain maximum length) in the translation and comparing this to the precision of the same n-grams in the reference translations. The precision values are then combined using a geometric mean. Additionally, BLEU incorporates a brevity penalty to discourage excessively short translations that may inflate the precision score. The resulting BLEU score ranges from 0 to 100, with a higher score indicating better translation quality. It is important to note that BLEU is a relatively simple metric that primarily measures lexical similarity and does not capture other aspects of translation quality, such as fluency, adequacy, or word order. Despite its limitations, BLEU remains widely used as a quick and automatic evaluation metric, especially when comparing different machine translation systems or evaluating improvements during system development. It provides a rough estimate of translation quality but should be used in conjunction with other evaluation metrics for a more comprehensive evaluation. Additional metrics have been included for more information about the quality. Character n-gram F-score (chrF) [61] is a metric that evaluates the translation quality based on character-level n-gram matches. It considers the precision and recall of character n-grams in both the machine translation and the reference translations. By considering character-level matches, chrF can capture the adequacy and fluency of the translation, even in cases where word order or word choice may differ. The resulting chrF score ranges from 0 to 100, with a higher score indicating better translation quality. Translation Edit Rate (TER) [62] is a metric that captures more global changes in the translation and is less sensitive to minor lexical variations. It aims to assess the overall fluency and adequacy of the translation by considering the broader context and number of changes needed to align it with the reference. The resulting TER score ranges from 0 to 100, with a lower score indicating better translation quality. For BLEU, chrF, and TER metrics, we utilized SacreBLEU [63], which provides a hassle-free computation of shareable, comparable, and reproducible scores. The Metric for Evaluation of Translation with Explicit ORdering (METEOR) [64] is a metric that primarily focuses on lexical similarity. METEOR incorporates more linguistic features and considers synonyms, stemming, and the reordering of words. It uses a combination of unigram matching, stemming, and WordNet synonym matching to compute an alignment score. Additionally, METEOR also incorporates a penalty for incorrect word order, rewarding translations that have a more similar word order to the references. The resulting METEOR score ranges from 0 to 100, with a higher score indicating better translation quality. The Consistent Translation Evaluation Metric (COMET) [65] is a metric that utilizes a pre-trained neural network model that is trained on a large parallel corpus of human translations. It compares the machine-generated translation against the human reference translations to compute a score that reflects the quality and similarity of the translation. The resulting COMET score ranges from 0 to 100, with a higher score indicating better translation quality.

In NMT, Byte Pair Encoding (BPE) [66] is used to address the out-of-vocabulary (OOV) words problem. Since NMT models learn from a fixed vocabulary, any word not present in the vocabulary is considered as an OOV word, and its translation cannot be learned. By applying BPE to the source and target language texts, we can split unknown words into subword units that are already present in the vocabulary. This helps the NMT model to translate sentences with OOV words accurately. For example, the rare word “petrichor” would be split into more common subwords, such as “pet”, “rich”, and “or”.

4.2. Experimental Settings for Models’ Training and Classification

To train NMT systems, we used Marian NMT [67], which is an efficient and free NMT framework written in pure C++ with minimal dependencies. Using toolkits such as Marian NMT, it is relatively straightforward to construct end-to-end NMT systems, which only require a little preprocessing of the training corpora and post-processing of the system output. We trained two NMT systems for both translation directions. The hyperparameters used for training are shown in Table 3.

To train the SMT systems, we used the Moses toolkit [69], which is an open-source toolkit with a wide variety of tools for the training and optimization of MT systems. We trained two independent phrase-based SMT systems for both translation directions. Each SMT system had six models and 14 parameters (model weights). To improve the SMT systems’ translation quality, model weights were optimized by the DE algorithm. In our previous research [59], we showed the competitive performance of the DE algorithm in comparison with MERT, MIRA, and PRO optimizers, which are commonly used in SMT optimization. The hyperparameters used to train and optimize the SMT systems are shown in Table 4.

4.3. Results

As NMT is the dominant approach, we used NMT as the baseline in our experiments. To evaluate the translation quality, the primary metric was the BLEU metric. We also included chrF, TER, METEOR, and COMET metrics for additional information. The ↑ and ↓ symbols in the tables indicate which values are better.

The results for the baseline (NMT) are shown in Table 5 for the BLEU, chrF, TER, METEOR, and COMET metrics. For translations from Slovenian to English, the baseline (NMT) achieved a BLEU score of 46.4, a chrF score of 65.5, a TER score of 40.1, a METEOR score of 70.5, and a COMET score of 83.3. For translations from English to Slovenian, the baseline (NMT) achieved a BLEU score of 32.0, a chrF score of 54.1, a TER score of 54.4, a METEOR score of 55.3, and a COMET score of 80.7.

The results for HMT are presented in Table 6 and Table 7 for the BLEU, chrF, TER, METEOR, and COMET metrics.

For the translation from Slovenian to English, seven classifiers achieved a better BLEU score than the baseline by a range of 0.4 to 1.5 points. The two classifiers achieved a worse or almost equal BLEU score compared to the baseline. Five classifiers achieved a better chrF score than the baseline by a range of 0.3 to 1.0 points. Four classifiers achieved a worse or almost equal chrF score compared to the baseline. One classifier achieved a better TER score than the baseline by 0.2 points. Eight classifiers achieved a worse or equal TER score compared to the baseline. One classifier achieved a better METEOR score than the baseline by 0.4 points. Eight classifiers achieved a worse or equal METEOR score compared to the baseline. Four classifiers achieved a better COMET score than the baseline by 0.2 to 0.6 points. Five classifiers achieved a worse or equal COMET score compared to the baseline.

For the translation from English to Slovenian, all nine classifiers achieved a better BLEU score than the baseline by a range of 8.4 to 10.9 points, a better chrF score than the baseline by a range of 6.7 to 9.0 points, a better TER score than the baseline by a range of 3.9 to 7.1 points, and a better METEOR score than the baseline by a range of 3.9 to 6.2 points. Seven classifiers achieved a better COMET score than the baseline by 0.3 to 1.8 points. Two classifiers achieved a worse COMET score compared to the baseline.

5. Discussion

We consider the NMT translation quality as our baseline. While NMT generally outperforms SMT, there are certain cases where SMT remains more competitive. As can be seen from the results, for translation from Slovenian to English, NMT achieved a better translation quality than SMT, while for the translation from English to Slovenian, SMT achieved a better translation quality. By using the best features of each system, HMT can offer an improved translation quality. HMT translates the source sentence using both systems and selects the more reliable translation depending on the features. In our experiment, the primary metric was the BLEU metric, while the other metrics were calculated as additional information. The results indicated similar conclusions to those obtained with the BLEU metric. For translation from Slovenian to English, SVM classifier achieved better scores according to all five metrics, with a BLEU score of 47.9, a chrF score of 66.6, a TER score of 39.9, a METEOR score of 70.9, and a COMET score of 83.9. For translation from English to Slovenian, two classifiers achieved the best scores. The MLP classifier achieved better scores according to BLEU and chrF metrics with a BLEU score of 42.9 and a chrF score of 63.1, while the NB classifier achieved better scores according to TER and COMET metrics, with a TER score of 47.4 and a COMET score of 82.5. Both MLP and NB classifiers achieved the same, better score according to the METEOR metric, with a METEOR score of 61.5. The results for the classifiers are shown in Table 6 and Table 7 for the BLEU, chrF, TER, COMET, and METEOR metrics. The upper bound for the BLEU metric is presented so we can see the maximum potential improvement in the classification task. It should be noted that the upper bound is only achievable if the classification is perfect, which is difficult to attain in reality.

To better understand the potential of HMT, Table 8 shows the maximum BLEU scores that can be achieved with perfect classification. For translation from Slovenian to English, this is 53.5, and for translation from English to Slovenian, this is 49.5. It also shows the percentage of translations where SMT or NMT is better based on their BLEU scores.

The contribution of the proposed system to translation quality is evident in the case of English to Slovenian translation, where NMT achieved a BLEU score of 32.0, and HMT achieved a BLEU score of 42.9, showing an improvement of 10.9 points. On the other hand, in the case of Slovenian to English, NMT achieved a BLEU score of 46.4, and HMT achieved a BLEU score of 47.9, showing an improvement of 1.5 points. Improving translation quality by 0.5 points or more can be a challenging task, especially if the initial translation quality is already high. Even small improvements in the translation quality can require significant effort and experimentation. In general, as the quality of the baseline system improves, it becomes increasingly difficult to make further gains in translation quality. However, this also depends on the specific language pair, the quality of the training data, the complexity of the target language, and other factors. Many MT systems already exist, and instead of spending months training new ones, we should consider reusing and combining them with one or more systems.

The limitation of the proposed system is the multilingual language model. Although multilingual language model supports many languages, there are still languages that are not included in its support. Additionally, the coverage of supported languages might be sparse, depending on the data upon which it was built.

Translation Examples

In Table 9 and Table 10, we present translation examples where SMT outperformed NMT. It is important to note that the ParaCrawl corpus used in our experiments was obtained through web crawling and filtering, resulting in a corpus that contains a significant amount of noise.

Table 9 shows some translation examples for translations from Slovenian to English. In the first example, we can see that the word order is different in the SMT and NMT translations. The NMT translation keeps the word order from the source sentence, while the SMT translation changes the word order and was closer to the reference translation. In the second example, we can see that the NMT translation literally translated the phrase as “golden wedding”, while the SMT translation translated it as “50th wedding anniversary” and was closer to the reference translation. In the fourth example, we show that the SMT used the simple present tense, while the NMT used an expression with a modal verb. Additionally, all three translations used different units: SMT used feet, NMT used meters, and the reference used yards.

Table 10 shows some translation examples for translations from English to Slovenian. In the first example, we can see that the NMT translation uses the singular form instead of the plural form, probably because of the noun that follows, which is in the singular form. In the second example, we can see that the length of the NMT translation is much shorter than that of the reference, source, and even SMT translation. In the fourth example, we can see that although both translations look good, the SMT translation provides a more accurate translation.

6. Conclusions

The main contributions of this paper involve enhancing NMT translation quality through the integration of SMT and representing the source sentence and translations as vectors in a shared vector space using a multilingual language model. These features were utilized to capture and quantify the differences between the two translation approaches. To determine the best possible translation, the classification algorithm predicts which translation system produced a superior translation. Several classifiers were used to select the best possible translation, and the results showed that the proposed HMT system improved the BLEU score by 1.5 and 10.9 points for both translation directions, respectively. The proposed method of combining SMT and NMT in the hybrid system is novel. Our framework is language-independent and can be applied to other languages supported by the multilingual language model. As seen from the results, the proposed HMT system successfully combined the strengths of both NMT and SMT and, by using the best features of each system, can offer an improved translation quality.

For future work, researchers can explore novel approaches to integrate additional models or even incorporate domain-specific models for an improved translation performance. Another idea worth exploring is the development of an even larger multilingual language model, expanding its coverage and potentially enhancing translation quality.

Author Contributions

Conceptualization, J.D.; methodology, J.B. and M.S.M.; software, J.D. and D.V.; validation, J.B., M.S.M., and D.V.; formal analysis, J.D. and J.B.; investigation, J.D. and M.S.M.; resources, J.D. and M.S.M.; writing—original draft preparation, J.D., J.B., M.S.M., and D.V.; writing—review and editing, J.D., J.B., M.S.M., and D.V.; visualization, J.D.; supervision, J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Slovenian Research Agency (research core funding No. P2-0069— Advanced Methods of Interaction in Telecommunications, P2-0041—Computer Systems, Methodologies, and Intelligent Services, and P2-0057—Information systems).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found here: https://opus.nlpl.eu/ParaCrawl.php (accessed on 14 April 2023).

Acknowledgments

The authors thank the authors of the ParaCrawl parallel corpora, the authors of the Marian NMT and Moses SMT toolkits, and the authors of mBERT for making all of these publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MT	Machine Translation
NMT	Neural Machine Translation
SMT	Statistical Machine Translation
HMT	Hybrid Machine Translation
NLP	Natural Language Processing
LR	Logistic Regression
DT	Decision Tree
GBDT	Gradient-Boosted Decision Tree
RF	Random Forest
NB	Naive Bayes
kNN	K-Nearest Neighbors
MLP	Multilayer Perceptron
CNN	Convolutional Neural Network
SVM	Support Vector Machine
BLEU	BiLingual Evaluation Understudy
chrF	Character F-score
TER	Translation Edit Rate
BERT	Bidirectional Encoder Representations from Transformers
mBERT	Multilingual Bidirectional Encoder Representations from Transformers
DE	Differential Evolution
RNN	Recurrent Neural Networks
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Unit
OOV	Out-of-vocabulary
WMT	Workshop on Machine Translation

References

Popović, M. Comparing Language Related Issues for NMT and PBMT between German and English. Prague Bull. Math. Linguist. 2017, 108, 209–220. [Google Scholar] [CrossRef]
Popović, M. Language-related issues for NMT and PBMT for English–German and English–Serbian. Mach. Transl. 2018, 32, 237–253. [Google Scholar] [CrossRef]
Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Cedarville, OH, USA; pp. 4996–5001. [Google Scholar] [CrossRef]
Koehn, P.; Och, F.J.; Marcu, D. Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, AB, Canada, 27 May–1 June 2003. [Google Scholar]
Koehn, P. Statistical Machine Translation; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Lopez, A. Statistical machine translation. ACM Comput. Surv. (CSUR) 2008, 40, 1–49. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 2, 3104–3112. [Google Scholar]
Vashishth, S.; Bhandari, M.; Yadav, P.; Rai, P.; Bhattacharyya, C.; Talukdar, P. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Cedarville, OH, USA, 2019; pp. 3308–3318. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Meng, F.; Lu, Z.; Wang, M.; Li, H.; Jiang, W.; Liu, Q. Encoding Source Language with Convolutional Neural Network for Machine Translation. arXiv 2015, arXiv:1503.01838. [Google Scholar] [CrossRef]
Stahlberg, F.; Hasler, E.; Byrne, B. The edit distance transducer in action: The University of Cambridge English-German system at WMT16. arXiv 2016, arXiv:1606.04963. [Google Scholar]
Stahlberg, F. Neural Machine Translation: A Review. J. Artif. Intell. Res. 2020, 69, 343–418. [Google Scholar] [CrossRef]
Wang, X.; Pham, H.; Dai, Z.; Neubig, G. SwitchOut: An efficient data augmentation algorithm for neural machine translation. arXiv 2018, arXiv:1808.07512. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Edinburgh neural machine translation systems for WMT 16. arXiv 2016, arXiv:1606.02891. [Google Scholar]
Cromieres, F.; Chu, C.; Nakazawa, T.; Kurohashi, S. Kyoto university participation to WAT 2016. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), Osaka, Japan, 11–16 December 2016; pp. 166–174. [Google Scholar]
Huang, J.X.; Lee, K.S.; Kim, Y.K. Hybrid Translation with Classification: Revisiting Rule-Based and Neural Machine Translation. Electronics 2020, 9, 201. [Google Scholar] [CrossRef]
Sen, S.; Hasanuzzaman, M.; Ekbal, A.; Bhattacharyya, P.; Way, A. Neural machine translation of low-resource languages using SMT phrase pair injection. Nat. Lang. Eng. 2021, 27, 271–292. [Google Scholar] [CrossRef]
Yan, R.; Li, J.; Su, X.; Wang, X.; Gao, G. Boosting the Transformer with the BERT Supervision in Low-Resource Machine Translation. Appl. Sci. 2022, 12, 7195. [Google Scholar] [CrossRef]
Bacanin, N.; Zivkovic, M.; Stoean, C.; Antonijevic, M.; Janicijevic, S.; Sarac, M.; Strumberger, I. Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering. Mathematics 2022, 10, 4173. [Google Scholar] [CrossRef]
Fuad, A.; Al-Yahya, M. Cross-Lingual Transfer Learning for Arabic Task-Oriented Dialogue Systems Using Multilingual Transformer Model mT5. Mathematics 2022, 10, 746. [Google Scholar] [CrossRef]
Baniata, L.H.; Kang, S.; Ampomah, I.K.E. A Reverse Positional Encoding Multi-Head Attention-Based Neural Machine Translation Model for Arabic Dialects. Mathematics 2022, 10, 3666. [Google Scholar] [CrossRef]
Alokla, A.; Gad, W.; Nazih, W.; Aref, M.; Salem, A.B. Retrieval-Based Transformer Pseudocode Generation. Mathematics 2022, 10, 604. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning–Based Text Classification: A Comprehensive Review. ACM Comput. Surv. 2021, 54, 62. [Google Scholar] [CrossRef]
Chen, L.C.; Chang, K.H.; Yang, S.C.; Chen, S.C. A Corpus-Based Word Classification Method for Detecting Difficulty Level of English Proficiency Tests. Appl. Sci. 2023, 13, 1699. [Google Scholar] [CrossRef]
Canbek, G.; Taskaya Temizel, T.; Sagiroglu, S. PToPI: A Comprehensive Review, Analysis, and Knowledge Representation of Binary Classification Performance Measures/Metrics. SN Comput. Sci. 2023, 4, 13. [Google Scholar] [CrossRef] [PubMed]
Hsu, B.M. Comparison of Supervised Classification Models on Textual Data. Mathematics 2020, 8, 851. [Google Scholar] [CrossRef]
Panigrahi, R.; Borah, S.; Bhoi, A.K.; Ijaz, M.F.; Pramanik, M.; Kumar, Y.; Jhaveri, R.H. A Consolidated Decision Tree-Based Intrusion Detection System for Binary and Multiclass Imbalanced Datasets. Mathematics 2021, 9, 751. [Google Scholar] [CrossRef]
Ding, W.; Chen, Q.; Dong, Y.; Shao, N. Fault Diagnosis Method of Intelligent Substation Protection System Based on Gradient Boosting Decision Tree. Appl. Sci. 2022, 12, 8989. [Google Scholar] [CrossRef]
Lučin, I.; Lučin, B.; Čarija, Z.; Sikirica, A. Data-Driven Leak Localization in Urban Water Distribution Networks Using Big Data for Random Forest Classifier. Mathematics 2021, 9, 672. [Google Scholar] [CrossRef]
Gan, S.; Shao, S.; Chen, L.; Yu, L.; Jiang, L. Adapting Hidden Naive Bayes for Text Classification. Mathematics 2021, 9, 2378. [Google Scholar] [CrossRef]
Kang, S. k-Nearest Neighbor Learning with Graph Neural Networks. Mathematics 2021, 9, 830. [Google Scholar] [CrossRef]
Nadeem, M.I.; Ahmed, K.; Li, D.; Zheng, Z.; Naheed, H.; Muaad, A.Y.; Alqarafi, A.; Abdel Hameed, H. SHO-CNN: A Metaheuristic Optimization of a Convolutional Neural Network for Multi-Label News Classification. Electronics 2023, 12, 113. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
Savini, E.; Caragea, C. Intermediate-Task Transfer Learning with BERT for Sarcasm Detection. Mathematics 2022, 10, 844. [Google Scholar] [CrossRef]
Patil, R.; Boit, S.; Gudivada, V.; Nandigam, J. A Survey of Text Representation and Embedding Techniques in NLP. IEEE Access 2023, 11, 36120–36146. [Google Scholar] [CrossRef]
Dash, G.; Sharma, C.; Sharma, S. Sustainable Marketing and the Role of Social Media: An Experimental Study Using Natural Language Processing (NLP). Sustainability 2023, 15, 5443. [Google Scholar] [CrossRef]
de Lima, R.R.; Fernandes, A.M.R.; Bombasar, J.R.; da Silva, B.A.; Crocker, P.; Leithardt, V.R.Q. An Empirical Comparison of Portuguese and Multilingual BERT Models for Auto-Classification of NCM Codes in International Trade. Big Data Cogn. Comput. 2022, 6, 8. [Google Scholar] [CrossRef]
Gomaa, W.H.; Fahmy, A.A. A Survey of Text Similarity Approaches. Int. J. Comput. Appl. 2013, 68, 13–18. [Google Scholar]
Dzisevič, R.; Šešok, D. Text Classification using Different Feature Extraction Approaches. In Proceedings of the 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania, 25 April 2019; pp. 1–4. [Google Scholar] [CrossRef]
Magalhães, D.; Pozo, A.; Santana, R. An empirical comparison of distance/similarity measures for Natural Language Processing. In Proceedings of the Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, SBC, Porto Alegre, Brasil, 15–18 October 2019; pp. 717–728. [Google Scholar] [CrossRef]
Wang, J.; Dong, Y. Measurement of Text Similarity: A Survey. Information 2020, 11, 421. [Google Scholar] [CrossRef]
Ristanti, P.Y.; Wibawa, A.P.; Pujianto, U. Cosine Similarity for Title and Abstract of Economic Journal Classification. In Proceedings of the 2019 5th International Conference on Science in Information Technology (ICSITech), Jogjakarta, Indonesia, 23–24 October 2019; pp. 123–127. [Google Scholar] [CrossRef]
Park, K.; Hong, J.S.; Kim, W. A Methodology Combining Cosine Similarity with Classifier for Text Classification. Appl. Artif. Intell. 2020, 34, 396–411. [Google Scholar] [CrossRef]
Eligüzel, N.; Çetinkaya, C.; Dereli, T. A novel approach for text categorization by applying hybrid genetic bat algorithm through feature extraction and feature selection methods. Expert Syst. Appl. 2022, 202, 117433. [Google Scholar] [CrossRef]
Kadhim, A.I. Survey on Supervised Machine Learning Techniques for Automatic Text Classification. Artif. Intell. Rev. 2019, 52, 273–292. [Google Scholar] [CrossRef]
Berciu, A.G.; Dulf, E.H.; Micu, D.D. Improving the Efficiency of Electricity Consumption by Applying Real-Time Fuzzy and Fractional Control. Mathematics 2022, 10, 3807. [Google Scholar] [CrossRef]
Inyang, U.; Akpan, E.; Akinyokun, O. A Hybrid Machine Learning Approach for Flood Risk Assessment and Classification. Int. J. Comput. Intell. Appl. 2020, 19, 2050012. [Google Scholar] [CrossRef]
Krivulin, N.; Prinkov, A.; Gladkikh, I. Using Pairwise Comparisons to Determine Consumer Preferences in Hotel Selection. Mathematics 2022, 10, 730. [Google Scholar] [CrossRef]
Machado, J.A.T.; Mendes Lopes, A. Fractional Jensen–Shannon analysis of the scientific output of researchers in fractional calculus. Entropy 2017, 19, 127. [Google Scholar] [CrossRef]
Shamir, R.R.; Duchin, Y.; Kim, J.; Sapiro, G.; Harel, N. Continuous dice coefficient: A method for evaluating probabilistic segmentations. arXiv 2019, arXiv:1906.11031. [Google Scholar]
Cha, S.H. Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. Int. J. Math. Model. Meth. Appl. Sci. 2007, 1, 300–307. [Google Scholar]
Ibrahim, H.; El Kerdawy, A.M.; Abdo, A.; Eldin, A.S. Similarity-based machine learning framework for predicting safety signals of adverse drug–drug interactions. Inform. Med. Unlocked 2021, 26, 100699. [Google Scholar] [CrossRef]
Gutiérrez-Reina, D.; Sharma, V.; You, I.; Toral, S. Dissimilarity metric based on local neighboring information and genetic programming for data dissemination in vehicular ad hoc networks (VANETs). Sensors 2018, 18, 2320. [Google Scholar] [CrossRef] [PubMed]
Bañón, M.; Chen, P.; Haddow, B.; Heafield, K.; Hoang, H.; Esplà-Gomis, M.; Forcada, M.L.; Kamran, A.; Kirefu, F.; Koehn, P.; et al. ParaCrawl: Web-Scale Acquisition of Parallel Corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4555–4567. [Google Scholar] [CrossRef]
Neubig, G.; Watanabe, T. Optimization for Statistical Machine Translation: A Survey. Comput. Linguist. 2016, 42, 1–54. [Google Scholar] [CrossRef]
Lü, Y.; Huang, J.; Liu, Q. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 343–350. [Google Scholar]
Dugonik, J.; Bošković, B.; Brest, J.; Sepesy Maučec, M. Improving Statistical Machine Translation Quality Using Differential Evolution. Informatica 2019, 30, 629–645. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Popović, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; Association for Computational Linguistics: Cedarville, OH, USA, 2015; pp. 392–395. [Google Scholar] [CrossRef]
Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 August 2006; pp. 223–231. [Google Scholar]
Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, 31 October–1 November 2018; Association for Computational Linguistics: Cedarville, OH, USA, 2018; pp. 186–191. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; Association for Computational Linguistics: Cedarville, OH, USA, 2005; pp. 65–72. [Google Scholar]
Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Cedarville, OH, USA, 2020; pp. 2685–2702. [Google Scholar] [CrossRef]
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar] [CrossRef]
Junczys-Dowmunt, M.; Grundkiewicz, R.; Dwojak, T.; Hoang, H.; Heafield, K.; Neckermann, T.; Seide, F.; Germann, U.; Fikri Aji, A.; Bogoychev, N.; et al. Marian: Fast Neural Machine Translation in C++. In Proceedings of the ACL 2018, System Demonstrations, Melbourne, Australia, 15–20 July 2018; pp. 116–121. [Google Scholar]
Marian NMT Documentation. Online. 2018. Available online: https://marian-nmt.github.io/docs/cmd/marian/ (accessed on 14 April 2023).
Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; et al. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, 23–30 June 2007; pp. 177–180. [Google Scholar]
Moses SMT Documentation. Online. 2017. Available online: http://www2.statmt.org/moses/ (accessed on 14 April 2023).

Figure 1. The Transformer architecture for training NMT systems.

Figure 2. The SMT architecture for training SMT systems.

Figure 3. NMT generates the translation

t_{1}

using one large model, while SMT generates the translation

t_{2}

using multiple models.

Figure 3. NMT generates the translation

t_{1}

using one large model, while SMT generates the translation

t_{2}

using multiple models.

Figure 4. The HMT architecture.

Table 1. The full list of features and their positions in the feature vector.

Feature	Name
$x_{1}$	Cosine similarity
$x_{2}$	Jensen–Shannon divergence
$x_{3}$	Euclidean distance
$x_{4}$	Cityblock distance
$x_{5}$	Squared Euclidean distance
$x_{6}$	Chebyshev distance
$x_{7}$	Canberra distance
$x_{8}$	Dice coefficient
$x_{9}$	Kulczynski distance
$x_{10}$	Russell–Rao similarity
$x_{11}$	Sokal–Sneath similarity

Table 2. The ParaCrawl corpus division for the training, development, and test sets.

	Training	Development			Test
	Training	SMT	NMT	HMT	Test
Sentences	9,000,000	500	45,000	90,000	3000

Table 3. Marian NMT training parameters. For the description of parameters and their values, see [68].

Parameter	Value
type	transformer
workspace GPU memory	10 GB
max–length	100
mini–batch–fit	True
maxi–batch	1000
early–stopping	10
after–epochs	50
valid–metrics	cross–entropy and perplexity
valid–mini–batch	64
beam–size	6
normalize	0.6
enc–depth	6
dec–depth	6
transformer–heads	8
transformer–postprocess–emb	d
transformer–postprocess	dan
transformer–dropout	0.1
label–smoothing	0.1
learn–rate	0.0003
lr–warmup	16,000
lr–decay–inv–sqrt	16,000
optimizer–params	0.9, 0.98, 1 × 10 $^{- 9}$
clip–norm	5
tied–embeddings–all	True
sync–sgd	True
exponential–smothing	True

Table 4. Moses SMT training parameters [70]. The last three parameters are for DE optimization.

Parameter	Value
alignment	grow-diag-final-and
reordering	msd-bidirectional-fe
smoothing	improved-kneser-ney
evaluation metric	BLEU
n-gram language model order	5
number of generations	50
population size	25
dimension	14

Table 5. The results of baseline (NMT) for both translation directions.

	Baseline (NMT)
	BLEU ↑	chrF ↑	TER ↓	METEOR ↑	COMET ↑
Slovenian ⇒ English	46.4	65.6	40.1	70.5	83.3
English ⇒ Slovenian	32.0	54.1	54.4	55.3	80.7