Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering

Bacanin, Nebojsa; Zivkovic, Miodrag; Stoean, Catalin; Antonijevic, Milos; Janicijevic, Stefana; Sarac, Marko; Strumberger, Ivana

doi:10.3390/math10224173

Open AccessArticle

Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering

by

Nebojsa Bacanin

¹

,

Miodrag Zivkovic

¹

,

Catalin Stoean

^2,3,*

,

Milos Antonijevic

¹

,

Stefana Janicijevic

¹,

Marko Sarac

¹

and

Ivana Strumberger

¹

Faculty of Informatics and Computing, Singidunum University, Danijelova 32, 11010 Belgrade, Serbia

²

Human Language Technologies Center, Faculty of Mathematics and Computer Science, University of Bucharest, Academiei 14, 010014 Bucharest, Romania

³

Department of Computer Science, Faculty of Sciences, University of Craiova, A.I.Cuza, 13, 200585 Craiova, Romania

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(22), 4173; https://doi.org/10.3390/math10224173

Submission received: 22 September 2022 / Revised: 29 October 2022 / Accepted: 4 November 2022 / Published: 8 November 2022

(This article belongs to the Special Issue New Machine Learning and Deep Learning Techniques in Natural Language Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Spam represents a genuine irritation for email users, since it often disturbs them during their work or free time. Machine learning approaches are commonly utilized as the engine of spam detection solutions, as they are efficient and usually exhibit a high degree of classification accuracy. Nevertheless, it sometimes happens that good messages are labeled as spam and, more often, some spam emails enter into the inbox as good ones. This manuscript proposes a novel email spam detection approach by combining machine learning models with an enhanced sine cosine swarm intelligence algorithm to counter the deficiencies of the existing techniques. The introduced novel sine cosine was adopted for training logistic regression and for tuning XGBoost models as part of the hybrid machine learning-metaheuristics framework. The developed framework has been validated on two public high-dimensional spam benchmark datasets (CSDMC2010 and TurkishEmail), and the extensive experiments conducted have shown that the model successfully deals with high-degree data. The comparative analysis with other cutting-edge spam detection models, also based on metaheuristics, has shown that the proposed hybrid method obtains superior performance in terms of accuracy, precision, recall, f1 score, and other relevant classification metrics. Additionally, the empirically established superiority of the proposed method is validated using rigid statistical tests.

Keywords:

machine learning; spam detection; natural language processing; metaheuristics algorithm; swarm intelligence; artificial intelligence; sine cosine algorithm; optimization; classification

MSC:

62H30; 62M45; 62M20; 62K25; 90C26

1. Introduction

An important amount of unsolicited emails comes from legitimate e-commerce businesses that encourage users to buy their products. These are not particularly dangerous for the email recipients, although they are distracting and most people would prefer that these messages are sent directly to their spam folder. However, the most dangerous ones represent the phishing messages that target sensitive information from the users, such as usernames, passwords, card numbers, or pins, or that ask directly for a sum of money with the promise that a lot more will be paid back later [1]. The scammers also took advantage of the uncertainty and fear of the population around COVID-19, sending messages regarding financial assistance, offering protective equipment, and often using the impersonation of various medical bodies, with the users being rather susceptible to the topic [2]. Phishing emails also target the employees of large companies aiming to create breaches that can be later speculated. According to a recent report from IBM (https://www.ibm.com/security/data-breach, accessed on 17 September 2022), the global average total cost of a data breach was USD 4.35 million in 2022, an increase from USD 4.24 million in 2021, while stolen and compromised credentials are responsible for 19% of breaches, and phishing, for 16%.

Spam messages do not only burden email users, but they also use a high amount of network bandwidth, occupy disk space, and often include malware products in attachments. A general classification of anti-spam approaches partitions these methods into two expansive categories: static and dynamic [3]. Filtering strategies based on specified whitelists and blacklists are examples of stationary techniques. However, the malware software mentioned above sometimes transforms the receiver device into a machine that further sends spam without the user being aware of this information [4]. Hence, static methods prove to be insufficient since they are based on a list of IPs or email addresses that are often used as senders, but spammers utilize the newly infected devices as transmitters. In order to determine whether an email is spam, dynamic algorithms typically take into account the content of the message, utilizing text modeling techniques developed with statistical or machine-learning methodologies.

Different algorithms approach tasks in various ways, and to differing degrees of success. Artificial intelligence is a viable solution for issues in the constantly evolving field of network security, due to its capacity to learn and to adapt to a changing environment. Traditional approaches such as firewalls, blacklists, and others are still used, but their effectiveness must be constantly monitored and maintained. Researchers have tried to enhance current models and increase network security as a whole by applying AI to these issues.

In the last decade, considerable advances in the domain of Artificial Intelligence (AI) have resulted in the widespread adoption of machine learning (ML) algorithms in a wide spectrum of different industries. Business [5], finance [6], various healthcare [7], and other fields [8] are relying on AI for everyday tasks and operations. Consequently, the amount and the variety of solutions that AI offers to real-world challenges is rapidly increasing. The most recent applications of the AI models include network security and intrusion detection [9,10,11], phishing [12,13], IoT networks botnets discovery [14,15,16], and many more. Spam detection is still a particularly open issue that has been addressed several times with ML approaches, as can be seen from the recent literature [4,17,18,19]. This problem falls within the domain of classical classification tasks. However, with respect to the no free lunch theorem, a universal solution does not exist, and it is always possible to find a method that will bring better results. Inspired by the research provided in [4], this research proposes two approaches, a logistic regression model trained by a metaheuristic algorithm (the same setup as in [4]), and an XGBoost model tuned using the same metaheuristic.

Logistic regression (LR) [20] is an approach that is very well capable of determining the connections between the features and the specific outcomes, and consequently, is frequently used for classification tasks. It is considered to be the baseline ML method in natural language processing. LR can be used either for binary or multi-class classification problems. It is a very simple and efficient model that allows for fast classification, which is extremely useful in real-time applications, and it has already been used to tackle the spam filtering task [21]. Notwithstanding the obvious advantages of the LR, the shortcoming of this approach is that it commonly utilizes stochastic gradient descent (SGD) for training, which can lead to premature convergence towards poor local optima. Similarly as in [4], this paper proposes avoiding this problem through training with the help of a metaheuristic algorithm.

Concerning classification tasks, ensemble learning models typically outperform single model techniques. The XGBoost model has been established as a powerful optimizer, and it is very popular among the research community for solving various difficult challenges. As a result, the XGBoost model has been utilized in a variety of application domains, which include healthcare [22,23,24,25], finance [26,27], and many others [28,29]. Notwithstanding the respectable level of performance that the XGBoost model is capable of, a challenge still exists to properly implement the model and to tune the control parameters appropriately for the method to deliver the expected accuracy level for each particular classification problem.

The XGBoost hyperparameters’ tuning refers to the choice of the appropriate weights for the parameters of the approach, and as the model must be tuned for each individual problem that needs to be solved, it deserves adequate attention. The task belongs to the NP-hard challenges. Traditionally, deterministic techniques cannot be applied, as they would require an impractical amount of time to find the solution. Metaheuristic algorithms, on the other hand, belong to a group of stochastic techniques and have been utilized to address a variety of optimization tasks in different domains, with significant success. Metaheuristic approaches can be used to address NP-hard problems that are otherwise considered impossible to solve by utilizing conventional methods [30].

A prominent subclass of the metaheuristic algorithms is swarm intelligence [31]. Algorithms that fall into this group are inspired by the natural processes of animal social behavior, mathematically modeled through the actions performed by the individuals in the population. A variety of efficient optimization algorithms have been developed by modeling the complex behaviors of groups of wolves, ants, bees, fireflies, whales, and so on, in the shape of the gray wolf optimizer (GWO) [32], ant colony optimization (ACO) [33], artificial bee colony(ABC) [34], firefly algorithm (FA), [35] and whale optimization algorithm (WOA) [36]. Notable exceptions are the group of algorithms influenced by the mathematical laws and properties of the specific functions, where the most famous representatives are the arithmetic optimization algorithms (AOA) [37] and the sine cosine algorithm (SCA) [38], the latter one being used in this paper as well.

The SCA algorithm, proposed by [38] in 2016, gained significant popularity with researchers, and has been proven to be a powerful and efficient optimizer. It was inspired by the mathematical properties shown by sine and cosine functions that are used to guide the search of the algorithm. Promising results on benchmark functions under test conditions made it an attractive option for scientists in various domains; however, extensive simulations have shown that there is enough room for additional enhancements of the basic implementation. This research proposes an enhanced version of the SCA algorithm, which was named diversity-oriented SCA (DOSCA), and implemented with the goal to address the known deficiencies of the basic SCA.

The goal of this manuscript is to employ the implemented DOSCA within two ML models, and to apply them for the spam classification problem, similarly as it was presented in [4]. DOSCA was used to optimize the input weight and hidden neurons bias values of the XGBoost model, and to train the logistic regression (LR) model. The most important contributions of this research can be summarized in the following way:

Develop a novel SCA-based algorithm that specifically targets the known drawbacks of the basic SCA implementation.
Utilize the novel SCA algorithm to train the LR model and to tune the hyperparameters of the XGBoost model.
Evaluate the proposed models against two benchmark spam detection datasets.

The evaluation methodology implies achieving a comparison of the results of several methods based on measures such as accuracy, precision, recall, and f1 score. Two datasets have been used in the experiments, CSDMC2010 and the Turkish Email Dataset. The Turkish language is very challenging for the classification of spam emails, since this language has more complex semantic structure than English. As always, in data mining, data preprocessing takes an important role, since adequate preparation implies the effective results of the used algorithms. Both datasets have been tackled with 500 and 1000 of the most representative features, as is described in Section 4. After preprocessing, the models have been evaluated on these two datasets in terms of classification accuracy and recall, since the false negative (FN) rate was especially targetted. The quality of the results of the proposed models was superior when compared to the outputs obtained using other methods.

The rest of the paper is structured as follows: Section 2 brings forward the background on the spam filtering systems and metaheuristics optimization. Section 3 describes the basic SCA metaheuristics, highlights its deficiencies, and proposes an improved version of the algorithm. The whole of Section 4 is dedicated to preprocessing, as it is a separate problem that falls into the natural language processing (NLP) domain.

2. Background and Literature Review

This section first provides an overview of the most relevant spam detection approaches from the recent literature. Afterwards, a brief summary of the metaheuristics optimization algorithms is provided. Finally, this section gives a survey of hybrid ML and swarm intelligence approaches that have been applied to the spam detection problem.

2.1. Spam Detection

Modern spam filtering systems typically include a ML model that is used for classification. The three most commonly applied ML models include logistic regression (LR), extreme learning machine (ELM), and XGBoost.

LR classifiers are based on a logistic function that models dependent dichotomous variables, and it is based on the supposition that each pair of components in a feature vector is independent [39]. Because of its simplicity, quick convergence, and accurate interpretation, LR is frequently employed in spam filtering. The algorithm is linear with a nonlinear transform on output.

ELM is an ML model that gained popularity within the academic community recently, although it was introduced in 2004 by [40]. ELM operates with single-hidden layer feed-forward neural networks (SLFNs). This approach has been able to achieve leveraged generalization performance in comparison to other feed-forward neural network approaches, while providing excellent learning speed and high efficiency. ELM classifiers are used in classification, regression and clustering, so as sparse approximation, compression and feature learning. Typically, they are created using a single layer or several layers of hidden nodes, with the hidden nodes’ settings left untuned. Olatunji [41] emphasized that on a comparative scale based on accuracy, SVM outperformed ELM. However, ELM fared substantially better than SVM in terms of the speed of operation.

The XGBoost algorithm is utilizing the additive training method for the optimization of the objective function [42], meaning that every step of the optimization task depends on the result from the preceding step.

Dedeturk et al. [4] concluded that LR is sensitive to equal feature weights, and that it obtains optimal weight and bias values that may converge to local minima. The LR classifier behaves stably after tf-idf feature extraction, and it yields good accuracy when a dataset is preprocessed with the feature reduction technique. The study published in [43] used the pre-trained bidirectional encoder representation from a transformer (BERT) and LR algorithm to categorize ham or spam emails in order to solve the NLP email classification challenge. The LR algorithm produced the best classification results in training and test datasets. Another research [44] compared accuracy, recall, and precision for many classifiers, and only LR achieved equal percentages for all three measures comparing all classifiers. Goodman and Yih [45] displayed a straightforward LR model that was improved using an online gradient descent approach. In comparison to the best published generative approach, they claimed that their model can give outcomes that are competitive.

Lucay [46] suggested a hybrid approach, explaining that the weights between the input and hidden layers are given random values by the ELM, while the hidden layer’s biases are fixed during training. The online sequential-ELM performs well in terms of solution quality and execution speed. ELM is an excellent method for imbalanced datasets, and email datasets very often have such a form. Roul [47] proposed a method for text mining search where they used Multilayer ELM, since the procedure overcame limitations in classic backpropagation neural networks. They saved time for training data, since the procedure did not use the fine-tuning of hyperparameters converged to a global optimum without using kernel techniques on feature separation.

The next advanced spam filtering technique is an XGBoost technique, which is an ensemble method that uses decision trees with a gradient-boosting framework. It is used usually for the predictions of unstructured data such as images and text. The main pillars in the XGBoost procedure are parallel processing, tree-pruning, handling missing values, and using regularization to avoid overfitting and bias. Ismail et al. [48] reported that an Extreme Gradient Boosting-based spam detection model has improved, although, as far as they are aware, it is receiving little attention for spam email detection issues. Anitha et al. [49] reported increased accuracy in spam detection based on the XGBoost classifier. They explored the proposed system’s performance using a more comprehensive range of experimental metrics and better accuracy (95%) when compared to other classifiers. Pandey et al. [50] suggested the XGBoost method, which chooses the most crucial characteristics for effective phishing website detection using a feature selection strategy. Both trustworthy and fraudulent websites follow a specific pattern. The outcomes of the machine learning classifiers will serve as the foundation for the phishing detection model. This method was applied to evaluate phishing websites. In comparison to the AdaBoost and Gradient boosting machine learning methods, XGBoost performs better.

Generally, despite the fact that these classifiers are simple to use, are generalizable, and are reasonably effective, they frequently have drawbacks such as the ones indicated by Dedeturk et al. [4]:

the dimensionality curse,
significant expenses for computation,
misclassification rates,
responsiveness to feature weights,
modest operation speeds for practical applications,
overfitting or getting stuck in local minima.

The effectiveness of these classifiers is also influenced by the nature of the problem and a few specified criteria. The novel spam filter methodology proposed in this paper aims to identify spam in emails in light of the shortcomings of the current spam detection techniques.

2.2. Metaheuristics Optimization

Traditional, deterministic algorithms are not suitable for solving NP-hard tasks, as they would require an impractical amount of time and resources to find the solution. Metaheuristics algorithms, on the other hand, provide satisfying solutions (not guaranteed to be the best, but good enough) in a reasonable time. One of the most prominent group of metaheuristics optimization algorithms is swarm intelligence, where the algorithms are inspired by behavior expressed by different types of animals and processes found in nature. Some of the most famous nature-inspired methods include the ABC [34] metaheuristics, which are frequently utilized to optimize the performance of neural networks [51]. Another notable example is the WOA [36], which models the unique fishing techniques of humpback whales. WOA is extremely famous due to interesting search patterns, and it was utilized to address many real-world challenges with great success [52]. Another popular metaheuristics, the GWO [32], was modeled to mimic the hunting techniques exhibited by a pack of gray wolves, and has also been established as a very powerful optimizer with numerous applications such as [53]. The original FA metaheuristics [35] is also well-known and is capable of superior performance due to its powerful search, and it was recently used in a wide range of domains, including credit card fraud detection [54], medical diagnostics [55], neural networks optimization [56], plants classification task [57], and many others.

Recent publications show the successful combination of neural networks and swarm intelligence metaheuristics, and also other successful application domains. The most notable contemporary applications of metaheuristics optimization include COVID-19 MRI classification and illness severity prediction [58], computer-assisted tumor MRI classification [59], feature selection task [60], cryptocurrencies trends estimation [61], security and intrusion detection [62], fake news detection [63], cloud computing workflow planning [64], sensor networks tuning [65], and many others.

According to the famous no free lunch theorem, the perfect single solution that is the best for all optimization problems cannot be proven to exist. Therefore, recent research is focused on the enhancement of the existing algorithms through modification and hybridization, and the implementation of novel, more efficient options.

2.3. Hybrid Machine Learning and Metaheuristics Approaches to Spam Detection

Table 1 briefly outputs a comparative view of various studies that hybridized ML approaches with metaheuristics through various manners. All the entries from the table are subsequently described in the following paragraphs.

Various combinations of machine learning and metaheuristics techniques are considered in the spam detection process. For instance, in [66], an integrated approach of the Naive Bayes (NB) algorithm is used for the learning and classification of email content, and the Particle Swarm Optimization model for the global optimization of the parameters of the machine learning algorithm is used for email spam detection. The Naive Bayes classifier, together with the binary firefly algorithm, is proposed in [67]. In this case, metaheuristics are used for decreasing the dimensionality of features and enhancing the accuracy of the email spam classification process. The obtained results show that the proposed approach achieved an accuracy of 95.14% on the SpamBase dataset used. Another example of facing a feature selection (FS) problem in ML is presented in [68]. In the described experiments, the k-Nearest Neighbours (KNN) classifier is used to define the target function of the FS challenge, which is solved by the proposed hybrid model of whale optimization and the flower pollination algorithm based on the concept of opposition-based learning. The authors in [4] proposed a new spam detection model based on the logistic regression (LR) classification algorithm combined with artificial bee colony metaheuristics. Experiments are conducted on the Enron, CSDMC2010, and TurkishEmail datasets, and are compared with the support vector machine (SVM), LR, and NB classifiers. The research conclusion is that, in terms of classification accuracy, the hybrid model outperforms other spam detection techniques.

One of the models for spam detection systems in social networks utilizing artificial neural network machine learning techniques enhanced with the artificial bee colony optimization algorithm is presented in [69]. In [70], the community-inspired firefly algorithm for spam detection is proposed for searching for the features that provide good performance for SVM, KNN, and Random forest classifiers. The effectiveness of the proposed method is validated by comparing the results with the other existing feature selection methods, with the results are showing improved performance in terms of accuracy, false positive rate, and F1-measure on two benchmark Twitter spam datasets. The authors in [71] compare five bio-inspired optimization techniques in combination with the k-Nearest Neighbours machine learning approach for the same dataset from the UCI repository. The presented results show 100% mean values for accuracy, precision, recall, and F1-measure, by applying different algorithms when Manhattan distance is used in KNN for classifying the emails as spam or legitimate, while the approaches provide different results for different distance metrics. For example, the grasshopper optimization algorithm provides the highest average accuracy and whale optimization algorithm provides the highest average value of precision and F1-measure.

2.4. Text Mining Models

This section introduces the text mining models used in this research. A brief overview of the logistic regression is given first, and is followed by a description of the XGBoost model.

2.4.1. Logistic Regression

The approach models the probability of an event. It belongs to a class of ML models with a statistical base. The event’s logarithm is a linear combination of one or more independent variables (“predictors”). The logstic model belongs to a field of regression analysis, since the objective is to estimate the parameters (the coefficients in the linear combination). The independent variables in binary logistic regression can each be either binary (two classes, coded by an indicator variable) or continuous variables. The binary dependent variable in binary logistic regression has a single binary class, coded by an indicator variable, and the two values are labeled with true/false or 1/0. The associated probability can range from 0 to 1. Because it is difficult to determine the precise line dividing the two classes in a linear model, some points from the classes in logistic regression suffer from random decisions.

The logistic function is of the form (1), where

μ

is a location parameter (the midpoint of the curve, where

p (μ) = \frac{1}{2}

) and s is a scale parameter. This expression may be rewritten as (2), where

β_{0} = \frac{- μ}{s}

and is known as the intercept (it is the vertical intercept or the y-intercept of the line

y = β_{0} + β_{1} x

), and

β_{1} = \frac{1}{s}

(inverse scale of rate parameter): these are the y-intercept and slope of the log-odds as a function of x. Conversely,

μ = \frac{- β_{0}}{β_{1}}

and

s = \frac{1}{β_{1}}

.

p (x) = \frac{1}{1 + e^{\frac{- (x - μ)}{s}}}

(1)

p (x) = \frac{1}{1 + e^{- (β_{0} + β_{1} x)}}

(2)

Predictive analytics and classification frequently use logistic regression. Based on a given dataset of independent variables, logistic regression calculates the likelihood that an event will occur. Since the result is a probability, the dependent variable’s range is limited to 0 and 1. In logistic regression, the odds—that is, the probability of success divided by the probability of failure—are transformed using the logit formula. The Equations (3) and (4) are used to represent this logistic function, which is sometimes referred to as the log odds or the natural logarithm of the odds.

l o g i t p = σ^{- 1} (p) = l n \frac{p}{1 - p} f o r p \in (0, 1)

(3)

\begin{matrix} P_{i} = P r o b (y_{i} = 1) \\ = \frac{1}{1 + e^{(- (β_{0} + β_{1} x_{1} + \dots β_{k} x_{k} + ε))}} = > \\ l n (\frac{P_{i}}{1 - P_{i}}) = β_{0} + β_{1} x_{1} + \dots β_{k} x_{k} + ε \end{matrix}

(4)

Regularization can be used to train a model to better generalize to unseen data, preventing the algorithm from overfitting the training dataset. A regression model that uses L1 norm for regularization is called Lasso regression, and a model that employs L2 norm is known as Ridge regression. If the hyperparameter (L2) is equal to 0, then overfitting occurs easily, and if it is very large, then it will add too much weight, which will lead to underfitting. We find the best hyperparameter using cross-validation (CV). CV is a decision-aiding method by comparing metrics from different samples by reserving a portion of its data for use in model estimation. We set for the LR a training of 90% of the data and the sample rest of 10% in the test set. For XGBoost, we used 80% for training and 20% for testing. The reason for this is that the LR is trained by the swarm, and we needed a bigger portion of the dataset for this task, while the XGBoost only has its parameter values tuned. These proportions for the separation were established during pre-experimental setup via trial and error.

2.4.2. XGBoost

The XGBoost model recently became very popular in the scientific community, as it is generally capable of delivering good overall results. Fundamentally, XGBoost deals with the resolution of the linear classification task.

There is an objective function given by Equation (5), where l specifies the loss of the t-th round and

c o n s t

refers to the constants, while

Ω

denotes the regularization term obtained by Equation (6). Within the latter,

γ

and

λ

specify the control parameters.

f_{t}

is the loss function in iteration t, y is the real value, and

\hat{y}

is the predicted (expected) value.

o b j^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{t - 1} + f_{t} (x_{i})) + Ω (f_{t}) + c o n s t

(5)

Ω (f_{t}) = γ \cdot T_{t} + λ \frac{1}{2} \sum_{j = 1}^{T} w_{j}^{2}

(6)

Since this is a minimization optimization problem that needs to be solved, gradient boosted trees and Taylor approximation methods are applied. Both methods imply a linearization of function that could be classified as a linear classification. The Taylor approximation approach is a transformation to a simple function around a single point that is obtained from the previous step

t - 1

. After the second-order Taylor approximation, there implies a loss function in Equation (7), where

c o n s t

refers to the constants, while

Ω

denotes the regularization term in Equation (6).

A further enhancement of optimization is achieved during the training process, where every iteration is dependent on the result achieved in the preceding one.

o b j^{(t)} = \sum_{i = 1}^{n} [l (y_{i}, {\hat{y}}_{i}^{t - 1}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t}) + c o n s t

(7)

Next, the first derivative is obtained with respect to Equation (8) and the second derivative is defined by Equation (9).

g_{i} = \partial_{{\hat{y}}_{i}^{t - 1}} l (y_{y}, {\hat{y}}_{i}^{t - 1})

(8)

h_{i} = \partial_{{\hat{y}}_{i}^{t - 1}}^{2} l (y_{y}, {\hat{y}}_{i}^{t - 1})

(9)

The combination of Equations (6), (8), and (9) allows the forming of Equation (7); and after determining the derivative, the loss function is determined by Equation (10), while the weight values are obtained by Equation (11).

o b j^{*} = - \frac{1}{2} \sum_{j = 1}^{T} \frac{{(\sum g_{i})}^{2}}{\sum h_{i} + λ} + γ \cdot T

(10)

w_{j}^{*} = - \frac{\sum g_{i}}{\sum h_{i} + λ}

(11)

The flexibility of the XGBoost model allows it to be completely adapted to every particular problem; however, it also means that its hyperparameters must be tuned for every given task individually. There are many parameters that may be considered for fine-tuning, and we restricted ourselves to the ones enumerated below (further information about them and their possible values is offered within Section 5):

Learning rate,
Minimum sum of instance weight (hessian) needed in a child,
Subsample ratio of the training instances,
Subsample ratio of columns when constructing each tree,
The maximum depth of a tree,
The minimum loss reduction required to make a further partition on a leaf node of the tree.

The hyperparameter optimization task refers to the problem of choosing the optimal values for the particular problem that needs to be solved, and it is regarded as an NP-hard challenge. If this task is executed manually, through trials and errors, it would require an impractical amount of time and resources to be solved.

3. Proposed Method

This section first gives details of the SCA metaheuristics. Afterward, the observed shortcomings of its basic version are elaborated. Finally, details of the proposed method that overcomes the deficiencies of the basic SCA are provided.

3.1. The Original SCA Method

The SCA algorithm belongs to a novel group of optimization metaheuristics, inspired by the mathematical properties of trigonometric functions [38]. Sine and cosine functions are responsible for updating the positions of the solutions within the populace, making them oscillate in the proximity of the best solution. As both functions are returning values in the range

[- 1, 1]

, they make sure that the solutions fluctuate. At the beginning of the algorithm, during the initialization phase, a defined number of candidate solutions are generated in an arbitrary fashion within the limits of the search domain. The exploration and exploitation processes are driven by random adjustable control variables during the algorithm’s execution.

The procedure of updating the individuals’ positions (which, in our particular case, encode the values for the parameters considered for tuning) is executed in every round by utilizing the Equations (12) and (13), as defined by [38], where

X_{i}^{t}

and

X_{i}^{t + 1}

represent the current individual’s location in the i-th dimension at the t-th and

i + 1

-th round;

r_{1 - 3}

are pseudo-arbitrary produced control values, while

P_{i}^{*}

denotes the destination point’s location (the latest best approximation of the optimal value) within the i-th dimension.

X_{i}^{t + 1} = X_{i}^{t} + r_{1} \cdot s i n (r_{2}) \cdot | r_{3} \cdot P_{i}^{* t} - X_{i}^{t} |

(12)

X_{i}^{t + 1} = X_{i}^{t} + r_{1} \cdot c o s (r_{2}) \cdot | r_{3} \cdot P_{i}^{* t} - X_{i}^{t} |

(13)

The fourth control variable

r_{4}

is utilized to control the search mechanism by switching between these two equations, as given by the Equation (14), where

r_{4}

is a randomly produced value in the range

[0, 1]

. The novel values of pseudo-random control parameters

r_{1 - 4}

are produced for each part of every individual within the populace.

X_{i}^{t + 1} = \{\begin{matrix} X_{i}^{t + 1} = X_{i}^{t} + r_{1} \cdot s i n (r_{2}) \cdot | r_{3} \cdot P_{i}^{* t} - X_{i}^{t} |, r_{4} < 0.5 \\ X_{i}^{t + 1} = X_{i}^{t} + r_{1} \cdot c o s (r_{2}) \cdot | r_{3} \cdot P_{i}^{* t} - X_{i}^{t} |, r_{4} \geq 0.5 \end{matrix}

(14)

During the execution of the algorithm, the search process is controlled by the four arbitrary control parameters

r_{1 - 4}

, which impact on the positions of the current and the best solutions. The balance among the solutions is necessary for efficient convergence to the global optimal solution, and the ant is granted by change of the functions’ range in an ad-hoc way.

The characteristic of both the sine and cosine functions is that they exhibit a cyclic pattern, allowing for repositioning in the proximity of the solution, and therefore, granting exploitation. The changing of the range of both functions allows for a search outside of the dedicated destinations. Additionally, each individual is required not to overlap its position with other individuals in the population.

To improve the quality of randomness, the control parameter

r_{2}

is being produced in the range

[0, 2 π]

, therefore guaranteeing the exploration. The balance of the exploration and exploitation processes is controlled by the Equation (15), where t denotes the current round of execution and T denotes the maximum number of rounds in a single run, while a represents a constant value.

r_{1} = a - t \frac{a}{T}

(15)

3.2. Limitation of Basic SCA and Proposed Improvements

The original version of SCA metaheuristics is known to be relatively simple as it does not include many control parameters, but it is still capable of achieving an outstanding level of performance for both the bound-constrained and constrained benchmarks [38]. It has also been employed for tackling numerous real-world challenges recently [72].

Although the original SCA exhibits excellent exploitation and exploration capabilities, extensive experiments on both the benchmark functions and practical problems have empirically shown that in some runs of the basic algorithm, the convergence to the optimal search region happens in later rounds of execution, not leaving enough iterations for the algorithm to execute a fine-grained exploitation. The reason for this behavior is found in the fundamental search equation (Equation (14)), which executes the sine and cosine functions, and it directs the search in the direction of the most recent approximation of the optimal (

P_{i}^{*}

) for every individuals’ parameter i. Consequently, although the original SCA’s search performs an exploitation very efficiently, there is still enough room for enhancements.

3.3. Diversity-Oriented Sine Cosine Algorithm (DOSCA)

As discussed earlier, the basic implementation of the SCA algorithm tends to converge to the sub-optimal solutions in the early rounds of some runs, having a significant impact on the algorithm’s performance level, as the correct search region is found late and there is no time for fine-tuned exploitation. Researchers have recently proposed several options to tackle this drawback, such as [58,59,73,74].

This paper suggests an improvement of the SCA through a definition of the proper population diversity procedure during the initialization phase. Moreover, the proposed improved SCA tries to keep the population diversity over the course of the complete execution of the algorithm by incorporating two additional procedures:

A new initialization procedure to provide the best possible start of the run.
A system that keeps population diversity control during the entire run.

3.3.1. A Novel Initialization Procedure

The initialization procedure used to produce the starting population of individuals for this research is provided in Equation (16), where

x_{i, j}

denotes the j-th variable of the i-th individual, and

l b_{j}

and

u b_{j}

determine the higher and lower limits for variable j, respectively. Moreover,

ψ

denotes a pseudo-random value drawn from the normal distribution in range

[0, 1]

.

x_{i, j} = l b_{j} + ψ \cdot (u b_{j} - l b_{j})

(16)

It is also important to mention that some of the recent research papers [75] noted that large domains of the search space could be mapped if the quasi-reflection-based learning (QRL) mechanism is employed to the initial solutions produced by Equation (16). This procedure generates a quasi-reflexive-opposite component (

x_{j}^{q r}

) per each individual’s parameter j (

x_{j}

), according to the Equation (17), where the

r n d

procedure is used to choose pseudo-random values within the range

[\frac{l b_{j} + u b_{j}}{2}, x_{j}]

.

X_{j}^{q r} = rnd (\frac{l b_{j} + u b_{j}}{2}, x_{j})

(17)

The proposed initialization mechanism that utilizes QRL does not increase the complexity of the algorithm with respect to the fitness function evaluations (FFEs), as at first, only

N / 2

solutions are initialized, where N represents the count of the solutions in the population. This mechanism is given in Algorithm 1.

Algorithm 1 QRL-based initialization procedure pseudo-code

Step 1: Produce population $P^{i n i t}$ of $N / 2$ solutions by applying Equation (16)
Step 2: Produce QRL population $P^{q r}$ based on $P^{i n i t}$ by applying Equation (17)
Step 3: Produce the starting population P by uniting $P^{i n i t}$ and $P^{q r}$ ( $P \cup P^{q r}$ )
Step 4: Obtain fitness value for every individual from P
Step 5: Sort all individuals within P in terms of fitness value

The proposed procedure provides better diversity in the initial population, which subsequently leads to a boost to the search phase, as can be observed in the quality of the results, in the experimental Results section.

3.3.2. Procedure for Keeping the Population Diversity

The procedure for population diversification can be used for monitoring the convergence and/or divergence speed throughout the search phase, as discussed by [76]. This research utilizes the population diversity measure, called the

L 1

norm proposed in [76], which comprises diversities over two properties—the elements (individuals) and the dimensionality of the problem. As discussed in [76], the dimension-related measure of the

L 1

norm brings important information regarding the algorithm’s search process.

Let m denotes the count of individuals in the population, and n specifies the count of dimensions. The

L 1

norm expression is formulated as defined in Equations (18)–(20), whereby

\bar{x}

represents the vector with the average location of the solutions in every dimension;

D_{j}^{p}

denotes the solutions’ position diversity vector as the

L 1

norm, while

D^{p}

denotes the diversity value as scalar, for the entire populace.

\bar{x} = \frac{1}{m} \sum_{i = 1}^{m} x_{i j}

(18)

D_{j}^{p} = \frac{1}{m} \sum_{i = 1}^{m} | x_{i j} - {\bar{x}}_{j} |

(19)

D^{p} = \frac{1}{n} \sum_{i = 1}^{n} D_{j}^{p}

(20)

In every run of the algorithm, during the initialization phase, where the solutions are produced by utilizing the standard method given by Equation (16), the diversity of the population is large. Nevertheless, the diversity will gradually decrease in later rounds of the run, as the algorithm begins to converge in the proximity of the potential solution. The

L 1

measure is utilized to control the population’s diversity, together with the dynamic threshold parameter

D_{t}

.

The suggested diversity-oriented method operates as follows—during the initialization phase, the initial values of

D_{t}

(

D_{t 0}

) are obtained. For every following round, the conditional statement

D^{P} < D_{t}

is evaluated. In case the condition has been evaluated as satisfied, the population’s diversity is not satisfactory, and

n r s

of the worst individuals are replaced with random solutions. Therefore,

n r s

is an additional control parameter that defines the quantity of the individuals to be removed and reinitialized with new ones. The equation that is used to calculate

D_{t 0}

is given by Equation (21).

D_{t 0} = \sum_{j = 1}^{n} \frac{(u b_{j} - l b_{j})}{2 \cdot n}

(21)

This presumes that the majority of the individuals would be produced in the proximity of the mean of the higher and lower parameter’s boundaries, as per Equation (16) and Algorithm 1. Additionally, for rounds where it can be anticipated that the population is converging into the direction of the more promising areas, and moving away from the starting value

D_{t} = D_{t 0}

, the

D_{t}

is being reduced according to Equation (22), where

i t e r

and

i t e r + 1

denote the ongoing and following rounds of execution, respectively; further, T determines the maximal repetition count for the execution. That being a case, with every following round, the value of

D_{t}

will be dynamically decreased towards the last round, regardless of

D^{P}

.

D_{t, i t e r + 1} = D_{t, i t e r} - D_{t, i t e r} \cdot \frac{i t e r}{T}

(22)

3.3.3. The Inner Workings and Complexity of the Proposed Method

With regard to the described modifications, a novel algorithm derived from SCA has been introduced, and is called the diversity-oriented SCA (DOSCA). The pseudo-code for the novel DOSCA method is presented in Algorithm 2.

Algorithm 2 The DOSCA pseudo-code

Initialize a population of N solutions according to Algorithm 1
Initialize SCA control parameters $r_{1}, r_{1}, r_{3},$ and $r_{4}$
Determine values of $D_{t 0}$ and $D_{t}$
Evaluate each of the solutions with respect to the objective function value
while t $< T$ do
Update $r_{1}, r_{1}, r_{3},$ and $r_{4}$
Update the position of individuals using Equation (14)
Calculate the value of $D^{P}$
if ( $D^{P} < D_{t}$ ) then
Replace worst $n r s$ individuals with new ones produced by Equation (16)
end if
Assess the population
Find the current best solution
Increment $i t e r$
Update $D_{t}$ by applying Equation (22)
end while
return the best (optimal) solution determined so far

4. Employed Datasets and Data Preprocessing

In this section, the utilized data preprocessing technique is explained first. Afterwards, the datasets that were used in the research are described in detail.

4.1. Data Preprocessing

Let us define class

c_{j} \in C = {c_{1}, \dots, c_{| C |}}

as a text classification corresponding to a document

d_{i} \in D = {d_{1}, \dots, d_{| D |}}

, where C and D are collections of categories and documents, respectively. An appropriate format should be used to express a message using preprocessing techniques before it is delivered to a text modeling method. The ground base of the text mining is represented by preprocessing procedures that convert words into understandable vectors. The first sub-procedures in text mining algorithms are stemming, lemming, tokenization, pruning, stop word removing, and at the end, there is feature selection and feature extraction. Feature selection is very important to create dimension reduction for the final data set.

A function

φ : D \times C \to {T, F}

returns true (T) if a document

d_{i}

is assigned to a class

c_{j}

, and returns false (F) otherwise. In spam detection, we are given a set of emails

D = {d_{1}, d_{2}, \dots, d_{m}}

, and two classes

C = {c_{s p a m}, c_{l e g i t i m a t e}}

. The objective function is to specify every email to one of the two classes accurately and precisely. This paper presents a creative spam filtering method that contains the advanced metaheuristic method of a swarm intelligence algorithm, which is an effective optimization method in real-time applications, while training to prevent its convergence to subpar local minima. It is an algorithm that mimics the grazing behavior of swarming and was inspired by nature.

In this study, we assessed the efficacy of the algorithms for spam email filtering using two publicly available datasets. One is the CSDMC2010 spam corpus, a well-known dataset for the ICONIP 2010 data mining competition. There are a total of 4327 English email messages, of which 1378 or 31.85% are classified as spam, and 2949 or 68.15% as legal emails. The labels of the emails are contained in the file SPAMTrain.label, where 0 denotes a HAM and 1 represents SPAM. The testing dataset contains 4292 messages without known class labels.

TurkishEmail dataset is the second dataset, and it represents Turkish emails, where 400 or 50% of them are marked as spam; the other half are marked as non-spam emails [77].

Consequently, we define a set of emails

D = {d_{1}, d_{2}, \dots, d_{m}}

and two classes

C = {c - s p a m, c - n o n - s p a m}

. The objective function is to relate an email

d_{i} \in D

to one of the two classes. In the algorithm schema, the major steps of the research approach are listed. The non-informational html tags are removed before the text, and the topic of a given email is taken from the English and Turkish databases. Tokenization is used; all letters are changed to lowercase to separate the strings into lists of spline strings. The aim is to develop substrings that are free of punctuation and stop words, with the exception of the exclamation points, which are often used in spam emails. Two distinct stemming libraries are employed for the two sets, since Turkish and English word morphologies and structures differ significantly from one another: the subject is detailed in Section 4.2.

Machine-learning classifiers are applied to the data after the preprocessing step, pointing reduced vector representation, and the classifier maps documents to classes, as shown in Equation (23).

ϕ ({\vec{d_{i}}}^{'}, c_{j}) = \{\begin{matrix} 1, & if {\vec{d_{i}}}^{'} \in c_{j} \\ 0, & otherwise \end{matrix}

(23)

Each document

d_{i} \in D

is represented by a vector of terms

t_{1}, t_{2}, \dots, t_{| B |}

.

t f (t, d)

represents the number of occurrences of the term t in the document d divided by the total number of terms in d, and it is computed as in (24), where

f_{d} (t)

denotes the frequency of the term t in document d.

i d f (t, D)

verifies the term against the total number of documents, as in Equation (25), by applying a logarithm to the fraction between the total number of documents and the number of documents that include that specific term. Finally,

t f - i d f

(Equation (26)) considers both the frequency of a term in a document, but its importance becomes reduced if t appears in many other documents.

t f (t, d) = \frac{f_{d} (t)}{m a x_{w \in d} f_{d} (w)}

(24)

i d f (t, D) = l n \frac{∣ D ∣}{∣ {d \in D : t \in d} ∣}

(25)

t f - i d f (t, d, D) = t f (t, d) \cdot i d f (t, D)

(26)

Feature selection boosts a classifier’s success by removing non-discriminatory phrases that are present in almost all classifications. It takes a lot of computation time and resources to train a classification model utilizing all the terms. Reducing computing complexity and getting rid of superfluous characteristics are the goals. Each email is represented by a numerical feature vector. The

t f - i d f

procedure [78] proved to be successful for feature selection, and it is utilized in the current work as well.

After the

t f - i d f

vectors are computed (Equation (26)), we normalize the vectors with the Euclidean (L2) norm, which is a basic technique, as in Equation (27) [79]. Long documents make the terms appear more important than they actually are, since their likelihood of occurrence increases. By taking the document length into account, the normalization seeks to avoid this bias in lengthy papers. By applying normalizing approaches to

t f - i d f

vectors, Amayri and Bouguila [80] found that using the L2-norm produced better results than using the L1-norm, and that normalization can help defend against assaults on sparse data.

{v_{n o r m}}^{i} = \frac{v^{i}}{{v_{1}}^{i} + {v_{2}}^{i} + \dots + {v_{n}}^{i}} (i = 1 \dots N)

(27)

The total weight of a phrase is calculated by summing the corresponding normalized

t f - i d f

weights of the term across all documents. We sorted descendingly the overall weights. We then chose the feature set with the highest overall weight. The criterion for the highest overall weight starts with the weight sorting descendent for all terms, and then the first half of the items are selected out of the total. Each document was intended to be represented as a feature vector made up of the normalized

t f - i d f

weights of these terms. Feature set

(S = t_{1}, \dots, t_{n})

was formed according to critical terms so that the feature selection determines the final set.

4.2. Dataset Details and Basic Exploratory Data Analysis

We provide an exploratory analysis for both datasets. According to the counting of distinct terms, there are 82,148 and 25,650 different terms for the CSDMC2010 and TurkishEmail datasets, respectively. A further direction of exploratory analysis is the evaluation of sparsity and imbalancing. The latter is often presented through an imbalance ratio.The formula for this ratio is calculated by dividing the number of emails that are not spam by the number of spam messages. TurkishEmail and CSDMC2010 datasets have imbalance ratios of 1.51 and 2.14, respectively. The TurkishEmail is a more balanced set than CSDMC2010, but they are both rather unbalanced. For the Turkish dataset, in [4] the dataset was perfectly balanced, but according to our preprocessing, the ratio between the two classes was different.

For the evaluation of the sparsity, the feature vector size was 1000, and the degrees of sparsity of the TurkishEmail and of the CSDMC2010 datasets were 90.02% and 90.48%, respectively. Consequently, the two datasets were scarce, as shown by the sparsity percentages. This is a brief explanation of the ground truth about the CSDMC2010 and TurkishEmail datasets. An exploratory analysis in Table 2 represents the main frequency statistics of the trained bundles.

As a stemmer for the CSDMC2010 dataset, PorterStemmer is used, since this is a commonly used Python function for these types of data. As a stemmer for the TurkishEmail dataset, TurkishStemmer is used. It is a part of the Python library and is very often used in machine learning and natural language processing applications (https://kandi.openweaver.com/python/otuncelli/turkish-stemmer-python (accessed on 17 July 2022)). A complexity test that has been applied is the Fraction of Borderline Points. This measure was proposed by Friedman for calculation if two multivariate samples are from the same distribution. A percentage of points is given while connecting two opposite classes from the training samples. This measure counts the boundary points number, and each of these points is a case related to a different class. The final value is normalized, since the result is from 0 to 1. A result of near 0 implies that the data are separable, and close to 1 indicates that the data are not separable. The measure is sensitive to imbalanced data and to the separability of the classes.

The WorldCloud function outputs for English spam emails, non-spam emails, and whole dataframe are shown in Figure 1.

WorldCloud function outputs for Turkish spam emails, non-spam emails, and whole dataframe are shown in Figure 2.

The class distribution for both the English and Turkish datasets is shown in Figure 3. The English dataset contains around 68% of non-spam and 32% of spam emails, while the Turkish dataset comprises around 60% non-spam, and 40% of spam emails, respectively.

The procedure of feature selection was establish according to previous data preprocessing steps: cleaning, tokenization, stemming, stop words removal, pruning, bag of words, and tf-idf vectorization. The tf-idf vectorization leads to weighted values for every word that is sorted by descendent. The first 500 words from the ordered set are selected, and a data frame is created for training with 500 features. The data frame for training with 1000 features is established in a similar manner. The weighed values for both datasets are shown in Figure 4.

5. Experimental Setup and Results

This section describes the experimental setup for both the LR and XGBoost experiments. Later on, the experimental outcomes are given and discussed.

5.1. Basic Experimental Setup

The introduced DOSCA algorithm has been utilized to train the LR model and to optimize the XGBoost hyperparameters for the case of spam detection. Flat swarm encoding has been employed, where each individual in the population comprises all the parameters being optimized. In addition, the LR and XGBoost models are different in nature; they were tested with the global experimental control parameters.

For the LR experiments, DOSCA is used to perform the training, and each solution represents the LR coefficients and intercept; therefore, the solution length D is determined as

D = n f + 1

, where

n f

is the number of all features (coefficients) and intercept with a length of 1. The coefficients’ boundaries are determined empirically for the English dataset with 500 and 1000 features (a detailed description of the English dataset is given in the Section 4). In the referred paper [4] the boundaries were set to

[- 8, 8]

, while this research utilizes a range of

[- 6, 6]

. In the case of a Turkish dataset with 500 and 1000 features (a detailed description of the Turkish dataset is also given in the following section), the referred paper [4] also uses a range of

[- 8, 8]

, while this research utilizes a range of

[- 6, 6]

. The intercept boundaries were not disclosed in [4], but were established empirically for the purpose of this research through a trial and error process, and were set to

[0, 1]

and

[- 1, 1]

for the English and Turkish datasets, respectively. All observed variables in this scenario are continuous.

In the case of the LR experiments, specific LR parameters (such as C, regularization, and so on) were set to the default values from the scikitlearn library, since the focus in this experiment was on model training. All algorithms were tested with 40 individuals in the population, and the maximum iterations number was set to

T = 500

. The paper [4] that inspired the LR experiments utilized a greater number of iterations and solutions. In this way, all algorithms were treated equally in terms of fitness function evaluations (FFEs),

N + N \cdot T

. However, FA was tested with 20 solutions in the population, since the worst complexity of FA is

N^{2} \cdot T

, while the average is

N / 2 \cdot T

. Each algorithm was executed in 15 independent runs (

R = 15

).

For the XGBoost simulations, the solution vector’s length was determined with the number of hyperparameters being optimized, and in this case,

D = 6

, since a total of six parameters are tuned by DOSCA. The XGBoost hyperparameters being tuned and their respective constraints are:

Learning rate ( $η$ ), boundaries: $[0.1, 0.9]$ , type: continuous,
$m i n_c h i l d_w e i g h t$ , boundaries: $[0, 10]$ , type: continuous,
Subsample, boundaries: $[0.01, 1]$ , type: continuous,
collsample_bytree, boundaries: $[0.01, 1]$ , type: continuous,
max_depth, boundaries: $[3, 10]$ , type: integer,
$g a m m a$ , boundaries: $[0, 0.5]$ , type: continuous.

All other XGBoost hyperparameters were set to the default values from the scikitlearn library.

In the case of the XGBoost experiment, the algorithms are executed with 10 solutions in the population, and using 30 iterations over 15 independent runs. An FA algorithm is tested with

N = 5

solutions in the population, in order to provide the same number of FFE.

The fitness calculation of a swarm individual (which is a simple classification error rate) is based on the training set in both experiments, the one with LR and the one with XGBoost. After all iterations in a run are completed, the best individual (the one with best fitness on the training set) is validated against the testing set, and this represents the final result of a run.

The performance of the suggested DOSCA algorithm with respect to the converging velocity and general optimization capabilities were evaluated. The experimental outcomes were put into a comparison, with results being obtained by seven other state-of-the-art metaheuristics algorithms that were employed in the same experimental setup. These competitor algorithms were namely: the original implementation of SCA [38], ABC [34], FA [35], BA [81], HHO [82], SNS [83], and TLB [84]. These competitor metaheuristics were implemented separately by the authors for comparison, and the control parameters setup was taken from the original publications.

For an easier tracking of the results, the following acronyms were used—for the LR experiments, all metaheuristics were assigned the LR prefix (for example, LR-DOSCA, LR-ABC, etc.), and for the XGBoost experiments, the XGB prefix was assigned (XGB-DOSCA, XGB-HHO, etc.).

The workflow of the proposed simulation is provided in Figure 5.

5.2. Obtained Results and Comparative Analysis

Table 3 depicts the overall metrics for LR experiments achieved by all the metaheuristics algorithms, on the English 500 and 1000, and on the Turkish 500 and 1000 datasets, respectively. It can be seen that the proposed LR-DOSCA metaheuristic approach obtained the best results (best metric) in all four experiments.

Detailed metrics for the best run in the case of the English and Turkish datasets are given in Table 4 and Table 5. Here, it is possible to see the superiority of the LR-DOSCA model, which is most obvious in case of the Turkish dataset, where the algorithm achieved 100% accuracy.

Convergence graphs for objective function (error rate), box plots, and violin diagrams for all observed methods in the case of LR training with the English dataset with 500 and 1000 features, respectively, are given in Figure 6, while those for the Turkish dataset (500 and 1000 features) are given in Figure 7.

Table 6 presents the overall metrics for the XGBoost experiments achieved by all of the metaheuristics algorithms, on the English 500 and 1000, and the Turkish 500 and 1000 datasets, respectively. It can be seen that the proposed XGBoost-DOSCA metaheuristic approach obtained the best results (best metric) in all four experiments (three superior best results, and tied for the best result in the case of the English 1000 dataset).

Detailed metrics for the best run in the case of the English and Turkish datasets are given in Table 7 and Table 8. Here, it is possible to see the superiority of the XGBoost-DOSCA model, which is most obvious in the case of the Turkish 500 and 1000 dataset, where the algorithm achieved significant difference in comparison to the other methods. The only case where the proposed method did not reach the best results is English 1000, where the difference is in the order of decimals for accuracy. This additionally indicates that the proposed method is not the most accurate for all test cases, as the no free lunch theorem also suggests. The statistical tests in the subsequent section indicate, however, that overall, the proposed technique leads to the most accurate results.

To better visualize the results of the XGBoost method and all observed metaheuristics, the box plots and violin diagrams for English 500 and 1000 datasets are given in Figure 8. The convergence graphs, box plots, and violin diagrams for Turkish 500 and 1000 datasets are given in Figure 9.

Finally, to better visualize the performance of both the LR-DOSCA and XGBoost-DOSCA methods, confusion matrices, precision-recall (PR) curves, and receiver operating characteristic one vs. one (ROC OvO) are shown in Figure 10.

5.3. The DOSCA Improvements Validation via Statistical Tests

In order to confirm the improved performance of the DOSCA algorithm compared to the opponents, further statistical analyses are necessary that will show whether or not the obtained improvements are statistically significant. According to the relevant literature [85,86,87], statistical tests in this case can be executed by taking the mean values of the measured objectives over multiple independent runs to construct a results sample for each approach. The possible disadvantage of this method occurs if the measured variable has outliers that do not follow a normal distribution. This can lead to false or deceptive conclusions. The usage of an average objective function value for the purpose of statistical tests are still an open issue [87].

Therefore, to check whether or not the usage of the mean values is safe, we used the objective function (classification error rate) for each run, and the data sample for each method–problem instance pair was constructed. Further, the Shapiro–Wilk test for single problem analysis [88] was conducted, and for each, the method–problem pairs and all generated p values were larger than the threshold

α = 0.5

, implying that the data samples came from normal distribution. Consequently, it was concluded that the mean values can be used for further analysis.

Later, according to [89], we checked the safe usage of the parametric tests conditions, which include independence, normality, and homoscedasticity of the variances of the data. With a unique pseudo-random number seed as a staring point, each run was executed independently, which means that the condition of independence was satisfied. The Shapiro–Wilk test [88] for multiple problem analysis was then used again to check for the fulfillment of a normality condition, and these results are shown in Table 9.

Finally, to check homoscedasticity based on means, Levene’s test [90] is employed, and the p-value of 0.55 is obtained, which yields the conclusion that the homoscedasticity is satisfied. However, as can be seen from Table 9, all of the obtained p-values from the Shapiro–Wilk test are smaller than

α = 0.05

, which means that the safe use of parametric tests is not satisfied, and so we have proceeded with non-parametric tests. In the following non-parametric tests, the DOSCA method proposed in this research is established as the control method.

In order to verify the significance of the proposed DOSCA performance over other algorithms, the Friedman test [91,92] and a two-way variance analysis by ranks were employed. The results of this test are presented in Table 10. Furthermore, the Friedman aligned test was also conducted, and these findings are shown in Table 11.

From the findings shown in Table 10, the presented DOSCA method statistically outperformed other algorithms to which it was compared, by achieving an average rank value of 1.44. We can see that second best result belongs to the HHO method, with an obtained average rank of 3.69. The original SCA method accomplished an average ranking of 5.33, which provides proof of the superiority of the proposed DOSCA over the original method. Moreover, the Friedman statistics (

χ_{r}^{2} = 21.97

) are greater than the

χ^{2}

critical value, with seven degrees of freedom (

14.067

), at a significance level

α = 0.05

, and the Friedman p-value is

1.89 \times 10^{- 13}

, inferring that significant differences in results between the different methods exist. Consequently, it is possible to reject the null hypothesis (

H_{0}

) and state that the proposed DOSCA-obtained performance was significantly different from other competitors. Similar conclusions can be derived from the Friedman-aligned test results.

As the research [93] indicates that the Iman and Davenport’s test [94] could give results with more precision than the

χ^{2}

, this test was performed as well. The result of the mentioned test is

4.85

, which is significantly larger than the critical value of the F-distribution (

2.20

). Additionally, the Iman and Devenport p-value is

4.43 \times 10^{- 2}

, which is smaller than the level of significance. From all this, it can be concluded that this test also rejects

H_{0}

.

It can be seen that both tests rejected the null hypothesis, and so, the non-parametric post hoc Holm’s step-down procedure was applied. The outcomes of the test are presented in Table 12. In the Table, the compared methods are sorted according to their p-values and evaluated to

α / (k - i)

, where k and i describe the degree of freedom (

k = 7

in this research) and the algorithm number, respectively, after sorting with respect to the p value in ascending order (corresponding to rank). In this experiment, the

α

values of 0.05 and 0.1 are used. The results of the test shown in Table 12 clearly indicate that the suggested DOSCA significantly outperformed all compared methods at both significance levels.

6. Conclusions

This manuscript presented an innovative version of the SCA metaheuristics, implemented in such way as to tackle the drawbacks of the original SCA variant. The novel algorithm was given the name diversity-oriented SCA (DOSCA), and it was implemented as part of the framework that is used for machine learning. The suggested DOSCA was employed for LR training, and for XGBoost hyperparameters optimization.

The goal of the presented research is to try to further enhance spam email filtering techniques based on intelligent algorithms. Therefore, both models were evaluated on two benchmark spam email datasets, CSDMC2010 for the English language and one dataset for the Turkish language.

The experimental outcomes of the DOSCA algorithm were put into comparison with seven other metaheuristics implemented in the same experimental framework. The obtained simulation results, backed up by the executed statistical tests, clearly suggest that LR-DOSCA and XGB-DOSCA obtained a superior accuracy level when compared to other methods that were included in the comparative analysis.

The future experiments in this domain will aim to further test the suggested models on more real-world datasets, with a goal to build confidence in the models before planning to implement them in real-world systems that deal with spam detection and the overall security of Internet, as well as of other networks that use email services.

Author Contributions

Conceptualization, M.Z., N.B. and C.S.; methodology, N.B., C.S. and S.J.; software, N.B. and M.Z.; validation, M.A., I.S. and M.S.; formal analysis, M.Z.; investigation, C.S., N.B. and S.J.; resources, N.B., M.S., I.S. and C.S.; data curation, M.Z., M.A. and N.B.; writing—original draft preparation, I.S., M.A. and S.J.; writing—review and editing, C.S., M.Z. and N.B.; visualization, N.B., M.A. and M.Z.; supervision, N.B.; project administration, M.Z. and N.B.; funding acquisition, N.B. and C.S. All authors have read and agreed to the published version of the manuscript.

Funding

Catalin Stoean was supported by a grant of the Romanian Ministry of Education and Research, CCCDI—UEFISCDI, project number 411PED/2020, code PN-III-P2-2.1-PED-2019-2271, within PNCDI III.

Data Availability Statement

Not applicable.

Conflicts of Interest

All authors declare no conflict of interest.

References

Ripa, S.P.; Islam, F.; Arifuzzaman, M. The Emergence Threat of Phishing Attack and The Detection Techniques Using Machine Learning Models. In Proceedings of the 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0 (ACMI), Rajshahi, Bangladesh, 8–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Rameem Zahra, S.; Ahsan Chishti, M.; Iqbal Baba, A.; Wu, F. Detecting COVID-19 chaos driven phishing/malicious URL attacks by a fuzzy logic and data mining based intelligence system. Egypt. Inform. J. 2022, 23, 197–214. [Google Scholar] [CrossRef]
Özgür, L.; Güngör, T.; Gürgen, F. Adaptive anti-spam filtering for agglutinative languages: A special case for Turkish. Pattern Recognit. Lett. 2004, 25, 1819–1831. [Google Scholar] [CrossRef]
Dedeturk, B.K.; Akay, B. Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Appl. Soft Comput. 2020, 91, 106229. [Google Scholar] [CrossRef]
Akerkar, R. Artificial Intelligence for Business; Springer: Cham, Switzerland, 2019. [Google Scholar]
Buchanan, B. Artificial Intelligence in Finance; The Alan Turing Institute: London, UK, 2019. [Google Scholar]
Hamet, P.; Tremblay, J. Artificial intelligence in medicine. Metabolism 2017, 69, S36–S40. [Google Scholar] [CrossRef] [PubMed]
Dias, R.; Torkamani, A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019, 11, 1–12. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ahmad, Z.; Shahid Khan, A.; Wai Shiang, C.; Abdullah, J.; Ahmad, F. Network intrusion detection system: A systematic study of machine learning and deep learning approaches. Trans. Emerg. Telecommun. Technol. 2021, 32, e4150. [Google Scholar] [CrossRef]
Almomani, O.; Almaiah, M.A.; Alsaaidah, A.; Smadi, S.; Mohammad, A.H.; Althunibat, A. Machine learning classifiers for network intrusion detection system: Comparative study. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; pp. 440–445. [Google Scholar]
Saba, T.; Sadad, T.; Rehman, A.; Mehmood, Z.; Javaid, Q. Intrusion detection system through advance machine learning for the internet of things networks. IT Prof. 2021, 23, 58–64. [Google Scholar] [CrossRef]
Tang, L.; Mahmoud, Q.H. A survey of machine learning-based solutions for phishing website detection. Mach. Learn. Knowl. Extr. 2021, 3, 672–694. [Google Scholar] [CrossRef]
Gandotra, E.; Gupta, D. An efficient approach for phishing detection using machine learning. In Multimedia Security; Springer: Singapore, 2021; pp. 239–253. [Google Scholar]
Doshi, R.; Apthorpe, N.; Feamster, N. Machine learning ddos detection for consumer internet of things devices. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 29–35. [Google Scholar]
Injadat, M.; Moubayed, A.; Shami, A. Detecting botnet attacks in IoT environments: An optimized machine learning approach. In Proceedings of the 2020 32nd International Conference on Microelectronics (ICM), Aqaba, Jordan, 14–17 December 2020; pp. 1–4. [Google Scholar]
Soe, Y.N.; Feng, Y.; Santosa, P.I.; Hartanto, R.; Sakurai, K. Machine learning-based IoT-botnet attack detection with sequential architecture. Sensors 2020, 20, 4372. [Google Scholar] [CrossRef]
Rao, S.; Verma, A.K.; Bhatia, T. A review on social spam detection: Challenges, open issues, and future directions. Expert Syst. Appl. 2021, 186, 115742. [Google Scholar] [CrossRef]
Ahmed, N.; Amin, R.; Aldabbas, H.; Koundal, D.; Alouffi, B.; Shah, T. Machine learning techniques for spam detection in email and IoT platforms: Analysis and research challenges. Secur. Commun. Netw. 2022, 2022, 1862888. [Google Scholar] [CrossRef]
Hossain, F.; Uddin, M.N.; Halder, R.K. Analysis of optimized machine learning and deep learning techniques for spam detection. In Proceedings of the 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), Toronto, ON, Canada, 21–24 April 2021; pp. 1–7. [Google Scholar]
Jurafsky, D.; Martin, J.H. Speech and Language Processing; Prentice Hall: Upper Saddle River, NJ, USA, 2014; Volume 3. [Google Scholar]
Han, Y.; Yang, M.; Qi, H.; He, X.; Li, S. The Improved Logistic Regression Models for Spam Filtering. In Proceedings of the 2009 International Conference on Asian Language Processing, Singapore, 7–9 December 2009; pp. 314–317. [Google Scholar]
Kabiraj, S.; Raihan, M.; Alvi, N.; Afrin, M.; Akter, L.; Sohagi, S.A.; Podder, E. Breast cancer risk prediction using XGBoost and random forest algorithm. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–4. [Google Scholar]
Li, M.; Fu, X.; Li, D. Diabetes prediction based on XGBoost algorithm. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2020; Volume 768, p. 072093. [Google Scholar]
Ryu, S.E.; Shin, D.H.; Chung, K. Prediction model of dementia risk based on XGBoost using derived variable extraction and hyper parameter optimization. IEEE Access 2020, 8, 177708–177720. [Google Scholar] [CrossRef]
Ogunleye, A.; Wang, Q.G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 17, 2131–2140. [Google Scholar] [CrossRef] [PubMed]
Nobre, J.; Neves, R.F. Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Syst. Appl. 2019, 125, 181–194. [Google Scholar] [CrossRef]
Wang, Y.; Guo, Y. Forecasting method of stock market volatility in time series data based on mixed model of ARIMA and XGBoost. China Commun. 2020, 17, 205–221. [Google Scholar] [CrossRef]
Shi, X.; Li, Q.; Qi, Y.; Huang, T.; Li, J. An accident prediction approach based on XGBoost. In Proceedings of the 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, China, 24–26 December 2017; pp. 1–7. [Google Scholar]
Zhang, S.; Zhang, D.; Qiao, J.; Wang, X.; Zhang, Z. Preventive control for power system transient security based on XGBoost and DCOPF with consideration of model interpretability. CSEE J. Power Energy Syst. 2020, 7, 279–294. [Google Scholar]
Abdel-Basset, M.; Abdel-Fatah, L.; Sangaiah, A.K. Metaheuristic algorithms: A comprehensive review. In Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications; Elsevier: Amsterdam, The Netherlands, 2018; pp. 185–231. [Google Scholar]
Blum, C.; Li, X. Swarm intelligence in optimization. In Swarm intelligence; Springer: Berlin/Heidelberg, Germany, 2008; pp. 43–85. [Google Scholar]
Mirjalili, S.; Mirjalili, S.M.; Lewis, A. Grey wolf optimizer. Adv. Eng. Softw. 2014, 69, 46–61. [Google Scholar] [CrossRef] [Green Version]
Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
Karaboga, D. Artificial bee colony algorithm. Scholarpedia 2010, 5, 6915. [Google Scholar] [CrossRef]
Yang, X.S. Firefly algorithms for multimodal optimization. In International Symposium on Stochastic Algorithms; Springer: Berlin/Heidelberg, Germany, 2009; pp. 169–178. [Google Scholar]
Mirjalili, S.; Lewis, A. The whale optimization algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
Abualigah, L.; Diabat, A.; Mirjalili, S.; Abd Elaziz, M.; Gandomi, A.H. The arithmetic optimization algorithm. Comput. Methods Appl. Mech. Eng. 2021, 376, 113609. [Google Scholar] [CrossRef]
Mirjalili, S. SCA: A Sine Cosine Algorithm for solving optimization problems. Knowl.-Based Syst. 2016, 96, 120–133. [Google Scholar] [CrossRef]
Maulud, D.; Abdulazeez, A.M. A review on linear regression comprehensive in machine learning. J. Appl. Sci. Technol. Trends 2020, 1, 140–147. [Google Scholar] [CrossRef]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Budapest, Hungary, 25–29 July 2004; Volume 2, pp. 985–990. [Google Scholar] [CrossRef]
Olatunji, S.O. Extreme Learning machines and Support Vector Machines models for email spam detection. In Proceedings of the 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), Windsor, ON, Canada, 30 April–3 May 2017; pp. 1–6. [Google Scholar]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. Xgboost: Extreme gradient boosting. R Package Version 0.4-2 2015, 1, 1–4. [Google Scholar]
Guo, Y.; Mustafaoglu, Z.; Koundal, D. Spam Detection Using Bidirectional Transformers and Machine Learning Classifier Algorithms. J. Comput. Cogn. Eng. 2022, 1–5. [Google Scholar] [CrossRef]
Vanaja, P.; Kumari, M.V. Machine Learning based Optimization for Efficient Detection of Email Spam. Available online: http://positifreview.com/gallery/33-june2022.pdf (accessed on 10 July 2022).
Goodman, J.; Yih, W.T. Online Discriminative Spam Filter Training. In Proceedings of the CEAS 2006—Third Conference on Email and AntiSpam, Mountain View, CA, USA, 27–28 July 2006; pp. 1–4. [Google Scholar]
Lucay, F.A. Accelerating Global Sensitivity Analysis via Supervised Machine Learning Tools: Case Studies for Mineral Processing Models. Minerals 2022, 12, 750. [Google Scholar] [CrossRef]
Roul, R.K. Impact of multilayer ELM feature mapping technique on supervised and semi-supervised learning algorithms. Soft Comput. 2022, 26, 423–437. [Google Scholar] [CrossRef]
Mustapha, I.B.; Hasan, S.; Olatunji, S.O.; Shamsuddin, S.M.; Kazeem, A. Effective Email Spam Detection System using Extreme Gradient Boosting. arXiv 2020, arXiv:2012.14430. [Google Scholar]
Anitha, P.; Rao, C.G.; Babu, D.S. Email Spam Filtering Using Machine Learning Based Xgboost Classifier Method. Turk. J. Comput. Math. Educ. 2021, 12, 2182–2190. [Google Scholar]
Pandey, M.K.; Singh, M.K.; Pal, S.; Tiwari, B. Measure the Performance by Analysis of Different Boosting Algorithms on Various Patterns of Phishing Datasets. 2022. Available online: https://doi.org/10.21203/rs.3.rs-1794002/v2 (accessed on 14 July 2022).
Cuk, A.; Bezdan, T.; Bacanin, N.; Zivkovic, M.; Venkatachalam, K.; Rashid, T.A.; Devi, V.K. Feedforward multi-layer perceptron training by hybridized method between genetic algorithm and artificial bee colony. In Data Science and Data Analytics: Opportunities and Challenges; Chapman and Hall/CRC: Boca Raton, FL, USA, 2021; p. 279. [Google Scholar]
Strumberger, I.; Bezdan, T.; Ivanovic, M.; Jovanovic, L. Improving Energy Usage in Wireless Sensor Networks by Whale Optimization Algorithm. In Proceedings of the 2021 29th Telecommunications Forum (TELFOR), Belgrade, Serbia, 23–24 November 2021; pp. 1–4. [Google Scholar]
Zivkovic, M.; Bacanin, N.; Zivkovic, T.; Strumberger, I.; Tuba, E.; Tuba, M. Enhanced grey wolf algorithm for energy efficient wireless sensor networks. In Proceedings of the 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), Online, 26–27 May 2020; pp. 87–92. [Google Scholar]
Jovanovic, D.; Antonijevic, M.; Stankovic, M.; Zivkovic, M.; Tanaskovic, M.; Bacanin, N. Tuning Machine Learning Models Using a Group Search Firefly Algorithm for Credit Card Fraud Detection. Mathematics 2022, 10, 2272. [Google Scholar] [CrossRef]
Tair, M.; Bacanin, N.; Zivkovic, M.; Venkatachalam, K. A Chaotic Oppositional Whale Optimisation Algorithm with Firefly Search for Medical Diagnostics. Comput. Mater. Contin. 2022, 72, 959–982. [Google Scholar] [CrossRef]
Bacanin, N.; Stoean, R.; Zivkovic, M.; Petrovic, A.; Rashid, T.A.; Bezdan, T. Performance of a novel chaotic firefly algorithm with enhanced exploration for tackling global optimization problems: Application for dropout regularization. Mathematics 2021, 9, 2705. [Google Scholar] [CrossRef]
Bacanin, N.; Zivkovic, M.; Sarac, M.; Petrovic, A.; Strumberger, I.; Antonijevic, M.; Petrovic, A.; Venkatachalam, K. A Novel Multiswarm Firefly Algorithm: An Application for Plant Classification. In International Conference on Intelligent and Fuzzy Systems; Springer: Cham, Switzerland, 2022; pp. 1007–1016. [Google Scholar]
Zivkovic, M.; Petrovic, A.; Bacanin, N.; Milosevic, S.; Veljic, V.; Vesic, A. The COVID-19 Images Classification by MobileNetV3 and Enhanced Sine Cosine Metaheuristics. In Mobile Computing and Sustainable Informatics; Springer: Singapore, 2022; pp. 937–950. [Google Scholar]
Bacanin, N.; Zivkovic, M.; Al-Turjman, F.; Venkatachalam, K.; Trojovskỳ, P.; Strumberger, I.; Bezdan, T. Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application. Sci. Rep. 2022, 12, 6302. [Google Scholar] [CrossRef] [PubMed]
Zivkovic, M.; Stoean, C.; Chhabra, A.; Budimirovic, N.; Petrovic, A.; Bacanin, N. Novel improved salp swarm algorithm: An application for feature selection. Sensors 2022, 22, 1711. [Google Scholar] [CrossRef] [PubMed]
Salb, M.; Zivkovic, M.; Bacanin, N.; Chhabra, A.; Suresh, M. Support Vector Machine Performance Improvements for Cryptocurrency Value Forecasting by Enhanced Sine Cosine Algorithm. In Computer Vision and Robotics; Springer: Singapore, 2022; pp. 527–536. [Google Scholar]
Zivkovic, M.; Jovanovic, L.; Ivanovic, M.; Bacanin, N.; Strumberger, I.; Joseph, P.M. XGBoost Hyperparameters Tuning by Fitness-Dependent Optimizer for Network Intrusion Detection. In Communication and Intelligent Systems; Springer: Singapore, 2022; pp. 947–962. [Google Scholar]
Zivkovic, M.; Stoean, C.; Petrovic, A.; Bacanin, N.; Strumberger, I.; Zivkovic, T. A Novel Method for COVID-19 Pandemic Information Fake News Detection Based on the Arithmetic Optimization Algorithm. In Proceedings of the 2021 23rd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, Romania, 7–10 December 2021; pp. 259–266. [Google Scholar]
Bacanin, N.; Zivkovic, M.; Bezdan, T.; Venkatachalam, K.; Abouhawwash, M. Modified firefly algorithm for workflow scheduling in cloud-edge environment. Neural Comput. Appl. 2022, 34, 9043–9068. [Google Scholar] [CrossRef]
Bacanin, N.; Antonijevic, M.; Bezdan, T.; Zivkovic, M.; Rashid, T.A. Wireless Sensor Networks Localization by Improved Whale Optimization Algorithm. In 2nd International Conference on Artificial Intelligence: Advances and Applications; Springer: Singapore, 2022; pp. 769–783. [Google Scholar]
Agarwal, K.; Kumar, T. Email Spam Detection Using Integrated Approach of Naïve Bayes and Particle Swarm Optimization. In Proceedings of the 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 14–15 June 2018; pp. 685–690. [Google Scholar] [CrossRef]
Ahmed, B. Wrapper Feature Selection Approach Based on Binary Firefly Algorithm for Spam E-mail Filtering. J. Soft Comput. Data Min. 2020, 1, 44–52. [Google Scholar]
Mohammadzadeh, H.; Gharehchopogh, F.S. A novel hybrid whale optimization algorithm with flower pollination algorithm for feature selection: Case study Email spam detection. Comput. Intell. 2021, 37, 176–209. [Google Scholar] [CrossRef]
Singh, A.; Chahal, N.; Singh, S.; Gupta, S.K. Spam Detection using ANN and ABC Algorithm. In Proceedings of the 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 28–29 January 2021; pp. 164–168. [Google Scholar] [CrossRef]
Elakkiya, E.; Selvakumar, S.; Velusamy, R.L. CIFAS: Community Inspired Firefly Algorithm with fuzzy cross-entropy for feature selection in Twitter Spam detection. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–7. [Google Scholar] [CrossRef]
Batra, J.; Jain, R.; Tikkiwal, V.A.; Chakraborty, A. A comprehensive study of spam detection in e-mails using bio-inspired optimization techniques. Int. J. Inf. Manag. Data Insights 2021, 1, 100006. [Google Scholar] [CrossRef]
Gabis, A.B.; Meraihi, Y.; Mirjalili, S.; Ramdane-Cherif, A. A comprehensive survey of sine cosine algorithm: Variants and applications. Artif. Intell. Rev. 2021, 54, 5469–5540. [Google Scholar] [CrossRef]
Wu, S.; Mao, P.; Li, R.; Cai, Z.; Heidari, A.A.; Xia, J.; Chen, H.; Mafarja, M.; Turabieh, H.; Chen, X. Evolving fuzzy k-nearest neighbors using an enhanced sine cosine algorithm: Case study of lupus nephritis. Comput. Biol. Med. 2021, 135, 104582. [Google Scholar] [CrossRef]
Gupta, S. Enhanced sine cosine algorithm with crossover: A comparative study and empirical analysis. Expert Syst. Appl. 2022, 198, 116856. [Google Scholar] [CrossRef]
Rahnamayan, S.; Tizhoosh, H.R.; Salama, M.M.A. Quasi-oppositional Differential Evolution. In Proceedings of the 2007 IEEE Congress on Evolutionary Computation, Singapore, 25–28 September 2007; pp. 2229–2236. [Google Scholar] [CrossRef]
Cheng, S.; Shi, Y. Diversity control in particle swarm optimization. In Proceedings of the 2011 IEEE Symposium on Swarm Intelligence, Paris, France, 11–15 April 2011; pp. 1–9. [Google Scholar]
Ergin, S.; Sora Gunal, E.; Yigit, H.; Aydin, R. Turkish anti-spam filtering using binary and probabilistic models. Glob. J. Technol. 2012, 1, 1007–1012. [Google Scholar]
Barushka, A.; Hajek, P. Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Appl. Intell. 2018, 48, 3538–3556. [Google Scholar] [CrossRef] [Green Version]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Amayri, O.; Bouguila, N. A study of spam filtering using support vector machines. Artif. Intell. Rev. 2010, 34, 73–108. [Google Scholar] [CrossRef]
Yang, X.S. Bat algorithm for multi-objective optimisation. Int. J. Bio-Inspired Comput. 2011, 3, 267–274. [Google Scholar] [CrossRef] [Green Version]
Heidari, A.A.; Mirjalili, S.; Faris, H.; Aljarah, I.; Mafarja, M.; Chen, H. Harris hawks optimization: Algorithm and applications. Future Gener. Comput. Syst. 2019, 97, 849–872. [Google Scholar] [CrossRef]
Talatahari, S.; Bayzidi, H.; Saraee, M. Social network search for global optimization. IEEE Access 2021, 9, 92815–92863. [Google Scholar] [CrossRef]
Rao, R.V.; Savsani, V.J.; Vakharia, D. Teaching–learning-based optimization: A novel method for constrained mechanical design optimization problems. Comput.-Aided Des. 2011, 43, 303–315. [Google Scholar] [CrossRef]
Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
García, S.; Molina, D.; Lozano, M.; Herrera, F. A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: A case study on the CEC’2005 special session on real parameter optimization. J. Heuristics 2009, 15, 617–644. [Google Scholar] [CrossRef]
Eftimov, T.; Korošec, P.; Seljak, B.K. Disadvantages of statistical comparison of stochastic optimization algorithms. In Proceedings of the Bioinspired Optimizaiton Methods and their Applications, BIOMA 2016, Bled, Slovenia, 18–20 May 2016; pp. 105–118. [Google Scholar]
Shapiro, S.S.; Francia, R. An approximate analysis of variance test for normality. J. Am. Stat. Assoc. 1972, 67, 215–216. [Google Scholar] [CrossRef]
LaTorre, A.; Molina, D.; Osaba, E.; Poyatos, J.; Del Ser, J.; Herrera, F. A prescription of methodological guidelines for comparing bio-inspired optimization algorithms. Swarm Evol. Comput. 2021, 67, 100973. [Google Scholar] [CrossRef]
Glass, G.V. Testing homogeneity of variances. Am. Educ. Res. J. 1966, 3, 187–190. [Google Scholar] [CrossRef]
Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
Sheskin, D.J. Handbook of Parametric and Nonparametric Statistical Procedures; Chapman and Hall/CRC: Boca Raton, FL, USA, 2020. [Google Scholar]
Iman, R.L.; Davenport, J.M. Approximations of the critical region of the fbietkan statistic. Commun. Stat.-Theory Methods 1980, 9, 571–595. [Google Scholar] [CrossRef]

Figure 1. WorldCloud function outputs for English spam (top left), non-spam emails (top right), and the whole dataframe (bottom).

Figure 2. WorldCloud function outputs for Turkish spam (top left), non-spam emails (top right), and the whole dataframe (bottom).

Figure 3. Class distribution for English and Turkish datasets.

Figure 4. Weights for English and Turkish datasets.

Figure 5. Work-flow of conducted simulations.

Figure 6. Convergence graphs, box plots, and violin diagrams for all observed methods and LR on the English dataset (500 and 1000 features).

Figure 7. Convergence graphs, box plots, and violin diagrams for all observed methods and LR on the Turkish dataset (500 and 1000 features).

Figure 8. Box plots and violin diagrams for error rate for all observed methods and XGBoost on English dataset (500 and 1000 features).

Figure 9. Convergence graphs, box plots, and violin diagrams for error rate for all observed methods and XGBoost on Turkish dataset (500 and 1000 features).

Figure 10. The LR-DOSCA and XGBoost-DOSCA visualization for the obtained confusion matrices, PR curves, and ROC OvO curves for some datasets.

Table 1. Overview of the selected ML and metaheuristics for spam detection.

ML Approach	Metaheuristics	Application	Ref.
Naive Bayes	Particle swarm optimization	Parameter tuning	[66]
Naive Bayes	Binary firefly algorithm	Feature selection	[67]
k-Nearest Neighbours	Whale optimization and flower pollination algorithm	Feature selection	[68]
Logistic regression	Artificial bee colony	Determine weight and bias values of LR	[4]
Neural network	Artificial bee colony	Feature selection	[69]
SVM, KNN, Random forest	Firefly algorithm	Feature selection and combination	[70]
k-Nearest Neighbours	Grey wolf optimization, firefly optimization, chicken swarm optimization, grasshopper optimization, and whale optimization	Parameter tuning	[71]

Table 2. Frequency statistics for the two datasets.

	CSDMC2010 Dataset	TurkishEmail Dataset
len(tokens)	1,574,504	289,598
len(distinct tokens)	90,392	46,529
len(not stopwords)	47,385	30,808
len(tokens when the string consists of alphabetic characters )	47,533	30,861
len(stemming)	35,652	20,195
len(dictionary tfidf for vectorization)	35,617	20,138
Shape(dataframe)	(4327, 35,617)	(826, 20,138)

Table 3. Overall metrics for LR results in terms of classification error. Best results in each row are in bold.

Method	LR-DOSCA	LR-SCA	LR-ABC	LR-FA	LR-BA	LR-HHO	LR-SNS	LR-TLB
English 500
Best	0.030023	0.036952	0.039261	0.036952	0.032333	0.039261	0.034642	0.036952
Worst	0.043880	0.046189	0.043880	0.041570	0.043880	0.043880	0.053118	0.046189
Mean	0.036952	0.041570	0.041570	0.039261	0.036374	0.040993	0.043303	0.040993
Median	0.036952	0.041570	0.041570	0.039261	0.034642	0.040416	0.042725	0.040416
Std	0.005164	0.003652	0.002309	0.001633	0.004726	0.001915	0.007000	0.003416
Var	0.000027	0.000013	0.000005	0.000003	0.000022	0.000004	0.000049	0.000012
English 1000
Best	0.025404	0.041570	0.034642	0.034642	0.039261	0.036952	0.039261	0.034642
Worst	0.050808	0.046189	0.055427	0.050808	0.048499	0.046189	0.046189	0.041570
Mean	0.040416	0.043303	0.043303	0.042148	0.044457	0.042148	0.043303	0.039261
Median	0.042725	0.042725	0.041570	0.041570	0.045035	0.042725	0.043880	0.040416
Std	0.009310	0.001915	0.008226	0.005745	0.004123	0.004123	0.002517	0.002829
Var	0.000087	0.000004	0.000068	0.000033	0.000017	0.000017	0.000006	0.000008
Turkish 500
Best	0	0.024096	0.012048	0.024096	0.012048	0.012048	0.012048	0.024096
Worst	0.036145	0.048193	0.048193	0.036145	0.036145	0.036145	0.048193	0.048193
Mean	0.024096	0.036145	0.030120	0.033133	0.024096	0.027108	0.033133	0.033133
Median	0.030120	0.036145	0.030120	0.036145	0.024096	0.030120	0.036145	0.030120
Std	0.014756	0.008519	0.013470	0.005217	0.012048	0.009990	0.013129	0.009990
Var	0.000218	0.000073	0.000181	0.000027	0.000145	0.000100	0.000172	0.000100
Turkish 1000
Best	0	0.012048	0.012048	0.012048	0.012048	0.024096	0.024096	0.024096
Worst	0.036145	0.024096	0.036145	0.036145	0.036145	0.060241	0.048193	0.024096
Mean	0.018072	0.015060	0.024096	0.024096	0.027108	0.039157	0.039157	0.024096
Median	0.018072	0.012048	0.024096	0.024096	0.030120	0.036145	0.042169	0.024096
Std	0.013470	0.005217	0.008519	0.008519	0.009990	0.015651	0.009990	0.000000
Var	0.000181	0.000027	0.000073	0.000073	0.000100	0.000245	0.000100	0.000000

Table 4. Detailed metrics for LR results and English dataset. Best results in each row are in bold.

	LR-DOSCA	LR-SCA	LR-ABC	LR-FA	LR-BA	LR-HHO	LR-SNS	LR-TLB
English 500
Accuracy (%)	96.9977	96.3048	96.0739	96.3048	96.7667	96.0739	96.5358	96.3048
Precision 0	0.970000	0.957377	0.966443	0.963455	0.966777	0.976027	0.963576	0.963455
Precision 1	0.969925	0.976563	0.948148	0.962121	0.969697	0.929078	0.969466	0.962121
M.Avg. Precision	0.969976	0.963492	0.960612	0.963030	0.967708	0.961064	0.965453	0.963030
Recall 0	0.986441	0.989831	0.976271	0.983051	0.986441	0.966102	0.986441	0.983051
Recall 1	0.934783	0.905797	0.927536	0.920290	0.927536	0.949275	0.920290	0.920290
M.Avg. Recall	0.969977	0.963048	0.960739	0.963048	0.967667	0.960739	0.965358	0.963048
F1 Score 0	0.978151	0.973333	0.971332	0.973154	0.976510	0.971039	0.974874	0.973154
F1 Score 1	0.952030	0.939850	0.937729	0.940741	0.948148	0.939068	0.944238	0.940741
M.Avg. F1 Score	0.969826	0.962662	0.960623	0.962824	0.967471	0.960850	0.965110	0.962824
English 1000
Accuracy (%)	97.4596	95.8430	96.5358	96.5358	96.0739	96.3048	96.0739	96.5358
Precision 0	0.973333	0.960133	0.969799	0.963576	0.957237	0.966555	0.960265	0.963576
Precision 1	0.977444	0.954545	0.955556	0.969466	0.968992	0.955224	0.961832	0.969466
M.Avg. Precision	0.974643	0.958352	0.965259	0.965453	0.960983	0.962944	0.960764	0.965453
Recall 0	0.989831	0.979661	0.979661	0.986441	0.986441	0.979661	0.983051	0.986441
Recall 1	0.942029	0.913043	0.934783	0.920290	0.905797	0.927536	0.913043	0.920290
M.Avg. Recall	0.974596	0.958430	0.965358	0.965358	0.960739	0.963048	0.960739	0.965358
F1 Score 0	0.981513	0.969799	0.974705	0.974874	0.971619	0.973064	0.971524	0.974874
F1 Score 1	0.959410	0.933333	0.945055	0.944238	0.936330	0.941176	0.936803	0.944238
M.Avg. F1 Score	0.974468	0.958177	0.965255	0.965110	0.960372	0.962901	0.960458	0.965110

Table 5. Detailed metrics for LR results and Turkish dataset. Best results in each row are in bold.

	LR-DOSCA	LR-SCA	LR-ABC	LR-FA	LR-BA	LR-HHO	LR-SNS	LR-TLB
Turkish 500
Accuracy (%)	100	97.5904	98.7952	97.5904	98.7952	98.7952	98.7952	97.5904
Precision 0	1.00000	0.980000	1.00000	1.00000	0.980392	1.0000	0.980392	0.980000
Precision 1	1.00000	0.969697	0.970588	0.942857	1.00000	0.970588	1.00000	0.969697
M.Avg. Precision	1.00000	0.975904	0.988306	0.977281	0.988188	0.988306	0.988188	0.975904
Recall 0	1.00000	0.980000	0.980000	0.960000	1.00000	0.980000	1.00000	0.980000
Recall 1	1.00000	0.969697	1.00000	1.00000	0.969697	1.00000	0.969697	0.969697
M.Avg. Recall	1.00000	0.975904	0.987952	0.975904	0.987952	0.987952	0.987952	0.975904
F1 Score 0	1.00000	0.980000	0.989899	0.979592	0.990099	0.989899	0.990099	0.980000
F1 Score 1	1.00000	0.969697	0.985075	0.970588	0.984615	0.985075	0.984615	0.969697
M.Avg. F1 Score	1.00000	0.975904	0.987981	0.976012	0.987919	0.987981	0.987919	0.975904
Turkish 1000
Accuracy (%)	100	98.7952	98.7952	98.7952	98.7952	97.5904	97.5904	97.5904
Precision 0	1.00000	0.980392	1.00000	1.00000	1.00000	0.961538	0.980000	0.961538
Precision 1	1.00000	1.00000	0.970588	0.970588	0.970588	1.00000	0.969697	1.00000
M.Avg. Precision	1.00000	0.988188	0.988306	0.988306	0.988306	0.976830	0.975904	0.97683
Recall 0	1.00000	1.00000	0.980000	0.980000	0.980000	1.00000	0.980000	1.00000
Recall 1	1.00000	0.969697	1.00000	1.00000	1.00000	0.939394	0.969697	0.939394
M.Avg. Recall	1.00000	0.987952	0.987952	0.987952	0.987952	0.975904	0.975904	0.975904
F1 Score 0	1.00000	0.990099	0.989899	0.989899	0.989899	0.980392	0.980000	0.980392
F1 Score 1	1.00000	0.984615	0.985075	0.985075	0.985075	0.968750	0.969697	0.968750
M.Avg. F1 Score	1.00000	0.987919	0.987981	0.987981	0.987981	0.975763	0.975904	0.975763

Table 6. Overall metrics for XGBoost results in terms of classification error. Best results in each row are in bold.

Method	X-DOSCA	X-SCA	X-ABC	X-FA	X-BA	X-HHO	X-SNS	X-TLB
English 500
Best	0.013857	0.017321	0.017321	0.015012	0.018476	0.019630	0.020785	0.018476
Worst	0.019630	0.019630	0.023095	0.021940	0.023095	0.020785	0.028868	0.020785
Mean	0.018245	0.019169	0.020554	0.018014	0.019861	0.020092	0.022633	0.020092
Median	0.019630	0.019630	0.020785	0.017321	0.018476	0.019630	0.020785	0.020785
Std	0.002239	0.000924	0.001848	0.002263	0.001848	0.000566	0.003150	0.000924
Var	0.000005	0.000001	0.000003	0.000005	0.000003	0.000000	0.000010	0.000001
English 1000
Best	0.013857	0.013857	0.013857	0.015012	0.017321	0.013857	0.016166	0.012702
Worst	0.016166	0.019630	0.020785	0.021940	0.021940	0.018476	0.020785	0.018476
Mean	0.015012	0.017090	0.018707	0.018245	0.018938	0.016166	0.018245	0.016166
Median	0.015012	0.017321	0.019630	0.018476	0.018476	0.016166	0.018476	0.016166
Std	0.000730	0.001987	0.002572	0.002466	0.001728	0.001633	0.001848	0.001932
Var	0.000001	0.000004	0.000007	0.000006	0.000003	0.000003	0.000003	0.000004
Turkish 500
Best	0.066667	0.078788	0.084848	0.090909	0.084848	0.072727	0.084848	0.090909
Worst	0.090909	0.096970	0.109091	0.103030	0.109091	0.096970	0.103030	0.109091
Mean	0.080000	0.089697	0.093333	0.094545	0.095758	0.084848	0.094545	0.096970
Median	0.084848	0.090909	0.090909	0.090909	0.096970	0.084848	0.096970	0.096970
Std	0.008907	0.005938	0.009071	0.004848	0.008040	0.008571	0.006181	0.006639
Var	0.000079	0.000035	0.000082	0.000024	0.000065	0.000073	0.000038	0.000044
Turkish 1000
Best	0.066667	0.090909	0.084848	0.084848	0.078788	0.072727	0.090909	0.084848
Worst	0.084848	0.103030	0.096970	0.090909	0.096970	0.090909	0.103030	0.103030
Mean	0.076364	0.093333	0.092121	0.088485	0.090909	0.077576	0.096970	0.093333
Median	0.072727	0.090909	0.096970	0.090909	0.090909	0.072727	0.096970	0.090909
Std	0.007273	0.004848	0.005938	0.002969	0.006639	0.007068	0.003833	0.006181
Var	0.000053	0.000024	0.000035	0.000009	0.000044	0.000050	0.000015	0.000038

Table 7. Detailed metrics for XGBoost results and English dataset. Best results in each row are in bold.

	X-DOSCA	X-SCA	X-ABC	X-FA	X-BA	X-HHO	X-SNS	X-TLB
English 500
Accuracy (%)	98.6143	98.2679	98.2679	98.4988	98.1524	98.037	97.9215	98.1524
Precision 0	0.984899	0.981575	0.983193	0.983250	0.979933	0.978297	0.981481	0.981544
Precision 1	0.988889	0.985130	0.981550	0.988848	0.985075	0.985019	0.974265	0.981481
M.Avg. Precision	0.986171	0.982708	0.982669	0.985034	0.981572	0.980439	0.979181	0.981524
Recall 0	0.994915	0.993220	0.991525	0.994915	0.993220	0.993220	0.988136	0.991525
Recall 1	0.967391	0.960145	0.963768	0.963768	0.956522	0.952899	0.960145	0.960145
M.Avg. Recall	0.986143	0.982679	0.982679	0.984988	0.981524	0.980370	0.979215	0.981524
F1 Score 0	0.989882	0.987363	0.987342	0.989048	0.986532	0.985702	0.984797	0.986509
F1 Score 1	0.978022	0.972477	0.972578	0.976147	0.970588	0.968692	0.967153	0.970696
M.Avg. F1 Score	0.986102	0.982619	0.982636	0.984936	0.981451	0.980281	0.979174	0.981469
English 1000
Accuracy (%)	98.6143	98.6143	98.6143	98.4988	98.2679	98.6143	98.3834	98.7298
Precision 0	0.986532	0.984899	0.986532	0.983250	0.986464	0.986532	0.980000	0.986555
Precision 1	0.985294	0.988889	0.985294	0.988848	0.974545	0.985294	0.992481	0.988930
M.Avg. Precision	0.986137	0.986171	0.986137	0.985034	0.982665	0.986137	0.983978	0.987312
Recall 0	0.993220	0.994915	0.993220	0.994915	0.988136	0.993220	0.996610	0.994915
Recall 1	0.971014	0.967391	0.971014	0.963768	0.971014	0.971014	0.956522	0.971014
M.Avg. Recall	0.986143	0.986143	0.986143	0.984988	0.982679	0.986143	0.983834	0.987298
F1 Score 0	0.989865	0.989882	0.989865	0.989048	0.987299	0.989865	0.988235	0.990717
F1 Score 1	0.978102	0.978022	0.978102	0.976147	0.972777	0.978102	0.974170	0.979890
M.Avg. F1 Score	0.986116	0.986102	0.986116	0.984936	0.982671	0.986116	0.983753	0.987267

Table 8. Detailed metrics for XGBoost results and Turkish dataset. Best results in each row are in bold.

	X-DOSCA	X-SCA	X-ABC	X-FA	X-BA	X-HHO	X-SNS	X-TLB
Turkish 500
Accuracy (%)	93.3333	92.1212	91.5152	53.3333	91.5152	92.7273	91.5152	90.9091
Precision 0	0.923077	0.921569	0.897196	0.610000	0.920792	0.922330	0.920792	0.903846
Precision 1	0.950820	0.920635	0.948276	0.415385	0.906250	0.935484	0.906250	0.918033
M.Avg. Precision	0.934174	0.921195	0.917628	0.532154	0.914975	0.927592	0.914975	0.909521
Recall 0	0.969697	0.949495	0.969697	0.616162	0.939394	0.959596	0.939394	0.949495
Recall 1	0.878788	0.878788	0.833333	0.409091	0.878788	0.878788	0.878788	0.848485
M.Avg. Recall	0.933333	0.921212	0.915152	0.533333	0.915152	0.927273	0.915152	0.909091
F1 Score 0	0.945813	0.935323	0.932039	0.613065	0.930000	0.940594	0.930000	0.926108
F1 Score 1	0.913386	0.899225	0.887097	0.412214	0.892308	0.906250	0.892308	0.881890
M.Avg. F1 Score	0.932842	0.920884	0.914062	0.532725	0.914923	0.926856	0.914923	0.908421
Turkish 1000
Accuracy (%)	93.3333	90.9091	91.5152	91.5152	92.1212	92.7273	90.9091	91.5152
Precision 0	0.915094	0.896226	0.912621	0.889908	0.913462	0.914286	0.896226	0.912621
Precision 1	0.966102	0.932203	0.919355	0.964286	0.934426	0.950000	0.932203	0.919355
M.Avg. Precision	0.935497	0.910617	0.915315	0.919659	0.921847	0.928571	0.910617	0.915315
Recall 0	0.979798	0.959596	0.949495	0.979798	0.959596	0.969697	0.959596	0.949495
Recall 1	0.863636	0.833333	0.863636	0.818182	0.863636	0.863636	0.833333	0.863636
M.Avg. Recall	0.933333	0.909091	0.915152	0.915152	0.921212	0.927273	0.909091	0.915152
F1 Score 0	0.946341	0.926829	0.930693	0.932692	0.935961	0.941176	0.926829	0.930693
F1 Score 1	0.912000	0.880000	0.890625	0.885246	0.897638	0.904762	0.880000	0.890625
M.Avg. F1 Score	0.932605	0.908098	0.914666	0.913714	0.920631	0.926611	0.908098	0.914666

Table 9. Shapiro–Wilk test results for multiple methods multiple problem analysis.

	DOSCA	SCA	ABC	FA	BA	HHO	SNS	TLB
p-value	0.015682	0.016325	0.012733	0.025307	0.030842	0.013288	0.029549	0.035672

Table 10. Friedman statistical test results.

Functions	DOSCA	SCA	ABC	FA	BA	HHO	SNS	TLB
LR English 500	2	6.5	6.5	3	1	4.5	8	4.5
LR English 1000	1	5.5	5.5	2.5	8	2.5	5.5	5.5
LR Turkish 500	1.5	8	4	6	1.5	3	6	6
LR Turkish 1000	2	1	4	4	6	7.5	7.5	4
XGBoost English 500	2	3	7	1	4	5.5	8	5.5
XGBoost English 1000	1	4	7	5.5	8	2.5	5.5	2.5
XGBoost Turkish 500	1	3	4	5.5	7	2	5.5	8
XGBoost Turkish 1000	1	6.5	5	3	4	2	8	6.5
Average Ranking	1.44	4.69	5.38	3.81	4.94	3.69	6.75	5.31
Rank	1	4	7	3	5	2	8	6

Table 11. Friedman aligned statistical test results.

Functions	DOSCA	SCA	ABC	FA	BA	HHO	SNS	TLB
LR English 500	10	43.5	43.5	22	9	38.5	53	38.5
LR English 1000	12	33.5	33.5	24.5	46	24.5	33.5	33.5
LR Turkish 500	7.5	61	28	51	7.5	11	51	51
LR Turkish 1000	5	2	15	15	37	63.5	63.5	15
XGBoost English 500	18	23	36	17	29	30.5	49	30.5
XGBoost English 1000	13	26	42	40.5	45	20.5	40.5	20.5
XGBoost Turkish 500	3	19	47	54.5	57	6	54.5	60
XGBoost Turkish 1000	1	58.5	56	27	48	4	62	58.5
Average Ranking	8.69	33.31	37.63	31.44	34.81	24.81	50.88	38.44
Rank	1	4	6	3	5	2	8	7

Table 12. Holm’s step-down procedure statistical test results.

Comparison	p_Values	Ranking	Alpha = 0.05	Alpha = 0.1	H1	H2
DOSCA vs. SNS	7.20 $\times 10^{- 6}$	0	0.007143	0.014286	1	1
DOSCA vs. ABC	6.52 $\times 10^{- 4}$	1	0.008333	0.016667	1	1
DOSCA vs. TLB	7.78 $\times 10^{- 4}$	2	0.010000	0.020000	1	1
DOSCA vs. BA	2.13 $\times 10^{- 3}$	3	0.012500	0.025000	1	1
DOSCA vs. SCA	3.98 $\times 10^{- 3}$	4	0.016667	0.033333	1	1
DOSCA vs. FA	2.62 $\times 10^{- 2}$	5	0.025000	0.050000	1	1
DOSCA vs. HHO	3.31 $\times 10^{- 2}$	6	0.050000	0.100000	1	1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bacanin, N.; Zivkovic, M.; Stoean, C.; Antonijevic, M.; Janicijevic, S.; Sarac, M.; Strumberger, I. Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering. Mathematics 2022, 10, 4173. https://doi.org/10.3390/math10224173

AMA Style

Bacanin N, Zivkovic M, Stoean C, Antonijevic M, Janicijevic S, Sarac M, Strumberger I. Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering. Mathematics. 2022; 10(22):4173. https://doi.org/10.3390/math10224173

Chicago/Turabian Style

Bacanin, Nebojsa, Miodrag Zivkovic, Catalin Stoean, Milos Antonijevic, Stefana Janicijevic, Marko Sarac, and Ivana Strumberger. 2022. "Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering" Mathematics 10, no. 22: 4173. https://doi.org/10.3390/math10224173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering

Abstract

1. Introduction

2. Background and Literature Review

2.1. Spam Detection

2.2. Metaheuristics Optimization

2.3. Hybrid Machine Learning and Metaheuristics Approaches to Spam Detection

2.4. Text Mining Models

2.4.1. Logistic Regression

2.4.2. XGBoost

3. Proposed Method

3.1. The Original SCA Method

3.2. Limitation of Basic SCA and Proposed Improvements

3.3. Diversity-Oriented Sine Cosine Algorithm (DOSCA)

3.3.1. A Novel Initialization Procedure

3.3.2. Procedure for Keeping the Population Diversity

3.3.3. The Inner Workings and Complexity of the Proposed Method

4. Employed Datasets and Data Preprocessing

4.1. Data Preprocessing

4.2. Dataset Details and Basic Exploratory Data Analysis

5. Experimental Setup and Results

5.1. Basic Experimental Setup

5.2. Obtained Results and Comparative Analysis

5.3. The DOSCA Improvements Validation via Statistical Tests

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI