1. Introduction
A multitude of everyday problems from various sciences can be treated as problems of classification or data fitting problems, such as problems that appear in the fields of physics [
1,
2,
3,
4], chemistry [
5,
6,
7], economics [
8,
9], environmental problems [
10,
11,
12], and medical problems [
13,
14]. In the relevant literature, there is a wide range of techniques that one can use to handle such problems, such as the k nearest neighbors model (k-NN) [
15,
16], artificial neural networks (ANNs) [
17,
18], radial basis function (RBF) networks [
19,
20], support vector machines (SVM) [
21,
22], and decision trees [
23,
24]. Also, many practical problems have been tackled using machine learning approaches, such as prediction of non-breaking waves [
25], energy conservation problems [
26], and the prediction of scour depth at seawalls using genetic programming and neural networks [
27]. Furthermore, machine learning models have been used in various complex tasks such as neural machine translation [
28], oil distribution [
29], image processing [
30], robotics [
31], and hydracarbon production [
32]. A brief description of the methods that can be used for classification datasets is given in the publication of Kotsiantis et al. [
33].
In the majority of cases, machine learning models have a number of parameters that should be determined through some algorithms. These parameters include the weights of artificial neural networks, which can be estimated with techniques such as the backpropagation method [
34,
35] or genetic algorithms [
36,
37,
38], as well as the hyperparameters of learning models, which require different approaches [
39,
40,
41]. However, most of the time, there are some problems in the parameterization of machine learning models:
A long training time is required which is proportional to the dimension of the input data. For example, in a neural network with 1 hidden layer equipped with 10 processing nodes and a provided dataset with 10 inputs, more than parameters are required to build the neural network. Therefore, the size of the network will grow proportionally to the problem, and longer training times will be required for the model.
Another important problem presented in machine learning techniques is the fact that many models require significant storage space in the computer’s memory for their parameters, and in fact, this space increases significantly with the increase in the dimension of the objective problem. For example, in the Bfgs [
42] optimization method,
storage space will be required for the training model and for the partial derivatives required by the optimization method. This issue was thoroughly discussed in a paper by Verleysen et al. [
43]. Some common approaches proposed in order to reduce the dimension of the input datasets are the principal component analysis (PCA) method [
44,
45,
46] as well as the minimum redundancy feature selection (MRMR) technique [
47,
48]. Moreover, Pourzangbar proposed a feature selection method [
49] based on genetic programming for the determination of the most effective parameters for scour depth at seawalls. The proposed technique, in addition to creating artificial features, also essentially selects features at the same time, since it can remove from the final features those that will not bring significant benefits to the learning of the objective problem. Furthermore, Wang et al. proposed an auto-encoder reduction method, which was applied on a series of large datasets [
50].
Another interesting problem, which has been tackled by dozens of researchers in the last few decades, is that of overfitting of neural networks or machine learning models in general. In this problem, although the machine learning model has achieved a satisfactory level of training, this is not reflected in the unknown patterns (test set) that were not present during training. The paper by Geman et al. [
51] as well the article by Hawkins [
52] thoroughly discussed the topic of overfitting. Examples of techniques proposed to tackle this problem are the weight sharing methods [
53,
54], methods that reduce the number of parameters of the model (pruning methods) [
55,
56], weight elimination [
57,
58,
59], weight decaying methods [
60,
61], dropout methods [
62,
63], the Sarprop method [
64], and positive correlation methods [
65]. Recently, a variety of papers have proposed methods to handle the overfitting problem in various cases, such as the usage of genetic algorithms for training data selection in RBF networks [
66], the evolution of RBF models using genetic algorithms for rainfall prediction [
67], and pruning decision trees using genetic algorithms [
68,
69].
This paper recommends a two-phase method for data classification or regression problems. In the first phase, a global optimization method directs the production of artificial features from the existing ones with the help of grammatical evolution [
70]. Grammatical evolution is a variation of genetic programming where the chromosomes are production rules of the target BNF grammar, and it has been used successfully in a variety of applications, such as music composition [
71], economics [
72], symbolic regression [
73], robotics [
74], and caching algorithms [
75]. The global optimization method used in this work is the particle swarm optimization (PSO) method [
76,
77,
78]. The PSO method was selected as the optimization method due to its simplicity and the small number of parameters that should be set. Also, the PSO method has been used in many difficult problems in all areas of the sciences, such as problems that arise in physics [
79,
80], chemistry [
81,
82], medicine [
83,
84], and economics [
85]. Furthermore, the PSO method was successfully applied recently in many practical problems such as flow shop scheduling [
86], the successful development of electric vehicle charging strategies [
87], emotion recognition [
88], robotics [
89], the optimal design of a brace-viscous damper and pendulum tuned mass damper [
90], application to high-dimensional expensive industrial problems [
91], and RFID readers [
92]. The generated artificial features are nonlinear combinations of the original ones, and any machine learning model can be used to effectively estimate their dynamics. In the present implementation, the RBF network was used since it is a widely tested machine learning model, but also because its training is much faster compared with other models. In the second phase, the best features obtained from the first phase are also used to modify the test set of the objective problem, and a machine learning method can be used to estimate the error in the control set.
The idea of creating artificial features using grammatical evolution was first introduced in the paper by Gavrilis et al. [
93], and it has been successfully applied on a series of problems, such as spam identification [
94], fetal heart classification [
95], epileptic oscillations [
96], the construction of COVID-19 predictive models [
97], and performance and early drop prediction for higher education students [
98].
Feature selection using neural networks has been also proposed in a series of papers, such as the work of Verikas and Bacauskiene [
99] or the work of Kabir et al. [
100]. Moreover, Devi utilized a simulated annealing approach [
101] to select the most important features for classification datasets. Also, Neshatian et al. [
102] developed a genetic algorithm that produces features using an entropy-based fitness function.
The rest of this article is organized as follows. In
Section 2, the steps of the proposed method are fully described. In
Section 3, the used experimental datasets as well as the results obtained by the incorporation of the proposed method are outlined. Finally, in
Section 4, some conclusions are listed.
2. The Proposed Method
This section will introduce the main parts of the proposed two-step method. The first subsection will introduce the basics of grammatical evolution and give a complete example of building a valid function from a chromosome. Next, the process by which the grammatical evolution chromosomes can be used to create artificial features from existing ones will be presented in
Section 2.2 The procedure by which the fitness of each chromosome can be assessed is presented in
Section 2.3. Finally, in
Section 2.4, the overall algorithm is presented along with a flowchart for its graphical representation.
2.1. The Technique of Grammatical Evolution
The process of grammatical evolution uses chromosomes that represent the production rules of the underlying Backus–Naur form (BNF) grammar [
103] of the objective problem. BNF grammars have been widely used to describe the syntax of programming languages. Any BNF grammar is a set
where the following are true:
The set N represents the non-terminal symbols of the grammar. Any non-terminal symbol is analyzed to a series of terminal symbols using the production rules of the grammar.
T is the set of terminal symbols.
The non-terminal symbol S represents the start symbol of the grammar.
The set P contains the production rules of the grammar. Typically, any production rule is expressed in the form or .
The process that creates a valid program starts from the symbol S and gradually replaces non-terminal symbols with the right-hand side of the selected production rule from the provided chromosome. The rule is selected with the following steps:
The BNF grammar for the proposed method is shown in
Figure 1. Symbols in < > brackets denote non-terminal symbols that belong to set
N. In every line of the grammar, a production rule is shown for every non-terminal symbol. The numbers in parentheses represent the sequence number of the production rule for the corresponding non-terminal symbol. For example, the non-terminal symbol <op> has four production rules, with each leading to a terminating arithmetic operation symbol. The constant N is the dimension of the input dataset.
An example that produces a valid expression for the chromosome
with
is shown in
Table 1. This chromosome represents a series of sequential numbers of production rules from the above grammar. The grammar evolution method takes the elements of the chromosome one by one and finds the corresponding production rule by using the remainder of the division and the number of symbols of each non-terminal symbol. The final expression created is
. 2.2. Feature Construction
In the proposed technique, the chromosomes of grammatical evolution are used as a set of functions that create artificial features as nonlinear combinations of the existing ones. This process can also be considered a feature selection method, since it is possible that only a part of the original features can be used in the generated features. The proposed method creates artificial features from the original ones, and the process for any chromosome p is as follows:
The final set of features will be considered mapping functions of the original ones. For example, the set
is a set of mapping functions for the original features
. However, sometimes the generated features can lead to extreme values, and this will result in generalization problems from the used machine learning models. For this reason, and in the present work, penalty factors are used so that the mapping functions do not lead to extreme values. These penalty factors also modify the fitness function that the particle swarm optimization technique will minimize each time and are considered next.
2.3. Fitness Calculation
Each chromosome in grammatical evolution produces a series of artificial features, which are nonlinear functions of existing features. However, an evaluation and a distinction should be made between those sets of features which will yield more in the learning process and those which will yield less. This is accomplished by assessing the appropriateness of these features. In order to be able to compute the fitness of each group of features, the original training set should be reduced using the artificial features that have been produced, and the following steps should be executed for any given chromosome p:
Denote as the original training set.
Set for the penalty factor.
Compute the mapping function
as suggested in
Section 2.2.
Set for the modified training set.
For , carry out the following steps.
- (a)
Set .
- (b)
Set .
- (c)
If , then , where a predefined positive. value.
End For.
Train an RF
with
H processing NODES on
and obtain the following error:
Compute the final fitness value:
where
.
2.4. The Used PSO Method
The mains steps for this algorithm are outlined in detail in Algorithm 1.
Algorithm 1 The base PSO algorithm executed in one processing unit. |
Initialization Step.
- (a)
Set . - (b)
Set m as the total number of particles. - (c)
Set as the maximum number of iterations allowed. - (d)
Initialize randomly the positions for the particles. For the grammatical evolution, every particle is a vector of randomly selected integers. - (e)
Initialize randomly the velocities . For the current work, every vector of velocities is a series of randomly selected integers in the range . In the current work, and . - (f)
For , set . The vector denotes the best located position of particle . - (g)
Set .
Termination Check Step. If then go to step 8. For , do the following:
- (a)
Compute the velocity as a combination of the vectors , and . - (b)
Set the new position for the particle to . - (c)
Calculate the fitness for particle using the procedure described in Section 2.3. - (d)
If , then .
End For. Set . Set . Goto step 2. Test step. Apply the mapping function of the best particle to the test set of the problem, and apply a machine learning model obtaining the corresponding test error.
|
The above calculates at every iteration the new position of the particle
i using
In most cases, the new velocity could be a linear combination of the previously computed velocity and the corresponding vectors for the best values
and
, and it can be defined as follows:
where the following are true:
The variables are random numbers defined in
The constants are defined in the range .
The variable
, commonly called the inertia, was suggested by Shim and Earhart [
76]. In the original paper, they proposed the idea that large values for the inertia coefficient can lead to a better exploration of the search space, while smaller values of the coefficient lead to the method being concentrated around regions likely to contain the global minimum. Hence, in their work, the value of the inertia factor generally started with large values and decreased with repetition. In the current work, the inertia value was computed through the following equation:
The variable
r is a a random number with
. This inertia calculation was proposed in [
104]. With this calculation of the inertia variable, an even better exploration of the research space is achieved with the randomness it introduces, something that was also found in the publication of Charilogis and Tsoulos [
105].
A flowchart for the overall process is shown in
Figure 2.
3. Experiments
The ability of the proposed technique to produce effective artificial features for class prediction and feature learning will be measured in this section on some datasets from the relevant literature. These problems have been studied by various researchers in the relevant literature and cover a wide range of research areas from physics to economics. These datasets come from the following relevant websites:
The proposed technique will be compared with a series of known machine learning techniques, and the experimental results are then presented in the relevant tables.
3.1. Experimental Datasets
The classification problems used in the experiments have the following:
Appendicitis, a medical dataset [
107,
108];
The
Australian dataset [
109], an dataset concerning economical transactions in banks;
The
Balance dataset, a dataset generated to model psychological experimental results [
110];
The
Bands dataset, a dataset used in rotogravure printing [
111];
The
Dermatology dataset [
112], a medical dataset used to detect a type of eryhemato-squamous disease;
The
Hayes Roth dataset [
113];
The
Heart dataset [
114], a medical dataset used to detect heart diseases;
The
House Votes dataset [
115], a dataset related to the congressional voting records of the USA;
The
Ionosphere dataset, used to classify measurements from the ionosphere, which has been examined in a variety of research papers [
116,
117];
The
Liver disorder dataset [
118,
119], a dataset used for medical purposes;
The
Mammography dataset [
120], a medical dataset used for breast cancer diagnosis;
The
Parkinson’s dataset [
121,
122], a dataset used to detect Parkinson’s disease using voice measurements;
The
Puma dataset [
123], a dataset used for medical purposes;
The
Pop failures dataset [
124], a dataset related to meteorological data;
The
Regions2 dataset, a medical dataset for liver biopsy images [
125];
The
Suharto dataset [
126], a medical dataset;
The
Segment dataset [
127], a dataset related to image segmentation;
The
BC dataset [
128], which is used for breast tumors;
The
Wine dataset, a dataset related to chemical analysis of wines [
129,
130];
EEG datasets [
131,
132], which is an EEG dataset where the following cases were used in the experiments:
- (a)
Z_F_S;
- (b)
ZOE_NF_S;
- (c)
ZONE_S.
The regression datasets used in the relevant experiments are the following:
The
Abalone dataset [
134], a dataset used to predict the age of abalones;
The
Airfoil dataset, a dataset provided by NASA [
135] which was obtained from a series of aerodynamic and acoustic tests;
The Baseball dataset, a dataset related to the salaries of baseball players;
The
BK dataset [
136], a dataset used to calculate the points in a basketball game;
The BL dataset, which is used in machine problems;
The
Concrete dataset [
137], a civil engineering dataset for calculating concrete’s compressive strength;
The Dee dataset, which is used to estimate the daily average price of 1 KWh of electricity energy in Spain;
The Diabetes dataset, which is a medical dataset;
The
Housing dataset [
138];
The FA dataset, which is used to fit body fat to other measurements;
The MORTGAGE dataset, holding economic data from the USA with the goal is to predict the 30-year conventional mortgage rate;
The
PY dataset (pyrimidines problem) [
139];
The Quake dataset, which is used to approximate the strength of an earthquake given the depth of its focal point and its latitude and longitude;
The Treasure dataset, which contains economic data for the USA, where the the goal is to predict the one-month CD rate.
3.2. Experimental Results
In order to give greater credibility to the experiments carried out, the method of tenfold cross validation was incorporated for every experimental dataset. Every experiment was repeated 30 times, using different seeds for the random generator each time. All the used code was implemented in ANSI C++ using the OPTIMUMS programming library for optimization purposes, which is freely available at
https://github.com/itsoulos/OPTIMUMS/(accessed on 10 July 2023). For the classification datasets, the average classification error as measured in the test set is reported, while for the regression datasets, the average regression error is reported. Here, for the term classification error, we mean the percentage of patterns in the test set that were classified into a different class than expected. Also, in every table, an additional column, “AVERAGE”, was added to show the average classification or regression error for the corresponding datasets. The values for the experimental parameters are shown in
Table 2.
In all techniques, the same parameter sets and the same random numbers were used in order to have a fair comparison of the experimental results.
In the first phase, the proposed technique will generate new artificial features from the existing ones with the help of a technique guided by the partnership of particle swarm optimization and grammatical evolution. In the second phase of the technique, these features will be used to modify the original test set, to which any machine learning method can now be applied. In the second phase of the current work, two different techniques will be used: an RBF neural network and an artificial neural network, which will be trained using a genetic algorithm. This is performed in order to establish the potential of the proposed procedure and improve the performance of both simple machine learning models and complex models. The proposed technique that created artificial features was compared on the same datasets against a series of well-known methods from the relevant literature:
A genetic algorithm with m chromosomes, denotes as GENETIC in the experimental tables, is used to train an artificial neural network with H hidden nodes. After termination of the genetic algorithm, the local optimization method BEFOGS is applied to the best chromosome of the population.
The radial basis function (RBF) network [
140] with
H processing nodes.
The optimization method Adam [
141], which is used to train an artificial neural network with
H hidden nodes.
The Prop optimization method [
142,
143], which is used to train an artificial neural network with
H hidden nodes.
The NEAT method (Revolution of Augmenting Typologies) [
144].
The experimental results using the above methods on the classification datasets are shown in
Table 3, and the results for the regression datasets are illustrated in
Table 4.
The results using the proposed method and for the construction of two, three, and four artificial features are presented in
Table 5 and
Table 6. The RBF column represents the experimental results in which, after the construction of the artificial features, an RBF network with
H processing nodes was applied on the modified dataset. Also, the column marked “GENETIC” in
Table 5 and
Table 6 stands for the results obtained by the application of a genetic algorithm with
m chromosomes to the modified dataset when the feature creation procedure was finished.
The experimental results are of great interest, as one can see from their careful study that the proposed technique was able to significantly reduce the error in the corresponding test sets. Especially in the case of regression problems, the reduction in error was, on average, greater than 50%. Moreover, the usage of a neural network trained by a genetic algorithm on the modified datasets gave clearly better results than the use of an RBF neural network, especially in the classification datasets. An additional test was performed for the regression datasets, where the number of particles in the PSO algorithm increased from 100 to 400, and the results are graphically illustrated in
Figure 3.
Judging from the results, we can observe that the selection of 200 particles in the experimental results was an optimal choice and a compromise between the speed and efficiency of the method, as adding another 200 particles to the particle swarm optimization did not significantly improve the efficiency of the proposed method.
Moreover, a graphical comparison between the genetic algorithm when applied to artificial datasets and the genetic algorithm when applied to the original classification datasets is given in
Figure 4. The same graphical comparison is also shown for the RBF model in
Figure 5, and in these figures, the ability of the proposed method to drastically reduce the learning error through the construction of artificial features is evident.
Also, in
Figure 6, a comparison for the regression datasets is outlined between the proposed method with the application of the genetic algorithm in second phase and the genetic algorithm on the original regression datasets.
Finally, using the Wilcox on a sandbank test, a comparison was made between the proposed method and all mentioned machine learning methods for the classification datasets. This comparison is graphically outlined in
Figure 7.
These results suggest that the proposed method has a distinct advantage over the Welkin classification methods when it comes to constructing artificial features for classification tasks. It offered improved performance and provided a more effective solution for these datasets.
4. Conclusions
A hybrid technique that utilizes a particle swarm optimizer and a feature creation method using grammatical evolution was introduced here. The proposed method can identify possible dependencies between the original features and can also reduce the number of required features to a limited number. Also, the method can remove from the set of features those features that may not contribute to the learning of the dataset by some machine learning model. In addition, to make learning more efficient, the values of the generated features are bounded within a value interval using penalty factors. The constructed features are evaluated in terms of their effectiveness with the help of a fast machine learning model such as the RBF network, even though other more effective models could also be used. Among the advantages of the proposed procedure is the fact that it does not require any prior knowledge of the dataset to which it will be applied, and furthermore, the procedure is exactly the same whether it is a data classification problem or a data fitting problem. The particle swarm optimization method was used for the production of the characteristics, as it has been proven by the relevant literature to be an extremely efficient technique with a limited number of parameters that must be defined by the user.
The current work was applied on an extended series of widely used datasets from various fields and was compared against some machine learning models on the same datasets. From the experimental results, it was seen that the proposed technique dramatically improved the performance of traditional learning techniques when applied to artificial features. The proposed two-stage technique generated artificial features in the first stage guided by the particle swarm optimization technique, and in the second stage, either a neural network trained by a genetic algorithm or an RBF network was used in the modified test set. In both cases, the improvement from artificial feature generation in the control error was significant for each learning model. This improvement reached an average of 30% for data classification and 50% for data fitting problems. In fact, in many cases, the improvement in the test error exceeded 75%. Moreover, the method appears to be quite robust, since increasing the number of particles in the particle swarm optimization method did not appear to significantly reduce the average error in the test sets. Furthermore, increasing the number of features constructed did not seem to have a dramatic effect on the performance of the method, which means that the method was able to achieve good generalization results even with a limited number of features, which in turn led to greatly reducing the number of dimensions of the original problem. Future work on the method may include the use of parallel techniques for feature construction to drastically reduce the required execution time.