Next Article in Journal
An Automated Deep Learning Approach for Spine Segmentation and Vertebrae Recognition Using Computed Tomography Images
Previous Article in Journal
Study on 3D High-Resolution Anorectal Manometry Interrater Agreement in the Evaluation of Dyssynergic Defecation Disorders
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detection of Diabetes through Microarray Genes with Enhancement of Classifiers Performance

by
Dinesh Chellappan
1,* and
Harikumar Rajaguru
2
1
Department of Electrical and Electronics Engineering, KPR Institute of Engineering and Technology, Coimbatore 641 407, Tamil Nadu, India
2
Department of Electronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam 638 401, Tamil Nadu, India
*
Author to whom correspondence should be addressed.
Diagnostics 2023, 13(16), 2654; https://doi.org/10.3390/diagnostics13162654
Submission received: 30 July 2023 / Revised: 6 August 2023 / Accepted: 7 August 2023 / Published: 11 August 2023
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

:
Diabetes is a life-threatening, non-communicable disease. Diabetes mellitus is a prevalent chronic disease with a significant global impact. The timely detection of diabetes in patients is necessary for an effective treatment. The primary objective of this study is to propose a novel approach for identifying type II diabetes mellitus using microarray gene data. Specifically, our research focuses on the performance enhancement of methods for detecting diabetes. Four different Dimensionality Reduction techniques, Detrend Fluctuation Analysis (DFA), the Chi-square probability density function (Chi2pdf), the Firefly algorithm, and Cuckoo Search, are used to reduce high dimensional data. Metaheuristic algorithms like Particle Swarm Optimization (PSO) and Harmonic Search (HS) are used for feature selection. Seven classifiers, Non-Linear Regression (NLR), Linear Regression (LR), Logistics Regression (LoR), Gaussian Mixture Model (GMM), Bayesian Linear Discriminant Classifier (BLDC), Softmax Discriminant Classifier (SDC), and Support Vector Machine—Radial Basis Function (SVM-RBF), are utilized to classify the diabetic and non-diabetic classes. The classifiers’ performances are analyzed through parameters such as accuracy, recall, precision, F1 score, error rate, Matthews Correlation Coefficient (MCC), Jaccard metric, and kappa. The SVM (RBF) classifier with the Chi2pdf Dimensionality Reduction technique with a PSO feature selection method attained a high accuracy of 91% with a Kappa of 0.7961, outperforming all of the other classifiers.

1. Introduction

Some statistics related to diabetes worldwide are as follows. The global prevalence of diabetes among adults (20–79 years old) was 10.5% in 2021 [1]. The prevalence of diabetes is higher in low- and middle-income countries than in high-income countries [2]. The region with the highest prevalence of diabetes is the Middle East and North Africa, where 13.9% of adults have diabetes. Diabetes was the ninth leading cause of death worldwide in 2019, with 4.2 million deaths attributed to the disease or its complications. The causes of diabetes in most cases are consuming food at irregular intervals, not doing any physical activity, and so on [3]. When a healthy human consumes a normal meal during the day, it increases their level of blood glucose around 120–140 mg/dL [4].
India has a high prevalence of diabetes, and is called the world’s diabetes capital. According to the International Diabetes Federation [5], in 2021, India had an estimated 87 million adults aged between 20 and 79 years with diabetes. This number is projected to increase to 151 million by 2045. The prevalence of diabetes in India varies across regions, with the southern and northern states having higher prevalence rates compared to the eastern and northeastern states. The states with the highest prevalence rates are Kerala, Tamil Nadu, and Punjab [6]. Type 2 diabetes accounts for more than 90% of all cases of diabetes in India [7]. Type 1 diabetes is less common and accounts for less than 10% of all diabetes cases. The complications of diabetes, such as heart disease, kidney disease, and eye disease, are also a significant problem in India. The burden of diabetes and its complications is significant in India and highlights the need for effective prevention and management programs [8]. Diabetes is a chronic illness that affects the body’s ability to produce insulin, a hormone that regulates blood sugar levels. As a result, people with diabetes often have high blood sugar levels, which can lead to several health complications. Some of the symptoms of high blood sugar include increased thirst, increased hunger, and frequent urination [9].
Diabetes can cause a range of complications, including neuropathy, retinopathy, nephropathy, and foot ulcers. Approximately one in two people with diabetes are undiagnosed, which means that they may not be receiving appropriate treatment to prevent or manage these complications. Economic burden: The estimated global health expenditure on diabetes was $760 billion in 2019 [10]. Diabetes is a major cause of lost productivity, as it can lead to disability and premature death. The economic burden of diabetes is expected to increase in the coming years, as more people are diagnosed with the disease. These statistics highlight the growing global burden of diabetes and the urgent need for effective prevention and management strategies. Identifying type II diabetic patients from microarray gene data obtained from the pancreas poses several challenges that need to be addressed. The high dimensionality of these gene data presents a significant challenge. Microarray experiments often generate many genes, resulting in a high-dimensional feature space [11]. This can lead to increased computational complexity and may require Dimensionality Reduction techniques to alleviate the curse of dimensionality.
Selecting informative features from the reduced dataset is crucial for accurate classification. Identifying the most relevant genes associated with type II diabetes is a non-trivial task, as the genetic basis of this disease is complex and involves various interactions. Optimization techniques, such as genetic algorithms or feature selection algorithms, need to be employed to identify the most discriminative features [12]. Additionally, the heterogeneity of gene expression patterns within the pancreas and individual variations in gene regulation further complicate the task of classification. The identification of robust and reliable classifiers that can effectively capture the subtle patterns in the data while generalizing well to unseen samples is another significant challenge. The identification of non-diabetic individuals who are at a higher risk of developing type II diabetes can lead to early detection and intervention [13]. Once individuals are accurately classified as diabetic or non-diabetic based on their gene expression profiles, the information can be used to develop personalized treatment strategies. This includes tailoring medication regimens, dietary recommendations, and exercise plans that are specific to the genetic profiles of diabetic patients. The gene expression patterns obtained from the microarray data can provide valuable insights into the underlying biological processes involved in type II diabetes. This information can be utilized to identify potential drug targets and guide the development of novel therapeutic interventions.
The objective of this research is to develop a classification framework for analyzing microarray gene data obtained from pancreases and classify the studied individuals as either diabetic or non-diabetic. This objective will be achieved through the utilization of the machine learning approach and meta-heuristic algorithms for feature extraction, selection, and classification.

2. Literature Review

According to the WHO [5], in India, 77 million people are living with type II diabetes, and 25 million are at high risk of developing it. Many people with diabetes are unaware of the severity of the condition, which can lead to serious complications such as nerve damage, reduced blood flow, and limb amputation. Shaw JE, Sicree RA, and Zimmet PZ et al. [14] gave statistical data and projected that the global prevalence of diabetes is projected to increase from 6.4% in 2010 to 7.7% in 2030, affecting an estimated 439 million adults. Mohan V and Pradeepa R et al. [15] proposed that the prevalence of type 2 diabetes in India is increasing rapidly, and the country is expected to have the largest number of people with diabetes in the world by 2045. The high prevalence of diabetes in India is due to a combination of genetic, lifestyle, and demographic factors. The complications of type 2 diabetes, such as diabetic retinopathy, neuropathy, and cardiovascular disease, are a significant burden in India. S. A. Abdulkareem et al. [16] conducted a comparative analysis of three soft computing techniques to predict diabetes risk: fuzzy analytical hierarchy processes (FAHP), support vector machine (SVM), and artificial neural networks (ANNs). The analysis involved 520 participants using a publicly available dataset, and the results show that these computational intelligence methods can reliably and effectively predict diabetes. The reported sensitivity values were 0.7312, 0.747, and 0.8793 for the FAHP, ANN, and SVM models, respectively. Aiswarya Mujumdar et al. [17] published a study using various machine learning algorithms to predict diabetes. Logistic Regression achieved the highest accuracy of 96%, followed by AdaBoost classifier with a 98.8% accuracy.
K. Bhaskaran et al. [18], published in the journal Diabetes Care in 2016, found that a machine learning algorithm based on Logistic Regression was able to predict diabetes with an accuracy of 85%. A machine learning algorithm based on decision trees was able to detect diabetes with an accuracy of 90%. A study published in the journal JAMA in 2018 found that a machine learning algorithm based on support vector machines was able to predict diabetes with an accuracy of 80%. OltaLlaha et al. [19] used the four classification methods mentioned above to classify data from a dataset of women with diabetes. The results of the study showed that the decision tree algorithm was the most accurate, with an accuracy of 79%. The Naive Bayes algorithm was the least accurate, with an accuracy of 65%. The SVM and Logistic Regression algorithms had accuracies of 73% and 74%, respectively. The results of the study are promising, but it is important to note that the study was conducted on a dataset of women with diabetes from Public Health Institute database. B. Shamreen Ahamed et al. [20] used different classifiers like Random Forest, Light Gradient Boosting Machine (LGBM), Gradient Boosting Machine, Support Vector Machine (SVM), Decision Tree, and XGBoost, and the database used in this study was the PIMA data base from the UCI repository, with 768 instances. The study aimed to improve the detection accuracy with a percentage of 95.20 using the LGBM Classifier. The Decision Tree classifier had the highest accuracy of 73.82% without preprocessing, while the classifiers KNN (k = 1) and Random Forest produced the highest accuracy rate of 100% after preprocessing.
Neha Prerna Tigga et al. [21] explained six machine learning classification methods, including lLogistic Regression, Naive Bayes, Decision tree, Support vector machine, Random Forest, and K-nearest neighbors, which were implemented to predict the risk of type 2 diabetes. Random Forest had the highest accuracy of 94.10%. The database used in this study was PIMA, with a total of 952 participants selected who were aged 18 and above, out of which 580 were males and 372 were females. The parameters with the highest significance for predicting diabetes were age, family history of diabetes, physical activity, regular medication, and gestation diabetes. Maniruzzaman, Md, et al. [22] used a Gaussian process (GP)-based classification technique to predict diabetes. The dataset used in this study from UCI—University of California, Irvina comprised 768 female patients, within which 268 were controlled patients and 500 were non-diabetic patients. The dataset was used to compare the performance of four machine learning methods on a classification task. The methods used were Gaussian Process (GP), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Naive Bayes (NB). The study shows that GP had the highest accuracy (81.97%), followed by LDA (75.39%), QDA (74.38%), and NB (73.17%). The GP model outperformed all of the other methods in terms of accuracy and sensitivity. Gupta, Sanjay Kumar et al. [23] produced study in India which found that the Indian Diabetes Risk Score (IDRS) was 64% effective in identifying people with a high risk of developing diabetes. The IDRS is a simple, cost-effective tool that can be used to screen large populations for diabetes. The study found that the IDRS was more effective in identifying people with a high BMI (body mass index). People with a BMI of more than 30 were six times more likely to have a high IDRS than people with a BMI of less than 18.5.
Howlader, Koushik Chandra et al. [24] conducted a study which used machine learning to identify features associated with T2D in Pima Indians. The best classifiers were Generalized Boosted Regression modeling, Sparse Distance Weighted Discrimination, a Generalized Additive Model using LOESS, and Boosted Generalized Additive Models. The study found that Generalized Boosted Regression modeling had the highest accuracy (90.91%), followed by Kappa statistics (78.77%) and specificity (85.19%). Sisodia, Deepti et al. [25] conducted a study comparing the performance of three machine learning algorithms for detecting diabetes: Decision Tree, Support Vector Machine (SVM), and Naive Bayes. The study used the Pima Indians Diabetes Database (PIDD) and evaluated the algorithms on various measures, including accuracy, precision, F-measure, and recall. The study found that Naive Bayes had the highest accuracy (76.30%), followed by Decision Tree (74.67%) and SVM (72.00%). Mathur, Prashant et al. [26] conducted a study about Indian diabetes statistics. In India, 9.3% of adults have diabetes, and 24.5% have impaired fasting blood glucose. Of those with diabetes, only 45.8% are aware of their condition, 36.1% are on treatment, and 15.7% have it under control. This is lower than the awareness, treatment, and control rates in other countries. For example, in the United States, 75% of adults with diabetes are aware of their condition, 64% are on treatment, and 54% have it under control. Kazerouni, Faranak et al. [27] conducted a performance evaluation of various algorithms, the AUC, sensitivity, and specificity were considered, and the ROC curves were plotted. The KNN algorithm had a mean AUC of 91% with a standard deviation of 0.09, while the mean sensitivity and specificity were 96% and 85%, respectively. The SVM algorithm achieved a mean AUC of 95% with a standard deviation of 0.05 after stratified 10-fold cross-validation, along with a mean sensitivity and specificity of 95% and 86%.
Ramdaniah et al. [28] used the microarray gene for the identification of diabetic classes from the GSE18732 dataset. A total of 46 diabetic classes and 72 non-diabetic classes were used in this study. The machine learning techniques used the Naive Bayes and SVM Sigmoid kernel methods with accuracies of 88.89% and 83.33% respectively. There are many researchers who have used the PIMA Indian diabetic data set to classify and analyze diabetic and non-diabetic classes for finding various performance metrics like accuracy, sensitivity, specificity MCC, etc., although very few findings are available in the microarray gene-based dataset for the identification of diabetic and non-diabetic classes.
The methodology of the research conducted in this study is depicted in Figure 1. It shows that four Dimensionality Reduction techniques, DFA, Chi2pdf, Firefly search, and Cuckoo search, are used. To further analyze the data, classification of data without feature selection and with feature selection methods is conducted. In this study, with feature selection methods, two optimization algorithms are used, PSO and HS. Moreover, seven classifiers are used—NLR, LR, LoR, GMM, BLDC, SDC, and SVM-RBF—to classify the classes as normal and diabetic.
Section 1 introduces the paper. The literature review is discussed in Section 2. The material and methods of datasets are explained in Section 3. Feature extraction through Dimensionality Reduction techniques is explained in Section 4. Section 5 deals with feature selection methods for the research. The classifier’s properties are explained in Section 6. Training and Testing of the classifier is discussed in Section 7. The results are discussed in Section 8. This paper is concluded in Section 9.

3. Material and Methods

Microarray gene expression analysis plays a vital role in understanding the molecular mechanisms and identifying gene expression patterns associated with various diseases, including diabetes. Here are some ways in which microarray gene expression analysis contributes to our understanding of diabetes: microarray analysis allows researchers to compare gene expression levels between healthy individuals and those with diabetes. These genes may be directly involved in disease development, progression, or complications. It is readily available from many search engines. “Expression data from human pancreatic islets” were taken from a Nordic islet transplantation program, consists of 57 Non-diabetic and 20 Diabetic cadaver donors, from which a total of 28735 gene data sets arrived (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA178122) (accessed on 20 August 2021). The data has accessed on 20 August 2021. The data are classified as 22,960 genes per patient and the peak intensity with the average valve were selected among the total samples. The logarithmic transformation was applied with a base 10 for standardization of individual samples with a value of 0 for mean and variance of 1.
Statistical analysis methods used in this article, in the following section, are as follows: 4 Dimensionality Reduction techniques for reduction of the high dimensional data, 2 meta-heuristic algorithms for feature selection with verification of p-value through t-test, and 7 classifiers used for classifying the diabetic and non-diabetic classes through statistical parameters such as accuracy, recall, precision F1 score, MCC, rrror rate, Jaccard metric, and kappa.

Data Set

When using biological functions to detect diabetes and the features of its secondary criteria in probability functions based on p values, a false positive error for selection of significant genes has also to be detected. The data available in many portals for human genes consist of 28735 genes with 50 non-diabetic and 20 diabetic samples, considered for greatest minimal intensity across 70 samples. During dimensionality reduction of both model- and heuristic-based grouping of diabetic and non-diabetic grouping, these were reduced by [2870 × 20] and [2870 × 50]. Further, it has been enhanced with feature selection based on 2 techniques named Particle Swarm Optimization search and Harmonic Search; even more, these were reduced by 287 × 20 and 287 × 50 with classifier techniques.
Table 1 shows the description of Pancreas Microarray Gene Data set for Diabetic and Non-Diabetic classes.

4. Dimensionality Reduction

Dimensionality Reduction techniques for the analysis of microarray gene data for type II diabetic class is important. Microarray experiments often generate a vast amount of gene expression data, resulting in a high-dimensional feature space. However, not all genes contribute equally to the classification task, and the presence of noise and irrelevant features can hinder the accuracy and interpretability of the results. Dimensionality reduction methods play a pivotal role in addressing these challenges by extracting the most informative features that are relevant to type II diabetes classification. These techniques aim to reduce the dimensionality of the data while preserving the discriminatory information, enabling the efficient computation and improved performance of subsequent classification algorithms. By eliminating redundant and irrelevant features, Dimensionality Reduction can enhance the performance of the results.
In this paper, four DR methods, DFA, Chi2pdf, the Firefly algorithm, and the Cuckoo search algorithm are utilized to reduce the dimension of a data set. The DR methods are discussed in the following section of the paper.
A.
Detrend Fluctuation Analysis (DFA)
DFA is the first DR method utilized in this paper which comprises the principles of inspecting the stationary and non-stationary functions of correlation, and the short range and long-range relationship has been used in DFA by Berthouze L et al. [29]. For typical application, the scaling of DFA is exponential to segregate the input data as rational and irrational. To estimate the functions of output class data it is useful to discriminate between the healthy and unhealthy objects using DFA.
The algorithm is determined by the root mean square fluctuation of natural scaling and integrated time-series of input data in the detrend.
A n = i = 1 n [ X i X ] ¯
Here, X(i) is denoted as the ith sample of input data.
X ¯    is specified as the overall signal of the mean value.
A(n) is indicated as the estimated value in the integrated time series.
F n = ( 1 n k = 1 N [ b k b n ( k ) ] 2 )
where bn(k) is the predetermined window of scale n for a trend of the kth point.
B.
Chi Square Probability Density Function
Siswantining et al. [30] proposed a slightly different approach for Chi square statistics methods, according to a fit test and the test of independence. A sample was obtained from the data which is referred to as number of cases. It was to be represented as data which is segregated in every incidence of occurrence in each group. In this statistics method of Chi square, if the hypothesis is accurate, the expected number of cases in each category makes a statement of null hypothesis. The test was based on the ratio of experimental data to the predicted values in each group. It is defined as:
χ 2 = i E i P i 2 P i        
where Ei refers to the experimental data of the cases in category i, and Pi refers to the number of predicted values in category i, to compute the Chi square function, and the difference between the experimental data cases to the predicted value cases is calculated. Then, the difference must be squared by the values and be divided by the predicted value such that all the values in this category are summed up for the entire distribution curve to obtain the Chi square statistics. Knowing whether the null hypothesis is a major concern depends on the data distribution. For the Chi square method, the alternative and null hypothesis is defined below:
Ho = Ei = Pi
Ho = Ei ≠ Pi
If Ei − Pi is small for each type, then the expected value and predicted values are very close to each other, and the null hypothesis is real. When the expected data are not associated with the predicted value of the null hypothesis, then a large difference appears between Ei and Pi. For real values of the null hypothesis, a small value will be found for the Chi square statistics, and if the value is false for the null hypothesis, a large value will be attained. The degree of freedom is dependent upon the variables of the categories utilized to calculate the Chi square. By means of hypothesis testing, the Chi square method reduces the dimensionality of the data.
C.
Firefly algorithm As Dimensionality Reduction
Yang, Xin-She (2010) [31] and Yang (2013) [32] proposed a reliable metaheuristic model for real-life problem-solving techniques like the scheduling of events, classification of a system, dynamic problem optimization, and economic load dispatch problems. The Fireflies algorithm works by using the characteristic behaviour of the idealized flashing light of the firefly to attract another one.
Three rules have been identified in the firefly algorithm:
  • An attraction is made to another fly regardless of the sex because every fly is considered as unisex;
  • “Opposite poles attract,” the attractiveness one of the fireflies is higher to another one which is slightly less bright. If none of the flies are getting brighter, it moves randomly on the surface;
  • If the distance increases, the brightness or light intensity of a firefly may decrease because the medium of air absorbs light and, thus, the brightness of a firefly k which is seem by another firefly is given by:
β k r = β r 0 e α r 2
where  β k 0  represents the firefly (k) brightness at the zero level. In the euclidean distance (if r = 0), light adsorption coefficient of the medium is represented by α, and the Euclidean distance between  i  and  k  is denoted by  r  as
r =   x i x k = j = 1 d ( x i j x k j ) 2
where  x i  and  x k  are the firefly position of  i  and  k ,  respectively. If the brighter firefly is j, its degree of attractiveness directs the movement of fly  i , which it is based on Yang and He, 2013 [32].
x i = x i + β k r x k x i + γ ( r n d )
where  γ  refers to the random parameter and, in general, is represented as  r n d ; it was expended as a random number generator using uniform distribution between the ranges of [−1, +1]. In-between representation in this equation contains the accountability of movement of firefly  i  towards firefly  k . The last term in the above equation gives the movement of the solution away from the local optimum value when such an incident occurs.
D.
Cuckoo Search Algorithm as Dimensionality Reduction
Yang, X. S, and Deb, S (2009) [33] proposed another metaheuristic model to give finite solutions which is used for solving real-world problems such as event scheduling, dynamic problem optimization, classifications, and problems in economic load dispatch. Exciting breeding behaviors is the main objective of learning this algorithm and it particularly concentrates on the oblige brood parasitism of certain cuckoo birds. The main idealized aspect in the Cuckoo search algorithm is the breeding characteristics, and the algorithm is applicable for many real-time optimization problems.
A simple solution is obtained from each egg in a host nest, with the continuation a new solution derived from a cuckoo bird’s own egg. The main aim is to obtain a better solution (cuckoos) to be replaced with one that is a less good fit. Each egg has one solution, each nest has multiple egg, and finding the best one signifies a set of solutions.
The three rules (Xin-She Yang & Suash Deb 2009) [33] for the Cuckoo search algorithm are:
  • Cuckoo’s lay an egg at a time, and are kept inside an arbitrarily selected shell;
  • To create a consecutive generation, the best host shell with a good quality egg is selected to transfer its own.
  • A fixed no. of host nests is accessible. Indeed, cuckoo’s place an egg in a nest with a probability of  P a ϵ 0,1 ; where  P a  is Cuckoo egg probability. To construct a new nest in an additional location, the host can demolish the cuckoo eggs, or remove the nest
Moreover, Yang and Deb detailed that an appropriate searching technique based on the random-walk (RW) technique, and its performance is better than Lévy flights and RW. The conventional method was modified by them to construct the proposed method using classification techniques.
The Lévy flights method denotes the RM characteristics of a bird’s position, and its performance is to obtain the following position  P i ( t + 1 )  based on the present position  P i ( t ) , as mentioned in the article by Gandomiet et al. (2013) [34]
P i ( t + 1 ) = P i ( t )     β L é v y ( λ )
where   and  β  represent the starting point multiplication and step size. Commonly,  β > 0  is interrelated to the depth of variation and its interest for problem consideration. For almost all of the classification problems, the values are randomly fixed as 1. The above equation is based on the RW on stochastic model. To find out the following position depends on the present position and transition probability for RM, which is denoted in Markov chain.
L é v y μ = t β
In the classification problem, the value of β is tuned to 0.2, which denotes infinite variance with an infinite mean. The power law-based step-length distribution approach uses a heavy tail for RW to, principally, be followed by cuckoo’s consecutive step. To speed up the classification process, the best solution was found using the Lévy walk method.

Statistical Analysis

The dimensionally reduced microarray genes, obtained through four DR methods, are then analysed by statistical parameters like mean, variance, skewness, kurtosis, Pearson correlation coefficient (PCC), and CCA to identify whether the outcomes represent the underlying microarray genes properties in the reduced subspace. Table 2 shows the statistical features analysis for four types of dimensionally reduced diabetic and non-diabetic pancreas microarray genes.
As shown in Table 2, the DFA and Cuckoo search-based DR methods depict higher values of mean and variance among the classes. The Chi2pdf and Firefly algorithm display low and overlapping values of mean and variance among the classes. The negative skewness depicted only by the Chi2pdf DR method indicates the presence of skewed components embedded in the classes. The Firefly algorithm indicates unusually flat kurtosis and the Cuckoo search DR method indicates negative kurtosis. This, in turn, leads to the observance that the DR methods are not modifying the underlying microarray genes characteristics. The PCC values indicate a high correlation within the class of attained outputs. This subsequently exhibits that the statistical parameters are associated with non-gaussian and non-linear outputs. The same is further examined by the histogram, Normal probability plots, and Scatter plots of the DR techniques outputs. Canonical Correlation Analysis (CCA) visualizes the correlation of the DR methods outcomes among the diabetic and non-diabetic cases. The low CCA value in Table 2 indicates that the DR outcomes are less correlated among the two classes. Further, the reduced data attained from the four DR techniques are analyzed by means of histogram, Normal probability plots, and Scatter plots to visualize the presence of non-linearity and the non-Gaussian nature of the datasets.
Figure 2 shows the Histogram of Detrend Fluctuation Analysis (DFA) techniques in the diabetic gene class. It is noted in Figure 2 that the histogram displays near quasi-Gaussian qualities and the presence of non-linearity in the DR method outputs. The legend in the figure is included, as the patients, from 1 to 10, are represented as from x(:,1) to x(:,10).
Figure 3 displays the Histogram of Detrend Fluctuation Analysis (DFA) techniques in the non-diabetic gene class. The patients are represented as from x(:,1) to z(:,10) for patients 1–10. It can be observed from Figure 3 that the histogram displays near-Gaussian qualities and the presence of non-linearity and gaps for the DR method outputs.
In the above figure, data 1–5 represent references, data 6–10 represent upper bound values, and data 11–15 represent feature selection points. Figure 4 exhibits the Normal probability plot for the Chi Square DR techniques’ features for the diabetic gene class. As indicated by Figure 4, the normal probability plot displays the total cluster of Chi Square DR outputs and the presence of non-linearly correlated variables among the classes.
Figure 5 depicts the Normal probability plot for the Chi Square PDF DR techniques’ features for the non-diabetic gene class. As shown in Figure 5, the Normal probability plot displays the total cluster of Chi Square DR outputs and the presence of non-linearly correlated variables among the classes. This is due to low variance and negatively skewed variables for the DR method outcomes. In the figure, data 1–5 are references, data 6–10 are upper bound values, and data 11–15 are feature selection variable points.
Figure 6 indicates the normal Probability plot for Firefly algorithm DR techniques’ features for the diabetic gene class. In the figure, data 1–5 are references, data 6–10 are upper bound values, and data 11–15 are feature selection Firefly algorithm diabetic gene points. As shown in Figure 6, the normal probability plots display discrete clusters for the Firefly DR outputs. This indicates the presence of non-Gaussian and non-linear variables within the classes. This is due to low variance and flat kurtosis variables of the DR method outcomes.
Figure 7 demonstrates the Normal probability plot for the Firefly algorithm DR Techniques’ features for the non-diabetic gene class. Data 1–5 are references in the figure, data 6–10 are upper bound data, and data 11–15 are Firefly DR technique feature selection points. As shown by Figure 7, the normal probability plots display the discrete clusters for Firefly DR outputs. This indicates the presence of non-Gaussian and non-linear variables within the classes. This is due to a low variance and flat kurtosis variables of the DR method outcomes.
Figure 8 shows the Scatter plot for the Cuckoo search algorithm DR techniques’ features for the diabetic and non-diabetic gene classes. As depicted by the Figure 8, the Scatter plots from the Cuckoo search display the total scattering of the variables of both classes across the entire subspace. The Scatter plot also indicates the presence of non-Gaussian, non-linear, and higher values for all of the statistical parameters.
From the above graphs it can be observed that the DR methods are insufficient to classify the data sets into appropriate classes. Therefore, feature selection methods like PSO and Harmonic Search are used to enhance the classifier performance.

5. Feature Selection

In the field of optimization, finding the optimal solution for complex problems is a significant challenge. Traditional optimization algorithms often struggle to handle high-dimensional search spaces or non-linear relationships between variables. To address these challenges, two popular meta-heuristic algorithms such as Particle Swarm Optimization (PSO) and Harmonic Search (HS) are incorporated as feature selection methods in this paper.

5.1. Particle Swarm Optimization (PSO)

Particle Swarm Optimization (PSO), Rajaguru H et al. [35], is one of the best and most simple to understand among all search algorithms. It uses some basic parameters for its initial search and population called particles. In an h-dimensional space, any of the particles will give the best possible solution for processing and analysis. Every particle needs to be traced and positioned for the optimized values to be achieved.
Position traced by:   P j k = ( P j 1 k , P j 2 k , , P j h k ) ;
Velocity traced by:  V e j k = ( V e j 1 k , V e j 2 k , , V e j h k ) ;
The updated velocity of each particle is given by:
V e j k + 1 = w j V e j k + c 1 r 1 p b e s t j P j k + c 2 r 2   ( g b e s t j P j k )
where  r 1   and   r 2  are the random variable search in the ranges from 0 to 1.  c 1   and   c 2  are the acceleration coefficient that checks the movement (motion) of the particles.
The updated position of each particle is defined as:
P j k + 1 = P j k + V e j k + 1
If a particle attains the best position, then it progresses to the next particle. The representation of the best position is expressed as p-best and the representation of the best position for all of the particles is expressed as g-best.
The weight function is expressed as:
w j = w m a x W m a x W m i n k m a x × k
Steps for implementation:
  • Step 1: Initialization of the process;
  • Step 2: For each particle, the dimension of a space is denoted as h;
  • Step 3: Initialization of the particle position as  p j  and velocity as  V e j ;
  • Step 4: Evaluate the fitness function;
  • Step 5: Initialize the  p b e s t j  with a copy of  p j ;
  • Step 6: Initialize the  g b e s t j  with a copy of  p j  with the best fitness function;
  • Step 7: Repeat the steps until the stopping criteria are satisfied.
p b e s t j = p j k + 1 i f p b e s t j < f P j k + 1
g b e s t j = p j k + 1   i f   g b e s t j < f P j k + 1

5.2. Harmonic Search (HS)

Harmony Search (HS) is a meta-heuristic algorithm that draws inspiration from the evolution of music and the quest for achieving perfect harmony. Bharanidharan, N et al. [36] introduced HS as an algorithm that emulates the improvisational techniques employed by musicians. The HS algorithm involves a series of steps to be implemented.
  • Step 1: Initialization
The optimization problem is generally formulated as minimizing or maximizing the objective function f(x), subject to yi ∈ Y, where i = 1, 2, …, N. In this formulation, y represents the set of decision variables, N denotes the number of decision variables, and Y represents the set of all possible values for each decision variable (i.e., yiLo ≤ yi ≤ yiUp, where yiLo and yiUp are the lower and upper bounds for each decision variable). Along with defining the problem, the subsequent step involves initializing the following parameters for the Harmonic Search (HS) algorithm.
  • Step 2: Memory Initialization
The Harmony Memory (HM) is a matrix that stores all of the decision variables. In the context of the general optimization problem, the initial HM is created by generating random values from a uniform distribution, bounded by yiLo and yiUp, for each decision variable.
  • Step 3: New Harmony Development
During the process of solution improvisation, a new harmony is created by adhering to the following constraints:
  • Memory consideration;
  • Pitch adjustment;
  • Random selection.
  • Step 4: Harmony memory updation
The fitness function for both the old and new harmony vectors is calculated. If the fitness function of the new harmony vector is lower than that of the old harmony vector, the old harmony vector is replaced with the new one. Otherwise, the old harmony vector is retained.
  • Step 5: Stopping criteria
Steps 3 and 4 are repeated until the maximum number of iterations is reached.
The effectiveness of the feature selection methods outputs is analysed through the significance of the p-value from the t-test. Table 3 shows the p-value significance for the PSO and Harmonic Search feature selection methods outputs after is applied to the four DR techniques.
t-test: The t-test is a statistical test used to determine if there is a significant difference between the means of two groups. It is commonly used in hypothesis testing when comparing the means of two independent samples to assess whether the difference observed in the samples is likely to reflect a true difference in the population.
The t-test and significance of the p-values is used to test for the null hypothesis of the feature selection method, which is explicated below.
(i)
Formulate the null hypothesis;
H0. 
The clustering/feature selection procedure is random or inconsistent, select the significant level α.
H1. 
The clustering/feature selection procedure is non-random or consistent, within the control level of α.
(ii)
Compute the p-value t-test;
(iii)
Check for the significance of p-value < 0.01, then the null hypothesis is accepted or otherwise rejected.
In the context of the study, if you have two groups, such as diabetic and non-diabetic individuals, you might consider using a t-test to compare the mean values of certain variables (e.g., gene expression levels) between these groups.
As tabulated in Table 3, the PSO feature selection method does not show any significant p-values among the classes for all four DR methods. In the case of Harmonic Search Feature selection, a certain p-value significance is shown for the DFA and Firefly DR techniques for the diabetic class. At the same time, all of the other DR methods exhibit non-significant p-values. These p-values will be measured to quantify the presence of outliers, and non-linear and non-Gaussian variables among the classes after applying the feature selection methods. The classification methods are explained in the following section.

6. Classification Techniques

There are seven classification models used after Dimensionality Reduction: 1. non-linear regression, 2. linear regression, 3. Logistic Regression. 4. Gaussian mixture model, 5. Bayesian linear discriminant classifier 6. Softmax discriminant classifier 7. support vector machine—radial basis function
  • Non-Linear regression
The behaviour of the system is denoted as a mathematical expression for easy representation and analysis to obtain the accurate best-fit line in-between the classifier values. In this case, the author uses the mathematical method for the linear system of the variables like (a,b) for the equation in a linear mode,  y = a x + b ; in the case of a non-linear mode, the values of the variables a and b are nonlinear and random variable, respectively. To obtain the least sum of the squares is one of the primary objectives of non-linear regression. The non-linear model requires more attention than the linear model because of its complex nature, and researchers have found many methods to reduce its complexity such as the Levenberg–Marquard and Gauss–Newton methods. To reduce the residual sum of the squares, the equation must be used for non-linear parameters. The Taylor series method, steepest descent method, and Levenberg-Marquardt’s method, Zhang et al. [37], can be used for non-linear equations in an iterative manner.
The authors assume a model:
z i = f x i ,   θ + ε i ,     w h e r e     i = 1,2 , 3 , , n
Here,  x i   a n d   z i    are the independent and dependent variables of the ith iteration.
θ = ( θ 1 , θ 2 , , θ m )  are the parameters and  ε i  is the error terms that follows  N   ( 0 , σ 2 ).
The residual sum of the squares is given by:
S u θ = i = 1 n z i f x i ,   θ 2
Let  θ k = θ 1 k , θ 2 k   , , θ p k  be the starting values, and the successive estimates are obtained using:
H + τ I θ 0 θ 1 = g
where    g = S u ( θ ) θ θ = θ o  and  H = 2 S u ( θ ) θ θ θ = θ 1 , τ  is a multiplier and  I  is the identity matrix.
From a previous experiment, the estimated parameter can be identified by the choice of the initial parameter and theoretical consideration for all other similar systems. By using Mean Square Error (MSE), the statistic method involved to approximate the goodness of fit model is described by:
Mean   Square   Error   ( MSE ) = 1 N ( i = 1 ) N ( y i y i ) 2
The overall experimental values in the model are represented by N, and the classification of the normal patient samples and diabetic patient samples in the dataset are determined by running the run test and normality test.
The steps to be followed for the non-linear regression algorithmic method are:
To get the best-fit function in a data point, the main objective is to get the MSE value to be less for non-linear regression.
  • Parameter initialization;
  • Curves value produced by the initial values;
  • To minimize the MSE value, calculate the parameters iteratively and modify the same to get the curve to be close to the nearer value;
  • If the MSE value has not changed when compared to the previous value, the process must stop.
2.
Linear regression
To analyze the gene expression data, linear regression is good to get the best-fit curve, and the expression level varies to a small extent at this gene level. By comparing the training data set with the gene expression for the data class to get the most informative genes that are used in the features selection process above, the various diversified levels of data are achieved. In this linear regression model, the dependent variable of x is taken in association with y as independent variable [38]. The model is established to forecast the values using the x variable when the regression fitness value is maximized because of the population in the y variable. The hypothesis function of the single variable is given as:
g θ = θ 0 + θ 1 x
where  θ i  is the parameters. To select the range between  θ o   a n d   θ 1  in such manner that  g θ  is near to y in the training data set (x, y), the cost function is given by:
R θ 0 , θ 1 = 1 2 m i = 1 m ( g θ x i y i ) 2
The total samples are represented by m in the training dataset. The linear regression model with n variables is given by:
g θ = θ 0 x 0 + θ 1 x 1 + + θ n x n
and the cost function is given by:
R θ = 1 2 m i = 1 m ( g θ x i y i ) 2
where θ is a set consisting of { θ 0 , θ 1 , θ 2 , , θ n }.
The algorithm for the linear regression is
  • The features selection parameters based on the DFA, Chi2Pdf, Firefly, and Cuckoo search algorithms is input to the classifiers;
  • Fit a line  g θ = θ 0 + θ 1 x  that splits the data in a linear method;
  • To minimize the observed data for prediction and to define the cost function for computes, the total squared error value is obtained;
  • Find the solutions by equating to zero for computing the derivate for  θ 0   and   θ 1 ;
  • Repeat the steps 2, 3, and 4 to get the coefficients that give the minimum squared error.
3.
Logistic Regression
The function Logit has been utilized effectively for the classification of problems like diabetes, cancer, and epilepsy. The author considers function y as an array of disease status with from 0 to 1 representation of normal patients to diabetic patients. Let us assume the vector gene expression as    x = x 1 ,   x 2 ,     ,   x m , where  x j    is the jth gene expression level. A model-based approach of  Π ( x )  is used to construct a dataset with the most likelihood of y = 1 given that x can be useful for an extremely new type of gene selection for diabetic patients. To identify the maximum likelihood in the Dimensionality Reduction techniques to find out the “q” informative genes for the Logistic Regression, let  x j * be the representation of the gene expression, where j = 1, 2, 3, …, q, and the binary disease status in the form of an array is given by  y i ,    where i = 1, 2, …, n, and the vectored gene expression are defined as  x i = ( x i 1 ,   ,   x i p ) . The Logistic Regression model is denoted by:
L o g i t   Π x = υ 0 + j = 1 q υ j x j
The fitness function and the log-likelihood should be the maximum when using the following function:
1 υ 0 ,   υ = j = 1 n y i log π i + 1 π i 1 2 τ 2 υ 2
where  τ  is the parameter that limits  υ  shrinkage near to 0, and  π i = π ( x i ) ,  as specified by the model in the article by Hamid et al. [39,40].
υ 2  is the Euclidean length of  υ = υ 1 , υ 2 , , υ p . The selection of  q   a n d   τ  is based on the parametric bootstrap and constrains the accurate calculation of the error prediction methods. First, the value of  υ  was set be zero due to the computing analysis of cost function. After that, it is varied due to various parameters to minimize the cost function. The selection of the values from 0 to 1, in the sigmoid function, is conducted for the purpose of attenuation. The threshold cut-off value between the diabetic and the normal patients is fixed as 0.5. Therefore, any probability under 0.5 is taken as a normal patient and any probability above the threshold value is considered as a diabetic patient.
In the below three methods, the authors used the techniques for threshold values for separation of the dataset.
4.
Gaussian Mixture Model (GMM)
The Gaussian Mixture model is one of the popular unsupervised learning models for machine learning which is used for pattern recognition and signal classification, and depends on integrating the related objects. By using clustering techniques, similar data are classified in a way that makes it easy to predict and compute the unrated items in the ratio of the same category. GMM [41] comes under the category of soft clustering techniques which has both hard and soft clustering techniques. Let us assume the GMM will allow the Gaussian Mixture model distribution techniques for further data analysis. For the data generated in the Gaussian distribution techniques, every GMM includes  g  in the Gaussian distributions. In the Probability density function of GMM, the distributed components are added in a linear form to analyze the generated data. For a random value generation in a vector form,  a  in a n-dimensional sample space  χ , if ‘ a ’ obeys the Gaussian distribution, the probability distribution function is expressed as:
p a = 1 ( 2 π ) n / 2 Σ 1 / 2 e ( 1 2 ) a μ T Σ 1 ( a μ )
where  μ  is represented by the mean vector of n-dimensional space and  n × n  is represented by the covariance of matrix  Σ . The Gaussian distribution of covariance  Σ  and the mean vector  μ  is done through determination of the matrix. There are many components to be mixed up for the Gaussian distribution function and each has individual vector spaces in the distribution curve. The mixture distribution equation is expressed as:
P Q a = i = 1 k j . p a μ j , Σ j
The  j t h  Gaussian Mixture of the parameter is represented as  μ j   and   Σ j  and, with the corresponding mixing coefficient, is represented as  j .
5.
Bayesian Linear Discriminant Classifier (BLDC)
The main usage of this type of classifier is to regularize the high-dimensional signal, the reduction of noisy signals, and to avoid the computation performance. An assumption to be made before proceeding to the Bayesian linear discriminant analysis, Zhou et al. [42], is that a target is set with respect to the relation in a vectors of b and c, which is denoted as white Gaussian noise; therefore, it is expressed as  a = x T b + c . The weighted function is considered as x, and its likelihood function is expressed as  p G β , x = β 2 π c 2 exp β 2 B T x m , where the pair of  B , m  is denoted as G. The B matrix will give the training vector.  a  denotes the filtered signal,  β  denotes the inverse variance of the noise, and the sample size is denoted by C. The prior distribution of  x  is expressed as:
p x α = α 2 π 1 2 ε 2 π 1 2 exp ( 1 2 x T H α x )
Here, the regularization square is represented as
H ( α ) = α 0 0 ε ( l + 1 ) ( l + 1 )
where the hyper parameter  α  is produced from the forecasting the data, and the vector number is assigned as l. The weight  x  follows a Gaussian distribution which has zero mean, and a small value is contained in  ε . According to the Bayes rule, the posterior distribution of  x  can be easily computed as:
p x β , α , G = P G β , x P x α P G β , x P y α d y
For posterior distribution, the mean vector  υ  and the covariance matrix  X  should satisfy the norms in Equations (30) and (31). The nature of posterior distribution is highly Gaussian.
υ = β   ( β B B T + H α ) 1 B a
X = ( β B B T + H α ) 1
For input prediction vector  b ^ , the expression for the probability distribution on the regression is shown as  p a ^ β , α , b , G ^ = p a ^ β , b , x ^ p x β , α , G d y .
The nature is, again, highly Gaussian in this prediction analysis, and its mean is expressed as  μ = υ T b ^  while its variance is expressed as  δ 2 = 1 β + b T ^ X b ^ .
6.
Softmax Discriminant Classifier (SDC)
SDC [43] is included in this analysis for determination and identification of the group from which the specific test sample is taken. In this case, the weighing of its distance between the training samples to the test samples in a specific class or group of data is undertaken. The training set is denoted as:
Z = Z 1 , Z 2 , , Z q R c × d
which comes from the distinct classes named q.  Z q = Z 1 q , Z 2 q , , Z d q q R c × d q ,  indicates the  d q  as samples from the  q t h  class, where  i = 1 q d i = d .  Assuming  K R c × 1  is the test samples, and again it is given to the classifiers. If a negligible construction error can be obtained from the test samples, then we utilize the class of  q . The class sample of  q  and test samples were transformed in the non-linear enhancing values, by which the ideology of SDC has been satisfied in the following equations:
h K = arg max Z w i
h K = arg max i log j = 1 d i exp ( λ v υ j i 2 )
Here,  h K  represents the distance between the  i t h  class and the test samples.  λ > 0 ,  validates the penalty cost. Hence, if  K  is identified as belonging to the class of the  i t h  value, then  v   a n d   υ j i  are the same characteristic function, and so  v υ j i 2  is improving close to zero and, hence, maximizing  Z w i  can be achieved in the asymptotic values in which its maximum possibility is attained.
7.
Support Vector Machine– Radial Basis Function (SVM–RBF)
The SVM classifier is one of the important machine learning techniques for classification problems, especially the non-linear regression based multilayer perceptron with Radial Basis Function (RBF) by Yao, X. J., et al. [44].
The training time and computational complexity for the machine in the SVM depends on the classifier, and the data must be supported for the required SVMs. If we increase the support for the machines in the SVM, it leads to the computational requirements being high, and its floating point for the multiplication and addition is calculated.
  • Steps for the SVM is to identify
Step 1—With the help of quadratic optimization, we can use linearization and convergence. The dual optimization problem which was transformed from the primal minimization problem is referred to as maximizing the dual lagrangian LD with respect to  α i ,
M a x   L D = i = 1 l α i 1 2 i = 1 l j = 1 l α i α j y i y j ( X i × X j )
Subject to  i = 1 l α i y i = 0 , where  α i 0 i = 1,2 , 3 , , l
Step 2—By solving the above quadratic programming problem in the optimal separating hyper plane, those points which have a non-zero lagrangian multiplier ( i   >   0 )  become the support vectors
Step 3—In the trained data, the optimal hyper plane is fixed by the support vectors, and it is very close to the decision boundary
Step 4—The K means clustering is the data set. It will function as a group of clusters according to the conditions of Step 2 and Step 3. A vector is randomly chosen from the clusters of three points each as a cluster or center point, which are the points from the given dataset. Each point in the center will acquire the present around them.
Step 5—If there are six center points from each corner, then the SVM training data are using kernel methods.
Radial   Basis   Function :   k   x i , x j = exp x i x j 2 ( 2 σ ) 2
The hyper plane and support vectors are used to separate linearly separable and nonlinearly separable data.

7. Training and Testing of Classifiers

The training data for the dataset are limited. Therefore, we performed k-fold cross-validation. k-fold cross-validation is a popular method for estimating the performance of a machine learning model. The process performed by Fushiki et al. [45] for k-fold cross-validation is as follows. The first step is to divide the dataset into k equally sized subsets (or “folds”). For each fold i, the model is trained on all the data except the ith fold and the model is tested on the ith fold. The process is repeated for all k folds so that each is used once for testing. At the end of the process, you will have k performance estimates (one for each fold). Now, the average of the k performance estimates is calculated to get an overall estimate of the model’s performance. Once the model has been trained and validated using k-fold cross-validation, you can retrain it on the full dataset and predict new, unseen data. The advantage of k-fold cross-validation is that it provides a more reliable estimate of a model’s performance than a simple train–test split, as it uses all the available data. In this paper, the k-value is chosen as 10-fold. This research used a value of 2870 dimensionally reduced features per patient. This research is associated with 20 diabetic and 50 non-diabetic patients with multi-trail training of the required classifiers. The use of cross-validation removes any dependence on the choice of pattern for the test set. The training process is controlled by monitoring the Mean Square Error (MSE), which is defined as:
M S E = 1 N j = 1 N O j T j 2
where Oj- is the observed value at time j, Tj is the target value at model j;
j = 1 and 2, and N is the total number of observations per epoch; in our case it is 2870. As the training progressed, the MSE value reached 1.0 × 10−12 within 2000 iterations.
Table 4 depicted the confusion matrix for Diabetic and Non-Diabetic Patient detection.
In the case of diabetic detection, the following terms can be defined as:
True Positive (TP): A patient is correctly identified as diabetic class;
True Negative (TN): A patient is correctly identified as a non-diabetic class;
False Positive (FP): A patient is incorrectly identified as diabetic class when they are in non-diabetic class;
False Negative (FN): A patient is incorrectly identified as non-diabetic class when they are in diabetic class.
The training MSE always varied between 10−4 and 10−8, while the testing MSE varied from 10−4 to 10−6. The SVM (RBF) classifier without feature selection method settled at minimum training and testing MSEs of 1.26 × 10−8 and 5.141 × 10−6, respectively. The minimum testing MSE is one of the indicators towards the attainment of a better performance of the classifier. As shown in Table 5, a higher value of the testing MSE leads to a poorer performance of the classifier irrespective of the Dimensionality Reduction techniques.
Table 6 displays the training and testing MSE performance of the classifiers with the PSO feature selection method for four Dimensionality Reduction techniques. The training MSE always varied between 10−5 and 10−8, while the testing MSE varied from 10−4 to 10−6. The SVM (RBF) classifier with the PSO feature selection method settled at minimum training and testing MSEs of 1.94 × 10−9 and 1.885 × 10−6, respectively. All of the classifiers slightly improved the performance in the testing MSE when compared to the performance without feature selection methods. This will be indicated by the enhancement of the accuracy of the classifier performance irrespective of the type of the Dimensionality Reduction technique.
Table 7 depicts the training and testing MSE performance of the classifiers with the Harmonic Search feature selection Method for four Dimensionality Reduction techniques. The training MSE always varied between 10−5 and 10−8, while the testing MSE varied from 10−4 to 10−6. The SVM (RBF) classifier with the Harmonic Search feature selection method settled at minimum training and testing MSEs of 1.86 × 10−8 and 1.7 × 10−6, respectively. All of the classifiers enhanced the performance in the testing MSE when compared to the performance without feature selection methods. This will be indicated by the improvement of the accuracy, MCC, and Kappa parameters of the classifier performance irrespective of the type of Dimensionality Reduction technique.

Selection of Target

The target value for the non-diabetic case  ( T N D )  is taken at the lower side of the zero to one (0 → 1) scale and this mapping is made according to the constraint of:
1 N i = 1 N μ i T N D
where  μ i  is the mean value of input feature vectors for the N number of non-diabetic features taken for classification. Similarly, the target value for the diabetic cases ( T D i a ) is taken at the upper side of the zero to one (0 → 1) scale and this mapping is made based on:
1 M j = 1 N μ j T D i a
where  μ j  is the average value of input feature vectors for the M number of diabetic cases taken for classification. Note that the target value  T D i a  would be greater than the average values of  μ i  and  μ j . The difference between the selected target values must be greater than or equal to 0.5, which is given by:
| | T D i a T N D | | 0.5
Based on the above constraints, the targets  T N D  and  T D i a  for the non-diabetic and diabetic patient output classes are chosen at 0.1 and 0.85, respectively. After selecting the target values, the Mean Squared Error (MSE) is used for evaluating the performance of machine learning classifiers.
Table 8 dictates the selection of optimum parametric values for classifiers.

8. Results and Discussion

The research uses standard 10-fold testing and training, in which 10% of the input features are employed for testing, whereas 90% are employed for training. The choice of performance measures is significant in evaluating classifier performance. The confusion matrix is used to evaluate the performance of classifiers, especially in binary classification (i.e., classification into two classes, such as diabetic or non-diabetic from the pancreas microarray genes). It can be used to calculate performance metrics such as accuracy, F1 score, MCC, error rate, Jaccard metric, and Kappa, which are commonly used to evaluate the model’s overall performance. Table 9 depicts the parameters associated with the classifiers for performance analysis.

8.1. Performance Metrics

  • Accuracy
The accuracy of a classifier is a measure of how well it correctly identifies the class labels of a dataset. It is calculated by dividing the number of correctly classified instances by the total number of instances in the dataset. The equation for accuracy is given by Fawcett et al. [46]:
A c c = T N + T P T N + F N + T P + F P
2.
Recall
Recall is a critical performance metric used to evaluate the classifier’s ability to correctly identify positive instances, specifically diabetic individuals, out of all the actual positive instances present in the dataset. It measures the proportion of true positive predictions out of all the instances that are positive in the dataset.
R e c a l l = T P ( T P + F N )
3.
Precision
Precision is used to evaluate the classifier’s ability to accurately predict positive instances. It measures the proportion of true positive predictions out of all instances predicted as positive by the classifier. A high precision value indicates that the classifier has a low false positive rate, meaning it correctly identifies a high proportion of diabetic individuals without misclassifying non-diabetic individuals as diabetic.
P r e c i s i o n = T P ( T P + F P )
4.
F1 Score
The F1 score is a measure of a classifier’s accuracy that combines precision and recall into a single metric. It is calculated as the harmonic mean of precision and recall, with values ranging from 0 to 1, where 1 indicates perfect precision and recall. The equation for the F1 score is given by Saito et al. [47]:
F 1 = 2 × T P ( 2 × T P + F P + F N )
where precision is the proportion of true positives among all instances classified as positive, and recall is the proportion of true positives among all instances that are positive. The F1 score is useful when the classes in the dataset are imbalanced, meaning there are more instances of one class than the other. In such cases, accuracy may not be a good metric to use, as a classifier that simply predicts the majority class would have a high accuracy but low precision and recall. The F1 score provides a more balanced measure of a classifier’s performance.
5.
Matthews Correlation Coefficient (MCC)
MCC is a measure of the quality of binary (two-class) classification models. It considers true and false positives and negatives and is particularly useful in situations where the classes are imbalanced.
The MCC is defined by the following equation, as given in Chicco et al. [48]:
M C C = ( T P × T N F P × F N ) T P + F P ) × ( T P + F N ) × ( T N + F P ) × ( T N + F N )
The MCC takes on values between −1 and 1, where a coefficient of 1 represents a perfect prediction, 0 represents a random prediction, and −1 represents a perfectly incorrect prediction.
6.
Error Rate
The error rate of a classifier, as mentioned in Duda et al. [49], is the proportion of instances that are misclassified. It can be calculated using the following equation:
E r r o r   r a t e = ( F P + F N ) ( T P + T N + F P + F N )
7.
Jaccard Metric:
The Jaccard metric, also known as the Tanimoto similarity coefficient, explicitly disregards the accurate classification of negative samples [50].
J a c c a r d = T P T P + F P + F N
Changes in data distributions can greatly impact the sensitivity of the Jaccard metric.
8.
Kappa
The Kappa statistic, also known as Cohen’s Kappa, is a measure of agreement between two raters, or between a rater and a classifier. In the context of classification, it is used to evaluate the performance of a classifier on a binary or multi-class classification task. The Kappa statistic measures the agreement between the predicted and true classes, considering the possibility of agreement by chance. Kvålseth et al. [51] defined Kappa as follows:
K a p p a = c o n s i d e r i n g   P e  is the proportion of agreement expected by chance.  P o  and  P e  are calculated as follows:
P o = ( T P + T N ) ( T P + T N + F P + F N )
P e = ( T P + F P ) × ( T P + F N ) + ( F P + T N ) × ( F N + T N ) ( T P + T N + F P + F N ) 2
The Kappa statistic takes on values between −1 and 1, where values greater than 0 indicate agreement better than chance, 0 indicates agreement by chance, and values less than 0 indicate agreement worse than chance. The results are tabulated in the following tables.
Table 9 demonstrates the performance analysis of the seven classifiers based on parameters like accuracy, F1 score, MCC, error rate, Jaccard metric, and Kappa values for the four Dimensionality Reduction methods without feature selection methods. It can be identified from Table 9 that the SVM (RBF) classifier in the Cuckoo Search DR technique is settled at a middle accuracy of 65.71%, F1 score of 50% with a moderate error rate of 34.28%, and Jaccard metric of 33.33%. The SVM (RBF) classifier also exhibits a low value of MCC 0.2581 and Kappa value of 0.25. The Logistic Regression classifier for the firefly algorithm DR Technique is placed in the lower ebb of accuracy of 51.42%, with a high error rate of 48.57%, F1 score of 37.03%, and Jaccard metric of 22.72%. The MCC and Kappa values of the Logistic Regression classifier are 0.01807 and 0.01652, respectively. Irrespective of the Dimensionality Reduction techniques, all of the classifiers settled at an accuracy within the range of 50–65%. This is due to the inherent limitation of the Dimensionality Reduction techniques. Therefore, it is recommended to incorporate the feature selection methods to enhance the classifier performance.
Figure 9 depicts the performance analysis of the seven classifiers based on parameters such as accuracy, F1 score, error rate, and Jaccard metric values for the four Dimensionality Reduction methods without feature selection methods. These attained a middle accuracy of 65.71%, F1 score of 50% with a moderate error rate of 34.28%, and Jaccard metric of 33.33%. The Logistic Regression classifier for the firefly algorithm DR Technique was placed in the lower end of accuracy, with a value of 51.42%, a high error rate of 48.57%, an F1 score of 37.03%, and a Jaccard metric of 22.72%.
Table 10 exhibits the performance analysis of the seven classifiers for the four Dimensionality Reduction methods with the PSO feature selection method. It can be observed from Table 10 that the SVM (RBF) classifier in the Chi square pdf DR techniques is settled at a high accuracy of 91.42%, F1 score of 85.71% with a low error rate of 8.57%, and Jaccard metric of 75%. The SVM (RBF) classifier also exhibits a high value of MCC 0.7979 and Kappa value of 0.7961. The Logistic Regression classifier for the firefly algorithm DR technique is once again placed in the lower end of accuracy, with a value of 55.71%, with a high error rate of 44.28%, F1 score of 43.63%, and Jaccard metric of 27.9%. The MCC and Kappa values of the Logistic Regression classifier are 0.1264 and 0.1142, respectively. Irrespective of the Dimensionality Reduction techniques, all of the classifiers settled at an accuracy within the range of 55–92%. This enhancement in the accuracy is due to the inherit property of the PSO feature selection method.
Figure 10 displays the performance analysis of the seven classifiers for the four Dimensionality Reduction methods with PSO feature selection methods. It is also identified from Figure 10 that the SVM (RBF) classifier in the Chi Square pdf DR techniques is settled at a high accuracy of 91.42%, F1 score of 85.71% with low an error rate of 8.57%, and Jaccard metric of 75%. The Logistic Regression classifier for the firefly algorithm DR technique settled in the lower end of accuracy with a value of 55.71%, with a high error rate of 44.28% F1 score of 43.63%, and Jaccard metric of 27.9%. The PSO feature selection method improves the classifier accuracy by around 10–35%, irrespective of the DR techniques.
Table 11 explores the performance analysis of the seven classifiers for the four Dimensionality Reduction methods with the Harmonic Search feature selection method. It is observed from Table 11 that the SVM (RBF) classifier in the Cuckoo search DR techniques is settled at a high accuracy of 90%, F1 score of 83.72% with a low error rate of 10%, and Jaccard metric of 72%. The SVM (RBF) Classifier also exhibits a high value of MCC 0.7694 and Kappa value of 0.7655. The Linear Regression classifier for the Detrend Fluctuation Analysis (DFA) DR technique is placed in the lower accuracy of 52.85%, with a high Error Rate of 47.14%, F1 score of 37.75%, and Jaccard metric of 23.25%. The MCC and Kappa values of the Linear Regression classifier are 0.0361 and 0.03343, respectively. Irrespective of the Dimensionality Reduction techniques, all of the classifiers settled at an accuracy within the range of 50–90%. This enhancement in the accuracy is due to the usage of the Harmonic Search Feature selection method.
Figure 11 exhibits the performance analysis of the seven classifiers for the four Dimensionality Reduction methods with the Harmonic Search feature selection methods. It is also observed from Figure 11 that the SVM (RBF) classifier in the Cuckoo search DR techniques is settled at a high accuracy of 90%, F1 score of 83.72% with a low error rate of 10%, and Jaccard metric of 72%. The Linear Regression classifier for the Detrend Fluctuation Analysis (DFA) DR technique is settled in the lower accuracy of 52.85%, with a high error rate of 47.14%, F1 score of 37.75%, and Jaccard metric of 23.25%. The Harmonic Search feature selection method improves the classifier accuracy by around 10–25%, irrespective of the DR techniques, and achieved the position next to the PSO feature selection method.
Figure 12 displays the performance of the MCC and Kappa parameters across the classifier for the four DR Techniques without and with the two-feature selection methods. The MCC and Kappa are the benchmark parameters which indicate the outcomes of the classifiers for different inputs. As in this research, there are three categories of inputs, like dimensionally reduced without feature selection, with PSO and Harmonic Feature selection methods. The classifiers’ performances are observed through the attained MCC and Kappa values for these inputs. The average MCC and Kappa values from the classifiers are 0.2984 and 0.2849, respectively. A methodology was devised to identify the performance of the classifiers with reference to Figure 12. The MCC values are divided into three ranges, 0.01–0.25, 0.251–0.54, and 0.55–0.8. The performance of the classifiers is very poor in the range 1 and there is a steep increase in the MCC vs Kappa slope in region 2 of the MCC values. Region 3 of the MCC values settled at a higher performance of the classifiers without any glitches.

8.2. Computational Complexity

The classifiers are analysed based on the computational complexity. The computational complexity is found according to the input O(n) size. The computational complexity is less if it equals O(1). As the number of inputs increases, the computational complexity will increase. In this research, the complexity does not depend on the input size; this is one of the desired entities for any algorithm. If the computational complexity increases log(n) times for any increase in ‘n,’ it is denoted as O(logn). In this paper, all of the classifiers are hybrid, and they classify the dimensionally reduced outputs along with the feature selection methods.
Table 12 shows the computational complexity of the classifiers for the four Dimensionality Reduction techniques without feature selection methods. It is observed from Table 12 that almost all of the classifiers’ computational complexities are near equal, and their performances are positioned in the low level of accuracy. The Linear Regression classifier has a low computational complexity of O(2nlog2n); at the same time, the Logistic Regression classifier with the Firefly algorithm DR techniques has a higher complexity of O(2n3log2n) and both of the classifiers are at the same level of accuracy. The SVM (RBF) classifier for the Cuckoo search DR technique has a high computational complexity of O(2n3log4n) with increased accuracy, MCC, and Kappa values.
Table 13 displays the computational complexity of the classifiers for the four Dimensionality Reduction techniques with the PSO feature selection method. It is identified from Table 13 that almost all of the classifiers’ computational complexities are near equal, and their performances are positioned in the high level of accuracy. The Linear Regression classifier has a low computational complexity of O(2n3log2n); at the same time, the Logistic Regression classifier for the Firefly algorithm DR techniques has a higher complexity of O(2n4log2n), and both the classifiers are at the same level of accuracy. The SVM (RBF) classifier for the Chi square pdf DR technique has a high computational complexity of O(2n4log4n), with the highest accuracy of 91%, and MCC and Kappa values of 0.794 and 0.7967, respectively.
Table 14 shows the computational complexity of the classifiers for the four Dimensionality Reduction techniques with the Harmonic Search feature selection method. It can be seen in Table 14 that almost all of the classifiers’ computational complexities are near equivalent and their performances are positioned in the high level of accuracy. The Linear Regression classifier has a low computational complexity of O(2n2log2n); at the same time, the BDLC and GMM classifiers with the Firefly algorithm DR techniques have a higher complexity of O (2n5log2n), and both the classifiers are at the same level of accuracy. The SVM (RBF) classifier for the Cuckoo search DR technique has a high computational complexity of O(2n4log2n), with the highest accuracy of 90%, and MCC and Kappa values of 0.7655 and 0.767, respectively. Even though there is a high computational complexity associated with the GMM and BDLC classifiers, they have not achieved better performance metrics.

8.3. Comparison of Previous Works

Table 15 shows the comparision of previous works with different machine learning techniques to detect the diabetics.
As mentioned in the Table 15 it is observed that, majority of the machine learning classifiers such as SVM (RBF), Naïve Bayes, Logistic Regression, Decision tree, Non-linear regression, random forest, multilayer perceptron, and Deep neural networks are utilized to classify the diabetics, based on the clinical data base. All the classifiers accuracy is at the range of 67–91%. The current study is based on microarray gene to detect diabetes and SVM (RBF) achieved accuracy of 91%.
Table 16 explains the comparison of classifiers performance in different data sets. It is identify that, from Table 16, explores the efficacy of machine learning classifiers in detecting various diseases with different datasets. It is also noted that, the clinical diabetes dataset from the Frankfurt hospital, Germany using decision tree (ID3) classifier attains highest accuracy of 99%.

9. Conclusions

The main aim of this paper is to identify and analyze a good classifier to classify diabetic microarray gene data with a high accuracy and low error rate. To assimilate the theme performances by seven classifiers, the classifiers are analyzed through benchmark measures like accuracy, F1 score, MCC, error rate, Jaccard metric, and Kappa. The highest accuracy of 91% is achieved in the SVM (RBF) classifier for the Chi2 pdf DR technique with the PSO feature selection method. The classifier exhibited a high computational complexity of O(2n4 log 2n). The second highest accuracy of 90% is achieved through the SVM (RBF) classifier, in combination with the Cuckoo search DR technique with the Harmonic search feature selection method; the computational complexity is O (2n4 log2n). However, even the GMM and BLDC classifiers with higher computational complexity failed to achieve good accuracy. The enhanced accuracy attained by the SVM (RBF) classifier will be utilized for better classification of diabetic microarray gene data. Future research will be in the direction of CNN and DNN models for the improvement of classifier performance.

Author Contributions

Conceptualization, D.C.; Methodology, D.C.; Software, D.C.; Validation, H.R.; Formal analysis, H.R.; Investigation, D.C.; Resources, D.C. and H.R.; Data curation, H.R.; Writing—original draft, D.C.; Writing—review and editing, H.R.; Visualization, D.C.; Supervision, H.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kumar, D.; Ashok, R. Govindasamy. Performance and evaluation of classification data mining techniques in diabetes. Int. J. Comput. Sci. Inf. Technol. 2015, 6, 1312–1319. [Google Scholar]
  2. Lam, A.A.; Lepe, A.; Wild, S.H.; Jackson, C. Diabetes comorbidities in low-and middle-income countries: An umbrella review. J. Glob. Health 2021, 11, 04040. [Google Scholar] [CrossRef]
  3. Mohsen, F.; Safieh, H.; Shibani, M.; Ismail, H.; Alzabibi, M.A.; Armashi, H.; Sawaf, B. Assessing diabetes mellitus knowledge among Syrian medical students: A cross-sectional study. Heliyon 2021, 7, e08079. [Google Scholar] [CrossRef]
  4. Nakrani, M.N.; Wineland, R.H.; Anjum, F. Physiology, Glucose Metabolism. In StatPearls; StatPearls Publishing: Treasure Island, FL, USA, 2023. Available online: https://www.ncbi.nlm.nih.gov/books/NBK560599/ (accessed on 20 August 2021).
  5. WHO. Diabetes—India; World Health Organization: Geneve, Switzerland; Available online: https://www.who.int/india/health-topics/mobile-technology-for-preventing-ncds (accessed on 13 November 2021).
  6. Krishnamoorthy, Y.; Rajaa, S.; Murali, S.; Rehman, T.; Sahoo, J.; Kar, S.S. Prevalence of metabolic syndrome among adult population in India: A systematic review and meta-analysis. PLoS ONE 2020, 15, e0240971. [Google Scholar] [CrossRef]
  7. Sahu, G.; Gohain, S.; Brahma, A. Drug utilization pattern of antidiabetic drugs among indoor diabetic patients in a tertiary care teaching hospital, Jorhat. Biomedicine 2020, 40, 512–515. [Google Scholar] [CrossRef]
  8. NCD Risk Factor Collaboration. Worldwide trends in diabetes since 1980: A pooled analysis of 751 population-based studies with 4·4 million participants. Lancet 2016, 387, 1513–1530. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Available online: https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes/type-2-diabetes (accessed on 20 August 2021).
  10. Deshpande, A.D.; Harris-Hayes, M.; Schootman, M. Epidemiology of diabetes and diabetes-related complications. Phys. Ther. 2008, 88, 1254–1264. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  11. Tonyan, Z.N.; Nasykhova, Y.A.; Danilova, M.M.; Barbitoff, Y.A.; Changalidi, A.I.; Mikhailova, A.A.; Glotov, A.S. Overview of Transcriptomic Research on Type 2 Diabetes: Challenges and Perspectives. Genes 2022, 13, 1176. [Google Scholar] [CrossRef]
  12. Kavakiotis, I.; Tsave, O.; Salifoglou, A.; Maglaveras, N.; Vlahavas, I.; Chouvarda, I. Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 2017, 15, 104–116. [Google Scholar] [CrossRef] [PubMed]
  13. Davidson, K.W.; Barry, M.J.; Mangione, C.M.; Cabana, M.; Caughey, A.B.; Davis, E.M.; Donahue, K.E.; Doubeni, C.A.; Krist, A.H.; Kubik, M.; et al. Screening for prediabetes and type 2 diabetes: US Preventive Services Task Force recommendation statement. Jama 2021, 326, 736–743. [Google Scholar] [PubMed]
  14. Shaw, J.E.; Sicree, R.A.; Zimmet, P.Z. Global estimates of the prevalence of diabetes for 2010 and 2030. Diabetes Res. Clin. Pract. 2010, 87, 4–14. [Google Scholar] [CrossRef] [PubMed]
  15. Mohan, V.; Pradeepa, R. Epidemiology of type 2 diabetes in India. Indian J. Ophthalmol. 2021, 69, 2932. [Google Scholar] [CrossRef] [PubMed]
  16. Abdulkareem, S.A.; Radhi, H.Y.; Fadil, Y.A.; Mahdi, H. Soft computing techniques for early diabetes prediction. Indones. J. Electr. Eng. Comput. Sci. 2022, 25, 1167–1176. [Google Scholar] [CrossRef]
  17. Mujumdar, A.; Vaidehi, V. Diabetes prediction using machine learning algorithms. Procedia Comput. Sci. 2019, 165, 292–299. [Google Scholar] [CrossRef]
  18. Bhaskaran, K.; Unnikrishnan, R.; Deepa, S. Prediction of diabetes using machine learning techniques. Int. J. Eng. Res. Technol. 2017, 6, 232–237. [Google Scholar]
  19. Llaha, O.; Rista, A. Prediction and Detection of Diabetes using Machine Learning. In Proceedings of the 20th International Conference on Real-Time Applications in Computer Science and Information Technology (RTA-CSIT), Tirana, Albania, 21–22 May 2021; pp. 94–102. [Google Scholar]
  20. Ahamed, B.S.; Arya, M.S.; Sangeetha, S.K.B.; Osvin, N.V.A. Diabetes Mellitus Disease Prediction and Type Classification Involving Predictive Modeling Using Machine Learning Techniques and Classifiers. Appl. Comput. Intell. Soft Comput. 2022, 2022, 7899364. [Google Scholar] [CrossRef]
  21. Tigga, N.P.; Garg, S. Prediction of type 2 diabetes using machine learning classification methods. Procedia Comput. Sci. 2020, 167, 706–716. [Google Scholar] [CrossRef]
  22. Maniruzzaman; Kumar, N.; Abedin, M.; Islam, S.; Suri, H.S.; El-Baz, A.S.; Suri, J.S. Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Programs Biomed. 2017, 152, 23–34. [Google Scholar] [CrossRef]
  23. Gupta, S.K.; Singh, Z.; Purty, A.J.; Kar, M.; Vedapriya, D.R.; Mahajan, P.; Cherian, J. Diabetes prevalence and its risk factors in rural area of Tamil Nadu. Indian J. Community Med. Off. Publ. Indian Assoc. Prev. Soc. Med. 2010, 35, 396. [Google Scholar] [CrossRef]
  24. Howlader, K.C.; Satu, S.; Awal, A.; Islam, R.; Islam, S.M.S.; Quinn, J.M.W.; Moni, M.A. Machine learning models for classification and identification of significant attributes to detect type 2 diabetes. Health Inf. Sci. Syst. 2022, 10, 2. [Google Scholar] [CrossRef]
  25. Sisodia, D.; Sisodia, D.S. Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 2018, 132, 1578–1585. [Google Scholar] [CrossRef]
  26. Mathur, P.; Leburu, S.; Kulothungan, V. Prevalence, awareness, treatment and control of diabetes in India from the countrywide National NCD Monitoring Survey. Front. Public Health 2022, 10, 205. [Google Scholar] [CrossRef] [PubMed]
  27. Kazerouni, F.; Bayani, A.; Asadi, F.; Saeidi, L.; Parvizi, N.; Mansoori, Z. Type2 diabetes mellitus prediction using data mining algorithms based on the long-noncoding RNAs expression: A comparison of four data mining approaches. BMC Bioinform. 2020, 21, 372. [Google Scholar] [CrossRef]
  28. Lawi, A.; Syarif, S. Performance evaluation of naive Bayes and support vector machine in type 2 diabetes Mellitus gene expression microarray data. J. Phys. Conf. Ser. 2019, 1341, 042018. [Google Scholar]
  29. Berthouze, L.; Farmer, S.F. Adaptive time-varying detrended fluctuation analysis. J. Neurosci. Methods 2012, 209, 178–188. [Google Scholar] [CrossRef]
  30. Alifah; Siswantining, T.; Sarwinda, D.; Bustamam, A. RFE and Chi-Square Based Feature Selection Approach for Detection of Diabetic Retinopathy. In Proceedings of the International Joint Conference on Science and Engineering (IJCSE 2020), Surabaya, Indonesia, 3 October 2020; Atlantis Press: Amsterdam, The Netherland, 2020. [Google Scholar]
  31. Yang, X.-S. Firefly algorithm, Levy flights and global optimization. In Research and Development in Intelligent Systems XXVI: Incorporating Applications and Innovations in Intelligent Systems XVII.; Springer: London, UK, 2010. [Google Scholar]
  32. Yang, X.S.; He, X. Firefly algorithm: Recent advances and applications. arXiv 2013, arXiv:1308.3898. [Google Scholar] [CrossRef] [Green Version]
  33. Yang, X.S.; Deb, S. Cuckoo search via Lévy flights. In Proceedings of the 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), Coimbatore, India, 9–11 December 2009; pp. 210–214. [Google Scholar]
  34. Gandomi, A.H.; Yang, X.S.; Alavi, A.H. Cuckoo search algorithm: A metaheuristic approach to solve structural optimization problems. Eng. Comput. 2013, 29, 17–35. [Google Scholar] [CrossRef]
  35. Rajaguru, H.; Prabhakar, S.K.; Manjusha, M. Performance Analysis of Original Particle Swarm Optimization and Modified PSO Technique for Robust Classification of Epilepsy Risk level from EEG Signals. Int. J. Pharm. Technol. 2016, 8, 18273–18283. [Google Scholar]
  36. Bharanidharan, N.; Rajaguru, H. Classification of dementia using harmony search optimization technique. In Proceedings of the 2018 IEEE Region 10 Humanitarian Technology Conference (R10-HTC), Malambe, Sri Lanka, 6–8 December 2018. [Google Scholar]
  37. Zhang, G.; Allaire, D.; Cagan, J. Reducing the Search Space for Global Minimum: A Focused Regions Identification Method for Least Squares Parameter Estimation in Nonlinear Models. J. Comput. Inf. Sci. Eng. 2023, 23, 021006. [Google Scholar] [CrossRef]
  38. Draper, N.R.; Smith, H. Applied Regression Analysis; John Wiley & Sons: Hoboken, NJ, USA, 1998; Volume 326. [Google Scholar]
  39. Hamid, I.Y. Prediction of Type 2 Diabetes through Risk Factors using Binary Logistic Regression. J. Al-Qadisiyah Comput. Sci. Math. 2020, 12, 1. [Google Scholar]
  40. Adiwijaya, K.; Wisesty, U.N.; Lisnawati, E.; Aditsania, A.; Kusumo, D.S. Dimensionality reduction using principal component analysis for cancer detection based on microarray data classification. J. Comput. Sci. 2018, 14, 1521–1530. [Google Scholar] [CrossRef] [Green Version]
  41. Prabhakar, S.K.; Rajaguru, H.; Lee, S.-W. A comprehensive analysis of alcoholic EEG signals with detrend fluctuation analysis and post classifiers. In Proceedings of the 2019 7th International Winter Conference on Brain-Computer Interface (BCI), Gangwon, Republic of Korea, 18–20 February 2019. [Google Scholar]
  42. Zhou, W.; Liu, Y.; Yuan, Q.; Li, X. Epileptic seizure detection using lacunarity and Bayesian linear discriminant analysis in intracranial EEG. IEEE Trans. Biomed. Eng. 2013, 60, 3375–3381. [Google Scholar] [CrossRef] [PubMed]
  43. Zang, F.; Zhang, J.S. Softmax Discriminant Classifier. In Proceedings of the 3rd International Conference on Multimedia Information Networking and Security, Shanghai, China, 4–6 November 2011; pp. 16–20. [Google Scholar]
  44. Yao, X.; Panaye, A.; Doucet, J.; Chen, H.; Zhang, R.; Fan, B.; Liu, M.; Hu, Z. Comparative classification study of toxicity mechanisms using support vector machines and radial basis function neural networks. Anal. Chim. Acta 2005, 535, 259–273. [Google Scholar] [CrossRef]
  45. Fushiki, T. Estimation of prediction error by using K-fold cross-validation. Stat. Comput. 2011, 21, 137–146. [Google Scholar] [CrossRef]
  46. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  47. Saito, M.; Yamanaka, S. Performance evaluation of least-squares probabilistic classifier for corporate credit rating classification problem. JSIAM Lett. 2021, 13, 9–12. [Google Scholar] [CrossRef]
  48. Chicco, D.; Jurman, G. An invitation to greater use of Matthews correlation coefficient (MCC) in robotics and artificial intelligence. Front. Robot. AI 2022, 78, 876814. [Google Scholar] [CrossRef]
  49. Hart, P.E.; Stork, D.G.; Duda, R.O. Pattern Classification; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  50. Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 2020, 17, 168–192. [Google Scholar] [CrossRef]
  51. Kvålseth, T.O. Note on Cohen’s kappa. Psychol. Rep. 1989, 65, 223–226. [Google Scholar] [CrossRef]
  52. Kumar, P.S.; Pranavi, S. Performance analysis of machine learning algorithms on diabetes dataset using big data analytics. In Proceedings of the 2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future directions) (ICTUS), Venue, Amity, 18–20 December 2017. [Google Scholar] [CrossRef]
  53. Olivera, A.R.; Roesler, V.; Iochpe, C.; Schmidt, M.I.; Vigo, A.; Barreto, S.M.; Duncan, B.B. Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes-ELSA-Brasil: Accuracy study. Sao Paulo Med. J. 2017, 135, 234–246. [Google Scholar] [CrossRef] [Green Version]
  54. Alghamdi, M.; Al-Mallah, M.; Keteyian, S.; Brawner, C.; Ehrman, J.; Sakr, S. Predicting diabetes mellitus using smote and ensemble machine learning approach: The henry ford exercise testing (ft) project. PLoS ONE 2017, 12, e0179805. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  55. Xie, J.; Liu, Y.; Zeng, X.; Zhang, W.; Mei, Z. A Bayesian network model for predicting type 2 diabetes risk based on electronic health records. Mod. Phys Lett B. 2017, 31, 1740055. [Google Scholar] [CrossRef] [Green Version]
  56. Sarwar, M.A.; Kamal, N.; Hamid, W.; Shah, M.A. Prediction of diabetes using machine learning algorithms in healthcare. In Proceedings of the 2018 24th International Conference on Automation and Computing (ICAC), Tyne, UK, 6–7 September 2018. [Google Scholar] [CrossRef]
  57. Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef]
  58. Perveen, S.; Shahbaz, M.; Keshavjee, K.; Guergachi, A. Metabolic syndrome and development of diabetes mellitus: Predictive modeling based on machine learning techniques. IEEE Access 2019, 7, 1365–1375. [Google Scholar] [CrossRef]
  59. Yuvaraj, N.; Sripreethaa, K.R. Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster. Clust. Comput. 2017, 22, 1365–1375. [Google Scholar] [CrossRef]
  60. Jakka, A.; Jakka, V.R. Performance evaluation of machine learning models for diabetes prediction. Int. J. Innov. Technol. Explor. Eng Regul. Issue 2019, 8, 1976–1980. [Google Scholar] [CrossRef]
  61. Radja, M.; Emanuel, A.W.R. Performance evaluation of supervised machine learning algorithms using different data set sizes for diabetes prediction. In Proceedings of the 2019 5th International Conference on Science in Information Technology (ICSITech), Jogjakarta, Indonesia, 23–24 October 2019. [Google Scholar] [CrossRef]
  62. Xiong, X.-L.; Zhang, R.-X.; Bi, Y.; Zhou, W.-H.; Yu, Y.; Zhu, D.-L. Machine learning models in type 2 diabetes risk prediction: Results from a cross-sectional retrospective study in Chinese adults. Curr. Med. Sci. 2019, 39, 582–588. [Google Scholar] [CrossRef]
  63. Dinh, A.; Miertschin, S.; Young, A.; Mohanty, S.D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med. Inform. Decis. Mak. 2019, 19, 211. [Google Scholar] [CrossRef] [Green Version]
  64. Yang, T.; Zhang, L.; Yi, L.; Feng, H.; Li, S.; Chen, H.; Zhu, J.; Zhao, J.; Zeng, Y.; Liu, H. Ensemble learning models based on noninvasive features for type 2 diabetes screening: Model development and validation. JMIR Med. Inform. 2020, 8, e15431. [Google Scholar] [CrossRef]
  65. Muhammad, L.J.; Algehyne, E.A.; Usman, S.S. Predictive supervised machine learning models for diabetes mellitus. SN Comput. Sci. 2020, 1, 240. [Google Scholar] [CrossRef]
  66. Lam, B.; Catt, M.; Cassidy, S.; Bacardit, J.; Darke, P.; Butterfeld, S.; Alshabrawy, O.; Trenell, M.; Missier, P. Using wearable activity trackers to predict type 2 diabetes: Machine learning-based cross-sectional study of the UK biobank accelerometer cohort. JMIR Diabetes 2021, 6, 23364. [Google Scholar] [CrossRef]
  67. De Silva, K.; Lim, S.; Mousa, A.; Teede, H.; Forbes, A.; Demmer, R.T.; Jonsson, D.; Enticott, J. Nutritional markers of undiagnosed type 2 diabetes in adults: Findings of a machine learning analysis with external validation and benchmarking. PLoS ONE 2021, 16, e0250832. [Google Scholar] [CrossRef]
  68. Kim, H.; Lim, D.H.; Kim, Y. Classification and prediction on the effects of nutritional intake on overweight/obesity, dyslipidemia, hypertension and type 2 diabetes mellitus using deep learning model: 4–7th Korea national health and nutrition examination survey. Int. J. Environ. Res. Public Health 2021, 18, 5597. [Google Scholar] [CrossRef]
  69. Ramesh, J.; Aburukba, R.; Sagahyroon, A. A remote healthcare monitoring framework for diabetes prediction using machine learning. Healthc. Technol. Lett. 2021, 8, 45–57. [Google Scholar] [CrossRef]
  70. Phongying, M.; Hiriote, S. Diabetes Classification Using Machine Learning Techniques. Computation 2023, 11, 96. [Google Scholar] [CrossRef]
  71. Changpetch, P.; Pitpeng, A.; Hiriote, S.; Yuangyai, C. Integrating Data Mining Techniques for Naïve Bayes Classification: Applications to Medical Datasets. Computation 2021, 9, 99. [Google Scholar] [CrossRef]
  72. Kumari, S.; Kumar, D.; Mittal, M. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. Int. J. Cogn. Comput. Eng. 2021, 2, 40–46. [Google Scholar] [CrossRef]
  73. Yang, H.; Luo, Y.; Ren, X.; Wu, M.; He, X.; Peng, B.; Deng, K.; Yan, D.; Tang, H.; Lin, H. Risk prediction of diabetes: Big data mining with fusion of multifarious physical examination indicators. Inf. Fusion 2021, 75, 140–149. [Google Scholar] [CrossRef]
  74. Haq, A.U.; Li, J.P.; Khan, J.; Memon, M.H.; Nazir, S.; Ahmad, S.; Khan, G.A.; Ali, A. Intelligent machine learning approach for effective recognition of diabetes in E-healthcare using clinical data. Sensors 2020, 20, 2649. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Illustration of the work.
Figure 1. Illustration of the work.
Diagnostics 13 02654 g001
Figure 2. Histogram of Detrend Fluctuation Analysis (DFA) Techniques in Diabetic Gene Class.
Figure 2. Histogram of Detrend Fluctuation Analysis (DFA) Techniques in Diabetic Gene Class.
Diagnostics 13 02654 g002
Figure 3. Histogram of Detrend Fluctuation Analysis (DFA) Techniques in Normal Gene Class.
Figure 3. Histogram of Detrend Fluctuation Analysis (DFA) Techniques in Normal Gene Class.
Diagnostics 13 02654 g003
Figure 4. Normal Probability plot for Chi Square DR Techniques Features for Diabetic Gene Class.
Figure 4. Normal Probability plot for Chi Square DR Techniques Features for Diabetic Gene Class.
Diagnostics 13 02654 g004
Figure 5. Normal Probability plot for Chi Square PDF DR Techniques Features for Non-Diabetic Gene Class.
Figure 5. Normal Probability plot for Chi Square PDF DR Techniques Features for Non-Diabetic Gene Class.
Diagnostics 13 02654 g005
Figure 6. Normal Probability plot for Firefly Algorithm DR Techniques Features for Diabetic Gene Class.
Figure 6. Normal Probability plot for Firefly Algorithm DR Techniques Features for Diabetic Gene Class.
Diagnostics 13 02654 g006
Figure 7. Normal Probability plot for Firefly algorithm DR techniques features for non-diabetic gene class.
Figure 7. Normal Probability plot for Firefly algorithm DR techniques features for non-diabetic gene class.
Diagnostics 13 02654 g007
Figure 8. Scatter plot for Cuckoo search DR technique with diabetic and non-diabetic gene classes.
Figure 8. Scatter plot for Cuckoo search DR technique with diabetic and non-diabetic gene classes.
Diagnostics 13 02654 g008
Figure 9. Performance Analysis of Classifiers without Feature Selection Methods.
Figure 9. Performance Analysis of Classifiers without Feature Selection Methods.
Diagnostics 13 02654 g009
Figure 10. Performance Analysis of Classifiers with PSO Feature Selection Methods.
Figure 10. Performance Analysis of Classifiers with PSO Feature Selection Methods.
Diagnostics 13 02654 g010
Figure 11. Performance analysis of classifiers with Harmonic Search feature selection methods.
Figure 11. Performance analysis of classifiers with Harmonic Search feature selection methods.
Diagnostics 13 02654 g011
Figure 12. Performance of MCC and Kappa parameters across the classifier for four DR techniques without and with two-feature selection methods.
Figure 12. Performance of MCC and Kappa parameters across the classifier for four DR techniques without and with two-feature selection methods.
Diagnostics 13 02654 g012
Table 1. Description of Pancreas Microarray Gene Data set for Diabetic and Non-Diabetic Classes.
Table 1. Description of Pancreas Microarray Gene Data set for Diabetic and Non-Diabetic Classes.
Data SetNo of GenesClass 1
Diabetes
Class 2
Non-Diabetic
Total
Pancreas 28735205070
Table 2. Statistical Analysis for Different Dimensionality Reduction Techniques.
Table 2. Statistical Analysis for Different Dimensionality Reduction Techniques.
Statistical ParametersDFAChi2pdfFirefly AlgorithmCuckoo Search
DPNDPDPNDPDPNDPDPNDP
Mean1.63021.63010.06900.06911.02780.12608.715812.6033
Variance0.16140.16650.00070.00080.00041.45 × 10−948.475171.7672
Skewness0.23190.2758−0.2771−0.32281.65274.02980.60870.2692
Kurtosis0.27060.16000.03010.00963.103378.3911−1.1233−1.3759
Pearson CC0.95160.95980.97810.98140.88030.50060.70480.6859
CCA0.059140.064110.047850.05107
Table 3. p-value significant for Feature Selection method from t-test for different DR techniques.
Table 3. p-value significant for Feature Selection method from t-test for different DR techniques.
Feature SelectionDR TechniquesDFAChi Square pdfFirefly
Algorithm
Cuckoo Search
ClassDPNDPDPNDPDPNDPDPNDP
PSOp value < 0.050.4150.19060.38770.160740.380590.24350.47400.48824
Harmonic Searchp value < 0.050.00040.42900.38360.436550.000310.4690.34880.3979
Table 4. Confusion Matrix for Diabetic and Non-Diabetic Patient Detection.
Table 4. Confusion Matrix for Diabetic and Non-Diabetic Patient Detection.
Truth of Clinical SituationPredicted Values
DiabeticNon-Diabetic
Actual ValuesDiabeticTPFN
Non-DiabeticFPTN
Table 5. Training and Testing MSE Performance of the Classifiers Without Feature Selection Method for Four Dimensionality Reduction Techniques.
Table 5. Training and Testing MSE Performance of the Classifiers Without Feature Selection Method for Four Dimensionality Reduction Techniques.
ClassifiersWith DFA
DR Method
With Chi2 pdf
DR Method
With Firefly Algorithm
DR Method
With Cuckoo Search
DR Method
Training
MSE
Testing
MSE
Training
MSE
Testing
MSE
Training
MSE
Testing
MSE
Training
MSE
Testing
MSE
NLR7.92 × 10−51.633 × 10−49.8 × 10−55.439 × 10−56.56 × 10−52.9 × 10−42.81 × 10−51.45 × 10−4
Linear Reg4.62 × 10−58.936 × 10−56.08 × 10−51.945 × 10−45.48 × 10−53.566 × 10−49.22 × 10−51.35 × 10−4
Logistic Regression2.6 × 10−51.325 × 10−42.21 × 10−51.565 × 10−48.84 × 10−54.625 × 10−49.8 × 10−51.685 × 10−4
GMM1.52 × 10−51.21 × 10−47.06 × 10−51.662 × 10−47.92 × 10−52.442 × 10−43.42 × 10−51.103 × 10−4
BDLC1.82 × 10−66.978 × 10−58.46 × 10−52.492 × 10−44.49 × 10−52.136 × 10−41.44 × 10−57.161 × 10−5
SDC2.72 × 10−51.567 × 10−41.21 × 10−51.095 × 10−44.22 × 10−53.081 × 10−41.21 × 10−51.04 × 10−4
SVM (RBF)1.96 × 10−51.656 × 10−44.62 × 10−51.454 × 10−43.48 × 10−51.924 × 10−41.26 × 10−85.141 × 10−6
Table 6. Training and Testing MSE Analysis of Classifiers for Various Dimensionality Reduction Techniques with PSO Feature Selection Methods.
Table 6. Training and Testing MSE Analysis of Classifiers for Various Dimensionality Reduction Techniques with PSO Feature Selection Methods.
ClassifiersWith DFA
DR Method
With Chi2 pdf
DR Method
With Firefly Algorithm
DR Method
With Cuckoo Search
DR Method
Training
MSE
Testing
MSE
Training
MSE
Testing
MSE
Training
MSE
Testing
MSE
Training
MSE
Testing
MSE
NLR3.61 × 10−69.01 × 10−65.29 × 10−64.062 × 10−53.36 × 10−61.568 × 10−43.84 × 10−73.737 × 10−5
Linear Reg9.61 × 10−61.603 × 10−51.22 × 10−74.825 × 10−68.28 × 10−61.464 × 10−54.62 × 10−65.042 × 10−5
Logistic Regression1.44 × 10−51.422 × 10−62.6 × 10−86.806 × 10−66.89 × 10−51.486 × 10−47.4 × 10−72.906 × 10−5
GMM6.24 × 10−58.021 × 10−51.36 × 10−71.168 × 10−54.62 × 10−61.394 × 10−52.7 × 10−71.511 × 10−4
BDLC1.76 × 10−54.063 × 10−57.29 × 10−65.653 × 10−51.52 × 10−61.192 × 10−45.33 × 10−72.92 × 10−5
SDC9 × 10−62.003 × 10−52.89 × 10−62.493 × 10−51.02 × 10−51.243 × 10−48.65 × 10−85.525 × 10−5
SVM (RBF)2.12 × 10−78.5 × 10−61.94 × 10−91.885 × 10−62.56 × 10−61.889 × 10−57.22 × 10−71.022 × 10−5
Table 7. Training and Testing MSE Analysis of Classifiers for Various Dimensionality Reduction Technique with Harmonic Search Feature Selection Methods.
Table 7. Training and Testing MSE Analysis of Classifiers for Various Dimensionality Reduction Technique with Harmonic Search Feature Selection Methods.
ClassifiersWith DFA
DR Method
With Chi2 pdf
DR Method
With Firefly Algorithm
DR Method
With Cuckoo Search
DR Method
Training MSETesting MSETraining MSETesting MSETraining MSETesting MSETraining MSETesting MSE
NLR7.06 × 10−51.68 × 10−53.25 × 10−55.46 × 10−55.04 × 10−59.7 × 10−59.61 × 10−65.07 × 10−5
Linear Reg9.22 × 10−52.44 × 10−41.44 × 10−52.22 × 10−51.68 × 10−51.21 × 10−48.28 × 10−72.86 × 10−5
Logistic Regression5.93 × 10−81.68 × 10−56.25 × 10−63.03 × 10−52.81 × 10−72.86 × 10−53.25 × 10−53.14 × 10−5
GMM1.09 × 10−52.21 × 10−41.76 × 10−55.66 × 10−53.02 × 10−57.57 × 10−53.84 × 10−51.23 × 10−5
BDLC2.92 × 10−85.04 × 10−51.82 × 10−44.22 × 10−53.25 × 10−52.58 × 10−51.85 × 10−57.42 × 10−5
SDC6.56 × 10−81.72 × 10−51.69 × 10−42.5 × 10−55.76 × 10−62.25 × 10−41.52 × 10−53.72 × 10−5
SVM (RBF)4.36 × 10−54.88 × 10−67.4 × 10−58.13 × 10−66.56 × 10−51.02 × 10−51.86 × 10−71.7 × 10−6
Table 8. Selection of Optimum Parametric Values for Classifiers.
Table 8. Selection of Optimum Parametric Values for Classifiers.
ClassifiersDescription
NLRSet distribution as N(o,σ2), g < 0.2, H < 0.014 with τ = 1, Convergence Criteria: MSE
LR θ T < 0.6, β = 0.01 and Convergence Criteria: MSE
LoRThreshold H θ(x) = 0.5. Criterion: MSE
GMMMean, covariance of the input samples and tuning parameter as like in Expectation Maximum in test point likelihood probability 0.15, cluster probability of 0.6, with a convergence rate of 0.6. Criterion: MSE
BLDCPrior probability P(x): 0.5, class mean µx = 0.8 and µy = 0.1. Criterion: MSE
SDCC: 0.5, Coefficient of the kernel function (gamma): 10, Class weights: 0.5, Convergence Criteria: MSE
SVM–RBFC: 1, Coefficient of the kernel function (gamma): 100, Class weights: 0.86, Convergence Criteria: MSE
Table 9. Average Performance of Classifiers with Different DM Techniques Without Feature Selection Methods.
Table 9. Average Performance of Classifiers with Different DM Techniques Without Feature Selection Methods.
Dimensionality ReductionClassifiersParameters
Accuracy
(%)
RecallPrecisionF1Score
(%)
MCCError
Rate
(%)
Jaccard Metric
(%)
Kappa
Detrend Fluctuation Analysis (DFA)NLR55.71426034.285743.63630.126444.285727.90690.1142
Linear Reg55.71425533.333341.50940.099544.285726.19040.0920
Logistic Regression55.71426034.285743.63630.126444.285727.90690.1142
GMM55.71425533.333341.50940.099544.285726.19040.0920
BDLC58.57146036.363645.28300.162841.428529.26820.1506
SDC54.28575532.352940.74070.081345.714225.58130.0743
SVM (RBF)57.14287036.842148.27580.199542.857131.81810.1732
Chi2pdfNLR61.4285704050.9090.252938.571434.14630.2285
Linear Reg57.14287036.842148.27580.199542.857131.81810.1732
Logistic Regression54.28575532.352940.74070.081345.714225.58130.0743
GMM54.28575532.352940.74070.081345.714225.58130.0743
BDLC52.85715030.303037.73580.036147.142823.25580.0334
SDC55.71425533.333341.50940.099544.285726.19040.0920
SVM (RBF)65.71425542.307647.8260.233734.285731.42850.2293
Firefly AlgorithmNLR52.85715030.303037.73580.036147.142823.25580.0334
Linear Reg52.85715531.4285400.063247.1428250.0571
Logistic Regression51.42855029.411737.0370.018048.571422.72720.0165
GMM54.28575532.352940.74070.081345.714225.58130.0743
BDLC52.85715531.4285400.063247.1428250.0571
SDC52.85715030.303037.73580.036147.142823.25580.0334
SVM (RBF)54.28576033.333342.85710.108445.714227.27270.0967
Cuckoo SearchNLR54.28575031.2538.46150.054445.714223.80950.0508
Linear Reg57.14285534.37542.30760.117842.857126.82920.1101
Logistic Regression57.14285534.37542.30760.117842.857126.82920.1101
GMM55.71425533.333341.50940.099544.285726.19040.0920
BDLC58.57146036.363645.28300.162841.428529.26820.1506
SDC57.14285534.37542.30760.117842.857126.82920.1101
SVM (RBF)65.71426042.8571500.258134.285733.33330.25
Table 10. Average Performance of Classifiers with Different DM Techniques With PSO Feature Selection Method.
Table 10. Average Performance of Classifiers with Different DM Techniques With PSO Feature Selection Method.
Dimensionality ReductionClassifiersParameters
Accuracy
(%)
RecallPrecisionF1 Score
(%)
MCCError
Rate
(%)
Jaccard Metric
(%)
Kappa
Detrend Fluctuation Analysis (DFA)NLR84.2857856875.55550.650515.714260.71420.6418
Linear Reg60804053.33330.29214036.36360.2461
Logistic Regression81.42857565.217369.76740.567418.571453.57140.5645
GMM57.14286035.294144.44440.144642.857128.57140.1322
BDLC65.71428544.736858.62060.389934.285741.46340.3385
SDC74.28578553.12565.38460.498725.714248.57140.4661
SVM (RBF)82.85719064.2857750.645417.1428600.625
Chi2pdfNLR65.71428544.7368458.62060.389934.285741.46340.3385
Linear Reg87.14289072800.716512.857166.66660.7069
Logistic Regression606037.546.15380.181340300.1694
GMM84.28577571.428573.17070.621015.714257.69230.6206
BDLC65.71425542.307647.82600.233734.285731.42850.2293
SDC81.42856568.421066.66660.538418.5714500.5380
SVM (RBF)91.42859081.818185.71420.79798.57142750.7961
Firefly AlgorithmNLR62.85718542.556.66660.356037.142839.53480.3
Linear Reg77.14289056.2569.23070.562222.857152.94110.5254
Logistic Regression55.71426034.285743.63630.126444.285727.90690.1142
GMM75.71428554.838766.66660.518324.2857500.4892
BDLC58.57147037.837849.12280.217141.428532.55810.1912
SDC62.85715038.461543.47820.168237.142827.77770.1651
SVM (RBF)75.71428554.838766.66660.518324.2857500.4892
Cuckoo SearchNLR72.85716052.1739155.81390.365427.142838.70960.3636
Linear Reg62.85717041.1764751.85180.271137.1428350.2479
Logistic Regression806565650.512048.14810.51
GMM57.14286536.111146.42850.171742.857130.23250.1532
BDLC707048.275857.14280.366830400.3524
SDC68.57145545.8333500.276031.428533.33330.2735
SVM (RBF)81.42858562.962972.34040.603218.571456.66660.5882
Table 11. Average Performance of Classifiers with Different DM Techniques with Harmonic Search Feature Selection Method.
Table 11. Average Performance of Classifiers with Different DM Techniques with Harmonic Search Feature Selection Method.
Dimensionality ReductionClassifiersParameters
Accuracy
(%)
RecallPrecisionF1 Score
(%)
MCCError
Rate
(%)
Jaccard Metric
(%)
Kappa
Detrend Fluctuation Analysis (DFA)NLR78.57148059.259268.08510.538221.428551.61290.5248
Linear Reg52.85715030.303037.73580.036147.142823.25580.0334
Logistic Regression78.57148059.259268.08510.538221.428551.61290.5248
GMM54.28575532.352940.74070.081345.714225.58140.0743
BDLC64.28576541.935450.98030.263735.714234.21050.2489
SDC75.71428554.838766.66670.518324.2857500.4892
SVM (RBF)85.71428570.833377.27270.675714.285762.96290.6698
Chi2pdfNLR61.42856539.393949.05660.226238.571432.50.2092
Linear Reg72.85717551.724161.22440.431027.142844.11760.4140
Logistic Regression708048.484860.37730.41623043.24320.3849
GMM64.28576041.379348.97950.238435.714232.43240.2290
BDLC67.14286544.827553.06120.302632.857136.11110.2907
SDC71.4285705058.33330.387228.571441.17640.375
SVM (RBF)84.2857856875.55560.650515.714260.71420.6418
Firefly AlgorithmNLR55.71425533.333341.50940.099544.285726.19040.0920
Linear Reg55.71425533.333341.50940.099544.285726.19040.0920
Logistic Regression71.4285855062.96290.460928.571445.94590.4214
GMM58.57146036.363645.28300.162841.428529.26820.1506
BDLC75.7142705662.22220.452524.285745.16120.4465
SDC54.28575031.2538.46150.054445.714223.80950.0508
SVM (RBF)85.71428072.727276.19040.661714.285761.53840.6601
Cuckoo SearchNLR62.85716540.625500.244837.142833.33330.2288
Linear Reg71.4285705058.33330.387228.571441.17640.375
Logistic Regression71.4285705058.33330.387228.571441.17640.375
GMM808061.538469.56520.56092053.33330.5504
BDLC58.57146036.363645.28300.162841.428529.26820.1506
SDC68.57147046.6666560.346831.428538.88890.3304
SVM (RBF)909078.260883.72090.769410720.7655
Table 12. Computational Complexity of the Classifiers for Different Dimensionality Reduction Method without Feature Selection Methods.
Table 12. Computational Complexity of the Classifiers for Different Dimensionality Reduction Method without Feature Selection Methods.
ClassifiersWith DFA DR MethodWith Chi2 pdf DR MethodWith Firefly Algorithm DR MethodWith Cuckoo Search DR Method
NLRO(2n2log2n)O(2n2log2n)O(2n4log2n)O(2n4log2n)
Linear RegO(2nlog2n)O(2n2log2n)O(2n3log2n)O(2n3log2n)
Logistic RegressionO(2n2log2n)O(2n2log2n)O(2n3log2n)O(2n3log2n)
GMMO(2n3log2n)O(2n3log2n)O(2n4log2n)O(2n4log2n)
BDLCO(2n3 log2n)O(2n3log2n)O(2n4log2n)O(2n4 log2n)
SDCO(2n2log2n)O(2n2log2n)O(2n4log2n)O(2n4log2n)
SVM (RBF)O(2n2log4n)O(2n2log4n)O(2n4log4n)O(2n3log4n)
Table 13. Computational Complexity of the Classifiers for Different Dimensionality Reduction Method with PSO Feature Selection Method.
Table 13. Computational Complexity of the Classifiers for Different Dimensionality Reduction Method with PSO Feature Selection Method.
ClassifiersWith DFA
DR Method
With Chi2 pdf
DR Method
With Firefly Algorithm
DR Method
With Cuckoo Search
DR Method
NLRO(2n4log2n)O(2n4log2n)O(2n6log2n)O(2n6log2n)
Linear RegO(2n3log2n)O(2n4log2n)O(2n5log2n)O(2n5log2n)
Logistic RegressionO(2n4log2n)O(2n4log2n)O(2n4log2n)O(2n5log2n)
GMMO(2n5log2n)O(2n5log2n)O(2n6log2n)O(2n6log2n)
BDLCO(2n5log2n)O(2n5log2n)O(2n6log2n)O(2n6log2n)
SDCO(2n4log2n)O(2n4log2n)O(2n4log2n)O(2n4log2n)
SVM (RBF)O(2n4log2n)O(2n4log4n)O(2n6log4n)O(2n5log4n)
Table 14. Computational Complexity of the Classifiers for Different Dimensionality Reduction Method with Harmonic Search Feature Selection Method.
Table 14. Computational Complexity of the Classifiers for Different Dimensionality Reduction Method with Harmonic Search Feature Selection Method.
ClassifiersWith DFA
DR Method
With Chi2 pdf
DR Method
With Firefly Algorithm
DR Method
With Cuckoo Search
DR Method
NLRO(2n3log2n)O (2n3log2n)O(2n4log2n)O(2n4log2n)
Linear RegO(2n2log2n)O(2n3log2n)O(2n5log2n)O(2n5log2n)
Logistic RegressionO(2n3log2n)O(2n3log2n)O(2n4log2n)O(2n4log2n)
GMMO(2n4log2n)O(2n5log2n)O(2n5log2n)O(2n5log2n)
BDLCO(2n4log2n)O(2n4log2n)O(2n5log2n)O(2n5log2n)
SDCO(2n3log2n)O(2n3log2n)O(2n3log2n)O(2n4log2n)
SVM (RBF)O(2n4log2n)O(2n4log2n)O(2n4log2n)O (2n4log2n)
Table 15. Comparison with Previous Works.
Table 15. Comparison with Previous Works.
S. NoAuthor (With Year)Description
of the Population
Data
Sampling
Machine
Learning
Parameter
Accuracy (%)
1Kumar et al. (2017)
[52]
Diagnosis lab, Warangal—IN.
Diabetes: 650
N-fold (N = 10) cross validationSupport Vector Machine,
Naive Bayes,
K-nearest neighbor
C4.5 Decision tree
Accuracy: 69, 67, 70, 74
2Olivera et al. (2017) [53]Diabetes: 12,447 unknown:
1359
age: 35–74
Training set (70%) test set
(30%) tenfold cross-validation
Logistic Regression
Artificial Neural Network
K-nearest neighbor
Naïve Bayes
Balanced Accuracy: 69.3, 69.47, 68.74, 68.95
3Alghamdi et al. (2017) [54]Total: 32,555 diabetes: 5099 imbalancedN-fold cross validationNaïve Bayes tree, random forest, and logistic model tree, j48 decision treeAccuracy: 83.9, 84.1, 79.9, 84.3, 84.1
4Xie et al. (2017) [55]Total: 21,285 diabetes: 1124 age: 35–65Training set (75%) test set (25%)K2 structure-learning algorithmAccuracy = 82.48
5Sarwar et al. (2018) [56]Pima Indians Diabetes Dataset = 768Training set (70%) test set
(30%) tenfold cross-validation
K nearest neighbors, Naïve Bayes,
support vector machine,
decision tree,
Logistic Regression,
random forest
Accuracy: 77, 74, 77, 71, 74, 71
6Zou et al. (2018) [57]Physical Examination data in Luzhou, China. Total data: 164431Fivefold cross-validationRandom forest
J48 decision tree
Deep Neural Network
Accuracy: 81, 79, 78
7Perveen et al. (2019) [58]Total: 667, 907 age: 22–74 diabetes: 8.13% imbalanceK-medoids under samplingJ48 decision tree,
Naïve Bayes
Accuracy: 88.3, 87.3, 83.6, 82.6
8Yuvaraj et al. (2019) [59]Total: 75,664Training set (70%) test set (30%)Decision tree Naïve Bayes random forestAccuracy: 88
9Jakka et al. (2019) [60]Pima Indians Diabetes datasetNoneK nearest neighbor, decision tree, Naive Bayes, support vector machine, Logistic Regression, random forestAccuracy: 73, 70, 75, 66, 78, 74
10Radja et al. (2019) [61]Total: 768
diabetes: 500
control: 268
Tenfold cross-validationNaive Bayes,
support vector machine,
decision table,
J48 decision tree
Accuracy: 80, 79, 76, 79
11Xiong et al. (2019) [62]Total: 11845 diabetes: 845 age: 20–100Training set (60%) test set (20%) tenfold cross-validation set (20%)Multilayer perceptron, Ada-Boost,
Random Forest,
Support Vector Machine,
Gradient boosting
Accuracy: 87, 86, 86, 86, 86
12Dinh et al. (2019) [63]Case 1: 21,131 diabetes: 5532
case 2: 16,426 prediabetes:
6482
Training set (80%) test set
(20%) tenfold cross-validation
Support vector machine,
random forest, gradient boosting, Logistic Regression
Accuracy: 89, 84, 86, 72
13Yang et al. (2020) [64]Total =8057 age: 20–89 imbalanced
Non-Diabetic = 6721
Diabetic = 1336
Training set: (80%,
2011–2014), test set: (20%,
2011–2014) and validation set:
(2015–2016) fivefold cross-validation
Linear discriminant analysis, support vector machine random forestAccuracy: 75, 74, 74
14Muhammad et al. (2020) [65]Total: 383 age: 1–150 diabetes: 51.9%Tenfold cross-validationLogistic Regression
Support vector machine
K-nearest neighbor
Random forest
Naive Bayes
Gradient boosting
Accuracy: 81, 85, 82, 89, 77, 86
15Lam et al. (2021) [66]Control: 19,852 diabetes: 3103 age: 40–69Tenfold cross-validationRandom forest
Logistic Regression
extreme gradient boosting GBT
Accuracy: 86
16De Silva et al. (2021) [67]Total: 16,429 diabetes: 5.6%
age: >20
Training set (30%) validation (30%) test set (40%)Logistic RegressionAccuracy: 62
17Kim et al. (2021) [68]Total: 3889 diabetes: 746 age: 40–69Fivefold cross-validationDeep neural network, Logistic Regression, decision treeAccuracy: 80, 80, 71
18Ramesh et al. (2021) [69]Pima IndiansTenfold cross-validationSupport vector machineAccuracy: 83
19This articleNordic islet transplantation programme10-fold cross-validationLR, NLR, LoR, GMM, BLDC, SDC, SVM (RBF)Accuracy: 91
Table 16. Comparison of classifier performance with different datasets.
Table 16. Comparison of classifier performance with different datasets.
S. No.AuthorsData Set UsedMachine Learning Models/ClassifiersAccuracy (%)
1Methaporn et al. (2023)
[70]
Department of
Medical Services, Bangkok, Thailand, from 2019–2021
Support Vector Machine (SVM)81.1
Decision Tree (DT)81.3
k-Nearest neighbour (kNN)81.7
Random forest88.2
2Pannapa C et al. (2021)
[71]
National Institute of Diabetes and Digestive and Kidney Diseases—PIMARandom Forest76.30
SVM (L)77.08
SVM (P)74.74
SVM (RBF)75.78
kNN74.87
Thyroid—University of California Irvine (UCI)Random Forest96.74
SVM (L)96.27
SVM (P)91.16
SVM (RBF)95.81
kNN96.28
3Kumari D et al. (2021)
[72]
PIMA Indian Data set—May 2008Logistic Regression74.89
Support vector74.08
k-Nearest 71.92
Navie Bayes74.12
Decision tree71.42
Random forest77.48
Ada boost75.32
XG Boost75.75
Gradient Boost75.32
CatBoost75.32
4Hui Yang et al. (2021)
[73]
Physical examination data from the EMR of Luzhou Municipal Health Commission in China from 2011 to 2017Linear regression80
XG Boost72
Random Forest80
5Amin Ul Haq et al. (2020)
[74]
Dataset of diabetes, taken from the hospital Frankfurt, GermanyID399
Ada Boost98.5
Random Forest98.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chellappan, D.; Rajaguru, H. Detection of Diabetes through Microarray Genes with Enhancement of Classifiers Performance. Diagnostics 2023, 13, 2654. https://doi.org/10.3390/diagnostics13162654

AMA Style

Chellappan D, Rajaguru H. Detection of Diabetes through Microarray Genes with Enhancement of Classifiers Performance. Diagnostics. 2023; 13(16):2654. https://doi.org/10.3390/diagnostics13162654

Chicago/Turabian Style

Chellappan, Dinesh, and Harikumar Rajaguru. 2023. "Detection of Diabetes through Microarray Genes with Enhancement of Classifiers Performance" Diagnostics 13, no. 16: 2654. https://doi.org/10.3390/diagnostics13162654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop