Next Article in Journal
Genetic Diversity and New Sequence Types of Escherichia coli Coharboring β-Lactamases and PMQR Genes Isolated from Domestic Dogs in Central Panama
Next Article in Special Issue
Spectrum of Causative Mutations in Patients with Hemophilia A in Russia
Previous Article in Journal
Lentivirus Susceptibility in Brazilian and US Sheep with TMEM154 Mutations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach

1
Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan
2
School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland
3
Department of Signal Theory and Communications and Telematic Engineering, University of Valladolid, Paseo de Belén 15, 47011 Valladolid, Spain
4
Department of Computer Science, Electronics and Telecommunications, University of Deusto, 48007 Bilbao, Spain
5
College of Engineering and Technology, Miami Dade College, Miami, FL 33132, USA
6
Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea
*
Authors to whom correspondence should be addressed.
Genes 2023, 14(1), 71; https://doi.org/10.3390/genes14010071
Submission received: 5 December 2022 / Revised: 16 December 2022 / Accepted: 16 December 2022 / Published: 26 December 2022
(This article belongs to the Special Issue Genetics of Blood Disorders)

Abstract

:
Genetic disorders are the result of mutation in the deoxyribonucleic acid (DNA) sequence which can be developed or inherited from parents. Such mutations may lead to fatal diseases such as Alzheimer’s, cancer, Hemochromatosis, etc. Recently, the use of artificial intelligence-based methods has shown superb success in the prediction and prognosis of different diseases. The potential of such methods can be utilized to predict genetic disorders at an early stage using the genome data for timely treatment. This study focuses on the multi-label multi-class problem and makes two major contributions to genetic disorder prediction. A novel feature engineering approach is proposed where the class probabilities from an extra tree (ET) and random forest (RF) are joined to make a feature set for model training. Secondly, the study utilizes the classifier chain approach where multiple classifiers are joined in a chain and the predictions from all the preceding classifiers are used by the conceding classifiers to make the final prediction. Because of the multi-label multi-class data, macro accuracy, Hamming loss, and α-evaluation score are used to evaluate the performance. Results suggest that extreme gradient boosting (XGB) produces the best scores with a 92% α-evaluation score and a 84% macro accuracy score. The performance of XGB is much better than state-of-the-art approaches, in terms of both performance and computational complexity.

1. Introduction

The genetic disorder is caused by a mutation in the genome or by a change in the gene structure [1]. As the genome carries the information, the change in the genome can result in a change in the structure or function of an organism [2]. The genes are made of deoxyribonucleic acid (DNA), and any change in the sequence of DNA results in genetic disorders. The genome data contains important information and health care indicators that can be used to analyze the genetic disorders that cause diseases. A dedicated branch of bioinformatics, genomics, focuses on the study of genomes, their structure, abnormalities, etc. [3]. There are several genetic disorders: single gene inheritance disorders [4], chromosomal disorders or mitochondrial genetic inheritance disorders [5], and complex disorders or multifactorial genetic inheritance disorders [6] and they are examined based on the DNA structure [7]. The single gene disorder type is caused by a mutation in a single gene in the DNA. The chromosomal disorder type is caused when a chromosome or a part of chromosomes is deleted or replaced in the DNA structure. Complex disorders are caused by the mutation in more than one gene present in the DNA.
The genes present in the DNA carry important information that can explain the formation of the different types of proteins [8]. Some changes in the structural properties of the gene can result in the formation of an abnormal protein. The abnormal protein does not work properly in the cell. Such abnormalities in the DNA gene lead to different genetic disorders such as cancer [9], diabetes, Alzheimer’s, etc. In 2020, 15,000 people were diagnosed with syndrome B disorder. About 100,000 children currently have syndrome B disorder while 12,000 people died from syndrome C disorder around the world [10]. Approximately 2% to 5% of all childbirths are diagnosed with genetic disorders [11] that may lead to 5% to 50% of deaths in the childhood [12]. The genome data contain useful information and several health-related indicators that can be used to predict the genetic disorder. However, keeping in view the complex nature of the DNA data, the number of features, and the volume of the data, manual prediction is laborious, error-prone, and inefficient.
Recently, the use of machine learning-based models has provided a great success in different fields, including prognosis, prediction, medicine, automation, etc. [13]. Such models are trained using good-quality historic data. A machine learning model finds the relationship or patterns in the data to make predictive decisions. Such models can provide automated prediction, as well as can perform assistive roles for medical experts, concerning the sensitivity and importance of the task. The choice of machine learning techniques is based on the type of real-world problem [14]. Machine learning techniques have a large variety of potential applications in bioinformatics [15]. The exponential growth of biological data raises the problem of management and useful information extraction. The transformation of heterogeneous data into biological knowledge is the main challenge in computational biology [16]. Machine learning models are applied to make a predictive decision based on large gene sequences and manage the large biological data. There are many biological domains currently using machine learning approaches for knowledge extraction. The applications of machine learning include analysis of genome-wide association [17], X-rays [18], enzyme function prediction [19], protein function prediction, and many more.
Machine learning models can help with precision medicine, however, often limited by low accuracy. Often using a single feature extraction approach, the sensitivity and specificity of the models are low than expected which requires further improvements. This study contributes to improving the predictive capabilities of machine learning models and makes the following key contributions:
  • Genetic exploratory data analysis (GEDA) is performed to obtain useful insights and discover important information from the genome data. Various attributes and their distributions are investigated to analyze their trends regarding different disorders.
  • A novel approach to extracting features from the genome data is designed where the extra tree (ET) and random forest (RF) are used to extract features that are combined to enrich the feature set.
  • A chain classifier approach is adopted in this study to obtain higher prediction accuracy. Machine learning models, equal to the number of classes, are placed in the chain and each classifier predicts in the specified sequence. Each conceding model uses the predictions of all preceding models as the input to make its prediction.
  • Eight machine learning models are also used for performance analysis including logistic regression (LR), multi-layer perceptron (MLP), decision tree classifier (DTC), random forest classifier (RFC), k nearest neighbors (KNN), extra tree classifier (ETC), extreme gradient boosting (XGB), and support vector classifier (SVC). Hyperparameter fine-tuning is completed for performance optimization. In addition, data balancing is applied to the genomes data to reduce the probability of model overfitting.
  • Extensive experiments are performed to analyze precision, recall, accuracy, and F1 score. Moreover, performance comparison with existing studies is carried out in terms of training time, macro accuracy, Hamming loss, and α evaluation score.
The remainder of this article is structured as follows: the related literature is examined in Section 2. Section 3 contains a description of the methodology, multi-label multi-class chain classifier approach, and the proposed ETRF technique. Results and evaluations of the proposed approach are examined in Section 4. Finally, the conclusions are elaborated in Section 5.

2. Related Work

The related literature to our proposed research approach is examined in this section. The past applied techniques and proposed approaches are analyzed. The related literature analysis is based on the previous dataset used, limitations, applied approach, and performance results.
Alzheimer’s, one of the diseases caused by the genetic disorder has been investigated by several researchers [20]. For example, [21] presents a stacked machine learning model for Alzheimer’s prediction. The AD genetic dataset of the neuroimaging project [22] is utilized for experiments. Results suggest that the proposed stacked model can obtain an accuracy score of 93%. The research findings proved the effectiveness of stacked-based models for predicting Alzheimer’s disease. Similarly, classification of Alzheimer’s disease is performed using neuroimaging initiatives in [22]. Experiments are performed using the genetic dataset [23]. For this purpose, machine learning-based feature selection from the gene data is utilized. The age and number of education years features are added as additional features. The non-genetic factors are also considered for Alzheimer’s classification. The study proposes a novel XGBoost model for classification [24]. The proposed approach achieves 64% for the area under the curve (AUC).
The prediction of complex genes using supervised machine learning methods is carried out in [25]. The complex genes lead to a large number of diseases [26]. The GEO dataset is utilized for machine learning-based model testing. The study develops a machine learning-based genetic disease analyzer (GDA) using principal component analysis (PCA), Naive Bayes (NB), random forest (RF), and decision tree (DT) techniques. The proposed approach achieves a 98% of accuracy score.
The study [27] uses a supervised machine learning approach to predict dementia, cancer, and diabetes. The multifactorial genetic inheritance disorder multiclass dataset is used to perform experiments. The employed learning techniques are KNN and SVM where SVM proves to be superior with a 92% accuracy score. The inflammatory bowel disease prediction using machine learning techniques is proposed in the study [28]. The metagenomic dataset on inflammatory bowel disease multi-omics is utilized for machine learning model building and experimental research evaluations. Several machine learning classifiers are applied, and RF outperforms with a 90% accuracy score.
The study [29] uses machine learning techniques to predict COVID-19 infection and related diseases. The genetic SNP mutation dataset utilized RF and neural networks. RF shows superior performance with a 92% accuracy score. The prediction of familial hypercholesterolemia genetic disorder of lipid metabolism by using machine learning techniques is performed in [30]. The virtual genetic, clinical test of familial hypercholesterolemia is performed for experimental results evaluations. Of the machine learning models used for experiments, the gradient boosting classifier shows an 83% accuracy score. The study [31] proposes a machine learning-based algorithms DOMINO to predict dominant (monoallelic) mutations in genes for Mendelian disorders. The proposed DOMINO is based on the linear discriminant analysis. Experimental results reveal a 92% accuracy which is better than existing approaches.
In biomedical areas, gene-based disease prediction is a prominent issue and several researchers are working in this domain [32]. A machine learning-based model for the prediction of gene diseases is proposed in [33]. The study focuses on the binary class problem and classifies the disease genes and healthy genes. For experiments, 12 representative machine learning-based models are examined in terms of comparisons and interpretability. Table 1 contain the summary of related work.
The prediction of autism spectrum disorder from the genome data is investigated in [35]. A novel gene selection technique is utilized to find the candidate biomarker genes [38]. The phenotypic group associative genetic markers are utilized for the prediction task. The gene expression carries the specie genetic information and gene patterns show the relationship of genes associated with numerous diseases. The optimal features are identified by regularized genetic algorithms. The proposed approach achieves a 97% of accuracy score.
The Alzheimer’s disease prediction is carried out in [36] where a machine learning-based model is designed. Next-generation sequencing techniques are utilized for identifying biomarkers for diseases that help early diagnosis. The proposed method achieves an 81% accuracy score by using 10-fold cross-validation. A network-based technique named brainMI is proposed for brain disease gene prediction [37]. The predicting is performed by brain connectome integrating and molecular network. The brain connectome data are utilized for model building. The support vector machine model is utilized to predict gene brain diseases. The proposed model achieved a 72% accuracy score.

3. Methods

The multi-label multi-class genomes and genetics dataset is utilized for the proposed approach. Figure 1 shows the steps followed in the proposed approach. GEDA is applied to reveal the factors that cause genetic disorders and useful insights are obtained regarding genes. Feature engineering techniques are employed to feature data mapping and select the high-importance features to achieve better performance from the models. The data balancing of the genetic disorder class is applied to train the learning model on an equal number of data distributions which also helps to improve the performance. The novel ETRF feature extraction technique is applied to enrich the feature set which is later used for training all the models.

3.1. Genomes Dataset

The genome and genetic dataset are based on the medical information of children and adult patients who have genetic disorders [39]. The type of dataset is multi-label multi-class. The first attribute of the dataset is genetic ‘disorder’ and the second sub-label is ‘disorder subclass’. The dataset contains a total of 44 attributes. The dataset-related information is analyzed in Table 2.

3.2. Genetic Exploratory Data Analysis (GEDA)

GEDA is applied to the genomes dataset to find hidden patterns and important information that may be helpful to predict genetic disorders. GEAA is based on several graphs, such as pair plots, 3-D data distributions analysis, bar charts, and scatter plots. GEAA proves helpful in the research study to find statistical insights from the gene data.
The genetic disorder label and the disorder sublabel data distribution analysis are applied and the results are visualized in Figure 2. The analysis shows that the dataset has an equal distribution. Genetic disorder attribute has three classes: single gene inheritance diseases, mitochondrial genetic inheritance disorders, and multifactorial genetic inheritance disorders. The mitochondrial genetic inheritance disorders class has the highest data distribution while the multifactorial genetic inheritance disorders have the lowest number of samples. The subclass category has nine classes: Leber’s hereditary optic neuropathy, diabetes, Leigh syndrome, cancer, cystic fibrosis, Tay-Sachs, hemochromatosis, mitochondrial myopathy, and Alzheimer’s. Leber’s hereditary optic neuropathy and diabetes have the lowest data distribution values. Similarly, the number of samples for Tay-Sachs is comparatively low.
The 3-D scatter data distribution analysis of white blood cell count (thousand per microliter) and blood cell count (mcL) features of genomes data with the genetic disorder label is analyzed in Figure 3. The analysis demonstrates that when the white blood cell count is less than zero, a genetic disorder of all types is found. When the white blood cell count is between 0 and 2, there is no chance of type 0 (mitochondrial) genetic disorder. However, type 1 (multifactorial) and type 2 (single-gene) disorders are found in patients. Blood cell count (mcL) values of 4.2 or less show that there are no genetic disorders. White blood cell count (thousand per microliter) value from 2 to 12 and the blood cell count (mcL) values from 4.3 to 5.6 demonstrates that genetic disorders of all types are found in patients. This analysis provides us with the significant range values of blood cells that cause genetic disorder diseases.
Figure 3b shows the scatter plot for the distribution of blood cell count (mcL) and white blood cell count (thousand per microliter) with the disorder subclass. There is no subclass disease found when the blood cell count value ranges from 4.2 to 4.4. Leber’s hereditary optic neuropathy disease is found when the blood cell count value varies between 4.4 and 4.8. White blood cell count values of 0 to 2 show the lowest chances of genetic disorder subclass. A value above 4.8 for blood cell count demonstrates the occurrence of all subclass disorder diseases. This analysis provides the significant values of blood cells that are involved in sub-types of genetic disorders and diseases.
Analysis of inherited genes that cause the genetic disorder is given in Figure 4. The analyzed genes are based on the maternal gene, paternal gene, genes from the mother’s side, and inherited from the father. This analysis demonstrates that the genes are the prominent factors that are causing the genetic disorder. Analysis reveals that when the maternal and paternal gene value is 0 or 1 the mitochondrial disorder has a higher probability while single gene disorder has less chance of happening. Similarly, Figure 4c,d demonstrate that the mitochondrial disorder has a higher chance when the values of the genes on the mother’s side and inherited from the father are 0 or 1.
The gene analysis for the disorder subclass is analyzed in Figure 5. The analysis demonstrates that the diabetes disorder found high occurrence when the maternal and paternal genes have the values 0 or 1 while at the same time it has low chances of Leigh syndrome for all genes. This analysis is based on the 8 disorder sub-class diseases.
The age factor of patients is analyzed by the genetic disorder category and is visualized in Figure 6. The age of the mother, father, and patient is examined in this analysis. The analysis demonstrates that there are high chances of genetic disorders when the mother’s age is between 20 and 60 years. When the mother’s age is less than 20 years, the probability of a genetic disorder is low. A high chance of genetic disorder is associated with the age of the father begins between 20 and 70 years. By analyzing the patient’s age, the genetic disorder diseases occur within the 15 years. This analysis shows that age is an important factor that can be used to study genetic disorders.

3.3. Data Normalization and Feature Engineering

Feature engineering is a crucial process for machine learning models [40]. Feature engineering techniques are applied to encode data and map data for the genomes dataset. The best fit optimal features are selected for learning models to train and test. For this purpose, important features are selected and unimportant and irrelevant features are dropped. In the current dataset, several features do not contribute to gene disorder prediction and can be dropped to reduce the feature space which improves both the computational complexity and performance of the models.
Feature importance is determined using the DT model and feature correlation is shown in Figure 7. Irrelevant features or features with low importance are not considered for experiments. The features ‘patient Id’, ‘patient first name’, ‘family name’, ‘father’s name’, ‘institute name’, ‘location of institute’, ‘place of birth’, and ‘parental consent’ are dropped due to their low or no contribution in predicting the genetic disorders. The data features ‘test 1’, ‘test 2’, ‘test 3’, ‘test 5’, and ‘autopsy shows birth defect (if applicable)’ are dropped due to lower feature importance values.
The null values are filled with zeros in the dataset. The selected features are encoded with proper categorical data values. The features ‘genes in mother’s side’, ‘inherited from father’, ‘maternal gene’, ’paternal gene’, ‘assisted conception IVF/ART’, ‘history of anomalies in previous pregnancies’, ‘folic acid details (peri-conceptional)’, and ‘H/O serious maternal illness’ contain the values ‘Yes’ and ‘No’ and are mapped by the values 1, and 0, respectively. The features ‘H/O radiation exposure (X-ray)’ and ‘H/O substance abuse’ contain the values of ‘Yes’, ‘No’, and ‘Not applicable’ and are mapped by the values 1, 0, and −1, respectively. The feature ‘status’ contains the values of ‘deceased’ and ‘alive’ that are mapped with 0 and 1, respectively. The feature ‘respiratory rate (breaths/min)’ contains the values of ‘normal (30–60)’ and ‘Tachypnea’ which are mapped with 0 and 1, respectively. The feature ‘heart rate (rates/min’ contains the values of ‘normal’ and ‘Tachypnea’ and are replaced with 0 and 1, respectively. The feature ‘follow-up’ contains the values of ‘Low’ and ‘High’ that are mapped with 0 and 1, respectively. The feature ‘gender’ has ‘male’ and ‘female’ and ‘ambiguous’ and are replaced with 0, 1, and 2, respectively. The feature ‘birth asphyxia’ contains the values of ‘No record’, ‘Not available’, ‘No’, and ‘Yes’ which are replaced with 0, 0, 0, and 1, respectively. The feature ‘birth defects’ contains the values of ‘singular’ and ‘multiple’ that are mapped with 0, and 1, respectively. The feature ‘blood test result’ contains the values of ‘normal’ and ‘abnormal’ and are replaced with 0, and 1, respectively.

3.4. Data Balancing

The dataset balancing is applied to achieve high accuracy results from the applied learning techniques [41]. By applying the data balancing approach, the learning models are trained on an equal number of data samples, resulting in efficient results. Before applying the data balancing, the mitochondrial genetic inheritance disorders, multifactorial genetic inheritance disorders, and single-gene inheritance classes have the 10,202, 2071, and 7664 data samples, respectively. We have balanced the dataset by randomly dropping other class data samples by the lowest class count.

3.5. Data Splitting

The data splitting is applied to split the data into training and test sets. The data splitting reduces the learning model overfitting and makes the model generalized. For experiments, the dataset is split into different split ratios as the cross-validation to validate the performance of machine learning techniques. The split ratios 0.7:0.3, 0.8:0.2, 0.85:0.15, and 0.9:0.1 are utilized for genomes dataset. These split ratios are used to determine the suitable split for achieving the best learning model.

3.6. Applied Learning Techniques

Several machine learning models are applied to analyze the performance of the proposed feature engineering approach. Eight well-known machine learning models, which are reported to show good performance for tasks similar to genetic disorder prediction, are utilized. A brief overview of each of these models is provided here regarding architecture and working mechanisms.
DTC is a supervised machine learning algorithm used for the classification tasks [42]. DTC has a tree-like structure and contains nodes and leaves. The inner nodes contain the data attributes and split them. The outcome label is placed on leaf nodes. The root node in DTC is the highest. DTC algorithms construct decision trees from input data automatically. The goal of the DTC technique is to identify the optimal decision tree by reducing the generalization error. The data attribute selection is a major challenge in DTC. The two data measures information gain and the Gini index are widely utilized. Information gain calculates the change in entropy and can be calculated as
G a i n ( S , A ) = E n t r o p y ( S ) υ ϵ V a l u e s ( A ) S υ S . E n t r o p y ( S υ )
where S represents a set of instances and attributes are indicated by A.
Gini index is used to measure the randomly chosen attributes that would be incorrectly identified. The attribute with a lower Gini index is preferred. It is calculated using
G i n i I n d e x = 1 j p j 2
RFC is a supervised machine learning model based on multiple decision trees [43]. RFC takes predictions from multiple trees and the final prediction is selected based on majority voting known as the bagging approach [44]. RFC randomly chooses observations to build decision trees and majority voting is taken as the final prediction. RFC is an ensemble learning technique and shows better results than individual classifiers. It reduces the overfitting problems and improves classification performance.
ETC is another ensemble-based, bagged decision tree technique similar to RFC [45]. It uses the same parameter used by the RFC, yet proves to be superior to RFC as it reduces the model variance. The key difference between ETC and RFC is the building of a tree where ETC follows a random split selection of values and not the bootstrap observations [46]. ETC utilizes a meta estimator that fits randomized decision trees on a sample dataset which results in improved accuracy and reduced overfitting.
LR is a supervised statistical learning method used for classifications [47]. For multi-label classification, the ordinal type of LR is utilized. LR predicts the dependent categorical variable using the independent variables. The Sigmoid function is used to map the predicted output to probabilities. The probability is defined by the threshold value. It can be expressed as
log [ y 1 y ] = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 + + b n x n
MLP is a classification algorithm that uses a feedforward neural network [48]. MLP consists of multiple fully connected layers. The nodes of the model layers are called neurons. The training process of MLP is iterative and stochastic gradient descent is utilized to optimize the loss function. The output of the layer is dependent on neurons and the neural network weights. Contrary to several complex models [49], MLP has shown superior performance for several tasks.
KNN is a non-parametric supervised learning technique [50] which uses data from nearest neighbors to predict the class of data. KNN works to group the data based on similar points near each other. The classification is performed based on the similarity of data points. The training time is slow due to lazy learning. The data point’s similarity is calculated by using Euclidean distance or similar other distance metrics [51].
XGB utilizes the boosting techniques used for the classification task [52]. XGB is flexible, efficient, and portable. It is based on the parallel gradient boosting tree technique to solve classification problems. To reduce overfitting, XGB uses a better regularization technique. Prediction from the XGB can be made using
F 2 ( x ) = σ ( 0 + 1 h 1 ( x ) + 1 h 2 ( x ) )
SVC is a supervised learning algorithm and is mostly used for classification tasks [53]. SVM finds the best fit or optimal hyperplane that separates input data points into two or more components by maximizing the margin between different class samples. The data points on the sides of the classification line represent the categories. The data points are represented by the support vectors. The predictions by using the hyperplane are calculated by the hypothesis function h which can be represented as
h ( x i ) = + 1 i f w . x + b 0 1 i f w . x + b 0
The applied machine learning approaches are fully hyperparameterized. The iterative process of tuning for employed machine learning models is performed to find the best fit model parameters. The selected parameters help to find efficient accuracy scores. A complete list of hyperparameters for models is provided in Table 3.

3.7. Multi-Label Multi-Class Chain Classifier Approach

The genomes and genetics datasets that are utilized in this study are multi-label multi-class data. We solve the multi-label multi-class classification of genetic disorders and types of disorders as a subclass. For this purpose, the classifier chain approach is utilized for building multi-label multi-class machine learning techniques. The classifier preserves the label correlations in the dataset during the training and testing of models. A connected chain of multiple classifiers is created for a machine learning model. In the classifier chain technique, each model predicts in the order specified by the chain, and the earlier predictions of models in the chain are incorporated by the next models [54,55]. The classifier chain technique uses a chain of classifiers where each classifier uses all the previous classifier’s predictions as input. The total number of classifiers in the classifier chain is equal to the number of classes in the dataset used in this study [56]. The architectural analysis of the classifier chain approach is examined in Figure 8. The macro accuracy, α-evaluation score, and Hamming loss are the evaluation metrics that are used for multi-label multi-class data [57,58].

3.8. Novel Proposed ETRF Feature Engineering Approach

The novel ETRF technique is analyzed in this section. The ETRF approach is formed by combining the ET and RF algorithms. In this research, the ETRF technique is used as a feature extraction technique for learning model building and predicting genetic disorders [59]. The feature set formation and extraction mechanism from the genomes dataset by using the proposed ETRF technique is visualized in Figure 9. The architectural analysis shows that the genomes data samples are input to the ET and RF algorithms separately. The class predicted probabilities are extracted from the RF and ET techniques. A hybrid feature set is formed by combining the extracted class predicted probabilities. The hybrid feature set is later used as an input to applied learning techniques for predicting the genetic disorder and types of disorder.

4. Results and Evaluations

4.1. Experimental Setup

Results and evaluations of the proposed research approach are examined in this section. Experiments are performed on an Intel I5-8265U CPU, 12GB random access memory (RAM), and an NVIDIA Tesla K80 graphic card. Python and Scikit-learn tools [60] are utilized for building machine learning models. The machine learning models are utilized to predict the genetic disorders and types of genetic disorders.

4.2. Evaluation Metrics

The macro accuracy, α-evaluation score, recall, precision, Hamming loss, and F1 score are used as evaluation metrics. The followings are the important factors used in evaluation metrics:
  • True Positive: the number of correctly classified positive samples by the model.
  • True Negative: the number of correctly classified negative samples by the model.
  • False Positive: the number of incorrectly classified negative samples by the model as positive.
  • False Negative: the number of incorrectly classified positive samples by the model as negative.
For multi-label problems, the label-based metrics are evaluated for each label and then averaged over all labels. The macro accuracy metric is computed on individual class labels and then averaged over all classes. The mathematical notations used to calculate the multi-label macro accuracy are expressed here.
λ a c c u r a c y ( A m a c r o j ) = j = 1 n [ ( y j ( i ) y ^ j ( i ) ) ] j = 1 n [ ( y j ( i ) y ^ j ( i ) ) ]
where n is training instances, y ( i ) i is the true label, and y ( i ) is the predicted label.
The hamming loss calculates the proportion of incorrectly predicted target labels to the total number of labels. The number of FN and FP per instance is computed and then averaged over the total number of instances for multi-label classification. The mathematical expression of Hamming loss is given as
h a m m i n g l o s s = 1 n L i = 1 n j = 1 L [ I ( y j ( i ) y ^ j ( i ) ) ]
where n is training instances, y i ( i ) is the true label, and y ^ i ( i ) is the predicted label.
For evaluating each multi-label prediction, the α-evaluation score is used as the generalized version of Jaccard similarity. The α-evaluation score provides the best way to evaluate the multi-label classification results of a learning approach. The mathematical notations for the α-evaluation score are expressed as
α - e v a l u a t i o n s c o r e = β M x + γ F x Y x P x a α 0 , 0 β , γ 1 , β = 1
where M x is the number of FNs, F x is the number of FPs, Y x is the combination of TPs and FNs, and P x is the combination of TPs and FPs.
Precision and recall are also utilized as evaluation metrics. Precision calculates the predicted number of samples that correctly belong to the positive class. The recall calculates the predicted number of positive samples out of all positive data. The mathematical notations for expressing the precision and recall are given as
Precision = True Positive True Positive + False Positive
Recall = True Positive True Positive + False Negative
F1 score is based on the combination of precision and recall scores to measure model performance. F1 score is the harmonic mean of precision and recall scores and is calculated as
F 1 score = 2 Precision Recall Precision + Recall
The accuracy score comparative analysis of applied machine learning techniques with a split of 70:30, 80:20, 85:15, and 90:10 is visualized. The imbalanced dataset accuracy comparative analysis results with and without using the proposed approach are examined in Figure 10 and Figure 11. By applying the data balancing, the comparative analysis of accuracy results with and without using the proposed approach is examined in Figure 12 and Figure 13. This analysis demonstrates the significance of our proposed approach by increasing the accuracy of the results.

4.3. Experimental Results with Imbalanced Dataset

The applied machine learning techniques are comparatively evaluated with the imbalanced dataset. For evaluating the performance with different train–test splits, the ratio is varied as 0.7:0.3, 0.8:0.2, and 0.9:0.1.

4.3.1. Results Using 70:30 Split

Table 4 demonstrates the comparative results analysis of machine learning models with and without using the proposed approach. The performance metrics precision, recall, F1 score, and accuracy scores are examined label-wise. The comparative analysis shows that the performance of the models is increased by using the proposed technique. ETC achieves a 59% accuracy score, 54% precision score, 48% recall score, and 49% F1 score for label 1. By using the proposed technique, its performance is elevated to 66% accuracy, 74% precision score, 72% recall score, and 71% F1 for label 1. In the same way, the performance metric scores are improved for label 2. All metrics scores are increased by using the proposed technique.
The multi-label multi-class performance evaluation parameters are also analyzed with a data split of 70:30 Table 5. The performance metrics used are training time (seconds), macro accuracy, hamming loss, and α-evaluation score. The comparative results are examined with and without using the proposed technique. The analysis demonstrates that by using the proposed approach, the performance metrics results are increased. SVC model achieves a 59% of accuracy score and by using the proposed approach, its accuracy is increased to 64%. The hamming loss is decreased from 0.24 to 0.18 and the α-evaluation score is increased from 86% to 91%. This shows that the proposed approach proves very effective to achieve higher results.

4.3.2. Results Using 80:20 Train–Test Split

Performance of machine learning techniques is analyzed using an 80:20 split as well. The performance results of models with and without using the proposed approach for label 1 and label 2 are examined in Table 6.
The analysis demonstrates that performance metric results are increased by using the proposed approach. MLP achieves a 60% accuracy, 54% precision, 50% recall, and 51% F score for label 1. By using the proposed technique, its performance is significantly improved and it obtains 67%, 75%, 73%, and 72% for accuracy, precision, recall, and F1 scores. Another important point is the increase in the performance of models with a change in train–test split ratios. For example, the accuracies of MLP is increased from 66% to 67%, DTC from 66% to 67%, RFC from 66% to 67%, KNN from 53% to 55%, ETC from 66% to 67%, XGB from 66% to 67%, and SVC from 64% to 65%. An increase in the training data size provides more samples for training which results in improved accuracy.
The multi-label multi-class performance comparative analysis with data split of 80:20 by using the imbalanced dataset is examined in Table 7. Analysis reveals that the performance of the models is increased when used with the proposed features. For example, the macro accuracy score of DTC is increased from 68% to 69% and the α-evaluation score from 83% to 90%. The hamming loss is decreased from 0.22 to 0.16. Similarly, the performance of other techniques is also improved when using the proposed approach.

4.3.3. Results Using 85:15 Split Ratio

Table 8 shows the comparative analysis of all the models using the 85:15 train–test split. MLP achieves a 61% accuracy, 55% precision, 53% recall, and 53% F1 score which is the best performance without using the proposed approach. However, when the proposed feature engineering approach is used, the performance of MLP is elevated to 66% accuracy, 74% precision, 72% score, and 71% F1. The performance results for label 2 are also increased. This analysis demonstrates that the results of all metrics are improved by utilizing the proposed approach. Other than that, the performance of models is also increased due to an increase in the size of training data.
The multi-label multi-class comparative analysis of applied learning techniques with and without the proposed approach is also examined in Table 9. Macro accuracy of the KNN model is improved from 61% to 66% using the proposed approach. The hamming loss is decreased from 0.25 to 0.17 and the α-evaluation score is increased from 84% to 91%. All other learning techniques results are improvised.

4.3.4. Results Using 90:10 Train–Test Split

In addition to the previous train–test splits, this study utilizes a 90:10 split ratio as well to analyze the performance of models, and the results are given in Table 10. The best performance is obtained using ETC with the proposed approach. ETC achieves 59%, 56%, 48%, and 49% for accuracy, precision, recall, and F1 score which are further improved when the proposed approach is used. Accuracy, precision, recall, and F1 scores are improved to 67%, 76%, 73%, and 72%, respectively. The same is true for label 2 scores. In addition, an increase in performance is also observed due to a change in the size of the training data.
The multi-label multi-class performance comparative analysis with an imbalanced dataset using a data split of 90:10 is examined in Table 11. The analysis is based on the multi-label multi-class metrics results with and without the proposed technique. The SVC technique obtains better performance and its performance is further improved with the proposed approach as its macro accuracy score is increased from 56% to 65% and the α-evaluation score from 90% to 92%. The hamming loss is decreased from 0.23 to 0.17. This analysis shows that our proposed approach helps us to achieve higher accuracy scores.

4.3.5. Performance Comparison of All Split Ratios for Imbalanced Dataset

Figure 10 and Figure 11 summarize the performance of machine learning models without the proposed approach and using the proposed approach, respectively. It can be observed that the performance of all the models is elevated using the proposed approach even when using the imbalanced dataset. Instead of using single features, the proposed approach combined features from ET and RF which are more suitable for training the models. Additionally, the class probabilities from these models are used as features that improve the performance of models. Another important observation is the similar performance of the models when used with the proposed approach. All the models tend to show a similar performance with slight variations.

4.4. Experimental Results with Balanced Dataset

Similar to experiments using the imbalanced dataset, experiments using the balanced dataset involve different train–test splits of 70:30, 80:20, 85:15, and 90:10. However, the best results are obtained using 80:20 train–test splits, so we discuss only the best results here.
The balanced dataset performance comparative analysis of learning techniques label-wise is examined in Table 12. MLP shows better performance as compared to other models and its performance is further improved when the proposed approach is used. For example, its accuracy is improved from 58% to 74% while the precision, recall, and F1 scores are improved from 57%, 58%, and 57% to 74%, 74%, and 74%, respectively. These results are for label 1, however, a similar trend is observed for label 2 indicating the effectiveness of the proposed approach.
The multi-label multi-class metrics result with and without the proposed approach by using the balanced dataset is examined in Table 13. The analysis utilizes the 80:20 split size with and without the proposed approach. XGB model obtains the highest macro accuracy of 79% without the proposed approach which is further elevated to 82% when used with the proposed approach. Similarly, its hamming loss is decreased from 0.17 to 0.14 and the α-evaluation score is increased from 88% to 89%.
Regarding results for the 85 to 15 train–test split ratio, the performance of XGB is superior with 73% accuracy, 73% precision, 72% recall, and 72% F1 score. For multi-label multi-class performance comparative analysis, DTC obtains a macro accuracy score of 77%, hamming loss of 0.14, and the α -evaluation score of 91%. Using a train–test split of 90:10, XGB models obtain the highest accuracy of 76% using the proposed approach. For multi-label multi-class analysis, XGB has macro accuracy of 84%. hamming loss of 0.12 and the α -evaluation score of 92%.
The best individual performance is obtained by the XGB when used with the proposed ETRF feature engineering, as shown in Figure 12. These figures further show that using the proposed approach, all the models improve their performance and the difference in their performance is reduced. Without using the proposed approach, the performance of the models varies significantly as shown in Figure 13.

4.5. Results for High Dimensional Real Genomic Data

We applied the proposed XGB model on multi-omics and high dimensional real genomic datasets taken from [61] to validate the performance of the proposed approach. The dataset is available at [62]. It contains the gene expression levels of 22,284 genes (columns) from 64 samples (rows). Experimental results of the proposed approach are provided in Table 14. The performance analysis of the proposed model demonstrates that a 100% accuracy can be obtained. The scores for accuracy, precision, recall, and F1 metrics are also 100%. The classification report indicates a superior performance of the proposed model on an additional dataset.

4.6. Performance in Comparison to Existing Studies

Performance analysis of the proposed approach is desired with existing approaches to show the effectiveness of the current study and highlight its performance within the context of existing literature.
A comparative analysis is shown in Table 15 to demonstrate the importance of the current study. For comparison, the models from the selected studies are implemented on the dataset used in this study to make a fair comparison. Results suggest that the proposed approach is superior in terms of overall performance, as well as computational complexity. Using less training time, it can outperform the existing approaches with high macro accuracy and α-evluation score, and low Hamming loss.

5. Conclusions and Future Work

Machine learning-based approaches have the need of time to build prediction models for the medical field and have the potential to assist medical experts with timely decisions. The prediction of genetic disorders is very important to reduce the risk of fatal outcomes. This study proposed a novel approach to enhance the performance of predictive models for genetic disorders. Two contributions of the study are the use of hybrid features from ET and RF where the class probabilities from these models are combined to make a feature set that is used to train machine learning models. Secondly, this study utilizes a classifier chain approach where the predictions from the preceding models are utilized by the conceding models. Each model in the chain predicts in a manner of its position in the chain. Extensive experiments are carried out with and without using the balanced dataset for the proposed approach. Results indicate that the proposed ETRF technique produces the best results with the XGB model with a 92% α-evaluation score, 84% macro accuracy score, and 0.12 Hamming loss. Results are far better than existing state-of-the-art approaches, regarding both performance and computational complexity. This study considers only machine learning models and performance analysis of deep learning models is left for the future. We also intend to apply transfer learning techniques for multi-label multi-class classification to enhance performance for genetic disorder prediction.

Author Contributions

Conceptualization, A.R. and F.R.; Data curation, F.R. and E.L.; Formal analysis, A.R., H.U.R.S., and E.L.; Funding acquisition, I.d.l.T.D.; Investigation, B.G.-Z.; Methodology, F.R. and H.U.R.S.; Project administration, I.d.l.T.D.; Resources, I.d.l.T.D.; Software, B.G.-Z. and E.L.; Supervision, H.U.R.S. and I.A.; Validation, I.A.; Visualization, B.G.-Z.; Writing—original draft, A.R. and F.R.; Writing—review and editing, I.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the European University of the Atlantic.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interests.

References

  1. Lamandé, S.R.; Bateman, J.F. Genetic disorders of the extracellular matrix. Anat. Rec. 2020, 303, 1527–1542. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Zhu, Z.; Lu, L.; Yao, X.; Zhang, W.; Liu, W. ’Rescue Mutations’ that Suppress Harmful DNA Changes Could Shed Light on Genetic Disorders 2021. Available online: http://resp.llas.ac.cn/C666/handle/2XK7JSWQ/327337 (accessed on 25 June 2022).
  3. Orlov, Y.L.; Baranova, A.V.; Tatarinova, T.V. Bioinformatics methods in medical genetics and genomics. Int. J. Mol. Sci. 2020, 21, 6224. [Google Scholar] [CrossRef] [PubMed]
  4. Shaw, J.; Scotchman, E.; Chandler, N.; Chitty, L. Preimplantation genetic testing: Non-invasive prenatal testing for aneuploidy, copy-number variants and single-gene disorders. Reproduction 2020, 160, A1–A11. [Google Scholar] [CrossRef] [PubMed]
  5. Sangkitporn, S.; Wanapirak, C.; Khorchai, A.; Yodtup, C.; Duangruang, S.; Panajamnong, N.; Nopprang, P.; Dambua, A.; Boonchu, P.; Sangkitporn, S. Prenatal Diagnosis of Down Syndrome and Common Chromosomal Disorders Using Molecular Karyotyping. Bull. Dep. Med. Sci. 2022, 64, 1–13. [Google Scholar]
  6. Maxwell, J.M.; Russell, R.A.; Wu, H.M.; Sharapova, N.; Banthorpe, P.; O’Reilly, P.F.; Lewis, C.M. Multifactorial disorders and polygenic risk scores: Predicting common diseases and the possibility of adverse selection in life and protection insurance. Ann. Actuar. Sci. 2021, 15, 488–503. [Google Scholar] [CrossRef]
  7. Spiegel, J.; Adhikari, S.; Balasubramanian, S. The structure and function of DNA G-quadruplexes. Trends Chem. 2020, 2, 123–136. [Google Scholar] [CrossRef] [Green Version]
  8. Stephanopoulos, N. Hybrid nanostructures from the self-assembly of proteins and DNA. Chem 2020, 6, 364–405. [Google Scholar] [CrossRef]
  9. Atlam, M.; Torkey, H.; Salem, H.; El-Fishawy, N. A New Feature Selection Method for Enhancing Cancer Diagnosis Based on DNA Microarray. In Proceedings of the 2020 37th National Radio Science Conference (NRSC), Cairo, Egypt, 8–10 September 2020; pp. 285–295. [Google Scholar]
  10. What Information Can Statistics Provide about a Genetic Condition: MedlinePlus Genetics. Available online: https://medlineplus.gov/genetics/understanding/mutationsanddisorders/statistics/ (accessed on 28 May 2022).
  11. Kaplanis, J.; Samocha, K.E.; Wiel, L.; Zhang, Z.; Arvai, K.J.; Eberhardt, R.Y.; Gallone, G.; Lelieveld, S.H.; Martin, H.C.; McRae, J.F.; et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 2020, 586, 757–762. [Google Scholar] [CrossRef]
  12. Hamamy, H.; Alwan, A. Genetic disorders and congenital abnormalities: Strategies for reducing the burden in the region. East Mediterr Health J. 1997, 3, 123–132. [Google Scholar] [CrossRef]
  13. Rustam, F.; Siddique, M.A.; Siddiqui, H.U.R.; Ullah, S.; Mehmood, A.; Ashraf, I.; Choi, G.S. Wireless capsule endoscopy bleeding images classification using CNN based model. IEEE Access 2021, 9, 33675–33688. [Google Scholar] [CrossRef]
  14. Kwekha-Rashid, A.S.; Abduljabbar, H.N.; Alhayani, B. Coronavirus disease (COVID-19) cases analysis using machine-learning applications. Appl. Nanosci. 2021. [Google Scholar] [CrossRef] [PubMed]
  15. Shastry, K.A.; Sanjay, H. Machine learning for bioinformatics. In Statistical Modelling and Machine Learning Principles for Bioinformatics Techniques, Tools, and Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 25–39. [Google Scholar]
  16. Munir, A.; Vedithi, S.C.; Chaplin, A.K.; Blundell, T.L. Genomics, computational biology and drug discovery for mycobacterial infections: Fighting the emergence of resistance. Front. Genet. 2020, 11, 965. [Google Scholar] [CrossRef] [PubMed]
  17. Lee, S.; Liang, X.; Woods, M.; Reiner, A.S.; Concannon, P.; Bernstein, L.; Lynch, C.F.; Boice, J.D.; Deasy, J.O.; Bernstein, J.L.; et al. Machine learning on genome-wide association studies to predict the risk of radiation-associated contralateral breast cancer in the WECARE Study. PLoS ONE 2020, 15, e0226157. [Google Scholar] [CrossRef] [PubMed]
  18. Zhang, F. Application of machine learning in CT images and X-rays of COVID-19 pneumonia. Medicine 2021, 100, e26855. [Google Scholar] [CrossRef] [PubMed]
  19. Watanabe, N.; Murata, M.; Ogawa, T.; Vavricka, C.J.; Kondo, A.; Ogino, C.; Araki, M. Exploration and evaluation of machine learning-based models for predicting enzymatic reactions. J. Chem. Inf. Model. 2020, 60, 1833–1843. [Google Scholar] [CrossRef]
  20. Vaz, M.; Silvestre, S. Alzheimer’s disease: Recent treatment strategies. Eur. J. Pharmacol. 2020, 887, 173554. [Google Scholar] [CrossRef]
  21. Alatrany, A.S.; Hussain, A.; Jamila, M.; Al-Jumeiy, D. Stacked Machine Learning Model for Predicting Alzheimer’s Disease Based on Genetic Data. In Proceedings of the 2021 14th International Conference on Developments in eSystems Engineering (DeSE), Sharjah, United Arab Emirates, 7–10 December 2021; pp. 594–598. [Google Scholar]
  22. Huckvale, E.D.; Hodgman, M.W.; Greenwood, B.B.; Stucki, D.O.; Ward, K.M.; Ebbert, M.T.; Kauwe, J.S.; Initiative, A.D.N.; Consortium, A.D.M.; Miller, J.B. Pairwise correlation analysis of the Alzheimer’s disease neuroimaging initiative (ADNI) dataset reveals significant feature correlation. Genes 2021, 12, 1661. [Google Scholar] [CrossRef]
  23. Torkey, H.; Atlam, M.; El-Fishawy, N.; Salem, H. A novel deep autoencoder based survival analysis approach for microarray dataset. PeerJ Comput. Sci. 2021, 7, e492. [Google Scholar] [CrossRef]
  24. Deng, X.; Li, M.; Deng, S.; Wang, L. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med. Biol. Eng. Comput. 2022, 60, 663–681. [Google Scholar] [CrossRef]
  25. Dhanalaxmi, B.; Anirudh, K.; Nikhitha, G.; Jyothi, R. A Survey on Analysis of Genetic Diseases Using Machine Learning Techniques. In Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 11–13 November 2021; pp. 496–501. [Google Scholar]
  26. Lattmann, E.; Deng, T.; Walser, M.; Widmer, P.; Rexha-Lambert, C.; Prasad, V.; Eichhoff, O.; Daube, M.; Dummer, R.; Levesque, M.P.; et al. A DNA replication-independent function of pre-replication complex genes during cell invasion in C. elegans. PLoS Biol. 2022, 20, e3001317. [Google Scholar] [CrossRef]
  27. Ghazal, T.M.; Al Hamadi, H.; Umar Nasir, M.; Gollapalli, M.; Zubair, M.; Adnan Khan, M.; Yeob Yeun, C. Supervised Machine Learning Empowered Multifactorial Genetic Inheritance Disorder Prediction. Comput. Intell. Neurosci. 2022, 2022, 1051388. [Google Scholar] [CrossRef] [PubMed]
  28. Mihajlović, A.; Mladenović, K.; Lončar-Turukalo, T.; Brdar, S. Machine Learning Based Metagenomic Prediction of Inflammatory Bowel Disease. Stud. Health Technol. Inform. 2021, 285, 165–170. [Google Scholar] [PubMed]
  29. Wang, R.Y.; Guo, T.Q.; Li, L.G.; Jiao, J.Y.; Wang, L.Y. Predictions of COVID-19 Infection Severity Based on Co-associations between the SNPs of Co-morbid Diseases and COVID-19 through Machine Learning of Genetic Data. In Proceedings of the 2020 IEEE 8th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 20–22 November 2020; pp. 92–96. [Google Scholar]
  30. Pina, A.; Helgadottir, S.; Mancina, R.M.; Pavanello, C.; Pirazzi, C.; Montalcini, T.; Henriques, R.; Calabresi, L.; Wiklund, O.; Macedo, M.P.; et al. Virtual genetic diagnosis for familial hypercholesterolemia powered by machine learning. Eur. J. Prev. Cardiol. 2020, 27, 1639–1646. [Google Scholar] [CrossRef] [PubMed]
  31. Quinodoz, M.; Royer-Bertrand, B.; Cisarova, K.; Di Gioia, S.A.; Superti-Furga, A.; Rivolta, C. DOMINO: Using machine learning to predict genes associated with dominant disorders. Am. J. Hum. Genet. 2017, 101, 623–629. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Boulogeorgos, A.A.A.; Trevlakis, S.E.; Tegos, S.A.; Papanikolaou, V.K.; Karagiannidis, G.K. Machine learning in nano-scale biomedical engineering. IEEE Trans. Mol. Biol. Multi-Scale Commun. 2020, 7, 10–39. [Google Scholar] [CrossRef]
  33. Le, D.H. Machine learning-based approaches for disease gene prediction. Briefings Funct. Genom. 2020, 19, 350–363. [Google Scholar] [CrossRef]
  34. Khanal, S.; Chen, J.; Jacobs, N.; Lin, A.L. Alzheimer’s Disease Classification Using Genetic Data. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 2245–2252. [Google Scholar]
  35. Sekaran, K.; Sudha, M. Predicting autism spectrum disorder from associative genetic markers of phenotypic groups using machine learning. J. Ambient Intell. Humaniz. Comput. 2021, 12, 3257–3270. [Google Scholar] [CrossRef]
  36. Rangaswamy, U.; Dharshini, S.A.P.; Yesudhas, D.; Gromiha, M.M. VEPAD-Predicting the effect of variants associated with Alzheimer’s disease using machine learning. Comput. Biol. Med. 2020, 124, 103933. [Google Scholar] [CrossRef]
  37. Wang, W.; Han, R.; Zhang, M.; Wang, Y.; Wang, T.; Wang, Y.; Shang, X.; Peng, J. A network-based method for brain disease gene prediction by integrating brain connectome and molecular network. Briefings Bioinform. 2022, 23, bbab459. [Google Scholar] [CrossRef]
  38. Zhang, X.; Jonassen, I.; Goksøyr, A. Machine Learning Approaches for Biomarker Discovery Using Gene Expression Data. Bioinformatics 2021, 53–64. [Google Scholar] [CrossRef]
  39. Of Genomes and Genetics: HackerEarth Machine Learning Challenge|Programming Challenges in July, 2021 on HackerEarth. Available online: https://www.hackerearth.com/challenges/competitive/hackerearth-machine-learning-challenge-genetic-testing/ (accessed on 28 May 2022).
  40. Dai, D.; Xu, T.; Wei, X.; Ding, G.; Xu, Y.; Zhang, J.; Zhang, H. Using machine learning and feature engineering to characterize limited material datasets of high-entropy alloys. Comput. Mater. Sci. 2020, 175, 109618. [Google Scholar] [CrossRef]
  41. Pecorelli, F.; Di Nucci, D.; De Roover, C.; De Lucia, A. A large empirical assessment of the role of data balancing in machine-learning-based code smell detection. J. Syst. Softw. 2020, 169, 110693. [Google Scholar] [CrossRef]
  42. Charbuty, B.; Abdulazeez, A. Classification based on decision tree algorithm for machine learning. J. Appl. Sci. Technol. Trends 2021, 2, 20–28. [Google Scholar] [CrossRef]
  43. Palimkar, P.; Shaw, R.N.; Ghosh, A. Machine learning technique to prognosis diabetes disease: Random forest classifier approach. In Advanced Computing and Intelligent Technologies; Springer: Berlin/Heidelberg, Germany, 2022; pp. 219–244. [Google Scholar]
  44. Zhan, C.; Zheng, Y.; Zhang, H.; Wen, Q. Random-forest-bagging broad learning system with applications for covid-19 pandemic. IEEE Internet Things J. 2021, 8, 15906–15918. [Google Scholar] [CrossRef] [PubMed]
  45. Bhati, B.S.; Rai, C. Ensemble based approach for intrusion detection using extra tree classifier. In Intelligent Computing in Engineering; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–220. [Google Scholar]
  46. Vrigazova, B.P.; Ivanov, I.G. The bootstrap procedure in classification problems. Int. J. Data Mining Model. Manag. 2020, 12, 428–446. [Google Scholar] [CrossRef]
  47. Daghistani, T.; Alshammari, R. Comparison of statistical logistic regression and randomforest machine learning techniques in predicting diabetes. J. Adv. Inf. Technol. Vol 2020, 11, 78–83. [Google Scholar] [CrossRef]
  48. Feng, X.; Ma, G.; Su, S.F.; Huang, C.; Boswell, M.K.; Xue, P. A multi-layer perceptron approach for accelerated wave forecasting in Lake Michigan. Ocean Eng. 2020, 211, 107526. [Google Scholar] [CrossRef]
  49. Raza, A.; Munir, K.; Almutairi, M. A Novel Deep Learning Approach for Deepfake Image Detection. Appl. Sci. 2022, 12, 9820. [Google Scholar] [CrossRef]
  50. Chen, L.; Wang, C.; Chen, J.; Xiang, Z.; Hu, X. Voice Disorder Identification by using Hilbert-Huang Transform (HHT) and K Nearest Neighbor (KNN). J. Voice 2021, 35, 932-e1. [Google Scholar] [CrossRef]
  51. Jones, A.H.S.; Hardiyanti, C. Case Based Reasoning using K-Nearest Neighbor with Euclidean Distance for Early Diagnosis of Personality Disorder. IJISTECH Int. J. Inf. Syst. Technol. 2021, 5, 23–30. [Google Scholar]
  52. Mateo, J.; Rius-Peris, J.; Marana-Perez, A.; Valiente-Armero, A.; Torres, A. Extreme gradient boosting machine learning method for predicting medical treatment in patients with acute bronchiolitis. Biocybern. Biomed. Eng. 2021, 41, 792–801. [Google Scholar] [CrossRef]
  53. Pisner, D.A.; Schnyer, D.M. Support vector machine. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 2020; pp. 101–121. [Google Scholar]
  54. Charles Kariuki. Multi-Label Classification with Scikit-MultiLearn|Engineering Education (EngEd) Program|Section. 2021. Available online: https://www.section.io/engineering-education/multi-label-classification-with-scikit-multilearn/ (accessed on 28 May 2022).
  55. Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi-label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef]
  56. Joyce Annie George. An Introduction to Multi-Label Text Classification|by Joyce Annie George|Analytics Vidhya|Medium. 2020. Available online: https://medium.com/analytics-vidhya/an-introduction-to-multi-label-text-classification-b1bcb7c7364c (accessed on 28 May 2022).
  57. Mustafa Murat ARAT. Metrics for Multilabel Classification|Mustafa Murat ARAT. 2020. Available online: https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics (accessed on 28 May 2022).
  58. Pritish Jadhav. Evaluation Metrics for Multi Label Classification|Pritish Jadhav|DataDrivenInvestor. 2021. Available online: https://medium.datadriveninvestor.com/a-survey-of-evaluation-metrics-for-multilabel-classification-bb16e8cd41cd (accessed on 28 May 2022).
  59. Raza, A.; Siddiqui, H.U.R.; Munir, K.; Almutairi, M.; Rustam, F.; Ashraf, I. Ensemble learning-based feature engineering to analyze maternal health during pregnancy and health risk prediction. PLoS ONE 2022, 17, e0276525. [Google Scholar] [CrossRef] [PubMed]
  60. Tran, M.K.; Panchal, S.; Chauhan, V.; Brahmbhatt, N.; Mevawalla, A.; Fraser, R.; Fowler, M. Python-based scikit-learn machine learning models for thermal and electrical performance prediction of high-capacity lithium-ion battery. Int. J. Energy Res. 2022, 46, 786–794. [Google Scholar] [CrossRef]
  61. Rupapara, V.; Rustam, F.; Aljedaani, W.; Shahzad, H.F.; Lee, E.; Ashraf, I. Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model. Sci. Rep. 2022, 12, 1–15. [Google Scholar]
  62. Grisci, B. Leukemia Gene Expression—CuMiDa—Kaggle.com. 2019. Available online: https://www.kaggle.com/datasets/brunogrisci/leukemia-gene-expression-cumida (accessed on 20 July 2022).
  63. Wu, G.; Zheng, R.; Tian, Y.; Liu, D. Joint ranking SVM and binary relevance with robust low-rank learning for multi-label classification. Neural Netw. 2020, 122, 24–39. [Google Scholar] [CrossRef] [Green Version]
  64. Bayati, H.; Dowlatshahi, M.B.; Paniri, M. MLPSO: A Filter Multi-label Feature Selection Based on Particle Swarm Optimization. In Proceedings of the 2020 25th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 1–2 January 2020; pp. 1–6. [Google Scholar]
  65. Paniri, M.; Dowlatshahi, M.B.; Nezamabadi-Pour, H. MLACO: A multi-label feature selection algorithm based on ant colony optimization. Knowl. Based Syst. 2020, 192, 105285. [Google Scholar] [CrossRef]
  66. Kouchaki, S.; Yang, Y.; Lachapelle, A.; Walker, T.M.; Walker, A.S.; Consortium, C.; Peto, T.E.; Crook, D.W.; Clifton, D.A. Multi-label random forest model for tuberculosis drug resistance classification and mutation ranking. Front. Microbiol. 2020, 11, 667. [Google Scholar] [CrossRef] [Green Version]
  67. Kang, J.; Fernandez-Beltran, R.; Hong, D.; Chanussot, J.; Plaza, A. Graph Relation Network: Modeling Relations Between Scenes for Multilabel Remote-Sensing Image Classification and Retrieval. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4355–4369. [Google Scholar] [CrossRef]
Figure 1. The methodological analysis of our proposed research approach for predicting the genetic disorder and types of disorder.
Figure 1. The methodological analysis of our proposed research approach for predicting the genetic disorder and types of disorder.
Genes 14 00071 g001
Figure 2. Distributions of samples for different classes in the dataset, (a) genetic disorders’ main classes, and (b) genetic disorder subclasses.
Figure 2. Distributions of samples for different classes in the dataset, (a) genetic disorders’ main classes, and (b) genetic disorder subclasses.
Genes 14 00071 g002
Figure 3. The 3D scatter analysis for white blood cell count (thousand per microliter) and blood cell count (mcL), (a) genetic disorder category, and (b) genetic disorder sub-category.
Figure 3. The 3D scatter analysis for white blood cell count (thousand per microliter) and blood cell count (mcL), (a) genetic disorder category, and (b) genetic disorder sub-category.
Genes 14 00071 g003
Figure 4. Genomes data distribution by genetic disorder category, (a) maternal gene, (b) paternal gene, (c) genes from mother side, and (d) inherited from father.
Figure 4. Genomes data distribution by genetic disorder category, (a) maternal gene, (b) paternal gene, (c) genes from mother side, and (d) inherited from father.
Genes 14 00071 g004
Figure 5. Genomes data distribution by genetic disorder sub-category, (a) maternal gene, (b) paternal gene, (c) genes from mother side, and (d) inherited from father.
Figure 5. Genomes data distribution by genetic disorder sub-category, (a) maternal gene, (b) paternal gene, (c) genes from mother side, and (d) inherited from father.
Genes 14 00071 g005
Figure 6. Age analysis of patients for the disorder category.
Figure 6. Age analysis of patients for the disorder category.
Genes 14 00071 g006
Figure 7. Feature correlation analysis graphs of genomes data.
Figure 7. Feature correlation analysis graphs of genomes data.
Genes 14 00071 g007
Figure 8. The architectural analysis of the multi-label multi-class classifier chain approach.
Figure 8. The architectural analysis of the multi-label multi-class classifier chain approach.
Genes 14 00071 g008
Figure 9. The architecture analysis of proposed ETRF technique for hybrid feature set formation mechanism.
Figure 9. The architecture analysis of proposed ETRF technique for hybrid feature set formation mechanism.
Genes 14 00071 g009
Figure 10. The applied techniques performance comparative analysis of different data split ratios without proposed technique using imbalanced data.
Figure 10. The applied techniques performance comparative analysis of different data split ratios without proposed technique using imbalanced data.
Genes 14 00071 g010
Figure 11. The applied techniques performance comparative analysis of different data split ratios with proposed technique using imbalanced data.
Figure 11. The applied techniques performance comparative analysis of different data split ratios with proposed technique using imbalanced data.
Genes 14 00071 g011
Figure 12. The applied techniques performance comparative analysis of different data split ratios with proposed technique using balanced data.
Figure 12. The applied techniques performance comparative analysis of different data split ratios with proposed technique using balanced data.
Genes 14 00071 g012
Figure 13. The applied techniques performance comparative analysis of different data split ratios without proposed technique using balanced data.
Figure 13. The applied techniques performance comparative analysis of different data split ratios without proposed technique using balanced data.
Genes 14 00071 g013
Table 1. Summary of genetic disorder-related literature.
Table 1. Summary of genetic disorder-related literature.
Ref.YearApproachDatasetAccuracy (%)Aim
[21]2021Stacked ML ModelAD genetic data of neuroimaging project (ADNI-1)93Classify Alzheimer’s disease type of brain disorders using ML.
[25]2021Genetic Disease Analyzer (GDA)GEO dataset98Prediction of complex genes and identify genetic classifications that cause complex diseases.
[34]2021XGBoostAlzheimer’s Disease Neuroimaging Initiatives (ADNI)64Alzheimer’s Disease Classification Using Genetic Data
[33]2020Machine Learning-based modelGenes data-Disease gene prediction using machine learning
[35]2020Machine Learning-based modelMicroarray gene expression dataset of autism spectrum disorder (ASD)97Predicting autism spectrum disorder from associative genetic markers of phenotypic groups.
[36]2020Random forest classifierGWAS and GTEx Portals data81Machine Learning-based Model for predicting the effect of the deleterious and neutral variant for Alzheimer’s disease.
[37]2021Support vector machineMolecular-based network and brain connectome data72Propose a framework for integrating brain connectome data and molecular-based gene association networks to predict brain disease genes.
[27]2022Support vector machineMultifactorial Genetic Inheritance Disorder multiclass dataset92Machine learning approaches were used to predict dementia, cancer, and diabetes.
[28]2021Random forest classifierMetagenomic dataset-based on inflammatory bowel disease multi-omics90Inflammatory bowel disease prediction using machine learning techniques.
[29]2020Random forest classifierGenetic SNP mutation dataset92Machine learning techniques were utilized to predict the COVID-19 infection and related diseases.
[30]2020Gradient boosting classifierVirtual genetic, clinical test data83Prediction of familial hypercholesterolemia genetic disorder using the machine learning techniques.
[31]2017DOMINOGenomic dataset92Machine learning-based DOMINO was used to Predict dominant mutations in Genes for Mendelian disorders.
Table 2. The genomes dataset features descriptive analysis.
Table 2. The genomes dataset features descriptive analysis.
Sr No.FeatureCountData TypeSr No.FeatureCountData Type
1Patient Id31,548object23Follow-up29,382object
2Patient Age30,121float6424Gender29,375object
3Genes in mother’s side31,548object25Birth asphyxia29,409object
4Inherited from father30,691object26Autopsy shows birth defect (if applicable)30,522object
5Maternal gene25,015object27Place of birth29,424object
6Paternal gene31,548object28Folic acid details (peri-conceptional)29,431object
7Blood cell count (mcL)31,548float6429H/O serious maternal illness29,396object
8Patient First Name31,548object30H/O radiation exposure (X-ray)29,395object
9Family Name12,540object31H/O substance abuse29,353object
10Father’s name31,548object32Assisted conception IVF/ART29,426object
11Mother’s age25,512float6433History of anomalies in previous pregnancies29,376object
12Father’s age25,562float6434No. of previous abortion29,386float64
13Institute Name24,406object35Birth defects29,394object
14Location of Institute31,548object36White Blood cell count (thousand per microliter)29,400float64
15Status31,548object37Blood test result29,403object
16Respiratory Rate (breaths/min)26,513object38Symptom 129,393object
17Heart Rate (rates/min26,535object39Symptom 229,326object
18Test 129,421float6440Symptom 329,447object
19Test 229,396float6441Symptom 429,435object
20Test 329,401float6442Symptom 529,395object
21Test 429,408float6443Genetic Disorder19,937object
22Test 529,378float6444Disorder Subclass19,915object
Table 3. Configuration of hyperparameters for employed machine learning models.
Table 3. Configuration of hyperparameters for employed machine learning models.
TechniqueHyperparameters
ETCn_estimators = 300, random_state = 5, max_depth = 300, criterion = “gini”, max_features = “sqrt”, bootstrap = False, oob_score = False, ccp_alpha = 0.0
SVCpenalty = ‘l2’, loss = ‘squared_hinge’, tol = 1 × 10−4, C = 1.0, multi_class = ‘ovr’, fit_intercept = True, max_iter = 1000
LRpenalty =‘l2’, tol = 1 × 10−4, C = 1.0, fit_intercept = True, solver = ‘lbfgs’, random_state = None, max_iter = 100, multi_class = ‘auto’
DTCmax_depth = 300, criterion = “gini”, splitter = “best”, ccp_alpha = 0.0, random_state = None
RFCmax_depth = 300, n_estimatorsint = 100, criterion = “gini”, max_features = “sqrt”, random_state = None, bootstrap = True, ccp_alpha = 0.0
XGBuse_label_encoder = False, eval_metric = ‘mlogloss’, max_depth = 300, objective = ‘multi:softprob’
KNNn_neighbors = 5, weights = ‘uniform’, leaf_size = 30, metric = ‘minkowski’, algorithm = ‘auto’, p = 2
MLPhidden_layer_sizes = 100, max_iter = 300, activation = ‘relu’, solver = ‘adam’, alpha = 0.0001, learning_rate = ‘constant’, tol = 1 × 10−4, epsilon = 1 × 10−8, max_fun = 15000
Table 4. Comparative analysis of machine learning models using an unbalanced dataset with a data split of 70:30.
Table 4. Comparative analysis of machine learning models using an unbalanced dataset with a data split of 70:30.
TechniqueLabel 1Label 2
Results without Proposed Technique
Accuracy (%)Precision (%)Recall (%)F1 Score (%)Accuracy (%)Precision (%)Recall (%)F1 Score (%)
LR5351423633191714
MLP5852515035262524
DTC5044444429232424
RFC5854464637282223
KNN4636343321131212
ETC5954484937402426
XGB5751494936302526
SVC4948353427201412
Results with Proposed Technique
LR6473696643393836
MLP6674727145584140
DTC6674727144514141
RFC6674727144514141
KNN5367676535384037
ETC6674727144514141
XGB6674727144514141
SVC6473696542353836
Table 5. The multi-label multi-class performance comparative analysis with an imbalanced dataset using a data split of 70:30.
Table 5. The multi-label multi-class performance comparative analysis with an imbalanced dataset using a data split of 70:30.
TechniqueTraining Time (s)Macro Accuracy (%)Hamming Lossα-Evaluation Score
Results without the proposed technique
LR3.20550.2290
MLP75.34690.1888
DTC0.25690.2183
RFC4.04660.1889
KNN0.05610.2584
ETC02.39680.1789
XGB75.94690.1887
SVC12.06590.2486
Results with the proposed technique
LR2.05650.1891
MLP13.17680.1890
DTC0.01680.1790
RFC0.79680.1790
KNN0.02710.2378
ETC2.18680.1790
XGB7.49680.1790
SVC0.28640.1891
Table 6. Performance analysis of machine learning models using an imbalanced dataset with a data split of 80:20.
Table 6. Performance analysis of machine learning models using an imbalanced dataset with a data split of 80:20.
TechniqueLabel 1Label 2
Results without Proposed Technique
Accuracy (%)Precision (%)Recall (%)F1 Score (%)Accuracy (%)Precision (%)Recall (%)F1 Score (%)
LR5352423733191714
MLP6054505137272425
DTC4943434329212121
RFC5854454536272122
KNN4736343422121212
ETC5955484937342325
XGB5752484936322628
SVC5045353219171612
Results with Proposed Technique
LR6575716843393937
MLP6775737245484039
DTC6775737245554141
RFC6775737245544140
KNN5569696637434240
ETC6775737245554141
XGB6775737245554141
SVC6575706743403836
Table 7. The multi-label multi-class performance comparative analysis with an imbalanced dataset using a data split of 80:20.
Table 7. The multi-label multi-class performance comparative analysis with an imbalanced dataset using a data split of 80:20.
TechniqueTraining Time (s)Macro Accuracy (%)Hamming Lossα-Evaluation Score
Results without the proposed technique
LR3.47550.2290
MLP69.57700.1789
DTC0.29680.2283
RFC4.74650.1989
KNN0.03610.2484
ETC14.45680.1889
XGB91.50690.1887
SVC14.04550.2488
Results with the proposed technique
LR2.16660.1791
MLP10.95690.1690
DTC0.01690.1690
RFC0.88690.1690
KNN0.03720.2679
ETC2.44690.1690
XGB9.13690.1690
SVC0.34650.1792
Table 8. Performance comparative analysis of models using an imbalanced dataset and 85:15 split.
Table 8. Performance comparative analysis of models using an imbalanced dataset and 85:15 split.
TechniqueLabel 1Label 2
Results without Proposed Technique
Accuracy (%)Precision (%)Recall (%)F1 Score (%)Accuracy (%)Precision (%)Recall (%)F1 Score (%)
LR5452423732151613
MLP6155535338302929
DTC5144444430242524
RFC5854464636272122
KNN4733333222131312
ETC5954474837332325
XGB5751484935272324
SVC4144342925202014
Results with Proposed Technique
LR6575706742403836
MLP6674727143454039
DTC6674727042474040
RFC6674727143484040
KNN6574706839443838
ETC6674727042474040
XGB6674727142453939
SVC6474706641353835
Table 9. The multi-label multi-class performance comparative analysis with an imbalanced dataset using a data split of 85:15.
Table 9. The multi-label multi-class performance comparative analysis with an imbalanced dataset using a data split of 85:15.
TechniqueTraining Time (s)Macro Accuracy (%)Hamming Lossα-Evaluation Score
Results without the proposed technique
LR4.08550.2291
MLP104.1700.1689
DTC0.31690.2184
RFC5.42660.1889
KNN0.04610.2584
ETC16.8690.1889
XGB99.37690.1887
SVC16.4700.2577
Results with the proposed technique
LR2.49660.1792
MLP13.58680.1790
DTC0.01680.1790
RFC0.95680.1790
KNN0.03660.1791
ETC2.55680.1790
XGB9.40680.1790
SVC0.38650.1892
Table 10. Performance comparative analysis using an imbalanced dataset with a data split of 90:10.
Table 10. Performance comparative analysis using an imbalanced dataset with a data split of 90:10.
TechniqueLabel 1Label 2
Results without Proposed Technique
Accuracy (%)Precision (%)Recall (%)F1 Score (%)Accuracy (%)Precision (%)Recall (%)F1 Score (%)
LR5357423734241714
MLP5756494836312424
DTC5044454430232424
RFC5856464738392325
KNN4634333222131212
ETC5956484938402628
XGB5752495036332628
SVC5140413531151915
Results with Proposed Technique
LR6677716843403837
MLP6776737145454040
DTC6776737245504141
RFC6776737345514141
KNN6271717041504242
ETC6776737245504141
XGB6776737245504242
SVC6576706643413836
Table 11. The multi-label multi-class performance comparative analysis using a data split of 90:10.
Table 11. The multi-label multi-class performance comparative analysis using a data split of 90:10.
TechniqueTraining Time (s)Macro Accuracy (%)Hamming Lossα-Evaluation Score
Results without the proposed technique
LR5.23540.2291
MLP40.5630.1990
DTC0.33700.2084
RFC5.70660.1890
KNN0.04610.2584
ETC17.62680.1889
XGB106.58700.1888
SVC19.0560.2390
Results with the proposed technique
LR2.76660.1792
MLP17.6690.1691
DTC0.01700.1690
RFC1.02700.1690
KNN0.03690.1987
ETC2.82700.1690
XGB15.7700.1690
SVC0.37650.1792
Table 12. Comparative analysis of applied machine learning models using a balanced dataset with a data split of 80:20.
Table 12. Comparative analysis of applied machine learning models using a balanced dataset with a data split of 80:20.
TechniqueLabel 1Label 2
Results without Proposed Technique
Accuracy (%)Precision (%)Recall (%)F1 Score (%)Accuracy (%)Precision (%)Recall (%)F1 Score (%)
LR5554565441182219
MLP5857585741262424
DTC4747474732242625
RFC5756575542232321
KNN3737373625141414
ETC5857585743472626
XGB5453545339412627
SVC37443702725202015
Results with Proposed Technique
LR7273737257384139
MLP7474747351594241
DTC7373737359504544
RFC7373737359504444
KNN7171717156444444
ETC7373737359504444
XGB7373737359494444
SVC7273737238584139
Table 13. The multi-label multi-class performance comparative analysis using balanced data with 80:20 split.
Table 13. The multi-label multi-class performance comparative analysis using balanced data with 80:20 split.
TechniqueTraining Time (s)Macro Accuracy (%)Hamming Lossα-Evaluation Score
Results without the proposed technique
LR0.99810.1687
MLP17.21820.1489
DTC0.08770.1987
RFC1.63790.1689
KNN0.01710.2485
ETC4.54810.1490
XGB19.98790.1788
SVC3.41810.1782
Results with the proposed technique
LR0.50810.1491
MLP6.81820.1491
DTC0.01820.1489
RFC0.54820.1489
KNN0.01810.1588
ETC1.32820.1489
XGB3.32820.1489
SVC0.09810.1491
Table 14. Results for the proposed XGB as classification model for multi-omics and high dimensional real genomic dataset.
Table 14. Results for the proposed XGB as classification model for multi-omics and high dimensional real genomic dataset.
CategoryPrecisionRecallF1 Score
01.001.001.00
11.001.001.00
21.001.001.00
31.001.001.00
41.001.001.00
Average1.001.001.00
Accuracy1.00
Table 15. Comparative analysis of proposed approach with state-of-the-art approaches.
Table 15. Comparative analysis of proposed approach with state-of-the-art approaches.
ReferenceYearTechniqueTraining Time (s)Macro Accuracy (%)Hamming Loss α -Evaluation Score (%)
[63]2020SVM7.10730.2288
[64]2020KNN0.01700.2586
[65]2020KNN0.01700.2586
[66]2020RF2.48820.1490
[67]2021KNN0.01700.2586
Proposed2022ETRF + XGB3.59840.1292
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Raza, A.; Rustam, F.; Siddiqui, H.U.R.; Diez, I.d.l.T.; Garcia-Zapirain, B.; Lee, E.; Ashraf, I. Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach. Genes 2023, 14, 71. https://doi.org/10.3390/genes14010071

AMA Style

Raza A, Rustam F, Siddiqui HUR, Diez IdlT, Garcia-Zapirain B, Lee E, Ashraf I. Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach. Genes. 2023; 14(1):71. https://doi.org/10.3390/genes14010071

Chicago/Turabian Style

Raza, Ali, Furqan Rustam, Hafeez Ur Rehman Siddiqui, Isabel de la Torre Diez, Begoña Garcia-Zapirain, Ernesto Lee, and Imran Ashraf. 2023. "Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach" Genes 14, no. 1: 71. https://doi.org/10.3390/genes14010071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop