Using Recurrent Neural Networks for Predicting Type-2 Diabetes from Genomic and Tabular Data
Abstract
:1. Introduction
- The reference gene sequence is analyzed against the trained genomic data for possible gene pattern matching. As well, the further correlation between the reference gene and gene pattern associated with diabetes is assessed.
- The probabilistic estimations are performed by the softmax layer towards the future illness based on the gene correlation. Additionally, based on the probabilities, the risk factor outcome is yielded.
- The proposed RNN model is evaluated over the tabular patient data such as PIMA for risk analysis, where the auxiliary memory components such as GRU and LSTM are integrated for better prediction performance.
- The feature selection and weight optimizations are performed over the features of the PIMA dataset for better prediction outcomes.
- The outcome of the present study is being evaluated against conventional classification techniques such as Decision Tree, J48, K Nearest Neighborhood, Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machine.
2. Literature Review
2.1. ML Models for Smart Diagnosis of Type-2 Diabetes
2.2. Deep Learning for Type-2 Diabetes
2.3. Genomics for Type 2 Diabetes
- Step 1:
- Preliminary genome-wide analysis and data preprocessing;
- Step 2:
- Identifying gene-set definitions whose patterns have to be recognized;
- Step 3:
- Processing genomic data such as filtering and identifying gene patterns;
- Step 4:
- Identify gene set analysis models, such as identifying the statistical hypothesis;
- Step 5:
- Assessing the statistical magnitude;
- Step 6:
- Report summarization and visualization.
3. Methodology
3.1. Recurrent Neural Network Model for Type 2 Diabetes Forecasting Based on Genomic Data
3.1.1. Data Collection and Processing
3.1.2. Feature Selection
3.1.3. Layered Architecture of RNN-Based Prediction Model
- After the Convolutional layer, the pooling layer is often applied. The pooling layer’s purpose is to minimize the volume of the input matrix for subsequent layers. In the current study, the MaxPooling function is used in the current study.
- A flattening operation transforms data into a one-dimensional array to be used in a subsequent layer. This is conducted so that CNN’s output may be sent to a fully connected network.
- A neural network is a collection of non-linear, mutually dependent functions. Neurons are the building blocks of every single function (or a perceptron). The neuron uses a weights matrix as a fully connected layer to apply a transformation matrix to the input vector. The result is then subjected to a non-linear transformation via a non-linear input signal as shown in Equation (6).
- One way to represent a set of numbers as probabilities are to use the Softmax mathematical function, which multiplies all the values in a set by the scale at which they appear in the vector. The likelihood of belonging to each class is calculated using the outcome of the softmax algorithm.
3.1.4. RNN Component Structure
3.1.5. GRU Component Structure
3.1.6. LSTM Component Structure
3.1.7. Working Procedure of the Proposed Approach
- Step 1:
- Acquire gene data from the annotated miRbase data set;
- Step 2:
- Data are preprocessed to remove the outlier data and fill out acquired data gaps;
- Step 3:
- Data is converted into 1D data, followed by aligning of genomic patterns;
- Step 4:
- Data is categorized into a training set (80% of the data) and a testing set (20% of the data);
- Step 5:
- Patterns are labeled based on sequence patterns of various illnesses. Moreover, weights are assigned in the later phases according to the correlation between the input sequence and the trained gene pattern;
- Step 6:
- When a new GENE sequence is fed as input for testing the algorithm, features are extracted through the mRMR approach that is pivotal in the prediction process;
- Step 7:
- The cumulative weight is evaluated from assigned weights based on the correlation of gene sequences between the input and the trained set;
- Step 8:
- Based on the approximated weight of the gene sequence, the probability of a future illness is assessed;
- Step 9:
- Final assumptions are made based on probabilistic approximations.
3.2. RNN Model for Illness Prediction from Tabular Data (PIMA Dataset)
3.2.1. Feature Weight Initialization
3.2.2. Weight Optimization
3.3. Dataset Description
3.4. Implementation Environments
4. Results and Discussion
4.1. Experimental Outcome of Genomic Data
4.2. Experimental Outcome with Tabular Data (PIMA Dataset)
4.3. Practical Implications
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting Diabetes Mellitus with Machine Learning Techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef] [PubMed]
- Hemu, A.A.; Mim, R.B.; Ali, M.; Nayer, M.; Ahmed, K.; Bui, F.M. Identification of Significant Risk Factors and Impact for ASD Prediction among Children Using Machine Learning Approach. In Proceedings of the 2022 Second International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 21–22 April 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Ravaut, M.; Sadeghi, H.; Leung, K.K.; Volkovs, M.; Rosella, L.C. Diabetes mellitus forecasting using population health data in Ontario, Canada. arXiv 2019, arXiv:1904.04137. [Google Scholar]
- Deberneh, H.; Kim, I. Prediction of Type 2 Diabetes Based on Machine Learning Algorithm. Int. J. Environ. Res. Public Health 2021, 18, 3317. [Google Scholar] [CrossRef] [PubMed]
- Arshad, A.; Khan, Y.D. DNA Computing: A Survey. In Proceedings of the 2019 International Conference on Innovative Computing (ICIC), Lahore, Pakistan, 1–2 November 2019; pp. 1–5. [Google Scholar] [CrossRef]
- Cho, Y.S.; Chen, C.-H.; Hu, C.; Long, J.; Ong, R.T.H.; Sim, X.; Takeuchi, F.; Wu, Y.; Go, M.J.; et al.; DIAGRAM Consortium Meta-analysis of genome-wide association studies identifies eight new loci for type 2 diabetes in east Asians. Nat. Genet. 2011, 44, 67–72. [Google Scholar] [CrossRef] [PubMed]
- The Coronary Artery Disease (C4D) Genetics Consortium. A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nat. Genet. 2011, 43, 339–344. [Google Scholar] [CrossRef]
- Duncan, L.; Shen, H.; Gelaye, B.; Meijsen, J.; Ressler, K.; Feldman, M.; Peterson, R.; Domingue, B. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 2019, 10, 3328. [Google Scholar] [CrossRef] [Green Version]
- Jordan, D.M.; Do, R. Using Full Genomic Information to Predict Disease: Breaking Down the Barriers Be-tween Complex and Mendelian Diseases. Annu. Rev. Genom. Hum. Genet. 2018, 19, 289–301. [Google Scholar] [CrossRef]
- Rahaman, A.; Ali, M.; Ahmed, K.; Bui, F.M.; Mahmud, S.M.H. Performance Analysis between YOLOv5s and YOLOv5m Model to Detect and Count Blood Cells: Deep Learning Approach. In Proceedings of the 2nd International Conference on Computing Advancements (ICCA’22). Association for Computing Machinery, Dhaka, Bangladesh, 10–12 March 2022; pp. 316–322. [Google Scholar] [CrossRef]
- Ontor, Z.H.; Ali, M.; Ahmed, K.; Bui, F.M.; Al-Zahrani, F.A.; Mahmud, S.M.H.; Azam, S. Early-Stage Cervical Cancerous Cell Detection from Cervix Images Using YOLOv5. Comput. Mater. Contin. 2023, 74, 3727–3741. [Google Scholar] [CrossRef]
- So, H.-C.; Sham, P.C. Exploring the predictive power of polygenic scores derived from genome-wide association studies: A study of 10 complex traits. Bioinformatics 2016, 33, 886–892. [Google Scholar] [CrossRef] [Green Version]
- Sarra, R.R.; Dinar, A.M.; Mohammed, M.A.; Ghani, M.K.A.; Albahar, M.A. A Robust Framework for Data Generative and Heart Disease Prediction Based on Efficient Deep Learning Models. Diagnostics 2022, 12, 2899. [Google Scholar] [CrossRef]
- Ali, M.; Ahmed, K.; Bui, F.M.; Paul, B.K.; Ibrahim, S.M.; Quinn, J.M.; Moni, M.A. Machine learning-based statistical analysis for early stage detection of cervical cancer. Comput. Biol. Med. 2021, 139, 104985. [Google Scholar] [CrossRef]
- Ali, M.M.; Paul, B.K.; Ahmed, K.; Bui, F.M.; Quinn, J.M.W.; Moni, M.A. Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Comput. Biol. Med. 2021, 136, 104672. [Google Scholar] [CrossRef]
- Bell, C.G.; Teschendorff, A.E.; Rakyan, V.K.; Maxwell, A.P.; Beck, S.; Savage, D.A. Genome-wide DNA methylation analysis for diabetic nephropathy in type 1 diabetes mellitus. BMC Med. Genom. 2010, 3, 33. [Google Scholar] [CrossRef] [Green Version]
- Konishi, T.; Matsukuma, S.; Fuji, H.; Nakamura, D.; Satou, N.; Okano, K. Principal Component Analysis applied directly to Sequence Matrix. Sci. Rep. 2019, 9, 19297. [Google Scholar] [CrossRef] [Green Version]
- Mallik, S.; Mukhopadhyay, A.; Maulik, U.; Bandyopadhyay, S. Integrated analysis of gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach. In Proceedings of the 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Singapore, 16–19 April 2013; pp. 120–127. [Google Scholar] [CrossRef]
- Mallik, S.; Bhadra, T.; Mukherji, A. DTFP-Growth: Dynamic Threshold-Based FP-Growth Rule Mining Algorithm Through Integrating Gene Expression, Methylation, and Protein–Protein Interaction Profiles. IEEE Trans. NanoBiosci. 2018, 17, 117–125. [Google Scholar] [CrossRef]
- Huang, S.; Cai, N.; Pacheco, P.P.; Narandes, S.; Wang, Y.; Xu, W. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genom.-Proteom. 2018, 15, 41–51. [Google Scholar] [CrossRef] [Green Version]
- Parry, R.M.; Jones, W.; Stokes, T.H.; Phan, J.H.; Moffitt, R.; Fang, H.; Shi, L.; Oberthuer, A.; Fischer, M.; Tong, W.; et al. k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. Pharmacogenom. J. 2010, 10, 292–309. [Google Scholar] [CrossRef]
- Wright, M.N.; Ziegler, A. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef] [Green Version]
- Nagaraj, P.; Deepalakshmi, P.; Ijaz, M.F. Optimized adaptive tree seed Kalman filter for a diabetes recommen-dation system—Bilevel performance improvement strategy for healthcare applications. In Intelligent Data-Centric Systems, Cognitive and Soft Computing Techniques for the Analysis of Healthcare Data; Academic Press: Cambridge, MA, USA, 2022; pp. 191–202. [Google Scholar]
- Mantzaris, D.H.; Anastassopoulos, G.C.; Lymberopoulos, D.K. Medical disease prediction using Artificial Neural Networks. In Proceedings of the 8th IEEE International Conference on BioInformatics and BioEngineering, Athens, Greece, 8–10 October 2008; pp. 1–6. [Google Scholar] [CrossRef]
- Huang, P.-J.; Chang, J.-H.; Lin, H.-H.; Li, Y.-X.; Lee, C.-C.; Su, C.-T.; Li, Y.-L.; Chang, M.-T.; Weng, S.; Cheng, W.-H.; et al. DeepVariant-on-Spark: Small-Scale Genome Analysis Using a Cloud-Based Computing Framework. Comput. Math. Methods Med. 2020, 2020, 7231205. [Google Scholar] [CrossRef]
- Koumakis, L. Deep learning models in genomics; are we there yet? Comput. Struct. Biotechnol. J. 2020, 18, 1466–1473. [Google Scholar] [CrossRef]
- Van Dam, S.; Võsa, U.; Van Der Graaf, A.; Franke, L.; De Magalhães, J.P. Gene co-expression analysis for functional classification and gene–disease predictions. Brief. Bioinform. 2018, 19, 575–592. [Google Scholar] [CrossRef] [PubMed]
- Travnik, J.B.; Mathewson, K.W.; Sutton, R.S.; Pilarski, P.M. Reactive Reinforcement Learning in Asynchronous Environments. Front. Robot. AI 2018, 5, 79. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Battineni, G.; Sagaro, G.G.; Chinatalapudi, N.; Amenta, F. Applications of Machine Learning Predictive Models in the Chronic Disease Diagnosis. J. Pers. Med. 2020, 10, 21. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yue, C.; Xin, L.; Kewen, X.; Chang, S. An Intelligent Diagnosis to Type 2 Diabetes Based on QPSO Algorithm and WLS-SVM. In Proceedings of the 2008 International Symposium on Intelligent Information Technology Application Workshops, Shanghai, China, 21–22 December 2008; pp. 117–121. [Google Scholar] [CrossRef]
- Srinivasu, P.N.; Rao, T.S.; Dicu, A.M.; Mnerie, C.A.; Olariu, I. A comparative review of optimisation techniques in segmentation of brain MR images. J. Intell. Fuzzy Syst. 2020, 38, 6031–6043. [Google Scholar] [CrossRef]
- Nadesh, R.K.; Arivuselvan, K. Type 2: Diabetes mellitus prediction using Deep Neural Networks classifier. Int. J. Cogn. Comput. Eng. 2020, 1, 55–61. [Google Scholar] [CrossRef]
- Abedini, M.; Bijari, A.; Banirostam, T. Classification of Pima Indian Diabetes Dataset using Ensemble of Decision Tree, Logistic Regression and Neural Network. Int. J. Adv. Res. Comput. Commun. Eng. 2020, 9, 1–4. [Google Scholar] [CrossRef]
- Kundu, N.; Rani, G.; Dhaka, V.S.; Gupta, K.; Nayak, S.C.; Verma, S.; Ijaz, M.F.; Woźniak, M. IoT and Interpretable Machine Learning Based Framework for Disease Prediction in Pearl Millet. Sensors 2021, 21, 5386. [Google Scholar] [CrossRef]
- Reddy, G.S.; Chittineni, S. Entropy based C4.5-SHO algorithm with information gain optimization in data mining. PeerJ Comput. Sci. 2021, 7, e424. [Google Scholar] [CrossRef]
- Luukka, P. Feature selection using fuzzy entropy measures with similarity classifier. Expert Syst. Appl. 2011, 38, 4600–4607. [Google Scholar] [CrossRef]
- Szmidt, E.; Kacprzyk, J. Some Problems with Entropy Measures for the Atanassov Intuitionistic Fuzzy Sets. In Applications of Fuzzy Theory, Proceedings of the WILF 2007, Camogli, Italy, 7–10 July 2007; Lecture Notes in Computer Science; Masulli, F., Mitra, S., Pasi, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4578. [Google Scholar] [CrossRef]
- Choubey, D.K.; Paul, S. GA_RBF NN: A classification system for diabetes. Int. J. Biomed. Eng. Technol. 2017, 23, 71–93. [Google Scholar] [CrossRef]
- Jackins, V.; Vimal, S.; Kaliappan, M.; Lee, M.Y. AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes. J. Supercomput. 2020, 77, 5198–5219. [Google Scholar] [CrossRef]
- Almustafa, K.M. Prediction of heart disease and classifiers’ sensitivity analysis. BMC Bioinform. 2020, 21, 278. [Google Scholar] [CrossRef] [PubMed]
- Tayeb, S.; Pirouz, M.; Sun, J.; Hall, K.; Chang, A.; Li, J.; Song, C.; Chauhan, A.; Ferra, M.; Sager, T.; et al. Toward predicting medical conditions using k-nearest neighbors. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 3897–3903. [Google Scholar] [CrossRef]
- Xu, W.; Zhao, Y.; Nian, S.; Feng, L.; Bai, X.; Luo, X.; Luo, F. Differential analysis of disease risk assessment using binary logistic regression with different analysis strategies. J. Int. Med. Res. 2018, 46, 3656–3664. [Google Scholar] [CrossRef]
- Wei, W.; Visweswaran, S.; Cooper, G.F. The application of naive Bayes model averaging to predict Alzheimer’s disease from genome-wide data. J. Am. Med. Inform. Assoc. 2011, 18, 370–375. [Google Scholar] [CrossRef] [Green Version]
- Benbelkacem, S.; Atmani, B. Random Forests for Diabetes Diagnosis. In Proceedings of the 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 10–11 April 2019; pp. 1–4. [Google Scholar] [CrossRef]
- Son, Y.-J.; Kim, H.-G.; Kim, E.-H.; Choi, S.; Lee, S.-K. Application of Support Vector Machine for Prediction of Medication Adherence in Heart Failure Patients. Health Inform. Res. 2010, 16, 253–259. [Google Scholar] [CrossRef]
- Ghaheri, A.; Shoar, S.; Naderan, M.; Hoseini, S.S. The Applications of Genetic Algorithms in Medicine. Oman Med. J. 2015, 30, 406–416. [Google Scholar] [CrossRef]
- Swapna, G.; Soman, K.P.; Vinayakumar, R. Diabetes Detection Using ECG Signals: An Overview. In Deep Learning Techniques for Biomedical and Health Informatics; Studies in Big Data; Dash, S., Acharya, B., Mittal, M., Abraham, A., Kelemen, A., Eds.; Springer: Cham, Switzerland, 2019; Volume 68. [Google Scholar] [CrossRef]
- Available online: https://www.forrester.com/webinar/AI+Software+Market+Sizing+Understand+Forresters+Four+Segments+To+Invest+Wisely/-/E-WEB32605?utm_source=prnewswire&utm_medium=pr&utm_campaign=cio20 (accessed on 7 October 2021).
- Mooney, M.A.; Wilmot, B. Gene set analysis: A step-by-step guide. Am. J. Med. Genet. Part B Neuropsychiatr. Genet. 2015, 168, 517–527. [Google Scholar] [CrossRef] [Green Version]
- Mathur, R.; Rotroff, D.; Ma, J.; Shojaie, A.; Motsinger-Reif, A. Gene set analysis methods: A systematic comparison. BioData Min. 2018, 11, 8. [Google Scholar] [CrossRef]
- Leevy, J.L.; Khoshgoftaar, T.M.; Villanustre, F. Survey on RNN and CRF models for de-identification of medical free text. J. Big Data 2020, 7, 73. [Google Scholar] [CrossRef]
- Yadav, S.S.; Jadhav, S.M. Deep convolutional neural network based medical image classification for disease diagnosis. J. Big Data 2019, 6, 113. [Google Scholar] [CrossRef] [Green Version]
- SivaSai, J.G.; Srinivasu, P.N.; Sindhuri, M.N.; Rohitha, K.; Deepika, S. An Automated Segmentation of Brain MR Image through Fuzzy Recurrent Neural Network. In Bio-Inspired Neurocomputing; Studies in Computational Intelligence; Bhoi, A., Mallick, P., Liu, C.M., Balas, V., Eds.; Springer: Singapore, 2020; Volume 903. [Google Scholar] [CrossRef]
- Ahmed, S.; Srinivasu, P.N.; Alhumam, A.; Alarfaj, M. AAL and Internet of Medical Things for Monitoring Type-2 Diabetic Patients. Diagnostics 2022, 12, 2739. [Google Scholar] [CrossRef] [PubMed]
- Kozomara, A.; Birgaoanu, M.; Griffiths-Jones, S. miRBase: From microRNA sequences to function. Nucleic Acids Res. 2019, 47, D155–D162. [Google Scholar] [CrossRef] [PubMed]
- Guan, Z.-X.; Li, S.-H.; Zhang, Z.-M.; Zhang, D.; Yang, H.; Ding, H. A Brief Survey for MicroRNA Precursor Identification Using Machine Learning Methods. Curr. Genom. 2020, 21, 11–25. [Google Scholar] [CrossRef] [PubMed]
- Hira, Z.M.; Gillies, D.F. A Review of feature selection and future extraction methods applied on microarray data. Adv. Bioinform. 2015, 2015, 198363. [Google Scholar] [CrossRef] [PubMed]
- Shirzad, M.B.; Keyvanpour, M.R. A feature selection method based on minimum redundancy maximum relevance for learning to rank. In Proceedings of the 2015 AI & Robotics (IRANOPEN), Qazvin, Iran, 12–12 April 2015; pp. 1–5. [Google Scholar] [CrossRef]
- Fang, H.; Tang, P.; Si, H. Feature Selections Using Minimal Redundancy Maximal Relevance Algorithm for Human Activity Recognition in Smart Home Environments. J. Health Eng. 2020, 2020, 8876782. [Google Scholar] [CrossRef]
- Carrara, F.; Elias, P.; Sedmidubsky, J.; Zezula, P. LSTM-based real-time action detection and prediction in human motion streams. Multimed. Tools Appl. 2019, 78, 27309–27331. [Google Scholar] [CrossRef]
- Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef] [Green Version]
- Srinivasu, P.N.; JayaLakshmi, G.; Jhaveri, R.H.; Praveen, S.P. Ambient Assistive Living for Monitoring the Physical Activity of Diabetic Adults through Body Area Networks. Mob. Inf. Syst. 2022, 2022, 3169927. [Google Scholar] [CrossRef]
- Wu, L.; Kong, C.; Hao, X.; Chen, W. A Short-Term Load Forecasting Method Based on GRU-CNN Hybrid Neural Network Model. Math. Probl. Eng. 2020, 2020, 1428104. [Google Scholar] [CrossRef]
- Swapna, G.; Soman, K.; Vinayakumar, R. Automated detection of diabetes using CNN and CNN-LSTM network and heart rate signals. Procedia Comput. Sci. 2018, 132, 1253–1262. [Google Scholar]
- Srinivasu, P.N.; SivaSai, J.G.; Ijaz, M.F.; Bhoi, A.K.; Kim, W.; Kang, J.J. Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM. Sensors 2021, 21, 2852. [Google Scholar] [CrossRef]
- Ijaz, M.F.; Alfian, G.; Syafrudin, M.; Rhee, J. Hybrid Prediction Model for Type 2 Diabetes and Hypertension Using DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest. Appl. Sci. 2018, 8, 1325. [Google Scholar] [CrossRef] [Green Version]
- Zhang, L.; Wang, Y.; Niu, M.; Wang, C.; Wang, Z. Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: The Henan Rural Cohort Study. Sci. Rep. 2020, 10, 4406. [Google Scholar] [CrossRef] [Green Version]
- Naz, H.; Ahuja, S. Deep learning approach for diabetes prediction using PIMA Indian dataset. J. Diabetes Metab. Disord. 2020, 19, 391–403. [Google Scholar] [CrossRef]
- Hertzog, M.I.; Correa, U.B.; Araujo, R.M. SpreadOut: A Kernel Weight Initializer for Convolutional Neural Networks. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–7. [Google Scholar] [CrossRef]
- Yang, B.; Urtasun, R. Learning to Reweight Examples for Robust Deep Learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Mitra, A.; Mohanty, D.; Ijaz, M.F.; Rana, A.u.H.S. Deep Learning Approach for Object Features Detection. In Advances in Communication, Devices, and Networking; Lecture Notes in Electrical Engineering; Dhar, S., Mukhopadhyay, S.C., Sur, S.N., Liu, C.M., Eds.; Springer: Singapore, 2022; Volume 776. [Google Scholar]
- Pranto, B.; Mehnaz, S.M.; Mahid, E.B.; Sadman, I.M.; Rahman, A.; Momen, S. Evaluating Machine Learning Methods for Predicting Diabetes among Female Patients in Bangladesh. Information 2020, 11, 374. [Google Scholar] [CrossRef]
- Lai, H.; Huang, H.; Keshavjee, K.; Guergachi, A.; Gao, X. Predictive models for diabetes mellitus using machine learning techniques. BMC Endocr. Disord. 2019, 19, 101. [Google Scholar] [CrossRef] [Green Version]
- Web-Based Data-Science Environment. Available online: https://www.kaggle.com/ (accessed on 8 January 2022).
- Ontor, Z.H.; Ali, M.; Hossain, S.S.; Nayer., M.; Ahmed, K.; Bui, F.M. YOLO_CC: Deep Learning based Approach for Early Stage Detection of Cervical Cancer from Cervix Images Using YOLOv5s Model. In Proceedings of the 2022 Second International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 21–22 April 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Srinivasu, P.N.; Rao, T.S.; Balas, V.E. A systematic approach for identification of tumor regions in the human brain through HARIS algorithm. In Deep Learning Techniques for Biomedical and Health Informatics; Academic Press: Cambridge, MA, USA, 2020; pp. 97–118. [Google Scholar] [CrossRef]
- Tigga, N.P.; Garg, S. Prediction of Type 2 Diabetes using Machine Learning Classification Methods. Procedia Comput. Sci. 2020, 167, 706–716. [Google Scholar] [CrossRef]
- Larabi-Marie-Sainte, S.; Aburahmah, L.; Almohaini, R.; Saba, T. Current Techniques for Diabetes Prediction: Review and Case Study. Appl. Sci. 2019, 9, 4604. [Google Scholar] [CrossRef] [Green Version]
- Ijaz, M.F.; Attique, M.; Son, Y. Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods. Sensors 2020, 20, 2809. [Google Scholar] [CrossRef]
- Vulli, A.; Srinivasu, P.N.; Sashank, M.S.K.; Shafi, J.; Choi, J.; Ijaz, M.F. Fine-Tuned DenseNet-169 for Breast Cancer Metastasis Prediction Using FastAI and 1-Cycle Policy. Sensors 2022, 22, 2988. [Google Scholar] [CrossRef]
- Chae, S.; Kwon, S.; Lee, D. Predicting Infectious Disease Using Deep Learning and Big Data. Int. J. Environ. Res. Public Health 2018, 15, 1596. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pinto, M.F.; Oliveira, H.; Batista, S.; Cruz, L.; Pinto, M.; Correia, I.; Martins, P.; Teixeira, C. Prediction of disease progression and outcomes in multiple sclerosis with machine learning. Sci. Rep. 2020, 10, 21038. [Google Scholar] [CrossRef] [PubMed]
- Srinivasu, P.N.; Balas, V.E. Self-Learning Network-based segmentation for real-time brain M.R. images through HARIS. PeerJ Comput. Sci. 2021, 7, e654. [Google Scholar] [CrossRef]
Approach | Type of Data | Applicability | Limitations |
---|---|---|---|
polygenic scores-based approach [12] | Genomic Data | Used in the evaluation of clinical trials and illness screening mechanisms | The polygenic score approach needs larger samples and tremendous training for considerable Accuracy. |
Singular Value Decomposition [13] | Genomic Data Tabular Data The image they are used | They are used in ranking the feature set and compression of the data through the least-square fitting. Gene sequences are ranked based on the probability of illness. | SVD is not an algorithm designed to perform; it is a matrix decomposition mechanism. They are various neural ranking models that perform much better than SVD. |
Principle Component Analysis [14] | Genomic Data Tabular Data | PCA technique is extensively used in gene analysis to discover the regional and ethnic patterns of genetic variation. | The independent gene expressions are less interpretable, and information loss is possible if the number of components is carefully chosen. |
Gene Co-Expression model [27] | Genomic Data | The Gene Co-Expression model analyzes the genomic data’s insights through similarity assessment of expressions and topologies. | The Gene Co-Expression model may not deal with larger features than the data size and non-linearity in the network architecture. |
Reinforcement approaches (SARSA, DDPG, DQN) [28] | Genomic Data Tabular Data Image Data | The reinforcement learning models are widely used in studies where the states in the problem are deterministic and in situations where control over the environment is needed. RL models are proven to exhibit better non-linearity in gene analysis. | Adding excessive amounts of reinforcement learning may result in an overflow of states, which might reduce the effectiveness of the findings. As well, RL models are data-hungry. |
Decision Tree [39] | Tabular Data Image Data | Using Decision Trees, the efforts to preprocess data can be reduced as normalization and scaling are not required, and missing values will not influence the model’s outcome. | DT models consume more time to train the model, and more effort is desired. |
J48 [40] | Tabular Data Image Data | J48 is a decision tree that can handle outliers effectively and robustly in non-linear problems. | J48 model is less stable, and noisy data compromises the efficiency of the data. |
K Nearest Neighbor [41] | Tabular Data Image Data | The K Nearest Neighbor model does not need prior training for classifying the class data. It requires lesser computational efforts and a faster resultant outcome. | The KNN model fails to work with a larger dataset and high-dimensional data. The feature scaling phase is crucial for an optimal classification level, which requires considerable effort. |
Logistic Regression [42] | Tabular Data Image Data | Logistic Regression is the very predominantly used classification technique. The model efficiently classifies the data based on the likelihood and the association among the data items. The model can sustain the overfitting and underfitting issues. | The challenging part of the Logistic Regression is linear separatable and often leads to overfitting when observations are fewer concerning the feature set size. |
Naive Bayes [43] | Tabular Data Image Data | Naive Bayes algorithms perform well for multi-class classification models with minimal training. | NB assumes all the feature vectors as mutually independent components in the classification process. NB may not perform better in evaluating the problems with the interdependent feature set. |
Random Forest [44] | Tabular Data Image Data | Random Forest models perform bagging for classification. RF models efficiently reduce the over-fitting issue and can handle the missing effectively. Moreover, the feature scaling task need not be performed. | RF models need tremendous training, and frequent hyperparameter tuning is required for considerable Accuracy. |
Support Vector Machine [45] | Tabular Data Image Data | Support Vector Machine is efficient in handling thigh-dimensional and efficient memory handling capability. | SVM is inappropriate for working with a larger dataset with a larger feature set. The outcome of the SVM model is largely dependent on the objective function. Too many support vectors will be generated when choosing a larger kernel, which might impact the model’s training process. |
Genetic Algorithm [46] | Genomic Data Tabular DataImage Data | A genetic algorithm is an evolutionary algorithm that uses probabilistic transaction rules, and non-linearity in the searching process would yield better model accuracy. As well, can effectively handle the larger search space. | The genetic algorithm has susceptible to local maxima and minima and similarly to global maxima and minima. That might result in poor prediction performances. |
Gene Data | Type 2 Diabetes | Fasting Glucose | Alleles | SNP | Megabase |
---|---|---|---|---|---|
GLS2 | ✔ | G/A | rs2657879 | 55.2 | |
P2RX2 | ✔ | A/G | rs10747083 | 131.6 | |
WARS | ✔ | G/T | rs3783347 | 99.9 | |
BCAR1 | ✔ | T/G | rs7202877 | 73.8 | |
ANKRD55 | ✔ | G/A | rs459193 | 55.8 | |
TLE1 | ✔ | G/A | rs2796441 | 83.5 | |
KLHDC5 | ✔ | C/T | rs10842994 | 27.9 | |
ANK1 | ✔ | C/T | rs516946 | 41.6 | |
ZMIZ1 | ✔ | A/G | rs12571751 | 80.6 |
Feature | Data_Type | Min_Value | Max_Value | Information Gain | Mean Rank |
---|---|---|---|---|---|
Glucose (mg/dL) | Integer | 0 | 199 | 0.2497 | 3 |
Pregnancies | Integer | 0 | 17 | ~ | ~ |
Age | Integer | 21 | 81 | 0.0761 | 3.17 |
Heart Rate | Integer | 7.67 | |||
Waist | Integer | ~ | ~ | 0.0356 | 9.5 |
Pulse Pressure | Integer | ~ | 12.33 | ||
Insulin (mm U/mL) | Integer | 0 | 846 | ~ | 13.33 |
Hypertension (Blood Pressure) (mm Hg) | Integer | 0 | 122 | 0.0304 (bp1), 0 (bp2) | 15 |
BMI (weight) (kg/m2) | Real | 0 | 67.1 | ~ | ~ |
Diabetes Pedigree Function | Real | 0.08 | 2.42 | ~ | ~ |
Skin thickness (mm) | Real | 0 | 99 | ~ | ~ |
Metric | Estimated Value |
---|---|
Sensitivity | 83.66 |
Specificity | 49.38 |
Precision | 75.73 |
Accuracy | 71.79 |
Mathew’s correlation Coefficient | 35.09 |
Sensitivity | Specificity | Accuracy | F1-Score | MCC | |
---|---|---|---|---|---|
RNN Model | 0.800 | 0.690 | 0.753 | 0.825 | 0.473 |
RNN + GRU | 0.786 | 0.652 | 0.744 | 0.809 | 0.426 |
RNN + LSTM | 0.826 | 0.679 | 0.774 | 0.823 | 0.505 |
RNN Model (WO) | 0.819 | 0.742 | 0.796 | 0.848 | 0.541 |
RNN + GRU(WO) | 0.833 | 0.733 | 0.800 | 0.849 | 0.558 |
RNN + LSTM(WO) | 0.815 | 0.793 | 0.810 | 0.856 | 0.568 |
Sensitivity | Specificity | Accuracy | F1-Score | MCC | |
---|---|---|---|---|---|
Decision Tree | 0.781 | 0.561 | 0.697 | 0.762 | 0.349 |
J48 | 0.688 | 0.695 | 0.691 | 0.754 | 0.383 |
K Nearest Neighbour | 0.748 | 0.603 | 0.708 | 0.787 | 0.331 |
Logistic Regression | 0.775 | 0.666 | 0.744 | 0.813 | 0.416 |
Naive Bayes | 0.820 | 0.687 | 0.689 | 0.830 | 0.502 |
Random Forest | 0.789 | 0.661 | 0.750 | 0.813 | 0.436 |
Support Vector Machine | 0.775 | 0.666 | 0.744 | 0.813 | 0.416 |
REPTree | 0.530 | 0.744 | 0.590 | ||
SMO | 0.280 | 0.724 | 0.410 | ||
BayesNet | 0.570 | 0.738 | 0.600 | ||
RNN model | 0.837 | 0.774 | 0.818 | 0.864 | 0.591 |
Value of K | RNN Model | RNN + GRU | RNN + LSTM | RNN Model (WO) | RNN + GRU (WO) | RNN + LSTM (WO) |
---|---|---|---|---|---|---|
2 | 0.716 | 0.704 | 0.723 | 0.752 | 0.771 | 0.789 |
5 | 0.745 | 0.739 | 0.770 | 0.791 | 0.799 | 0.812 |
10 | 0.774 | 0.762 | 0.798 | 0.810 | 0.821 | 0.824 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Srinivasu, P.N.; Shafi, J.; Krishna, T.B.; Sujatha, C.N.; Praveen, S.P.; Ijaz, M.F. Using Recurrent Neural Networks for Predicting Type-2 Diabetes from Genomic and Tabular Data. Diagnostics 2022, 12, 3067. https://doi.org/10.3390/diagnostics12123067
Srinivasu PN, Shafi J, Krishna TB, Sujatha CN, Praveen SP, Ijaz MF. Using Recurrent Neural Networks for Predicting Type-2 Diabetes from Genomic and Tabular Data. Diagnostics. 2022; 12(12):3067. https://doi.org/10.3390/diagnostics12123067
Chicago/Turabian StyleSrinivasu, Parvathaneni Naga, Jana Shafi, T Balamurali Krishna, Canavoy Narahari Sujatha, S Phani Praveen, and Muhammad Fazal Ijaz. 2022. "Using Recurrent Neural Networks for Predicting Type-2 Diabetes from Genomic and Tabular Data" Diagnostics 12, no. 12: 3067. https://doi.org/10.3390/diagnostics12123067