Supervised Learning Models for the Preliminary Detection of COVID19 in Patients Using Demographic and Epidemiological Parameters
Abstract
:1. Introduction
 Extensive review of background research: We perform a detailed review of recent work in the literature, which looks at various diagnostic procedures for COVID19 using AI and ML. Emphasis is placed on articles which consider demographic and epidemiological parameters as part of their data.
 Preprocessing: The data are preprocessed to understand the most important parameters. Correlation techniques have been used to underline the most important columns in the dataset.
 Balancing: We use the BorderlineSMOTE technique to balance the data.
 Feature importance: We highlight relevant feature importance derivation techniques.
 Application of ML models: Machine learning and deep learning techniques have been used to derive insights from the data. As demonstrated below, the models tend to perform quite well for the considered data.
 Analysis of parameters: Information about the various parameters is obtained, and their effect on COVID19 patients is studied. The results obtained are compared with stateoftheart studies in the literature using similar data.
 Future directions: We provide an overview of some challenges faced and potential future directions to extend the work.
Motivation and Contributions
2. Related Work
3. Materials and Methods
3.1. Dataset Description
3.2. Data PreProcessing
3.3. Some Machine Learning Algorithms and Related Terminologies
 Logistic regression: For binary and multiclass classification problems, logistic regression is an extensively used statistical classification approach. The logistic function is used to forecast the likelihood of a class label [70]. The model gives exceptional results when the labels are binary. Contrary to its name, this is a classification model, not a regression model. It is quite simple to implement and achieves excellent performance when using linearly separable classes. It uses the sigmoid function to classify the instances. The mathematical equation for logistic regression can be given as:$$\mathrm{log}\left(P(Y\right)/\left(1P\left(Y\right)\right))={\beta}_{0}+{\beta}_{1}Y$$
 Random forest: The random forest (RF) method is a widely used machine learning technique that interpolates the output of numerous decision trees (DT) to produce a single result [71]. It is based on the notion of ensemble learning, which is a method for integrating several weak classifiers in order to solve a complex problem. It can be used for both regression and classification problems. RF is a technique that extends the bagging approach by combining bagging with feature randomization to generate an uncorrelated forest of decision trees. It partitions the data into training and testing sets using the bootstrapping data sampling approach. The model builds trees repeatedly with each bootstrap. The final forecast is based on the average vote for each class. The larger the number of trees in the forest, the better the reliability. The chance of overfitting also decreases drastically. Further, it provides great flexibility since it can accurately perform classification and regression jobs with high accuracy. It can also be used to understand the importance of each feature. However, its main disadvantage is that these models are very complex and require much time and memory to train the models. The equations to calculate the Gini impurity and entropy are described in Equations (4) and (5). Both Gini impurity and entropy are measures of impurity of a node.$$Gini\text{}Impurity={\displaystyle \sum}_{k=1}^{c}\text{}{f}_{k}\left(1{f}_{k}\right)$$$$Entropy={\displaystyle \sum}_{k=1}^{C}{f}_{i}\mathrm{log}\left({f}_{i}\right)$$
 XGBoost: The extreme gradient boosting (XGBoost) [72] algorithm is another prediction modelling algorithm based on ensemble learning, which can be applied to classification, regression and ranking problems. Generally, gradient boosting algorithms may suffer from overfitting as a result of data inequality [72]. However, the regularisation parameter in the XGBoost technique mitigates the danger of model overfitting. It is also an iterative treebased ensemble classifier which seeks to improve the model’s accuracy by using a boosting data resampling strategy to decrease the classification error. The algorithm is composed of a number of parameters. The ideal parameter combination improves the model’s performance. It also makes use of the previous unsuccessful iteration results in the subsequent steps to achieve an optimal result. The XGBoost algorithm makes use of several CPU cores, allowing for simultaneous learning during training. The objective function of XGBoost is given by the sum of loss and regularization function as described in Equation (6).$$obj\left(\theta \right)={\displaystyle \sum}_{k}^{n}l\left({y}_{k}{y}_{ik}^{\prime}\right)+{\displaystyle \sum}_{j=1}^{j}\omega \left({f}_{j}\right)$$
 AdaBoost: Adaptive boosting, also referred to as AdaBoost, is a machine learning approach that uses the ensemble methodology [73]. It is a metaalgorithm for statistical classification that may be used in combination with a variety of learning algorithms to enhance performance [73]. It is a widely used algorithm and it makes use of the terminology named decision stumps, which are singlelevel decision trees (decision trees with just one split). A key feature of AdaBoost is its adaptivity based on the results of the previous classifiers. The first step of the algorithm involves constructing a model where all data points are assigned equal weights. Points that have been misclassified are provided with larger weights. With this change, the models deployed subsequently are expected to be more reliable. The model continues to train till it reduces its loss function. However, AdaBoost’s performance degrades when irrelevant features are added. It is also slow compared to XGBoost since it is not tuned for speed. The model function for AdaBoost is described in Equation (7).$$H\left(x\right)=Sign({\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}(x))$$
 KNN: The knearest neighbours algorithm (kNN or KNN) is a simple nonparametric supervised ML algorithm used for both regression and classification [74]. A dataset’s k closest training instances serve as the input for the model’s learning process. It is also known as a “lazy learner” algorithm since it does not utilise the input during training. The KNN algorithm is based on the principle of majority voting. It gathers information from the training dataset and utilises it to make predictions about subsequent records. The first step in a KNN algorithm is to select k number of neighbours where k is an optimal constant. Calculation of the Euclidean distance (or Hamming distance for text classification) is conducted to find the nearest data points. Choosing a suitable value of k is crucial as it affects the functioning of the algorithm. The benefits of the KNN model include its robustness, ease of implementation and its ability to preprocess large datasets. However, selecting the right k value requires expertise. Further, it also increases the computational time during testing.
 ANN: Artificial neural networks (ANNs) mirror the human brain’s functioning, enabling software programs to discover patterns in large datasets [75]. They make use of nodes referred to as artificial neurons, interconnected over multiple layers of varying sizes to mimic the activities and roles of biological neural networks in the human brain. To their credit, ANNs have the ability to draw inferences about the correlations between variables which is not possible with other types of statistical models. The ANN architecture is composed of a series of node layers, they consist of a single input layer, connected to one or more hidden layers, which are then connected to an output layer. The nodes link to one another and each of them has a weight and threshold associated with it. Only when a node’s output exceeds a certain threshold, is it activated and begins transferring data to the network’s next layer. The node architecture for the ANN model is described in Figure 4.
 SMOTE: Data imbalance is a common problem in medical machine learning and often results in overfitting. Imbalanced class distribution has a considerable performance penalty in comparison to most traditional classifier learning techniques that assume a generally balanced class distribution and equal misclassification costs. An effective method to overcome dataset imbalance in ML is by using the synthetic minority oversampling technique (SMOTE) [76]. SMOTE employs an oversampling technique to adjust the initial training set. Rather than just replicating minority class cases, SMOTE’s central concept is to offer new artificial instances which are similar to the minority class. This new dataset is constructed by interpolating between numerous occurrences of a minority class within a specific neighbourhood. In this research, a technique called the BorderlineSMOTE was used. It is based on the principle that borderline cases may provide negligible contribution to the overall success of the classification [77]. The models are more reliable when the data are balanced. Figure 5 shows the dataset before and after the use of the BorderlineSMOTE algorithm. Further, the training data were split randomly into an 80:20 ratio, with the larger proportion of the partition reserved for training the model. The smaller set was used for testing the models’ performance. It was made sure that both the subsets maintained a similar composition and lacked bias.
 Shapley Additive Values (SHAP): SHAP is based on the principle of game theory and it is used to increase the interpretability and transparency of the ML models [78]. Most ML and deep learning models are compatible with SHAP. The ‘TreeExplainer’ procedure is mainly used in treebased classifiers such as decision tree, random forest and other boosting algorithms. SHAP employs a variety of visual descriptions to convey the importance of attributes and how they influence the model’s decision making. The baseline estimates of various parameters are compared to forecast the prediction.
 Local Interpretable ModelAgnostic Explanations (LIME): LIME is independent of any model and can be used with all the existing classifiers [79]. By adjusting the source of data points and seeing how the predictions vary, the technique seeks to understand the model’s prediction. To acquire a deeper understanding of the blackbox model, specific approaches look at the fundamental components and how they interact in LIME. It also modifies the attribute values in a particular order before assessing the impact on the whole outcome.
4. Results and Discussion
4.1. Performance Metrics
 Accuracy: It is a measurement which calculates the number of COVID19 cases diagnosed accurately from the total number of cases. Correct diagnosis in this scenario is when the prediction for the case is positive, and its result is positive or when the prediction for the case is negative, and the result is also negative. It is an important metric to understand if the model is accurately diagnosing the virus. It is given by the formula:$$\mathrm{Accuracy}=\frac{tp+tn}{tp+tn+fn+fp}$$
 Precision: It is another metric which calculates the ratio of patients correctly diagnosed as COVID19 positive from the total patients predicted as COVID19 positive by the ML models. This means that it also considers the falsepositive cases, which are the patients incorrectly diagnosed with COVID19 positive diagnosis. This metric indicates the merit of the positive cases diagnosed by the algorithm and to understand that if a patient was predicted as COVID19 positive by the model, what would be the likelihood of them being affected by it. It is given by the formula below:$$\mathrm{Precision}=\frac{tp}{tp+fp}$$
 Recall: It is a performance metric that can be defined as the ratio of the patients correctly diagnosed as COVID19 positive to the total patients infected by the virus. This metric emphasizes the falsenegative cases. The recall is exceptionally high when the number of falsenegative cases is low. It is calculated by the formula given below:$$\mathrm{Recall}/\mathrm{Sensitivity}=\frac{tp}{tp+fn}$$
 F1score: It is an estimate which gives equal importance to the precision and recall values obtained previously for the COVID19 cases. It gives a better idea about the positive cases of the virus obtained. It is given by the following formula:$$\mathrm{F}1\mathrm{score}=2\times \frac{precision\times recall}{precision+recall}$$
 AUC (area under curve): The ROC (receiver operating characteristic) curve plots the true positive rate against the falsepositive rate for various test instances. It indicates how well the models are differentiating the binary classes. The area under this curve is the AUC. High values for AUC indicate that the classifier is performing well.
 Confusion matrix: For binary classification, the confusion matrix is a 2 × 2 matrix. All the classified instances will be in the confusion matrix. The diagonal elements indicate the correct classified instances (both truepositive and truenegative). The nondiagonal elements indicate the wrongly classified instances (both falsepositive and falsenegative). All the mentioned performance metrics can be easily calculated using the confusion matrix.
4.2. Model Evaluation
4.3. Feature Importance using SHAP and LIME
4.4. Further Discussion
5. Challenges and Future Directions
5.1. Challenges
 Data from a single country: For this research, data were collected from Mexico. However, data from all geographic areas must be considered for better validation. This is not a trivial task as there are clear differences in reporting standards and authenticity across different countries.
 Imbalance in data: In much of medical AI research, data imbalance is a persistent issue. The number of healthy patients is always more than the number of infected people. However, the models perform well when there are an equal number of classes. In this research, the BorderlineSMOTE technique was used to balance the data. Appropriate preprocessing should precede model training when working with such data.
 Original values: The data obtained for this research was already normalized. However, original data are required to form accurate medical intuitions.
 Missing blood and clinical markers: Clinical markers, such as CRP (C reactive protein), Ddimer, ferritin and lactate dehydrogenase (LDH) are known to be extremely useful in diagnosing COVID19. However, these markers were not available in the dataset.
 Variance in computer equipment: There is no one single uniform standard architecture followed by machines universally. The data are quite sensitive to software and hardware changes of the setup.
 Distributional shift in test data: An ML model will struggle to perform well if it is unable to adapt to novel scenarios. Trained models in supervised learning are notoriously bad at detecting meaningful changes in context or data, which leads to inaccurate predictions based on outofscope data. When the ML method is incorrectly applied to an unexpected patient situation, it might cause a disparity between the learning and operational data.
 Difficulties in deploying AI systems on a logistical level: Numerous existing difficulties in converting AI applications to clinical practice are due to the fact that the majority of healthcare data are not easily accessible for machine learning. Data are often compartmentalised in a plethora of medical imaging archiving systems, electronic health records (EHR), pathology systems, electronic prescription tools and insurance databases, making integration very challenging.
 Interpreting the result: The model may be able to derive complex and hidden patterns. However, sometimes these patterns might have no meaning. This might be problematic in medical applications, where there is a high need for techniques that are not just effective, but also clear, interpretable and explainable.
 Quality of data: It is essential to obtain reliable input from authentic sources. It is also necessary to filter out the noise which may have crept in while feeding the data.
 Data privacy: Most of the medical data obtained from the patients are highly confidential. A leak, attack or misuse of it can be catastrophic.
5.2. Future Directions
 Improving the dataset: For further research, a more balanced dataset can be collected. Important clinical markers mentioned in the previous section can also be considered. COVID19 severity can also be predicted.
 Using different algorithms: This research can be expanded by experimenting with different ML algorithms and combining them, as each model has its own pros and cons, there could be a model which is tailormade for this dataset
 Medical validation: Medical validation can be performed by doctors to comment on the authenticity of the models. Further, the models can be deployed in medical facilities and feedback on accuracy can be incorporated.
 Combining other AI methodologies: CTscans, Xrays, MRIs, ultrasound and cough sound analysis also use AI to diagnose COVID19. The integration of these models is expected to produce compelling results.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
 Woo, P.C.; Huang, Y.; Lau, S.K.; Yuen, K.Y. Coronavirus genomics and bioinformatics analysis. Viruses 2010, 2, 1804–1820. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 Hayden, F.; Richman, D.; Whitley, R. Clinical Virology, 4th ed.; ASM Press: Washington, DC, USA, 2017. [Google Scholar]
 Huang, C.; Wang, Y.; Li, X.; Ren, L.; Zhao, J.; Hu, Y.; Zhang, L.; Fan, G.; Xu, J.; Gu, X.; et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020, 395, 497–506. [Google Scholar] [CrossRef] [Green Version]
 Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute respiratory syndromerelated coronavirus: Classifying 2019nCoV and naming it SARSCoV2. Nat. Microbiol. 2020, 5, 536–544. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 Yuki, K.; Fujiogi, M.; Koutsogiannaki, S. COVID19 pathophysiology: A review. Clin. Immunol. 2020, 215, 108427. [Google Scholar] [CrossRef]
 Liu, K.; Chen, Y.; Lin, R.; Han, K. ReviewClinical features of COVID19 in elderly patients: A comparison with young and middleaged patients. J. Infect. 2020, 80, e14–e18. [Google Scholar] [CrossRef] [Green Version]
 Singh, A.K.; Gupta, R.; Ghosh, A.; Misra, A. Diabetes in COVID19: Prevalence, pathophysiology, prognosis and practical considerations. Diabetes Metab. Syndr. 2020, 14, 303–310. [Google Scholar] [CrossRef]
 Zhang, J.; Wang, X.; Jia, X.; Li, J.; Hu, K.; Chen, G.; Wei, J.; Gong, Z.; Zhou, C.; Yu, H.; et al. Risk factors for disease severity, unimprovement, and mortality in COVID19 patients in Wuhan, China. Clin. Microbiol. Infect. 2020, 26, 767–772. [Google Scholar] [CrossRef]
 Lu, H.; Stratton, C.W.; Tang, Y.W. Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle. J. Med. Virol. 2020, 92, 401–402. [Google Scholar] [CrossRef] [Green Version]
 Johns Hopkins Coronavirus Resource Center. Available online: https://coronavirus.jhu.edu/ (accessed on 1 June 2022).
 Lei, S.; Jiang, F.; Su, W.; Chen, C.; Chen, J.; Mei, W.; Zhan, L.; Jia, Y.; Zhang, L.; Liu, D.; et al. Clinical characteristics and outcomes of patients undergoing surgeries during the incubation period of COVID19 infection. EClinicalMedicine 2020, 21, 100331. [Google Scholar] [CrossRef]
 Li, Q.; Guan, X.; Wu, P.; Wang, X.; Zhou, L.; Tong, Y.; Ren, R.; Leung, K.; Lau, E.; Wong, J.; et al. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus–Infected Pneumonia. N. Engl. J. Med. 2020, 382, 1199–1207. [Google Scholar] [CrossRef]
 Habibzadeh, P.; Mofatteh, M.; Silawi, M.; Ghavami, S.; Faghihi, M. Molecular diagnostic assays for COVID19: An overview. Crit. Rev. Clin. Lab. Sci. 2021, 58, 385–398. [Google Scholar] [CrossRef] [PubMed]
 Mahendiratta, S.; Batra, G.; Sarma, P.; Kumar, H.; Bansal, S.; Kumar, S.; Prakash, A.; Sehgal, R.; Medhi, B. Molecular diagnosis of COVID19 in different biologic matrix, their diagnostic validity and clinical relevance: A systematic review. Life Sci. 2020, 258, 118207. [Google Scholar] [CrossRef] [PubMed]
 Goudouris, E.S. Laboratory diagnosis of COVID19. J. Pediatr. 2021, 97, 7–12. [Google Scholar] [CrossRef] [PubMed]
 Zhu, H.; Zhang, H.; Xu, Y.; Laššáková, S.; Korabečná, M.; Neužil, P. PCR past, present and future. BioTechniques 2020, 69, 317–325. [Google Scholar] [CrossRef]
 Falzone, L.; Gattuso, G.; Tsatsakis, A.; Spandidos, D.A.; Libra, M. Current and innovative methods for the diagnosis of COVID19 infection (Review). Int. J. Mol. Med. 2021, 47, 100. [Google Scholar] [CrossRef]
 Yang, Y.; Yang, M.; Yuan, J.; Wang, F.; Wang, Z.; Li, J.; Zhang, M.; Xing, L.; Wei, J.; Peng, L.; et al. Laboratory Diagnosis and Monitoring the Viral Shedding of SARSCoV2 Infection. Innovation 2020, 1, 100061. [Google Scholar] [CrossRef]
 Kucirka, L.M.; Lauer, S.A.; Laeyendecker, O.; Boon, D.; Lessler, J. Variation in FalseNegative Rate of Reverse Transcriptase Polymerase Chain Reaction–Based SARSCoV2 Tests by Time Since Exposure. Ann. Intern. Med. 2020, 173, 262–267. [Google Scholar] [CrossRef]
 Burog, A.; Yacapin, C.; Maglente, R.; MacalaladJosue, A.; Uy, E.; Dans, A.; Dans, L. Should IgM/IgG rapid test kit be used in the diagnosis of COVID19? Acta Med. Philipp. 2020, 54, 1–12. [Google Scholar] [CrossRef]
 Yu, K.H.; Beam, A.L.; Kohane, I.S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2018, 2, 719–731. [Google Scholar] [CrossRef]
 Rustam, F.; Reshi, A.A.; Mehmood, A.; Ullah, S.; On, B.; Aslam, W.; Choi, G.S. COVID19 Future Forecasting Using Supervised Machine Learning Models. IEEE Access 2020, 8, 101489–101499. [Google Scholar] [CrossRef]
 Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 2007, 160, 3–24. [Google Scholar]
 Quinlan, R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Mateo, CA, USA, 1993. [Google Scholar]
 Liu, D.; Clemente, L.; Poirier, C.; Ding, X.; Chinazzi, M.; Davis, J.T.; Vespignani, A.; Santillana, M. A machine learning methodology for realtime forecasting of the 2019–2020 COVID19 outbreak using Internet searches, news alerts, and estimates from mechanistic models. arXiv 2020, arXiv:2004.04019. [Google Scholar]
 Saravanan, R.; Sujatha, P. A state of art techniques on machine learning algorithms: A perspective of supervised learning approaches in data classification. In Proceedings of the IEEE 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 14–15 June 2018; pp. 945–949. [Google Scholar]
 Kaelbling, L.; Littman, M.; Moore, A. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef] [Green Version]
 Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
 LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
 Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing. IEEE Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
 Pak, M.S.; Kim, S.H. A review of deep learning in image recognition. In Proceedings of the International Conference on Computer Applications and Information Processing Technology, Kuta Bali, Indonesia, 8–10 August 2017; pp. 1–3. [Google Scholar]
 Shokeen, J.; Rana, C. An Applicationoriented Review of Deep Learning in Recommender Systems. Int. J. Intell. Syst. Appl. 2019, 11, 46–54. [Google Scholar] [CrossRef] [Green Version]
 Lee, W.; Seong, J.J.; Ozlu, B.; Shim, B.S.; Marakhimov, A.; Lee, S. Biosignal Sensors and Deep LearningBased Speech Recognition: A Review. Sensors 2021, 21, 1399. [Google Scholar] [CrossRef]
 Chadaga, K.; Prabhu, S.; Vivekananda, B.K.; Niranjana, S.; Umakanth, S. Battling COVID19 using machine learning: A review. Cogent Eng. 2021, 8, 1958666. [Google Scholar] [CrossRef]
 Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef]
 Toğaçar, M.; Ergen, B.; Cömert, Z.; Özyurt, F. A Deep Feature Learning Model for Pneumonia Detection Applying a Combination of mRMR Feature Selection and Machine Learning Models. IRBM 2020, 41, 212–222. [Google Scholar] [CrossRef]
 Kourou, K.; Exarchos, T.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 Pellegrini, E.; Ballerini, L.; Hernandez, M.D.C.V.; Chappell, F.M.; GonzálezCastro, V.; Anblagan, D.; Danso, S.; MuñozManiega, S.; Job, D.; Pernet, C.; et al. Machine learning of neuroimaging for assisted diagnosis of cognitive impairment and dementia: A systematic review. Alzheimer Dement. Diagn. Assess. Dis. Monit. 2018, 10, 519–535. [Google Scholar] [CrossRef]
 Bind, S.; Tiwari, A.K.; Sahani, A.K. A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction. Int. J. Comput. Sci. Inf. Technol. 2015, 6, 1648–1655. [Google Scholar]
 Musunuri, B.; Shetty, S.; Shetty, D.K.; Vanahalli, M.K.; Pradhan, A.; Naik, N.; Paul, R. AcuteonChronic Liver Failure Mortality Prediction using an Artificial Neural Network. Eng. Sci. 2021, 15, 187–196. [Google Scholar] [CrossRef]
 Lalmuanawma, S.; Hussain, J.; Chhakchhuak, L. Applications of machine learning and artificial intelligence for COVID19 (SARSCoV2) pandemic: A review. Chaossolitons Fractals 2020, 139, 110059. [Google Scholar] [CrossRef]
 Zu, Z.Y.; Jiang, M.D.; Xu, P.P.; Chen, W.; Ni, Q.Q.; Lu, G.M.; Zhang, L.J. Coronavirus Disease 2019 (COVID19): A Perspective from China. Radiology 2020, 296, E15–E25. [Google Scholar] [CrossRef] [Green Version]
 Lee, E.Y.P.; Ng, M.Y.; Khong, P.L. COVID19 pneumonia: What has CT taught us? Lancet Infect. Dis. 2020, 20, 384–385. [Google Scholar] [CrossRef]
 Narin, A.; Kaya, C.; Pamuk, Z. Automatic Detection of Coronavirus Disease (COVID19) Using Xray Images and Deep Convolutional Neural Networks. Pattern Anal. Appl. 2021, 24, 1207–1220. [Google Scholar] [CrossRef]
 Ozturk, T.; Talo, M.; Yildirim, E.A.; Baloglu, U.B.; Yildirim, O.; Acharya, U. Automated detection of COVID19 cases using deep neural networks with Xray images. Comput. Biol. Med. 2020, 121, 103792. [Google Scholar] [CrossRef]
 SmithBindman, R.; Yu, S.; Wang, Y.; Kohli, M.D.; Chu, P.; Chung, R.; Luong, J.; Bos, D.; Stewart, C.; Bista, B.; et al. An Image Quality–informed Framework for CT Characterization. Radiology 2022, 302, 380–389. [Google Scholar] [CrossRef]
 Muhammad, L.J.; Algehyne, E.A.; Usman, S.S.; Ahmad, A.; Chakraborty, C.; Mohammed, I.A. Supervised Machine Learning Models for Prediction of COVID19 Infection using Epidemiology Dataset. SN Comput. Sci. 2020, 2, 11. [Google Scholar] [CrossRef] [PubMed]
 Franklin, M.R. Mexico COVID19 Clinical Data. Available online: https://www.kaggle.com/marianarfranklin/mexicocovid19clinicaldata/metadata (accessed on 26 June 2020).
 QuirozJuárez, M.A.; TorresGómez, A.; HoyoUlloa, I.; LeónMontiel, R.D.J.; U’Ren, A.B. Identification of highrisk COVID19 patients using machine learning. PLoS ONE 2021, 16, e0257234. [Google Scholar] [CrossRef] [PubMed]
 Prieto, K. Current forecast of COVID19 in Mexico: A Bayesian and machine learning approaches. PLoS ONE 2022, 17, e0259958. [Google Scholar] [CrossRef] [PubMed]
 Iwendi, C.; Huescas, C.; Chakraborty, C.G.Y.; Mohan, S. COVID19 health analysis and prediction using machine learning algorithms for Mexico and Brazil patients. J. Exp. Theor. Artif. Intell. 2022, 1, 1–21. [Google Scholar] [CrossRef]
 MartinezVelazquez, R.; Tobon, V.D.P.; Sanchez, A.; El Saddik, A.; Petriu, E. A Machine Learning Approach as an Aid for Early COVID19 Detection. Sensors 2021, 21, 4202. [Google Scholar] [CrossRef]
 Rezapour, M.; Varady, C.A. A machine learning analysis of the relationship between some underlying medical conditions and COVID19 susceptibility. arXiv 2021, arXiv:2112.12901. [Google Scholar]
 Maouche, I.; Terrissa, S.L.; Benmohammed, K.; Zerhouni, N.; Boudaira, S. Early Prediction of ICU Admission Within COVID19 Patients Using Machine Learning Techniques. In Innovations in Smart Cities Applications; Springer: Cham, Switzerland, 2021; Volume 5, pp. 507–517. [Google Scholar]
 DelgadoGallegos, J.L.; AvilésRodriguez, G.; PadillaRivas, G.R.; CosioLeón, M.D.l.Á.; FrancoVillareal, H.; ZuñigaViolante, E.; RomoCardenas, G.S.; Islas, J.F. Clinical applications of machine learning on COVID19: The use of a decision tree algorithm for the assessement of perceived stress in mexican healthcare professionals. medRxiv 2020. [Google Scholar] [CrossRef]
 Yadav, A. Predicting Covid19 using Random Forest Machine Learning Algorithm. In Proceedings of the 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Khargpur, India, 6 July 2021; pp. 1–6. [Google Scholar]
 Mukherjee, R.; Kundu, A.; Mukherjee, I.; Gupta, D.; Tiwari, P.; Khanna, A.; Shorfuzzaman, M. IoTcloud based healthcare model for COVID19 detection: An enhanced kNearest Neighbour classifier based approach. Computing 2021, 1–21. [Google Scholar] [CrossRef]
 Chaudhary, L.; Singh, B. Community detection using unsupervised machine learning techniques on COVID19 dataset. Soc. Netw. Anal. Min. 2021, 11, 28. [Google Scholar] [CrossRef]
 Cornelius, E.; Akman, O.; Hrozencik, D. COVID19 Mortality Prediction Using Machine LearningIntegrated Random Forest Algorithm under Varying Patient Frailty. Mathematics 2021, 9, 2043. [Google Scholar] [CrossRef]
 WollensteinBetech, S.; Cassandras, C.G.; Paschalidis, I.C. Personalized predictive models for symptomatic COVID19 patients using basic preconditions: Hospitalizations, mortality, and the need for and ICU or ventilator. Int. J. Med. Inform. 2020, 123, 11–22. [Google Scholar] [CrossRef] [PubMed]
 Durden, B.; Shulman, M.; Reynolds, A.; Phillips, T.; Moore, D.; Andrews, I.; Pouriyeh, S. Using Machine Learning Techniques to Predict RTPCR Results for COVID19 Patients. In Proceedings of the 2021 IEEE Symposium on Computers and Communications (ISCC), Athens, Greece, 5–8 September 2021; pp. 1–4. [Google Scholar]
 GuzmánTorres, J.A.; AlonsoGuzmán, E.M.; DomínguezMota, F.J.; TinocoGuerrero, G. Estimation of the Main Conditions in (SARSCoV2) COVID19 Patients That Increase the Risk of Death Using Machine Learning, the Case of Mexico; Elsevier: Amsterdam, The Netherlands, 2021; Volume 27. [Google Scholar]
 Chadaga, K.; Prabhu, S.; Umakanth, S.; Bhat, V.K.; Sampathila, N.; Chadaga, R.P.; Prakasha, K.K. COVID19 Mortality Prediction among Patients Using Epidemiological Parameters: An Ensemble Machine Learning Approach. Eng. Sci. 2021, 16, 221–233. [Google Scholar] [CrossRef]
 Chadaga, K.; Chakraborty, C.; Prabhu, S.; Umakanth, S.; Bhat, V.; Sampathila, N. Clinical and laboratory approach to diagnose COVID19 using machine learning. Interdiscip. Sci. Comput. Life Sci. 2022, 14, 452–470. [Google Scholar] [CrossRef] [PubMed]
 Almansoor, M.; Hewahi, N.M. Exploring the Relation between Blood Tests and COVID19 Using Machine Learning. In Proceedings of the 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI), Sakheer, Bahrain, 26–27 October 2020; pp. 1–6. [Google Scholar]
 Open Data General Directorate of Epidemiology. Available online: https://www.gob.mx/salud/documentos/datosabiertos152127 (accessed on 26 March 2022).
 Ahlgren, P.; Jarneving, B.; Rousseau, R. Requirements for a cocitation similarity measure, with special reference to pearson’s correlation coefficient. J. Am. Soc. Inf. Sci. Technol. 2003, 54, 550–560. [Google Scholar] [CrossRef]
 Devillanova, G.; Solimini, S. Minmax solutions to some scalar field equations. Adv. Nonlinear Stud. 2012, 12, 173–186. [Google Scholar] [CrossRef]
 Thara, T.D.K.; Prema, P.S.; Xiong, F. Autodetection of epileptic seizure events using deep neural network with different feature scaling techniques. Pattern Recognit. Lett. 2019, 128, 544–550. [Google Scholar]
 Nick, T.G.; Campbell, K.M. Logistic regression. Methods Mol. Biol. 2007, 404, 273–301. [Google Scholar]
 Belgiu, M.; Drăguţ, L. Random Forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
 Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
 Schapire, R.E. Explaining adaboost. In Empirical Inference; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52. [Google Scholar]
 Zhang, M.; Zhou, Z. MLKNN: A lazy learning approach to multilabel learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef] [Green Version]
 Krogh, A. What are Artificial Neural Networks? Nat. Biotechnol. 2008, 26, 195–197. [Google Scholar] [CrossRef]
 Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority oversampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
 Han, H.; Wang, W.Y.; Mao, B.H. Borderlinesmote: A new oversampling method in imbalanced data sets learning. Adv. Intell. Comput. 2005, 3644, 878–887. [Google Scholar]
 Parsa, A.B.; Movahedi, A.; Taghipour, H.; Derrible, S.; Mohammadian, A. Toward Safer Highways, Application of XGBoost and SHAP for RealTime Accident Detection and Feature Analysis. Accid. Anal. Prev. 2019, 136, 105405. [Google Scholar] [CrossRef] [PubMed]
 Visani, G.; Bagli, E.; Chesani, F.; Poluzzi, A.; Capuzzo, D. Statistical stability indices for LIME: Obtaining reliable explanations for machine learning models. J. Oper. Res. Soc. 2020, 73, 91–101. [Google Scholar] [CrossRef]
 Hatwell, J.; Gaber, M.M.; Azad, R.M.A. AdaWHIPS: Explaining AdaBoost classification with applications in the health sciences. BMC Med. Inform. Decis. Mak. 2020, 20, 250. [Google Scholar] [CrossRef]
 Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
 Dhanabal, S.; Chandramathi, S. A review of various Knearest neighbor query processing techniques. Int. J. Comput. Appl. Technol. 2011, 31, 14–22. [Google Scholar]
Reference  Models  Accuracy  Critical Analysis/Findings 

[58]  KMeans and Principal Component Analysis    The use of unsupervised learning in COVID19 diagnosis. The use of principal component analysis in feature selection is also highlighted. 
[59]  Naïve Bayes, Decision Tree, KNN, Support Vector Machine, Random Forest and Multilayer perceptron  96%  The use of data mining to assist machine learning. 
[60]  Logistic Regression and Support Vector Machine  72%  Accurate severity classification. 
[61]  Decision Tree, Random Forest, Rotation Forest, MultiLayer Perceptron, Naïve Bayes, KNN  87%  The use of rotation forest in diagnosing COVID19. 
[62]  Many ML models  87%  The main causes of COVID19 deaths in Mexico were due to age, chronic diseases, bad eating habits and unnecessary contact with infected people. 
[63]  Ensemble Algorithms  96%  The use of feature importance techniques such as Shapley Additive Values. 
[64]  Random Forest, XGBoost, KNN and Logistic Regression  92%  The use of local interpretable modelagnostic explanations. 
[65]  Ensemble Algorithms  85%  The use of SMOTETomek in data balancing. 
Categories  Characteristics  





 

 

 

 

 
 




 

 

 

 

 
 




 

 

 

 

 

 

Model  Training  Testing  

Accuracy  Precision  Recall  F1Score  Accuracy  Precision  Recall  F1Score  
XGBoost  94.5  94.7  93.8  94.2  92  92  91  91.4 
AdaBoost  92.1  88.9  91.2  90  90.4  90.1  89.5  89.8 
ANN  86.6  84.9  83.2  84.1  86.2  88.2  83.1  85.7 
Random forest  91  91.6  89.9  90.7  89  88.3  88.1  88.2 
KNN  91.9  92.3  90.6  91.3  91.6  91.7  90.5  91 
Logistic Regression  84.2  73.3  63.8  68.2  78.4  70  60.1  64.7 
Reference  Dataset Origin  ML Models Used  No of Parameters Considered  Accuracy  Feature Importance 

[48]  Mexico  Five  10  94.99%  No 
[49]  Mexico  Various ML models  21  93.50%  No 
[51]  Mexico  Various ML models    69%  No 
[52]  Mexico  Various ML models  22  Sensitivity75%  Gini Index 
[53]  Mexico  Various ML models  14  Qualitative  No 
Proposed  Mexico  Six  10  94.50%  SHAP and LIME 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pradhan, A.; Prabhu, S.; Chadaga, K.; Sengupta, S.; Nath, G. Supervised Learning Models for the Preliminary Detection of COVID19 in Patients Using Demographic and Epidemiological Parameters. Information 2022, 13, 330. https://doi.org/10.3390/info13070330
Pradhan A, Prabhu S, Chadaga K, Sengupta S, Nath G. Supervised Learning Models for the Preliminary Detection of COVID19 in Patients Using Demographic and Epidemiological Parameters. Information. 2022; 13(7):330. https://doi.org/10.3390/info13070330
Chicago/Turabian StylePradhan, Aditya, Srikanth Prabhu, Krishnaraj Chadaga, Saptarshi Sengupta, and Gopal Nath. 2022. "Supervised Learning Models for the Preliminary Detection of COVID19 in Patients Using Demographic and Epidemiological Parameters" Information 13, no. 7: 330. https://doi.org/10.3390/info13070330