# A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature

## Abstract

**:**

## 1. Introduction

- Transition from standardized clinical protocols to a personalized approach to patient care due to the accumulation of a large amount of medical data, as well as the widespread use of individual biomonitoring devices;
- Disease prevention through early diagnosis and regular health monitoring using wearable devices;
- Patient focus and active involvement of the patient in the treatment process.

- Approaches using various class balancing algorithms that implement oversampling technologies (for example, SMOTE algorithm (Synthetic Minority Oversampling Technique) [21,22,23], ADASYN algorithm (Adaptive Synthetic Sampling Approach) [24]), undersampling technologies (for example, Tomek Links algorithm) [23,25] and their combinations;
- Approaches that apply algorithms that account for the sensitivity to the cost of wrong decisions (cost-sensitive algorithms) [10];

- Algorithms for engineering (generation) of new features for data patterns, which make it possible to move to a space of higher dimension [38].

- Avoid loss of information during undersampling or reducing the dimension of the data space;
- Exclude the introduction of false or redundant information during oversampling or increasing the dimension of the data space.

## 2. Materials and Methods

#### 2.1. Aspects of Development of Classifiers

#### 2.1.1. kNN Classifier Development

#### 2.1.2. SVM Classifier Development

#### 2.2. Quality Metrics of Multiclass Classification

#### 2.3. Solving the Class Imbalance Problem

- Algorithms accounting for the sensitivity to the cost of wrong decisions [10];

#### 2.4. UMAP Algorithm

#### 2.5. Entropies, Hjorth Parameters and Fractal Dimensions

## 3. A Novel Approach to the Generation of Datasets and the Development of Classifiers

- Adding all possible combinations of three groups of features based on the UMAP algorithm, one entropy and two fractal dimensions (as 1 of 3, 2 of 3, 3 of 3) to the features of the original dataset;
- Adding all possible combinations of two feature groups based on one entropy and two fractal dimensions (as 1 of 2, 2 of 2) to the features based on the UMAP algorithm.

## 4. Experimental Studies

#### 4.1. Data Analysis Based on the UMAP Algorithm

#### 4.2. Generation of New Features Based on Entropies, Hjorth Parameters and Fractal Dimensions of Data Patterns

#### 4.3. Generation of Datasets Used in the Development of Classifiers

- C1 is the original dataset (it contains 39 features);
- C2 is a dataset based on the UMAP algorithm (it contains from 2 to H features as a result of embedding in a space of lower dimension);
- C3 is a dataset based on the original dataset and the UMAP algorithm (it generates from 2 to H features);
- C4 is a dataset based on the UMAP algorithm (it generates from 2 to H features) and one entropy;
- C5 is a dataset based on the UMAP algorithm (it generates from 2 to H features) and two fractal dimensions;
- C6 is a dataset based on the UMAP algorithm (it generates from 2 to H features), one entropy and two fractal dimensions;
- C7 is a dataset based on the original dataset, the UMAP algorithm (it generates from 2 to H features) and one entropy;
- C8 is a dataset based on the original dataset and one entropy;
- C9 is a dataset based on the original dataset and two fractal dimensions;
- C10 is a dataset based on the original dataset, one entropy and two fractal dimensions;
- C11 is a dataset based on the original dataset, the UMAP algorithm (it generates from 2 to H features) and two fractal dimensions;
- C12 is a dataset based on the original dataset, the UMAP algorithm (it generates from 2 to H features), one entropy and two fractal dimensions.

#### 4.4. Aspects of k-Fold Cross-Validation

#### 4.5. Development of the Classifiers

#### 4.6. Development of kNN Classifiers

#### 4.6.1. Experiment without Class Balancing

_{1}of the best classifier in [10] was equal to 0.819, and the values of such metrics as Accuracy, Recall and Precision were equal to 0.952, 0.807 and 0.833, respectively. Unfortunately, the rules for choosing the best classifier in our study and in [10] may be somewhat different (for example, we do not know if standard deviation estimates were calculated in that study), but we assume that our best classifiers (with the best (maximum) values of quality metrics) clearly outperform the best classifier in [10]. To confirm these conclusions, we will provide additional information on classifiers C2 (with h = 36), C4 (with h = 11) and C5 (with h = 19) (because for classifier C6 (with h = 24), such information is given in Table 5).

#### 4.6.2. Class Balancing Experiment

#### 4.7. Development of SVM Classifiers

#### 4.7.1. Experiment without Class Balancing

#### 4.7.2. Class Balancing Experiment

## 5. Discussion

## 6. Conclusions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Global Health Care Outlook. 2021. Available online: https://www2.deloitte.com/cn/en/pages/life-sciences-and-healthcare/articles/2021-global-healthcare-outlook.html (accessed on 3 January 2023).
- Li, G.; Hu, J.; Hu, G. Biomarker Studies in Early Detection and Prognosis of Breast Cancer. Adv. Exp. Med. Biol.
**2017**, 1026, 27–39. [Google Scholar] [CrossRef] [PubMed] - Loke, S.Y.; Lee, A.S.G. The future of blood-based biomarkers for the early detection of breast cancer. Eur. J. Cancer.
**2018**, 92, 54–68. [Google Scholar] [CrossRef] [PubMed] - Cohen, J.D.; Li, L.; Wang, Y.; Thoburn, C.; Afsari, B.; Danilova, L.; Douville, C.; Javed, A.A.; Wong, F.; Mattox, A.; et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science
**2018**, 359, 926–930. [Google Scholar] [CrossRef] - Killock, D. CancerSEEK and destroy—a blood test for early cancer detection. Nat. Rev. Clin. Oncol.
**2018**, 15, 133. [Google Scholar] [CrossRef] - Hao, Y.; Jing, X.Y.; Sun, Q. Joint learning sample similarity and correlation representation for cancer survival prediction. BMC Bioinform.
**2022**, 23, 553. [Google Scholar] [CrossRef] [PubMed] - Núñez, C. Blood-based protein biomarkers in breast cancer. Clin. Chim. Acta.
**2019**, 490, 113–127. [Google Scholar] [CrossRef] [PubMed] - Du, Z.; Liu, X.; Wei, X.; Luo, H.; Li, P.; Shi, M.; Guo, B.; Cui, Y.; Su, Z.; Zeng, J.; et al. Quantitative proteomics identifes a plasma multi protein model for detection of hepatocellular carcinoma. Sci. Rep.
**2020**, 10, 15552. [Google Scholar] [CrossRef] [PubMed] - Kalinich, M.; Haber, D.A. Cancer detection: Seeking signals in blood. Science
**2018**, 359, 866–867. [Google Scholar] [CrossRef] - Song, C.; Li, X. Cost-Sensitive KNN Algorithm for Cancer Prediction Based on Entropy Analysis. Entropy
**2022**, 24, 253. [Google Scholar] [CrossRef] - Huang, S.; Cai, N.; Pacheco, P.P.; Narrandes, S.; Wang, Y.; Xu, W. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genom. Proteom.
**2018**, 15, 41–51. [Google Scholar] [CrossRef] - Sepehri, M.M.; Khavaninzadeh, M.; Rezapour, M.; Teimourpour, B. A data mining approach to fistula surgery failure analysis in hemodialysis patients. In Proceedings of the 2011 18th Iranian Conference of Biomedical Engineering (ICBME), Tehran, Iran, 14–16 December 2011; pp. 15–20. [Google Scholar] [CrossRef]
- Rezapour, M.; Zadeh, M.K.; Sepehri, M.M. Implementation of Predictive Data Mining Techniques for Identifying Risk Factors of Early AVF Failure in Hemodialysis Patients. Comput. Math. Methods Med.
**2013**, 2013, 830745. [Google Scholar] [CrossRef] [PubMed] - Rezapour, M.; Zadeh, K.M.; Sepehri, M.M.; Alborzi, M. Less primary fistula failure in hypertensive patients. J. Hum. Hypertens.
**2018**, 32, 311–318. [Google Scholar] [CrossRef] [PubMed] - Toth, R.; Schiffmann, H.; Hube-Magg, C.; Büscheck, F.; Höflmayer, D.; Weidemann, S.; Lebok, P.; Fraune, C.; Minner, S.; Schlomm, T.; et al. Random forest-based modelling to detect biomarkers for prostate cancer progression. Clin. Epigenet.
**2019**, 11, 148. [Google Scholar] [CrossRef] [PubMed] - Savareh, B.A.; Aghdaie, H.A.; Behmanesh, A.; Bashiri, A.; Sadeghi, A.; Zali, M.; Shams, R. A machine learning approach identified a diagnostic model for pancreatic cancer through using circulating microRNA signatures. Pancreatology
**2020**, 20, 1195–1204. [Google Scholar] [CrossRef] - Lv, J.; Wang, J.; Shang, X.; Liu, F.; Guo, S. Survival prediction in patients with colon adenocarcinoma via multi-omics data integration using a deep learning algorithm. Biosci Rep.
**2020**, 40, BSR20201482. [Google Scholar] [CrossRef] - Chaudhary, K.; Poirion, O.B.; Lu, L.; Garmire, L.X. Deep learning-based multi-omics integration robustly predicts survival in liver cancer. Clin. Cancer Res.
**2018**, 24, 1248–1259. [Google Scholar] [CrossRef] - Lee, T.Y.; Huang, K.Y.; Chuang, C.H.; Lee, C.Y.; Chang, T.H. Incorporating deep learning and multi-omics autoencoding for analysis of lung adenocarcinoma prognostication. Comput. Biol.
**2020**, 87, 107277. [Google Scholar] [CrossRef] - Qadri, S.F.; Shen, L.; Ahmad, M.; Qadri, S.; Zareen, S.S.; Akbar, M.A. SVseg: Stacked Sparse Autoencoder-Based Patch Classification Modeling for Vertebrae Segmentation. Mathematics
**2022**, 10, 796. [Google Scholar] [CrossRef] - Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res.
**2002**, 16, 321–357. [Google Scholar] [CrossRef] - Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing; ICIC 2005. Lecture Notes in Computer Science, Huang, D.S., Zhang, X.P., Huang, G.B., Eds.; Springer: Berlin, Heidelberg, 2005; Volume 3644, pp. 878–887. [Google Scholar] [CrossRef]
- Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset. Sensors
**2022**, 22, 3246. [Google Scholar] [CrossRef] - He, H.; Bay, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
- Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern.
**1976**, 6, 769–772. [Google Scholar] [CrossRef] - Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis? J. ACM
**2011**, 58, 1–37. [Google Scholar] [CrossRef] - Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Phil. Trans. R. Soc. A.
**2016**, 374, 20150202. [Google Scholar] [CrossRef] [PubMed] - van der Maaten, L.; Hinton, G.E. Visualizing Data using t-SNE. J. Mach. Learn. Res.
**2008**, 9, 2579–2605. [Google Scholar] - McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv
**2018**, arXiv:1802.03426. [Google Scholar] - Dorrity, M.W.; Saunders, L.M.; Queitsch, C.; Fields, S.; Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun.
**2020**, 11, 1537. [Google Scholar] [CrossRef] - Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.H.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol.
**2019**, 37, 38–44. [Google Scholar] [CrossRef] - Demidova, L.A.; Gorchakov, A.V. Fuzzy Information Discrimination Measures and Their Application to Low Dimensional Embedding Construction in the UMAP Algorithm. J. Imaging
**2022**, 8, 113. [Google Scholar] [CrossRef] - Yu, W.; Liu, T.; Valdez, R.; Gwinn, M.; Khoury, M.J. Application of support vector machine modeling for prediction of common diseases: The case of diabetes and pre-diabetes. BMC Med. Inform. Decis. Mak.
**2010**, 10, 16. [Google Scholar] [CrossRef] - Demidova, L.A. Two-stage hybrid data classifiers based on SVM and kNN algorithms. Symmetry
**2021**, 13, 615. [Google Scholar] [CrossRef] - Khan, S.S.; Madden, M.G. One-class classification: Taxonomy of study and review of techniques. Knowl. Eng. Rev.
**2014**, 29, 345–374. [Google Scholar] [CrossRef] - Scholkopf, B.; Williamson, R.C.; Smola, A.J.; Shawe-Taylor, J.; Platt, J. Estimating the support of a high-dimensional distribution. Neural Comput.
**2001**, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed] - Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation-Based Anomaly Detection. ACM Trans. Knowl. Discov. Data
**2012**, 6, 1–39. [Google Scholar] [CrossRef] - Zheng, A.; Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018; p. 201. [Google Scholar]
- COSMIC|Catalogue of Somatic Mutations in Cancer. Available online: https://cancer.sanger.ac.uk/cosmic (accessed on 3 January 2023).
- Zanin, M.; Zunino, L.; Rosso, O.A.; Papo, D. Permutation Entropy and Its Main Biomedical and Econophysics Applications: A Review. Entropy
**2012**, 14, 1553–1577. [Google Scholar] [CrossRef] - Zhang, A.; Yang, B.; Huang, L. Feature Extraction of EEG Signals Using Power Spectral Entropy. In Proceedings of the International Conference on BioMedical Engineering and Informatics, Sanya, China, 27–30 May 2008; Volume 2, pp. 435–439. [Google Scholar] [CrossRef]
- Weng, X.; Perry, A.; Maroun, M.; Vuong, L.T. Singular Value Decomposition and Entropy Dimension of Fractals. arXiv
**2022**, arXiv:2211.12338. [Google Scholar] [CrossRef] - Pincus, S.M. Approximate entropy as a measure of system complexity. Proc. Natl. Acad. Sci. USA
**1991**, 88, 2297–2301. [Google Scholar] [CrossRef] - Pincus, S.M.; Gladstone, I.M.; Ehrenkranz, R.A. A regularity statistic for medical data analysis. J. Clin. Monit. Comput.
**1991**, 7, 335–345. [Google Scholar] [CrossRef] - Delgado-Bonal, A.; Marshak, A. Approximate Entropy and Sample Entropy: A Comprehensive Tutorial. Entropy
**2019**, 21, 541. [Google Scholar] [CrossRef] - Hjorth, B. EEG Analysis Based on Time Domain Properties. Electroencephalogr. Clin. Neurophysiol.
**1970**, 29, 306–310. [Google Scholar] [CrossRef] - Galvão, F.; Alarcão, S.M.; Fonseca, M.J. Predicting Exact Valence and Arousal Values from EEG. Sensors
**2021**, 21, 3414. [Google Scholar] [CrossRef] - Shi, C.-T. Signal Pattern Recognition Based on Fractal Features and Machine Learning. Appl. Sci.
**2018**, 8, 1327. [Google Scholar] [CrossRef] - Petrosian, A. Kolmogorov Complexity of Finite Sequences and Recognition of Different Preictal EEG Patterns. In Proceedings of the Computer-Based Medical Systems, Lubbock, TX, USA, 9–10 June 1995; pp. 212–217. [Google Scholar] [CrossRef]
- Katz, M.J. Fractals and the analysis of waveforms. Comput. Biol. Med.
**1988**, 18, 145–156. [Google Scholar] [CrossRef] [PubMed] - Gil, A.; Glavan, V.; Wawrzaszek, A.; Modzelewska, R.; Tomasik, L. Katz Fractal Dimension of Geoelectric Field during Severe Geomagnetic Storms. Entropy
**2021**, 23, 1531. [Google Scholar] [CrossRef] - Higuchi, T. Approach to an irregular time series on the basis of the fractal theory. Phys. D Nonlinear Phenom.
**1988**, 31, 277–283. [Google Scholar] [CrossRef] - Hall, P.; Park, B.U.; Samworth, R.J. Choice of neighbor order in nearest-neighbor classification. Ann. Stat.
**2008**, 36, 2135–2152. [Google Scholar] [CrossRef] - Nigsch, F.; Bender, A.; Van Buuren, B.; Tissen, J.; Nigsch, A.E.; Mitchell, J.B. Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization. J. Chem. Inf. Model.
**2006**, 46, 2412–2422. [Google Scholar] [CrossRef] - Xing, W.; Bei, Y. Medical Health Big Data Classification Based on KNN Classification Algorithm. IEEE Access
**2020**, 8, 28808–28819. [Google Scholar] [CrossRef] - Mohanty, S.; Mishra, A.; Saxena, A. Medical Data Analysis Using Machine Learning with KNN. In International Conference on Innovative Computing and Communications; Gupta, D., Khanna, A., Bhattacharyya, S., Hassanien, A.E., Anand, S., Jaiswal, A., Eds.; Advances in Intelligent Systems and Computing; Springer: Singapore, 2020; Volume 1166. [Google Scholar] [CrossRef]
- Chapelle, O.; Vapnik, V.; Bousquet, O.; Mukherjee, S. Choosing multiple parameters for support vector machines. Mach. Learn.
**2002**, 46, 131–159. [Google Scholar] [CrossRef] - Demidova, L.; Nikulchev, E.; Sokolova, Y. Big data classification using the SVM classifiers with the modified particle swarm optimization and the SVM ensembles. Int. J. Adv. Comput. Sci. Appl.
**2016**, 7, 294–312. [Google Scholar] [CrossRef] - Schober, P.; Vetter, T.R. Logistic Regression in Medical Research. Anesth Analg.
**2021**, 132, 365–366. [Google Scholar] [CrossRef] - Dai, B.; Chen, R.-C.; Zhu, S.-Z.; Zhang, W.-W. Using Random Forest Algorithm for Breast Cancer Diagnosis. In Proceedings of the 2018 International Symposium on Computer, Consumer and Control (IS3C), Taichung, Taiwan, 6–8 December 2018; pp. 449–452. [Google Scholar] [CrossRef]
- Acharjee, A.; Larkman, J.; Xu, Y.; Cardoso, V.R.; Gkoutos, G.V. A random forest based biomarker discovery and power analysis framework for diagnostics research. BMC Med. Genom.
**2020**, 13, 178. [Google Scholar] [CrossRef] [PubMed] - Cheng, S.; Liu, B.; Ting, T.O.; Qin, Q.; Shi, Y.; Huang, K. Survey on data science with population-based algorithms. Big Data Anal.
**2016**, 1, 3. [Google Scholar] [CrossRef] - Demidova, L.A.; Gorchakov, A.V. Application of bioinspired global optimization algorithms to the improvement of the prediction accuracy of compact extreme learning machines. Russ. Technol. J.
**2022**, 10, 59–74. [Google Scholar] [CrossRef] - Liu, J.-Y.; Jia, B.-B. Combining One-vs-One Decomposition and Instance-Based Learning for Multi-Class Classification. IEEE Access
**2020**, 8, 197499–197507. [Google Scholar] [CrossRef] - Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-class Classification: An Overview. arXiv
**2020**, arXiv:2008.05756. [Google Scholar] - Haibo, H.; Yunqian, M. Imbalanced Learning: Foundations, Algorithms, and Applications; Wiley-IEEE Press: Hoboken, NJ, USA, 2013; p. 216. [Google Scholar]
- Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell.
**2016**, 5, 221–232. [Google Scholar] [CrossRef] - Dong, W.; Moses, C.; Li, K. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 577–586. [Google Scholar]
- Damrich, S.; Hamprecht, F.A. On UMAP’s true loss function. Adv. Neural Inf. Process. Syst.
**2021**, 34, 12. [Google Scholar] - UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Available online: https://umap-learn.readthedocs.io/en/latest/_modules/umap/umap_.html (accessed on 4 January 2023).
- Prusty, S.; Patnaik, S.; Dash, S. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front. Nanotechnol.
**2022**, 4, 972421. [Google Scholar] [CrossRef] - Slamet, W.; Herlambang, B.; Samudi, S. Stratified K-fold cross validation optimization on machine learning for prediction. Sink. J. Dan Penelit. Tek. Inform.
**2022**, 7, 2407–2414. [Google Scholar] [CrossRef]

**Figure 3.**Visualization of nine-class dataset of ODs using the UMAP algorithm. (n_neighbors = 15, min_dist = 0.1, random_state = 42, metric = ‘euclidean’).

**Figure 4.**Visualization of three-class dataset of ODs using the UMAP algorithm. (n_neighbors = 15, min_dist = 0.1, random_state = 42, metric = ‘euclidean’).

**Figure 5.**Visualization of the results of the experiment of choosing the best kNN classifier based on 12 datasets without using a class balancing algorithm (n_neighbors is the number of nearest neighbors; weights is the parameter which assigns weight coefficients to the neighbors; q is the dimension of the space corresponding to the dataset used for development of classifier; h is the dimension of the space into which the UMAP algorithm embeds the 39-dimensional feature space corresponding to the original dataset; the background of each color shows the amount of standard deviation around the mean of the metric ${\mathit{MacroF}}_{1}-\mathit{score}$).

**Figure 6.**Visualization of the results of the experiment of choosing the best kNN classifier based on 12 datasets using the Borderline SMOTE-1 class balancing algorithm (n_neighbors is the number of nearest neighbors; weights is the parameter which assigns weight coefficients to the neighbors; q is the dimension of the space corresponding to the dataset used for the development of the classifier; h is the dimension of the space into which the UMAP algorithm embeds the 39-dimensional feature space corresponding to the original dataset; the background of each color shows the amount of standard deviation around the mean of the metric ${\mathit{MacroF}}_{1}-\mathit{score}$).

**Figure 7.**Visualization of the results of the experiment for choosing the best SVM classifier based on 12 datasets without using class balancing algorithms (gamma is the parameter of the radial basic kernel function; C is the regularization parameter; q is the dimension of the space corresponding to the dataset used for development of the classifier; h is the dimension of the space into which the UMAP algorithm embeds the 39-dimensional feature space corresponding to the original dataset; the background of each color shows the amount of standard deviation around the mean of the metric ${\mathit{MacroF}}_{1}-\mathit{score}$).

**Figure 8.**Visualization of the results of the experiment for choosing the best SVM classifier based on 12 datasets using the SMOTE algorithm for class balancing (gamma is the parameter of the radial basic kernel function; C is the regularization parameter; q is the dimension of the space corresponding to the dataset used for the development of the classifier; h is the dimension of the space into which the UMAP algorithm embeds the 39-dimensional feature space corresponding to the original dataset; the background of each color shows the amount of standard deviation around the mean of the metric ${\mathit{MacroF}}_{1}-\mathit{score}$).

Class | Mean Values | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Entropy | Hjorth Parameters | Fractal Dimension | ||||||||

PE | SPE | SVDE | AE ^{1} | SE | HM | HC | PFD | KFD | HFD | |

Normal | 0.977 | 0.921 | 0.943 | 0.454 | 0.508 | 1.335 | 1.334 | 1.068 | 1.609 | 2.337 |

Liver | 0.981 | 0.913 | 0.937 | 0.257 | 0.425 | 1.326 | 1.336 | 1.069 | 1.548 | 2.271 |

Ovary | 0.978 | 0.924 | 0.975 | 0.353 | 0.348 | 1.313 | 1.306 | 1.065 | 1.696 | 2.372 |

^{1}Bold type indicates the mean values of the metrics that make it possible to distinguish between classes.

**Table 2.**Mean standard deviations for each of the three classes for each potential new generatedfeature.

Class | Mean Standard Deviations | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Entropy | Hjorth Parameters | Fractal Dimension | ||||||||

PE | SPE | SVDE | AE ^{2} | SE | HM | HC | PFD | KFD | HFD | |

Normal | 0.018 | 0.054 | 0.043 | 0.091 | 0.164 | 0.131 | 0.510 | 0.005 | 0.174 | 0.068 |

Liver | 0.013 | 0.085 | 0.082 | 0.105 | 0.105 | 0.098 | 0.021 | 0.005 | 0.266 | 0.138 |

Ovary | 0.015 | 0.026 | 0.021 | 0.097 | 0.131 | 0.110 | 0.207 | 0.004 | 0.230 | 0.091 |

^{2}Bold type indicates the mean standard deviations of the metrics that make it possible to distinguish between classes.

Type of Classification Algorithm/ Class Balancing Algorithm | $\mathbf{Maximum}\text{}\mathbf{Mean}\text{}\mathbf{Value}\text{}\mathbf{of}\text{}\mathit{M}\mathit{a}\mathit{c}\mathit{r}\mathit{o}{\mathit{F}}_{1}-\mathit{s}\mathit{c}\mathit{o}\mathit{r}\mathit{e}$ | |
---|---|---|

AE | SE | |

kNN/no class balancing | 0.842 | 0.849 ^{3} |

kNN/SMOTE | 0.866 | 0.864 |

kNN/Borderline SMOTE-1 | 0.878 | 0.878 |

kNN/Borderline SMOTE-2 | 0.861 | 0.861 |

kNN/ADASYN | 0.842 | 0.842 |

^{3}The metric value with the largest value in the row is highlighted in bold. Matching metric values in a row are italicized.

Type of Classification Algorithm/ Class Balancing Algorithm | $\mathbf{Maximum}\text{}\mathbf{Mean}\text{}\mathbf{Value}\text{}\mathbf{of}\text{}\mathit{M}\mathit{a}\mathit{c}\mathit{r}\mathit{o}{\mathit{F}}_{1}-\mathit{s}\mathit{c}\mathit{o}\mathit{r}\mathit{e}$ | |
---|---|---|

AE | SE | |

SVM/no class balancing | 0.883 | 0.884 ^{4} |

SVM/SMOTE | 0.912 | 0.912 |

SVM/Borderline SMOTE-1 | 0.911 | 0.911 |

SVM/Borderline SMOTE-2 | 0.886 | 0.886 |

SVM/ADASYN | 0.905 | 0.905 |

^{4}The metric value with the largest value in the row is highlighted in bold. Matching metric values in a row are italicized.

**Table 5.**Characteristics of kNN classifiers C1 and C6 (with h = 24) in the experiment without class balancing.

Characteristic | Classifier | |
---|---|---|

C1 | C6 (with h = 24) | |

Number of features in the dataset | 39 | 27 |

Number of neighbors (n_neighbors) | 6 | 12 |

weights | ‘distance’ | ‘distance’ |

${\mathit{MacroF}}_{1}-\mathit{score}$ (mean/std) | 0.756/0.101 | 0.842/0.093 |

Accuracy (mean/std) | 0.948/0.017 | 0.966/0.016 |

MacroRecall (mean/std) | 0.687/0.106 | 0.803/0.091 |

MacroPrecision (mean/std) | 0.938/0.063 | 0.919/0.098 |

Training time (mean/std), s. | 0.002/0.001 | 0.004/0.002 |

Quality metrics calculation time (mean/std), s. | 0.009/0.003 | 0.009/0.002 |

**Table 6.**Characteristics of kNN classifiers C1 and C8 (independent of h) in the experiment using the Borderline SMOTE-1 class balancing algorithm.

Characteristic | Classifier | |
---|---|---|

C1 | C8 (independent of h) | |

Number of features in the dataset | 39 | 40 |

Number of neighbors (n_neighbors) | 10 | 6 |

weights | ‘uniform’ | ‘uniform’ |

${\mathit{MacroF}}_{1}-\mathit{score}$ (mean/std) | 0.847/0.079 | 0.878/0.050 |

Accuracy (mean/std) | 0.957/0.022 | 0.968/0.013 |

MacroRecall (mean/std) | 0.870/0.079 | 0.877/0.066 |

MacroPrecision (mean/std) | 0.846/0.085 | 0.896/0.063 |

Training time (mean/std), s. | 0.012/0.006 | 0.028/0.002 |

Quality metrics calculation time (mean/std), s. | 0.024/0.009 | 0.017/0.003 |

Characteristic | Classifier | |
---|---|---|

C1 | C8 (independent of h) | |

Number of features in the dataset | 39 | 40 |

gamma | 1.2 | 1.2 |

C | 2.0 | 2.0 |

${\mathit{MacroF}}_{1}-\mathit{score}$ (mean/std) | 0.877/0.078 | 0.885/0.079 |

Accuracy (mean/std) | 0.973/0.015 | 0.974/0.016 |

MacroRecall (mean/std) | 0.843/0.088 | 0.850/0.090 |

MacroPrecision (mean/std) | 0.950/0.053 | 0.957/0.051 |

Training time (mean/std), s. | 0.123/0.008 | 0.131/0.007 |

Quality metrics calculation time (mean/std), s. | 0.007/0.001 | 0.009/0.001 |

**Table 8.**Characteristics of SVM classifiers C1 and C7 (with h = 28) in the experiment with class balancing.

Characteristic | Classifier | |
---|---|---|

C1 | C7 (with h = 28) | |

Number of features in the dataset | 39 | 68 |

gamma | 1 | 0.7 |

C | 0.4 | 0.7 |

${\mathit{MacroF}}_{1}-\mathit{score}$ (mean/std) | 0.910/0.064 | 0.914/0.050 |

Accuracy (mean/std) | 0.977/0.015 | 0.978/0.012 |

MacroRecall (mean/std) | 0.907/0.081 | 0.907/0.065 |

MacroPrecision (mean/std) | 0.929/0.058 | 0.937/0.048 |

Training time (mean/std), s. | 0.886/0.214 | 0.489/0.021 |

Quality metrics calculation time (mean/std), s. | 0.013/0.004 | 0.008/0.001 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Demidova, L.A.
A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature. *Mathematics* **2023**, *11*, 792.
https://doi.org/10.3390/math11040792

**AMA Style**

Demidova LA.
A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature. *Mathematics*. 2023; 11(4):792.
https://doi.org/10.3390/math11040792

**Chicago/Turabian Style**

Demidova, Liliya A.
2023. "A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature" *Mathematics* 11, no. 4: 792.
https://doi.org/10.3390/math11040792