Next Article in Journal
On Flag-Transitive, Point-Quasiprimitive Symmetric 2-(v,k,λ) Designs with λ Prime
Next Article in Special Issue
Multi-Target Feature Selection with Adaptive Graph Learning and Target Correlations
Previous Article in Journal
Automatic Evaluation of Functional Movement Screening Based on Attention Mechanism and Score Distribution Prediction
Previous Article in Special Issue
Efficient Federated Learning with Pre-Trained Large Language Model Using Several Adapter Mechanisms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Analysis of Colorectal and Gastric Cancer Classification: A Mathematical Insight Utilizing Traditional Machine Learning Classifiers

by
Hari Mohan Rai
* and
Joon Yoo
*
School of Computing, Gachon University, Seongnam-si 13120, Republic of Korea
*
Authors to whom correspondence should be addressed.
Mathematics 2023, 11(24), 4937; https://doi.org/10.3390/math11244937
Submission received: 19 November 2023 / Revised: 9 December 2023 / Accepted: 11 December 2023 / Published: 12 December 2023

Abstract

:
Cancer remains a formidable global health challenge, claiming millions of lives annually. Timely and accurate cancer diagnosis is imperative. While numerous reviews have explored cancer classification using machine learning and deep learning techniques, scant literature focuses on traditional ML methods. In this manuscript, we undertake a comprehensive review of colorectal and gastric cancer detection specifically employing traditional ML classifiers. This review emphasizes the mathematical underpinnings of cancer detection, encompassing preprocessing techniques, feature extraction, machine learning classifiers, and performance assessment metrics. We provide mathematical formulations for these key components. Our analysis is limited to peer-reviewed articles published between 2017 and 2023, exclusively considering medical imaging datasets. Benchmark and publicly available imaging datasets for colorectal and gastric cancers are presented. This review synthesizes findings from 20 articles on colorectal cancer and 16 on gastric cancer, culminating in a total of 36 research articles. A significant focus is placed on mathematical formulations for commonly used preprocessing techniques, features, ML classifiers, and assessment metrics. Crucially, we introduce our optimized methodology for the detection of both colorectal and gastric cancers. Our performance metrics analysis reveals remarkable results: 100% accuracy in both cancer types, but with the lowest sensitivity recorded at 43.1% for gastric cancer.

1. Introduction

Cancer, a longstanding enigma in human history, has experienced a notable upsurge in its prevalence in recent decades due to several contributing causes. These reasons encompass the inexorable aging of populations, the embracing of detrimental lifestyles, and heightened exposure to carcinogens in the environment, food, and beverages [1,2]. The term “cancer” has its origins in the Greek word “kapkivoc”, which carries a dual meaning, referring to both a neoplasm and a crustacean of the crab genus. This nomenclature was first introduced in the medical lexicon in the 17th century and signifies a condition characterized by the invasive spread of cells to different anatomical sites, potentially causing harm [3,4,5]. In the human anatomy, composed of countless innumerable cells, cancer can emerge in diverse locations, from the extremities to the brain. While cells typically divide and multiply to meet the body’s needs and undergo programmed cell death, when necessary, deviations can lead to the uncontrolled replication of damaged or abnormal cells, resulting in the formation of a neoplasm or tumor. These tumors can be categorized as benign (non-malignant) or malignant (cancerous), with the latter having the potential to travel to distant body parts from the original location, often affecting nearby tissues along the way. Notably, blood cancers, like leukemia, do not follow the typical pattern of solid tumor formation but rather tend to involve the proliferation of abnormal blood cells that circulate within the body and may not form solid masses as seen in other types of cancer. Cancer arises from genetic anomalies that disrupt the regulation of cellular proliferation. These genetic anomalies compromise the natural control mechanisms that prevent excessive cell proliferation. The body has inherent mechanisms designed to remove cells that possess damaged DNA, but, in certain cases, these fail, allowing abnormal cells to thrive and potentially develop into tumors, disrupting regular bodily functions; these defenses can diminish with age or due to various factors [6].
Each instance of cancer exhibits a distinct genetic modification that evolves as the tumor grows. Tumors often showcase a diversity of genetic mutations across various cells existing within the same cluster. Genetic abnormalities primarily affect three types of genes: DNA repair genes, proto-oncogenes, and tumor suppressor genes. Proto-oncogenes are typically immersed in healthy cell division and proliferation. The transformation of these genes into oncogenes, brought on by specific alterations or increased activity, fuels uncontrolled cell growth and plays a role in cancer development. Meanwhile, tumor suppressor genes meticulously manage cellular division while imposing restraints on unbridled and unregulated cellular proliferation, and mutations in these genes disable their inhibitory function, increasing the risk of cancer. Mutations in DNA repair genes are significant in rectifying DNA damage, and these genes can lead to the accumulation of further genetic abnormalities, making cells more prone to developing cancer. Metastasis is the movement of cancer cells from the initial site to new parts. It includes cell detachment, local tissue invasion, blood or lymph system entry, and growth in distant tissues [7,8]. Understanding the genetic and cellular mechanisms underlying cancer development and metastasis is crucial for improving diagnostics, developing effective treatments, and advancing cancer research. Researchers can work toward better strategies for prevention, early detection, and targeted therapies by unraveling the intricacies of cancer at the molecular level. The early diagnosis of cancer developments across different body areas requires accurate and automated computerized techniques. While numerous researchers have made significant strides in cancer detection, there remains substantial scope for improvement in this field. In this manuscript, we have scrutinized colorectal and gastric cancers employing conventional ML techniques solely based on medical imaging datasets. Medical images offer finer and more specific details compared to other medical data sources.

Literature Review

This section provides an evaluative comparison of the most recent review articles available, analyzing current review articles dedicated to the utilization of machine learning and deep learning classifiers for cancer detection across diverse types. The objective is to summarize the positive aspects and limitations of these review articles, as per the review presented, on various cancer types. The papers selected for analysis include those that cover more than two cancer types, are peer-reviewed, and were published between 2019 and 2023. This present study extends our prior works [9,10] by providing an extensive review that now encompasses seven distinct cancer types. Levine et al. (2019) [9] focused on cutaneous, mammary, pulmonary, and various other malignant conditions, emphasizing radiological practices and diagnostic workflows. The study detailed the construction and deployment of a convolutional neural network for medical image analysis. However, limitations included a relative underemphasis on malignancy detection, sparse literature sources, and examination of a limited set of performance parameters. Huang et al. (2020) [10] explored prostatic, mammary, gastric, colorectal, solid, and non-solid malignancies. The study presented a comparative analysis of artificial intelligence algorithms and human pathologists in terms of prognostic and diagnostic performance across various cancer classifications. However, limitations included a lack of literature for each malignancy category, the absence of consideration for machine learning and deep learning classifiers, and a lack of an in-depth literature review. Saba (2020) [11] examined mammary, encephalic, pulmonary, hepatic, cutaneous, and leukemic cancers, offering concise explanations of benchmark datasets and a comprehensive evaluation of diverse performance metrics. However, limitations included a combined treatment of machine learning and deep learning without a separate analysis and the absence of a comparative exploration between the two methodologies. Shah et al. (2021) [12] proposed predictive systems for various cancer types but had limitations. It used a data, prediction technique, and view (DPV) framework to assess cancer detection. The focus was on data type, modality, and acquisition. However, the study included a limited number of articles for each cancer type, lacked a performance evaluation, and only considered deep learning-based methods.
Majumder and Sen (2021) [13] centered its focus on the domains of mammary, pulmonary, solid, and encephalic malignancies. The findings embraced the demonstration of artificial intelligence’s application in the domains of oncopathology and translational oncology. However, limitations included a limited consideration of cancer types and literature sources, along with variations in performance metrics across different sources. Tufail et al. (2021) [14] evaluated astrocytic, mammary, colorectal, ovarian, gastric, hepatic, thyroid, and various other cancer types, emphasizing publicly accessible datasets, cancer detection, and segmentation. However, the exclusive focus on deep learning-based cancer detection limited a comprehensive examination of each cancer type. Kumar and Alqahtani (2022) [15] examined mammary, encephalic, pulmonary, cutaneous, prostatic, and various other malignancies, detailing diverse deep learning models and architectures based on image types. However, limitations included the exclusive focus on deep learning methods and variations in performance metrics across different literature sources. Kumar et al. (2022) [3] evaluated various malignancies, offering comprehensive coverage across diverse cancer categories. The study drew from numerous literature sources, presenting a wide array of performance metrics and acknowledging challenges. However, limitations included the amalgamation of all cancer types in a single analysis and the absence of a separate assessment of machine learning and deep learning approaches. Painuli et al. (2022) [16] concentrated on mammary, pulmonary, hepatic, cutaneous, encephalic, and pancreatic malignancies. The study examined benchmark datasets for these cancer types and provided an overview of the utilization of machine learning and deep learning methodologies. The research identified the most proficient classifiers based on accuracy but unified the examination of deep learning and machine learning techniques instead of offering individual assessments.
Rai (2023) [17] conducted a comprehensive analysis of cancer detection and segmentation, utilizing both deep neural network (DNN) and conventional machine learning (CML) methods, covering seven cancer types. The review separately scrutinized the strengths and challenges of DNN and CML classifiers. Despite limitations, such as a limited number of research articles and the absence of a database and feature extraction analysis, the study provided valuable insights into cancer detection, laying the foundation for future research directions. Maurya et al. (2023) [18] assessed encephalic, cervical, mammary, cutaneous, and pulmonary cancers, providing a comprehensive analysis of the performance parameters and inherent challenges. However, it lacked an independent assessment of machine learning and deep learning techniques and a dataset description. Mokoatle et al. (2023) [19] focused on pulmonary, mammary, prostatic, and colorectal cancers, proposing novel detection methodologies utilizing SBERT and the SimCSE approach. However, limitations included the study’s focus on four cancer types, the lack of a dataset analysis, and reliance on a single assessment metric. Rai and Yoo (2023) [20] enhanced cancer diagnostics by classifying four cancer types with computational machine learning (CML) and deep neural network (DNN) methods. The study reviewed 130 pieces of literature, outlined benchmark datasets and features, and presented a comparative analysis of CML and DNN models. Limitations included a focus on four cancer types and reliance on a single metric (accuracy) for classifier validation.
This study offers an expansive and in-depth examination of the current landscape and potential prospects for diagnosing colorectal and gastric cancers through the application of traditional machine learning methodologies. The key contributions and highlights of this review can be distilled into the following key points.
  • Mathematical Formulations to Augment Cognizance: Inaugurating the realm of mathematical formulations, meticulously addressing the most frequently utilized preprocessing techniques, features, machine learning classifiers, and the intricate domain of assessment metrics.
  • Mathematical Deconstruction of ML Classifiers: Engaging in a profound exploration of the mathematical intricacies underpinning machine learning classifiers commonly harnessed in the arena of cancer detection.
  • Colorectal and Gastric Cancer Detection: Dedicating an analytical focus to the nuanced landscape of colorectal and gastric cancer detection. Our scrutiny unfurled a detailed examination of the methodologies and techniques germane to the diagnosis and localization of these particular cancer types.
  • Preprocessing Techniques and Their Formulation: Penetrating the intricate realm of preprocessing techniques and probing their pivotal role in elevating the quality and accuracy of models employed in cancer detection.
  • Feature Extraction Strategies and Informative Features: Embarking on a comprehensive journey, scrutinizing the multifaceted domain of feature extraction techniques, meticulously counting and discerning the number of features wielded in research articles.
  • A Multidimensional Metrics Analysis: Conducting an holistic examination encompassing a spectrum of performance evaluation metrics, encapsulating accuracy, sensitivity, specificity, precision, negative predictive value, F-measure (F1), area under the curve, and the Matthews correlation coefficient (MCC).
  • Evaluation Parameters for Research Articles: Systematically analyzing diverse parameters, including publication year, preprocessing techniques, features, techniques, image count, modality nuances, dataset details, and integral metrics (%).
  • Prominent Techniques and Their Effectiveness: Expertly identifying the techniques most prevalently harnessed by researchers in the realm of cancer detection and meticulously pinpointing the most effective among the gamut of options.
  • Key Insights and Ongoing Challenges: Highlighting key insights from the scrutinized research papers, encompassing advances, groundbreaking revelations, and challenges in cancer detection using traditional machine learning techniques.
  • Architectural Design of Proposed Methodology: Laying out in meticulous detail an architectural blueprint derived from the reviewed literature. These architectural formulations present invaluable guides for the enhancement of cancer detection models.
  • Recognizing Opportunities for Improvement: Executing a methodical comparative analysis of an array of metrics, meticulously scrutinizing their zenith and nadir values, as well as the interstitial chasm. This granular evaluation aids in the strategic pinpointing of areas harboring untapped potential for enhancement in cancer detection practices.

2. Materials and Methods

2.1. Literature Selection Process

In this section, we will provide a broad overview of the procedures involved in selecting and employing research articles for the purpose of cancer classification through traditional ML approaches. These selection criteria encompass both inclusion and exclusion standards, which we will delineate in depth. The PRISMA flow diagram delineates the systematic review process employed for the detection of colorectal and stomach (gastric) cancer utilizing conventional machine learning (CML) methodologies, as illustrated in Figure 1. Commencing with an initial identification of 571 records through meticulous database searching, the subsequent removal of 188 duplicates yielded 383 distinct records. Through a rigorous screening process, 197 records were deemed ineligible, prompting a detailed assessment of eligibility for 186 full-text articles. Within this subset, the exclusion of 150 articles on various grounds culminated in the inclusion of 36 studies. This select group of 36 studies served as the foundational basis for the scoping review, offering a comprehensive exploration of cancer detection methods employing CML approaches for both colorectal and stomach cancers.

2.1.1. Inclusion Criteria

The inclusion criteria for the review of research articles focused on cancer detection were defined across several specific parameters. Firstly, the articles had to pertain exclusively to the classification of cancer using conventional machine learning classifiers. These articles were specifically chosen if they were peer-reviewed and published between 2017 and 2023. The selection was limited to journal articles, omitting conference papers, book chapters, and similar sources to maintain the analytical scope. The studies selected for review utilized medical image datasets related to colorectal and gastric cancers. Additionally, a key criterion was the inclusion of accuracy as a performance metric in the chosen articles. Accuracy stands as a fundamental measure in evaluating the effectiveness of cancer detection models. The selected studies also strictly employed traditional machine learning classifiers for their classification tasks. The review was narrowed down to studies covering two specific high-mortality cancer types: colorectal and gastric cancer. Furthermore, articles were required to be in the English language, a criterion implemented to ensure the enhanced accessibility and comprehension of the research, thereby contributing to clarity and accuracy in the assessment process. Figure 2 illustrates the parameters governing the inclusion and exclusion of research articles in the selection process employed in this manuscript.

2.1.2. Exclusion Criteria

The exclusion criteria, a pivotal aspect of the research review process for cancer detection, served as a strategic filter to ensure the selection of high-quality, pertinent articles. Omitting conference papers and book chapters was a deliberate choice to uphold a superior standard, guided by the in-depth scrutiny and comprehensive nature typically associated with peer-reviewed journal articles. Additionally, the requirement for digital object identifiers (DOIs) within the selected studies aimed to guarantee the reliability and accessibility of the articles, facilitating easy citation, retrieval, and verification processes. The temporal boundary set the scope within a specific timeframe, excluding research published before 2017 or after 2023, with the intention of focusing on the most recent advancements within the field of cancer detection. Language limitations were incorporated, allowing only English publications to ensure a consistent understanding and analysis. Moreover, the exclusion of deep learning classifiers in favor of traditional machine learning methods aligned with the specific objective of assessing the performance and effectiveness of the latter in cancer detection. By narrowing the focus exclusively to colorectal and gastric cancers, the exclusion criteria aimed to ensure a concentrated and comprehensive analysis across these specific high-mortality cancer types. This approach facilitated a deeper understanding of the efficacy of traditional machine learning methods in the context of different cancer types.
To illuminate the research hotspots, we have detailed the quantity of literature references pertaining annually to each cancer category (colorectal and gastric), along with the cumulative total, visually represented in Figure 3. This visual aid is designed to aid readers in identifying pertinent literature related to these specific cancer categories, fostering a more nuanced analysis within the specified years.

2.2. Medical Imaging Datasets

Data collection is the essential first step in any machine learning endeavor, and the performance of classifiers and detection tasks depends on the characteristics of the datasets used. The approach for identifying or classifying diseases, particularly cancers, is closely linked to the nature of the dataset. Various data types, such as images, text, and signal data, may require distinct processing methods. In the context of cancer detection, medical image datasets are of paramount importance. These datasets contain images that provide valuable information about the presence and characteristics of cancerous tissues. Specialized techniques, including image segmentation and feature extraction, are applied to extract relevant information for classification or detection. Analyzing image datasets differs significantly from text or signal datasets due to differences in data structures and feature extraction techniques. Dataset availability can be categorized as offline or real-time. In the domain of cancer detection, most research relies on offline datasets sourced from healthcare institutions, research centers, and platforms like Kaggle and Mendeley. Researchers often use local datasets from these sources to conduct studies and develop innovative cancer detection methods. In Table 1, we have described some benchmarked imaging datasets of lung and colorectal cancers.

2.3. Preprocessing

In cancer detection, preprocessing is essential to prepare data for analysis and classification. It refines diverse data types, like medical images and genetic and clinical data, addressing noise and inconsistencies. Medical image preprocessing includes noise reduction, enhancement, normalization, and format standardization. Augmentation enhances data diversity. Quality preprocessed data improves cancer detection model performance. Common tasks include noise reduction, data cleaning, transformation, normalization, and standardization. Preprocessing optimizes data for analysis, contributing to effective cancer diagnosis. Key preprocessing techniques are summarized in Table 2.

2.4. Feature Engineering

Feature engineering is a critical component in solving classification problems, particularly with traditional machine learning methods. Features represent dataset attributes used by the model to classify or predict. Instead of using the entire dataset, relevant features are extracted and serve as classifier inputs, delivering the desired outcomes. Proper preprocessing is essential before feature engineering to ensure data quality. Feature engineering involves selecting which features to extract, choosing methods, defining the domain, and specifying the number of features. Categories of feature engineering include extraction, selection, reduction, fusion, and enhancement. Commonly used features for predicting lung and colorectal cancers in medical images are outlined below.

2.4.1. Histogram-Based First-Order Features (FOFs)

These are statistical features extracted from an image’s histogram, providing valuable information about the distribution and characteristics of pixel intensities [24]. Here are some significant FOFs, along with their mathematical formulae presented in Equations (1)–(4).
Skewness (s): Skewness quantifies the asymmetry of the histogram and is calculated as:
s = 1 σ 3 i = 1 G m a x { ( i μ ) 3 h i }
Here, i is the gray level, h i is its frequency, G max is the highest grayscale intensity, and μ and σ 2 are the mean and variance, respectively.
Excess Kurtosis (k): Excess kurtosis measures the peakedness of the histogram and is calculated as:
k = 1 σ 4 i = 1 G m a x i μ 4 h i 3
Energy: Energy reflects the overall intensity in the image and is computed as:
E n e r g y = i = 1 G m a x { [ h i ] 2 }
Entropy (HIST): Entropy quantifies the information or randomness in the histogram and is calculated as:
E n t r o p y = i = 1 G m a x { h i ln ( h i ) }

2.4.2. Gray-Level Co-Occurrence Matrix (GLCM) Features

GLCM is a technique used for texture analysis in image processing. It assesses the association between pixel values in an image, relying on the likelihood of specific pixel pairs with particular gray levels occurring within a defined spatial proximity [25,26,27]. Here are some important GLCM features, along with their mathematical formulas as provided in Equations (5)–(10).
Here, (x, y) pairs typically refer to the intensity values of adjacent or neighboring pixels.
Sum of Squares Variance (SSV): SSV quantifies the variance in gray levels within the texture.
S S V = x , y x μ 2 G L C M x , y
Inverse Different Moment (IDM): IDM measures the local homogeneity and is higher for textures with similar gray levels.
I D M = x , y 1 1 + ( x y ) 2 G L C M ( x , y )
Correlation (Corr): Correlation quantifies the linear dependency between pixel values in the texture. It spans from −1 to 1, with 1 signifying flawless positive correlation.
C o r r = x , y ( x y     G L C M ( x , y ) ) ( μ a μ b ) σ a σ b
Dissimilarity: Dissimilarity quantifies how different neighboring pixel values are.
D i s s i m i l a r i t y = x , y ( x y ) G L C M ( x , y )
Autocorrelation (AuCorr): Autocorrelation measures the similarity between pixel values at different locations in the texture.
A u C o r r = x , y x y G L C M ( x , y )
Inverse Difference (ID): ID measures the local homogeneity and is higher for textures with similar gray levels at different positions.
I D = x , y G L C M ( x , y ) 1 + ( x y )

2.4.3. Gray-Level Run Length Matrix (GLRLM)

This is a statistical procedure employed in image processing and texture assessment to quantify the distribution of run lengths of specific gray levels within an image. Here are some significant GLRLM features along with their corresponding mathematical formulas, as presented in Equations (11)–(22).
Short Run Emphasis (SRE): SRE evaluates the dispersion of shorter runs characterized by lower gray-level values.
S R E = x , y C ( x , y ) x 2
Here, ( x , y ) are gray levels, and C x , y is the co-occurrence matrix value reflecting the frequency of each gray-level combination.
Long Run Emphasis (LRE): LRE assesses the presence of extended runs marked by higher gray-level values [28].
L R E = x , y C x , y x 2
Gray Level Nonuniformity (GLN): GLN Quantifies the nonuniformity of gray-level values in runs.
G L N = x , y C x , y 2
Run Length Nonuniformity (RLN): RLN evaluates the irregularity in the lengths of runs.
R L N = x , y C x , y y 2
Run Percentage (RP): RP represents the percentage of runs in the matrix.
R P = x , y C x , y N 2
Run Entropy (RE): RE calculates the entropy of run lengths and gray levels.
R E = x , y ( C x , y log C x , y   + )
Low Gray-Level Run Emphasis (LGRE): LGRE accentuates shorter runs with lower gray-level values.
L R G E = x , y C x , y y 2 , for y N + 1 2
High Gray-Level Run Emphasis (HGRE): HGRE highlights longer runs with higher gray-level values.
H R G E = x , y C x , y y 2 , for y > N + 1 2
Short Run Low Gray-Level Emphasis (SRLGLE): SRLGLE highlights shorter runs that contain lower gray-level values.
S R L G L E = x , y C x , y ( x 2 y 2 ) , for x , y N + 1 2
Short Run High Gray-Level Emphasis (SRHGLE): SRHGLE highlights shorter runs that contain higher gray-level values.
S R H G L E = x , y C x , y x 2 y 2 , for x , y N + 1 2 ,   y > N + 1 2
Long Run Low Gray-Level Emphasis (LRLGLE): LRLGLE emphasizes longer runs featuring lower gray-level values.
L R L G L E = x , y C x , y x 2 y 2 , for x > N + 1 2 ,   y N + 1 2
Long Run High Gray-Level Emphasis (LRHGLE): LRHGLE highlights extended sequences with higher gray-level values.
L R H R G L E = x , y C x , y x 2 y 2 , for x , y > N + 1 2

2.4.4. Neighborhood Gray-Tone Difference Matrix (NGTDM)

This is another texture analysis method used in image processing to characterize the spatial arrangement of gray tones in an image. Here are some key NGTDM features along with their respective mathematical formulas, as outlined in Equations (23)–(27).
Coarseness: Measures the coarseness of the texture based on differences in gray tones.
C o a r s = x = 1 N g C ( x , y ) Δ x 2
Ng refers to the highest achievable discrete intensity level within the image.
Contrast (NGTD): Quantifies the contrast or sharpness in the texture.
C o n t r a s t N G T D = x = 1 N g y = 1 N g C x , y x y
Busyness: Represents the level of activity or complexity in the texture.
B u s y n e s s = x = 1 N g y = 1 N g C x , y y
Complexity: Measures the complexity or intricacy of the texture.
C o m p l e x i t y = x = 1 N g y = 1 N g P x , y 1 + x y 2
Texture Strength (TS): Quantifies the strength or intensity of the texture.
T S = x = 1 N g y = 1 N g P x , y x N g y N g 2
These features provide a detailed analysis of texture patterns in images, making them valuable for various applications, including image classification, quality control, and texture discrimination in fields such as geology, material science, and medical imaging.

2.5. Traditional Machine Learning Classifiers

Machine learning-based classifiers, renowned for their advanced capabilities in detecting cancer, notably stand out in their effectiveness when harmonized with non-invasive diagnostic techniques, providing a significant edge in the domain of cancer detection. Researchers have employed a range of ML classifiers to identify different malignancies and disorders. Some commonly used classifiers include:

2.5.1. K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a widely used and simple machine learning algorithm, suitable for classification and regression tasks. It relies on the assumption that similar inputs lead to similar outputs, assigning a class label to a test input based on the prevalent class among its k closest neighbors. The formal definition involves representing a test point ‘x’ and determining its set of ‘k’ nearest neighbors, denoted as ‘Nx’, where ‘k’ is a user-defined parameter.
The Minkowski distance is a flexible distance metric that can be tailored by adjusting the value of the parameter ‘p.’ The Minkowski distance between two data points ‘x’ and ‘z’ in a ‘d’-dimensional space is defined by Equation (28):
d i s t x , z = r = 1 d x r z r p 1 / p
The “1-NN Convergence Proof” states that, as your dataset grows infinitely large, the 1-Nearest Neighbor (1-NN) classifier’s error will not be more than twice the error of the Bayes optimal classifier, which represents the best possible classification performance. This also holds for k-NN with larger values of k. It highlights the ability of the K-Nearest Neighbors algorithm to approach optimal performance with increasing data [29]. As n approaches infinity, Z N N converges to Z t , and the probability of different labels for Z t when returning ( Z N N ) ’s label is described in Equation (29) [30].
N N = P ( y | Z t ) ( 1 P ( y | Z N N ) ) + P ( y | Z N N ) ( 1 P ( y | Z t ) ) ( 1 P ( y | Z N N ) ) + ( 1 P ( y | Z t ) ) = 2 ( 1 P ( y | Z t ) ) = 2 B O
Here, BO is the Bayes optimal classifier. If the test point and its nearest neighbor are indistinguishable, misclassification occurs if they have different labels. This probability is outlined in Equation (30) and Figure 4 [29,31].
1 p s x p s x + p s x 1 p s x = 2 p s x 1 p s x
Equation (30) represents the misclassification probability when the test point and its nearest neighbor have differing labels.

2.5.2. Multilayered Perceptron (MLP)

In contrast to static kernels, neural network units have adaptable internal parameters for an adjustable structure. A perceptron, inspired by biological neurons, comprises three components: (i) weighted edges for individual multiplications, (ii) a summation unit for calculating the sum, and (iii) an activation unit applying a non-linear function [32,33,34]. The single-layer unit function involves a linear combination passed through a non-linear activation, represented by Equation (31) and Figure 5 [33,34].
y 1 f = w 0 1 + j = 1 N w j 1 x j
In a single-layer neural network unit, y 1 f is the output, w 0 1 is the bias, and j = 1 N w j 1 x j is the weighted sum of inputs. In general, we compute U 1 units as feature transformations in learning models, described in an Equation (32) [33,34].
model x , w = w 0 + y 1 1 x w 1 + + y U 1 1 x w U 1
The input vector x can be denoted as represented in Equation (33) [33,34].
x = 1 x 1 . . . x N
The vector representation comprises input values x 1 to x N , and an additional element of 1. Internal parameters of single-layer units include bias w 0 , j 1 and weights w 1 , j 1 through w N , j 1 . These parameters form the j t h column of a matrix W 1 with dimensions N + 1 × U 1 , as demonstrated in Equation (34) below [34]:
W 1 = w 0,1 ( 1 ) w 0,2 ( 1 ) w 0 , U 1 ( 1 ) w 1,1 ( 1 ) w 1,2 ( 1 ) w 1 , U 1 ( 1 ) w N , 1 ( 1 ) w N , 2 ( 1 ) w N , U 1 ( 1 )
Notably, the matrix–vector product W 1 T x encompasses all linear combinations within our U 1 units as given in Equation (35) [33].
W 1 T x j = w 0 , j 1 + n = 1 N w n , j 1 x n , j = 1 , , U 1
We extend the activation function f to handle a general d × 1 vector v in Equation (36) [34]:
f ( v ) = f ( v 1 ) f ( v d )
In Equation (37), f W 1 T x is a U 1 × 1 vector containing all U 1 single-layer units [33,34]:
f W 1 T x j = f w 0 , j 1 + n = 1 N w n , j 1 x n , j = 1 , , U 1
The mathematical expression for an L -layer unit in a general multilayer perceptron, built recursively from single-layer units, is given by Equation (38) [33,34].
y L x = f w 0 L + i = 1 U L 1 w i L f i L 1 x

2.5.3. Support Vector Machine (SVM)

SVMs, employed for regression and classification tasks, stand out in supervised machine learning for their precision with complex datasets. Particularly effective in binary classification, SVMs aim to discover an optimal hyperplane, maximizing the boundary between classes. Serving as a linear classifier, SVMs build on the perceptron introduced by Rosenblatt in 1958 [35,36,37]. Unlike perceptrons, SVMs identify the hyperplane (H) with the maximum separation margin, defined in Equation (39).
h x = sign w T x + b
The SVM classifies in { + 1 , 1 } , emphasizing the key concept of finding a hyperplane with maximum margin σ . Figure 6 illustrates this importance, with the margin expressed in Equation (40) [35]
σ = min ( x j , y j ) ϵD w . x j
where input vectors x j are within the unit sphere, σ is the closest data point from the hyperplane, and the vector w resides on the unit sphere.
Max Margin Classifier: We formulate our pursuit of the maximizing-margin hyperplane as a constrained optimization task, aiming to enhance the margin while ensuring correct classification of all data points. This is expressed in Equation (41) [35,37]:
[ max u , δ σ u , δ maximize   margin such   that i y i u T x i + δ 0 separating   hyperplane ]
Upon substituting the definition of σ, Equation (42) is derived, as given below.
[ max u , δ 1 u 2 min x i D u T x i + δ σ ( u , δ ) maximize margin s . t . i y i u T x i + δ 0 separating   hyperplane ]
Scaling invariance enables flexible adjustment of u and δ . Smart value selection ensures ( min x D u T x + δ = 1 ) , introduced as an equality constraint in the objective per Equation (43) [37]:
[ max u , δ 1 | u | 2 = min u , δ | u | 2 = min u , δ u u ]
Utilizing the fact that f z = z 2 is monotonically increasing for z 0 and | u | 0 , where u maximizing | u | 2 also maximizes u u . This reformulates the optimization problem in Equation (44), and a structural diagram of a multi-SVM has been visualized in Figure 7.
min u , δ u u subject   to i , y i u T x i + δ 0 ,   min i u T x i + δ = 1

2.5.4. Bayes and Naive Bayes (NB) Classifier

The Bayes classifier, an ideal algorithm, assigns class labels based on class probabilities given observed features and prior knowledge. It predicts the class with the highest estimated probability, often used as a benchmark but requiring complete knowledge of underlying probability distributions. To estimate P y x ¯ for the Bayes classifier, the common approach is maximum likelihood estimation (MLE), especially for the discrete variable y, as outlined in Equation (45) [37]:
P y x ¯ = k = 1 m I x k ¯ = x ¯     y i ¯ = y i = 1 n I x i ¯ = x ¯
Naive Bayes addresses MLE’s limitations with sparse data by assuming feature independence. It estimates P y and P x ¯ y instead of P y x ¯ using Bayes’ rule (Equation (46)) [37]:
P y x ¯ = P x ¯ y P y P x ¯
Generative learning estimates P y and P x ¯ y , with P y resembling tallying occurrences for discrete binary values (Equation (47)).
P y = c = i = 1 n I y i = c n = π c
To simplify estimation, the Naive Bayes (NB) assumption is introduced, a key element of the NB classifier. It assumes feature independence given the class label, formalized in Equation (48) for P x ¯ y .
P x ¯ y = α = 1 d P x α y
Here, x α is the value of feature α , assuming feature values, given class label y, are entirely independent. Despite potential complex relationships, NB classifiers are effective. The Bayes classifier, defined in Equation (49), further simplifies to (50) due to P x ¯ independence from y, and using logarithmic property, it can be expressed as (51).
h x ¯ = argmax y P y x ¯ = argmax y P x ¯ y P y
h x ¯ = argmax y α = 1 n P x α y P y
h x ¯ = argmax y α = 1 n log P x α y + log P y
Estimating log P x α y   is straightforward for one dimension. P y remains unaffected and is calculated independently. In Gaussian NB, where features are continuous x α R , P x α y follows a Gaussian distribution (Equation (52)). This assumes each feature ( x α ) follows a class-conditional Gaussian distribution with mean μ α c and variance σ α c 2 (Equations (53) and (54)), using parameter estimates in the Gaussian NB classifier for each class [37].
P x α y = d = 1 2 π σ α c 2 exp x α μ α c 2 2 σ α c 2
μ α c = 1 n d k = 1 n I y k = d x i α
σ α c 2 = 1 n d k = 1 n I y k = d x i α μ α c 2

2.5.5. Logistic Regression (LR)

Logistic regression, commonly used in classification, calculates the probability of a binary label based on input features. In logistic regression (LR), the logistic (sigmoid) function transforms a linear combination of input features x , weights w , and a bias term b into a likelihood estimate between 0 and 1. Mathematically, logistic regression is defined in Equation (55) [38]:
P y = 1 x = 1 1 + e w T x + b
Equation (71): P y = 1 x ) is the likelihood of class 1 given features x . w and b are estimated using statistical methods, minimizing assumptions about P x i y , allowing flexibility in underlying distributions [38].
The Maximum Likelihood Estimate (MLE): MLE maximizes P y x , w , the probability of observing y R n given feature values x i . It aims to find parameters maximizing this function, assuming independence among y i given x i and w . Equation (56) captures the mathematical expression for the conditional data likelihood.
P y x , w = k = 1 m P y k x k , w
Now, by taking the logarithm of the product of Equation (57), we obtain Equation (73):
l o g k = 1 m P y k x k , w = k = 1 m l o g 1 + e y k w T x k
To find the MLE for w , we aim to minimize the function provided in Equation (58):
w M L E = a r g m a x w k = 1 m l o g 1 + e y k w T x k = a r g m i n w k = 1 m l o g 1 + e y k w T x k
Minimizing the function in Equation (58) is our goal, achieved through gradient descent on the negative log likelihood in Equation (59).
L w = k = 1 m l o g 1 + e y k w T x k
Maximum a Posteriori (MAP): In maximum a posteriori (MAP), assuming a Gaussian prior, the objective is to find w M A P that maximizes the posterior probability, represented mathematically in Equation (60). Reformulating, this becomes an optimization problem, as shown in Equation (61), where λ = 1 2 σ 2 , and gradient descent is employed on the negative log posterior l w for parameter optimization [32,37].
w M A P = arg max w log P y x , w P w P y x , w P w
w M A P = arg min w k = 1 m log 1 + e y k w T x k + λ w T w

2.5.6. Decision Tree (DT)

Decision trees, used for regression and classification, form a hierarchical structure with nodes for decisions, branches for outcomes, and leaves for predictions. The goal is a compact tree with pure leaves, ensuring each contains instances from a single class. Achieving consistency is computationally challenging due to the NP-hard complexity of finding a minimum-size tree [37]. Impurity functions in decision trees, evaluated on a dataset D with pairs a 1 , b 1 , , a n , b n , where b i takes values in { 1 , , m } representing m classes, are crucial for assessing tree quality.
Gini Impurity: Gini impurity in a decision tree is calculated for a leaf using Equation (62), and the Gini impurity for the entire tree is given by Equation (63).
I D = m = 1 k q m 1 q m
G T D = D L D G T D L + D R D G T D R
where: D = D L D R , D L D R = , D L D represents the fraction of inputs in the left subtree, and D R D represents the fraction of inputs in the right subtree. The binary decision tree with class levels has been visualized in Figure 8.
Entropy in Decision Trees: Entropy in decision trees measures disorder using class fractions. Minimizing entropy aligns with a uniform distribution, promoting randomness. KL-Divergence K L ( p | | q ) gauges the closeness of p to a uniform distribution q , as in Equation (64).
K L ( p | | q ) = n = 1 c p n log p n q n > 0 K L D i v e r g e n c e = n p n l o g p n p n l o g q n , where q n = 1 c = n p n l o g p n + p n l o g c = n p n l o g p n + l o g c n p n , where l o g c c o n s t a n t , n p n = 1 max p K L ( p | | q ) = max p n p n l o g p n = min p n p n l o g p n = min p H s E n t r o p y
ID3 Algorithm: The ID3 algorithm stops tree-building when all labels are the same or no more attributes can split further. If all share the same label, a leaf with that label is created. If no more splitting attributes exist, a leaf with the most frequent label is generated (Equation (65)) [39].
I D 3 ( S ) : i f     y   s . t .   ( x , y )     S ,     y = y ,   r e t u r n   l e a f   w i t h   l a b e l   y                                                 i f     x   s . t .   ( x , y )     S ,   x = x     r e t u r n   l e a f   w i t h o u t   m o d e   ( y :   ( x , y ) S )  
CART (Classification and Regression Trees): CART (classification and regression trees) is suitable for continuous labels ( y i R ), using the squared loss function (Equation (66)). It efficiently finds the best split (attribute and threshold) by minimizing the average squared difference from the average label y s [37].
L S = 1 S i , j S y y S 2 Average squared difference from average label
where y S = 1 S i , j S y average label

2.5.7. Ensemble Classifier (EC)

Ensemble classifiers represent a sophisticated class of machine learning techniques aimed at enhancing the precision and resilience of predictive models. Their fundamental premise revolves around the amalgamation of predictions from multiple foundational models. Below, we delve into several prominent types of ensemble classifiers, each with its distinct modus operandi.
Bagging (Bootstrap Aggregating): Bagging orchestrates the training of multiple foundational models in parallel. Each model operates independently on distinct, resampled subsets of the training data. This decomposition helps us understand the sources of error in our models. Bias/variance decomposition is described by Equation (67) [37,40]:
E [ ( f k ( x ) b ) 2 ] E r r o r = E [ ( f k ( x ) f ( x ) ) 2 ] V a r i a n c e + E [ ( f ( x ) c ( ) ) 2 ] B i a s + E [ ( c ( x ) d ( x ) ) 2 ] N o i s e
In Equation (67), we decompose the error into four components: “Error”, “Variance”, “Bias”, and “Noise”. Our primary objective is to minimize the “Variance” term, which is expressed as Equation (68):
E [ ( f k ( x ) f ( x ) ) 2 ] V a r i a n c e
Ensemble learning minimizes variance by averaging individual predictions f k x . Bagging enhances ML classifiers by creating multiple datasets, training individual classifiers h i ( ) , and aggregating predictions in the final ensemble classifier h ( z ) , through averaging (Equation (69)) [40]:
h z = 1 n i = 1 n h i z
In practice, a larger value of n often leads to a better-performing ensemble, as it leverages diverse base models for more robust predictions.
Random Forest (RF): RF stands as one of the most renowned and beneficial bagging algorithms. The RF algorithm entails creating multiple datasets, building decision trees with random feature subsets for each dataset, and averaging their predictions for the final classifier [37,40] ( h x = 1 m h j x ) .
Boosting: Boosting addresses high bias in machine learning models, specifically when dealing with the hypothesis class H. Boosting reduces bias by iteratively constructing an ensemble of weak learners H T x = t = 1 T α t h t x with each iteration introducing a new classifier, guided by gradient descent in function space [37,41].
Gradient descent: Gradient descent in functional space optimizes the loss function ℓ within hypothesis class H by finding the appropriate step size α and weak learner h that minimizes l H + α h . The technique uses Taylor approximation to approximate the optimal weak learner h with a fixed α around 0.1 (Equation (70)) [34].
a r g m i n h H l H + α h a r g m i n h H < l H , h a r g m i n h H i = 1 n l H x i h x i
Here, each prediction serves as an input to the loss function. The function ℓ(H) can be expressed by Equation (71).
l H = i = 1 n l H x i = l H x 1 , , H x n
This approximation enables the utilization of boosting as long as there exists a method, denoted as A, capable of solving Equation (72).
h t + 1 = a r g m i n h H i = 1 n l H x i r i h x
where A { x 1 , r 1 , , x n , r n } = a r g m i n h H i = 1 n r i h x i ; progress is made as long as i = 1 n r i h x i < 0 , even if h is not an excellent learner.
AnyBoost (Generic Boosting): AnyBoost, a versatile boosting technique, iteratively combines weak learners, prioritizing challenging data points for enhanced accuracy. It creates a strong learner from weak ones, effectively reducing bias and improving predictions. See Algorithm 1 for the pseudo-code [41].
Algorithm 1: Pseudo-code for the AnyBoost.
Input: l, a, {(xi, yi)}, A
H0 = 0
for t = 0: T − 1 do
   I : r i = l ( ( H t ( x 1 ) , y 1 ) , , ( H t ( x n ) , y n )   H ( x i )
    h t + 1 = A ( { ( x 1 , r 1 ) , . , ( x n , r n ) } ) = a r g m i n h H i = 1 n r i h x i
   if  i = 1 n r i h t + 1 x i < 0 then
     H t + 1 = H t + α   t + 1 h t + 1
   else
    return Ht (Negative gradient orthogonal to descent direction.)
   end
end
return HT
Gradient Boosted Regression Trees (GBRT): GBRT, a sequential regression algorithm, combines decision trees to correct errors iteratively for precise predictions. Applicable to both classification and regression, it uses weak learners, often shallow regression trees, with a fixed depth. The step size (α) is a small constant, and the loss function (l) must be differentiable, convex, and decomposable over individual samples. The ensemble’s overall loss is defined in Equation (73) [41].
L H = i = 1 n l H x i
GBRT minimizes the loss by iteratively adding weak learners to the ensemble. Pseudo-code is in Algorithm 2 [41].
Algorithm 2: Pseudo-code for GBRT
Input: l, α, {(xi, yi)}, A
H = 0
for t = 1: T do
   i   :   t i = y i H ( x i )
   h = a r g m i n h H ( h ( x i ) t i ) 2
   H   H + α h
end
return H
AdaBoost: AdaBoost is a binary classification algorithm utilizing weak learners h producing binary predictions. Key components include step-size α and exponential loss ℓ(H), given by Equation (74):
l H = i = 1 n e y i H x i
The gradient function r i needed to find the optimal weak learner is computed using Equation (75).
r i = L H x i = y i e y i H x i
Introducing w i = 1 Z e y i H x i , for clarity and convenience, where Z = i = 1 n e y i H x i , normalizing the weights. Each w i signifies the role of x i , y i in the global loss. To find the next weak learner, we solve the optimization problem in Equation (76) with h x i { + 1 , 1 }   [42].
h x i = a r g m i n h H i = 1 n r i h x i substitute in : r i = e H x i y i = a r g m i n h H i = 1 n y i e H x i y i h x i substitute in : w i = 1 Z e H x i y i = a r g m i n h H i = 1 n w i y i h x i y i h x i { + 1 , 1 } with h x i y i = 1 h x i = y i = a r g m i n h H i : h x i y i w i i : h x i = y i w i i : h x i = y i w i = 1 i : h x i y i w i = a r g m i n h H i : h x i y i w i This is the weighted classification error .
In (76), ϵ = i : h x i y i w i , representing the weighted classification error. AdaBoost seeks a classifier minimizing this error without requiring high accuracy. The optimal step size, denoted as α , minimizes the loss l most effectively in the closed-form optimization problem (77) [41].
α = a r g m i n α l H + α h = a r g m i n α i = 1 n e y i H x i + α h x i
Taking the derivative with respect to α and setting it to zero, as shown by Equations (78)–(80):
i = 1 n y i h x i e y i H x i + α y i h x i = 0 ( y i h ( x i ) { + 1   or 1 } )
i : h x i y i = 1 e y i H x i + α y i h x i 1 + i : h x i y i 1 e y i H x i + α y i h x i 1 = 0 w i = 1 Z e y i H x i
i : h x i y i = 1 w i e α + i : h x i y i 1 w i e α = 0 ϵ = i : h x i y i = 1 w i
For further simplification, with ε representing the sum over misclassified examples, as given in Equation (81):
1 ϵ e α + ϵ e α = 0
Solving for α, as shown in Equation (82):
e 2 α = 1 ϵ ϵ
α = 1 2 ln 1 ϵ ϵ
The optimal step size α, derived from the closed-form solution in (83), facilitates rapid convergence in AdaBoost. After each step H t + 1 = H t + α h , recalculating and re-normalizing all weights is crucial for the algorithm’s progression. The pseudo-code for AdaBoost Ensemble classifier is presented in Algorithm 3 [37,41].
Algorithm 3: Pseudo-code for AdaBoost
Input: l, α, {(xi, yi)}, A
H = 0
i   :   w i = 1 n
for t = 1: T do
     h = A   w 1 ,   x 1 ,   y 1 , ,   w n ,   x n ,   y n
        ϵ = i : h x i y i w i
    if  ϵ < 1 2   then
       α = 1 2 ln 1 ϵ ϵ
       H t + 1 = H t + α h
       i   :   w i w i e α h x i y i 2 ϵ 1 ϵ 1 2
    else
       return (Ht)
    end
    return H
end

2.6. Assessment Metrics

The crucial next step in evaluating machine learning classifiers is the use of a separate test dataset that has not been part of the training process. Evaluation involves various parameters, with the confusion matrix being a widely adopted tool. This matrix forms the basis for determining assessment metrics, essential for validating model performance, whether it is a traditional or deep neural network classifier. In cancer prediction tasks, numerous metrics are employed to assess effectiveness, including error rate, accuracy, sensitivity, specificity, recall, precision, predictivity, F1 score, area under the curve (AUC), negative predictive value (NPR), false positive rate (FPR), and false negative rate (FNR), and Matthews correlation coefficient (MCC) [43]. These metrics quantify predictive capabilities and are vital for diverse prediction tasks. Multiple performance evaluation metrics rely on the confusion matrix, as visualized in Figure 9, for multiclass classification.
Accuracy (Acc): This metric is a fundamental indicator of a model’s overall performance. It measures the ratio of accurately categorized cases (both cancer and non-cancer) to the overall cases in the test dataset. It may not be suitable when the dataset is imbalanced.
Accuracy   ( % A C C ) = ( TP + TN )     Total   Samples × 100
Error Rate (ER): The reciprocal of accuracy equates to the error rate. It quantifies the proportion of instances that the model incorrectly classifies. A lower error rate suggests a more accurate model, and it is especially useful when you want to know how often the model makes incorrect predictions.
Error rate E R = 1 A c c
% E R = F P + F N Total Samples × 100 = 100 ( % A C C )
Specificity (% Spe): True negative rate, commonly known as specificity, is a metric that evaluates a model’s accuracy in correctly identifying true negative cases. This is crucial in minimizing false alarms.
S p e c i f i c i t y % S p = True Negative Rate ( % T N R ) = T N Total Negative × 100
Sensitivity (% Sen): This metric, also termed recall or the true positive rate (TPR), gauges the model’s capability to accurately identify true positive values, which correspond to cases of cancer, among the total positive cases within a dataset [42].
S e n s i t i v i t y % S e n = R e c a l l % R e = True Positive Rate % T P R = T P Total Positive × 100
Precision (% Pr): Precision, also recognized as positive predictive value (PP), denotes the ability to accurately predict positive values among the true positive predictions. A high precision score signifies that the model effectively reduces false positive errors.
P r e c i s i o n % P r = Positive Predictivity ( % P P ) = T P True Prediction × 100
F1 Score (% F1): An equitable metric that amalgamates positive predictive value and recall forms the F1 score [44]. It is particularly valuable when you require a singular metric that contemplates both incorrect positive predictions and missed positive predictions.
F1-score   ( % F 1 ) = 2 × TP ( 2 × TP + FP + FN ) ×   100 = 2 PP   ×   TPR ( PP + TPR ) × 100
Area Under the Curve (AUC): The AUC assesses the classifier’s capacity to differentiate between affirmative and negative occurrences. It gauges the general efficacy of the model concerning receiver operating characteristic (ROC) graphs. A superior AUC score signifies enhanced differentiation capability.
Negative Predictive Value (% NPV): It measures the classifier’s capability to accurately predict negative instances among all instances classified as negative. A high NPV suggests that the classifier is effective at identifying non-cancer cases when it predicts them as such, reducing the likelihood of unnecessary treatments.
Negative Predictive Value % N P V = T N Total Negative × 100
False Positive Rate (%FPR): This quantifies how often the classifier falsely identifies a negative instance as positive. It provides insights into the model’s propensity for false positive errors. In cancer detection, a high FPR can lead to unnecessary distress and treatments for individuals who do not have cancer.
False Positive Rate % F P R = F P Total Negative × 100
False Negative Rate (%FNR): It determines the classifier’s tendency to falsely identify a positive instance as negative. It reveals the model’s performance regarding false negative errors, which is critical in cancer detection to avoid missing real cases. High FNR can lead to undiagnosed cancer cases and potentially delayed treatments.
False Negative Rate   ( % F N R ) = FN     Total   Positive × 100
Matthews Correlation Coefficient (MCC): The Matthews correlation coefficient (MCC) represents a pivotal metric utilized for evaluating the effectiveness of binary (two class) predictions, prominently beneficial when dealing with scenarios where classes are asymmetrically distributed in their volume and representation within the dataset. The formula to calculate MCC is:
[ ( T N T P ) ( F N F P ) ] ( True Prediction ) ( False Predication ) ( Total Positive ) ( Total Negative )
where TN (True Negative) is accurately recognized negatives, TP (True Positive) is accurately recognized positives, FP (False Positive) is negatives incorrectly identified as positives, FN (False Negative) is positives incorrectly recognized as negatives, Total Positive is the Sum of TP and FN (all actual positives), Total Negative is the Sum of TN and FP (all actual negatives), True Prediction: Sum of TP and FP (correctly identified positives), False Predication: Sum of FN and TN (incorrectly identified negatives), Total Samples: Sum of TP, TN, FP, and FN (entire dataset).

3. Review Analysis

In this section, we present a thorough and extensive analysis of cancer detection utilizing conventional machine learning models applied to medical imaging datasets. Our study is focused exclusively on the detection of two specific types of cancer: colorectal and stomach cancer. For each of these cancer types, we have meticulously compiled a comprehensive review table that encompasses the relevant literature published during the period spanning 2017 to 2023. This table encompasses a range of crucial review parameters, including the year of publication, the datasets utilized, preprocessing methods, feature extraction techniques, machine learning classifiers employed, the number of images involved, the imaging modality, and various performance metrics. In total, our review encompasses 36 research articles that have harnessed medical imaging datasets to detect these specific types of cancer. Our primary emphasis lies in scrutinizing the utilization of traditional machine learning methodologies in the context of cancer detection using image datasets. We have conducted this analysis based on the meticulously assembled review tables. Subsequent subsections provide in-depth and comprehensive reviews for both colorectal and stomach cancer. Within our analysis, we delve into the intricate application of machine learning approaches for the intent of cancer prediction. Our overarching goal is to furnish valuable insights into the efficacy and constraints of conventional machine learning models when applied to the realm of cancer detection using medical imaging datasets. Through a meticulous examination and comparative analysis of results derived from various studies, our objective is to make a meaningful contribution to the evolution of cancer detection methodologies and to offer guidance for future research endeavors in this critical domain.

3.1. Analysis of Colorectal Cancer Prediction

Table 3 showcases 20 studies conducted from 2017 to 2023, focusing on machine learning-based colorectal cancer detection. These studies underscore the vital role of preprocessing methods in enhancing detection accuracy. The highest accuracy achieved is 100%, with the lowest at 76.00%. Various techniques, including cropping, stain normalization, contrast enhancement, smoothing, and filtering, were employed in conjunction with segmentation, feature extraction, and machine learning algorithms like SVM, MLP, RF, and KNN. These approaches successfully detect colorectal cancer using modalities such as endocytoscopy, histopathological images, and clinical data. The studies employed varying quantities of images, patients, or slices, ranging from 54 to 100,000. The “KCRC-16” datasets are prominently featured in these analyses.
In a comparative analysis of colorectal cancer detection studies, (Talukder et al., 2022) [45] stood out with an impressive accuracy of 100%. Their approach included preprocessing steps like resizing, BGR2RGB conversion, and normalization. Deep learning models such as DenseNet169, MobileNet, VGG19, VGG16, and DenseNet201 were employed. Performance assessment was conducted using a combination of voting, XGB, EC, MLP, LGB, RF, SVM, LR, and hybrid techniques on a dataset comprising 2800 H&E images from the LC25000 dataset. Their best model achieved a flawless 100% accuracy. In contrast, (Ying et al., 2022) [46] achieved thelowest accuracy of 76.0% in colorectal cancer detection. Their approach involved manual region of interest (ROI) selection and various preprocessing techniques. They leveraged multiple features, including FOS, shape, GLCM, GLSZM, GLRLM, NGTDM, GLDM, LoG, and WT. Classification was carried out using the MLR technique on a dataset consisting of 276 CECT images from a private dataset. Their least-performing model achieved an accuracy of 76.00%. Moreover, their study exhibited a sensitivity of 65.00%, specificity of 80.00%, and precision of 54.00%, indicating relatively suboptimal performance in accurately identifying colorectal cancer cases.
(Khazaee Fadafen and Rezaee 2023) [47] conducted a remarkable colorectal cancer detection study by utilizing a substantial dataset (the highest number of images among all) comprising a total of 100,000 medical images sourced from the H&E NCT-CRC-HE-100K dataset. Their preprocessing methodology encompassed the conversion of RGB images to the HSV color space and the utilization of the lightness space. For classification, they harnessed the dResNet architecture in conjunction with DSVM, which resulted in an outstanding accuracy rate of 99.76%. (Jansen-winkeln et al., 2021) [48] conducted a study with a notably smallest dataset, comprising only 54 medical images. Their preprocessing approach included smoothing and normalization. For classification purposes, they employed a combination of MLP, SVM, and RF techniques. This approach yielded commendable results with an accuracy of 94.00%, sensitivity at 86.00%, and specificity reaching 95.00%. Notably, their analysis identified MLP as the most effective model in their study.
Within the corpus of 20 studies dedicated to the realm of colorectal cancer detection, researchers have deployed an array of diverse preprocessing strategies encompassing endocytoscopy, cropping, IPP, stain normalization, CEI, smoothing, normalization, filtering, THN, DRR, augmentation, UM-SN, resizing, BGR2RGB, normalization, scaling, labeling, RGBG, VTI, HOG, RGB to HSV, lightness space, edge preserving, and linear transformation. These sophisticated methodologies collectively served as the linchpin for optimizing machine learning-based colorectal cancer detection, ushering in a new era of precision and accuracy. However, it is captivating to note that, within the comprehensive assessment of 23 studies, a select quartet of research endeavors chose to forgo the utilization of any specific preprocessing techniques. This exceptional cluster includes the works of (Bora et al., 2021) [49], (Fan et al., 2021) [50], and (Lo et al., 2023) [51]. Astonishingly, these studies defied conventional wisdom by attaining commendable accuracies that spanned the spectrum from 94.00% to an impressive 99.44%. Such outcomes suggest that, in cases where the dataset is inherently pristine and impeccably aligned with the demands of the classification task, the impact of preprocessing techniques on the classifier’s performance might indeed exhibit a marginal influence.
In the comprehensive analysis of the research studies under scrutiny, it is noteworthy that only the works of (Grosu et al., 2021) [52] and (Ying et al., 2022) [46] registered accuracy figures falling below the 90% threshold, specifically at 84.7% and 76%, respectively. This observation underscores the intriguing possibility that traditional machine learning models can indeed yield highly accurate cancer detection performance, provided they are meticulously optimized.
Table 3. Performance comparison of traditional ML-based colorectal cancer prediction methods.
Table 3. Performance comparison of traditional ML-based colorectal cancer prediction methods.
YearReferencesPre-ProcessingFeaturesTechniquesDatasetData SamplesTrain DataTest DataModalityMetrics (%)
2017[53]EndocytoscopyTexture, nucleiSVMPrivate58435643200ENIAcc 94.1
Sen 89.4
Spe 98.9
Pre 98.8
NPV 90.1
2019[54]IPPCSQ, Color histogramWSVMCSPrivate18010872H&EAcc 96.0
2019[55]CroppingBiophysical characteristic, WLD, NB, MLP, OMIS data31623779OMISAcc 92.6
Sen 96.3
Spe 88.9
2021[56]FilteringHOS, FOS, GLCM, Gabor, WPT, LBPANN, RSVM, KCRC-1650004550450H&EAcc 95.3
2021[57]IPP, AugmentationVGG-16MLPKCRC-1650004825175H&EAcc 99.0
Sen 96.0
Spe 99.0
Pre 96.0
NPV 99.0
F1 96.0
2021[50]---AlexNetEC, SVM, AlexNet, LC2500010,0004-fold cross validationH&EAcc 99.4
2021[58]THN, DRRBmzPNNMALDI MSI559Leave-One-Out cross-validationH&EAcc 98.0
Sen 98.2
Spe 98.6
2021[52]FilteringFilters, Texture, GLHS, Shape RFPrivate28716977CTAcc 84.7 *
Sen 82.0
Spe 85.0
AUC 91.0
2021[49]---GFD,
NSCT, Shape
MLP LSSVM, Private734five-fold cross-validationNBI, WLIAcc 95.7
Sen 95.3
Spe 95.0
Pre 93.2
F1 90.5
2021[48]Normalization, smoothingSpatial InformationMLP, SVM, RFPrivate54Leave-One-Out cross-validationHSIAcc 94.0
Sen 86.0
Spe 95.0
2022[59]VTIHaralick, VTFRFPrivate63cross-validation methodCTAcc 92.2
Sen 88.4
Spe 96.0
AUC 96.2
2022[60]RGBGGLCMANN, RF, KNNKCRC-1650004500500H&EAcc 98.7
Sen 98.6
Spe 99.0
Pre 98.9
2022[45]Resize, BGR2RGB, Normalization, Deep FeaturesEC, Hybrid, LR, LGB, MLP, RF, SVM, XGB, Voting LC25000280010-fold cross-validationH&EAcc 100.0
2022[46]ROIFOS, GLCM, GLDM, GLRLM, GLSZM, LoG, NGTDM, Shape, WTMLRPrivate27619482CECTAcc 76.0
Sen 65.0
Spe 80.0
Pre 54.0
NPV 86.0
2022[61]UM-SNHIM, GLCM, StatisticalLDA, MLP, RF, SVM, XGB, LGBLC25000 1000900100H&EAcc 99.3
Sen 99.5
Pre 99.5
F1 99.5
2022[26]---Color Spaces, HaralickANN, DT, KNN, QDA, SVMKCRC-16500035041496H&EAcc 97.3
Sen 97.3
Spe 99.6
Pre 97.4
2023[62]Filtering, linear Transformation, normalizationColor characteristic, DBCM, SMOTECatBoost, DT, GNB, KNN, RFNCT-CRCHE-7K12,04284293613H&EAcc 90.7
Sen 97.6
Spe 97.4
Pre 90.6
Rec 90.5
F1 90.5
2023[51]---Clinical, FEViTSEKNNPrivate1729tenfold cross-validationENIAcc 94.0
Sen 74.0
Spe 98.0
AUC 93.0
2023[47]Lightness space, RGB to HSVdResNetDSVMKCRC-16500040001000H&EAcc 98.8
NCT-CRC-HE-100 K100,00080,00319,997H&EAcc 99.8
2023[63]HOG, RGBG, Resizing Morphological SVMPrivate540420120ENIAcc 97.5
* Not given in the paper, calculated from the result table, bold font signifies the best model in the ‘Techniques’ column. Abbreviations: BGR2RGB, Blue-Green-Red to Red-Green-Blue; BmzP, Binning of m/z Points; catBoost, Categorical Boosting; CECT, Contrast-Enhanced CT; CSQ, Color Space Quantization; DBCM, Differential Box Count Method; DSVM, Deep Support Vector Machine; dResNet, Dilated ResNet; DRR, Dynamic Range Reduction; DSVM, Deep Support Vector Machine; ENI, Endomicroscopy Images; FEViT, Feature Ensemble Vision Transformer; FOS, First-Order Statistics; GFD, Generic Fourier Descriptor; GNB, Gaussian Naive Bayes; GLDM, Gray-Level Dependence Matrix; GLHS, Gray Level Histogram Statistics; GLSZM, Gray Level Size Zone Matrix; GNB, Gaussian Naive Bayes; HOG, Histogram of Oriented Gradients; HOS, Higher-Order Statistic; HIM, Hu Invariants Moments; HSI, Hyperspectral Imaging; HSV, Hue-Saturation-Value; LBP, Local Binary Pattern; LDA, Linear Discriminant Analysis; LGB, Light Gradient Boosting; LoG, Laplacian of Gaussian; LSSVM, Least Square Support Vector Machine; MLR, Multivariate Logistic Regression; NGTDM, Neighboring Gray Tone Difference Matrix; NSCT, Non-Subsampled Contourlet Transform; OMIS, Optomagnetic Imaging Spectroscopy; QDA, Quadratic Discriminant Analysis; SEKNN, Subspace Ensemble K-Nearest Neighbor; THN, TopHat and Normalization; UMSN, Unsharp Masking and Stain Normalization; VTF, Vector Texture Features; VTI, Vector Texture Images; WLD, Wavelength Difference; WLI, White Light Imaging; WPT, Wavelet Packet Transform; WSVMCS, Wavelet Kernel SVM with Color Histogram; XGB, Extreme Gradient Boosting.
The analysis of colorectal cancer detection using traditional machine learning techniques reveals a notable disparity in model performance across various crucial metrics, showcasing substantial discrepancies between the models with the highest and lowest values as shown in Figure 10. The most proficient model achieved an extraordinary accuracy of 100.0%, whereas the least effective model achieved an accuracy of 76.0%, resulting in a substantial difference of 24.0%. When considering sensitivity, the top-performing model reached an impressive 99.5%, whereas the lowest-performing model registered a mere 65.0%, leading to a remarkable disparity of 34.5%. Similarly, concerning specificity, the superior model attained 99.6%, while the inferior model managed only 80.0%, resulting in a significant difference of 19.6%. In terms of precision, the best model demonstrated 99.5%, while the worst model exhibited a precision of only 54.0%, resulting in a substantial difference of 45.5%. When examining the F1-score, the model with the highest performance achieved 99.5%, whereas the least proficient model attained a score of 63.2%, yielding a notable difference of 36.3%. Lastly, in the case of the area under the curve (AUC), the top model achieved a score of 96.2%, while the bottom model scored 76.0%, marking a significant difference of 20.2%. These conspicuous differences underscore the pivotal role of choosing appropriate machine learning techniques and feature sets in the effectiveness of colorectal cancer detection. Effective cancer detection has far-reaching implications, influencing not only patient outcomes but also the operational efficiency of healthcare systems and the allocation of valuable medical resources.

3.2. Analysis of Gastric Cancer Prediction

Table 4 meticulously encapsulates 16 distinct studies conducted within the temporal frame of 2018 to 2023, each ardently devoted to machine learning-based gastric cancer detection. These investigations collectively underscore the pivotal role of preprocessing in elevating the accuracy of stomach cancer detection models. Notably, the pinnacle of achievement in this realm reached a remarkable 100.0% accuracy, whereas the lowest point stood at 71.2%. This diverse spectrum of performance underscores the profound influence of preprocessing techniques, spanning resizing, filtering, cropping, and color enhancement. These preprocessing strategies, in harmony with segmentation, feature extraction, and the adept utilization of machine learning algorithms encompassing SVM, MLP, RF, and KNN, have collectively converged to engender a triumphant era of stomach cancer detection. This progress extends across diverse modalities such as endoscopy, CT, MRI, and histopathology images. The quantity of images, patients, or slices underpinning these studies spanned a substantial range, from 30 to a staggering 245,196. It is intriguing to note that the enigmatic “Private” dataset emerged as the most recurrently harnessed resource in this insightful analysis.
The research conducted by (Ayyaz et al., 2022) [64] achieved outstanding results in stomach cancer detection, with a remarkable accuracy of 99.80%. They employed various preprocessing techniques, including resizing, contrast enhancement, binarization, and filtering. However, the segmentation method used was not specified in the study. Feature extraction was carried out with deep learning models like VGG19 and AlexNet. For classification, they used multiple techniques such as DT, NB, KNN, SVM, and more. Among these, the cubic SVM model performed the best, achieving an accuracy of 99.80%. This model also had a high sensitivity, precision, F1-score, and an AUC of 100.0%. On the other hand, the study conducted by (Mirniaharikandehei et al., 2021) [65] achieved comparatively lower performance in stomach cancer detection, with an accuracy of 71.20%. Their preprocessing techniques involved filtering and ROI selection, and they utilized the HTS segmentation method. Feature extraction was done using radiomics features such as GLRLM, GLDM, and WT LoG. The classification was carried out using various machine learning models, including SVM, LR, RF, DT, and GBM. The worst-performing model in their analysis was GBM, with an accuracy of 71.20%. This model had lower sensitivity but a higher specificity, precision, and F1-score. (Hu et al., 2022) [66] conducted a stomach cancer detection study with a large dataset of 245,196 medical images. They used various preprocessing techniques, including ROI selection, cropping, filtering, rotation, and disruption. The study extracted features such as color histograms, LBP, and GLCM. For classification, they applied RF and LSVM classifiers, achieving an accuracy of 85.99%. RF was the best-performing model in their analysis. On the other hand, (Naser and Zeki 2021) [67] conducted a stomach cancer detection study with a smaller dataset of only 30 medical images. They applied DIFQ-based preprocessing techniques, and their study used FCM for classification and achieved an accuracy of 85.00%. Table 4 provides an overview of different machine learning-based techniques for stomach (gastric) cancer detection, encompassing 16 reviewed studies. Notably, three of these studies specifically, namely, (Korkmaz and Esmeray 2018) [68], (Nayyar et al., 2021) [69], and (Hu et al., 2022a) [70], opted not to employ any preprocessing techniques. Surprisingly, they achieved noteworthy accuracies of 87.77%, 99.8%, and 85.24%, respectively. This demonstrates the potential for effective stomach cancer detection even in the absence of preprocessing methods. However, it is essential to highlight that a significant portion of the studies examined in the table chose to implement various preprocessing techniques, including CEI, filtering, resizing, Fourier transform, cropping, ROI selection, rotation, disruption, binarization, augmentation, and RSA. These preprocessing steps underscore their pivotal role in enhancing the performance of machine learning models for stomach cancer detection.
Out of the 16 studies focused on gastric cancer detection, 50% of them (8 studies) achieved an accuracy rate of over 90%, indicating highly accurate results. However, the other 50% of the studies received less than 90% accuracy. This discrepancy in performance might be attributed to the utilization of private datasets in these studies. Private datasets may not undergo the same level of processing or standardization as publicly available datasets, potentially leading to variations in data quality and affecting the performance of the machine learning models.
Table 4. Performance comparison of traditional ML-based gastric cancer prediction methods.
Table 4. Performance comparison of traditional ML-based gastric cancer prediction methods.
YearReferencesPreprocessingFeaturesTechniquesDatasetData SamplesTrain DataTest DataModalityMetrics (%)
2018[71]Fourier transformBRISK, SURF, MSER DT, DAPrivate1809090H&EAcc 86.7
2018[72]ResizingLBP, HOGANN, RFPrivate1809090H&EAcc 100.0
2018[68]---SURF, DFTNBPrivate1809090H&EAcc 87.8
Private720360360H&EAcc 90.3
2018[73]CEI, filtering, resizingGLCMSVMPrivate20712681NBIAcc 96.3
Sen 96.7
Spe 95.0
Pre 98.3
2019[74]Resizing, croppingGLCM, Shape, FOF, GLSZMSVMPrivate490326164CTAcc 71.3
Sen 72.6
Spe 68.1
Pre 82.0
NPV 50.0
2021[67]DIFQSMIFCM, KMC Private30------MRIAcc 85.0
2021[75]ResizingExtract HOGRF, MLP Private1809090H&EAcc 98.1
2021[76]ResizingTSSBP, BPSVM, SVMPrivate78------MRIAcc 94.6
2021[69]---Deep FeaturesCSVM, Bagged Trees, KNNs, SVMs Private400028001200WCEAcc 99.8
Sen 99.0
Pre 99.3
F1 99.1
AUC 100
2021[65]Filtering, ROILoG, WT, GLDM, GLRLM GBM, DT, RF, LR, SVM. Private159Leave-One-Out cross-validationCTAcc 71.2
Sen 43.1
Spe 87.1
Pre 65.8
2022[77]Augmentation, resizing, filtering InceptionNet, VGGNetSVM, RF, KNN. HKD10,662 (47,398 Augmneted)37,7889610EndoscopyAcc 98.0
Sen 100
Pre 100
F1 100
MCC 97.8
2022[70]---GLCM, LBP, HOG, histogram, luminance, Color histogramNSVM, LSVM, LR, NB, RF, ANN, KNN GasHisSDB245196196,15749,039H&EAcc 85.2
Sen 84.9 #
Pre 84.6 #
Spe 84.9 #
F1 84.8 #
2022[64]Binarization, CEI, filtering, resizingVGG19 AlexnetBagged Tree, Coarse Tree, CSVM, CKNN, DT, Fine Tree, KNN, NBPrivate259010-fold cross-validationEUSAcc 99.8
Sen 99.8
Pre 99.8
F1 99.8
AUC 100
2022[66]Cropping, disruption, filtering, ROI, RotationColor histogram, GLCM, LBPLSVM, RFGasHisSDB245,196196,15749,039H&EAcc 85.9
Sen 86.2 #
Spe 86.2 #
Pre 85.7 #
F1 85.9 #
2023[78]Augmentation, CEI MobileNet-V2Bayesian, CSVM, LSVM, QSVM, SoftmaxKV2D485410-fold cross-validationEndoscopyAcc 96.4
Pre 97.6
Sen 93.0
F1 95.2
2023[79]RSARSFPLS-DA, LOO, SVMPrivate450Leave-One-Out cross validationH&EAcc 94.8
Sen 91.0
Spe 100
AUC 95.8
# Calculated by averaging the normal and abnormal class, Bold Font techniques represent the best model. Abbreviations: BPSVM, Binary Robust Invariant Scalable Keypoints; BRISK, Binary Robust Invariant Scalable Keypoints; CKNN, Cosine K-Nearest Neighbor; CSVM, Cubic SVM; DA, Discriminant Analysis; DIFQ, Dividing an image into four quarters; FCM, Fuzzy C-Means; GGF, Global Graph Features; HOG, Histogram of Oriented Gradients; HTSS, Hybrid Tumor Segmentation; KMC, K-Means Clustering; LOO, Leave-One-Out; LSVM, Linear Support Vector Machine; MSER, Maximally Stable Extremal Regions; NSVM, Non-Linear Support Vector Machine; OAT, Otsu Adaptive Thresholding; PLS-DA, Partial Least-Squares Discriminant Analysis; QSVM, Quadratic SVM; RSA, Raman Spectral Analysis; RSF, Raman Spectral Feature; SM, Seven Moments Invariants; SMI, Seven Moments Invariants; SURF, Speeded Up Robust Features; TSS, Tumor Scattered Signal.
The analysis of gastric cancer detection reveals substantial variations in model performance across key metrics, with significant differences observed between the highest and lowest values as shown in Figure 11. Accuracy (Acc) showcased a noteworthy contrast, with the best-performing model achieving a flawless 100.00% and the least effective model scoring 71.20%. This substantial 28.80% difference underscores the pivotal role of model selection in achieving accurate gastric cancer detection. Sensitivity (Sen) displayed a considerable gap, with the top model achieving a perfect 100.00%, while the lowest model only reached 43.10%. This marked difference of 56.90% emphasizes the necessity of sensitive detection techniques in identifying gastric cancer. Similarly, specificity (Spe) followed suit, with the highest model reaching 100.00% and the lowest model achieving 68.10%. The substantial 31.90% difference highlights the importance of correctly identifying non-cancer cases in diagnostic accuracy. Precision (Pre) also exhibited a significant disparity, with the best model achieving 100.00%, and the least effective model achieving 65.80%. The difference of 34.20% underscores the significance of precise identification of gastric cancer cases. It is noteworthy that the negative predictive value (NPV) remained constant at 50.00% for both the highest and lowest models, signifying that neither model excelled in ruling out non-cancer cases. However, since NPV is only used in a single article, its impact on the overall analysis may be limited.
Additionally, the F1-score showed a substantial difference, with the top model achieving a perfect 100.00%, while the lowest model reached 84.80%. The 15.20% difference emphasizes the balance between precision and sensitivity in gastric cancer detection. Lastly, in terms of the area under the curve (AUC), the best model achieved a near-perfect 100.00%, while the lowest model attained a still impressive 95.80%. The modest 4.20% difference indicates that both models performed well in distinguishing between gastric cancer and non-cancer cases. It is also worth noting that the area under the curve (AUC) metric was utilized in only three articles, and the differences in AUC were relatively modest. Therefore, the impact of AUC on the overall analysis may be less generalized. These findings underscore the critical role of model choice and feature selection in the effective detection of gastric cancer. Accurate and sensitive diagnostic tools are crucial for improving patient outcomes and optimizing healthcare resources. While NPV and AUC may have a limited impact in this context due to their restricted usage, the other metrics highlight the significance of selecting appropriate models for reliable gastric cancer detection.

4. Proposed Methodology

In this section, we delineate our proposed methodology for the detection of colorectal and gastric cancer through the application of traditional machine learning techniques. These approaches have been meticulously crafted based on the discerning insights and observations gleaned from the comprehensive review tables. Our primary goal is to introduce a Proposed (optimized) approach, accompanied by the most suitable parameters, in order to attain the most superior results. Our endeavor is to provide an efficient, effective, automated, and highly precise technique for the detection of colorectal and gastric cancer.

4.1. Detection of Colorectal Cancer

Figure 12 is a comprehensive visualization of the architectural framework that underpins our proposed model for the detection of colorectal cancer. This blueprint draws its inspiration from the wealth of insights extracted from Table 3, which provides a foundational understanding of the methodologies that have proven effective in this domain. While we have opted to use the H&E modality as an illustrative example, it is imperative to recognize that our model can seamlessly accommodate other modalities. This flexibility is a testament to the adaptability and robustness of our approach, as it allows for the incorporation of diverse data sources to enrich the depth and scope of our analysis. At the crux of our methodology lies the preprocessing phase, an instrumental step that sets the stage for the rigorous examination of input images. Within this phase, we meticulously execute four pivotal steps: Image Enhancement, Pixel Enhancement, RGB-to-Gray Conversion, and Image Segmentation. These sequential operations are not arbitrary but have been thoughtfully selected and implemented to systematically prepare the input images. Their collective objective is to optimize the images, ensuring they are in a suitable form for efficient feature extraction and subsequent in-depth analysis. The realm of feature engineering is where our approach truly shines. Here, we introduce an innovative and nuanced strategy. Instead of relying solely on one type of feature, we merge two distinct categories: deep learning-based features, which are often referred to as “deep features”, and a varied assortment of other features. This assortment includes Discrete Wavelet Transform (DWT), Gray Level Co-occurrence Matrix (GLCM), Local Binary Pattern (LBP), Texture, and Gray Level Size Zone Matrix (GLSZM). The fusion of these diverse feature sets is not a random choice but a deliberate effort to enhance the robustness and comprehensiveness of our analysis. This fusion is designed to ensure that our model captures both the intricate, high-level representations obtained through deep learning and handcrafted features meticulously tailored to highlight specific aspects of tumor characteristics. By incorporating these different types of features, our model becomes versatile, capable of effectively identifying patterns and characteristics in the data that may not be discernible when using only one type of feature. By executing this innovative approach, we aim to enhance the model’s ability to interpret and understand the complex information contained within medical images. This, in turn, contributes to the accuracy and efficiency of colorectal cancer detection. Furthermore, it enables our model to adapt and excel in different scenarios and datasets, making it a powerful tool for healthcare professionals and researchers working in the field of cancer detection.
The combination of these diverse features enhances the model’s capability to encompass both intricate, high-level representations acquired through deep learning and meticulously tailored handcrafted features that accentuate distinct tumor characteristics. Moving forward in the workflow, we encounter the crucial stages of feature selection and optimization. This pivotal process serves a dual role: it reduces feature redundancy while enhancing the overall model performance by focusing on the most distinctive attributes. Our model evaluation process is underpinned by a rigorous data-partitioning strategy, effectively splitting the dataset into training and testing subsets. The training dataset undergoes additional scrutiny through a k-fold cross-validation approach, fortifying the model’s training and facilitating a robust performance assessment. This approach not only guards against overfitting but also assesses the model’s adaptability to various data scenarios. The test dataset becomes the arena for predicting colorectal cancer, with the cubic support vector machine (SVM) taking the lead in this classification task. The SVM is a formidable presence among traditional machine learning classifiers, known for its prowess in handling high-dimensional data and executing binary classification tasks, making it ideally suited for the intricacies of cancer detection. In summary, our proposed model architecture harmoniously integrates advanced image preprocessing techniques, innovative feature-engineering methodologies, and the proven machinery of a traditional machine learning classifier. This synthesis yields an efficient and accurate framework for colorectal cancer detection. Pending further validation and testing on diverse datasets, this approach has the potential to revolutionize early cancer detection and diagnosis, potentially leading to improved patient outcomes and a transformation in healthcare effectiveness.

4.2. Detection of Gastric Cancer

The system architecture flow diagram, as depicted in Figure 13, outlines our comprehensive and adaptable approach to stomach cancer (Gastric) detection employing traditional machine learning classifiers. Informed by the top-performing models scrutinized in Table 4, our proposed architecture is intentionally crafted to accommodate both endoscopy video datasets, which have gained prominence in recent years, and static image datasets. Initiating with endoscopy video datasets as the primary data source, our architecture seamlessly extends its capabilities to image datasets by extracting individual frames from the video sequences. Subsequently, these extracted frames undergo preprocessing, which encompasses various techniques such as noise reduction, RGB-to-grayscale conversion, or other pertinent methods contingent on the specific application and dataset attributes. Acknowledging the potential constraint of limited video datasets, we introduce data augmentation techniques as part of our solution. This augmentation process generates an ample supply of augmented image datasets, enabling the model to undergo training on a more diverse and representative set of samples. This augmentation strategy empowers the model to generalize better, ultimately leading to enhanced performance outcomes. Moving into the feature extraction phase, we advocate the simultaneous use of deep features and texture-based features. Deep features are sourced from state-of-the-art deep learning models, while texture-based features encompass attributes like GLCM, GLRLM, and GLSZM, harnessed through conventional feature extraction methods. This fusion of diverse feature types ensures that the model possesses the capability to encapsulate both abstract high-level representations and the specific characteristics embedded in the stomach cancer data.
Upon the amalgamation of these features, the subsequent step in our approach involves feature optimization. Here, we employ well-suited algorithms to meticulously select the most pertinent attributes among the fused features. This optimization process serves a dual function: firstly, it mitigates the peril of overfitting, a common pitfall in machine learning endeavors, and secondly, it bolsters the overall efficiency of the model. The carefully curated selection of features enhances the model’s capacity to discriminate between different classes, resulting in improved classification accuracy. Following the optimization phase, the dataset undergoes a deliberate partitioning into two distinct subsets: the training set and the testing set. This partitioning is a strategic maneuver that ensures the robust training and rigorous evaluation of traditional machine learning classifiers. The distribution of the dataset is thoughtfully orchestrated to prevent any data leakage and to create a reliable foundation for our model’s assessment. Depending on the specific nature of the classification task and the unique requirements of the application, we employ a range of classifiers known for their effectiveness in various scenarios. These classifiers include but are not limited to support vector machines (SVM), Random Forest (RF), logistic regression (LR), backpropagation neural networks (BPNN), and artificial neural networks (ANN). Each of these classifiers is chosen judiciously to cater to the specific characteristics of the dataset and the intricacies of the task at hand. These classifiers excel in categorizing stomach cancer into distinct types, thereby providing valuable insights essential for accurate diagnosis and tailored treatment. A standout feature of our proposed system architecture is its inherent adaptability. This architectural flexibility empowers the system to seamlessly accommodate both image and video datasets, thereby rendering it versatile and suitable for a wide spectrum of applications. By harnessing the capabilities of traditional machine learning methods and integrating the novel approaches of feature fusion and optimization, our system architecture exhibits substantial potential for delivering heightened efficiency and heightened accuracy in the realm of stomach cancer detection. Nonetheless, it is imperative to emphasize the essentiality of conducting further validation and in-depth evaluation of our system’s performance.

4.3. Key Observations

The comprehensive assessment of colorectal and gastric cancer detection techniques using traditional machine learning methods and medical image datasets has revealed several key insights:
  • Dataset Diversity: Evaluation includes colorectal and gastric cancer datasets, ranging from 30 to 100,000 images. The varied dataset sizes showcase machine learning classifier effectiveness with appropriate tuning.
  • Exceptional Model Performances: Models achieve 100% accuracy for both colorectal and gastric cancer, with perfect scores in key metrics like sensitivity, specificity, precision, and F1-score, showcasing the potential of traditional ML classifiers with optimal parameters.
  • Preprocessing Techniques: Researchers employ various preprocessing techniques, including image filtering, denoising, wavelet transforms, RGB-to-gray conversion, normalization, cropping (ROI), sampling, and binarization, to optimize model performance and minimize biases during data manipulation.
  • Literature Review Significance: This analysis spans 36 literature sources related to colorectal and gastric cancer, underscoring the significant interest in cancer detection through traditional ML classifiers. Researchers have explored an extensive range of cancer types, diverse evaluation metrics, and datasets, collectively advancing the field.
  • Dominant Traditional ML Techniques: SVM is a commonly used traditional ML classifier in cancer detection tasks, emphasizing the need to understand each classifier’s strengths and limitations for optimal selection.
  • Insightful Dataset and Feature Analysis: Reviewed studies predominantly utilized benchmark medical image datasets, with researchers employing feature extraction techniques like GLCM for informative feature extraction in cancer detection.
  • Prudent Model Architecture Design: Optimal results in cancer detection require thoughtful and optimized model architectures, which can enhance accuracy, generalizability, and interpretability, addressing challenges in medical image analysis.

4.4. Key Challenges and Future Scope

Traditional ML classifiers have shown remarkable potential in cancer detection. However, several challenges and the future scope in their application have been identified:
  • Variability in Accuracy: Traditional ML classifiers exhibit variable accuracy rates across cancer types, ranging from 76% to 100%. Overcoming these variations poses a challenge, underscoring the need for enhanced models. Future research should prioritize refining models for consistent and accurate performance across diverse cancer types.
  • Metric Disparities: Metric variations, especially in sensitivity (43.1% to 100%) for gastric cancer, suggest potential data imbalance challenges. Addressing these issues is crucial for accurate model assessments. Future research should focus on developing strategies to handle imbalanced data and improve model robustness.
  • Preprocessing Challenges: Balancing raw and preprocessed data is crucial to ensure input data quality and reliability, contributing to robust cancer detection model performance. Future research should explore advanced preprocessing techniques and optimization methods to further enhance model robustness.
  • Limited use of evaluation metrics: Limited use of metrics like NPV, AUC, and MCC in the reviewed literature highlights the challenge of comprehensive model assessment. Addressing this limitation and exploring a broader range of metrics is crucial for future research to enhance understanding and effectiveness in cancer detection tasks.
  • Generalizing to novel cancer types: The literature primarily focuses on colorectal and gastric cancers, posing a challenge for extending traditional ML classifiers to less-explored cancer types. Future research should aim to develop versatile ML models with robust feature extraction techniques to adapt to diverse cancer types and domains.
  • Addressing overfitting and model selection: The diversity in ML classifiers poses challenges in model selection for specific cancers, emphasizing the need for careful evaluation to avoid overfitting. Future research should focus on refining model selection strategies to enhance the robustness of cancer detection techniques and improve diagnostic accuracy.

5. Conclusions

In this manuscript, a thorough review and analysis of colorectal and gastric cancer detection using traditional machine learning techniques are presented. We have meticulously scrutinized 36 research papers published between 2017 and 2023, specifically focusing on the domain of medical imaging datasets for detecting these types of cancers. Mathematical formulations elucidating frequently employed preprocessing techniques, feature extraction methods, traditional machine learning classifiers, and assessment metrics are provided. These formulations offer valuable guidance to researchers when selecting the most suitable techniques for their cancer detection studies. To conduct this analysis, a range of criteria such as publication year, preprocessing methods, dataset particulars, image quantities, modality, techniques, best models, and metrics (%) were considered. An extensive array of metrics was employed to evaluate model performance comprehensively. Notably, the study delves into the highest and lowest metric values and their disparities, highlighting opportunities for enhancement. Remarkably, we found that the highest achievable value for all metrics reached an astonishing 100%, with gastric cancer detection registering the lowest sensitivity at 43.10%. This underscores the potential of traditional ML classifiers, while indicating areas for further refinement. Drawing from these insights, we present a proposed (optimized) methodology for both colorectal and gastric cancer detection, aiding in the selection of an optimized approach for future cancer detection research. The manuscript concludes by delineating key findings and challenges that offer valuable directions for future research endeavors.
In our future research endeavors, we plan to implement the proposed optimized methodology for the detection of colorectal and gastric cancer within the specified experimental framework. This proactive approach aligns with our commitment to enhancing the effectiveness of cancer detection methodologies. Furthermore, we will conscientiously incorporate and address the challenges and limitations identified in this study, ensuring a comprehensive and iterative improvement in our investigative efforts.

Author Contributions

Original Draft Preparation: H.M.R.; Review and Editing: H.M.R.; Visualization: H.M.R.; Supervision: J.Y.; Project Administration: J.Y.; Funding Acquisition: J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (NRF-2021R1F1A1063640).

Data Availability Statement

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Faguet, G.B. A brief history of cancer: Age-old milestones underlying our current knowledge database. Int. J. Cancer 2014, 136, 2022–2036. [Google Scholar] [CrossRef] [PubMed]
  2. Afrash, M.R.; Shafiee, M.; Kazemi-Arpanahi, H. Establishing machine learning models to predict the early risk of gastric cancer based on lifestyle factors. BMC Gastroenterol. 2023, 23, 6. [Google Scholar] [CrossRef] [PubMed]
  3. Kumar, Y.; Gupta, S.; Singla, R.; Hu, Y.-C. A systematic review of artificial intelligence techniques in cancer prediction and diagnosis. Arch. Comput. Methods Eng. 2021, 29, 2043–2070. [Google Scholar] [CrossRef] [PubMed]
  4. Nguon, L.S.; Seo, K.; Lim, J.-H.; Song, T.-J.; Cho, S.-H.; Park, J.-S.; Park, S. Deep learning-based differentiation between mucinous cystic neoplasm and serous cystic neoplasm in the pancreas using endoscopic ultrasonography. Diagnostics 2021, 11, 1052. [Google Scholar] [CrossRef] [PubMed]
  5. Kim, S.H.; Hong, S.J. Current status of image-enhanced endoscopy for early identification of esophageal neoplasms. Clin. Endosc. 2021, 54, 464–476. [Google Scholar] [CrossRef] [PubMed]
  6. NCI. What Is Cancer?—NCI. National Cancer Institute. Available online: https://www.cancer.gov/about-cancer/understanding/what-is-cancer (accessed on 9 June 2023).
  7. Zhi, J.; Sun, J.; Wang, Z.; Ding, W. Support vector machine classifier for prediction of the metastasis of colorectal cancer. Int. J. Mol. Med. 2018, 41, 1419–1426. [Google Scholar] [CrossRef] [PubMed]
  8. Zhou, H.; Dong, D.; Chen, B.; Fang, M.; Cheng, Y.; Gan, Y.; Zhang, R.; Zhang, L.; Zang, Y.; Liu, Z.; et al. Diagnosis of Distant Metastasis of Lung Cancer: Based on Clinical and Radiomic Features. Transl. Oncol. 2017, 11, 31–36. [Google Scholar] [CrossRef] [PubMed]
  9. Levine, A.B.; Schlosser, C.; Grewal, J.; Coope, R.; Jones, S.J.; Yip, S. Rise of the Machines: Advances in Deep Learning for Cancer Diagnosis. Trends Cancer 2019, 5, 157–169. [Google Scholar] [CrossRef] [PubMed]
  10. Huang, S.; Yang, J.; Fong, S.; Zhao, Q. Artificial intelligence in cancer diagnosis and prognosis: Opportunities and challenges. Cancer Lett. 2019, 471, 61–71. [Google Scholar] [CrossRef]
  11. Saba, T. Recent advancement in cancer detection using machine learning: Systematic survey of decades, comparisons and challenges. J. Infect. Public Health 2020, 13, 1274–1289. [Google Scholar] [CrossRef]
  12. Shah, B.; Alsadoon, A.; Prasad, P.; Al-Naymat, G.; Beg, A. DPV: A taxonomy for utilizing deep learning as a prediction technique for various types of cancers detection. Multimed. Tools Appl. 2021, 80, 21339–21361. [Google Scholar] [CrossRef]
  13. Majumder, A.; Sen, D. Artificial intelligence in cancer diagnostics and therapy: Current perspectives. Indian J. Cancer 2021, 58, 481–492. [Google Scholar] [CrossRef] [PubMed]
  14. Bin Tufail, A.; Ma, Y.-K.; Kaabar, M.K.A.; Martínez, F.; Junejo, A.R.; Ullah, I.; Khan, R. Deep Learning in Cancer Diagnosis and Prognosis Prediction: A Minireview on Challenges, Recent Trends, and Future Directions. Comput. Math. Methods Med. 2021, 2021, 9025470. [Google Scholar] [CrossRef] [PubMed]
  15. Kumar, G.; Alqahtani, H. Deep Learning-Based Cancer Detection-Recent Developments, Trend and Challenges. Comput. Model. Eng. Sci. 2022, 130, 1271–1307. [Google Scholar] [CrossRef]
  16. Painuli, D.; Bhardwaj, S.; Köse, U. Recent advancement in cancer diagnosis using machine learning and deep learning techniques: A comprehensive review. Comput. Biol. Med. 2022, 146, 105580. [Google Scholar] [CrossRef] [PubMed]
  17. Rai, H.M. Cancer detection and segmentation using machine learning and deep learning techniques: A review. Multimed. Tools Appl. 2023, 1–35. [Google Scholar] [CrossRef]
  18. Maurya, S.; Tiwari, S.; Mothukuri, M.C.; Tangeda, C.M.; Nandigam, R.N.S.; Addagiri, D.C. A review on recent developments in cancer detection using Machine Learning and Deep Learning models. Biomed. Signal Process. Control. 2023, 80, 104398. [Google Scholar] [CrossRef]
  19. Mokoatle, M.; Marivate, V.; Mapiye, D.; Bornman, R.; Hayes, V.M. A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application. BMC Bioinform. 2023, 24, 112. [Google Scholar] [CrossRef]
  20. Rai, H.M.; Yoo, J. A comprehensive analysis of recent advancements in cancer detection using machine learning and deep learning models for improved diagnostics. J. Cancer Res. Clin. Oncol. 2023, 149, 14365–14408. [Google Scholar] [CrossRef]
  21. Ullah, A.; Chen, W.; Khan, M.A. A new variational approach for restoring images with multiplicative noise. Comput. Math. Appl. 2016, 71, 2034–2050. [Google Scholar] [CrossRef]
  22. Azmi, K.Z.M.; Ghani, A.S.A.; Yusof, Z.M.; Ibrahim, Z. Natural-based underwater image color enhancement through fusion of swarm-intelligence algorithm. Appl. Soft Comput. 2019, 85, 105810. [Google Scholar] [CrossRef]
  23. Alruwaili, M.; Gupta, L. A statistical adaptive algorithm for dust image enhancement and restoration. In Proceedings of the 2015 IEEE International Conference on Electro/Information Technology (EIT), Dekalb, IL, USA, 21–23 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 286–289. [Google Scholar]
  24. Cai, J.-H.; He, Y.; Zhong, X.-L.; Lei, H.; Wang, F.; Luo, G.-H.; Zhao, H.; Liu, J.-C. Magnetic Resonance Texture Analysis in Alzheimer’s disease. Acad. Radiol. 2020, 27, 1774–1783. [Google Scholar] [CrossRef]
  25. Chandrasekhara, S.P.R.; Kabadi, M.G.; Srivinay, S. Wearable IoT based diagnosis of prostate cancer using GLCM-multiclass SVM and SIFT-multiclass SVM feature extraction strategies. Int. J. Pervasive Comput. Commun. 2021. ahead-of-print. [Google Scholar] [CrossRef]
  26. Alqudah, A.M.; Alqudah, A. Improving machine learning recognition of colorectal cancer using 3D GLCM applied to different color spaces. Multimed. Tools Appl. 2022, 81, 10839–10860. [Google Scholar] [CrossRef]
  27. Vallabhaneni, R.B.; Rajesh, V. Brain tumour detection using mean shift clustering and GLCM features with edge adaptive total variation denoising technique. Alex. Eng. J. 2018, 57, 2387–2392. [Google Scholar] [CrossRef]
  28. Rego, C.H.Q.; França-Silva, F.; Gomes-Junior, F.G.; de Moraes, M.H.D.; de Medeiros, A.D.; da Silva, C.B. Using Multispectral Imaging for Detecting Seed-Borne Fungi in Cowpea. Agriculture 2020, 10, 361. [Google Scholar] [CrossRef]
  29. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  30. Callen, J.L.; Segal, D. An Analytical and Empirical Measure of the Degree of Conditional Conservatism. J. Account. Audit. Financ. 2013, 28, 215–242. [Google Scholar] [CrossRef]
  31. Weinberger, K. Lecture 2: K-Nearest Neighbors. Available online: https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote02_kNN.html (accessed on 12 November 2023).
  32. Weinberger, K. Lecture 3: The Perceptron. Available online: https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote03.html (accessed on 12 November 2023).
  33. Watt, J.; Borhani, R.; Katsaggelos, A.K. Machine Learning Refined; Cambridge University Press (CUP): Cambridge, UK, 2020; ISBN 9781107123526. [Google Scholar]
  34. Watt, R.B.J. 13.1 Multi-Layer Perceptrons (MLPs). Available online: https://kenndanielso.github.io/mlrefined/blog_posts/13_Multilayer_perceptrons/13_1_Multi_layer_perceptrons.html (accessed on 12 November 2023).
  35. Weinberger, K. Lecture 9: SVM. Available online: https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote09.html (accessed on 13 November 2023).
  36. Balas, V.E.; Mastorakis, N.E.; Popescu, M.-C.; Balas, V.E. Multilayer Perceptron and Neural Networks. 2009. Available online: https://www.researchgate.net/publication/228340819 (accessed on 18 September 2023).
  37. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  38. Islam, U.; Al-Atawi, A.; Alwageed, H.S.; Ahsan, M.; Awwad, F.A.; Abonazel, M.R. Real-Time Detection Schemes for Memory DoS (M-DoS) Attacks on Cloud Computing Applications. IEEE Access 2023, 11, 74641–74656. [Google Scholar] [CrossRef]
  39. Houshmand, M.; Hosseini-Khayat, S.; Wilde, M.M. Minimal-Memory, Noncatastrophic, Polynomial-Depth Quantum Convolutional Encoders. IEEE Trans. Inf. Theory 2012, 59, 1198–1210. [Google Scholar] [CrossRef]
  40. Bagging. Available online: https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote18.html (accessed on 13 November 2023).
  41. Boosting. Available online: https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote19.html (accessed on 13 November 2023).
  42. Dewangan, S.; Rao, R.S.; Mishra, A.; Gupta, M. Code Smell Detection Using Ensemble Machine Learning Algorithms. Appl. Sci. 2022, 12, 10321. [Google Scholar] [CrossRef]
  43. Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 2018, 17, 168–192. [Google Scholar] [CrossRef]
  44. Leem, S.; Oh, J.; So, D.; Moon, J. Towards Data-Driven Decision-Making in the Korean Film Industry: An XAI Model for Box Office Analysis Using Dimension Reduction, Clustering, and Classification. Entropy 2023, 25, 571. [Google Scholar] [CrossRef] [PubMed]
  45. Talukder, A.; Islam, M.; Uddin, A.; Akhter, A.; Hasan, K.F.; Moni, M.A. Machine learning-based lung and colon cancer detection using deep feature extraction and ensemble learning. Expert Syst. Appl. 2022, 205, 117695. [Google Scholar] [CrossRef]
  46. Ying, M.; Pan, J.; Lu, G.; Zhou, S.; Fu, J.; Wang, Q.; Wang, L.; Hu, B.; Wei, Y.; Shen, J. Development and validation of a radiomics-based nomogram for the preoperative prediction of microsatellite instability in colorectal cancer. BMC Cancer 2022, 22, 524. [Google Scholar] [CrossRef]
  47. Fadafen, M.K.; Rezaee, K. Ensemble-based multi-tissue classification approach of colorectal cancer histology images using a novel hybrid deep learning framework. Sci. Rep. 2023, 13, 8823. [Google Scholar] [CrossRef]
  48. Jansen-Winkeln, B.; Barberio, M.; Chalopin, C.; Schierle, K.; Diana, M.; Köhler, H.; Gockel, I.; Maktabi, M. Feedforward artificial neural network-based colorectal cancer detection using hyperspectral imaging: A step towards automatic optical biopsy. Cancers 2021, 13, 967. [Google Scholar] [CrossRef]
  49. Bora, K.; Bhuyan, M.K.; Kasugai, K.; Mallik, S.; Zhao, Z. Computational learning of features for automated colonic polyp classification. Sci. Rep. 2021, 11, 4347. [Google Scholar] [CrossRef]
  50. Fan, J.; Lee, J.; Lee, Y. A Transfer learning architecture based on a support vector machine for histopathology image classification. Appl. Sci. 2021, 11, 6380. [Google Scholar] [CrossRef]
  51. Lo, C.-M.; Yang, Y.-W.; Lin, J.-K.; Lin, T.-C.; Chen, W.-S.; Yang, S.-H.; Chang, S.-C.; Wang, H.-S.; Lan, Y.-T.; Lin, H.-H.; et al. Modeling the survival of colorectal cancer patients based on colonoscopic features in a feature ensemble vision transformer. Comput. Med. Imaging Graph. 2023, 107, 102242. [Google Scholar] [CrossRef] [PubMed]
  52. Grosu, S.; Wesp, P.; Graser, A.; Maurus, S.; Schulz, C.; Knösel, T.; Cyran, C.C.; Ricke, J.; Ingrisch, M.; Kazmierczak, P.M. Machine learning–based differentiation of benign and premalignant colorectal polyps detected with CT colonography in an asymptomatic screening population: A proof-of-concept study. Radiology 2021, 299, 326–335. [Google Scholar] [CrossRef]
  53. Takeda, K.; Kudo, S.-E.; Mori, Y.; Misawa, M.; Kudo, T.; Wakamura, K.; Katagiri, A.; Baba, T.; Hidaka, E.; Ishida, F.; et al. Accuracy of diagnosing invasive colorectal cancer using computer-aided endocytoscopy. Endoscopy 2017, 49, 798–802. [Google Scholar] [CrossRef] [PubMed]
  54. Yang, K.; Zhou, B.; Yi, F.; Chen, Y.; Chen, Y. Colorectal Cancer Diagnostic Algorithm Based on Sub-Patch Weight Color Histogram in Combination of Improved Least Squares Support Vector Machine for Pathological Image. J. Med. Syst. 2019, 43, 306. [Google Scholar] [CrossRef] [PubMed]
  55. Dragicevic, A.; Matija, L.; Krivokapic, Z.; Dimitrijevic, I.; Baros, M.; Koruga, D. Classification of Healthy and Cancer States of Colon Epithelial Tissues Using Opto-magnetic Imaging Spectroscopy. J. Med. Biol. Eng. 2018, 39, 367–380. [Google Scholar] [CrossRef]
  56. Trivizakis, E.; Ioannidis, G.S.; Souglakos, I.; Karantanas, A.H.; Tzardi, M.; Marias, K. A neural pathomics framework for classifying colorectal cancer histopathology images based on wavelet multi-scale texture analysis. Sci. Rep. 2021, 11, 15546. [Google Scholar] [CrossRef]
  57. Damkliang, K.; Wongsirichot, T.; Thongsuksai, P. Tissue classification for colorectal cancer utilizing techniques of deep learning and machine learning. Biomed. Eng. Appl. Basis Commun. 2021, 33, 2150022. [Google Scholar] [CrossRef]
  58. Mittal, P.; Condina, M.R.; Klingler-Hoffmann, M.; Kaur, G.; Oehler, M.K.; Sieber, O.M.; Palmieri, M.; Kommoss, S.; Brucker, S.; McDonnell, M.D.; et al. Cancer tissue classification using supervised machine learning applied to MALDI mass spectrometry imaging. Cancers 2021, 13, 5388. [Google Scholar] [CrossRef]
  59. Cao, W.; Pomeroy, M.J.; Liang, Z.; Abbasi, A.F.; Pickhardt, P.J.; Lu, H. Vector textures derived from higher order derivative domains for classification of colorectal polyps. Vis. Comput. Ind. Biomed. Art 2022, 5, 16. [Google Scholar] [CrossRef]
  60. Deif, M.A.; Attar, H.; Amer, A.; Issa, H.; Khosravi, M.R.; Solyman, A.A.A. A New Feature Selection Method Based on Hybrid Approach for Colorectal Cancer Histology Classification. Wirel. Commun. Mob. Comput. 2022, 2022, 7614264. [Google Scholar] [CrossRef]
  61. Chehade, A.H.; Abdallah, N.; Marion, J.-M.; Oueidat, M.; Chauvet, P. Lung and colon cancer classification using medical imaging: A feature engineering approach. Phys. Eng. Sci. Med. 2022, 45, 729–746. [Google Scholar] [CrossRef]
  62. Tripathi, A.; Misra, A.; Kumar, K.; Chaurasia, B.K. Optimized Machine Learning for Classifying Colorectal Tissues. SN Comput. Sci. 2023, 4, 461. [Google Scholar] [CrossRef]
  63. Kara, O.C.; Venkatayogi, N.; Ikoma, N.; Alambeigi, F. A Reliable and Sensitive Framework for Simultaneous Type and Stage Detection of Colorectal Cancer Polyps. Ann. Biomed. Eng. 2023, 51, 1499–1512. [Google Scholar] [CrossRef] [PubMed]
  64. Ayyaz, M.S.; Lali, M.I.U.; Hussain, M.; Rauf, H.T.; Alouffi, B.; Alyami, H.; Wasti, S. Hybrid deep learning model for endoscopic lesion detection and classification using endoscopy videos. Diagnostics 2021, 12, 43. [Google Scholar] [CrossRef] [PubMed]
  65. Mirniaharikandehei, S.; Heidari, M.; Danala, G.; Lakshmivarahan, S.; Zheng, B. Applying a random projection algorithm to optimize machine learning model for predicting peritoneal metastasis in gastric cancer patients using CT images. Comput. Methods Programs Biomed. 2021, 200, 105937. [Google Scholar] [CrossRef]
  66. Hu, W.; Li, C.; Li, X.; Rahaman, M.; Ma, J.; Zhang, Y.; Chen, H.; Liu, W.; Sun, C.; Yao, Y.; et al. GasHisSDB: A new gastric histopathology image dataset for computer aided diagnosis of gastric cancer. Comput. Biol. Med. 2022, 142, 105207. [Google Scholar] [CrossRef] [PubMed]
  67. Naser, E.F.; Zeki, S.M. Using Fuzzy Clustering to Detect the Tumor Area in Stomach Medical Images. Baghdad Sci. J. 2021, 18, 1294. [Google Scholar] [CrossRef]
  68. Korkmaz, S.A.; Esmeray, F. A New Application Based on GPLVM, LMNN, and NCA for Early Detection of the Stomach Cancer. Appl. Artif. Intell. 2018, 32, 541–557. [Google Scholar] [CrossRef]
  69. Nayyar, Z.; Khan, M.A.; Alhussein, M.; Nazir, M.; Aurangzeb, K.; Nam, Y.; Kadry, S.; Haider, S.I. Gastric tract disease recognition using optimized deep learning features. Comput. Mater. Contin. 2021, 68, 2041–2056. [Google Scholar] [CrossRef]
  70. Hu, W.; Chen, H.; Liu, W.; Li, X.; Sun, H.; Huang, X.; Grzegorzek, M.; Li, C. A comparative study of gastric histopathology sub-size image classification: From linear regression to visual transformer. Front. Med. 2022, 9, 1072109. [Google Scholar] [CrossRef]
  71. Korkmaz, S.A. Recognition of the Gastric Molecular Image Based on Decision Tree and Discriminant Analysis Classifiers by using Discrete Fourier Transform and Features. Appl. Artif. Intell. 2018, 32, 629–643. [Google Scholar] [CrossRef]
  72. Korkmaz, S.A.; Binol, H. Classification of molecular structure images by using ANN, RF, LBP, HOG, and size reduction methods for early stomach cancer detection. J. Mol. Struct. 2018, 1156, 255–263. [Google Scholar] [CrossRef]
  73. Kanesaka, T.; Lee, T.-C.; Uedo, N.; Lin, K.-P.; Chen, H.-Z.; Lee, J.-Y.; Wang, H.-P.; Chang, H.-T. Computer-aided diagnosis for identifying and delineating early gastric cancers in magnifying narrow-band imaging. Gastrointest. Endosc. 2018, 87, 1339–1344. [Google Scholar] [CrossRef] [PubMed]
  74. Feng, Q.-X.; Liu, C.; Qi, L.; Sun, S.-W.; Song, Y.; Yang, G.; Zhang, Y.-D.; Liu, X.-S. An Intelligent Clinical Decision Support System for Preoperative Prediction of Lymph Node Metastasis in Gastric Cancer. J. Am. Coll. Radiol. 2019, 16, 952–960. [Google Scholar] [CrossRef]
  75. Korkmaz, S.A. Classification of histopathological gastric images using a new method. Neural Comput. Appl. 2021, 33, 12007–12022. [Google Scholar] [CrossRef]
  76. Dai, H.; Bian, Y.; Wang, L.; Yang, J. Support Vector Machine-Based Backprojection Algorithm for Detection of Gastric Cancer Lesions with Abdominal Endoscope Using Magnetic Resonance Imaging Images. Sci. Program. 2021, 2021, 9964203. [Google Scholar] [CrossRef]
  77. Haile, M.B.; Salau, A.; Enyew, B.; Belay, A.J. Detection and classification of gastrointestinal disease using convolutional neural network and SVM. Cogent Eng. 2022, 9, 2084878. [Google Scholar] [CrossRef]
  78. Noor, M.N.; Nazir, M.; Khan, S.A.; Song, O.-Y.; Ashraf, I. Efficient Gastrointestinal Disease Classification Using Pretrained Deep Convolutional Neural Network. Electronics 2023, 12, 1557. [Google Scholar] [CrossRef]
  79. Yin, F.; Zhang, X.; Fan, A.; Liu, X.; Xu, J.; Ma, X.; Yang, L.; Su, H.; Xie, H.; Wang, X.; et al. A novel detection technology for early gastric cancer based on Raman spectroscopy. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2023, 292, 122422. [Google Scholar] [CrossRef]
Figure 1. PRISMA flow diagram for the literature selection process.
Figure 1. PRISMA flow diagram for the literature selection process.
Mathematics 11 04937 g001
Figure 2. Parameters governing the inclusion and exclusion of research articles in the selection process.
Figure 2. Parameters governing the inclusion and exclusion of research articles in the selection process.
Mathematics 11 04937 g002
Figure 3. Temporal Analysis of Literature Utilization Across Cancer Categories (2017–2023).
Figure 3. Temporal Analysis of Literature Utilization Across Cancer Categories (2017–2023).
Mathematics 11 04937 g003
Figure 4. Probabilistic analysis of misclassification for identical test point and nearest neighbor scenario.
Figure 4. Probabilistic analysis of misclassification for identical test point and nearest neighbor scenario.
Mathematics 11 04937 g004
Figure 5. Contrasts (a) biological neurons, showcasing intricate neural architecture, with (b) artificial perceptrons in neural networks, depicting simplified representations and emphasizing structural differences.
Figure 5. Contrasts (a) biological neurons, showcasing intricate neural architecture, with (b) artificial perceptrons in neural networks, depicting simplified representations and emphasizing structural differences.
Mathematics 11 04937 g005
Figure 6. Separating Hyperplanes and Maximum Margin Hyperplane in Support Vector Machines.
Figure 6. Separating Hyperplanes and Maximum Margin Hyperplane in Support Vector Machines.
Mathematics 11 04937 g006
Figure 7. Structural diagram of the multi-class support vector machine (SVM).
Figure 7. Structural diagram of the multi-class support vector machine (SVM).
Mathematics 11 04937 g007
Figure 8. Binary Decision Tree with Sole Storage of Class Labels.
Figure 8. Binary Decision Tree with Sole Storage of Class Labels.
Mathematics 11 04937 g008
Figure 9. Confusion Matrix for Multiclass Classification Evaluation.
Figure 9. Confusion Matrix for Multiclass Classification Evaluation.
Mathematics 11 04937 g009
Figure 10. Metrics comparison for the prediction of colorectal cancer.
Figure 10. Metrics comparison for the prediction of colorectal cancer.
Mathematics 11 04937 g010
Figure 11. Metrics comparison for the prediction of gastric cancer.
Figure 11. Metrics comparison for the prediction of gastric cancer.
Mathematics 11 04937 g011
Figure 12. Proposed architectural flow diagram for the detection of colorectal cancer using traditional machine learning models from imaging database.
Figure 12. Proposed architectural flow diagram for the detection of colorectal cancer using traditional machine learning models from imaging database.
Mathematics 11 04937 g012
Figure 13. Proposed architectural flow diagram for the detection of stomach cancer using traditional machine learning models from imaging dataset.
Figure 13. Proposed architectural flow diagram for the detection of stomach cancer using traditional machine learning models from imaging dataset.
Mathematics 11 04937 g013
Table 1. Benchmark and public medical imaging datasets for colorectal and gastric cancer with download links.
Table 1. Benchmark and public medical imaging datasets for colorectal and gastric cancer with download links.
DatasetCancer CategoryModalityDownloadable LinkNo. of Data SamplesPixel Size
NCT-CRC-HE-100KColorectalH&Ehttps://zenodo.org/record/1214456 (accessed on 15 September 2023)100,000 224 × 224
Lung and colon histopathological images (LC25000)H&Ehttps://academictorrents.com/details/7a638ed187a6180fd6e464b3666a6ea0499af4af (accessed on 15 September 2023)10,000768 × 768
CRC-VAL-HE-7KH&Ehttps://zenodo.org/record/1214456 (accessed on 15 September 2023)7180224 × 224
Kather-CRC-2016 (KCRC-16)H&Ehttps://zenodo.org/record/53169#.W6HwwP4zbOQ (accessed on 15 September 2023)5000
10
150 × 150
5000 × 5000
Kvasir V-2 dataset (KV2D)Stomach (Gastric)Endoscopyhttps://dl.acm.org/do/10.1145/3193289/full/ (accessed on 15 September 2023)4000720 × 576 to 1920 × 1072
HyperKvasir dataset (HKD)Endoscopyhttps://osf.io/mh9sj/ (accessed on 15 September 2023)110,079 images and 374 videos----
Gastric histopathology sub-size image database (GasHisSDB)H&Ehttps://gitee.com/neuhwm/GasHisSDB245,196160 × 160, 120 × 120, 80 × 80
Table 2. Fundamental preprocessing techniques, associated formulas, and detailed descriptions.
Table 2. Fundamental preprocessing techniques, associated formulas, and detailed descriptions.
Preprocessing TechniqueFormulaDescription
Image Filtering I filtered A , B = x = N N y = N N I A x , B y K x , y I filtered A , B epitomizes the clean image pixel at location A , B .   I A x , B y is the pixel significance at location A x , B y in the original image. K x , y is the value of the convolution kernel at location x , y . The summation is performed over a window of size 2 N + 1 × 2 N + 1 centered at A , B .
Image Denoising I denoised = arg min E I denoised + R I denoised I denoised represents the denoised image. E I denoised is the data fidelity term, which measures how well the denoised image matches the noisy input image. R I denoised is the regularization term, which imposes a prior on the structure of the denoised image [21].
Gaussian Filtering F i l t e r e d v a l u e = 1 ( 2 π σ 2 ) e ( x 2 + y 2 ) 2 σ 2 F i l t e r e d v a l u e represents the resulting value after applying Gaussian filtering. x and y are the spatial coordinates. σ is the standard deviation, controlling the amount of smoothing or blurring.
Contrast Enhancement of Images (CEI) P i x e l O P = P i x e l I P M i n I P ( M a x I P M i n I P ) M a x O P M i n O P + M i n O P P i x e l O P is the enhanced pixel value, derived from P i x e l I P in the input image. M i n I P and M a x I P are the minimum and maximum pixel values in the input image. M i n O P and M a x O P represent the desired minimum and maximum pixel values in the output image [22].
Linear Transformation T v = A v where T is the transformation operator, v is the input vector, and A is a matrix defining the transformation.
Contrast Limited Adaptive Histogram Equalization (CLAHE) O A , B = T I A , B O A , B is the enhanced output pixel at A , B using contrast-enhancing transformation function T based on pixel intensity using cumulative distribution function (CDF).
Discrete Cosine Transform (DCT) X m = k = 0 N 1 x k cos π 2 k + 1 m 2 N X m represents the DCT coefficient at frequency index m . x k is the input signal. N is the number of samples in the signal. The summation is performed over all samples in the signal
Wavelet Transform (WT) W x , y = a = 0 N 1 b = 0 M 1 I a , b ψ x , y a , b W x , y is the DWT coefficient, ( I a , b ) is the pixel value at a , b , and ψ x , y a , b is the 2D wavelet function.
RGB to Gray Conversion (RGBG) Gray_value = ( 0.2989 R e d v a l u e ) + ( 0.5870 G r e e n v a l u e ) + ( 0.1140 B l u e v a l u e ) G r a y v a l u e is the converted gray value from RGB channels ( R e d v a l u e , G r e e n v a l u e , B l u e v a l u e ). Coefficients 0.2989, 0.5870, and 0.1140 are weights assigned to the R, G, and B channels, respectively [23].
Cropping (ROI) I cropped = I y : y + h , x : x + w The cropped image I cropped is obtained by cropping the input image I at coordinates x , y with width w and height h .
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rai, H.M.; Yoo, J. Analysis of Colorectal and Gastric Cancer Classification: A Mathematical Insight Utilizing Traditional Machine Learning Classifiers. Mathematics 2023, 11, 4937. https://doi.org/10.3390/math11244937

AMA Style

Rai HM, Yoo J. Analysis of Colorectal and Gastric Cancer Classification: A Mathematical Insight Utilizing Traditional Machine Learning Classifiers. Mathematics. 2023; 11(24):4937. https://doi.org/10.3390/math11244937

Chicago/Turabian Style

Rai, Hari Mohan, and Joon Yoo. 2023. "Analysis of Colorectal and Gastric Cancer Classification: A Mathematical Insight Utilizing Traditional Machine Learning Classifiers" Mathematics 11, no. 24: 4937. https://doi.org/10.3390/math11244937

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop