Next Article in Journal
Bioengineered Organoids Offer New Possibilities for Liver Cancer Studies: A Review of Key Milestones and Challenges
Previous Article in Journal
Validation and Estimation of Obesity-Induced Intervertebral Disc Degeneration through Subject-Specific Finite Element Modelling of Functional Spinal Units
Previous Article in Special Issue
Motion Artifact Reduction Using U-Net Model with Three-Dimensional Simulation-Based Datasets for Brain Magnetic Resonance Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Utilizing Nearest-Neighbor Clustering for Addressing Imbalanced Datasets in Bioengineering

1
Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 833, Taiwan
2
Division of Cardiology, Department of Internal Medicine, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung 833, Taiwan
3
Department of Pediatrics, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung 833, Taiwan
*
Authors to whom correspondence should be addressed.
Bioengineering 2024, 11(4), 345; https://doi.org/10.3390/bioengineering11040345
Submission received: 20 February 2024 / Revised: 19 March 2024 / Accepted: 24 March 2024 / Published: 31 March 2024
(This article belongs to the Special Issue Computer Vision and Machine Learning in Medical Applications)

Abstract

:
Imbalance classification is common in scenarios like fault diagnosis, intrusion detection, and medical diagnosis, where obtaining abnormal data is difficult. This article addresses a one-class problem, implementing and refining the One-Class Nearest-Neighbor (OCNN) algorithm. The original inter-quartile range mechanism is replaced with the K-means with outlier removal (KMOR) algorithm for efficient outlier identification in the target class. Parameters are optimized by treating these outliers as non-target-class samples. A new algorithm, the Location-based Nearest-Neighbor (LBNN) algorithm, clusters one-class training data using KMOR and calculates the farthest distance and percentile for each test data point to determine if it belongs to the target class. Experiments cover parameter studies, validation on eight standard imbalanced datasets from KEEL, and three applications on real medical imbalanced datasets. Results show superior performance in precision, recall, and G-means compared to traditional classification models, making it effective for handling imbalanced data challenges.

1. Introduction

The problem of one-class classification poses a unique challenge in machine learning, where obtaining data from the target class is relatively easy, but data from the non-target class are either extremely scarce or entirely absent. Identifying the data sample from the non-target class is crucial, as exemplified in the Stroke-Poor dataset, where correctly identifying patients enables more proactive medical interventions by healthcare professionals. In practical scenarios like rare disease identification, target class data (non-rare disease cases) often dominate the dataset, while non-target-class data are challenging to acquire due to cost constraints or physiological characteristics. In such highly imbalanced [1] or one-class situations, building a reasonable model using traditional supervised learning algorithms becomes a formidable task.
This article introduces a method based on K-means to replace the inter-quartile range mechanism in the One-Class Nearest Neighbor (OCNN) algorithm. Additionally, we propose a novel strategy, the Location-based Nearest Neighbor (LBNN) algorithm, aiming to provide improved model performance with comparable time complexity. Experimental validation involves assessing algorithm performance enhancement using KEEL datasets [2] and comparing with traditional algorithms in real medical data experiments. The article’s structure includes the research motivation, objectives, and an overview of the article’s organization. It discusses the problem background, the relevant literature, OCNN mechanisms, and the LBNN strategy and details the experimental process, providing analysis and interpretation of results for different experiment types. The article concludes with a summary of the findings and suggests future research directions. We believe the key contribution with this new strategy would provide better performance results on predicting imbalanced applications, such as heart diseases, diabetes mellitus, or septicemia, compared with current bio-engineering algorithms. Here, we emphasize that the target class we mentioned above is the data sample that is easier to access, which means the non-target class is the minority sample, which is a non-accessible class.

2. Materials and Methods

2.1. One-Class Classification

The assumption underlying the one-class classification problem [3] is that during the training process, only one class is available, referred to as the target class, while the remaining classes are considered non-target classes. One-class classifiers leverage the distinctive characteristics of the target class to identify a boundary encompassing all target class data or the majority of it. In practical applications, such as wearable devices for individual electrocardiogram monitoring in precision healthcare (predicting conditions like high blood sugar or potassium levels), initially, data may be collected only from healthy patients. It is impractical to wait until a sufficient amount of data are gathered before initiating predictions, presenting a typical scenario for a one-class classification problem.
An alternative solution involves using public data to predict individual health conditions, but this approach has proven ineffective due to the inconsistent physiological characteristics among individuals (e.g., variations in electrocardiogram wavelength, peak positions, and heights). To overcome the limitations of traditional classification algorithms in such situations, one-class classification algorithms play a crucial role.
A suitable one-class classifier must exhibit strong generalization capabilities, maintaining a high recognition rate for non-target classes while avoiding overutilization of target class information to prevent overfitting. Proper handling of target class and outlier [4,5] values is essential to derive effective decision boundaries. The one-class classification problem can be mathematically expressed as follows:
f z = d z < θ z
Here, d z is an unknown measurement of data z with respect to the target class group (e.g., distance, density), θ z is the threshold for d z , and f z is a function determining whether z is accepted as the target class.

One-Class Support Vector Machine (OC-SVM)

The OC-SVM [6] is a type of unsupervised algorithm based on the SVM [7,8], which can be used for novelty detection and anomaly detection. The objective of the OC-SVM is to find a decision function or hyperplane, attempting to separate the target class data from the non-target-class data. As illustrated below in Figure 1, most of the training data are allocated to one region and assigned a value of +1, while data outside this region are assigned a value of −1. The OC-SVM utilizes a kernel function (typically Gaussian) to map the input data to a higher dimensional space and aims to find a minimal hyperplane. Given a set of target class training data x i R d , i = 1 , , N , we can represent the following quadratic programming expression:
Minimize 1 2 w 2 + N v 1 i = 1 N ξ i p Subject to w · Φ ( x i ) p ξ i , ξ i 0
Here, N represents the total number of data points, v ( 0 , 1 ) is used to determine the upper limit of outlier values, ξ i represents the slack variable for each data point, and Φ is the kernel function.

2.2. One-Class Nearest-Neighbor (OCNN) Algorithm

The OCNN algorithm can be classified into four types based on the number of nearest neighbors [9,10] chosen:
  • Find the nearest neighbor of the test data in the target class, and then find the nearest neighbor of this nearest neighbor (11NN).
  • Find the nearest neighbor of the test data in the target class, and then find the K-nearest neighbors of this nearest neighbor (1KNN).
  • Find J-nearest neighbors of the test data in the target class, and then find the nearest neighbor of each of these J-nearest neighbors (J1NN).
  • Find J-nearest neighbors of the test data in the target class, and then find the K-nearest neighbors of each of these J-nearest neighbors (JKNN).
Figure 2 illustrates the four different OCNN methods. The black circles represent the target class data, and the red asterisk represents an unknown data point. To determine whether the unknown data point belongs to the target class, different numbers of nearest neighbors are selected based on the parameters J and K. After calculating the average distance of J-nearest neighbors and the average distance of K-nearest neighbors, the values are compared with the threshold θ . Taking JKNN as an example, the detailed process please referring to Algorithm 1 as follows:
Algorithm 1 Pseudo-code of JKNN
1:
Input: N target class data points with d dimensions, test data z, nearest neighbor parameters J, K, and threshold θ .
2:
Output: Accept or reject the test data z as target class data.
3:
Calculate the distance from the test data z to the J-nearest neighbors and compute the average. N N j t r ( z ) represents the J-nearest neighbors of z, expressed as: D J = 1 J j = 1 J z N N j t r ( z )
4:
Calculate the distance from the J-nearest neighbors of the test data z to their respective K-nearest neighbors and compute the average. N N k t r ( N N j t r ( z ) ) represents all K-nearest neighbors of the J-nearest neighbors, expressed as: D K = 1 J K j = 1 J k = 1 K N N j t r z N N k t r ( N N j t r ( z ) )
5:
If D J D K < θ , consider the test data z as the target class; otherwise, consider it as a non-target class.
In the OCNN algorithm, the threshold θ can be fixed at 1 or chosen arbitrarily. Here, we will discuss the relationship between 11NN under different threshold values θ and other OCNN methods.
In Figure 2a, when 11NN has a threshold θ set to 1, if D 1 > D 11 , the test data will be classified as a non-target class (outlier), even if D 1 is only slightly larger than D 11 . Intuitively, the distance ( D 1 ) from non-target class (or outlier) to its nearest neighbor should be much larger than the distance ( D 11 ) from the nearest neighbor to itself. This can be expressed mathematically as:
D 1 > θ D 11
When θ > 1 , some data that were originally classified as a non-target class due to the rule D 1 > D 11 will be accepted as a target class. This situation aligns more with our intuition about outliers. Finding the optimal θ will be an important issue, and the optimal θ will change depending on the dataset and evaluation criteria.
Figure 2b is the 1KNN for non-target class data, and we represent its distance to the nearest neighbor as:
D 1 > D 11 + D 12 + D 13 + D 1 i D 1 K K
where D 1 i ( K i 1 ) is the distance from the test data point to its i-th nearest neighbor, and we can observe that D 1 i should increase as i increases. Expanding the inequality
D 1 i D 1 i 1 D 1 i D 11
we obtain:
D 11 + D 12 + D 13 + D 1 i D 1 K K > D 11 + D 11 + D 11 + D 11 K D 11 + D 12 + D 13 + D 1 i D 1 K K > D 11 D 11 + D 12 + D 13 + D 1 i D 1 K K = α D 11 ( α > 1 )
Based on the derivations from the inequalities (6) and (7), we obtain:
D 1 > α D 11 ( α > 1 )
This demonstrates that the 1KNN method can produce similar effects to 11NN ( θ > 1 ). J1NN: In contrast to 1KNN, we consider J- nearest neighbors for the test data but only consider one nearest neighbor for each of these neighbors. As shown in Figure 2c, similar to the derivation of 1KNN, we obtain:
α D 1 > D 11 D 1 > D 11 α α > 1
This proves that the J1NN method can produce similar effects to 11NN ( θ < 1 ). JKNN: We calculate the average distance of J-nearest neighbors and their respective K-nearest neighbors. Based on the previous derivations of 1KNN and J1NN, we obtain:
α D 1 > β D 11 D 1 > β α D 11 α > 1 , β > 1
As the parameters J and K vary, we observe that these two parameters will offset each other’s influences. When β α > 1 , JKNN is similar to 11NN ( θ > 1 ), allowing it to accept more outliers as the target class. When β α < 1 , JKNN is similar to 11NN ( θ < 1 ), making the criteria more stringent, and more data will be considered as outliers.

2.3. One-Class Nearest Neighbor (OCNN) Parameter Optimization

Based on the above discussion, we understand the relationships between different types of OCNN classifiers. The settings of parameters J , K , and threshold θ will be an important topic.
Parameter optimization is a challenging issue in one-class classifiers because, in the training data, only data from the target class can be used, unlike the situation with multi-class data, where traditional classifiers utilize data from different classes to make decision boundaries. Here, we use some methods to identify outliers in the target class data for parameter optimization of the OCNN classifier. Regarding the selection of nearest neighbors and their distance calculations, we can identify the following issues faced by different OCNN classifiers:
Firstly, we assume the target class as negative data and the non-target class as positive data.
  • False Negatives: In real-world datasets, noise samples may be generated due to human errors (incorrect labeling, operational negligence, etc.). The OCNN classifier described earlier cannot detect this phenomenon. When target class samples exhibit a tight configuration, noise samples far from the cluster will lead to unknown non-target-class data being incorrectly classified as target class data.
  • False Positives: If we do not find an appropriate decision threshold θ after removing noise samples from the dataset, the OCNN classifier will identify many test data as non-target-class data. Another situation leading to false positives occurs when the target class in the training data cannot demonstrate sufficient representativeness.
Yin et al. [11] mentioned that in one-class classification problems, designing an error detection system while simultaneously reducing false negatives and false positives is a difficult task due to the lack of non-target class data. Generally, one-class classifiers are sensitive to parameter settings [12]. A common approach to optimizing parameters for one-class classifiers is to use generated synthetic samples. Ref. [13] attempts to assume the distribution of the non-target class using artificially generated samples. However, generating artificial data has some issues, requires in-depth knowledge of the domain, and may lead to overfitting.
In this article, we describe how to optimize the parameters of OCNN classifiers using one-class data using the outliers identified by the above methods, and we consider them as proxies for the non-target class and use cross-validation to optimize parameters J , K , and θ .

K-Means with Outlier Removal (KMOR) Algorithm

KMOR [14] is an algorithm based on K-means that can simultaneously detect outliers and perform clustering [15,16,17]. Traditional K-means algorithms can experience drastic changes in clustering results due to the presence of outliers, as illustrated in Figure 3 below. Circles represent normal data, while asterisks represent outliers. Setting the number of clusters to two, the left portion shows clustering results without considering the presence of outliers, while the right portion demonstrates clustering results that account for outliers, resulting in a more asymmetric clustering outcome.
To address the aforementioned issue, KMOR introduces the concept of the K + 1 cluster, where data identified as outliers are assigned to the K + 1 cluster. Additionally, these outliers are independently treated in the objective function. The objective function of KMOR is defined as follows:
P ( U , Z ) = i = 1 n j = 1 k u i , j x i z j 2 + u i , k + 1 D ( U , Z )
subject to
i = 1 n u i , k + 1 n 0
In the equation, U represents the membership of all data points in the clusters, Z denotes the cluster centers, and u i , j = 1 if a data point x i belongs to the jth cluster. n 0 restricts the maximum number of data points identified as outliers in the entire dataset.
D ( U , Z ) can be expressed as follows:
D ( U , Z ) = γ l = 1 k j = 1 n u j , l x j z l 2 n j = 1 n u j , k + 1
The parameter γ > 0 represents the weight of the average distance from all non-outliers to their respective clusters. γ and n 0 control the number of outliers in the dataset. Updating cluster centers and cluster assignments differs slightly from the original K-means. The rules are as follows:
  • Rule 1: Calculate the distance from each data point to all cluster centers. If the distance from the data point to the cluster center is less than D ( U , Z ) , assign it to the cluster with the shortest distance; otherwise, assign it to cluster K + 1 .
  • Rule 2: Update the cluster centers for clusters 1 to K by averaging the data points in each cluster. Data points classified as cluster K + 1 do not participate in the calculation.
The detailed process please referring to Algorithm 2 as follows:
Algorithm 2 Algorithm flow for KMOR
1:
Input: X, k, γ , n 0 , δ , iter max
2:
Output: Optimal U and Z
3:
Initialize Z by selecting k points from X randomly
4:
Foreach  i 1, , i do
5:
     Update U by assigning x i to its nearest center
6:
end
7:
s = 0, p 0 = 0
8:
     While True do
9:
        Update U by rule 1
10:
      Update Z by rule 2
11:
      s = s +1
12:
      p s + 1 = P ( U , Z )
13:
      If  | p s + 1 < p s | < δ or s > = i t e r m a x then
14:
            Break
15:
      end
16:
   end

2.4. Optimal Parameters J and K

We initially employ outlier detection methods (such as IQR or KMOR) to identify outliers in the target class of training data. These outliers are then treated as non-target-class instances, forming a binary dataset. Subsequently, using the K-fold cross-validation method, we divide the dataset into K subsets, with one subset reserved for testing and the remaining (K − 1) subsets for training. We evaluate the performance of each subset for various combinations of J and K values in JKNN. Based on the results obtained, we store them in a two-dimensional matrix indexed by J and K. Finally, we select the indices (representing J and K values) corresponding to the highest average performance.

2.5. Optimal Parameter θ

After identifying outliers in the training dataset and reorganizing it into a binary dataset, we employ cross-validation to determine, for each validation data point, the distance to its nearest neighbor and the distance between that neighbor and its own nearest neighbor (11NN). Dividing these distances yields a test threshold value, θ . After computing this for all folds, we obtain a θ vector of length N. We then compare these thresholds, calculating an array representing the performance of G-means. The index of the best G-means value in this array corresponds to the optimal threshold value, θ .

2.6. Location-Based Nearest Neighbor (LBNN) Algorithm

We understand that, whether it’s 11NN or JKNN, when an unknown data point comes in, they both need to perform two rounds of nearest neighbor searches on the entire training dataset. The first-round searches for J-nearest neighbors and the second round searches for J.K-nearest neighbors result in a time complexity of O 2 d n + J × 2 d n O J + 1 × 2 d n , where the part of searching for nearest neighbors of nearest neighbors can be anticipated to be mostly within adjacent blocks. In other words, the unknown data point and these nearest neighbors are mainly compared locally (Figure 4), without considering the overall distribution characteristics of the data. This may affect the final performance of the model. Therefore, we propose a clustering-based nearest neighbor search strategy, LBNN, to compare with 11NN and JKNN.
Our LBNN strategy initially applies KMOR [18] clustering to the training data, setting a percentile Q , 0 Q 1 . For an unknown data point u i , i 1 , 2 , n , we find one nearest neighbor P i , c , c 1 , 2 , k from each cluster. These nearest neighbors are considered reference points for their respective clusters. Finally, we calculate the distance L i , c between the reference point P i , c and the other data points in the same cluster. If the distance d i , c from the unknown data to any cluster reference point is less than the percentile Q of its distance L i , c , we classify the unknown data as the target class. Our LBNN strategy has a search time complexity of O k d n + k d n O 2 k d n , where k is the number of clusters. The LBNN process is illustrated in Algorithm 3.
Algorithm 3 Pseudo-code of LBNN
1:
Input: Target training data (D), number of cluster (k) percentile rank (Q), testing data (T)
2:
Output: an result array R (prediction of T)
3:
Apply KMOR clustering for Target training data (D), then we get k clusters in D
4:
N number of testing data (T)
5:
Foreach i 1, , N do
6:
     R[i] = 0
7:
end
8:
Foreach i 1, , N do
9:
     foreach i 1, , k do
10:
        find nearest neighbor P of T[i] in cluster j and
11:
        record this distance d
12:
        compute distance between P and other data in
13:
        cluster j to get length vector L
14:
        If d < O percentile rank of L then
15:
            R[i] = 1
16:
            Break
17:
        end
18:
     end
19:
end
20:
return R

2.7. Feature Selection

With the rapid development of modern technology, the improvement in the performance of hardware and software, and the widespread application of the Internet of Things, data are generated at an unprecedented rate. This includes high-definition videos, images, text, audio, and data obtained from the rise of social relationships and the Internet of Things. Such data often possesses features with multiple dimensions, presenting a challenging task for accurate data analysis and decision making. Feature selection can effectively handle multidimensional data and enhance learning efficiency, a notion that has been proven in both theory and practice.
Feature selection refers to the process of obtaining a subset of features from the original features based on certain criteria. The feature selection criteria gather relevant features of the dataset. It plays a crucial role in reducing the computational cost of data processing by eliminating unnecessary and irrelevant features. Feature selection is considered a preprocessing step for data and learning algorithms, where good feature selection results can improve model accuracy and reduce training time.
In this article, the real-world medical data we used contain a large number of features (64 in total). To enhance the performance of various algorithm models, we employed a stepwise feature selection method called “Stepwise”. The concept is, in each round, to select only one feature at a time, retaining the combination of features with the best performance or continuing until a specified number of features to retain is achieved. Assuming a dataset has ten features, and our evaluation metric is the area under the receiver operating characteristic (ROC) curve, which is the area under the ROC curve (AUC), the detailed process is as follows:
  • In the first round of feature selection, we select one feature at a time for model training. After testing all features, we retain the feature with the best AUC performance.
  • Similar to the first step, we choose one feature at a time from the remaining nine features, but this time, we include the feature retained from the first round. In the end, we obtain the two features with the best performance.
  • We repeat the first and second steps until the specified number of retained features is achieved.

3. Results

This section provides an overview of the experimental environment, including details on the dataset utilized and the configuration of relevant parameters.

3.1. Experimental Environment and Settings

3.1.1. Execution Environment

The experiments were conducted in a controlled setting to ensure consistency and reproducibility. The hardware device is manufactured by HP who’s headquarter is located in Palo Alto, CA. United States. The hardware and software specifications are outlined below:

3.1.2. Experimental Parameter Configuration and Evaluation Metrics

For all algorithms incorporating the K-Means with Outlier Removal (KMOR) technique, the setting of the cluster number k is a critical consideration. We employ the KMOR technique to perform clustering on the data, selecting the cluster number k based on the minimum value obtained from the objective function.
  • Evaluation Metrics:
  • AUC (area under the ROC curve)
  • Accuracy: ( T P + T N ) / ( T P + T N + F P + F N )
Terms: true positive ( T P ), true negative ( T N ), false positive ( F P ), false negative ( F N )
  • Precision: T P / ( T P + F P )
  • Recall (sensitivity, true-positive rate): T P / ( T P + F N )
  • Specificity (true-negative rate): T N / ( T N + F P )
  • G-means: ( s p e c i f i c i t y × R e c a l l )
The specified parameters and evaluation metrics serve as the foundation for assessing the performance of KMOR-integrated algorithms on the KEEL dataset. The rigorous parameter tuning and metric selection aim to provide a comprehensive evaluation framework for the conducted experiments. The algorithm-specific parameter configuration for the KEEL dataset is as below Table 1:

3.1.3. Dataset Utilization

KEEL is a software tool (Version Release 3.0 and the released date: 9 April 2018) that assesses evolutionary algorithms for data mining problems such as regression, classification, clustering, pattern mining, and more. KEEL provides a repository of preprocessed datasets for classification problems, including imbalanced datasets. Imbalanced datasets are a special case for classification problems where the class distribution is not uniform among the classes. Typically, they are composed of two classes: the majority (negative) class and the minority (positive) class. We leveraged eight standard imbalanced datasets from the KEEL dataset repository collected from various domains, all comprising two classes. The datasets such as glass2 and glass4 are for glass classification, with ‘2’ in glass2 indicating the second class as positive and the rest as negative, forming a one-versus-rest two-class dataset. Similarly, datasets like glass4, Yeast4, and ecoli4 follow the same concept. The Yeast series and ecoli4 datasets are for biological applications (protein classification). The segment0 dataset is for outdoor object image classification, with features consisting of various pixel information for the images. On the other hand, the pageblocks0 dataset is for document classification, with features comprising the layout information of the documents.
Regarding the real medical datasets, we obtained three types of medical data from Dr. Tsai, who is one of our authors working in Chang Gung Memorial Hospital; those datasets could be obtained from Dr. Tsai’s email address, tcmnor@cgmh.org.tw.
ROSC (return of spontaneous circulation) indicates that an emergency patient had no breathing or heartbeat upon admission. For such patients, we have two different outcomes for prediction. ROSC-CPC12 predicts whether a patient will have an excellent prognosis after 12 months (able to live independently) after discharge, the minority samples indicate the alive patients, and the majority samples indicate the patients not able to survive; ROSC-30DayS predicts whether a patient will survive 30 days after discharge, in which the minority samples indicate those patients survived after 30 days, and the majority samples indicate those patients that did not survive. Stroke-Poor is used to predict whether a stroke patient will have another severe stroke after discharge, where the minority samples mean the patients will not have another stroke, and the majority samples indicate the patients will have another stroke. The features of these medical datasets are all derived from the patients’ physiological test results (blood sugar, blood pressure, etc.) and the use of various drugs and treatment methods.
In the below Table 2 and Table 3, we listed the dataset structure, including features, samples, minority samples, majority samples, and imbalanced ratio.

3.2. Experimental Framework

The datasets utilized in this article consist exclusively of binary classifications. In the training process of the One-Class Nearest Neighbor (OCNN) algorithm, one of the classes is designated as the target class. Initially, the dataset is divided into training and testing sets using cross-validation. Subsequently, the training data are subjected to the methodology outlined in Section 2, distinguishing them into target class data and outliers. The outliers, treated as non-target-class data, are employed for parameter optimization. The final evaluation of the model is conducted using the testing data, generating various performance metrics. The experimental framework is illustrated in Figure 5. In contrast to the OCNN training architecture, our approach, LBNN, initiates with clustering the one-class training data using KMOR. A percentile, denoted as Q, is set during this process. The experimental framework is illustrated in Figure 6.

3.3. Experimental Result

The SVM kernel function leverages “SVM with RBF kernel optimized via grid search”. The below tables express our experimental results. We took different kinds of datasets and listed the AUC/accuracy/TPR/TNR/G-means with One-Class SVM, 11NN, 11NN( θ ), JKNN, 11NN( θ ) KMOR, JKNN KMOR, and LBNN algorithms for comparison.
From the above experimental result from Table 4, Table 5, Table 6, Table 7 and Table 8, we can tell the improved OCNN and LBNN methods exhibit better performance on most datasets. Although the LBNN method has poorer performance in terms of TPR, it compensates by significantly improving TNR, resulting in more stable variations between TPR and TNR and hence better G-means performance. In a few datasets, the LBNN method has an AUC less than 0.6, which may be due to insufficient representativeness of one-class data, overlapping phenomena, or unprocessed features.

3.4. Real Medical Dataset Results

In the Table 9, Table 10 and Table 11 below, we list the evaluation metric results for three real datasets from a local hospital. We selected 3 traditional methods for our LBNN comparison which including Logistic Regression, Support Vector Machine, Random Forest [19].
From the above result with the ROSC-CPC12 dataset, here we can see the LBNN method obtains better results compared with the three other traditional methods except for accuracy and specificity. The recall for the LBNN model improved significantly. In contrast, the LBNN method underperformed the traditional model in terms of accuracy and specificity. The higher recall of the LBNN model indicates a more stringent evaluation of unknown data, even at the cost of sacrificing accuracy for the target class.
For the ROSC-30Days dataset, the results of the LBNN method are all better than the three other traditional methods, and the insight of the results should indicate that with lower imbalance ratio data, the LBNN method obtains significantly improved results compared with the ROSC-CPC12 dataset results.
From the above Stroke-Poor results, the LBNN method was able to obtain better performance on precision and recall, which is more important as this means fewer false-positive and fewer false-negative cases. For the Stroke-Poor example, it means there were fewer misdiagnosed patients. Therefore, we can believe the LBNN method obtained enhanced performance for those three real medical datasets.

4. Discussion

For KEEL dataset experiments, based on the experimental results, our enhanced OCNN method and the proposed LBNN method exhibit superior performance across most datasets. While the LBNN method may slightly lag behind the other methods in terms of its true-positive rate (TPR), its substantial improvement in its true-negative rate (TNR) contributes to a more stable overall performance. The variations in TNR and TPR are relatively steady. Consequently, the LBNN method outperforms most methods when evaluated against G-means. In terms of the area under the curve (AUC), a few datasets show values below 0.6, possibly due to inherent high overlap in the dataset or insufficient representation of class characteristics. Comparing G-means metrics, the LBNN method demonstrates superior stability.
For the real dataset experiments, in investigating the ROSC-CPC12 dataset, given its substantial imbalance, we employed traditional model preprocessing through data sampling, such like oversampleing [20,21,22,23] and undersampleing [15,24,25] methods. The results revealed a notable improvement in recall but at the expense of precision and specificity. According to experimental findings, the LBNN method exhibited the best performance in terms of precision, recall, and G-means across all medical datasets. However, in terms of accuracy and specificity, the LBNN method generally lagged behind the other traditional algorithms. This is attributed to the LBNN method’s use of a stricter standard to enhance recall, sacrificing the precision of the target class and subsequently reducing overall accuracy. Nevertheless, as described by medical professionals, in healthcare applications, emphasis is placed on precision and recall as crucial metrics (typically requiring a recall above 0.8). This implies that the LBNN method is better suited to effectively identify individuals truly afflicted with a disease, aligning more closely with the practical requirements of medical applications. Future research directions may involve further optimizing the LBNN method to enhance its performance on other metrics and further applying integrated approaches to strike a balance between accuracy and recall.

5. Conclusions

In practical applications, the collection of industrial and medical data often encounters challenges related to one-class or imbalanced scenarios. Traditional algorithms typically require data sampling, weighting adjustments, or cost-sensitive learning to achieve reasonable results aligned with specific context needs. However, these methods have drawbacks, such as information loss in sampling, potential overfitting with added minority class data, and questionable interpretability when generating synthetic samples. Weight adjustments and cost-sensitive methods focus on ensuring accurate classification of the minority class, leading to decreased recognition rates for the majority class.
Addressing imbalanced data, we approach the problem from a one-class perspective. In the medical data experiments, we used the majority of data from the majority class for training, with the remaining majority class data and all minority class data mixed for testing. Experimental results indicate that the LBNN method outperforms other traditional algorithms in terms of accuracy, recall, and G-means. The LBNN method can also be applied to personal wearable devices in advanced healthcare. In this application, where initially collected patient data are predominantly normal, requiring patients to wear the device until sufficient data are collected for both classes is impractical. The LBNN method provides a viable solution during the transition period of single-class or extremely imbalanced data.
While the LBNN method exhibits good accuracy and recall for the non-target class, it sacrifices some specificity. Future considerations may involve integrating other traditional algorithms to reinforce the accuracy of the target class. In terms of computational complexity, the search for the nearest neighbor is the most time-consuming step. Exploring approximate nearest-neighbor methods could be attempted to observe the trade-off between algorithm acceleration and accuracy changes. However, a challenge remains in the application of the OCNN and LBNN methods in edge computing. They lack the capability of Online Learning, common in some neural network models, to update internal model parameters at a faster rate when new data arrive. In the OCNN and LBNN models, every new data point requires retraining the entire dataset, leading to time-intensive predictions.

Author Contributions

Conceptualization, W.-H.Z., C.-M.H. and C.-H.L.; methodology, Y.-C.Z. and C.-M.T.; software, C.-S.H.; resources, C.-M.T. and Y.-C.Z.; writing—original draft preparation, W.-H.Z.; writing—review and editing, C.-M.H.; supervision, C.-H.L.; funding acquisition, C.-M.T. and C.-H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study received funding from the grants of National Science and Technology Council of TAIWAN: NSTC 112-2221-E-182A-005 and NSTC 112-2622-E-110-013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets we got from KEEL is open source and can be found from below link https://sci2s.ugr.es/keel/imbalanced.php (accessed on 19 February 2024) and the real medical datasets are not available due to privacy restriction.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  2. Alcalá-Fdez, J.; Fernandez, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  3. Sun, W.; Hu, W.; Xue, Z.; Cao, J. Overview of one-class classification. In Proceedings of the 2019 IEEE 4th International Conference on Signal and Image Processing, Wuxi, China, 19–21 July 2019; pp. 6–10. [Google Scholar]
  4. Boukerche, A.; Zheng, L.; Alfandi, O. Outlier Detection: Methods, Models, and Classification. ACM Comput. Surv. CSUR 2020, 53, 1–37. [Google Scholar] [CrossRef]
  5. Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 93–104. [Google Scholar]
  6. Schölkopf, B.; Williamson, R.C.; Smola, A.J.; Shawe-Taylor, J.; Platt, J.C. Support vector method for novelty detection. Neural Inf. Process. Syst. 1999, 12, 582–588. [Google Scholar]
  7. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
  8. Tax, D.M.; Duin, R.P. Support Vector Data Description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
  9. Wilson, D.L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar]
  10. Tomek, I. An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 1976, SMC-6, 448–452. [Google Scholar]
  11. Yin, J.; Yang, Q.; Pan, J.J. Sensor-based abnormal humanactivity detection. IEEE Trans. Knowl. Data Eng. 2008, 20, 1082–1090. [Google Scholar] [CrossRef]
  12. Mack, B.; Roscher, R.; Waske, B. Can i trust my one-class classification? Remote Sens. 2014, 6, 8779–8802. [Google Scholar] [CrossRef]
  13. Liu, J.; Miao, Q.; Sun, Y.; Song, J.; Quan, Y. Modular ensembles for one-class classification based on density analysis. Neurocomputing 2016, 171, 262–276. [Google Scholar] [CrossRef]
  14. Gan, G.; Ng, M.K.P. K-means clustering with outlier removal. Pattern Recognit. Lett. 2017, 90, 8–14. [Google Scholar] [CrossRef]
  15. Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409, 17–26. [Google Scholar] [CrossRef]
  16. Shah, A.; Ali, B.; Wahab, F.; Ullah, I.; Alqahtani, F.; Tolba, A. A Three-Way Clustering Mechanism to Handle Overlapping Regions. IEEE Access 2024, 12, 6546–6559. [Google Scholar] [CrossRef]
  17. Mohi ud din dar, G.; Bhagat, A.; Ansarullah, S.I.; Othman, M.T.B.; Hamid, Y.; Alkahtani, H.K.; Ullah, I.; Hamam, H. A of different Alzheimer’s disease stages using CNN model. Electronics 2023, 12, 469. [Google Scholar] [CrossRef]
  18. Khan, S.S.; Ahmad, A. Relationship between variants of one-class nearest neighbors and creating their accurate ensembles. IEEE Trans. Knowl. Data Eng. 2018, 30, 1796–1809. [Google Scholar] [CrossRef]
  19. Ho, T.K. Random Decision Forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–18 August 1995; pp. 278–282. [Google Scholar]
  20. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  21. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
  22. Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
  23. Sharma, S.; Bellinger, C.; Krawczyk, B.; Zaiane, O.; Japkowicz, N. Synthetic Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 447–456. [Google Scholar] [CrossRef]
  24. Tsai, C.F.; Lin, W.C.; Hu, Y.H.; Yao, G.T. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci. 2019, 477, 47–54. [Google Scholar] [CrossRef]
  25. Vuttipittayamongkol, P.; Elyan, E. Neighbourhood-based undersampling approach for handling imbalancedand over-lapped data. Inf. Sci. 2020, 509, 47–70. [Google Scholar] [CrossRef]
Figure 1. OC SVM.
Figure 1. OC SVM.
Bioengineering 11 00345 g001
Figure 2. Four kinds of OCNN. The red five-pointed star symbol represents a new data sample, (a) represents 11NN, the new data sample distance to the closest data point is D 1 and the point to it’s the closest data point is D 11 , (b) represents 1KNN, in this case K = 3 , (c) represents J1NN, in this case, J = 3 , (d) represents JKNN, J = 3 , K = 2 .
Figure 2. Four kinds of OCNN. The red five-pointed star symbol represents a new data sample, (a) represents 11NN, the new data sample distance to the closest data point is D 1 and the point to it’s the closest data point is D 11 , (b) represents 1KNN, in this case K = 3 , (c) represents J1NN, in this case, J = 3 , (d) represents JKNN, J = 3 , K = 2 .
Bioengineering 11 00345 g002
Figure 3. K-means and KMOR illustration. The five-pointed star means the outliers.
Figure 3. K-means and KMOR illustration. The five-pointed star means the outliers.
Bioengineering 11 00345 g003
Figure 4. JKNN search illustration.
Figure 4. JKNN search illustration.
Bioengineering 11 00345 g004
Figure 5. OCNN Experimental Framework.
Figure 5. OCNN Experimental Framework.
Bioengineering 11 00345 g005
Figure 6. LBNN Experimental Framework.
Figure 6. LBNN Experimental Framework.
Bioengineering 11 00345 g006
Table 1. Relevant parameter settings for algorithms using KMOR on KEEL dataset.
Table 1. Relevant parameter settings for algorithms using KMOR on KEEL dataset.
Configuration NameValue
Number of folds5
Inner folds for 1NN( θ ), 1NN( θ )-KMOR, JKNN, JKNN-KMOR2
Number of rounds10
Nu for One-Class SVM0.5
Gamma for One-Class SVM1/d (number of features)
ω for 1NN( θ ), JKNN1.5
θ for 1NN, 3NN, JKNN-KMOR1
J, K for 1NN( θ ), 1NN( θ )-KMOR1
n 0 for 1NN( θ )-KMOR, JKNN-KMOR, LBNN0.05 × samples
λ for 1NN( θ )-KMOR, JKNN-KMOR, LBNN2.5
Iteration(max) for KMOR100
Table 2. KEEL datasets.
Table 2. KEEL datasets.
DatasetFeatureSamplesMinority SamplesMajority SamplesImbalance Ratio
Yeast48148451143328.09
Yeast58148444144032.72
Yeast68148435144941.4
Ecoli473362031615.8
Glass292141719711.58
Glass492141519915.47
Segment019230832919796.02
Pageblocks010547255949138.79
Table 3. Real datasets.
Table 3. Real datasets.
DatasetFeatureSamplesMinority SamplesMajority SamplesImbalance Ratio
ROSC-CPC126410718698511.45
ROSC-30DayS6410712078644.17
Stroke-Poor256171644532.76
Table 4. KEEL dataset AUC.
Table 4. KEEL dataset AUC.
DatasetsOne-Class SVM11NN11NN( θ )JKNN11NN( θ ) KMORJKNN KMORLBNN
Yeast40.4730.5110.5010.5030.5620.5110.534
yeast50.5020.5890.5400.5990.5790.6000.605
Yeast60.4670.6400.6000.5840.5970.6070.626
ecoli40.4940.6680.6270.6990.6560.7190.753
segment00.6520.7260.8020.8030.8100.8140.861
glass20.4810.7170.7670.7890.7580.7760.813
glass40.5550.6230.6010.6890.5960.6940.764
pageblocks00.5210.7710.7810.8070.7840.8130.855
Table 5. KEEL dataset accuracy.
Table 5. KEEL dataset accuracy.
DatasetOne-Class SVM11NN11NN( θ )JKNN11NN( θ ) KMORJKNN KMORLBNN
Yeast40.5010.5740.1830.4270.6120.4680.681
yeast50.5770.7060.3170.5170.6130.5510.701
Yeast60.4970.6750.5150.5250.6120.5180.777
ecoli40.6200.5760.6950.5910.6450.6660.741
segment00.8080.9650.8920.9760.9340.9690.869
glass20.6350.6320.7710.7350.7270.7240.855
glass40.7360.5520.6610.6690.6410.6530.773
pageblocks00.5420.9600.9360.9790.9430.9820.883
Table 6. KEEL dataset TPR.
Table 6. KEEL dataset TPR.
DatasetOne-Class SVM11NN11NN( θ )JKNN11NN( θ ) KMORJKNN KMORLBNN
Yeast40.4450.4470.8280.5810.4910.5550.384
yeast50.4260.4700.7660.6820.5330.6480.507
Yeast60.4370.6050.6850.6420.5720.6970.474
ecoli40.3650.8460.7590.9070.6760.8230.775
segment00.4860.4700.7060.6170.9430.6480.770
glass20.3230.9300.7750.9250.8350.9090.705
glass40.3700.7610.7180.7270.5090.7750.746
pageblocks00.4990.5740.6200.6280.6180.6380.825
Table 7. KEEL dataset TNR.
Table 7. KEEL dataset TNR.
DatasetOne-Class SVM11NN11NN( θ )JKNN11NN( θ ) KMORJKNN KMORLBNN
Yeast40.5010.5750.1780.4260.6340.4670.683
yeast50.5780.7070.3150.5160.6250.5510.702
Yeast60.4970.6760.5140.5250.6220.5170.779
ecoli40.6230.4900.4950.4910.6350.6220.731
segment00.8190.9810.8980.9880.6760.9800.951
glass20.6400.5030.7550.6520.6800.6160.921
glass40.7410.4840.4840.6510.6840.6430.782
pageblocks00.5430.9680.9430.9860.950 (0.038)0.9890.884
Table 8. KEEL dataset G-means.
Table 8. KEEL dataset G-means.
DatasetOne-Class SVM11NN11NN( θ )JKNN11NN( θ ) KMORJKNN KMORLBNN
Yeast40.4640.4990.3330.4890.5530.4920.495
yeast50.4740.5570.4360.5750.5720.5810.586
Yeast60.4430.6280.4910.5540.5910.5820.593
ecoli40.4320.6420.5640.6580.6320.7030.751
segment00.6290.6770.7820.7780.7930.7950.855
glass20.3800.6820.7580.7690.7440.7600.805
glass40.3730.6050.5610.6730.5640.6780.756
pageblocks00.5200.7440.7580.7840.7620.7920.854
Table 9. Evaluation metrics for ROSC-CPC12.
Table 9. Evaluation metrics for ROSC-CPC12.
EvaluationLogistic RegressionSupport Vector MachineRandom ForestLBNN
AUC0.7240.7870.9050.934
Accuracy0.8100.5420.9580.935
Precision0.2510.1390.7190.898
Recall0.6200.8570.8410.989
Specificity0.8270.5140.9690.880
G-means0.7110.6790.9010.932
Table 10. Evaluation metrics for ROCS-30DayS.
Table 10. Evaluation metrics for ROCS-30DayS.
EvaluationLogistic RegressionSupport Vector MachineRandom ForestLBNN
AUC0.7010.7410.8190.870
Accuracy0.4930.5770.8080.865
Precision0.2610.3000.5190.955
Recall0.8200.8040.8370.861
Specificity0.4110.5210.8010.876
G-means0.5960.6830.8150.868
Table 11. Evaluation metrics for Stroke-Poor.
Table 11. Evaluation metrics for Stroke-Poor.
EvaluationLogistic RegressionSupport Vector MachineRandom ForestLBNN
AUC0.8700.8640.8030.820
Accuracy0.8300.8060.7970.829
Precision0.7100.5830.6080.882
Recall0.5400.7690.5320.849
Specificity0.9260.8180.8850.791
G-means0.7530.7960.7440.818
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, C.-M.; Lin, C.-H.; Hung, C.-S.; Zeng, W.-H.; Zheng, Y.-C.; Tsai, C.-M. Utilizing Nearest-Neighbor Clustering for Addressing Imbalanced Datasets in Bioengineering. Bioengineering 2024, 11, 345. https://doi.org/10.3390/bioengineering11040345

AMA Style

Huang C-M, Lin C-H, Hung C-S, Zeng W-H, Zheng Y-C, Tsai C-M. Utilizing Nearest-Neighbor Clustering for Addressing Imbalanced Datasets in Bioengineering. Bioengineering. 2024; 11(4):345. https://doi.org/10.3390/bioengineering11040345

Chicago/Turabian Style

Huang, Chih-Ming, Chun-Hung Lin, Chuan-Sheng Hung, Wun-Hui Zeng, You-Cheng Zheng, and Chih-Min Tsai. 2024. "Utilizing Nearest-Neighbor Clustering for Addressing Imbalanced Datasets in Bioengineering" Bioengineering 11, no. 4: 345. https://doi.org/10.3390/bioengineering11040345

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop