Next Article in Journal
Wind Preview-Based Model Predictive Control of Multi-Rotor UAVs Using LiDAR
Previous Article in Journal
Modeling of and Experimenting with Concentric Tube Robots: Considering Clearance, Friction and Torsion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space

1
Department of Industrial Engineering, Chonnam National University, Gwangju 61186, Republic of Korea
2
College of Business, Northern Michigan University, Marquette, MI 49855, USA
3
Department of Industrial & Systems Engineering, Rutgers University, Piscataway, NJ 08854, USA
4
School of Business, Stockton University, Galloway, NJ 08205, USA
5
Interdisciplinary Program of Arts and Design Technology, Chonnam National University, Gwangju 61186, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(7), 3708; https://doi.org/10.3390/s23073708
Submission received: 11 March 2023 / Revised: 29 March 2023 / Accepted: 29 March 2023 / Published: 3 April 2023
(This article belongs to the Topic AI and Data-Driven Advancements in Industry 4.0)

Abstract

:
Cluster validity indices (CVIs) for evaluating the result of the optimal number of clusters are critical measures in clustering problems. Most CVIs are designed for typical data-type objects called certain data objects. Certain data objects only have a singular value and include no uncertainty, so they are assumed to be information-abundant in the real world. In this study, new CVIs for uncertain data, based on kernel probabilistic distance measures to calculate the distance between two distributions in feature space, are proposed for uncertain clusters with arbitrary shapes, sub-clusters, and noise in objects. By transforming original uncertain data into kernel spaces, the proposed CVI accurately measures the compactness and separability of a cluster for arbitrary cluster shapes and is robust to noise and outliers in a cluster. The proposed CVI was evaluated for diverse types of simulated and real-life uncertain objects, confirming that the proposed validity indexes in feature space outperform the pre-existing ones in the original space.

1. Introduction

The purpose of clustering is to partition objects into groups with criteria such that the similarity within the groups and the dissimilarity among different groups should be maximized [1,2]. Although clustering methods have been widely used in many applications, most clustering algorithms do not provide the optimal number of clusters. Partitional-based clustering algorithms such as K-means clustering [3] must preset the number of clusters [4]. As cluster information is rarely known in the real world, it is crucial to evaluate the clustering results depending on the different numbers of clusters. Although many clustering methods exist for diverse applications, such as pattern recognition [5], semiconductor manufacturing [6], and healthcare [7], they have been developed primarily for only certain data or fixed values. However, the embedded uncertainty of data is essential in many applications. For instance, a patient’s blood pressure may not be consistent because of environmental conditions and instrument errors. Furthermore, measurement values are continuously changing because of the positions of instrumentation devices or workers’ conditions. Aside from these examples, data randomness, missing data, delayed updates, and worker fatigue are other factors of data uncertainty [8,9].
Uncertain data are assumed to be prevalent information in the real world, e.g., measurement errors and environmental conditions. The uncertainty of uncertain data can be expressed by probability density functions (PDFs). Figure 1 illustrates two uncertain data, each distributed by a PDF. The standard method of converting uncertain data is to transform a summary statistic (e.g., mean or median) into certain data. However, these statistics could lose extra information of uncertainty that is significant to capture the uncertainty information of uncertain objects.
Cluster validity indices (CVIs), which are indicators for validating the quality of clustering algorithms, have been widely used to determine the correct number of clusters for the given data. As the CVIs only use input data information, they must be used according to the characteristics of the data. The two components of a CVI are compactness and separability measures. The former refers to an intra-cluster distance, and the latter represents an inter-cluster distance. Most CVIs indicate that a good partition produces a small compactness value and a high separability value. However, the existing CVIs are vulnerable to validating cluster results when the shapes of the clusters are not spherical clusters [10,11].
For certain data, several CVIs, such as the Dunn [12], Calinski–Harabasz [13], Davies–Bouldin [14], and Xie–Beni [15] indices, have been proposed based on combinations of compactness and separability measures. However, most of the existing CVIs have been developed for certain data. There have been few studies on uncertain data. Moreover, relatively new CVIs are also being designed to incorporate mathematical theories into pre-existing CVIs, such as the K-nearest neighbor algorithm, which is used to compute compactness and separation by taking into account shared/non-shared data pairs [10], and principal component analysis, which is used to capture the geometry of the clusters [16]; or to develop clustering algorithms to cluster more well-separated clusters [1].
To apply uncertain data to the existing CVIs’ formulas, they should be changed to calculate distance measures of compactness and separability. In a study of uncertain CVIs, Tavakkol et al. [17] proposed CVIs for uncertain data to calculate the distance between two uncertain objects using probabilistic distance measures in the original space. However, it leads to sensitivity to arbitrary shapes of clusters, sub-clusters, and outliers because of the clusters shape that may cause inaccurate compactness and separability [11].
Consequently, this study proposes new uncertain CVIs for uncertain data objects based on kernel probabilistic distance measures in feature space. The proposed CVIs for uncertain objects are designed to adapt the kernel-based Bhattacharyya probabilistic distance in kernel spaces. In kernel space, the proposed CVIs produce accurate compactness and separability for the arbitrary shapes of clusters by transforming them into elliptical shapes in feature space. Figure 2 illustrates that the ambiguous shape of a dataset in the original space is transformed into a relatively elliptical, circular shape in feature space; thus, the kernel transformation can improve performance in calculating accurate compactness and separability. Furthermore, the proposed approaches could be robust to noise and outliers in a cluster. The superior performance of the proposed CVIs was evaluated through diverse experiments, including simulated and real-life datasets.
This paper is organized as follows. Section 2 reviews the previous studies on CVIs. New CVIs for uncertain data based on a kernel probabilistic distance measure are proposed in Section 3. After the extensive experiments are presented in Section 4, the conclusions and future studies are provided in Section 5.

2. Related Work

2.1. CVI for Certain Data

In the past few decades, many CVIs have been developed to determine the optimal number of clusters. Most CVIs focus on calculating compactness and separability measures. The combination of the two measures is composed of a ratio-type or summation-type index. This section presents several popular CVIs that have been evaluated in many applications.
The Dunn (DU) index [12]:
D U K = m i n i , j = 1 , , K ,   i j m i n x C i ,     y C j d x , y m a x i = 1 , , K m a x x , y C i d x , y .
Compactness and separability are computed using the maximum diameter among all clusters and the minimum pair-wise distance between objects in different clusters. The DU index is integrated by the ratio type of separability to compactness. Thus, the maximum value of the DU index is the optimal number of clusters (max. S/C).
Calinski–Harabasz (CH) index [13]:
C H K = i = 1 K n i · d z i · z t o t 2 K 1 · n K 1 = 1 K x C i d x , z i 2
The CH is composed of the ratio type of separability and compactness like the DU index. z t o t is the centroid of the entire dataset. Compactness and separability are computed using within- and between-cluster sums of squares. Thus, the maximum value for CH is the optimum partition (max. S/C).
The Davies–Bouldin (DB) index [14]:
D B K = 1 K max i = 1 ,   , K ,   i j 1 n i x C i d x , z i 2 + 1 n j y C j d y , z j 2 / d z i , z j
where z i and z j are the centroids of each cluster. Compactness and separability are calculated using the sum of mean squares of individual clusters, unlike the DU index, which considers the compactness and separability of the total cluster. Compactness is the computed sum of the pair-wise distances between different clusters; separability is calculated differently for each cluster. The DB index is comprised of the ratio types of compactness and separability. Therefore, the minimum value of DB is the optimum partition (min. C/S).
The pre-existing CVIs are sensitive to sub-clusters, arbitrary shapes, and noise in clusters for the compactness measure [18]. This study overcomes those drawbacks by conducting a spatial transformation from the original space into feature space using a kernel function that correctly measures cluster compactness and separability.

2.2. CVI for Uncertain Data

Most CVIs have focused on certain data or fixed values [19]. Certain data do not have uncertainty caused by several factors and environments such as sensor measurement error, repeated measurements by workers, or equipment operating environments. Uncertain data objects come in two possible forms: (1) multiple points for each object and (2) a PDF for each object, either given or obtained by fitting the multiple points [20]. Several studies related to clustering uncertain data have been conducted. However, CVIs for uncertain data have rarely been used. The CVIs are crucial criteria for validating the results of clusters [21,22] to find the appropriate number of clusters. Therefore, the study of CVIs for uncertain data is necessary.
In this study, the proposed CVIs use kernel probabilistic distance measures to compute the distance between two uncertain data objects. There are many popular probabilistic distance measures, such as Bhattacharyya distance [23], Wasserstein distance, and Kullback–Leibler divergence [24]. This study uses the Bhattacharyya distance measure. The Bhattacharyya distance measure is one of the widely used probabilistic distance measures and has been generally used in diverse applications.
The Bhattacharyya distance between two probability distributions can be calculated in discrete and continuous cases. Let p and q be the continuous probability distributions over the same space. The definition of the Bhattacharyya distance for a continuous case in original space can be described as follows:
P D B h a t t p , q = ln   x p x q x d x
There are closed-form solutions for many probabilistic distance measures, including the Bhattacharyya distance, for cases where uncertain data objects are modeled with multivariate normal distributions. As probabilistic distance measures can capture the distance between PDFs, they can also be used to capture the distance between uncertain data objects [25]. The Bhattacharyya distance is a special case of Chernoff distance with parameters α 1 = α 2 = 1 / 2 , and the closed-from of Bhattacharyya distance for multivariate normal PDFs is defined in Equation (5):
P D B h a t t p , q = 1 8 μ p μ q Σ p + Σ q 1 μ p μ q + 1 2 ln Σ p + Σ q 2 Σ p + Σ q 1 2
where μ p and μ q are means, and Σ p and Σ q are covariance matrices of P   ~   M V N μ p ,   Σ p and Q   ~   M V N μ q ,   Σ q .
This study models the Bhattacharyya distance between two uncertain data objects in kernel space. We can compute the probabilistic distance between two uncertain data objects in feature space using a kernel function.

3. Proposed CVIs for Uncertain Data

3.1. Kernel Probabilistic Distance Measure in Feature Space

Computing the probabilistic distance is a nontrivial problem. We can compute the Bhattacharyya distance in feature space by referring to several steps developed by Zhou and Chellappa [26]. In capturing the probabilistic distance, suppose that x 1 = x 11 , x 21 , , x N 1 and x 2 = x 12 , x 22 , , x N 2 are the given objects in original space d with a multivariate normal density function:
N x ;   μ ,   Σ = 1 2 π d Σ exp 1 2 x μ T Σ 1 x μ
The radial basis function (RBF) kernel function displayed in Equation (7) can be used to transfer original data into feature space for calculating the distance between uncertain data objects x 1 and x 2 . The RBF kernel function is commonly used in various fields and algorithms because it outperforms other kernel functions [27,28].
K i j = exp 1 2 σ 2 x i x j 2 ,   i , j = 1 , 2
In kernel function K x 1 ,   x 2 , where x 1 ,   x 2 d , and the non-linear mapping function ϕ and kernel Gram matrix K are defined as K = Φ T Φ , where Φ : =   Φ N = ϕ x 1 ,   ϕ x 2 ,   ,   ϕ x N f , and f d represents the data transformed to kernel space. The mean μ and covariance matrix Σ in feature space are estimated as:
μ = N 1 n = 1 N ϕ x n = Φ ,
Σ = N 1 n = 1 N ϕ n μ ϕ n μ T = Φ J J T Φ T ,
where J = 1 n ( I N s 1 ) with s N × 1 = 1 N 1 T and 1 = 1 , 1 ,   ,   1 .
The covariance matrix Σ must be converted into approximation form because of its rank-deficient characteristic f d . Therefore, we can use the approximation form as follows:
C = Φ J J T Φ T + ρ I f = W W T + ρ I f = Φ A Φ T + ρ I f ,
where W = ˙   Φ J Q , A = ˙ J Q Q T J T , and ρ is a user parameter that should be pre-specified in advance.
Obtaining the matrix Q requires computing the top r eigenvalues matrix Λ r and the top r eigenvectors matrix V r of K ¯ = J T K J , where top r is a pre-specified parameter; thus, r = 3 is used. Q is an N × r matrix calculated as follows:
Q = ˙ V r I r ρ Λ r 1 1 /   2 .  
Define matrix P as:
P N 1 + N 2 × r 1 + r 2 = α 1 J 1 Q 1 0 0 α 2 J 2 Q 2 .  
The Bhattacharyya distance is a special case of Chernoff distance; it must be set to α 1 = α 2 = 1 / 2 for all experiments. The τ i , i = 1 , , r 1 + r 2 , are eigenvalues of a L c h matrix, with dimensions of r 1 + r 2 × r 1 + r 2 given by
L c h = P T Φ 1 T Φ 2 T Φ 1 T   Φ 2 T P = P T K 11 K 12 K 21 K 22 P .  
Scalar values ε 11 ,   ε 12 ,   ε 22 are computed by Equation (14).
ε i j = s i T K i j s j s i T K i 1 K i 2 B c h K 1 j K 2 j s j
where B c h = P ρ I r 1 + r 2 + L c h 1 P T with dimensions of N 1 + N 2 × N 1 + N 2 .
The kernel-based probabilistic Bhattacharyya distance between two uncertain data objects x 1 and x 2 in feature space is calculated as follows:
K P D B h a t t = 0.5 [ α 1 α 2 ρ 1 ε 11 + ε 22 2 ε 12   + 0.5 i = 1 r 1 + r 2 log ρ + τ i λ i , 1 + i = 1 r 1 + r 2 log ρ + τ i λ i , 2   ,  
where λ i , j   , i = 1 , , r j are the eigenvalues of C j :
λ i , j = λ i , j , w h e n   i = 1 , , r j ρ , w h e n   i = r j + 1 , , r 1 + r 2  

3.2. New CVI for Uncertain Data

The uncertain data objects in the cluster are transformed into feature space to compute the compactness and separability in the feature space by applying a kernel function. The mapped uncertain data objects are used to compute the distance between different clusters for calculating compactness and separability, which are used to obtain the values of the proposed CVIs. The calculated value of the indices changes according to the number of clusters K, and the proposed uncertain feature space DU (UFSDU) and uncertain feature space CH (UFSCH) index, are defined in Equations (17) and (18), respectively:
UFSDU index:
U F S D U K = m i n i , j = 1 , , K ,   i j m i n x C i ,     y C j K P D B h a t t x , y m a x i = 1 , , K m a x x , y C i   K P D B h a t t x , y
UFSCH index:
U F S C H K = i = 1 K n i · K P D B h a t t z i · z t o t 2 K 1 · n K i = 1 k x C i K P D B h a t t x , z i 2
These proposed CVI equations are similar to the DU and CH indices, except for the term K P D B h a t t x , y , which is the computed distance between two uncertain data objects in feature space in Equation (15).

4. Experimental Results

In this study, we propose two CVIs that are calculated probabilistic distances between different uncertain data objects in feature space. The K-medoids clustering algorithm proposed by Jiang et al. [19] was used to compare the performances of the proposed CVIs in feature space. The K-medoids algorithm is one of the most useful algorithms in clustering problems, which uses probabilistic distance measures to capture the similarity between uncertain objects. It differs from the popular K-means clustering algorithm used for clustering data into groups in its robustness to outliers. The K-means method represents each cluster by the mean of all objects in this cluster, whereas the K-medoids method calculates the distance between every pair of all uncertain data objects and the medoid within a cluster [19]. Then, of all calculated distance values, uncertain data with the smallest distances are assigned as a new medoid for the cluster. We proceeded with the experiments by setting the value of K, which is the number of clusters and is used as the probabilistic distance measure. In this study, we varied the number of clusters (K) and the Bhattacharyya distance measure to compute distances between different uncertain data objects in feature space.

4.1. Experimental Procedure for Uncertain Data

Experiments were performed with artificial and real-world datasets that may have sub-clusters and clusters with asymmetrical, arbitrary, and noisy shapes to evaluate the performances of the proposed CVIs. A normalization process was conducted for each feature of the datasets to reduce the scale gap between different features defined in Equation (19):
x n o r m = x x m i n x m a x x m i n ,  
where x m i n and x m a x are the minimum and maximum values of one feature of the dataset. We then simulated uncertain data objects from certain data objects by following the methodology used by [20].
The pre-existent DU and CH indexes were used to compute uncertain data objects in original space—uncertain original space, DU (UOSUD), and uncertain original space, CH (UOSCH)—to confirm the validity of the proposed CVIs. The overall experimental procedure is represented by Algorithm 1. The procedure used to compare the performances of the proposed CVIs with those of the previous CVIs was as follows: The inputs included the number of uncertain data objects N , the number of object features M , and the number of clusters K . We modeled the uncertain data with multivariate normal distributions. The means of the distributions were the original certain data. The covariances were estimated as follows:
f S i k | Ψ k , d f k = Ψ k d f k 2 p d f k 2 Γ p d f k 2 S i k d f k + p + 1 2 e 1 2 t r Ψ k S i k 1 ,     i = 1 , , n k ,   k = 1 , , K
where S i k represents the covariance matrices for objects in class k with the inverse Wishart PDF [29], as defined in Equation (20) [20]. Ψ k is a positive definite scale matrix and d f k is the degree of freedom. p indicates the dimensions of S i k , t r is the trace of a matrix, and Γ is the multivariate gamma function.
Algorithm 1: K-medoids for uncertain data using a probabilistic distance measure in feature space.
1.  Input: n : The number of objects in cluster k, K : The number of clusters, iter = 0;
2.  Randomly select the cluster medoids C 0 = { c 1 0   , , c K 0 } obtained from the initial clusters
3.  Initialize
4.   C V I s = c v i 1 ,   ,   c v i K obtained UOSDU, UOSCH, UFSDU, and UFSCH
5.  Repeat
6.  for k = 2 to K
7.     c k o l d = c k 0 ; c k n e w = 0
8.    Compute the new medoids:
9.    while  c k o l d   c k n e w  
10.       p = a r g m i n 1 i n j = 1 k K P D B h a t t ( x i ,   c j k   ) , where j is an index of cluster medoid in c k    
11.       c k n e w = x p  
12.    end
13.    Calculate the c v i k using Equations (1), (2), (17), and (18).
14.  end
15.  iter = iter + 1
16. Until ( iter = Maxiter)
Step 1: Set K initial clusters with uncertain objects randomly for a given dataset. Run a K-medoids clustering algorithm with different values for the K parameter (2 ≤ K ≤10).
Step 2: Obtain the medoids of each cluster for which the sum of the probabilistic distance between the objects is the smallest.
Step 3: Calculate CVIs for all the partitions. We calculated the compactness and separability in kernel space using an RBF kernel function with σ (bandwidth in the RBF kernel function). The optimal value was determined through a set of preliminary experiments by taking [0.1, 0.2, …, 4] in σ.
Step 4: We increased the reliability of experimental results by replicating the experiment 100 times for the same dataset with different trial seeds to obtain the initial medoids in Step 1 and used the average value of CVI for each cluster.
Step 5: Finally, we evaluated each CVI and the suggested number of clusters from a CVI; the actual numbers of clusters of a dataset were then compared.

4.2. Experiments with Artificial and Real-World Datasets

Experiments were conducted to evaluate the proposed CVIs in comparison to the pre-existent CVIs. These experiments used 10 datasets with sensitive characteristics containing arbitrariness, sub-clusters, asymmetry, and noise provided by the UCI (https://archive.ics.uci.edu/, accessed on 10 March 2023) [30] and Tomas Barton repositories (https://github.com/deric/clustering-benchmark, accessed on 10 March 2023), which have 122 artificial datasets with arbitrariness, sub-clusters, and asymmetric shapes in two or three features. The datasets from UCI repository, (e.g., D3, D4, D5, and D7) were collected in real environmental conditions; however, the other datasets were artificially created, which can be checked in Tomas Barton repositories.
The summary of datasets used for the experiments is presented in Table 1. Two-dimensional (2D) and three-dimensional (3D) dataset shapes are illustrated in Figure 3. The CVI values were computed by changing the number of clusters (K) in each dataset and then comparing the predicted labels of experiments to the actual labels in the datasets.

4.3. Performance Comparison of the Proposed CVIs

The experimental results are given in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11. The actual number of clusters is below the name of the dataset. It is also noted with an asterisk (*) adjacent to the actual number of clusters along the top. Moreover, all the results of the datasets are presented in Table 12, indicating the performance of the proposed CVIs by a quantitative figure. Each cell in Table 12 represents the optimal number of clusters K determined by its CVI criteria.
The bold values with gray-shaded backgrounds indicate the optimal cluster K decided by each CVI. As presented in Table 2, three of the CVIs succeeded in estimating the number of clusters as two in D1. UOSCH failed. The proposed UFSDU and UFSCH also successfully predicted the number of clusters in D2. In contrast, UOSDU failed to estimate the number of clusters in D2.
Although the proposed UFSDU index and the pre-existent CVIs failed to predict the number of clusters in D3, UFSCH was successful. All CVIs correctly predicted the number of clusters for some datasets; see Table 5, Table 7 and Table 8. In contrast, the proposed UFSDU index is the only CVI that correctly predicted the actual number of clusters in D5, as presented in Table 6. Furthermore, the UFSDU index predicted the actual number of clusters of D8. D8’s shape (Figure 3) is classified distinctly into two classes when viewed visually. However, it is challenging to calculate the compactness and separability of a cluster in the original space. Nevertheless, the UFSDU index was successful in such predictions; the UFSCH forecasted the number of clusters as three, which is close to the actual number of clusters, two. The kernel transformation facilitates computation to obtain greater compactness and separability in the feature space than the original space, leading to high-performance clustering.
The UOSCH index and the new CVIs predicted the number of clusters to be three in D9, and the UOSDU and UFSCH indexes successfully estimated the number of clusters in D10. Table 12 presents a summary of the results of the 10 datasets above, whereas the symbol of a circled dot (⨀) indicates that the CVI accurately predicted the actual number of clusters. As presented in Table 12, the pre-existent CVIs precisely estimated the number of clusters for five experimental datasets, whereas the newly proposed CVIs accurately predicted the number of clusters for eight datasets—three more than the pre-existent CVIs.

5. Conclusions

In this study, we proposed novel cluster validity indices (CVIs) for uncertain data objects in feature space. Unlike conventional CVIs in original space, the proposed CVIs are used for uncertain data objects with arbitrariness, sub-clusters, and noisy shapes of clusters that are hard to evaluate, by transforming the uncertain data from the original space to the feature space, which is performed by the kernel function. The proposed CVIs measure the compactness and separability of each cluster in kernel space, which transforms the original data into a higher-dimensional space, leading to less sensitivity to the arbitrary shapes of clusters and more robustness to noise and outliers. We compared the performances of the proposed CVIs with those of pre-existent CVIs that only consider for the original space. The Bhattacharyya distance measure, one of the most widely used for calculating distance, was used to perform experiments with several artificial and real-life datasets to capture the distances between probability density functions. Numerical examples, including a real-life case study and artificial datasets, confirmed that our proposed CVIs are robust to arbitrary cluster shapes, especially sub-clusters, and are promising alternatives for evaluating the fitness of clustering results that can find the optimal number of clusters, K. The proposed CVIs outperform the pre-existent CVIs because of the application of kernel functions to uncertain data, transforming them from the original space to the feature space. As for practical significance, the proposed CVIs could be utilized in diverse applications. For example, Kim et al. proposed new a multivariate kernel density estimator for uncertain data classification for mixed defect patterns on DRAM wafer maps [31]. The proposed CVI method could be applied for evaluating the number of defect patterns on wafer maps. However, there are some limitations to the proposed CVIs. The uncertain data are assumed to have multivariate normal distributions in advance to compute the distances between different uncertain data objects. The uncertainty of the uncertain data may have a variety of probability functions (normal distribution, exponential distribution, etc.), and some cannot be strictly modeled by PDFs. This might be overcome through methods for generating random variables and support-measure data description, which is a non-parametric machine learning method that does not require an assumption of a prior distribution to be made in advance.
Future research should consider the compactness measure in kernel space in advanced machine learning algorithms, such as support vector data descriptions or Bayesian frameworks of Bayesian support vector data descriptions. The concepts of our CVIs can also be applied to other clustering algorithms.

Author Contributions

Conceptualization, Y.-S.J.; data curation, C.K.; formal analysis, Y.-S.J.; investigation, B.T. and Y.-S.J.; methodology, C.K. and Y.-S.J.; resources, B.T.; software, B.T.; supervision, Y.-S.J.; validation, J.B.; visualization, J.B.; writing—original draft, C.K.; writing—review and editing, J.B., B.T. and Y.-S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by LG Yonam Foundation (of Republic of Korea) and by National Research Foundation of Republic of Korea Grant (No. NRF-2021S1A5A8060639, NRF-2022R1F1A1063174).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The real-world datasets used in this study are available at: https://archive.ics.uci.edu/ml/index.php accessed on 10 March 2023; the artificial datasets that contain data sensitive to shapes are available at: https://github.com/deric/clustering-benchmark/tree/master/ accessed on 10 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BhattBhattacharyya distance measure
C/SSeparability/Compactness
CHCalinski–Harabasz
CVIsCluster validity indices
DBDavies–Bouldin
DUDunn
KPDKernel-based probabilistic distance
PDProbabilistic distance
PDFProbability density function
RBFRadial basis function
S/CCompactness/Separability
UFSCHUncertain feature space CH
UFSDUUncertain feature space DU
UOSCHUncertain feature space CH
UOSDUUncertain feature space DU

References

  1. Abdalameer, A.K.; Alswaitti, M.; Alsudani, A.A.; Isa, N.A. A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst. Appl. 2022, 191, 116329. [Google Scholar] [CrossRef]
  2. Irani, J.; Pise, N.; Phatak, M. Clustering techniques and the similarity measures used in clustering: A survey. Int. J. Comput. Appl. Technol. 2016, 134, 9–14. [Google Scholar] [CrossRef]
  3. MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 27 December 1965–7 January 1966; The Regents of the University of California: Santa Barbara, CA, USA, 1967; pp. 281–297. [Google Scholar]
  4. Li, M.J.; Ng, M.K.; Cheung, Y.-m.; Huang, J.Z. Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowl. Data Eng. 2008, 20, 1519–1534. [Google Scholar] [CrossRef] [Green Version]
  5. Mahesh Kumar, K.; Rama Mohan Reddy, A. A fast DBSCAN clustering algorithm by accelerating neighbor searching using groups method. Pattern Recognit. 2016, 58, 39–48. [Google Scholar] [CrossRef]
  6. Chien, C.-F.; Wang, W.-C.; Cheng, J.-C. Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Syst. Appl. 2007, 33, 192–198. [Google Scholar] [CrossRef]
  7. El-shafeiy, E.; Sallam, K.M.; Chakrabortty, R.K.; Abohany, A.A. A clustering based swarm intelligence optimization technique for the internet of medical things. Expert Syst. Appl. 2021, 173, 114648. [Google Scholar] [CrossRef]
  8. Aggarwal, C.C.; Yu, P.S. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 2009, 21, 609–623. [Google Scholar] [CrossRef] [Green Version]
  9. Shou, L.; Zhang, X.; Chen, G.; Gao, Y.; Chen, K. Mud: Mapping-based query processing for high-dimensional uncertain data. Inf. Sci. 2012, 198, 147–168. [Google Scholar] [CrossRef]
  10. Duan, X.; Ma, Y.; Zhou, Y.; Huang, H.; Wang, B. A novel cluster validity index based on augmented non-shared nearest neighbors. Expert Syst. Appl. 2023, 223, 119784. [Google Scholar] [CrossRef]
  11. Lee, S.-H.; Jeong, Y.-S.; Kim, J.-Y.; Jeong, M.K. A new clustering validity index for arbitrary shape of Clusters. Pattern Recognit. Lett. 2018, 112, 263–269. [Google Scholar] [CrossRef]
  12. Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
  13. Calinski, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
  14. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
  15. Xie, X.L.; Beni, G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 841–847. [Google Scholar] [CrossRef]
  16. Rojas-Thomas, J.C.; Santos, M.; Mora, M. New internal index for clustering validation based on graphs. Expert Syst. Appl. 2017, 86, 334–349. [Google Scholar] [CrossRef]
  17. Tavakkol, B.; Jeong, M.K.; Albin, S.L. Validity indices for clusters of uncertain data objects. Ann. Oper. Res. 2018, 303, 321–357. [Google Scholar] [CrossRef]
  18. Wang, J.-S.; Chiang, J.-C. A cluster validity measure with a hybrid parameter search method for the support vector clustering algorithm. Pattern Recognit. 2008, 41, 506–520. [Google Scholar] [CrossRef]
  19. Jiang, B.; Pei, J.; Tao, Y.; Lin, X. Clustering uncertain data based on probability distribution similarity. IEEE Trans. Knowl. Data Eng. 2013, 25, 751–763. [Google Scholar] [CrossRef]
  20. Tavakkol, B.; Jeong, M.K.; Albin, S.L. Object-to-group probabilistic distance measure for uncertain data classification. IEEE Trans. Knowl. Data Eng. 2017, 230, 143–151. [Google Scholar] [CrossRef]
  21. Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Pérez, J.M.; Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013, 46, 243–256. [Google Scholar] [CrossRef]
  22. Rezaee, B. A cluster validity index for Fuzzy Clustering. Fuzzy Sets Syst. 2010, 161, 3014–3025. [Google Scholar] [CrossRef]
  23. Bhattacharyya, A. On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
  24. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  25. Tavakkol, B.; Son, Y. Fuzzy kernel K-medoids clustering algorithm for uncertain data objects. Pattern Anal. Appl. 2021, 24, 1287–1302. [Google Scholar] [CrossRef]
  26. Zhou, S.K.; Chellappa, R. From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 917–929. [Google Scholar] [CrossRef] [PubMed]
  27. Patle, A.; Chouhan, D.S. SVM kernel functions for classification. In Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE), Mumbai, India, 23–25 January 2013. [Google Scholar]
  28. Tbarki, K.; Ben Said, S.; Ksantini, R.; Lachiri, Z. RBF kernel based SVM Classification for landmine detection and discrimination. In Proceedings of the 2016 International Image Processing, Applications and Systems (IPAS), Sfax, Tunisia, 5–7 November 2016. [Google Scholar]
  29. Nydick, S.W. The wishart and inverse wishart distributions. Electron. J. Stat. 2012, 6, 1–19. [Google Scholar]
  30. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ (accessed on 28 March 2023).
  31. Kim, B.; Jeong, Y.-S.; Jeong, M.K. New multivariate kernel density estimator for uncertain data classification. Ann. Oper. Res. 2020, 303, 413–431. [Google Scholar] [CrossRef]
Figure 1. Two uncertain datasets, each expressed by a PDF.
Figure 1. Two uncertain datasets, each expressed by a PDF.
Sensors 23 03708 g001
Figure 2. Visualization of kernel transformation: (a) asymmetry shape in original space; (b) transformed shape in feature space.
Figure 2. Visualization of kernel transformation: (a) asymmetry shape in original space; (b) transformed shape in feature space.
Sensors 23 03708 g002
Figure 3. Shapes of 2D and 3D datasets: (a) D1 dataset; (b) D2 dataset; (c) D7 dataset; (d) D8 dataset; (e) D9 dataset; (f) D10 dataset.
Figure 3. Shapes of 2D and 3D datasets: (a) D1 dataset; (b) D2 dataset; (c) D7 dataset; (d) D8 dataset; (e) D9 dataset; (f) D10 dataset.
Sensors 23 03708 g003
Table 1. Summary of datasets.
Table 1. Summary of datasets.
Dataset IndexDataset Name# of Obs.# of Dim.# of ClustersProjection Shape
D1A.K Jain’s Toy37322Asymmetry, Arbitrary shape
D2Flame24022Sub-cluster, Noise
D3Iris15043-
D4Thyroid21552-
D5Wine178133-
D6Wisconsin68392-
D7Harberman30132Random shape
D8Chainlink100032Sub-cluster, Arbitrary shape
D9Lsun40023Asymmetry, Arbitrary shape
D10Zelnik129923Sub-cluster
Table 2. Performance results for D1.
Table 2. Performance results for D1.
# of Clusters2 *345678910
CVI
D1
(2)
UOSDU0.000750.000630.000490.000460.000430.000470.000440.000420.000410
UOSCH554.4796537.8279573.5387586.5310576.5872562.0666575.2021566.6556567.6008
UFSDU0.0118300.007270.0074100.0063500.0069200.0063900.0067400.005800.005630
UFSCH256.0945204.767167.9338149.5915138.3076128.206122.4676117.0263112.4593
Table 3. Performance results for D2.
Table 3. Performance results for D2.
# of Clusters2 *345678910
CVI
D2
(2)
UOSDU0.005780.005810.005830.005330.004940.004940.004520.004540.00448
UOSCH218.9052188.6698201.7685195.0877190.2412190.7961192.3785187.7774186.0032
UFSDU0.018750.014330.016190.013860.012840.012610.012630.01250.01271
UFSCH246.7711190.3472184.7522163.52150.3938143.1108138.9139131.6189127.3284
Table 4. Performance results for D3.
Table 4. Performance results for D3.
# of Clusters23 *45678910
CVI
D3
(3)
UOSDU0.573930.186910.066710.045990.033750.030450.024750.024430.02427
UOSCH393.8149340.7616288.9103257.4766227.8328211.7321193.9894179.4227172.1492
UFSDU0.781210.052910.03320.028180.02010.022170.020330.016760.01503
UFSCH97.24412100.967783.4784774.6862965.0818659.8012855.4249951.3250848.54411
Table 5. Performance results for D4.
Table 5. Performance results for D4.
# of Clusters2 *345678910
CVI
D4
(2)
UOSDU0.010590.007020.004470.003890.003380.002850.00290.002640.00254
UOSCH52.4466249.2722945.2977244.2313646.2928643.0583540.033438.9937936.43862
UFSDU0.090450.026780.020970.019410.01860.01660.017280.016040.01577
UFSCH88.1683363.6249454.5452848.3216443.3375238.6507335.5344632.8977730.6346
Table 6. Performance results for D5.
Table 6. Performance results for D5.
# of Clusters23 *45678910
CVI
D5
(3)
UOSDU0.285460.192180.169530.134510.130420.121880.12220.117750.11498
UOSCH46.9884541.6182234.0832429.4512726.6611123.7156421.9784820.887819.0692
UFSDU0.13510.139920.123610.111020.10580.105440.103430.104020.10242
UFSCH166.511594.1177570.1792657.1506648.4821942.4480338.1971834.5573331.1674
Table 7. Performance results for D6.
Table 7. Performance results for D6.
# of Clusters2 *345678910
CVI
D6
(2)
UOSDU0.102230.047190.022620.012090.007420.003420.00140.001090.00075
UOSCH237.829186.8503145.4631119.386698.3638189.7237980.1847270.8307366.12163
UFSDU0.226310.107630.049280.039020.014160.012280.00840.006050.00391
UFSCH349.3685261.4169205.8692171.2457144.4285124.5582109.265397.5029288.98401
Table 8. Performance results for D7.
Table 8. Performance results for D7.
# of Clusters2 *345678910
CVI
D7
(2)
UOSDU0.001980.00140.001120.000860.000780.000890.000690.000790.00076
UOSCH128.8359117.8517104.820397.5645195.8268692.1792586.8538184.9889782.71107
UFSDU0.130210.025770.016810.011990.011080.011220.011320.010280.00945
UFSCH319.3255171.7169127.0638104.591990.6331980.9444272.8699467.5197462.62473
Table 9. Performance results for D8.
Table 9. Performance results for D8.
# of Clusters2 *345678910
CVI
D8
(2)
UOSDU0.000190.000170.000170.000170.000170.000180.000210.000190.00017
UOSCH419.8882371.9768388.8548430.2229426.5956430.8854449.3122438.7834417.3569
UFSDU0.004390.002370.002040.001140.00130.001180.001490.001530.0014
UFSCH445.5408463.2664449.4758439.8262425.4487411.5018422.1565428.8755437.9047
Table 10. Performance results for D9.
Table 10. Performance results for D9.
# of Clusters23 *45678910
CVI
D9
(3)
UOSDU0.012770.001680.000870.000810.000690.000620.00060.000630.00054
UOSCH316.7407406.3877395.188401.578380.968363.1193365.4242349.9761351.8199
UFSDU0.014390.020060.016970.011190.006580.005740.004850.004720.00416
UFSCH190.3465205.1745189.6315175.6124164.8462154.2108149.907141.5702133.6363
Table 11. Performance results for D10.
Table 11. Performance results for D10.
# of Clusters23 *45678910
CVI
D10
(3)
UOSDU0.0306440.0492960.0488490.0487980.0467520.0445940.0424780.0377490.041905
UOSCH235.4205161.3342142.117135.4194127.012126.4954125.6673123.9964132.4379
UFSDU0.003680.001230.001230.001150.001030.000870.000730.000770.00056
UFSCH102.6013106.597699.713398.7982297.6849595.6792995.8284496.62246102.6371
Table 12. Difference between the actual and estimated numbers of clusters in lower-dimensional datasets.
Table 12. Difference between the actual and estimated numbers of clusters in lower-dimensional datasets.
DatasetDim# of ClustersUOSDUUOSCHUFSDUUFSCH
D1225
D2224
D343222
D452
D5133222
D692
D732
D832883
D9232
D102322
# of successes in estimating the optimal number of clusters5588
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ko, C.; Baek, J.; Tavakkol, B.; Jeong, Y.-S. Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space. Sensors 2023, 23, 3708. https://doi.org/10.3390/s23073708

AMA Style

Ko C, Baek J, Tavakkol B, Jeong Y-S. Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space. Sensors. 2023; 23(7):3708. https://doi.org/10.3390/s23073708

Chicago/Turabian Style

Ko, Changwan, Jaeseung Baek, Behnam Tavakkol, and Young-Seon Jeong. 2023. "Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space" Sensors 23, no. 7: 3708. https://doi.org/10.3390/s23073708

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop