Next Article in Journal
Brownian Motion Effects on the Stabilization of Stochastic Solutions to Fractional Diffusion Equations with Polynomials
Next Article in Special Issue
Robust Multi-Label Classification with Enhanced Global and Local Label Correlation
Previous Article in Journal
Analytical Solutions to Minimum-Norm Problems
Previous Article in Special Issue
A Quantum Language-Inspired Tree Structural Text Representation for Semantic Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Improved Three-Way Clustering Based on Ensemble Strategy

1
School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212003, China
2
School of Science, Jiangsu University of Science and Technology, Zhenjiang 212003, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(9), 1457; https://doi.org/10.3390/math10091457
Submission received: 22 March 2022 / Revised: 17 April 2022 / Accepted: 18 April 2022 / Published: 26 April 2022
(This article belongs to the Special Issue Data Mining: Analysis and Applications)

Abstract

:
As a powerful data analysis technique, clustering plays an important role in data mining. Traditional hard clustering uses one set with a crisp boundary to represent a cluster, which cannot solve the problem of inaccurate decision-making caused by inaccurate information or insufficient data. In order to solve this problem, three-way clustering was presented to show the uncertainty information in the dataset by adding the concept of fringe region. In this paper, we present an improved three-way clustering algorithm based on an ensemble strategy. Different to the existing clustering ensemble methods by using various clustering algorithms to produce the base clustering results, the proposed algorithm randomly extracts a feature subset of samples and uses the traditional clustering algorithm to obtain the diverse base clustering results. Based on the base clustering results, labels matching is used to align all clustering results in a given order and voting method is used to obtain the core region and the fringe region of the three way clustering. The proposed algorithm can be applied on the top of any existing hard clustering algorithm to generate the base clustering results. As examples for demonstration, we apply the proposed algorithm on the top of K-means and spectral clustering, respectively. The experimental results show that the proposed algorithm is effective in revealing cluster structures.

1. Introduction

Clustering aims to classify similar elements into the same cluster and dissimilar elements into different clusters by calculating the certain similarity between all elements, where the elements in the same cluster must have a high similarity and the elements in the different clusters have a low similarity [1,2]. As a key technology of machine learning, clustering is widely used in different fields such as information granulation [3,4,5], information fusion [6,7,8], attribute reduction [9,10,11,12], feature selection [13,14,15], etc. Many clustering methods have been developed over the past decades. Most of the existing clustering algorithms can be divided into five categories: the partition-based method [16,17], hierarchy-based method [18,19,20], density-based method [21,22,23,24], grid-based method [25] and model-based method [26,27,28].
Although there are many different clustering algorithms, the lack of prior knowledge means that clustering analysis remains a very challenging problem. It has been accepted that a single clustering approach cannot always describe the structural characteristics accurately, even if the same clustering approach cannot obtain good clusters because of the different initial parameters as well. To avoid this problem, ensemble clustering [29,30,31] has been developed to improve the robustness, stability, and quality of a clustering solution. Compared with a single clustering approach, the clustering results obtained by ensemble clustering approaches have better stability, robustness and accuracy. In recent years, ensemble clustering has received more attention and many new ensemble clustering approaches have been developed [32,33,34,35].
The above works have obtained good performance in solving a clustering ensemble problem. However, most of the existing ensemble clustering results are hard clustering, in which an element belongs to one cluster or does not belong to one cluster, and there is a clear boundary between different clusters. Hard clustering algorithms often bring a higher decision risk when the sample’s information is insufficient. In order to solve this problem, three-way decision [36] was proposed to describe the uncertainty of the information. The main idea of three-way decision is to divide a universe into three disjoint regions and make different strategies for different regions [37,38,39]. Many soft computing models for learning uncertain concepts, such as rough sets [40], fuzzy sets [41], shadowed sets [42] and concept learning [43,44] can be reinvestigated within the framework of three-way decision. By integrating three-way decision with clustering, Yu [45,46,47] proposed three-way clustering, which uses a core region and the fringe region to represent a cluster. These two sets divide the universe into three parts, where there are three types of relationships between the objects and cluster, namely objects belonging to the cluster, objects not belonging to the cluster, and objects partially belonging to the cluster. Recently, three-way clustering has attracted much research, and many three-way clustering algorithms have been developed [48,49,50,51,52,53,54,55,56].
This paper aims at presenting an improved three-way clustering algorithm based on an ensemble strategy to solve the problem of inaccurate decision-making caused by inaccurate information or insufficient data. Different to the existing clustering ensemble methods by using various clustering algorithms to produce the base clustering results, the proposed algorithm randomly extracts a feature subset of samples and uses the traditional clustering algorithm to obtain the diverse base clustering results. Based on the base clustering results, we develop a three-way clustering method by using the voting method. The main process of the proposed algorithm has two steps. Firstly, we use parts of the sample’s features to obtain the base clustering results. Secondly, we use label matching to align all clustering results to a given order and the voting method to obtain the core region and the fringe region of three-way clustering. The sample is assigned into the core region of a corresponding cluster when the frequency of the sample in the same cluster is more than the given threshold. The difference between the union of the cluster with the same labels and the core region are regarded as the fringe region of the specific cluster. Therefore, a three-way clustering is naturally formed. The proposed strategy can be applied on the top of any existing hard clustering algorithm. Three-way ensemble K-means and NJW are given as examples for demonstration in this paper.
The remainder of this paper is organized as follows. In Section 2, we mainly introduce the concepts of ensemble clustering and three-way clustering. The process of the proposed algorithm is presented in Section 3. The performances of the proposed three-way ensemble clustering algorithm are illustrated through some UCI datasets in Section 4. Conclusions and future works are given in Section 5.

2. Related Work

In this section, we review some concepts and related works about ensemble clustering and three-way clustering.

2.1. Ensemble Clustering

Each clustering algorithm has its unique method for discovering the data structure. Different clustering algorithms can obtain different clustering results even if the data are the same. A single clustering algorithm cannot deal with all types of data structure. It is also difficult to choose a specialized clustering algorithm because of the insufficient prior class information. Hence, researchers are devoted to integrating multiple clustering results to one clustering result, which is called ensemble clustering. Compared with a single clustering algorithm, ensemble clustering algorithm can obtain a better clustering result with a higher performance of robustness, stability and quality.
The concept of ensemble clustering was first proposed by Strehl and Ghosh [29], who combined cluster labels without accessing the original features. Wang et al. [57] developed an ensemble clustering method for probabilities accumulation by considering factors such as cluster size, sample dimension and density. Punera and Ghosh [58] proposed several consensus algorithms that were suitable for soft clustering by extending the relatively hard clustering approaches. Sevillano et al. [59] presented a set of fuzzy consensus functions which can combine multiple soft clustering results into a final soft clustering result, using the application of positional and confidence voting techniques. Li et al. [60] developed an ensemble clustering algorithm based on a sample’s stability. Yu et al. [55] presented a framework of three-way ensemble clustering based on Spark and proposed a consensus clustering algorithm based on cluster units.
In general, ensemble clustering can be roughly divided into two stages: base cluster generation and base cluster aggregation. Base cluster generation is the first step of the ensemble clustering algorithm, and a set of clusters which will be combined should be generated. There are no restrictions on how to achieve base clusters, so we may have many approaches, such as using different clustering algorithms or using the same clustering algorithm with different parameters. In this paper, we mainly research the process of base cluster aggregation and how to convert hard clustering results into a soft clustering result. The process of ensemble clustering is shown in Figure 1.

2.2. Three-Way Clustering

Three-way decision  [36] is an extension of two-way decision, in which a definite decision is given to the elements with definite information and a deferment decision is adopted when the elements’ information is insufficient to avoid decision risk. Three-way decision uses three disjointed regions to represent a set, namely the positive region, the negative region and the boundary region, which correspond to the acceptance decision, the rejection decision and the delayed decision. Inspired by the idea of three-way decision, Yu [45] presented the framework of three-way clustering by combining the clustering approach and three-way decision.
We introduce some basic knowledge about three-way clustering. Given a set U = { x 1 , x 2 , , x n } , which has n elements, the clustering result can be denoted as C = { C 1 , C 2 , , C K } by using the hard clustering algorithm. In contrast to the hard clustering representation, the three-way cluster C i is represented as a pair of sets:
C i = { C o ( C i ) , F r ( C i ) } ,
where C o ( C i ) is the core region of cluster C i and F r ( C i ) is the fringe region of cluster C i . The third region is trivial region T r ( C i ) = U C o ( C i ) F r ( C i ) , which can be expressed as the complement of the union of C o ( C i ) and F r ( C i ) .
We summarize some concepts of three-way clustering. If the element v C o ( C i ) , v must belong to the cluster C i ; if the element v F r ( C i ) , v might belong to the cluster C i ; if the element v T r ( C i ) , v does not belong to the cluster C i . The above three subsets obey the following properties:
C o ( C i ) F r ( C i ) = ,
C o ( C i ) T r ( C i ) = ,
F r ( C i ) T r ( C i ) = ,
T r ( C i ) C o ( C i ) F r ( C i ) = U .
When F r ( C i ) = , it is obvious to find that the cluster C i can be only represented by the core region C o ( C i ) , and it is a hard clustering result. So the hard clustering result is a special case of a three-way clustering result in a certain situation.
In this paper, we adopt the following three conditions:
(1)
C o ( C i ) , i = 1 , 2 , , k,
(2)
i = 1 K ( C o ( C i ) F r ( C i ) ) = U ,
(3)
C o ( C i ) C o ( C j ) = , i j .
Condition (1) means that each cluster must have elements. Condition (2) indicates that an element v belongs to more than one cluster. Condition (3) demands that the core region of each cluster must be disjointed. Therefore, we can represent the three-way clustering results as:
C = { ( C o ( C i ) , F r ( C i ) ) , ( C o ( C i ) , F r ( C i ) ) , , ( C o ( C K ) , F r ( C K ) ) } .
Some three-way clustering algorithms were developed since three-way clustering was proposed. Wang [49] proposed a three-way clustering framework by combining ideas of three-way decision and erosion and dilation from mathematical morphology. Wang et al. [50] improved the K-means algorithm and then developed a three-way k-means method. Zhang [61] presented a three-way c-means clustering algorithm by integrating the three-way weight and three-way assignment. Yu et al. [47] proposed an efficient three-way clustering algorithm based on the idea of universal gravitation, which can adjust the thresholds automatically in the process of clustering. Jia et al. [62] developed an automatic three-way clustering approach by combining the proposed threshold selection based on the roughness degree using sample’s similarity and the cluster number selection method.

3. The Proposed Three-Way Clustering Based on Ensemble Strategy

Three-way clustering was presented to show the uncertainty information in the dataset by adding the concept of a fringe region. Although many three-way clustering algorithms have obtained good performances, there is still much room for improving the methods. We present an improved three-way clustering algorithm based on an ensemble strategy (TWCE for short) in this section. Compared with the existing algorithms, the proposed algorithm randomly extracts a feature subset of samples and uses the traditional clustering algorithm to obtain the diverse base clustering results. The computational complexity is lower than that of the existing clustering ensemble methods by using all the sample’s features. In addition, the proposed strategy can be applied on the top of any existing hard clustering algorithm. The process of the proposed algorithm has three steps: generation of base clustering, labels matching and results of three-way clustering.

3.1. Generation of Base Clustering Results

The first task in the clustering ensemble is to obtain a set of base clustering results. There are many approaches to generating base clustering clusters, among which utilizing different clustering algorithms is the most commonly used strategy. Each clustering algorithm has its own specific view on how to discover the underlying structure of a dataset. Therefore, multiple clustering algorithms can be used to generate diverse base clustering results. Another commonly used method to obtain the base clustering results is to use one clustering algorithm with different parameters. For example, base clustering results can be obtained by setting different numbers of clusters and the initial centers of the K-means type of algorithms. The above methods use all the sample’s features during the process of clustering, which will take a lot of time for a multi-dimensional dataset.
Different from the existing clustering ensemble method, the proposed TWCE algorithm uses parts of the sample’s features to obtain the base clustering results. For a multi-dimensional dataset, different subsets of features try to describe the dataset from different views. Thus, a set of diverse clustering results will be obtained when distinguishing subsets of features are utilized. Suppose U is a dataset with m features, we randomly extract parts of the features and use the traditional clustering algorithm to obtain one clustering result. Repeat the above process t times and obtain various clustering results C 1 , C 2 , , C t . The generation process of base clustering results can be depicted as Algorithm 1.
Algorithm 1: Generation of base clustering results.
Input: Data set U = { v 1 , v 2 , , v n } , ensemble size t, cluster number K
Output: C 1 * , C 2 * , , C t *
Mathematics 10 01457 i001

3.2. Labels Matching

Based on the base clustering results, we use the voting method to obtain the core region and the fringe region of three way clustering. The base clustering results C 1 * , C 2 * , , C t * cannot be directly used for voting due to the lack of a priori category information. As an example, we consider the dataset U = ( v 1 , v 2 , v 3 , v 4 , v 5 , v 6 ) and let C 1 , C 2 and C 3 be three clustering results of V, which are shown in Table 1.
Although the clustering results are expressed in different orders, they represent the same clustering result. In order to combine the clustering results, the cluster labels must be matched to establish the correspondence between each other. Zhou and Tang [31] pointed out that the number of common elements covered by two clusters with the corresponding relationship should be the largest. Therefore, for two base clustering results C i = { C i 1 , , C i k , , C i K } , C j = { C j 1 , , C j k , , C j K } , 1 i , j t , t is the number of base clusters; we record the number of identical elements covered by each cluster C i k 1 , C j k 2 ( 1 k 1 , k 2 K ) in the two partitions in the OVERLAP matrix of K × K . Then select the cluster label that covers the largest number of identical elements to establish a corresponding relationship, and delete the result from the matrix OVERLAP. Repeat the above process until all cluster labels have established corresponding relationships. This process is defined as label matching. When there are t ( t > 2 ) clustering results, we can select the first clustering results as the matching criterion and match the other clustering results with the first one. The procedure of label matching can be shown as Algorithm 2.   
Algorithm 2: Label matching.
Mathematics 10 01457 i002

3.3. Results of Three-Way Clustering

After the process of label matching, updated cluster results C 1 , C 2 , , C t can be obtained. Then we use the voting method to achieve the core region and fringe region of three-way clustering. In the voting process, each clustering result can be regarded as a voter, and they will vote once and only once for each data point. Let C j = i = 1 t C i j , ( j = 1 , 2 , , K ) . For any v C j , we count the votes of the element v C i j , ( i = 1 , 2 , , t ) and denote the votes of the element v as count(v). Suppose p is one given threshold. If count(v) p , we assign the element v to the core region of the cluster C j , otherwise, we assign the element v to the fringe region of the cluster C j . Finally, we can obtain three-way clustering results. The process of finding the core region and the fringe region can be depicted as Algorithm 3.   
Algorithm 3: Finding core region and fringe region of TWCE.
Mathematics 10 01457 i003
    Based on Algorithms 1–3, a three-way clustering result that describes the data distribution will be generated. Sequentially executing Algorithms 1–3 forms the framework of the proposed TWCE, which is shown as Algorithm 4.   
Algorithm 4: Three way clustering based on ensemble strategy (TWCE).
  Input: Data set U = { v 1 , v 2 , , v n } , ensemble size t, cluster number K, threshold p
  Output: C = { ( C o ( C 1 ) , F r ( C 1 ) ) , ( C o ( C 2 ) , F r ( C 2 ) ) , , ( C o ( C K ) , F r ( C K ) ) }
1 C 1 * , C 2 * , , C t * Algorithm 1;
2 C 1 , C 2 , , C t Algorithm 2;
3 { ( C o ( C 1 ) , F r ( C 1 ) ) , ( C o ( C 2 ) , F r ( C 2 ) ) , , ( C o ( C K ) , F r ( C K ) ) } Algorithm 3;
4 Return C = { ( C o ( C 1 ) , F r ( C 1 ) ) , ( C o ( C 2 ) , F r ( C 2 ) ) , , ( C o ( C K ) , F r ( C K ) ) } .

4. Experimental Analyses

In this section, we verify the performances of the proposed TWCE strategy. In the process of the experiments, K-means [16] and NJW [63] are used to generate the base clustering results, respectively, and the percentage of the selected feature subsets are randomly 50%, 60%, 70%, 80% or 90%. This section consists of three parts. In the first part, we introduce some popular clustering evaluation indices. In the second part, 12 datasets from the UCI Machine Learning repository are employed to show the working mechanism of the TWCE strategy. The relation between the clustering performances and the percentage of the selected feature subsets is discussed in the third part.

4.1. Evaluation Indices

The evaluation of clustering is an effective process for assessing the performance of clustering algorithms. We compare the proposed algorithm with other existing clustering algorithms by calculating some evaluation indices such as Normalized Mutual Information ( N M I ) [64], Adjusted Rand Index ( A R I ) [65] and Accuracy ( A c c ) [50]. The three validity metrics N M I , A R I and A c c are all positive indices, that is, the larger the value, the better the clustering effect.
  • Normalized Mutual Information ( N M I )
    N M I = I ( X , Y ) H ( X ) H ( Y ) ,
    where X is the test label, and Y is the real label. H ( X ) and H ( Y ) represent the entropy of X and Y, respectively. I ( X , Y ) is the mutual information between X and Y.
  • Adjusted Rand Index( A R I )
    A R I = 2 ( a d b c ) ( a + b ) ( b + d ) + ( a + c ) ( c + d ) ,
    where a is the number of data points in a pair that belong to the same cluster in both real and experimental situations; b is the number of data points in a pair that belong to the same cluster in real but not in experimental situations; c is the number of data points in a pair that belong to the same cluster in experimental but not in real situations; d is the number of data points in a pair that do not belong to the same cluster in both real and experimental situations.
  • Accuracy ( A c c )
    A c c = 1 N i = 1 k n i ,
    where N is the total number of elements, n i is the number of elements which are correctly divided into corresponding cluster i, and k is cluster number. A c c represents the ratio between the number of correctly partitioned elements and the total number. A higher value of A c c means the clustering result is better.

4.2. Performances of TWCE Strategy

To test the performances of our proposed TWCE strategy, we employ 12 datasets from the UCI Machine Learning repository, which are Cardiotocography, Congressional voting, Dermatology, Forrest, Landsat, Optical recognition, Synthetic, Urban Land Cover, Vehicle, Waveform, Wdbc and Wine. Table 2 shows the details of these datasets. The first step of the TWCE strategy is to obtain base clustering results. Different clustering algorithms may obtain different clustering results. We use the K-means algorithm and the NJW algorithm to generate base clusters in our experiments, which randomly select 50%, 60%, 70%, 80% or 90% feature subsets to clustering, respectively.
Because the evaluation indices N M I , A R I and A c c are only adopted to the hard clustering results, a three-way clustering results cannot calculate these values directly. In order to present the performances of our proposed TWCE algorithm, we use the core regions to form a clustering result, then calculate the N M I , A R I and A c c valuse by using the core region to represent the corresponding cluster. The average N M I value, A R I value and A c c value are achieved by running 30 times on all datasets and the ensemble size is 50. Table 3 and Table 4 show the performances of the TWCE strategy based on the K-means and NJW algorithm, respectively. For comparing the clustering effect, the performances of K-means, NJW, Voting [31] and CSPA [29] are also presented in Table 5, Table 6 and Table 7.
From the experimental results in Table 3, Table 4, Table 5, Table 6 and Table 7, we can make the following conclusions.
(1)
It is obvious that the N M I and A R I performances of the TWCE strategy based on K-means and NJW are better than only using K-means and NJW. So, the TWCE strategy can indeed obtain a better clustering result than a single clustering algorithm. Compared to the other two ensemble clustering algorithms, the N M I value and A R I value obtained by the TWCE algorithm outperform on most of the 12 datasets. Specifically, the performances of TWCE based on K-means are always superior to the other algorithms on all datasets, and the performances of TWCE based on NJW are only slightly worse than the voting algorithms on Waveform. The improvement can be attributed to the fact that the clustering result of the proposed TWCE algorithm is represented by core regions when calculating the N M I value and the A R I value, which can increase the degree of separation between clusters and reduce the degree of dispersion within clusters.
(2)
TWCE based on K-means or NJW both have a better A c c value on most datasets compared to other algorithms, except Waveform. The increase of A c c value can be attributed to each cluster being represented only by a core region when we calculate the A c c value, which means that the total number is to exclude elements in the fringe region, that is, n i and N both become smaller in Equation (3).
(3)
Comparing the performances of TWCE using K-means and TWCE using NJW, we can find that using different clustering algorithms to generate base clusters has little effect on the performances of the TWCE algorithm.

4.3. The Influences of the Selected Feature Percentage

The TWCE strategy uses a traditional clustering algorithm on parts of the features to obtain the base clustering results. The percentage of the selected feature subsets will have an impact on the performances of the experiment. In this subsection, we discuss the relation between the clustering performances and the percentage of the selected feature subsets. We also use the datasets in the previous subsection and the percentages of the feature subset are set to 50%, 60%, 70%, 80% and 90% of a dataset when the ensemble size is 50. The average N M I value, A R I value and A c c value are achieved by running the TWCE algorithm 30 times, which uses the K-means algorithm and the NJW algorithm to generate base clusters on all datasets, respectively. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 list the N M I value, A R I value and A c c value of twelve datasets from the TWCE strategy based on K-means when the percentage of the feature subset takes different values. Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24 and Figure 25 show the relation between the clustering performances and the percentage of the feature subsets selected by the TWCE strategy based on the NJW algorithm.
From the experimental results recorded in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24 and Figure 25, we can find that different datasets achieve the best performances at different percentages. For example, Dermatology achieves the best performances with the TWCE strategy based on K-means when the percentage is 60%, while most of the other datasets achieve the best performances with the TWCE strategy based on K-means when the percentage is 50%. Even for the same dataset, the TWCE strategy on different clustering algorithms achieves the best performances at different percentages. For example, Synthetic achieves the best performance with the TWCE strategy based on K-means when the percentage is 70%, while the best performance by the TWCE strategy based on NJW is obtained at a different percentage value. Though different datasets achieve the best performances at different percentages, the experimental performances of most of the datasets become worse when the percentage is 90%. This is because the diversity of the base clustering results will become smaller when the percentage becomes larger and low diversity limits the improvement of the ensemble performance. The issue of choosing a reasonable percentage needs to be further explored in future research.

5. Conclusions and Future Work

It has been recognized that a single clustering algorithm cannot identify all types of data structures. Ensemble clustering is an effective approach to solving the problem that a single clustering algorithm may not obtain good clustering results for all datasets. The three-way clustering method uses a core region and a fringe region to solve the problem of inaccurate decision-making caused by inaccurate information or insufficient data. Integrating the idea of three-way clustering and ensemble clustering methods, we propose a new ensemble three-way clustering strategy in this paper. In the proposed strategy, we randomly extract part of the features and use the traditional clustering algorithm to obtain one clustering result. A different feature subset will lead to different clustering results. Diverse base clustering results can be obtained by using different feature subsets. Based on the base clustering results, we use label matching to align all clustering results in a given order and the voting method to obtain the core region and the fringe region of the three way clustering. The sample is assigned to the core region of the corresponding cluster when the frequency of the sample in the same cluster is more than the given threshold. The difference between the union of the cluster with the same labels and the core region is regarded as the fringe region of the specific cluster. Therefore, a three-way clustering strategy is obtained. As examples for demonstration, we apply the proposed strategy on the top of K-means and spectral clustering, respectively. The experimental results on UCI datasets demonstrate the effectiveness in revealing data structures.
The following topics will deserve further investigation:
(1)
In this paper, the cluster number K is set to be constant during the process of generating base cluster members. Due to clustering, the algorithm is an unsupervised method, so how to apply the proposed algorithm to the different K is our next future work.
(2)
The base clusters generated by different feature subsets may be of low quality which may effect the final ensemble clustering result. We can evaluate the quality of base clusters by setting the evaluation function to remove some low-quality members of base clusters. This will be a good research direction.
(3)
In the process of three-way decision, the strategy needs to obtain more details on the division guidelines, so that the proposed algorithm can achieve a clustering result with a higher performance.

Author Contributions

Conceptualization, P.W.; Data curation, T.W.; Formal analysis, J.F.; Funding acquisition, P.W.; Methodology, P.W.; Supervision, P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (nos. 62076111, 61906078) and the Key Laboratory of Oceanographic Big Data Mining and Application of Zhejiang Province (no. OBDMA202002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: http://archive.ics.uci.edu/ml/datasets (accessed on 21 March 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ding, S.F.; Jia, H.J.; Du, M.J.; Xue, Y. A semi-supervised approximate spectral clustering algorithm based on HMRF model. Inf. Sci. 2018, 429, 215–228. [Google Scholar] [CrossRef]
  2. Shi, H.; Wang, P.X.; Yang, X.B.; Yu, H.L. An improved mean imputation clustering algorithm for incomplete data. Neural Process. Lett. 2021. [Google Scholar] [CrossRef]
  3. Yang, X.B.; Qi, Y.S.; Song, X.N.; Yang, J.Y. Test cost sensitive multigranulation rough set: Model and minimal cost selection. Inf. Sci. 2013, 250, 184–199. [Google Scholar] [CrossRef]
  4. Xu, W.H.; Guo, Y.T. Generalized multigranulation double-quantitative decision-theoretic rough set. Knowl.-Based Syst. 2016, 105, 190–205. [Google Scholar] [CrossRef]
  5. Li, W.T.; Xu, W.H.; Zhang, X.Y.; Zhang, J. Updating approximations with dynamic objects based on local multigranulation rough sets in ordered information systems. Artif. Intell. Rev. 2021, 55, 1821–1855. [Google Scholar] [CrossRef]
  6. Xu, W.H.; Yu, J.H. A novel approach to information fusion in multi-source datasets: A granular computing viewpoint. Inf. Sci. 2017, 378, 410–423. [Google Scholar] [CrossRef]
  7. Chen, X.W.; Xu, W.H. Double-quantitative multigranulation rough fuzzy set based on logical operations in multi-source decision systems. Int. J. Mach. Learn. Cybern. 2022, 13, 1021–1048. [Google Scholar] [CrossRef]
  8. Xu, W.H.; Yuan, K.H.; Li, W.T. Dynamic updating approximations of local generalized multigranulation neighborhood rough set. Appl. Intell. 2022. [Google Scholar] [CrossRef]
  9. Yang, X.B.; Yao, Y.Y. Ensemble selector for attribute reduction. Appl. Soft Comput. 2018, 70, 1–11. [Google Scholar] [CrossRef]
  10. Jiang, Z.; Yang, X.; Yu, H.; Liu, D.; Wang, P.; Qian, Y. Accelerator for multi-granularity attribute reduction. Knowl.-Based Syst. 2019, 177, 145–158. [Google Scholar] [CrossRef]
  11. Li, J.; Yang, X.; Song, X.; Li, J.; Wang, P.; Yu, D.J. Neighborhood attribute reduction: A multi-criterion approach. Int. J. Mach. Learn. Cybern. 2019, 10, 731–742. [Google Scholar] [CrossRef]
  12. Liu, K.; Yang, X.; Yu, H.; Fujita, H.; Chen, X.; Liu, D. Supervised information granulation strategy for attribute reduction. Int. J. Mach. Learn. Cybern. 2020, 11, 2149–2163. [Google Scholar] [CrossRef]
  13. Xu, S.; Yang, X.; Yu, H.; Yu, D.J.; Yang, J.; Tsang, E.C. Multi-label learning with label-specific feature reduction. Knowl.-Based Syst. 2016, 104, 52–61. [Google Scholar] [CrossRef]
  14. Liu, K.Y.; Yang, X.B.; Fujita, H. An efficient selector for multi-granularity attribute reduction. Inf. Sci. 2019, 505, 457–472. [Google Scholar] [CrossRef]
  15. Liu, K.; Yang, X.; Yu, H.; Mi, J.; Wang, P.; Chen, X. Rough set based semi-supervised feature selection via ensemble selector. Knowl.-Based Syst. 2020, 165, 282–296. [Google Scholar] [CrossRef]
  16. MacQueen, J. Some methods for classification and analysis of multivariate observations. Berkeley Symp. Math. Stat. Probab. 1967, 5, 281–297. [Google Scholar]
  17. Maulik, U.; Bandyopadhyay, S. Genetic algorithm-based clustering technique. Pattern Recognit. 2000, 33, 1455–1465. [Google Scholar] [CrossRef]
  18. Gurrutxaga, I.; Albisua, I.; Arbelaitz, O.; Martín, J.I.; Muguerza, J.; Pérez, J.M.; Perona, I. An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index. Pattern Recognit. 2010, 43, 3364–3373. [Google Scholar] [CrossRef]
  19. Fred, A.L.; Leito, M.N. Partitional vs. hierarchical clustering using a minimum grammar complexity approach. In Proceedings of the SSPR 2000&SPR 2000, Alicante, Spain, 30 August–1 September 2000; pp. 193–202. [Google Scholar]
  20. Azzag, H.; Lebbah, M. A New Way for Hierarchical and Topological Clustering; Guillet, F., Pinaud, B., Venturini, G., Zighed, D., Eds.; Advances in Knowledge Discovery and Management; Springer: Berlin/Heidelberg, Germany, 2013; pp. 85–97. [Google Scholar]
  21. Lotfi, A.; Moradi, P.; Beigy, H. Density peaks clustering based on density backbone and fuzzy neighborhood. Pattern Recognit. 2020, 107, 107449. [Google Scholar] [CrossRef]
  22. Ankerts, M.; Breuning, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the International Conference on Management of Data and Symposium on Principles of Database Systems, Philadelphia, PA, USA, 31 May–3 June 1999; pp. 49–60. [Google Scholar]
  23. Hinneburg, A.; Keim, D.A. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 27–31 August 1998; pp. 58–65. [Google Scholar]
  24. Birant, D.; Kut, A. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data Knowl. Eng. 2007, 60, 208–221. [Google Scholar] [CrossRef]
  25. Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications; ACM: New York, NY, USA, 1998. [Google Scholar]
  26. Govaert, G.; Nadif, M. An EM algorithm for the block mixture model. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 643–647. [Google Scholar] [CrossRef] [PubMed]
  27. Pavlidis, P.; Li, Q.; Noble, W.S. The effect of replication on gene expression microarray experiments. Bioinformatics 2003, 19, 1620–1627. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Yen, L.C.; Lu, M.C.; Chen, J.L. Applying the self-organization feature map (SOM) algorithm to AE-based tool wear monitoring in micro-cutting. Mech. Syst. Signal Process. 2013, 34, 353–366. [Google Scholar] [CrossRef]
  29. Strehl, A.; Ghosh, J. Cluster ensembles-a knowledge reuse framework for combing multiple partitions. J. Mach. Learn. Res. 2003, 3, 583–617. [Google Scholar]
  30. Fred, A.L.N.; Jain, A.K. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 835–850. [Google Scholar] [CrossRef]
  31. Zhou, Z.H.; Tang, W. Cluster Ensemble. Knowl.-Based Syst. 2006, 19, 77–83. [Google Scholar] [CrossRef]
  32. Huang, D.; Lai, J.H.; Wang, C.D. Ensemble clustering using factor graph. Pattern Recognit. 2016, 50, 131–142. [Google Scholar] [CrossRef]
  33. Huang, D.; Wang, C.D.; Lai, J.H. Locally weighted ensemble clustering. IEEE Trans. Cybern. 2018, 48, 1460–1473. [Google Scholar] [CrossRef] [Green Version]
  34. Xu, L.; Ding, S.F. A novel clustering ensemble model based on granular computing. Appl. Intell. 2021, 51, 5474–5488. [Google Scholar] [CrossRef]
  35. Zhou, P.; Wang, X.; Du, L.; Li, X.J. Clustering ensemble via structured hypergraph learning. Inf. Fusion 2022, 78, 171–178. [Google Scholar] [CrossRef]
  36. Yao, Y.Y. The superiority of three-way decisions in probabilistic rough set models. Inf. Sci. 2011, 81, 1080–1096. [Google Scholar] [CrossRef]
  37. Yao, Y.Y. Three-way decisions and cognitive computing. Cogn. Comput. 2016, 8, 543–554. [Google Scholar] [CrossRef]
  38. Yao, Y.Y. Tri-level thinking: Models of three-way decision. Int. J. Mach. Learn. Cybern. 2020, 11, 947–959. [Google Scholar] [CrossRef]
  39. Yao, Y.Y. The geometry of three-way decision. Appl. Intell. 2021, 51, 6298–6325. [Google Scholar] [CrossRef]
  40. Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 314–356. [Google Scholar] [CrossRef]
  41. Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef] [Green Version]
  42. Pedrycz, W. Shadowed sets: Representing and processing fuzzy sets. IEEE Trans. Syst. Man Cybern. B 1988, 28, 103–109. [Google Scholar] [CrossRef]
  43. Xu, W.H.; Li, W.T. Granular computing approach to two-way learning based on formal concept analysis in fuzzy datasets. IEEE Trans. Cybern. 2016, 46, 366–379. [Google Scholar] [CrossRef]
  44. Yuan, K.H.; Xu, W.H.; Li, W.T.; Ding, W.T. An incremental learning mechanism for object classificationbased on progressive fuzzy three-way concept. Inf. Sci. 2022, 584, 127–147. [Google Scholar] [CrossRef]
  45. Yu, H. A framework of three-way cluster analysis. In Proceedings of the International Joint Conference on Rough Sets Olsztyn, Poland, 3–7 July 2017; pp. 300–312. [Google Scholar]
  46. Yu, H.; Jiao, P.; Yao, Y.Y.; Wang, G.Y. Detecting and refining overlapping regions in complex networks with three-way decisions. Inf. Sci. 2016, 373, 21–41. [Google Scholar] [CrossRef]
  47. Yu, H.; Wang, X.C.; Wang, G.Y.; Zeng, X.H. An active three-way clustering method via low-rank matrices for multi-view data. Inf. Sci. 2020, 507, 823–839. [Google Scholar] [CrossRef]
  48. Afridi, M.K.; Azam, N.; Yao, J.T. A three-way clustering approach for handling missing data using gtrs. Int. J. Approx. Reason. 2018, 98, 11–24. [Google Scholar] [CrossRef]
  49. Wang, P.X.; Yao, Y.Y. CE3: A three-way clustering method based on mathematical morphology. Knowl.-Based Syst. 2018, 155, 54–65. [Google Scholar] [CrossRef]
  50. Wang, P.X.; Shi, H.; Yang, X.B.; Mi, J.S. Three-way k-means: Integrating k-means and three-way decision. Int. J. Mach. Learn. Cybern. 2019, 10, 2767–2777. [Google Scholar] [CrossRef]
  51. Wang, P.X.; Chen, X.J. Three-way ensemble clustering forincomplete data. IEEE Access 2020, 8, 91855–91864. [Google Scholar] [CrossRef]
  52. Wang, P.X.; Yang, X.B. Three-way clustering method based on stability theory. IEEE Access 2021, 9, 33944–33953. [Google Scholar] [CrossRef]
  53. Zhu, J.; Jiang, D.Q.; Wang, P.X. A Three-step Method for Three-way Clustering by Similarity-based Sample’s Stability. Math. Probl. Eng. 2022, 2022, 6555501. [Google Scholar] [CrossRef]
  54. Chu, X.; Sun, B.; Li, X.; Han, K.; Wu, J.; Zhang, Y.; Huang, Q. Neighborhood rough set-based three-way clustering considering attribute correlations: An approach to classification of potential gout groups. Inf. Sci. 2020, 535, 28–41. [Google Scholar] [CrossRef]
  55. Yu, H.; Chen, Y.; Lingras, P.; Wang, G.Y. A three-way cluster ensemble approach for large-scale data. Int. J. Approx. Reason. 2019, 115, 32–49. [Google Scholar] [CrossRef]
  56. Shah, A.; Azam, N.; Alanazi, E.; Yao, J.T. Image blurring and sharpening inspired three-way clustering approach. Appl. Intell. 2022. [Google Scholar] [CrossRef]
  57. Wang, X.; Yang, C.Y.; Zhou, J. Clustering aggregation by probability accumulation. Pattern Recognit. 2009, 42, 668–675. [Google Scholar] [CrossRef]
  58. Punera, K.; Ghosh, J. Consensus-based ensembles of Soft clusterings. Appl. Artifical Intell. 2008, 22, 780–810. [Google Scholar] [CrossRef]
  59. Sevillano, X.; Alías, F.; Socoró, J.C. Positional and confidence voting-based consensus functions for fuzzy cluster ensembles. Fuzzy Sets Syst. 2012, 193, 1–32. [Google Scholar] [CrossRef]
  60. Li, F.J.; Qian, Y.H.; Wang, J.T.; Dang, C.Y.; Li, L.P. Clustering ensemble based on sample’s stability. Artif. Intell. 2019, 273, 37–55. [Google Scholar] [CrossRef]
  61. Zhang, K. A three-way c-means algorithm. Appl. Soft Comput. 2019, 82, 105336. [Google Scholar] [CrossRef]
  62. Jia, X.Y.; Rao, Y.; Li, W.W.; Yang, S.C.; Yu, H. An automatic three-way clustering method based on sample similarity. Int. J. Mach. Learn. Cybern. 2021, 12, 1545–1556. [Google Scholar] [CrossRef]
  63. Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada, 3–8 December 2001; pp. 849–856. [Google Scholar]
  64. Vinh, L.T.; Lee, S.; Park, Y.T.; D’Auriol, B.J. A novel feature selection method based on normalized mutual information. Appl. Intell. 2012, 37, 100–120. [Google Scholar] [CrossRef]
  65. Lawrence, H.; Phipps, A. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar]
Figure 1. Flow chart of ensemble clustering.
Figure 1. Flow chart of ensemble clustering.
Mathematics 10 01457 g001
Figure 2. Results of Cardiotocography by TWCE strategy based on K-means.
Figure 2. Results of Cardiotocography by TWCE strategy based on K-means.
Mathematics 10 01457 g002
Figure 3. Results of Congressional Voting by TWCE strategy based on K-means.
Figure 3. Results of Congressional Voting by TWCE strategy based on K-means.
Mathematics 10 01457 g003
Figure 4. Results of Dermatology by TWCE strategy based on K-means.
Figure 4. Results of Dermatology by TWCE strategy based on K-means.
Mathematics 10 01457 g004
Figure 5. Results of Forrest by TWCE strategy based on K-means.
Figure 5. Results of Forrest by TWCE strategy based on K-means.
Mathematics 10 01457 g005
Figure 6. Results of Landsat by TWCE strategy based on K-means.
Figure 6. Results of Landsat by TWCE strategy based on K-means.
Mathematics 10 01457 g006
Figure 7. Results of Optical recognition by TWCE strategy based on K-means.
Figure 7. Results of Optical recognition by TWCE strategy based on K-means.
Mathematics 10 01457 g007
Figure 8. Results of Synthetic by TWCE strategy based on K-means.
Figure 8. Results of Synthetic by TWCE strategy based on K-means.
Mathematics 10 01457 g008
Figure 9. Results of Urban Land Cover by TWCE strategy based on K-means.
Figure 9. Results of Urban Land Cover by TWCE strategy based on K-means.
Mathematics 10 01457 g009
Figure 10. Results of Vehicle by TWCE strategy based on K-means.
Figure 10. Results of Vehicle by TWCE strategy based on K-means.
Mathematics 10 01457 g010
Figure 11. Results of Waveform by TWCE strategy based on K-means.
Figure 11. Results of Waveform by TWCE strategy based on K-means.
Mathematics 10 01457 g011
Figure 12. Results of Wdbc by TWCE strategy based on K-means.
Figure 12. Results of Wdbc by TWCE strategy based on K-means.
Mathematics 10 01457 g012
Figure 13. Results of Wine by TWCE strategy based on K-means.
Figure 13. Results of Wine by TWCE strategy based on K-means.
Mathematics 10 01457 g013
Figure 14. Results of Cardiotocography by TWCE strategy based on NJW.
Figure 14. Results of Cardiotocography by TWCE strategy based on NJW.
Mathematics 10 01457 g014
Figure 15. Results of Congressional Voting by TWCE strategy based on NJW.
Figure 15. Results of Congressional Voting by TWCE strategy based on NJW.
Mathematics 10 01457 g015
Figure 16. Results of Dermatology by TWCE strategy based on NJW.
Figure 16. Results of Dermatology by TWCE strategy based on NJW.
Mathematics 10 01457 g016
Figure 17. Results of Forrest by TWCE strategy based on NJW.
Figure 17. Results of Forrest by TWCE strategy based on NJW.
Mathematics 10 01457 g017
Figure 18. Results of Landsat by TWCE strategy based on NJW.
Figure 18. Results of Landsat by TWCE strategy based on NJW.
Mathematics 10 01457 g018
Figure 19. Results of Optical recognition by TWCE strategy based on NJW.
Figure 19. Results of Optical recognition by TWCE strategy based on NJW.
Mathematics 10 01457 g019
Figure 20. Results of Synthetic by TWCE strategy based on NJW.
Figure 20. Results of Synthetic by TWCE strategy based on NJW.
Mathematics 10 01457 g020
Figure 21. Results of Urban Land Cover by TWCE strategy based on NJW.
Figure 21. Results of Urban Land Cover by TWCE strategy based on NJW.
Mathematics 10 01457 g021
Figure 22. Results of Vehicle by TWCE strategy based on NJW.
Figure 22. Results of Vehicle by TWCE strategy based on NJW.
Mathematics 10 01457 g022
Figure 23. Results of Waveform by TWCE strategy based on NJW.
Figure 23. Results of Waveform by TWCE strategy based on NJW.
Mathematics 10 01457 g023
Figure 24. Results of Wdbc by TWCE strategy based on NJW.
Figure 24. Results of Wdbc by TWCE strategy based on NJW.
Mathematics 10 01457 g024
Figure 25. Results of Wine by TWCE strategy based on NJW.
Figure 25. Results of Wine by TWCE strategy based on NJW.
Mathematics 10 01457 g025
Table 1. Different represents of the same clustering results.
Table 1. Different represents of the same clustering results.
C 1 C 2 C 3
v 1 123
v 2 123
v 3 232
v 4 232
v 5 311
v 6 311
Table 2. A description of dataset used.
Table 2. A description of dataset used.
IDData SetsSamplesAttributesClasses
1Cardiotocography21262110
2Congressional voting435162
3Dermatology366346
4Forrest523274
5Landsat6435366
6Optical recognition56206410
7Synthetic600606
8Urban Land Cover6751479
9Vehicle846184
10Waveform5000213
11Wdbc569302
12Wine178133
Table 3. The performances of TWCE using K-means.
Table 3. The performances of TWCE using K-means.
IDData SetsNMIARIAcc
1Cardiotocography0.67950.57460.7611
2Congressional voting0.56480.64750.9026
3Dermatology0.97770.97760.9748
4Forrest0.60400.57090.8221
5Landsat0.71510.64910.7829
6Optical recognition0.95730.96540.9789
7Synthetic0.90080.87610.8947
8Urban Land Cover0.68510.65430.8233
9Vehicle0.28120.18880.4792
10Waveform0.44780.31080.5574
11Wdbc0.68750.77630.9412
12Wine0.92190.94180.9800
Table 4. The performances of TWCE using NJW.
Table 4. The performances of TWCE using NJW.
IDData SetsNMIARIAcc
1Cardiotocography0.63200.51350.7721
2Congressional voting0.56190.64300.9012
3Dermatology0.98500.99020.9914
4Forrest0.64460.64100.8525
5Landsat0.77760.74440.8440
6Optical recognition0.96230.97000.9798
7Synthetic0.91210.88850.9053
8Urban Land Cover0.68700.63800.8059
9Vehicle0.26480.18740.5081
10Waveform0.33260.14620.5361
11Wdbc0.68350.79200.9452
12Wine0.92300.94370.9812
Table 5. The performances of average N M I value.
Table 5. The performances of average N M I value.
IDData SetsK-MeansNJWVotingCSPA
1Cardiotocography0.33050.33140.37720.3256
2Congressional voting0.45740.44620.45410.4640
3Dermatology0.81300.82260.91330.8614
4Forrest0.54240.54280.54390.4898
5Landsat0.61250.60280.64820.6134
6Optical recognition0.73350.73120.77500.7385
7Synthetic0.74180.75870.80350.7292
8Urban Land Cover0.57700.52390.60290.5491
9Vehicle0.11260.08440.14180.1058
10Waveform0.36420.36680.26820.3639
11Wdbc0.62320.60750.61900.6175
12Wine0.82490.87820.83690.8354
Table 6. The performances of average A R I value.
Table 6. The performances of average A R I value.
IDData SetsK-MeansNJWVotingCSPA
1Cardiotocography0.16110.15470.27000.1567
2Congressional voting0.52870.52630.53810.5368
3Dermatology0.65440.74870.84460.7685
4Forrest0.49560.49040.49520.4299
5Landsat0.52640.49610.57250.5276
6Optical recognition0.63580.65830.72150.6525
7Synthetic0.59020.63360.71350.5730
8Urban Land Cover0.45870.40210.52450.4181
9Vehicle0.08000.06570.09650.0829
10Waveform0.25350.24960.21780.2524
11Wdbc0.73020.72510.72590.7242
12Wine0.83670.89920.85640.8449
Table 7. The performances of average A c c value.
Table 7. The performances of average A c c value.
IDData SetsK-MeansNJWVotingCSPA
1Cardiotocography0.37010.33840.54840.3653
2Congressional voting0.86100.86310.86710.8667
3Dermatology0.69960.80770.85080.8117
4Forrest0.77950.75400.78070.7130
5Landsat0.66820.68270.74290.6707
6Optical recognition0.75560.77160.84500.7654
7Synthetic0.66850.74750.83370.6697
8Urban Land Cover0.61190.58380.73330.5910
9Vehicle0.36910.37380.42430.3845
10Waveform0.50110.50490.56070.5013
11Wdbc0.92790.92620.92670.9262
12Wine0.94090.96630.95170.9292
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wu, T.; Fan, J.; Wang, P. An Improved Three-Way Clustering Based on Ensemble Strategy. Mathematics 2022, 10, 1457. https://doi.org/10.3390/math10091457

AMA Style

Wu T, Fan J, Wang P. An Improved Three-Way Clustering Based on Ensemble Strategy. Mathematics. 2022; 10(9):1457. https://doi.org/10.3390/math10091457

Chicago/Turabian Style

Wu, Tingfeng, Jiachen Fan, and Pingxin Wang. 2022. "An Improved Three-Way Clustering Based on Ensemble Strategy" Mathematics 10, no. 9: 1457. https://doi.org/10.3390/math10091457

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop