Next Article in Journal
Increasing Extractable Work in Small Qubit Landscapes
Previous Article in Journal
Influence of Removing Leaf Node Neighbors on Network Controllability
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combined Gaussian Mixture Model and Pathfinder Algorithm for Data Clustering

1
College of Artificial Intelligence, Guangxi Minzu University, Nanning 530006, China
2
Guangxi Key Laboratory of Hybrid Computation and IC Design Analysis, Nanning 530006, China
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(6), 946; https://doi.org/10.3390/e25060946
Submission received: 6 April 2023 / Revised: 2 June 2023 / Accepted: 5 June 2023 / Published: 16 June 2023
(This article belongs to the Section Signal and Data Analysis)

Abstract

:
Data clustering is one of the most influential branches of machine learning and data analysis, and Gaussian Mixture Models (GMMs) are frequently adopted in data clustering due to their ease of implementation. However, there are certain limitations to this approach that need to be acknowledged. GMMs need to determine the cluster numbers manually, and they may fail to extract the information within the dataset during initialization. To address these issues, a new clustering algorithm called PFA-GMM has been proposed. PFA-GMM is based on GMMs and the Pathfinder algorithm (PFA), and it aims to overcome the shortcomings of GMMs. The algorithm automatically determines the optimal number of clusters based on the dataset. Subsequently, PFA-GMM considers the clustering problem as a global optimization problem for getting trapped in local convergence during initialization. Finally, we conducted a comparative study of our proposed clustering algorithm against other well-known clustering algorithms using both synthetic and real-world datasets. The results of our experiments indicate that PFA-GMM outperformed the competing approaches.

1. Introduction

Clustering is a fundamental tool in machine learning that facilitates the extraction of underlying similarities from data by grouping similar data points into clusters based on their features. Clustering has been widely applied in diverse domains, ranging from network analysis and business to marketing, education, data science, and medical diagnosis [1]. Clustering, a fundamental technique in data analysis, involves grouping similar objects or data points together based on some similarity metric. We explore the two primary approaches to clustering: partitional clustering and hierarchical clustering [2]. Each method has its own advantages and disadvantages, and the appropriate method should be chosen based on the characteristics of the data and the problem at hand [3]. Clustering can be conceptualized as an optimization problem that aims to categorize all data into distinct groups, where the inter-cluster distance is maximized and the intra-cluster distance is minimized. This process is executed using various algorithms that may produce different clustering outcomes [4]. Gaussian Mixture Models (GMMs) are a powerful statistical tool that enables the analysis and clustering of complex datasets that do not conform to a single distribution. One issue with GMMs is that they require the number of components to be specified in advance. Choosing an inappropriate number of components can lead to overfitting or underfitting, which can compromise the accuracy of the model. It means manual selection of cluster numbers is necessary for GMMs. GMMs have the potential to be inadequate in capturing complex patterns within data during initialization, which may result in suboptimal performance. To enhance the performance of GMMs, it is crucial to explore other techniques and algorithms that can capture complex patterns more efficiently [5].
The shortcomings of GMMs can be summarized as follows:
  • To perform clustering using GMMs, one needs to manually configure the cluster numbers. However, this task can be challenging and can significantly impact the outcome of the clustering process;
  • The initialization phase of GMMs may encounter difficulties in capturing complex structures within the data and become trapped in local convergence, leading to suboptimal clustering performance.
The Pathfinder algorithm (PFA) is included as a solution for the deficiency of GMMs, due to its strong global search ability and simple yet effective concept, enabling it to converge to the optimal solution in a relatively small number of iterations [6]. We proposed a novel clustering algorithm called PFA-GMM, which combines GMMs and PFA to address the aforementioned issues.
  • To address the issue of Gaussian Mixture Models potentially getting stuck in local convergence during initialization, we introduced the powerful global searching PFA for clustering analysis. PFA identifies the optimal solution during initialization, thereby avoiding the local convergence trap;
  • In addressing the issue of manual selection of cluster centers by GMMs, we employed the Davies–Bouldin Index as a fitness function for PFA, allowing for the automatic selection of cluster numbers.
The paper is structured as follows: Section 2 provides a summary of the related work relevant to this study. Section 3 presents the theoretical basis and some related concepts. In Section 4, we propose a new clustering algorithm based on GMMs and PFA called PFA-GMM. Section 5 analyzes the experimental results on different datasets. Finally, we provide a summary of our work in Section 6.

2. Related Work

Finite mixture modeling is a statistical technique that examines whether model parameters vary over unmeasured groups of individuals. The goal is to estimate the parameters of a mixture distribution, which is a probability distribution that results from the combination of two or more probability distributions [7,8].
Gaussian Mixture Models (GMMs) are a well-known method for modeling probability distributions of continuous variables. They are flexible and powerful tools for a wide range of applications, such as image processing, speech recognition, and data clustering. Chen et al. [9] establish the information-theoretic threshold for the separation of cluster centers, which ensures the exact retrieval of cluster labels in a K-component Gaussian mixture model with equal cluster sizes. Qu et al. [10] proposed a novel GMMs-based algorithm for anomaly detection in Hyperspectral Images with the aim of improving the detection accuracy of anomaly pixels and an effective GMMs-based weighting approach for fusing the extracted anomaly result. Fu et al. [11] introduce a novel embedded feature selection approach for GMMs by incorporating a relevancy index. The relevancy index is a metric that quantifies the probability of assigning a data point to a specific clustering group, thereby facilitating feature selection. Patel et al. [12] compare the cluster representativeness of K-Means and GMMs for heterogeneity in resource usage of cloud workloads. Experimental results suggest that GMMs are superior to K-Means when it comes to clustering with distinct usage boundaries. Although GMMs require more computational time, they are more effective in fine-grained workload characterization and analysis.
Projection pursuit is a statistical technique used for exploratory data analysis, information visualization, and feature selection [13]. It is a broader framework that has a form of an additive model of the derived features rather than the inputs themselves [14].
Nature-inspired algorithms, also known as metaheuristic algorithms, have gained widespread popularity in the field of engineering due to their remarkable ability to solve complex problems. These algorithms seek to find the optimal solution by balancing exploitation and exploration in the search space [15]. Compared to traditional clustering, which is susceptible to local convergence and initialization, metaheuristic algorithms have a higher probability of achieving the optimal global solution [16]. Yapici et al. [6] proposed the Pathfinder algorithm in 2019, a swarm intelligence algorithm inspired by the leadership of animal hunting behavior. Animals living in groups often make decisions based on the social hierarchy among members and may need to make decisions with or without a leader. Varaprasad et al. [17] utilized the Pathfinder algorithm to optimize the allocation and integration of a solar photovoltaic system. The Pathfinder algorithm was applied to determine the optimal configuration of Interline-Photovoltaic (I-PV) systems among multiple feeders to enhance the performance and resilience of the distribution system operation and control while maintaining various operational and radiality constraints. Oladipo et al. [18] employed an innovative approach to optimize the control efficiency of electrical systems. Specifically, they combined the flower pollinated algorithm (FPA) with the Pathfinder algorithm (PFA) to achieve maximum efficiency. By leveraging the PFA’s ability to ensure the search for optimal solutions, the authors were able to successfully exploit the full potential of their algorithmic approach. Tang et al. [19] have devised a novel approach by combining two algorithms, namely the teaching–learning-based algorithm (TLBO) and the Pathfinder algorithm (PFA), to enhance the exploration and mining abilities of the original algorithm. TLBO facilitates the teaching phase by considering the pathfinder of PFA as a teacher, thereby increasing the exploration ability of the algorithm. On the other hand, it allows the followers of PFA to perform the learning phase in TLBO and apply it to the mathematical model of PFA to enhance the mining ability of the algorithm.

3. Theoretical Background

We provide a comprehensive introduction to the Pathfinder algorithm and Gaussian Mixture Models.

3.1. Pathfinder Algorithm

The searching, exploiting, and hunting abilities of animal swarms have always been a focus of interest for many scientists. All behaviors in a swarm are carried out on the basis of the common action of all individuals [20]. The hunting skills of animal swarms have long captivated the attention of scientists. Mimicking the behavior of social movement within animal species, the Pathfinder algorithm simulates the characteristics of searching for prey or feeding areas under the guidance of a leader within animal herds. There is a leader corresponding to multiple members of the population, and the members of the population follow the leader according to the location of a neighbor and the behavior of the leader [21]. The algorithm begins by randomly initializing the positions of herd members. Afterward, the fitness of each individual is calculated, and the position of the individual with the best fitness is chosen as the pathfinder to be followed [22]. The pathfinder updates the location through Equation (1), and the members update the location through Equation (2).
x i K + 1 = x i K + R 1 × x j K x i K + R 2 × x p K x i K + ε ,       i 2
x p K + 1 = 2 r 3 x p K x p K 1 + A
where K represents the current iteration, x i K represents the position vector of i th member, x p K represents the position of the pathfinder, x j K is the position vector of j th member, R 1 is equal to α r 1 , and R 2 is equal to β r 2 , where r 1 , r 2 , r 3 are random variables uniformly generated in the range of [0, 1], and α and β are randomly selected in the range of [1, 2] over the course of iterations. ε and A represent members and pathfinder random movement, respectively, are given by Equations (3) and (4)
A = u 2 × e 2 K K m α x
ε = 1 K K m a x u 1 D i j ,       D i j = x i x j
where u 1 and u 2 variables are set in the range [−1, 1], and members can also move to their previous positions. D i j is the distance between two members; K m a x represents the maximum iteration. Locating the global optimum of optimization problems is a significant challenge. Therefore, we assume that the best solution detected thus far is the global optimum and accept it as the food area or hunt area to be exploited by the herd. In the Pathfinding algorithm, the objective is to locate the optimal food source or hunting ground, also known as the global optimum. During each iteration, the pathfinder’s position is designated as the current optimal location, and the other members of the group converge toward it. The pseudo-code is given as follows (Algorithm 1).
Algorithm 1: Pseudo-code of the Pathfinder algorithm
1.  Initialize the population
2.  Calculate the fitness of initial population
3.  Find the Pathfinder
4.  While K < maximum number of iterations
5.           α and β = random number in [1, 2]
6.           Update the position of Pathfinder using Equation (1)
7.           If new Pathfinder is better than old
8.                   Update Pathfinder
9.            End
10.          For i = 2 to maximum number of populations
11.                  Update positions of members using Equation (2)
12.          End
13.          Calculate new fitness of members
14.          Find the best fitness
15.          If best fitness < fitness of Pathfinder
16.                  Pathfinder = best member
17.                  Fitness = best fitness
18.          End
19.          For i = 2 to maximum number of populations
20.                  If new fitness of member(i) < fitness of member(i)
21.                          Update members
22.                  End
23.          End
24.          Generate new A and ε
25. End

3.2. Gaussian Mixture Models and Expectation Maximization

Gaussian Mixture Models (GMMs) are one of many unsupervised clustering techniques that are typically trained using an Expectation–Maximization (EM) algorithm to maximize the likelihood [23]. In GMMs, each cluster is considered an independent Gaussian distribution, also known as a normal distribution. This approach is used to assign data points to clusters based on the probability distribution [24], and the Gaussian distribution is defined as Equation (5):
N X μ , Σ = 1 2 π D 2 Σ e x p X μ T Σ 1 X μ 2
where μ is a D dimensional mean vector, and Σ is a covariance matrix with D × D shapes. The Gaussian distribution is a probability distribution that is defined by its mean and standard deviation. While the unimodal Gaussian distribution is inadequate to represent multiple density regions found in multimodal datasets, they can be effectively modeled using GMMs. The GMMs are defined as Equation (6):
P X = k = 1 K π k N X μ k , Σ k
where K is the number of Gaussian mixture distribution, and π k is the weight of kth Gaussian distribution, where the sum of π k is one, μ k denotes the mean of kth Gaussian distribution, and Σ k represents the covariance matrix of the kth Gaussian distribution. The Gaussian mixture distribution comprises a convex combination of Gaussian distributions, providing significantly more flexibility to model complex densities than a simple Gaussian distribution. If we seek to obtain the maximum likelihood estimation, we need to derive the log-likelihood function as the following Equation (7).
ln p X π , μ , Σ = n = 1 N ln k = 1 K π k N x n μ k , Σ k
we cannot obtain a closed-form solution through maximum likelihood estimation. However, we can simplify it by introducing a latent variable, which turns out to be the EM algorithm for GMM. The E-step and M-step for GMM are displayed as Equations (8) and (9), respectively.
Q i z i k = p z i k x i ; θ = π k N x i ; π k , Σ k k = 1 K π k N x i ; π k , Σ k
π k = i = 1 M Q i z i k M , μ k = i = 1 M x i Q i z i k i = 1 M Q i z i k , Σ k = i = 1 M x i μ k x i μ k T Q i z i k i = 1 M Q i z i k
The fundamental principle of the EM algorithm is to revise one parameter at a time while keeping the others fixed. The algorithm proceeds iteratively by calculating the E-step and M-step until convergence is reached.

4. Pathfinder Algorithm Based on GMMs for Data Clustering

In this paper, we propose PFA-GMM, a new clustering algorithm that integrates the Pathfinder algorithm with GMMs for data clustering. As GMMs are highly susceptible to initialization, we introduce the Pathfinder algorithm to mitigate this issue. To measure the intra-cluster compactness and inter-cluster separation, we employ the Davies–Bouldin (DB) index as the fitness function [25].

4.1. Internal Validation Criteria

We can categorize criteria for evaluating the quality of clusters into two types: internal validation and external validation. These criteria assess the effectiveness of clustering and help validate the resulting clusters [26]. The fundamental distinctions between internal validation and external validation of cluster analysis are primarily contingent upon the utilization of external information. In practical applications, such as clustering labels, it may be unavailable. Consequently, internal validation, which relies on the information within the dataset, is often the sole recourse for evaluating clusters in the absence of external information [27]. Internal validation is a method used to evaluate the effectiveness of clustering algorithms by assessing inter-cluster separation and intra-cluster compactness.
  • Compactness: It measures the proximity of intra-cluster data. Variance has been utilized for measuring compactness in some methods, with lower variance representing better compactness. Similarity also has been applied to measure compactness, such as pairwise distance;
  • Separation: The inter-cluster data distinction is measured using a similarity metric to determine the distance between sets of clusters. This distance is used to evaluate separation, such as the pairwise distance of intra-cluster data points and the distance between cluster centroids.
Davies–Bouldin (DB) index: The DB index assesses the average inter-cluster similarity between any two clusters and their nearest counterparts. The Davies–Bouldin Index is calculated as the average similarity of each cluster with a cluster most similar to it. The clustering results are enhanced by minimizing this index; the lower the Davies–Bouldin Index, the better the clusters are separated, and the better the result of the clustering performed. This reduction in the DB value highlights the distinctiveness of each cluster, thereby producing an optimal clustering outcome. The Davies–Bouldin Index is defined as follows:
D B = 1 c i = 1 c max i j d x i + d x j d c i , c j
where c represents the number of clusters; i and j are cluster labels; d x i and d x j denote the distance of all the entities in clusters i and j; d c i , c j represents the distance between cluster centroids.

4.2. PFA-GMM

PFA-GMM comprises three components. First, the proposed method selects cluster centers from the dataset by calculating fitness values to determine the number of resulting clusters, and PFA-GMM introduces the Pathfinder algorithm for clustering analysis. The population is initialized based on the candidate cluster centers. The PFA’s global search ability is leveraged to discover multiple optimal solutions. These optimal solutions are then utilized as the initial cluster centers. By doing so, the PFA-GMM algorithm successfully accomplishes the task of automatically selecting the cluster centers, thus eliminating the subjectivity that arises from manual selection processes.
Second, the candidate solutions are iteratively applied to data clustering using the EM algorithm. By updating the population through the Pathfinder algorithm, it is possible to obtain the optimal solution during initialization and avoid getting trapped in local convergence.
Finally, the Pathfinder algorithm dynamically updates an optimal population and pathfinder until the termination condition is achieved. Additionally, searching for an optimal solution for data clustering. According to the above description, the main frame of PFA-GMM is presented below (Algorithm 2):
Algorithm 2: Pseudo-code of PFA-GMM
Input: The set of data points X = { x 1 , x 2 , , x n } , Maximum iterations, Population numbers
Output: The optimal clustering result
1.  Initialize the population
2.  Select data points randomly and determine the number of clusters C
3.  Generate the initial population and applying through GMM
4.  Calculate the fitness function according to EM
5.  Choose population with the best fitness value as Pathfinder
6.  While K < maximum number of iterations
7.       For i = 1 to maximum number of populations
8.              Update positions of members using Equation (2)
9.              Calculate fitness value of members through EM
10.      End
11.      If best fitness < fitness of Pathfinder
12.             Pathfinder = best member
13.             Fitness = best fitness
14.      End
15.      α and β = random number in [1, 2]
16.      Generate new A and ε
17.      Update the position of Pathfinder using Equation (1)
18.      If new Pathfinder is better than old
19.             Update Pathfinder
20.             Calculate fitness value
21.      End
22. Assign data points to final cluster centroids
Suppose that N denotes the total number of points, P represents the number of populations, T is the time of iteration, and D is the dimension of the dataset. The time complexity of PFA-GMM mainly depends on the following parts: (1) Calculating the covariance matrix by EM algorithm at the initialization stage is O D 3 P ; (2) calculating the fitness values by EM algorithm during iteration is O T P + D 3 ; and (3) assigning data points to clusters takes O N l o g N . Thus, the overall time complexity is O D 3 P + T P + D 3 + N l o g N .

5. Experimental Results and Analysis

In this section, we have validated the performance of PFA-GMM and compared it with other related clustering algorithms, including K-means [2], DBSCAN [28], GMM [23], Fuzzy C- means (FCM) [29], and Affinity Propagation (AP) [30]. To ensure a comprehensive comparison, we have also combined K-means with the Pathfinder algorithm. Additionally, we have adopted some metaheuristic algorithms such as Particle Swarm Optimization (PSO) [31], PSO-FCM [32], Genetic Algorithm (GA) [33], Artificial Bee Colony (ABC) [34], and Differential Evolution (DE) [35]. The parameters for the compared algorithms have been listed in Table 1. All datasets used in this section have been obtained from UCI datasets and synthetic datasets [36]. Furthermore, the numerical experiment to verify the effectiveness of the proposed algorithm includes the mean, standard deviation, and Wilcoxon rank-sum test [37] for fitness values. The iteration and population are set as 100 and 10, respectively. All the algorithms were coded in the Python programming language on Windows 10 with AMD Ryzen 5 2500U@2GHz and 8G RAM.

5.1. Clustering Evaluation Criteria

We have adopted four evaluation indices to test the performance of the clustering results, including Accuracy (ACC) [38], Normalized Mutual Information (NMI) [39], Rand Index (RI) [40], and Adjusted Rand Index (ARI) [41].
Accuracy (ACC): ACC means the proportion of correct labels to actual labels. The range of ACC is [0, 1]. When ACC equals 1, which tells that all the correct labels are found by the algorithm, the ACC is defined as follows:
A C C = i = 1 K x i N
where x i is a sample that is classified correctly, K is the number of clusters, and N is the number of data points in the dataset.
Normalized Mutual Information (NMI): NMI is a measure of the mutual dependence between two random variables, which is expressed as a ratio of the Mutual Information score and the average entropy of the two variables. Mutual Information measures the amount of information obtained about one random variable by observing the other variable, while entropy quantifies the expected amount of information held in a random variable. The NMI is defined as follows:
N M I X , Y = I X , Y H X H Y
where I X , Y represents the mutual information of variable X and variable Y , H X and H Y are the information entropy of variable X and variable Y , respectively. The NMI score ranges between 0 and 1, where a score of 1 indicates perfect correlation and a score of 0 indicates no correlation.
Rand Index (RI): RI is a similarity measure used to compare two different clustering methods. RI combines two sources of information, object pairs put together and object pairs assigned to different clusters in both partitions.
R I = a + b n 2
RI can be viewed as a measure of binary classification accuracy over the pairs of elements in a set, where a is the number of pairs correctly labeled as belonging to the same subset, and b is the number of pairs correctly labeled as belonging to different subsets. n 2 is the number of unordered pairs in a set of n elements. RI gives a value between 0 and 1, where 0 indicates that two clustering methods do not agree on the clustering of any pair of elements, and 1 indicates that two clustering methods perfectly agree on the clustering of every pair of elements.
Adjusted Rand Index (ARI): The ARI is a corrected-for-chance version of the Rand Index. The ARI rescales the index, taking into account that random chance will cause some objects to occupy the same clusters, so the RI will never actually be zero. The ARI is an improvement over the RI because it considers the expected value of the RI for random clustering.
A R I = R I E R I 1 E R I
where RI is the Rand Index and E(RI) is the expected value of the Rand index when the partitions are made at random while keeping the same marginal clustering distributions.

5.2. Experiments on Synthetical Datasets

Six synthetic datasets are selected for variety (the features and instances of datasets are displayed below). The clustering result has been displayed as color figures in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6; we highlighted the best value for each dataset in bold. Table 2 shows the ARI, NMI, and ACC index values obtained by the proposed algorithm and other compared algorithms for synthetical datasets.
Table 2 shows that PFA-KM achieved superior results on Jain and Pathbased datasets, while PFA-GMM obtained the best results on S2, Compound, Four lines, and Aggregation datasets. Also the GMM algorithm leverages posterior probabilities for a soft assignment.
The clustering algorithms of K-means and its variants in the Aggregation dataset fail to consider the connectivity between clusters. K-means allocates data points to the nearest cluster center without regard for the boundaries between the points and the centers. As a result, PFA-GMM produces the most accurate clustering results among the algorithms tested. However, PFA-GMM still misallocates some boundary points to the wrong clusters.
In the S2 dataset, the PFA-GMM algorithm outperformed other algorithms. In contrast, the K-means algorithm failed to correctly differentiate data points of the close relationship between some clusters, which K-means identified as the same distribution. This issue led to a poor clustering result, which other algorithms also experienced since they shared the same problem as K-means regarding cluster differentiation.
In the Compound dataset, PFA-GMM obtained a better result than other algorithms, but it is not completely consistent with the actual distribution of the Compound dataset. However, most of the algorithms failed to differentiate clusters in the lower left corner of Figure 2.
In the Four Lines dataset, while K-means and its variants fail to accurately discriminate clusters due to their allocation of some data points to the wrong cluster centers based on the mean of data points, PFA-GMM results in the closest approximation to the actual distribution of the dataset.
In the Pathbased dataset, PFA-KM obtained a better result than other algorithms; most of the algorithms cannot assign data points correctly because this dataset contains manifold clusters, which most algorithms cannot differentiate accurately. K-means calculates the mean of data points and consider it as a proper candidate center that fails to recognize the cluster and allocates data points incorrectly.
In the Jain dataset, the clustering results of PFA-KM seem to closely resemble the actual distribution of the data. However, handling this dataset can be challenging due to the presence of non-convex clusters and closely related clusters. Most algorithms struggle to handle such datasets.
In order to evaluate the performance of our PFA-GMM algorithm, we not only employ ACC, ARI, and NMI index values but also use the Davies–Bouldin (DB) Index and Rand Index (RI) for comparing the performance with different algorithms. The Rand Index is a measure of the similarity between two algorithms on clustering. Both indexes range from 0 to 1, and the DB index value closer to 0 demonstrates better results. Table 3 shows the RI and DB index values obtained by the proposed algorithm and other compared algorithms for synthetical datasets.

5.3. Experiments on Real-World Datasets

Ten datasets from the UCI machine learning repository were used to evaluate algorithm performance. Each algorithm ran independently ten times, and the best value, worst value, mean, and standard deviation of fitness value were calculated and displayed in Table 4. Table 5 shows the Wilcoxon signed rank test for our proposed algorithm with other algorithms.
In this experiment, the results indicate that the optimal value of PFA-GMM on Glass, Liver, Wine, and Zoo datasets is superior to that of other algorithms. Similarly, the optimal value of ABC on Spect, Seeds, Iris, Breast, and Banknote datasets is better than that of other algorithms. Additionally, the GA algorithm outperforms other algorithms on the Heart dataset. On the other hand, the worst value of PFA-GMM on Glass, Heart, Liver, and Wine datasets is lower than that of other algorithms, while the worst value of ABC on Spect, Seeds, Iris, and Breast datasets is lower than other algorithms. In the case of PFA-KM, the worst value is lower than other algorithms on Zoo and Banknote datasets.
Furthermore, the mean of ABC on Spect, Seeds, Iris, Breast, and Banknote datasets is better than that of other algorithms. In contrast, the mean of PFA-GMM on Glass, Heart, Liver, and Wine datasets is better than other algorithms. Additionally, PFA-KM performs better than other algorithms in the Zoo dataset. In general, the mean of this data outperforms other algorithms.
The standard deviation of PFA-GMM on Spect, Glass, and Liver datasets is superior to that of other algorithms. Similarly, the standard deviation of ABC is better than other algorithms on the Seeds dataset, and the standard deviation of PSO-FCM is better than other algorithms on the Iris dataset. Lastly, the standard deviation of PFA-KM is better than other algorithms on the Breast, Heart, Wine, Zoo, and Banknote datasets.
A statistical method known as the Wilcoxon rank-sum test was used to analyze the fitness results from 10 independent runs. Table 5 shows the values generated by the test for pairwise comparison between PFA-GMM and other algorithms. The test compares the hypothesis that there is no significant difference between two values sampled from continuous distributions to the hypothesis that there is a significant difference. The results are statistically significant if the values in Table 5 are below 0.05.

6. Conclusions

In this paper, we introduce a novel clustering method by combining the Pathfinder algorithm with the GMM algorithm. The proposed algorithm aims to leverage the strengths of both algorithms to enhance the clustering performance. To evaluate the clustering performance, we compared our method, named PFA-GMM, with traditional clustering algorithms and swarm intelligence algorithms. We adopted ACC, NMI, and ARI clustering criteria to evaluate the performance of six synthetic datasets. Moreover, we calculated the fitness value for testing the performance of PFA-GMM on ten UCI datasets. Our proposed algorithm outperforms other algorithms on most datasets.
It is worth noting that PFA-GMM is based on an iterative algorithm, and therefore its running time on a large sample dataset will be relatively longer than that of traditional clustering methods. In future research, we will focus on improving the time complexity of this algorithm. Furthermore, we aim to enhance the accuracy of this algorithm on non-convex data and improve its performance on complex datasets. To achieve this, we plan to study allocation strategies to assign data points correctly and enhance their ability to cope with different shapes and density datasets.

Author Contributions

Software, Z.L.; Methodology, Y.Z.; Writing—original draft, H.H.; Writing—review & editing, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (62266007, 61662005); the Guangxi Natural Science Foundation (2021GXNSFAA220068, 2018GXNSFAA294068).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hosseinzadeh, M.; Hemmati, A.; Rahmani, A.M. Clustering for smart cities in the internet of things: A review. Clust. Comput. 2022, 25, 4097–4127. [Google Scholar] [CrossRef]
  2. Jain, A.K. Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  3. Saraswathi, S.; Sheela, M.I. A comparative study of various clustering algorithms in data mining. Int. J. Comput. Sci. Mob. Comput. 2014, 11, 422–428. [Google Scholar]
  4. Tao, X.; Guo, W.; Li, X.; He, Q.; Liu, R.; Zou, J. Fitness peak clustering based dynamic multi-swarm particle swarm optimization with enhanced learning strategy. Expert Syst. Appl. 2022, 191, 116301. [Google Scholar] [CrossRef]
  5. Xie, Y.; Shekhar, S.; Li, Y. Statistically-robust clustering techniques for mapping spatial hotspots: A survey. ACM Comput. Surv. (CSUR) 2022, 55, 1–38. [Google Scholar] [CrossRef]
  6. Yapici, H.; Cetinkaya, N. A new metaheuristic optimizer: Pathfinder algorithm. Appl. Soft Comput. 2019, 78, 545–568. [Google Scholar] [CrossRef]
  7. Loperfido, N. Kurtosis-Based Projection Pursuit for Outlier Detection in Financial Time Series. Eur. J. Financ. 2020, 26, 142–164. [Google Scholar] [CrossRef]
  8. Peña, D.; Prieto, F.J. Cluster identification using projections. J. Am. Statist. Assoc. 2001, 96, 1433–1445. [Google Scholar] [CrossRef] [Green Version]
  9. Chen, X.; Yang, Y. Cutoff for exact recovery of Gaussian Mixture Models. IEEE Trans. Inf. Theory 2021, 67, 4223–4238. [Google Scholar] [CrossRef]
  10. Qu, J.; Du, Q.; Li, Y.; Tian, L.; Xia, H. Anomaly detection in hyperspectral imagery based on Gaussian mixture model. IEEE Trans. Geosci. Remote Sens. 2020, 59, 9504–9517. [Google Scholar] [CrossRef]
  11. Fu, Y.; Liu, X.; Sarkar, S.; Wu, T. Gaussian mixture model with feature selection: An embedded approach. Comput. Ind. Eng. 2021, 152, 107000. [Google Scholar] [CrossRef]
  12. Patel, E.; Kushwaha, D.S. Clustering cloud workloads: K-means vs gaussian mixture model. Procedia Comput. Sci. 2020, 171, 158–167. [Google Scholar] [CrossRef]
  13. Hennig, C. Asymmetric Linear Dimension Reduction for Classification. J. Comput. Graph. Stat. 2004, 13, 930–945. [Google Scholar] [CrossRef]
  14. De Luca, G.; Loperfido, N. A Skew-in-Mean GARCH Model for Financial Returns. In Skew-Elliptical Distributions and Their Applications: A Journey Beyond Normality; CRC/Chapman & Hall: Boca Raton, FL, USA, 2004; pp. 205–222. [Google Scholar]
  15. Abu Khurma, R.; Aljarah, I.; Sharieh, A.; Abd Elaziz, M.; Damaševičius, R.; Krilavičius, T. A review of the modification strategies of the nature inspired algorithms for feature selection problem. Mathematics 2022, 10, 464. [Google Scholar] [CrossRef]
  16. Abd Elaziz, M.; Al-Qaness, M.A.A.; Abo Zaid, E.O.; Lu, S.; Ali Ibrahim, R.; Ewees, A.A. Automatic clustering method to segment COVID-19 CT images. PLoS ONE 2021, 16, e0244416. [Google Scholar] [CrossRef]
  17. Janamala, V.; Reddy, D.S. Coyote optimization algorithm for optimal allocation of interline–Photovoltaic battery storage system in islanded electrical distribution network considering EV load penetration. J. Energy Storage 2021, 41, 102981. [Google Scholar] [CrossRef]
  18. Oladipo, S.; Sun, Y.; Wang, Z. Application of a new fusion of flower pollinated with Pathfinder algorithm for AGC of multi-source interconnected power system. IEEE Access 2021, 9, 94149–94168. [Google Scholar] [CrossRef]
  19. Tang, C.; Zhou, Y.; Tang, Z.; Luo, Q. Teaching-learning-based Pathfinder algorithm for function and engineering optimization problems. Appl. Intell. 2021, 51, 5040–5066. [Google Scholar] [CrossRef]
  20. Lee, P.C.; Moss, C.J. Wild female African elephants (Loxodonta africana) exhibit personality traits of leadership and social integration. J. Comp. Psychol. 2012, 126, 224. [Google Scholar] [CrossRef]
  21. Peterson, R.O.; Jacobs, A.K.; Drummer, T.D.; Mech, L.D.; Smith, D.W. Leadership behavior in relation to dominance and reproductive status in gray wolves, Canis lupus. Can. J. Zool. 2002, 80, 1405–1412. [Google Scholar] [CrossRef] [Green Version]
  22. Priyadarshani, S.; Subhashini, K.R.; Satapathy, J.K. Pathfinder algorithm optimized fractional order tilt-integral-derivative (FOTID) controller for automatic generation control of multi-source power system. Microsyst. Technol. 2021, 27, 23–35. [Google Scholar] [CrossRef]
  23. Bilmes, J.A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Int. Comput. Sci. Inst. 1998, 4, 126. [Google Scholar]
  24. Ng, S.K.; McLachlan, G.J. An EM-based semi-parametric mixture model approach to the regression analysis of competing-risks data. Stat. Med. 2003, 22, 1097–1111. [Google Scholar] [CrossRef]
  25. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
  26. Maulik, U.; Bandyopadhyay, S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1650–1654. [Google Scholar] [CrossRef] [Green Version]
  27. Ullmann, T.; Hennig, C.; Boulesteix, A.L. Validation of cluster analysis results on validation data: A systematic framework. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1444. [Google Scholar] [CrossRef]
  28. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise; KDD: Washington, DC, USA, 1996; Volume 96, pp. 226–231. [Google Scholar]
  29. Bezdek, J.C.; Ehrlich, R.; Full, W. FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 1984, 10, 191–203. [Google Scholar] [CrossRef]
  30. Frey, B.J.; Dueck, D. Clustering by passing messages between data points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef] [Green Version]
  31. Izakian, Z.; Mesgari, M.S.; Abraham, A. Automated clustering of trajectory data using a particle swarm optimization. Comput. Environ. Urban Syst. 2016, 55, 55–65. [Google Scholar] [CrossRef]
  32. Li, L.; Liu, X.; Xu, M. A novel fuzzy clustering based on particle swarm optimization. In Proceedings of the 2007 First IEEE International Symposium on Information Technologies and Applications in Education, Kunming, China, 23–25 November 2007; pp. 88–90. [Google Scholar]
  33. Doval, D.; Mancoridis, S.; Mitchell, B.S. Automatic clustering of software systems using a genetic algorithm. In Proceedings of the Ninth International Workshop Software Technology and Engineering Practice (STEP’99), Pittsburgh, PA, USA, 30 August–2 September 1999; IEEE: New York, NY, USA, 1999; pp. 73–81. [Google Scholar]
  34. Zhang, C.; Ouyang, D.; Ning, J. An artificial bee colony approach for clustering. Expert Syst. Appl. 2010, 37, 4761–4767. [Google Scholar] [CrossRef]
  35. Storn, R.; Price, K. Differential evolution—A simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
  36. Fränti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell. 2018, 48, 4743–4759. [Google Scholar] [CrossRef]
  37. Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
  38. Carpaneto, G.; Toth, P. Algorithm 548: Solution of the assignment problem [H]. ACM Trans. Math. Softw. (TOMS) 1980, 6, 104–111. [Google Scholar] [CrossRef]
  39. Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
  40. Morey, L.C.; Agresti, A. The measurement of classification agreement: An adjustment to the Rand statistic for chance agreement. Educ. Psychol. Meas. 1984, 44, 33–37. [Google Scholar] [CrossRef]
  41. Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Figure 1. Clustering results on synthetic dataset-Aggregation.
Figure 1. Clustering results on synthetic dataset-Aggregation.
Entropy 25 00946 g001
Figure 2. Clustering results for the eight methods on synthetic dataset-Compound.
Figure 2. Clustering results for the eight methods on synthetic dataset-Compound.
Entropy 25 00946 g002
Figure 3. Clustering results for the eight methods on synthetic dataset-Pathbased.
Figure 3. Clustering results for the eight methods on synthetic dataset-Pathbased.
Entropy 25 00946 g003
Figure 4. Clustering results for the eight methods on synthetic dataset-Four Lines.
Figure 4. Clustering results for the eight methods on synthetic dataset-Four Lines.
Entropy 25 00946 g004
Figure 5. Clustering results for the eight methods on synthetic dataset-Jain.
Figure 5. Clustering results for the eight methods on synthetic dataset-Jain.
Entropy 25 00946 g005
Figure 6. Clustering results for the eight methods on synthetic dataset-S2.
Figure 6. Clustering results for the eight methods on synthetic dataset-S2.
Entropy 25 00946 g006
Table 1. Parameter values of compared algorithms.
Table 1. Parameter values of compared algorithms.
AlgorithmParameterValue
PSO C 1 ,   C 2 , Weight factor0.5, 0.5, 1.2
PSO-FCM C 1 ,   C 2 , Weight factor2.0, 2.0, 0.4
GACrossover factor, Mutation factor0.8, 0.01
ABCPredetermined cycles5
DEWeight factor, Crossover probability0.3, 0.8
Table 2. Clustering results of the algorithms on synthetic datasets.
Table 2. Clustering results of the algorithms on synthetic datasets.
DATASETAlgorithmARINMIACCDATASETAlgorithmARINMIACC
Aggregation
(788 × 2)
PFA-GMM0.970.930.94S2
(5000 × 2)
PFA-GMM0.970.940.93
PFA-KM0.750.840.69PFA-KM0.800.870.77
GMM0.870.930.91GMM0.660.790.56
K-MEANS0.650.810.61K-MEANS0.530.790.59
FCM0.730.730.67FCM0.400.680.42
DBSCAN0.350.000.00DBSCAN0.070.000.00
AP0.390.710.35AP0.410.690.44
PSO0.610.680.49PSO0.690.790.63
PSO-FCM0.830.840.72PSO-FCM0.970.940.93
GA0.350.000.00GA0.070.000.00
ABC0.820.750.67ABC0.760.820.67
DE0.420.160.05DE0.140.280.06
Compound
(399 × 2)
PFA-GMM0.880.860.84Patbased
(300 × 2)
PFA-GMM0.470.310.14
PFA-KM0.780.800.77PFA-KM0.750.550.47
GMM0.860.850.83GMM0.450.210.13
K-MEANS0.550.680.46K-MEANS0.550.550.45
FCM0.610.600.44FCM0.700.480.42
DBSCAN0.400.000.00DBSCAN0.370.000.00
AP0.360.630.30AP0.280.520.24
PSO0.500.490.31PSO0.480.260.18
PSO-FCM0.660.700.54PSO-FCM0.750.550.47
GA0.400.000.00GA0.370.000.00
ABC0.650.480.42ABC0.720.530.45
DE0.590.330.26DE0.510.200.10
Four Lines
(511 × 2)
PFA-GMM1.000.990.99Jain
(373 × 2)
PFA-GMM0.680.280.13
PFA-KM0.730.680.51PFA-KM0.880.550.59
GMM0.570.420.33GMM0.580.200.01
K-MEANS0.660.830.72K-MEANS0.280.400.17
FCM0.640.640.48FCM0.860.510.51
DBSCAN0.290.000.00DBSCAN0.740.000.00
AP0.240.520.18AP0.180.360.10
PSO0.570.650.47PSO0.720.290.19
PSO-FCM0.850.760.67PSO-FCM0.880.550.59
GA0.290.000.00GA0.740.000.00
ABC0.380.500.23ABC0.740.010.01
DE0.470.480.26DE0.420.380.26
Table 3. The DB index and RI results of each algorithm on synthetical datasets.
Table 3. The DB index and RI results of each algorithm on synthetical datasets.
DATASETAlgorithmDBRIDATASETAlgorithmDBRI
Aggregation
(788 × 2)
PFA-GMM0.120.98S2
(5000 × 2)
PFA-GMM0.080.94
PFA-KM0.120.92PFA-KM0.090.94
GMM0.450.84GMM0.090.94
K-MEANS0.130.89K-MEANS0.110.93
FCM0.190.89FCM0.120.84
DBSCAN0.340.22DBSCAN0.470.07
AP0.170.84AP0.410.69
PSO0.150.89PSO0.090.94
PSO-FCM0.130.91PSO-FCM0.040.93
GA0.230.77GA0.360.07
ABC0.130.83ABC0.080.93
DE0.140.61DE0.080.34
Compound
(399 × 2)
PFA-GMM0.140.92Patbased
(300 × 2)
PFA-GMM0.220.74
PFA-KM0.150.90PFA-KM0.220.75
GMM0.770.87GMM1.750.56
K-MEANS0.170.83K-MEANS0.230.74
FCM0.200.85FCM0.380.69
DBSCAN0.420.25DBSCAN0.490.33
AP0.170.80AP0.060.73
PSO0.220.73PSO0.220.72
PSO-FCM0.160.84PSO-FCM0.230.75
GA0.440.74GA0.350.58
ABC0.160.81ABC0.230.73
DE0.220.75DE0.360.59
Four Lines
(511 × 2)
PFA-GMM0.071.0Jain
(373 × 2)
PFA-GMM0.480.57
PFA-KM0.190.83PFA-KM0.390.79
GMM0.271.0GMM0.530.51
K-MEANS0.070.91K-MEANS0.530.51
FCM0.300.75FCM0.810.59
DBSCAN0.540.25DBSCAN0.590.61
AP0.320.83AP0.650.46
PSO0.330.83PSO0.530.51
PSO-FCM0.350.82PSO-FCM0.390.78
GA0.450.59GA0.590.61
ABC0.460.57ABC0.580.62
DE0.380.62DE0.630.48
Table 4. Results obtained from 10 runs using real-world datasets.
Table 4. Results obtained from 10 runs using real-world datasets.
Algorithm
(Number × Feature)
Spect
(267 × 22)
Seeds
(210 × 7)
Iris
(150 × 4)
Breast
(699 × 9)
Glass
(214 × 9)
PFA-GMMBest0.80580.32090.20330.58760.0512
Worst0.80580.46500.34561.04170.0809
Mean0.80580.36090.30100.67070.0603
Std.0.00000.03970.04330.14430.0087
PFA-KMBest1.14210.32680.23030.86630.2151
Worst1.17910.32970.33211.09500.4132
Mean1.15850.32880.27230.94060.2632
Std.0.01270.00110.04400.09980.0604
PSOBest1.07370.32860.23030.80690.2882
Worst2.13670.73670.76841.49670.4787
Mean1.49300.48890.45531.16160.3921
Std.0.29830.13180.15030.20630.0582
PSO-FCMBest1.12270.33170.32421.12360.3117
Worst1.46900.33630.32471.68320.3432
Mean1.22410.33320.32441.37120.3279
Std.0.11660.00170.00020.18960.0096
GABest0.82180.24880.19020.53470.2022
Worst1.10030.33500.25291.03440.2815
Mean1.00890.29850.21360.71280.2527
Std.0.09570.03040.02210.12940.0225
ABCBest0.30790.17560.17980.19070.1531
Worst0.84130.30210.23030.67490.2628
Mean0.57140.23150.19960.38640.2095
Std.0.17270.03920.01760.18080.0384
DEBest1.12380.34120.29990.99230.2284
Worst1.68330.71040.69761.56950.5714
Mean1.30660.47890.47491.29340.4302
Std.0.15870.11610.12600.15340.0944
Algorithm
(Number × Feature)
Heart
(303 × 13)
Liver
(345 × 6)
Wine
(178 × 13)
Zoo
(101 × 16)
Banknote
(1372 × 4)
PFA-GMMBest0.26730.11760.19830.12300.3596
Worst0.42890.41830.29750.17420.6053
Mean0.28340.19270.22610.15710.5016
Std.0.04850.07940.02740.01370.1089
PFA-KMBest1.03940.18190.48750.15390.6004
Worst1.03940.59500.49370.15830.6004
Mean1.03940.22320.49130.15600.6004
Std.0.00000.12400.00290.00150.0000
PSOBest1.08710.78960.53310.19270.5152
Worst2.23021.73411.31470.38880.8781
Mean1.63781.08810.80280.29050.6749
Std.0.38390.27830.21020.05390.1150
PSO-FCMBest1.09080.76880.47010.19170.5998
Worst2.17320.83080.48010.31790.6064
Mean1.31550.79450.47520.23150.6023
Std.0.31450.01870.00260.03130.0020
GABest0.15450.13060.46990.18330.2714
Worst1.17540.54630.49560.25470.4616
Mean0.90420.34550.47870.22500.3775
Std.0.37620.15070.00650.02120.0673
ABCBest0.25580.25580.20810.15230.1845
Worst0.87410.63530.46760.19830.6263
Mean0.50530.41170.33220.16680.2944
Std.0.19330.14180.10140.01370.1259
DEBest0.96031.02090.48240.21770.5461
Worst2.12711.87121.23150.43040.8467
Mean1.47221.38980.78520.30220.6600
Std.0.39070.30840.23980.06120.0993
Table 5. p-values produced by the Wilcoxon rank-sum test.
Table 5. p-values produced by the Wilcoxon rank-sum test.
AlgorithmPFA-GMM vs.
PFA-GMMPFA-KMPSOPSO-FCMGAABCDE
Spect0.0019530.0019530.0019530.0019530.0019530.0097660.001953
Seeds0.0019530.0039060.0019530.0371090.0039060.0019530.019531
Iris0.0019530.2324220.0019530.0371090.0097660.0019530.005859
Breast0.0019530.0019530.0019530.0019530.2324220.0195310.001953
Glass0.0019530.0019530.0019530.0019530.0019530.0019530.001953
Heart0.0019530.0019530.0019530.0019530.0097660.0644530.001953
Liver0.0019530.4614510.0019530.0019530.0371090.9519150.001953
Wine0.0019530.0019530.0019530.0019530.0019530.0273440.001953
Zoo0.0019530.5566410.0019530.0019530.0019530.4316410.001953
Banknote0.0019530.2324220.0019530.2324220.0371090.0371090.009766
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, H.; Liao, Z.; Wei, X.; Zhou, Y. Combined Gaussian Mixture Model and Pathfinder Algorithm for Data Clustering. Entropy 2023, 25, 946. https://doi.org/10.3390/e25060946

AMA Style

Huang H, Liao Z, Wei X, Zhou Y. Combined Gaussian Mixture Model and Pathfinder Algorithm for Data Clustering. Entropy. 2023; 25(6):946. https://doi.org/10.3390/e25060946

Chicago/Turabian Style

Huang, Huajuan, Zepeng Liao, Xiuxi Wei, and Yongquan Zhou. 2023. "Combined Gaussian Mixture Model and Pathfinder Algorithm for Data Clustering" Entropy 25, no. 6: 946. https://doi.org/10.3390/e25060946

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop