Hypergraph-Clustering Method Based on an Improved Apriori Algorithm

Chen, Rumeng; Hu, Feng; Wang, Feng; Bai, Libing

doi:10.3390/app131910577

Open AccessArticle

Hypergraph-Clustering Method Based on an Improved Apriori Algorithm

¹

School of Computer, Qinghai Normal University, Xining 810008, China

²

The State Key Laboratory of Tibetan Intelligent Information Processing and Application, School of Computer, Xining 810008, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10577; https://doi.org/10.3390/app131910577

Submission received: 1 August 2023 / Revised: 13 September 2023 / Accepted: 18 September 2023 / Published: 22 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

With the complexity and variability of data structures and dimensions, traditional clustering algorithms face various challenges. The integration of network science and clustering has become a popular field of exploration. One of the main challenges is how to handle large-scale and complex high-dimensional data effectively. Hypergraphs can accurately represent multidimensional heterogeneous data, making them important for improving clustering performance. In this paper, we propose a hypergraph-clustering method dubbed the “high-dimensional data clustering method” based on hypergraph partitioning using an improved Apriori algorithm (HDHPA). First, the method constructs a hypergraph based on the improved Apriori association rule algorithm, where frequent itemsets existing in high-dimensional data are treated as hyperedges. Then, different frequent itemsets are mined in parallel to obtain hyperedges with corresponding ranks, avoiding the generation of redundant rules and improving mining efficiency. Next, we use the dense subgraph partition (DSP) algorithm to divide the hypergraph into multiple subclusters. Finally, we merge the subclusters through dense sub-hypergraphs to obtain the clustering results. The advantage of this method lies in its use of the hypergraph model to discretize the association between data in space, which further enhances the effectiveness and accuracy of clustering. We comprehensively compare the proposed HDHPA method with several advanced hypergraph-clustering methods using seven different types of high-dimensional datasets and then compare their running times. The results show that the clustering evaluation index values of the HDHPA method are generally superior to all other methods. The maximum ARI value can reach 0.834, an increase of 42%, and the average running time is lower than other methods. All in all, HDHPA exhibits an excellent comparable performance on multiple real networks. The research results of this paper provide an effective solution for processing and analyzing large-scale network datasets and are also conducive to broadening the application range of clustering techniques.

Keywords:

cluster; high-dimensional data; Apriori algorithm; hypergraph partitioning

1. Introduction

With the explosive growth of information, the huge amount of data generated [1] has permeated every industry and field. Moreover, most of these data presented high dimensions and complex structures. How to handle these massive and complex data has become an urgent challenge in the field of machine learning at present. Clustering is one of the most important methods in machine learning, which reasonably assigns data into a cluster by selecting a measurement method [2]. Traditional clustering methods are mainly using classical machine learning theoretical knowledge to conduct research, such as partition-based clustering methods [3] and hierarchy-based clustering methods [4], etc. With the progression of time, although many scholars improve the traditional clustering algorithm on this basis, due to the special characteristics of complex data, they often cannot achieve ideal clustering results, which is mainly reflected in poor clustering division results [5]. Due to the fact that the graph-based data analysis method can identify well-connected vertices clusters [6], this makes up for the shortcomings of ordinary high-dimensional data clustering. Therefore, clustering analysis through graphs has become a more common processing method. However, ordinary graphs cannot comprehensively describe various complex relationships in data, such as the World Wide Web, airline networks, etc. These networks often have intricate internal structures [7,8,9,10]. Inspired by this, hypergraphs have emerged. Hypergraphs [11] are mainly used to study the generalization of multiple relationships in graph topology structures and are widely used in many fields of information technology. One popular direction is to use hypergraphs for clustering analysis [12] and to effectively represent high-order correlations between data in a visualized form using hypergraphs [13].

In the contemporary period, many new clustering algorithms have been proposed to handle complex data by constructing hypergraphs. Strehl et al. [14] transformed the clustering problem into a hypergraph-partitioning combinatorial optimization problem, and proposed three consensus functions for better clustering. Yang et al. [15] systematically discussed cluster ensembles, compared the advantages and disadvantages of traditional clustering algorithms, and extended their analysis to hypergraph partitioning. Based on the evolutionary model of the hypernetwork dynamics process, Suo et al. [16] provided a comprehensive overview of the future development directions in the field of hypergraphs. Tian et al. [17] introduced knowledge hypergraphs using heterogeneous hypergraphs, constructed a three-layer architecture to better represent and extract hyper-featured relationships, and modeled hyper-relation data. Liu et al. [18] constructed a hypergraph to embed sample representation and proposed an optimized constrained clustering algorithm with upper and lower limits on the number of clusters. All these studies have demonstrated the superiority and universality of hypergraph-clustering methods.

Due to the complexity and variability of today’s data, it is still necessary to adopt more scientific and objective methods in constructing hypergraphs. Association rules are commonly used, as they can acquire frequent itemsets and association rules to clarify the relationships between data. Wei et al. [19] proposed the HOT algorithm combining outlier detection based on the hypergraph model and association rules to calculate the support degree, membership degree, and other information for each point. However, it may suffer from local optima issues. Cui et al. [20] comprehensively discussed several applications of hypergraphs in data mining, including association rules, clustering, and spatial data mining. Kadir et al. [21] constructed a hypergraph model and investigated the association relationships in hypergraphs based on the hypergraph-partitioning algorithm hMETIS and the association rule algorithm Apriori. However, the efficiency of the Apriori algorithm [22] is relatively low due to its iterative method of searching frequent itemsets in a bottom-up manner, making it unable to analyze rare information. Lately, many fruits related to intelligent clustering technology have emerged. Althuwaynee et al. [23] combined machine learning with geospatial technology by studying landslides using nonlinear clustering t-SNE and Apriori association rule mining. Esmaeili et al. [24] proposed a combination method based on the fuzzy firefly algorithm and random forest and applied it to clustering in wireless sensor networks (WSNs). Zhao et al. [25] proposed several higher-order centralities in high-order complex networks to reflect the impact of ranking on clustering and propagation. Chen et al. [26] proposed a DPC algorithm based on Chebyshev inequality and differential privacy and discussed the data privacy protection function. These studies have expanded our research ideas on clustering.

Based on these studies, we propose a new clustering method based on improved Apriori and hypergraph partitioning. This method first uses t-SNE to reduce the dimensionality of the dataset and combines the improved Apriori algorithm with the hypergraph construction method to construct a hypergraph for model clustering. Then, it employs the dense subgraph partition algorithm based on multi-level hypergraph partitioning to divide the hypergraph into numerous sub-clusters. After merging the sub-clusters, the final clustering results are obtained. Finally, various methods are compared on multiple datasets through clustering evaluation indicators, and the operating efficiency of all methods is analyzed, which demonstrates the effectiveness of the proposed method from many aspects.

The specific contributions of this paper are as follows:

(1): This paper extends association rules to hypergraphs, then proposes a hypergraph-partitioning high-dimensional data-clustering method based on the improved Apriori algorithm (HDHPA), providing a solution for identifying clustering results of high-dimensional data.
(2): In order to avoid the generation of redundant rules and improve the mining efficiency, the parallel mining of different frequent itemsets is used to obtain hyperedges with corresponding ranks.
(3): The HDHPA method improves clustering accuracy by constructing a hypergraph structure of high-dimensional data. In order to demonstrate the superiority of the method, this paper conducted numerous experiments on multiple datasets.

The remaining sections of this paper are organized as follows. Section 2 affords some theoretical foundations correlated to the proposed methods. Section 3 conducts the HDHPA method and its three stages. Section 4 analyzes the experiments from different perspectives to validate the feasibility and superiority of the method. Section 5 summarizes the findings and prospects for future research directions.

2. Related Theoretical Foundations

2.1. The Concept of Hypergraph

The hypergraph H can be defined as a binary group

H = (V, E)

, where V is a set of vertices

V = \{v_{1}, v_{2}, \dots, v_{n}\}

and E is a set of hyperedges

E = \{e_{1}, e_{2}, \dots, e_{m}\}

are finite sets, and

e_{i} \neq ϕ (i = 1, 2, \dots, m)

,

\cup_{i = 1}^{m} e_{i} = V

. The number of vertices contained in the hyperedge is called the rank. For

r (H) = m a x |e_{i}|

, it is called the upper rank of the hypergraph, and for

s (H) = m i n |e_{i}|

, it is called the lower rank of the hypergraph. If

r (H) = s (H)

, then H is a uniform hypergraph. In hypergraph H, the number of hyperedges containing node

v_{i}

is called the hyperdegree, denoted as

d_{H} (v_{i})

.

The incidence matrix

I (H)

of hypergraph H is a matrix where the rows correspond to the vertices of H and the columns correspond to the hyperedges. The adjacency matrix

A (H)

of hypergraph H is a square symmetric matrix, where the element

a_{i j}

represents the number of hyperedges containing nodes

v_{i}

and

v_{j}

; the diagonal entries of

A (H)

are zero. For a set

A \subseteq V

,

H_{A} = (A, {e_{i} \cap A | e_{i} \in E \land e_{i} \cap A \neq \emptyset})

is called the sub-hypergraph of the hypergraph

H = (V, E)

, or rather, the hypergraph on the vertices subset of H is called the sub-hypergraph of H, such that for each hyperedge in the sub-hypergraph, there exists only one hyperedge in H that contains it.

Figure 1 shows an example of a hypergraph. In this hypergraph, vertices with a degree of 1 are

v_{1}, v_{3}, v_{4}, v_{7} and v_{8}

, vertices with a degree of 2 are

v_{2} and v_{5}

, the vertex with a degree of 3 is

v_{6}

, and the four hyperedges in Figure 1 do not all contain the same number of vertices, making it a non-uniform hypergraph.

2.2. Apriori of Association Rules

Association rules [27] are mainly used to find rules of the form

A ⟹ B

from a given dataset. This process can be decomposed into two sub-processes: determining frequent itemsets and establishing association rules. Apriori, as a classic association rule algorithm, generates frequent itemsets based on the prior principle and extracts high-confidence rules with only one item in the consequent based on confidence pruning. (In this paper, according to the dataset characteristics, we used the trial-and-error sampling method to determine the size of the support threshold). The concrete algorithm steps are as follows:

Scan the dataset L, calculate the support of each item, discard items that do not meet the support threshold, and generate one-dimensional frequent itemsets.
Iteration: generate two itemsets through the one-dimensional itermsets, discard items that do not meet the support threshold, and generate two-dimensional frequent itemsets.
Repeat, establish k-dimensional candidate itemsets, discard items that do not meet the support threshold, and generate k-dimensional frequent itemsets.
Prune each k-itemset; count and retain the association rule itemset that satisfies the set confidence threshold.

Let I be a set of N different items, denoted as

I = \{i_{1}, i_{2}, \dots, i_{N}\},

referred to as itemset. The length of an itemset is the number of items it contains. A k-itemset represents an itemset with a length of k, also known as dimension.

T = \{t_{1}, t_{2}, \dots, t_{n}\},

is a transaction set, where each transaction

t_{i}

is a set of items,

t_{i} \subseteq I

. An association rule has the following form:

A ⟹ B

, where

A, B \subset I

and

A \cap B = \emptyset

. For a given itemset

X \subset I

, the number of transactions containing X is called the support count of itemset X, denoted as

σ (X)

, which is defined as

σ (X) = |\{t_{i} |X \subseteq t_{i}, t_{i} \in T\}|

. Therefore, the support of the rule

A ⟹ B

is defined as

s u p p o r t (A \Rightarrow B) = σ (A \cup B) / n

. The confidence is defined as

c o n f i d e n c e (A \Rightarrow B) = σ (A \cup B) / σ (A)

; the higher the value of confidence, the greater the credibility of B inferring the appearance of A. In order to avoid ignoring the support of the itemsets in the rule posterior, the definition of lifting degree is introduced as

l i f t (A \Rightarrow B) = \frac{c o n f i d e n c e (A \Rightarrow B)}{s u p p o r t (B)}

(1)

l i f t (A \Rightarrow B)

reflects the degree of influence of the occurrence of A on the occurrence of B in the hyperedge

e_{i}

, and is a correlation metric. If the lift degree l > 1, A and B are positively correlated; if l = 1, A and B are independent of each other and there is no correlation; if l < 1, A and B are negatively correlated.

2.3. Hypergraph Model Definition with Association Rules

Combining the descriptions in Section 2.1 and Section 2.2, we can correspond the hypergraph

H = (V, E)

with association rules. For the hypergraph

H = (V, E)

, each vertex

v_{i}

corresponds to an item in the dataset, and the hyperedge

e_{i} (i = 1, 2, \dots, m)

is a transaction that contains an itemset, the vertex set denoted as

V (e_{i})

(also called V), and the hyperedge set denoted as

E (e_{i})

. Therefore, the support of the itemset X is

s u p p o r t (X) = σ (X) / n

(2)

where the support count of itemset X (

X \subseteq E)

is

σ (X) = |\{t_{i}| X \subseteq t_{i}, t_{i} \in T\}|

.

Let the association rule r discovered in itemset X is

A \Rightarrow B

, where

A \subseteq X, B \subseteq X

, then the confidence of this association rule is found by

c o n f i d e n c e (r) = σ (A \cup B) / σ (A)

(3)

If the itemset X satisfies the minimum support threshold, the itemset X is denoted as a frequent itemset and a hyperedge

e_{i}

is constructed by a frequent itemset, so the weight of hyperedge

e_{i}

is defined as follows:

w e i g h t (e i) = \sum_{r} c o n f i d e n c e (r) / R

(4)

where R is the number of association rules discovered in itemset X.

Figure 2 shows an elemental example of constructing a hypergraph based on association rules, where the dataset is represented by binary 0/1, and the minimum support is set to 60%.

2.4. DSP Algorithm in Hypergraph Partitioning

In this section, we introduce a hypergraph-partitioning algorithm based on dense subgraph partition, abbreviated as DSP [28]; this algorithm can spontaneously and effectively partition hypergraphs into dense subgraphs, called DPs. Based on the idea of divide-and-conquer, DSP is suitable for parallel processing and can be applied to large-scale hypergraphs efficiently. Therefore, we used it to obtain enough sub-clusters.

A permutation P of a set of elements represents a specific order in which the elements are arranged according to a certain rule, and its sub-permutation R refers to a continuous sequence of P. For a hypergraph

H = (V, E)

, the density

ρ (e_{i}) = w e i g h t (e_{i}) / |V (e_{i})|

, which is the ratio of the weight of the hyperedge

e_{i}

to the number of vertices included. Similarly, the density of hypergraphs can be obtained. The sub-hypergraph with the maximum density in hypergraph H is called the densest sub-hypergraph. There is only one densest sub-hypergraph with the largest number of vertices, which is called the kernel sub-hypergraph. When conditions are set and multiple sub-hypergraphs that meet the conditions have the highest density, the sub-hypergraph with the highest number of vertices is called the conditional kernel sub-hypergraph. If

H_{u}

is the conditional kernel sub-hypergraph of

H_{s}

, then

H_{u}

is partitioned into the maximum number t of non-connected sub-hypergraphs, represented as

D P (H_{U} | H_{S}) = \{H_{U_{1}}, \dots, H_{U_{t}}\}

. For the dense sub-hypergraph partition of hypergraph H, it is defined as follows:

D S P (H) = 〈D P (H_{v_{1}} | \emptyset), \dots, D P (H_{v_{i}} | H_{\cup_{j = 1}^{i - 1} v_{j}}), \dots, D P (H_{v_{n}} | H_{\cup_{j = 1}^{n - 1} v_{j}})〉

(5)

where

\cup_{i = 1}^{n - 1} v_{i} = v

and

H_{v_{1}}

is the kernel sub-hypergraph of H.

H_{v_{i}} (i > 1)

are conditional kernel sub-hypergraphs conditioned on the sub-hypergraphs

H_{\cup_{j = 1}^{j - 1} v_{j}}

.

Based on the above, the DSP partition algorithm includes two layers. First, the hypergraph H is continuously partitioned into a series of conditional kernel sub-hypergraphs,

〈 H_{V_{1}}, \dots, H_{V_{n}} 〉

. The partition of vertices set V is represented as

Ψ (V) = 〈 v_{1}, \dots, v_{n} 〉

. Then, each sub-hypergraph

H_{v_{i}}

is partitioned into dense sub-hypergraphs.

Ψ (V)

defines the order of partitioning the vertices set V. If there are |V| permutations of V, there must be one permutation that satisfies the order of vertices in

Ψ (V)

, denoted as

Θ (H)

. Based on

Θ (H)

, it is easy to find the dense sub-hypergraphs of the partition. Therefore, the algorithm for partitioning DSP can be described as follows:

The time complexity of Algorithm 1 Is

O (τ d n_{e} {(n_{v} + n_{e})}^{2})

, where τ is the number of iterations in the algorithm and

n_{v}

and

n_{e}

are the number of vertices and hyperedges of the dense sub-hypergraph with maximum size, respectively.

Algorithm 1: Partitioning DSP algorithm

Input: Hypergraph H and an initial permutation P
Output: DSP(H)
1: Apply min-partition () on P to get

M P (P)

2: Set

Ω = M P (P), \tilde{Ω} = \emptyset

and

\tilde{P} = P

3: Repeat
4: for each

R \in M P (P)

do
5: if

R \notin \tilde{Ω}

then
6: Apply permutation-reorder () on R to get

\tilde{R}

7: if

\tilde{R} \neq R

then
8: Replace R by

\tilde{R}

in

\tilde{P}

9: Apply min-partition () on

\tilde{R}

to get

M P (\tilde{R})

10: Replace R by

M P (\tilde{R})

in

Ω

11: end if
12: end if
13: end for
14: Apply min-merge () on

Ω

to get

R \in M P (\tilde{P})

15: Set

\tilde{Ω} = M P (P)

,

P = \hat{P}

and

M P (P) = M P (\hat{P})

16: until P does not change
17: For each

R \in M P (P)

, apply disjoint-partition () on

H_{R}

18: return DSP(H) by Equation (5)

3. Method Design of HDHPA

3.1. Basic Ideology of HDHPA

This paper proposes a new complex data clustering method called the high-dimensional data clustering method based on hypergraph partitioning using an improved Apriori algorithm (HDHPA) based on an improved Apriori algorithm and hypergraph-partitioning model. First, the method constructs a hypergraph based on the improved Apriori algorithm for association rule mining and then uses the DSP algorithm for multi-level hypergraph partitioning to partition the hypergraph into many subclusters. Finally, the subclusters are merged to obtain the final clustering results. Although previous studies, such as those in [17,18], have utilized hypergraphs to extract data information, there are limitations to high computational complexity. For example, the authors of [23] used the Apriori algorithm to determine the relationship between factors, but they did not improve the algorithm, which has problems with multiple iterations.

We use t-distributed Stochastic Neighbor Embedding (t-SNE) [29] for the preprocessing and feature extraction of the original high-dimensional data. After obtaining low-dimensional data, HDHPA discovers clusters in the dataset through three stages: in the first stage, the improved Apriori algorithm is parallelized to speed up the mining of association rules and construction of the hypergraph model; in the second stage, the dense sub-hypergraph partitioning algorithm is applied to partition the hypergraph into many relatively small dense sub-hypergraphs (or subclusters); in the third stage, two evaluation functions (Conformance function C and Aggregation function A) are used to merge dense sub-hypergraphs and obtain clustering results.

3.2. The Stage of Constructing a Hypergraph

During the construction of the hypergraph stage, the goal is to find sub-hypergraphs that meet clustering conditions and are mutually connected. Based on the generated frequent sub-hypergraphs, the relationships (or hyperedges) between the entities (or vertices) of the encapsulated objects are established. To ensure the objectivity of selecting frequent sub-hypergraphs, we decided to follow the structure of the Apriori algorithm to find frequent itemsets. To avoid the additional computational burden, we improved it based on discovering frequent itemsets with the Apriori algorithm, which can maximize the avoidance of excessive iterations and achieve effective pruning with minimal time complexity, compared with other improved association rule algorithms (such as FP-stream [30] and EDFS [31]).

Based on this idea, in the hypergraph construction stage of the HDHPA method, the vertices correspond to the itemset containing transactions; a set of hyperedges containing multiple hyperedges in a node is called an itemset. At first, this stage of the method enumerates all frequent single-hyperedge and double-hyperedge sub-hypergraphs based on Equation (2). Then, parallel retrieval is performed based on these two sets. In each iteration round, candidate sub-hypergraphs are first generated, with a size one edge larger than the previous frequent sub-hypergraph (line 7 of the algorithm). Next, the support count of each candidate item is calculated, measuring the correlation between hyperedges by Equation (3), and hyperedges that meet the threshold are merged into sub-hypergraphs. Finally, sub-hypergraphs that do not satisfy the prior principle are pruned to generate hypergraphs by Equation (1) (lines 13–16). The discovered frequent sub-hypergraphs satisfy the downward closure property of the support condition, making the constructed hypergraph robust to noisy data.

The method for constructing the hypergraph stage is shown in Algorithm 2 below. Given a minimum support threshold count of

σ

to construct the hypergraph,

C^{k}

is a candidate set with k hyperedges, then

H_{C}^{k} = (L, C^{k})

is a sub-hypergraph with k candidate sets. A hyperedge

e_{i}

that meets the minimum support is called a frequent itemset

F (e_{i})

, then

H_{F}^{k} = (L^{'}, F)

is a sub-hypergraph with k frequent sets, where

L^{'}

is a set of frequent transactions and F is a set of

F (e_{i})

. Sparse graphs are used to store input transactions, intermediate candidates, and frequent sub-hypergraphs. This representation saves memory and speeds up computation. The topology, hyperedges, and vertex labels of each frequent sub-hypergraph are randomly selected, resulting in a complete set of valid candidate items with no duplication. The weights adhere to a unit-mean exponential distribution, and the total weight of all frequent sub-hypergraphs is normalized to 1. Based on the previous analysis, the time complexity of Algorithm 2 is

O (N C^{k} \log (k))

.

Algorithm 2: HDHPA method for constructing the hypergraph stage

Input: A set of all items in dataset L:

I = \{i_{1}, i_{2}, \dots, i_{N}\}

Output:

H_{F}^{1}, H_{F}^{2}, \dots, H_{F}^{k}

,

H_{C}^{k} . c o u n t

1:: Compute $H_{F}^{1} \leftarrow \{i | i \in I \land σ (\{i\} \geq N \times m i n s u p)\}$ by Equation (2)
2:: Compute $H_{F}^{2} \leftarrow \{i | i \in I \land σ (\{i\} \geq N \times m i n s u p)\}$ by Equation (2)
3:: Set $H_{F}^{1} \leftarrow d e t e c t a l l f r e q u e n t 1 - s u b h y p e r g r a p h s i n V$
4:: Set $H_{F}^{2} \leftarrow d e t e c t a l l f r e q u e n t 2 - s u b h y p e r g r a p h s i n V$
5:: $k \leftarrow 3$
6:: while $H_{F}^{k - 1} \neq \emptyset$ do
7:: $k \leftarrow k + 1$
8:: $C^{k} \leftarrow g e n C a n d i d a t e (H_{F}^{k - 1})$ by Equation (3)
9:: $C^{k} \leftarrow p r u n e C a n d i d a t e (C^{k}, H_{F}^{k - 1})$
10:: for each candidate $H_{C}^{k} \in C^{k}$ do
11:: $H_{C} . c o u n t \leftarrow 0$
12:: for each transaction $v \in V$ do
13:: $C^{v} = s u b s e t (C^{k}, v)$ by Equation (1)
14:: if candidate $H_{C}^{k}$ is included in transaction v then
15:: $H_{C}^{k} . c o u n t \leftarrow H_{C}^{k} . c o u n t + 1$
16:: $H_{F}^{k} \leftarrow \{H_{C}^{k} \in C^{k} \land σ (H_{C}^{k}) / N \geq m i n s u p\}$
17:: until $H_{F}^{k} \neq \emptyset$
18:: return $H_{F}^{1}, H_{F}^{2}, \dots, H_{F}^{k}$

3.3. Hypergraph Partitioning Stage

After constructing the hypergraph, we use the hypergraph-partitioning algorithm DSP introduced in Section 2.4 to partition the hypergraph. The DSP algorithm adopts the divide-and-conquer approach to perform parameter-free partitioning of the hypergraph from top to bottom, resulting in relatively low time and space complexity. Due to the uniqueness of conditional kernel sub-hypergraphs and their non-intersecting partitions, DSP is unique. A denser sub-hypergraph is more likely to represent a true cluster, while a less dense sub-hypergraph is usually composed of outliers. The result of the DSP algorithm is an ordered list of dense sub-hypergraphs with decreasing density, which reveals all potential clusters and outliers. Therefore, it can achieve good clustering results even if there are a lot of outliers.

The algorithm first calculates the weight of each hyperedge in the hypergraph, and then divides the hypergraph into a sequence of conditional kernel sub-hypergraphs in decreasing order, and calculates the corresponding density. Then, each conditional kernel sub-hypergraph is divided into disjoint dense sub-hypergraphs by a non-overlapping partitioning algorithm. The dense sub-hypergraphs generated using DSP usually have strong connections between vertices and can be considered as potential clusters to obtain clustering results [32].

3.4. Dense Sub-Hypergraph Merging

Once the entire hypergraph is partitioned into k parts, we use the following cluster fitness criterion to eliminate bad clusters. Let

v_{i}

be the set of vertices representing connected hyperedges, and Z be the set of vertices representing the partition. The degree of conformance function C that measures the fitness of the partition Z is defined as follows:

C o n f o r m a n c e (Z) = \frac{\sum_{v_{i} \subseteq Z} W e i g h t (v_{i})}{\sum_{| v_{i} \cap Z | > 0} W e i g h t (v_{i})}

(6)

The conformance function measures the ratio of

v_{i}

weights within a partition to any vertices’ hyperedge weights within that partition. A higher value indicates that vertices within the partition are more likely to be retained in the current cluster. Partitions with a degree of conformance greater than a given threshold are considered clusters with good clustering effects. We set the degree of conformance threshold to 0.1 (reasonably adjusting this threshold can effectively reduce the number of un-clustered points in the clustering results) so that this function can be easily combined with Apriori-based partitioning algorithms, such that only when the conformance of a partition is lower than a given threshold is the cluster further divided, until all such clusters are found.

After finding the clusters, each cluster needs to be checked to filter out vertices that are not highly connected to other nodes in the cluster. Based on the ratio of the hyperdegree of node

v_{j}

in the partitioned vertex set Z to the other vertices in the same cluster, the aggregation function A is defined as follows:

A g g r e g a t i o n (v_{j}, Z) = \frac{|d_{H} (v_{j})|}{|d_{H} (Z)|}

(7)

The higher the value of aggregation, the more connected hyperedges there are, which connect most of the vertices in the cluster. Points with aggregation values greater than a given threshold are considered to belong to the cluster; the rest of the vertices are discarded from the cluster. In this paper, we set the aggregation threshold to 0.1.

4. Experiments and Analysis

4.1. Dataset Description

The HDHPA method was validated for clustering problems using 6 different protein–protein interaction (PPI) network datasets and the MNIST dataset.

Protein–protein interaction network: The vertices, attribute values, and edges correspond to proteins, GO (gene ontology [33]) terms, and protein–protein interactions, respectively. The GO terms mainly include molecular functions, cellular components, and biological processes. The experiments selected 6 PPI network datasets from real protein interaction network datasets [34], including FAA4 (from yeast), Natoc_0297 (from Cryptococcal Natrophomonas), 16 items (from Nattococcus), HOXA10 (from chromosome 7), NNMT (from cancer cells), and MBP (from Escherichia coli). Partial PPI networks constructed from the datasets are shown in Figure 3.
MNIST: A classic handwritten digit recognition dataset, consisting of 60,000 grayscale images divided into 10 classes from “0” to “9”. In this study, 1000 samples were selected from the dataset for experimentation.

In Figure 3, the larger the blue vertices, the more proteins it interacts with, and conversely, the smaller the vertices, the fewer proteins it interacts with in the network [35]. Edges represent protein–protein associations, indicating that they are clear and meaningful, that is, proteins work together to share functions, but this does not necessarily mean that they physically bind to each other [36]. Colored edges represent the predicted degree of protein–protein interaction, increasing from orange-yellow-green-blue-purple (more about PPI network supplements as shown in Appendix A), indicating that the closer the edge color is to purple, the more homogeneous features it contains and the more similar biological information it contains. Unlike the MNIST, we can see from Figure 3 that the PPI network basically follows a power law distribution, with more sparse edge structures that are closer to real networks to verify the robustness of the HDHPA from multiple perspectives.

4.2. Clustering Evaluation Metrics

In order to quantitatively estimate the effectualness of clustering in a computational model, the performance was assessed using different evaluation metrics. We used the values of the Fowlkes and Mallows Index (FM-index) [37] and the Adjusted Rand Index (ARI) [38]. Before introducing these two types of evaluation metrics, let’s first introduce the relevant terms involved. Consider a dataset with n points. For binary classification, the rare class is referred to as the positive class, and the majority class is referred to as the negative class. The confusion matrix, summarizing the correct and incorrect predictions of the classification model, is shown in Table 1 below:

From the above, clustering usually uses indicators to evaluate the degree of consistency between clustering results and actual data. For network data, we want to assign two nodes to the same cluster if and only if they are similar. Therefore, in cluster evaluation metrics, TP represents the assignment of two similar nodes to the same cluster. TN represents the assignment of two dissimilar nodes to different clusters. FP represents the assignment of two dissimilar nodes to the same cluster. FN represents the assignment of two similar nodes to different clusters.

The equation of FM-index value [37] is

F M - i n d e x = \sqrt{\frac{T P}{T P + F P} \cdot \frac{T P}{T P + F N}}

(8)

The equation of ARI value [38] is

A R I = \frac{R I - E [R I]}{\max (R I) - E [R I]}

(9)

where

R I = (T P + T N) / C_{2}^{n_{s a m p l e s}}

; C is the total number of possible element pairs that can be formed in the dataset. The value range of the FM-index is [0, 1]. The larger the value of the FM-index, the better the clustering effect, indicating a more accurate representation of the factual situation. The value range of ARI is [−1, 1]. Specifically, a value of 1 indicates that the clustering result matches the fact almost perfectly, a value of 0 indicates that the clustering performance is the same as that generated randomly, and a value of −1 indicates that the clustering result completely contradicts the fact.

4.3. Data Processing

To objectively and comprehensively verify the performance of the HDHPA method, all experiments in this paper were conducted on the same operating system, based on a Python 3.8.13 parallel computing environment. In the data preprocessing stage, duplicate hyperedges were removed and the original data were dimensionally reduced using t-SNE, with a maximum iteration setting of 3000. In PPI network datasets, we set the protein set as the vertex set and protein–protein interactions as hyperedges, then constructed a hypergraph based on this presumption. Table 2 summarizes some statistics about the PPI network datasets:

4.4. Parameter Settings

The purpose of this section was to ensure the robustness of experimental performance, study parameter sensitivity, and understand how changes in the support count

σ

will affect the different weights between data relationships and hyperedge selection, as well as how it will affect clustering results. Specifically, the FAA4 dataset was used for the experiment, and the results are shown in Figure 4. It can be seen that when the parameter

σ

is relatively small compared to k, the number of sub-hypergraphs generated by DSP partitioning the hypergraph is also small, and the dense sub-hypergraphs generated by DSP partitioning will be particularly large, so the number will also be correspondingly reduced. In addition, the number of dense sub-hypergraphs is linearly related to σ, and when σ > 5, the number of dense sub-hypergraphs generated by DSP is relatively robust. Therefore, for simplicity, the minimum σ is set to 7 in this experiment.

4.5. Experimental Comparison Methods

Since the HDHPA method combines hypergraph partitioning and clustering, we selected two classic methods for partitioning and clustering, and combined them pairwise as comparisons.

k-medoids [39]: A classic clustering method based on representative objects that select objects located at the center of clusters as center points. It is more suitable for processing small datasets with noise and outliers.
hMETIS [40]: A hypergraph-partitioning method based on the METIS graph-partitioning software package version 5.1.0 that can efficiently partition hypergraphs through k-way multilevel partitioning.

Five methods, including HDHPA, use seven datasets for clustering comparison: k-medoids + hMETIS (kh), k-medoids + DSP (kD), Apriori + hMETIS (Ah), Apriori + DSP (AD), and HDHPA. Since the cluster centers of the k-medoids method are randomly defined, each method was run 100 times, and the average of these 100 results was taken as the final result.

4.6. Experimental Comparison Results

4.6.1. Comparison of Experimental Results on PPI Networks

This section applied five hypergraph-clustering methods to six protein–protein interaction network datasets of varying sizes. After clustering the selected datasets, we analyze their enrichment in GO terms associated with biological processes. The results are shown in Figure 5.

In the enrichment analysis shown, the size of the circle represents the degree of significance of the enrichment and the color of the circle gradually decreases in significance from red to blue. Due to the small p-value, this paper chose the scale and color based on the

- \log (P)

value of each term’s importance. Comparing six PPI network datasets, it can be seen from Figure 5 that they exhibit similar characteristics. GO terms that cluster more similarly to each other are also closer in space, and vertices with richer colors show changes in enrichment.

Then, we summarized the information of six PPI network datasets, seen in in Table 3. We can see that the Natoc_0297 dataset has the highest average clustering coefficient, the HOXA10 dataset has the maximum average number of neighbors, the NNMT dataset has the highest network density, the 16 items dataset has the maximum value of network heterogeneity, and the NNMT dataset has the highest network centrality. According to the relevant theoretical knowledge of PPI networks [41], the clustering results of the Natoc_0297 and HOXA10 datasets are easier to distinguish. For the FAA4, 16 items, NNMT, and MBP datasets, there may be local optima in the clustering results of the dataset, but this situation is not within the scope of consideration.

The evaluation indices FM-index and ARI values of the clustering results are summarized in Table 4 and Table 5.

As shown in Table 4 and Table 5, For the FAA4 dataset, the FM-index value of the HDHPA method is higher than other methods, up to 0.805. The ARI value of the kD method is higher than other methods and is very close to the ARI value of the HDHPA method, far exceeding other methods. In the Natoc_0297 dataset, the FM-index value of the kD method is higher than other methods. The FM index value of the HDHPA method is very close to it, with values of 0.741 and 0.734, respectively, and the ARI value of the HDHPA method is higher than other methods. For the 16-item dataset, the FM-index value and the ARI value of the HDHPA method are the largest. In the HOXA10 dataset, the FM-index value of the HDHPA method far exceeds that of other methods, but the ARI value of the kD method is greater than that of the other methods. In the NNMT dataset, the FM-index value of the kh method is the largest. The HDHPA method values of the FM-index are very close to it, 0.502 and 0.496, respectively. The ARI values of the AD and the HDHPA are very close to eachother, 0.590 and 0.576, respectively. For the MBP dataset, the FM-index and ARI values of the HDHPA are higher than those of other methods, 0.732 and 0.799, respectively.

The experimental results show that no method performs significantly better than others. In most cases, our proposed HDHPA method combined with clustering produces the best results. The clustering results of the other four comparison methods are similar for the 16-item dataset, indicating that the precision gain is relatively limited in datasets with high clustering coefficients. The kD method produces better clustering results for the HOXA10 and Natoc_0297 datasets than the Ah and AD methods, indicating that when there is more noise and isolated points, the efficiency of the k-medoids method is higher than that of the Apriori. In most cases, the results of the Ah method are more sensitive to data types and less robust than the other four clustering methods.

4.6.2. Comparison of Experimental Results on MNIST

To verify the universality and accuracy of the HDHPA method from multiple perspectives, this section selects the MNIST dataset and compares it with the same method combination as the benchmark. The FM-index and ARI values of the clustering results are summarized in Table 6.

As shown in Table 6, on the MNIST dataset, the FM-index and ARI values of the HDHPA method are much higher than those of other methods, demonstrating superior performance. In addition, compared to 6 PPI network datasets, the HDHPA method increased clustering by up to 0.834. The Ah method increased in both FM-index and ARI values, with an increase of at least 14.77% and 22.9%, respectively. The result indicates that methods based on support for finding frequent itemsets are more suitable for clustering text bytes in practice, possibly because the Apriori algorithm reduces the weight of noisy data when dealing with the same hyperedge with different weights. Based on the experimental results in the first two sections, we can find that the method proposed in this paper has more universality and effectiveness. Furthermore, we compared it with the latest research, such as using the same evaluation indicators as [26], and the evaluation indicator values obtained from the two experiments were very close, reflecting the effectiveness of the proposed method in this paper.

4.6.3. Running Time Comparison

In this section, we compare the computational efficiency of hypergraph-clustering methods by comparing the running time complexity of all experiments. All experiments were conducted on the same server with the same configuration: Windows 10 Professional OS, CUDA 11.6, Intel (R) Xeon (R) E-2244G CPU (4.80 GHz), Nvidia GeForce RTX 4070Ti GPU (12 GB), mainly based on Python 3.8.13 as the development language support. PyCharm 2021.3.1 IDE implements a parallel computing environment. We will display the running time of five methods on six protein–protein interaction network datasets (FAA4, Natoc_0297, 16 items, HOXA10, NNMT, and MBP) and the MNIST handwritten digit recognition dataset (only the trained part is selected), as shown in Figure 6.

From the results, the HDHPA method has an overall running time less than the other methods in all datasets, with a duration of less than 10 s. In the HOXA10 dataset, except for the HDHPA method, all other methods have a running time of more than 10 s, with a maximum running time of 18.4 s. The running time of all methods shows a consistent pattern. We can see that the speed of hypergraph-clustering methods can be ranked (from fastest to slowest) as HDHPA, kD, AD, kh, and Ah. We can see that using association rules to construct hypergraphs requires the iterative processing of complex and large-scale datasets in a non-parallel search regime, which leads to longer running time. Therefore, parallel search can significantly reduce the running time of processing large-scale data. By comparing only six PPI network datasets, it can see that HDHPA is more efficient than AD and Ah. Therefore, considering both running time and classification accuracy, the HDHPA is more competitive.

Based on the values of clustering evaluation indicators in the experimental results, the next step is to discuss clustering efficiency. From Figure 7, the clustering efficiency of the AD method will change significantly with the characteristics of the datasets, and the Ah method is generally lower than other methods in clustering evaluation indicators. The FM-index and ARI values of the HDHPA method are substantially higher than other methods, and the highest values can be close to 0.9. Therefore, the HDHPA method has lower time costs and more stable performance on different datasets. Because the construction and partition of the hypergraph occupy most of the running time, the improvement of the traditional Apriori algorithm makes the construction of the hypergraph a parallel calculation, saving a lot of time and achieving a better trade-off between effectiveness and efficiency.

5. Conclusions and Discussion

This paper mainly focuses on clustering multi-dimensional datasets using a constructed hypergraph model. We propose a hypergraph-clustering method called HDHPA (high-dimensional data clustering method based on hypergraph partitioning using improved Apriori algorithm). The method starts by reducing the dimensionality of high-dimensional data and then performs clustering based on the hypergraph model. During the construction of the hypergraph, based on the knowledge of hypergraph theory, the traditional associative Apriori algorithm was improved. The HDHPA method uses parallel computing to eliminate the limitations of the Apriori algorithm in the iterative search of candidate sets, algorithm efficiency, and other aspects. The HDHPA can also reasonably measure the similarity between data in high-dimensional space. Then, using the dense subgraph segmentation DSP algorithm to segment the hypergraph can better merge similar dense subgraphs, it can improve clustering accuracy and obtain the final clustering result.

Experiments were conducted on six protein–protein interaction (PPI) network datasets and the MNIST handwritten digit recognition dataset. The experimental results demonstrate that the HDHPA method outperforms traditional clustering algorithms and other hypergraph-partitioning algorithms, achieving efficient clustering results on datasets of different modalities. This indicates that the HDHPA method has broad applicability in various domains. Finally, the running time of all methods was compared, and the results show the superior performance of the HDHPA method in terms of time complexity. It is worth noting that the performance of our proposed method is related to the constructed hypergraph structure, so there are still many potential issues with the HDHPA method. Firstly, the method proposed in this paper has not yet comprehensively integrated more features of nodes and hyperedges into high-dimensional data to construct hypergraphs. Secondly, when improving the Apriori algorithm, how to reduce time complexity should be more comprehensive. Therefore, our future research directions mainly have two aspects: On the one hand, we will continue to try to conduct experiments on datasets in different fields, including high-dimensional network datasets such as social networks and brain networks. We will adjust the parameters of the method appropriately based on the characteristics of different modal datasets. On the other hand, combined with the latest intelligent clustering technology, we will further study how to shorten the running time mechanism of the method.

Author Contributions

Conceptualization, R.C. and F.H.; methodology, R.C., F.H., F.W. and L.B.; software, R.C.; validation, F.W., L.B. and F.H.; formal analysis, R.C., F.H. and L.B.; investigation, R.C. and F.W.; resources, L.B. and F.H.; writing—original draft preparation, R.C.; writing—review and editing, R.C., F.H. and L.B.; visualization, R.C.; supervision, F.H. and L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the project: The National Natural Science Foundation of China (Grant No. 61663041) and the Basic Research Program of Qinghai Province (Grant No. 2023-ZJ-916M).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Supplementary Explanation to Figure 3

We know that the primary goal of the protein–protein interaction (PPI) network is to find similar evolutionary historical bases, such as protein sequence, GO terms (related gene ontology), structural and genomic information, and extract homogeneous features from it to obtain useful biological information. Many scholars in related fields often construct feature vectors for PPI by processing the biological information of proteins. Therefore, this process can apply conventional clustering techniques. Characterizing the PPI network is essential for understanding how proteins form functional units in the proteome. Therefore, this paper constructs a protein–protein interaction network hypergraph model and performs clustering, which can preserve the original biological information of proteins and provide powerful technical solutions for protein prediction. Then, we added the scales for the network connection edge colors in Figure 3, as shown below:

Figure A1. Network connection edge color scale in Figure 3.

Finally, for Figure 3, network nodes represent protein splicing isotypes or post-translational modifications folded together, meaning that each node represents all proteins generated by a single protein-coding gene. The size of nodes represents the order of proteins. If proteins interact with each other, they are connected. The color of the connection represents the degree of interaction between the two. Taking Figure 3a FAA4 as an example, the PPI network has fewer nodes and connections. Therefore, it is clear from the figure whether the interaction relationship between proteins is strong or weak. The larger the nodes, the more purple the color of the connected lines. However, these nodes are often few in PPI networks. Therefore, this PPI network satisfies the degree distribution in the network generated by preferred and nonpreferred connections. It follows a power law distribution and is more similar to the relationships between network nodes in the real world.

References

Guo, X.; Liu, X.; Zhu, E. Adaptive self-paced deep clustering with data augmentation. IEEE Trans. Knowl. Eng. 2019, 32, 1680–1693. [Google Scholar] [CrossRef]
Mago, N.; Shirwaikar, R.D.; Acharya, U.D.; Hegde, K.G.; Lewis, L.E.S.; Shivakumar, M. Partition and Hierarchical Based Clustering Techniques for Analysis of Neonatal Data. In Proceedings of International Conference on Cognition and Recognition; Springer: Berlin/Heidelberg, Germany, 2017; pp. 345–355. [Google Scholar]
Von, L.U. A tutorial on spectral clustering. Stat. Comput. 2007, 4, 395–416. [Google Scholar]
Zeng, J. Analysis of data mining K-means clustering algorithm based on partitioning. Moder. Electron. Technol. 2020, 3, 14–17. [Google Scholar]
Wang, G.Y. A Preliminary Study on Uncertainty-Oriented Data Clustering. Master’s Thesis, Jilin University, Changchun, China, 2020. [Google Scholar]
Ackermann, M.R.; Blömer, J.; Kuntze, D.; Sohler, C. Analysis of agglomerative clustering. Algorithmica 2014, 69, 184–215. [Google Scholar] [CrossRef]
Menche, J.; Sharma, A.; Kitsak, M.; Ghiassian, S.D.; Vidal, M.; Loscalzo, J.; Barabási, A.-L. Uncovering disease-disease relationships through the incomplete interactome. Science 2015, 347, 1257601. [Google Scholar] [CrossRef]
Guo, L.; Cui, Y.; Liang, H.; Zhou, Z. Spectral bisection community detection method for urban road networks. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 806–811. [Google Scholar]
Newman, M.E.J. Fast algorithm for detecting community structure in networks. Phys. Rev. E 2004, 69, 066133. [Google Scholar] [CrossRef]
Newman, M.E.J. Spectral methods for community detection and graph partitioning. Phys. Rev. E 2013, 88, 042822. [Google Scholar] [CrossRef]
Berge, C. Graphs and Hypergraphs; North-Holland: Amsterdam, The Netherlands, 1973. [Google Scholar]
Brusa, L.; Matias, C. Model-based clustering in simple hypergraphs through a stochastic blockmodel. Comput. Sci. 2022, 10, 05983. [Google Scholar]
Wang, S.; Li, X.; Liu, D. Hyper-network Model of Architecture for Weapon Equipment System of Systems Based on Granular Computing. J. Syst. Eng. Electron. 2016, 38, 836–843. [Google Scholar]
Strehl, A.; Ghosh, J.; Cardie, C. Cluster ensembles: A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
Yang, C.; Liu, D.; Yang, B.; Chi, S.; Jin, D. Research on clustering ensemble methods. Comput. Sci. 2011, 38, 166–170. [Google Scholar]
Suo, Q.; Guo, J. Hypernetworks: Structure and evolution mechanism. Syst. Eng. Theory Pract. 2017, 37, 720–734. [Google Scholar]
Tian, L.; Zhang, J.; Zhang, J.; Zhou, W.; Zhou, X. Knowledge graph: Representation, construction, reasoning, and hypergraph theory. J. Comput. Appl. 2021, 41, 2161–2186. [Google Scholar]
Liu, S.; Huang, X.; Xian, Z.; Zuo, W. Commodity warehouse model based on hypergraph embedding representation. Chin. J. Manag. Sci. 2023, 1–12. [Google Scholar] [CrossRef]
Wei, L.; Gong, X.; Qian, W.; Zhou, A. Outlier detection in high-dimensional space. J. Softw. 2002, 2, 280–290. [Google Scholar]
Cui, Y.; Yang, B. Several applications of hypergraphs in data mining. Comput. Sci. 2010, 37, 220–222. [Google Scholar]
Kadir, M.; Sobhan, S.; Islam, M.Z. Temporal relation extraction using Apriori algorithm. In Proceedings of the 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), Dhaka, Bangladesh, 13–14 May 2016; pp. 915–920. [Google Scholar]
Agrawal, R.; Imielinski, T.; Swami, A. Mining Associations between Sets of Items in Massive Databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, 25–28 May 1993; pp. 207–216. [Google Scholar]
Althuwaynee, O.F.; Aydda, A.; Hwang, I.T.; Lee, Y.-K.; Kim, S.-W.; Park, H.-J.; Lee, M.-S.; Park, Y. Uncertainty reduction of unlabeled features in landslide inventory using machine learning t-SNE clustering and data mining apriori association rule algorithms. Appl. Sci. 2021, 11, 556. [Google Scholar] [CrossRef]
Esmaeili, H.; Hakami, V.; Bidgoli, B.M.; Shokouhifar, M. Application-specific clustering in wireless sensor networks using combined fuzzy firefly algorithm and random forest. Expert Syst. Appl. 2022, 210, 118365. [Google Scholar] [CrossRef]
Zhao, Y.; Li, C.; Shi, D.H.; Chen, G.; Li, X. Ranking cliques in higher-order complex networks. Chaos 2023, 33, 073139. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Zhou, Y.; Mei, K.; Wang, N.; Tang, M.; Cai, G. An Improved Density Peak Clustering Algorithm Based on Chebyshev Inequality and Differential Privacy. Appl. Sci. 2023, 13, 8674. [Google Scholar] [CrossRef]
Liu, B.; Hsu, W.; Ma, Y. Integrating classification and association rule mining. In Proceedings of the KDD’98: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 27–31 August 1998; pp. 80–86. [Google Scholar]
Liu, H.; Latecki, L.J.; Yan, S. Dense subgraph partition of positive hypergraphs. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 541–554. [Google Scholar] [CrossRef] [PubMed]
Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Giannella, C.; Han, J.; Pei, J.; Yan, X.; Yu, P.S. Mining frequent patterns in data streams at multiple time granularities. Next Gener. Data Min. 2006, 35, 61–84. [Google Scholar]
Hu, J.; He, L.; Mao, Y.; Yang, J. Research on improved algorithm for mining uncertain frequent subgraphs. Comput. Eng. Appl. 2015, 51, 112–116. [Google Scholar]
Lin, Z. Research on Hierarchical Structure Construction and Maintenance Based on Dense Subgraph Approximation Mode. Ph.D. Thesis, East China Normal University, Shanghai, China, 2022; p. 000745. [Google Scholar]
Barabási, A.L.; Oltvai, Z.N. Network biology: Understanding the cell’s functional organization. Nature Rev. Gene. 2004, 5, 101–113. [Google Scholar] [CrossRef] [PubMed]
Johnson, S. Data Repository. 2020. Available online: https://www.samuel-johnson.org/data (accessed on 1 January 2022).
Hu, F.; Liu, M.; Zhao, J.; Lei, L. Analysis and application of protein complex hypernetwork characteristics. Complex Syst. Complex. Sci. 2018, 4, 31–38. [Google Scholar]
Pareek, V.; Tian, H.; Winograd, N.; Benkovic, S.J. Metabolomics and mass spectrometry imaging reveal channeled de novo purine synthesis in cells. Science 2020, 368, 283–290. [Google Scholar] [CrossRef]
Fowlkes, E.B.; Mallows, C.L. A method for comparing two hierarchical clustering. J. Amer. Statist. Assoc. 1983, 78, 553–569. [Google Scholar] [CrossRef]
Davide, H.; Giuseppe, J. A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and Fowlkes–Mallows index. J. Biomed. Inform. 2023, 144, 104426. [Google Scholar]
Kaufman, L.; Rousseeuw, P. Clustering by Means of Medoids; North-Holland: Amsterdam, The Netherlands, 1987. [Google Scholar]
Karypis, G.; Aggarwal, R.; Kumar, V.; Shekhar, S. Multilevel hypergraph partitioning: Applications in VLSI domain. IEEE Trans. VLSI Sys. 1999, 7, 69–79. [Google Scholar] [CrossRef]
Cong, Q.; Anishchenko, I.; Ovchinnikov, S.; Baker, D. Protein interaction networks revealed by proteome coevolution. Science 2019, 365, 185–189. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Example of a hypergraph. (a) The vertices set

V = \{v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, v_{6}, v_{7}, v_{8}\}

, and the hyperedges set

E = \{e_{1}, e_{2}, e_{3}, e_{4}\}

, where the hyperedges are

e_{1} = \{v_{2}, v_{3}, v_{6}\}

,

e_{2} = \{v_{1}, v_{2}, v_{5}, v_{6}\}

,

e_{3} = \{v_{4}, v_{5}, v_{6}\}

,

e_{4} = \{v_{7}, v_{8}\}

. (b) the incidence matrix of (a), and the red color indicates that the hyperedge contains the vertex. (c) the adjacency matrix of (a), and the blue color indicates that there are several hyperedges containing these two vertices, and the number indicates the number of hyperedges containing these two vertices.

Figure 1. Example of a hypergraph. (a) The vertices set

V = \{v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, v_{6}, v_{7}, v_{8}\}

, and the hyperedges set

E = \{e_{1}, e_{2}, e_{3}, e_{4}\}

, where the hyperedges are

e_{1} = \{v_{2}, v_{3}, v_{6}\}

,

e_{2} = \{v_{1}, v_{2}, v_{5}, v_{6}\}

,

e_{3} = \{v_{4}, v_{5}, v_{6}\}

,

e_{4} = \{v_{7}, v_{8}\}

. (b) the incidence matrix of (a), and the red color indicates that the hyperedge contains the vertex. (c) the adjacency matrix of (a), and the blue color indicates that there are several hyperedges containing these two vertices, and the number indicates the number of hyperedges containing these two vertices.

Figure 2. An example of constructing hypergraphs with association rules. The gray part in the table represents the itemsets that are deleted because they are below the support threshold, and the gray arrows point to these deleted itemsets. The colored circles represent the connected hyperedges, and the hyperedges in frequent 2-itemsets are used as transitions. Finally, only

\{v_{1}, v_{2}, v_{3}\}

is retained and connected by hyperedge

e_{1}

.

Figure 2. An example of constructing hypergraphs with association rules. The gray part in the table represents the itemsets that are deleted because they are below the support threshold, and the gray arrows point to these deleted itemsets. The colored circles represent the connected hyperedges, and the hyperedges in frequent 2-itemsets are used as transitions. Finally, only

\{v_{1}, v_{2}, v_{3}\}

is retained and connected by hyperedge

e_{1}

.

Figure 3. Schematic diagram of the 6 protein–protein interaction networks (here, only the network composed of randomly selecting up to 110 protein interactions is shown). The colored lines indicate the extent of protein-protein interactions, and the closer the color to purple, the more similar the biological information contained.

Figure 4. Relationship between σ value and the number of dense sub-hypergraphs produced by DSP.

Figure 5. Enrichment analysis of 6 protein–protein interaction network GO terms.

Figure 6. Comparison of run times of different methods on different datasets.

Figure 7. Comparison of FM-index and ARI values in all datasets.

Table 1. Confusion matrix for non-equally important two-class classification issues.

		Predicted Class
		+	−
Actual class	+	TP ¹	FN ²
Actual class	−	FP ³	TN ⁴

¹ TP (True Positive): the number of correctly predicted positive sample pairs. ² FN (False Negative): the number of positive sample pairs incorrectly predicted as negative classes. ³ FP (False Positive): the number of negative sample pairs incorrectly predicted as positive classes. ⁴ TN (True Negative): the number of correctly predicted negative sample pairs.

Table 2. PPI network datasets were used for the experiments.

Dataset	Number of Proteins	Number of Interactions	Attribute Values
FAA4	789	636	1140
Natoc_0297	991	1031	1227
16 items	1106	2611	2115
HOXA10	2151	4367	2446
NNMT	2612	5258	2676
MBP	3121	7423	3235

Table 3. Summary information of 6 protein–protein interaction networks.

Dataset	Avg. Clustering Coefficient	Avg. Number of Neighbors	Network Density	Network Heterogeneity	Network Centralization
FAA4	0.651	13.978	0.155	0.562	0.341
Natoc_0297	0.701	22.659	0.252	0.636	0.470
16 items	0.695	11.638	0.112	0.740	0.180
HOXA10	0.630	44.596	0.297	0.584	0.489
NNMT	0.652	27.648	0.307	0.566	0.493
MBP	0.631	23.521	0.196	0.677	0.445

Table 4. FM-index values of clustering results using 5 methods (unit %).

Method	Dataset
Method	FAA4	Natoc_0297	16 Items	HOXA10	NNMT	MBP
kh	70.97	66.19	55.18	60.77	50.15	62.95
kD	77.80	74.16	50.29	51.40	46.64	61.01
Ah	44.14	47.51	24.34	38.51	19.33	49.20
AD	56.47	40.58	21.90	25.11	26.72	33.17
HDHPA	80.46	73.39	70.13	65.60	49.55	73.18

Table 5. ARI values of clustering results using 5 methods (unit %).

Method	Dataset
Method	FAA4	Natoc_0297	16 Items	HOXA10	NNMT	MBP
kh	77.62	74.50	34.98	56.97	46.95	67.06
kD	82.83	78.15	47.60	70.47	49.92	75.40
Ah	34.34	33.56	13.47	18.52	14.23	29.60
AD	58.67	50.91	38.12	46.33	58.95	48.71
HDHPA	81.13	80.60	49.29	67.81	57.56	79.88

Table 6. The clustering evaluation index of each method for MNIST dataset (unit %).

Method	MNIST
Method	FM-Index	ARI
kh	45.96	43.06
kD	46.52	48.40
Ah	63.97	57.24
AD	39.51	34.22
HDHPA	70.60	83.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, R.; Hu, F.; Wang, F.; Bai, L. Hypergraph-Clustering Method Based on an Improved Apriori Algorithm. Appl. Sci. 2023, 13, 10577. https://doi.org/10.3390/app131910577

AMA Style

Chen R, Hu F, Wang F, Bai L. Hypergraph-Clustering Method Based on an Improved Apriori Algorithm. Applied Sciences. 2023; 13(19):10577. https://doi.org/10.3390/app131910577

Chicago/Turabian Style

Chen, Rumeng, Feng Hu, Feng Wang, and Libing Bai. 2023. "Hypergraph-Clustering Method Based on an Improved Apriori Algorithm" Applied Sciences 13, no. 19: 10577. https://doi.org/10.3390/app131910577

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hypergraph-Clustering Method Based on an Improved Apriori Algorithm

Abstract

1. Introduction

2. Related Theoretical Foundations

2.1. The Concept of Hypergraph

2.2. Apriori of Association Rules

2.3. Hypergraph Model Definition with Association Rules

2.4. DSP Algorithm in Hypergraph Partitioning

3. Method Design of HDHPA

3.1. Basic Ideology of HDHPA

3.2. The Stage of Constructing a Hypergraph

3.3. Hypergraph Partitioning Stage

3.4. Dense Sub-Hypergraph Merging

4. Experiments and Analysis

4.1. Dataset Description

4.2. Clustering Evaluation Metrics

4.3. Data Processing

4.4. Parameter Settings

4.5. Experimental Comparison Methods

4.6. Experimental Comparison Results

4.6.1. Comparison of Experimental Results on PPI Networks

4.6.2. Comparison of Experimental Results on MNIST

4.6.3. Running Time Comparison

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Supplementary Explanation to Figure 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI