Next Article in Journal
On the Optimal Point of the Weighted Simpson Index
Next Article in Special Issue
Bayesian Learning in an Affine GARCH Model with Application to Portfolio Optimization
Previous Article in Journal
Finite State Automata on Multi-Word Units for Efficient Text-Mining
Previous Article in Special Issue
Three-Part Composite Pareto Modelling for Income Distribution in Malaysia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Advancing Spectral Clustering for Categorical and Mixed-Type Data: Insights and Applications

Department of Economics and Business, University of Catania, Corso Italia, 55, 95129 Catania, Italy
Mathematics 2024, 12(4), 508; https://doi.org/10.3390/math12040508
Submission received: 31 December 2023 / Revised: 2 February 2024 / Accepted: 5 February 2024 / Published: 6 February 2024

Abstract

:
This study focuses on adapting spectral clustering, a numeric data-clustering technique, for categorical and mixed-type data. The method enhances spectral clustering for categorical and mixed-type data with novel kernel functions, showing improved accuracy in real-world applications. Despite achieving better clustering for datasets with mixed variables, challenges remain in identifying suitable kernel functions for categorical relationships.
MSC:
62H30; 62H20; 6207; 6209

1. Introduction

Clustering for categorical and mixed-type data has been a topic of increasing interest in recent years. Several research articles have explored the application of clustering methods to datasets that involve categorical and mixed-type variables. The researchers on this topic have contributed to the development of techniques and algorithms tailored to the challenges presented by such data. Clustering categorical and mixed-type data is a complex task, due to the discrete and non-ordinal nature of categorical variables, as categorical variables cannot be ordered. For this reason, research based on categorical similarity measures is an important theme in data mining. Categorical data clustering has proved to be a particularly complicated topic to be addressed. Some categorical clustering methods have been discussed in [1,2,3,4,5]. Specifically, in [1], the author compared model-based and distance-based clustering approaches for categorical data, assessing their performance through simulation studies. The work addressed the crucial question of whether the objectives of classifying similar objects and identifying observations from the same distribution yield consistent clustering results. In [2], introducing the multiple correspondence k-means methodology, this work simultaneously applied discrete and continuous factorial models to categorical data. Its primary contribution lay in achieving a dual objective: optimal partitioning of objects and identification of orthogonal linear combinations of factors, providing a valuable alternative to tandem analysis for categorical data. In [3], addressing the need for formalized cluster descriptions in categorical data, the authors proposed CACTUS, a summarization-based algorithm. Noteworthy for its speed and scalability, CACTUS outperformed previous algorithms and enabled subspace clustering, demonstrating its efficiency in discovering clusters in subsets of attributes. The work of [4] introduced a novel approach to clustering sets of categorical data, leveraging weighted iterative methods. The methodology assigned and propagated weights on categorical values, deriving a similarity measure from value co-occurrence. The study highlighted the quick convergence of the proposed iterative methods through analytical evaluation of non-linear dynamical systems. Finally, in [5], proposing the clustering-and-dimension-reduction model, this research addressed the challenge of handling mixed qualitative and quantitative variables in two-way two-mode data matrices. The model integrated various clustering-and-dimension-reduction techniques, showcasing versatility and performance through simulation studies and real-data analyses. These works have collectively contributed to advancing the methodology of clustering categorical data, offering insights into various approaches and techniques for uncovering meaningful patterns in diverse datasets.
In addressing the intricacies of clustering categorical and mixed-type data, spectral clustering has proved to be a promising methodology. It was initially designed for continuous numerical data. This adaptation responds to the growing demand for clustering in diverse fields, such as data mining and social-network analysis. The research effort is not limited to introducing a spectral-clustering method for categorical data; it also extends the application of these kernel functions to handling categorical and mixed datasets. This adaptation involves transforming the functional forms of the self-tuning kernel (5) [6] and the adaptive density-aware kernel (6) [7], to accommodate four distinct categorical similarities, following the approach introduced in [8].
Building on the methodology proposed in [8], which successfully constructed a kernel function suitable for textual data, this article replicates the idea for categorical and mixed-type data. The adaptation includes constructing kernel functions aligned with the functional forms of the original numerical kernels, thus demonstrating the versatility of the proposed spectral-clustering approach. In terms of the kernel functions chosen in this analysis, the introduced kernel functions, initially designed for numerical data in [6,7], exhibited significant improvements in clustering outcomes compared to other commonly used kernel functions in this context.
A notable feature of (5) was its introduction of the concept of scaling the distance between two objects with the distance of those objects to their h-th nearest neighbor. Furthermore, (6) extended the idea from [6] by considering the density of objects among themselves, incorporating an additional parameter τ for enhanced flexibility in capturing relationships between data points. Therefore, the results obtained on seven categorical datasets (including three mixed-type datasets) showcase the efficacy of spectral clustering in the categorical context, proving competitive with other methods.
Spectral-clustering methods for categorical data have been already analyzed in [9] and, more recently, in [10]; however, here, a different spectral-clustering method is introduced.
The rest of the work is organized as follows: in Section 2, some similarity measures for categorical data are presented; in Section 3, the standard spectral-clustering method is summarized; in Section 4, the categorical and mixed-type spectral-clustering approach is introduced; in Section 5, the results on categorical and mixed-type real data are presented; finally, in Section 6, some concluding remarks are discussed.

2. Measures for Categorical and Mixed-Type Data

In this section, exploration and comparison of similarity measures for categorical and mixed-type data are undertaken, with the specific goal of adapting them for spectral clustering.
Selecting suitable similarity measures is crucial when clustering categorical data. Unlike numerical data, categorical variables do not possess an inherent order, which necessitates the development of distinct similarity measures. In this section, we review several important similarity measures for categorical and mixed-type data.
In the pursuit of identifying the most appropriate categorical measure for spectral clustering, a comparative analysis of 15 measures utilizing the k-means algorithm on the similarity/dissimilarity matrix was conducted and is outlined in Table 1. Further insights into the specifics of the measures referenced in Table 1 can be found in [11,12].
Table 1 presents accuracy measures across various datasets and the categorical similarity measures considered. Specifically, accuracy provides a straightforward assessment of a classification model’s performance by indicating the proportion of correctly classified instances out of the total predictions. A higher accuracy value signifies better predictive ability, while a lower accuracy value indicates a higher rate of misclassification. Please note that in Table 1, the rows represent the different categorical similarity measures considered, while the columns represent the different datasets taken into consideration in our study. Our empirical investigations were conducted across seven distinct datasets, with the summarized details available in Table 2. Specifically, our experimentation encompassed three mixed-type datasets: Credit Approval data, Dermatology data, and Heart data. Notably, barring the Gower distance, which has a predefined definition, all other categorical similarity measures were exclusively applied to categorical variables, omitting numerical variables. Consequently, within the mixed-type datasets, the computation of the Gower coefficient was directly performed on the data. Conversely, for all other categorical similarity measures, a prior categorization of continuous variables was undertaken before computing the categorical similarity/dissimilarity measures, as detailed in Section 5.
Furthermore, for computing similarities utilizing the Dice and Jaccard coefficients, an initial binarization of the data was applied, followed by the computation of the respective coefficients. Specifically, for the Dice coefficient, the binarization typically includes converting the categorical data into binary vectors, where each element corresponds to the presence or absence of a specific category. The coefficient is then calculated, based on the count of common categories between two vectors, normalized by the total count of categories in each vector. Similarly, in the case of the Jaccard coefficient, the binarization process entails representing categorical data as binary sets. Each set contains elements corresponding to the categories present in the data. The Jaccard coefficient is subsequently computed by measuring the intersection of these sets divided by their union. It is crucial to note that the binarization process is fundamental in enabling the application of these similarity measures to categorical or mixed-type data.
The k-means algorithm was executed with 100 starting points (the algorithm is quite robust regarding the number of starting points), and Table 1 presents the maximum values attained with respect to accuracy. The computation of categorical similarity/dissimilarity measures featured in Table 1 was facilitated through the utilization of the nomclust R package and the proxy R package.
Observing Table 1, it becomes evident that no categorical similarity/dissimilarity measure distinctly outperformed the others. Consequently, in order to ascertain an appropriate weighting of relationships among categorical variables within the context of spectral clustering, the LIN1 and VE dissimilarity measures, representing the most promising options, were selected. Note that among all the measures analyzed in Table 1, we chose to consider, for adapting the data to categorical spectral clustering, the two similarity measures LIN1 and VE, as they were the measures that most frequently provided the best results (see bold values in Table 1). Moreover, note that LIN1 and VE are considered promising dissimilarity measures for spectral clustering, due to their ability to capture semantic relationships, assess commonality between categories, and quantify information differences. These characteristics make them well-suited for the challenges posed by clustering categorical and mixed-type data, which led to their selection in this context.
Below, we summarize four measures for categorical data that we will consider in our categorical spectral-clustering studies.
Let X = x 1 , , x n be a categorical dataset with n objects and m variables. Let k l for l = 1 , , m be the number of categories of the lth variable. Let p ( u ) for u = 1 , , k l be the relative frequency of the uth category of the lth variable.

2.1. Hamming Distance

The Hamming distance is a fundamental measure for categorical data. It quantifies dissimilarity between two data points based on the proportion of categorical attributes for which they differ. It is defined as
Δ ( x i , x j ) = l = 1 m δ ( x i l , x j l ) k l , for all x i l , x j l X l ,
where X l denotes the l-th column of the data matrix, k l is the number of categories in the l-th variable, δ ( x i l , x j l ) represents the binary difference between two values, i.e.,
δ ( x i l , x j l ) = 1 x i l = x j l 0 otherwise , for l = , m ,
and x i l , x j l are the i-th and j-th object of the l-th variable.

2.2. Chi-Square Distance

The chi-square distance [13] measures dissimilarity based on the chi-square statistic, which is adapted to categorical data. It is particularly useful for mixed-type datasets, combining categorical and numerical attributes. The chi-square distance is defined as
d ( x i , x j ) = h = 1 n l = 1 m x h l l = 1 m 1 h = 1 n x h l x i l l = 1 m x i l x j l l = 1 m x j l 2 .

2.3. Variable Entropy Similarity

The Variable Entropy (VE) similarity [14] is a measure that takes into account the entropy of categorical attributes. It assigns higher weights to rare categories. The VE similarity for variables is defined as
s l ( x i l , x j l ) = 1 ln k l u = 1 k l p ( u ) ln p ( u ) if x i l = x j l 0 otherwise ,
where p ( u ) denotes the relative frequency of the uth category. This represents the similarity computed for all variables and it ranges between 0 and 1. VE similarity has a higher value to a match of two categories with large variability in the variable. In order to compute the proximity matrix, the following steps must be computed:
-
the total similarity s ( x i , x j ) = 1 m l = 1 m s l ( x i l , x j l ) ;
-
the dissimilarity transformation as d ( x i , x j ) = 1 s ( x i , x j ) .

2.4. LIN1 Similarity

The LIN1 similarity [11] is designed to consider the frequency of categories in variables. It assigns different weights to categories based on their prevalence. This measure is particularly effective when one of the mismatched values is very frequent. The LIN1 similarity uses a subset of the x l variable denoted by Q x l , such that for all q Q , p ( x i l ) p ( q ) p ( x j l ) . It is defined by the following expression:
s l ( x i l , x j l ) = q Q ln p ( q ) if x i l = x j l 2 q Q ln p ( q ) otherwise ,
where x i l , x j l are the i-th and j-th object of the l-th variable.
Also in this case, we convert this similarity in a dissimilarity, following the steps:
-
compute the total similarity by s ( x i , x j ) = l = 1 m s l ( x i l , x j l ) l = 1 m ln   p ( x i l ) + ln   p ( x j l ) ;
-
as the LIN1 measure in (4) has values greater than 1, the following dissimilarity transformation is used:
d ( x i , x j ) = 1 s ( x i , x j ) 1 .
The LIN1 measure captures the dissimilarity or similarity between categorical variables by assessing how often the categories align or diverge across different instances. In essence, it quantifies the agreement or disagreement in the categorical attributes, providing insight into the degree of similarity or dissimilarity in the patterns exhibited by those variables.
The measures defined in (1)–(4) are valuable for describing a similarity measure between statistical objects when composed of categorical variables. In this study, these measures were employed to compute the kernel function of spectral clustering, i.e., to calculate the similarity matrix associated with pairs of objects in the data. Within the realm of graph theory, this matrix can also be construed as a weight matrix determining the edge weight between two vertices in a graph.

3. Spectral Clustering

In this section, we refer to the spectral-clustering method for numerical data; further details on this methodology can be found in [8,15,16].
Consider a set of objects X = { x 1 , x 2 , , x n } within 𝒳 R m ; in the spectral-clustering framework they can be interpreted also as vertices in an undirected graph. The initial step in grouping the data X into K clusters involves defining a symmetric and continuous function κ : 𝒳 × 𝒳 [ 0 , ) , called the kernel function. Subsequently, a weighted matrix W = ( w i j ) can be established, where w i j = κ ( x i , x j ) 0 for x i , x j 𝒳 . In terms of graph theory, each element w i j of the weighted matrix W represents the weight of the edge associated to the vertices x i and x j . Moreover, for definition the matrix W can be considered as a similarity matrix. In spectral-clustering algorithms, the frequently adopted choice is the Gaussian kernel  κ ( x i , x j ) = exp ( x i x j 2 / 2 ϵ ) , where ϵ > 0 denotes a fixed parameter.
The selection of the kernel function plays a pivotal role in spectral-clustering algorithms, as it profoundly influences the data structure within the graph and, consequently, impacts the structure of the Laplacian and its corresponding eigenvectors. An optimal kernel function should ideally result in a weighted matrix W exhibiting distinct diagonal blocks, facilitating well-separated groups and aiding in determining the number of groups within the dataset by merely counting these blocks.
However, within the Gaussian kernel, the primary concern arises in choosing the scale parameter ϵ . Addressing this, [6] proposed computing a local scaling parameter ϵ i for each x i , introducing the following kernel function:
κ S T ( x i , x j ) = exp x i x j 2 ϵ i ϵ j ,
where ϵ i = | x i x h | and x h represents the h-th neighbor of point x i (similarly for ϵ j ). This approach, termed self-tuning, generates a similarity matrix dependent on pairwise proximities among points. Yet, despite its nomenclature, the self-tuning approach necessitates selecting the number h of neighbors for point x i . Although [6] suggested h = 7 , this choice cannot be universally adopted, as highlighted in [16].
In [7], in order to amplify the intra-cluster similarity, the local density between the common neighbors of the points x i and x j was considered. The adaptive density-aware kernel function is defined by
κ A D A ( x i , x j ) = exp x i x j 2 ϵ i ϵ j CNN ( x i x j ) + 1 ,
where ϵ i , ϵ j are defined as in (5), and where CNN ( x i x j ) = | B ( x i , τ ) B ( x j , τ ) | is the number of points in the common region between the spheres with radius τ centered in x i and x j , respectively. The advantage of this kernel is to be able to capture group membership in cases where point density is very important—as, for example, in cases where the shape of the clusters consists of spirals or U-shapes. In order to select the parameters τ and h, this function was also discussed in [16].
Upon computing the similarity matrix W using a kernel function, the introduction of the normalized graph Laplacian, denoted as L sym R n × n , occurs. This matrix is formulated as
L sym = I D 1 / 2 W D 1 / 2 ,
where D = d i a g ( d 1 , d 2 , , d n ) represents the degree matrix and d i denotes the degree of vertex x i , defined as
d i = j w i j .
Subsequently, the K smallest eigenvalues λ 1 , λ 2 , λ 3 , , λ K and corresponding eigenvectors v 1 , v 2 , , v K of L sym are computed.
The normalized Laplacian embedding in the K principal subspace maps the data from the input space 𝒳 to a feature space defined by the K principal subspace of L sym .
The map is defined as Φ Γ : { x 1 , , x n } R K , given by
Φ Γ ( x i ) = ( v 1 i , , v K i ) , i = 1 , , n ,
where v 1 i , , v K i are the i-th components of the eigenvectors of L sym , v 1 , , v K , respectively.
Subsequently, n × K matrix Y = ( y 1 , , y n ) of the embedded data in the feature space is computed, where
y i = Φ Γ ( x i ) = ( v 1 i , , v K i ) , i = 1 , , n .
Thus, the clustered embedded data Y undergo a clustering procedure, typically employing the k-means algorithm.
The spectral-clustering algorithm is outlined in Algorithm 1. Regarding the clustering of the embedded data, typically Step 6 of Algorithm 1 utilizes the k-means algorithm. However, ref. [8] proposed Gaussian mixtures as an alternative. Notably, both the k-means and mixture models generate convex clusters, with the latter accommodating more flexible shapes. The selection between the k-means algorithm and Gaussian mixtures for clustering embedded data involves trade-offs. Known for its simplicity and speed, k-means is computationally efficient and robust against noise but assumes spherical clusters and equal-sized cluster assignments. It may struggle with more complex data structures and varying cluster densities. On the other hand, Gaussian mixtures provide greater flexibility in modeling clusters with different shapes and sizes, accommodating overlapping clusters and offering probabilistic cluster assignments. Further investigations in [15] suggest that Gaussian-mixture models offer a balanced trade-off between model simplicity and efficacy after rigorous trials with diverse non-Gaussian component densities.
Algorithm 1 Spectral-Clustering Algorithm [16]
Input: dataset X , number of clusters K, kernel function κ .
  • Compute the similarity matrix W . The first step involves creating an affinity matrix, capturing the pairwise similarities between data points. This matrix is crucial as it forms the basis for identifying relationships within the dataset.
  • Compute the normalized graph Laplacian L sym based on W . Note that the degree matrix is computed to represent the importance of each data point in the dataset. It provides a measure of the connectivity of each point to the rest of the dataset, aiding in the identification of densely connected regions. Therefore, the Laplacian matrix is then normalized to ensure consistent scaling across different datasets. This step enhances the stability and robustness of the algorithm, making it applicable to diverse data scenarios.
  • Eigendecomposition: compute the K smallest eigenvalues 0 = λ 1 λ 2 λ 3 λ K and consider the eigenvectors v 1 , v 2 , , v K corresponding to the eigenvalues λ 1 λ 2 λ K of L sym . This eigendecomposition on the normalized Laplacian matrix yields eigenvalues and corresponding eigenvectors, which encapsulate essential information about the dataset’s structure and connectivity.
  • Embed the data in the K-th principal subspace: compute Φ Γ ( x i ) according to the normalized Laplacian embedding and build the matrix Y . A subset of eigenvectors is chosen based on eigenvalues, and they play a crucial role in defining the cluster structure within the data.
  • Data Clustering: Run a clustering algorithm on the matrix X .
Output: the data clustering C 1 , , C K .

4. Categorical Spectral Clustering

Spectral clustering, a widely-used method for clustering numerical data, can be adapted to categorical and mixed-type data by employing appropriate kernel functions and similarity measures. This section delves into the adaptation of spectral clustering to cater to the unique characteristics of categorical data.
In spectral clustering, the central idea is to transform the data into a graph representation, where each data point becomes a node, and the edges between nodes represent the pairwise similarities or distances between data points.
Spectral clustering requires defining appropriate kernel functions, which transform data into a form suitable for spectral analysis. In the context of categorical data, we introduce novel kernel functions, specifically designed for categorical and mixed-type datasets. These kernel functions are extensions of the self-tuning kernel and the adaptive density-aware kernel. The main contribution of this work advocates employing the kernel functions expounded in (8) and (9), wherein instead of utilizing the Euclidean distance, we incorporate specific categorical measures.
The spectral-clustering algorithm for categorical data has already been previously considered in [9], where the weighted Hamming distance introduced in (1) was used as a similarity measure to build the kernel function. Nevertheless, upon applying the kernel function suggested in [9] to the categorical datasets delineated in Table 2, the outcomes exhibited inferior performance compared to those achieved by the k-means method, concerning accuracy. Consequently, the main contribution of this work was to consider the categorical versions of the self-tuning kernel function (5) and the adaptive density-aware kernel function (6).
In order to group the categorical data X in K clusters via the spectral-clustering approach, the following appropriate kernel functions κ S T and κ A D A are defined:
κ S T ( x i , x j ) = exp d i s t ( x i , x j ) ϵ i ϵ j ,
κ A D A ( x i , x j ) = exp d i s t ( x i , x j ) ϵ i ϵ j CNN ( x i x j ) + 1 ,
where
  • d i s t ( x i , x j ) is a distance or dissimilarity measure for the categorical data presented in Section 2, i.e., the Hamming distance, chi-square distance, variable entropy similarity, and LIN1 similarity;
  • ϵ i = d i s t ( x i , x h ) , where x h is the h-th neighbor of point x i (similarly for ϵ j );
  • CNN ( x i x j ) = | B ( x i , τ ) B ( x j , τ ) | , where B ( x i , τ ) is the sphere centered in x i with radius τ (analogously for B ( x i , τ ) ).
Following the spectral-clustering algorithm (Algorithm 1), once the similarity matrix W through to these kernel functions is built, one has to compute the symmetric Laplacian matrix L sym , which is a representation of the graph’s structure and provides insights into the data’s connectedness. Afterwards, eigenvalue decomposition on the Laplacian matrix is performed, devising the symmetric Laplacian embedding Φ Γ . This step yields the eigenvectors and eigenvalues, which provide valuable information about the data’s intrinsic structure. Moreover, creating a low-dimensional embedding of the data using the eigenvectors associated with the smallest eigenvalues yields an embedded representation suitable for clustering. Finally, Gaussian-mixture models are employed to cluster the low-dimensional embedding.
Moreover, note that the specific contributions of this work lie in the adaptation of kernel functions: namely, the self-tuning kernel function and adaptive density-aware kernel, for spectral clustering on categorical and mixed-type data. The novelty stems from extending these kernel functions, originally designed for numerical data, to handle the nuances of categorical variables. These adaptations address a gap in the literature, where existing kernel functions are typically designed for numerical data, making them unsuitable for nominal data. The advantages of these new kernel functions include their effectiveness in improving clustering outcomes for categorical and mixed-type data, as demonstrated by the clustering results. They provide a more nuanced and data-adaptive approach, contributing to the advancement of spectral-clustering methodologies tailored to handling diverse data types.
Remark 1. 
The choice of parameters, such as the number of clusters and the parameter(s) in the kernel functions, is essential for effective spectral clustering. These parameters are typically determined based on the characteristics of the dataset. Here, the number of clusters and the best parameter(s) in the kernel functions are assumed known values; see Section 5 for details.
The proposed approach has demonstrated its effectiveness on various real datasets, yielding competitive results compared to traditional clustering methods. The choice of specific similarity measures and kernels played a key role in the outcomes. Visualizations of embedded data and eigengap analysis provided an intuitive understanding of clustering structures. Additionally, the methodology outperformed alternatives specifically designed for categorical data, showcasing the adaptability of spectral clustering in mixed-data contexts.

5. Results on Real Data

Spectral clustering has demonstrated its effectiveness in various applications, and its adaptation for categorical data extends its utility to domains characterized by categorical and mixed-type attributes. In this section, we provide examples of real-world applications and demonstrate the practical utility of categorical spectral clustering.
The experiments were conducted using seven public datasets from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets.php (accessed on 30 December 2023) and Kaggle https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility (accessed on 30 December 2023). These datasets include a mix of purely categorical and mixed-type data. They are described in Table 2.
Processing Mixed-Type Data. Some of the selected datasets (the Credit Approval, Dermatology, and Heart datasets) involve mixed-type attributes, including both categorical and numerical variables. The categorical measures described in Section 2, for definition, had to be computed on nominal variables, not on numerical variables. To accommodate this mixed-type data in spectral clustering, a categorization technique based on the Calinski–Harabasz index was applied. This approach categorizes numerical variables to make them compatible with categorical attributes, a critical step in ensuring the effective application of spectral clustering. This idea was introduced in [9]. The Caliski–Harabasz index was used to measure the validity index for the clustering of the continuous variables. The Caliski-Harabasz index was defined by
S k , n l = ( n k ) S b l ( K ) ( k 1 ) S w l ( k ) ,
where S b l is the between-cluster sum of squares and S w l is the within-cluster sum of squares. The optimal number of categories k b e s t l was obtained by a clustering method to the numerical variable x l while calculating the corresponding validity index S k , n l for x l , so k b e s t l was the smallest k for which the S k , n l had a local maximum. In particular, each variable was clustered according to one-dimensional k-means on every single numerical variable.
Choice of Parameters. Selecting appropriate parameters plays a crucial role in the success of spectral clustering. The choice of the number of clusters K and other kernel function parameters can significantly impact the clustering results.
The approach outlined in [16] for determining the number of clusters in spectral clustering distinguishes itself from conventional methods through the incorporation of a collective graphical analysis encompassing the similarity matrix, eigengap values, and eigenvector configurations post-Laplacian embedding. Accentuating the interdependence of proximity parameters (h and/or τ ) in the kernel function with the number of clusters (K), it facilitates a simultaneous analysis. The proposed multi-step procedure entails investigating a subset of proximity parameters and evaluating the similarity matrix, eigengap values, and scatter plot of the embedded data for each parameter. The key steps involve: (1) Identifying a distinct diagonal block structure or aligned/star structure. (2) Scrutinizing the eigengap plot and determining K based on a unique maximum eigengap. (3) In the presence of multiple maxima, analyzing embedded data plots for different K values based on local maxima, ensuring K is not smaller than the number of spikes.
Therefore, while there is no one-size-fits-all solution for parameter selection, some of the experiments carried out on the considered datasets indicated that setting the number of nearest neighbors h at half of the data size and determining the radius τ as the 10th percentile of the distance distribution of W often leads to promising results.
Selecting the optimal number of clusters K is a critical step in spectral clustering. The eigengap method is used to determine the number of clusters. This method examines the first eigengap, which is the difference between consecutive eigenvalues in the spectral decomposition. Figure 1 shows that the eigengap analysis effectively identified the correct number of clusters in the Heart and Dermatology datasets. Moreover, visualizing the results of spectral clustering in categorical datasets can provide insights into the cluster structures. Figure 2 and Figure 3 show the shape of the embedded data for two example datasets: Dermatology data and Congressional Voting data. These visual representations help in understanding the clusters formed by the spectral-clustering algorithm; see [16] for further details.
Results and discussion. The performance of spectral clustering using different combinations of categorical similarity measures (1)–(4) and kernel functions (5) and (6) was evaluated. These combinations encompassed the Hamming distance, chi-square distance, variable entropy, and LIN1 similarity defined in Section 2 and the self-tuning kernel (5) and adaptive density-aware kernel (6). For completeness, we compared our approach to alternative clustering techniques, including k-Modes, h-Prototypes, k-Means applied to similarity matrices, and the spectral-clustering method proposed in a previous work [9] specifically tailored for categorical data.
Table 3 presents the accuracy of the spectral-clustering method employing the Gaussian-mixture model on embedded data while exploring varied combinations of categorical measures and distinct kernel functions. Specifically, categorical measures from Section 2 substituted the generic distance d i s t ( x i , x j ) within the kernel functions. Notably, while SM, LIN1, and VM measures performed well (as indicated in Table 1), their results across all datasets were inferior to those reported in Table 3; hence, their omission from this discussion.
The comparative analysis in Table 3 underscores spectral clustering’s enhanced classification performance over k-Means, k-Modes, h-Prototypes, and the SpectralCat algorithm by [9]. All the experiments detailed in Table 3 were executed using R software. Our spectral-clustering algorithms were self-implemented, employing the kmodes function from the klaR R package for the k-Modes method, and the kproto function from the clustMixType R package for the h-Prototypes method.
Regarding kernel functions, the adaptive density-aware kernel (9) consistently matched or outperformed the self-tuning kernel (8). However, no single categorical measure universally fitted all the datasets. Nonetheless, the chi-square distance demonstrated commendable performance across most cases, leading us to assert that, in combination with the adaptive density-aware kernel (9), it stood as the most effective categorical measure for spectral clustering (2).
For completeness, we provide visual representations in Figure 2 and Figure 3 showcasing the embedded data shapes of two categorical datasets: the Dermatology dataset and the Congressional Voting dataset, respectively. Furthermore, Figure 1 displays the first eigengaps for the Dermatology and Heart datasets. Notably, Figure 1a reveals a maximum eigengap corresponding to K = 6 , while Figure 1b indicates a maximum eigengap at K = 2 , representing the actual number of clusters in these datasets. Additionally, Figure 4 illustrates the similarity matrix derived from the self-tuning kernel (8) with the chi-square distance for the Heart dataset, confirming the existence of two diagonal blocks and validating the actual number of clusters within the Heart data.
In our evaluations, the choice of kernel function and categorical similarity measure played a significant role in achieving the best clustering results. Notably, the chi-square distance, in combination with the adaptive density-aware kernel, consistently delivered competitive performance across different datasets. Our method outperformed other clustering techniques, as demonstrated in the results presented in Table 3. Moreover, for the sake of clarity and completeness, from results obtained in Table 3, the best spectral-clustering algorithm for categorical data can be summarized as follows:
  • Compute the chi-squared distance from (2).
  • Compute the similarity matrix W from the adaptive density-aware kernel (9), using chi-square distance with h = n / 2 and τ as the 10th percentile of the chi-squared distance distribution.
  • Calculate the normalized graph Laplacian L sym .
  • Compute the eigenvalues of L sym and the number of clusters K * thanks to the first eigengap method.
  • Consider the eigenvectors of L sym and introduce the matrix Y with columns equal to the eigenvectors v 1 , , v K * associated with the K * smallest eigenvalues of L sym .
  • Re-normalize the rows of Y to have unit length yielding Z R n × K * .
  • Perform Gaussian-mixture model on rows of the matrix Z .
The experiments presented in the previous section have provided valuable insights into the performance of spectral clustering in the context of categorical and mixed-type data. Our findings underscore the adaptability of spectral clustering to diverse data domains and the importance of selecting appropriate similarity measures and kernel functions for optimal results. Moreover, note that the computational complexity of the suggested spectral-clustering approach depends on various factors, including the size of the dataset, the number of clusters, and the chosen similarity measures and kernel functions. The overall computational complexity is often dominated by the eigenvalue-decomposition step, making it potentially high for large datasets.
One of the critical aspects in spectral clustering for categorical data is the choice of similarity measures. Categorical data involve variables that are not ordered, making it essential to rely on appropriate similarity measures. Our analysis of different similarity measures, including Hamming distance, chi-square distance, variable entropy, and LIN1 similarity, demonstrated that not all measures perform equally well across all datasets. In particular, the chi-square distance, which considers the frequency distributions of categorical variables, consistently delivered competitive results. This result supports the notion that similarity measures that capture the variability within categories can provide effective representations for spectral clustering. The variability-based approach enables the model to identify the importance of rare categories in forming clusters.
Our experiments introduced two categorical versions of kernel functions: the self-tuning kernel and the adaptive density-aware kernel. These kernels played a crucial role in shaping the similarity matrix, affecting the clustering outcomes. The adaptive density-aware kernel, in most cases, demonstrated superiority over the self-tuning kernel. The kernel-function selection has a direct impact on how the data are represented and separated in the spectral embedding space. Our results suggest that kernel functions that adapt to the data’s density provide a more flexible and effective approach for spectral clustering.
The performance of our categorical spectral-clustering approach was compared to other established clustering techniques. These alternatives included k-Modes, h-Prototypes, k-Means applied to similarity matrices, and the spectral-clustering method specifically designed for categorical data [9]. The results clearly indicated the superiority of our approach over these alternatives. This outcome underlines the potential of spectral clustering as a robust and versatile technique for clustering categorical and mixed-type data. Moreover, it highlights the importance of tailored methods for these types of data, as traditional techniques may not provide competitive results.
Visualizations of the embedded data for specific datasets, as presented in Figure 2 and Figure 3, offer an intuitive understanding of the cluster structures formed by spectral clustering. Visual representations provide a practical way to inspect the results and assess their quality. Furthermore, determining the optimal number of clusters is essential for spectral clustering. The eigengap analysis, as demonstrated in Figure 1, effectively identified the correct number of clusters for specific datasets. This highlights the value of the eigengap method as a useful tool for number-of-clusters selection in spectral clustering. Furthermore, it is observed that, as demonstrated in [16], an alternative method to determine the optimal number of clusters involves counting the visible diagonal blocks within the similarity matrix W , as exemplified in Figure 4.
Moreover, across the various scenarios considered in Table 3, it can be observed that the method consistently demonstrated superiority over the alternatives, as indicated by the reported accuracy values. This consistent out-performance suggests that the spectral-clustering approach is not overly sensitive to specific dataset characteristics and maintains its effectiveness across diverse scenarios. The spectral-clustering approach appears to be robust and capable of generalizing well across different types of datasets, showcasing consistent and competitive performance across various scenarios. Rather, its sensitivity may lie in the choice of the initial categorical similarity measure used to transform the categorical data into numerical form.
In conclusion, our experiments emphasized the effectiveness of spectral clustering in categorical and mixed-type data, providing competitive results compared to other traditional clustering methods. The choice of similarity measures, kernel functions, and the visual inspection of results play pivotal roles in the success of spectral clustering. These findings are particularly significant for practitioners dealing with categorical and mixed-type data, where tailored clustering approaches are essential. Finally, we note that the computational complexity of spectral clustering, primarily due to eigenvalue decomposition, can be addressed for large datasets through various strategies. However, it is important to underline that the investigation into the overall computational complexity of the algorithm is beyond the scope of this work.
In summary, from the results obtained, one can state that the choice of similarity measure showed that the chi-square distance often provided superior performance, emphasizing the importance of capturing variability in categorical variables. The analysis highlighted that density-adaptive kernels, especially the density-aware kernel, often led to better results than self-tuning kernels. Furthermore, the evaluation of the optimal number of clusters, through eigengap analysis, proved to be an effective strategy. The discussion emphasizes that despite the computational complexity associated with spectral clustering, the promising results justify the use of this methodology, especially when dealing with categorical and mixed data.

6. Conclusions

In this article, we explored the application of spectral clustering to the challenging domain of categorical and mixed-type data. Spectral clustering effectively revealed intricate cluster structures in such data, providing competitive results when compared to alternative clustering methods.
Role of Similarity Measures. Selecting appropriate similarity measures for categorical data is critical. Among the measures we evaluated, the chi-square distance consistently yielded competitive results. By considering the frequency distributions of categorical variables, this measure captured the nuances of categorical data and proved to be a suitable choice.
Kernel Functions. The choice of kernel functions played a crucial role in spectral clustering. The adaptive density-aware kernel outperformed the self-tuning kernel in most cases. These kernel functions had a significant impact on the data representation and clustering results.
Competitive Edge. Our categorical spectral-clustering approach outperformed other traditional clustering methods, demonstrating the potential of spectral clustering as a robust and versatile technique for clustering categorical and mixed-type data.
Visual Analysis. Visualizing the embedded data provided valuable insights into the cluster structures formed by spectral clustering. This visual inspection was essential for assessing the quality of the clustering results.
Optimal number of clusters. The eigengap analysis effectively determined the optimal number of clusters for specific datasets. This technique assisted in selecting the correct cluster count, a crucial step in spectral clustering.
In conclusion, spectral clustering for categorical and mixed-type data is a promising and effective approach. Its adaptability to various data domains, versatility in handling categorical variables, and competitive performance make it a valuable tool for data-clustering tasks. Understanding the nuances of categorical data, selecting appropriate similarity measures and kernel functions, and visually inspecting the results are essential steps in achieving successful spectral-clustering outcomes.
Applications in Economic and Financial Datasets. Spectral clustering holds particular importance in economic and financial datasets, where variables range from categorical attributes, like industry sectors, to numerical data, such as financial-performance metrics. Its versatility in accommodating this amalgamation of data types enables effective pattern recognition and segmentation, elucidating nuanced relationships and inherent groupings. Spectral clustering emerges as a potent tool capable of integrating categorical, numerical, and mixed data, facilitating insightful analyses in risk assessment, fraud detection, portfolio optimization, and market segmentation.
Future investigations should focus on automating parameter-selection methods for spectral clustering in categorical and mixed-type data. This entails developing approaches to determining optimal parameters, including the number of clusters and tuning parameters for similarity measures and kernel functions. Additionally, exploring innovative similarity measures specifically designed for categorical data is crucial to enhancing clustering efficacy. Overall, the future trajectory involves automating parameter selection, exploring novel similarity measures, and conducting thorough investigations into the algorithm’s robustness for improved practical applications. The potential of spectral clustering in categorical and mixed-type data calls for future research emphasis on automated parameter selection and the creation of specialized similarity measures. This will further enhance the adaptability and effectiveness of spectral clustering in diverse domains. Categorical spectral clustering is a promising approach, providing new perspectives and insights into complex data structures.

Funding

This study was funded by the European Union-NextGenerationEU, in the framework of the GRINS-Growing Resilient, INclusive and Sustainable project (GRINS PE00000018—CUP E63C22002120006). The views and opinions expressed are solely those of the author and do not necessarily reflect those of the European Union, nor can the European Union be held responsible for them.

Data Availability Statement

The experiments were conducted using seven public datasets from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets.php (accessed on 28 February 2022) and Kaggle https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility (accessed on 28 February 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Anderlucci, L. Comparing Different Approaches for Clustering Categorical Data. Ph.D. Thesis, Alma Mater Studiorum Università di Bologna, Bologna, Italy, 2012. [Google Scholar] [CrossRef]
  2. Fordellone, M.; Vichi, M. Multiple Correspondence K-Means: Simultaneous Versus Sequential Approach for Dimension Reduction and Clustering. In Data Science and Social Research. Studies in Classification, Data Analysis, and Knowledge Organization; Lauro, N., Amaturo, E., Grassia, M., Aragona, B., Marino, M., Eds.; Springer: Cham, Switzerland, 2017. [Google Scholar]
  3. Ganti, V.; Gehrke, J.; Ramakrishnan, R. Cactus—Clustering categorical data using summaries. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’99, New York, NY, USA, 15–18 August 1999; pp. 73–83. [Google Scholar]
  4. Gibson, D.; Kleinberg, J.; Raghavan, P. Clustering categorical data: An approach based on dynamical systems. VLDB J. 2000, 8, 222–236. [Google Scholar] [CrossRef]
  5. Vichi, M.; Vicari, D.; Kiers, H.A.L. Clustering and dimension reduction for mixed variables. Behaviormetrika 2019, 46, 243–269. [Google Scholar] [CrossRef]
  6. Zelnik-Manor, L.; Perona, P. Self-tuning spectral clustering. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 13–18 December 2004; Volume 17. [Google Scholar]
  7. John, C.R.; Watson, D.; Barnes, M.R.; Pitzalis, C.; Lewis, M.J. Spectrum: Fast density-aware spectral clustering for single and multi-omic data. Bioinformatics 2019, 36, 1159–1166. [Google Scholar] [CrossRef] [PubMed]
  8. Di Nuzzo, C.; Ingrassia, S. A mixture model approach to spectral clustering and application to textual data. Stat. Methods Appl. 2022, 31, 1071–1097. [Google Scholar] [CrossRef]
  9. David, G.; Averbuch, A. Spectralcat: Categorical spectral clustering of numerical and nominal data. Pattern Recognit. 2012, 45, 416–433. [Google Scholar] [CrossRef]
  10. Mbuga, F.; Tortora, C. Spectral Clustering of Mixed-Type Data. Stats 2022, 5, 1–11. [Google Scholar] [CrossRef]
  11. Boriah, S.; Chandola, V.; Kumar, V. Similarity measures for categorical data: A comparative evaluation. In Proceedings of the SIAM International Conference on Data Mining, Atlanta, GA, USA, 24–26 April 2008; Volume 30, pp. 243–254. [Google Scholar]
  12. Santos, T.; Zárate, L. Categorical data clustering: What similarity measure to recommend? Expert Syst. Appl. 2015, 42, 1247–1260. [Google Scholar] [CrossRef]
  13. Legendre, P.; De Cáceres, M. Beta diversity as the variance of community data: Dissimilarity coefficients and partitioning. Ecol. Lett. 2013, 16, 951–963. [Google Scholar] [CrossRef] [PubMed]
  14. Šulc, Z.; Řezanková, H. Comparison of similarity measures for categorical data in hierarchical clustering. J. Classif. 2019, 36, 58–72. [Google Scholar] [CrossRef]
  15. Di Nuzzo, C. Model Selection and Mixture Approaches in the Spectral Clustering Algorithm. Ph.D. Thesis, University of Messina, Messina, Italy, 2022. Available online: https://iris.unime.it/handle/11570/3222428 (accessed on 28 February 2022).
  16. Di Nuzzo, C.; Ingrassia, S. A graphical approach for the selection of the number of clusters in the spectral clustering algorithm. In Studies in Theoretical and Applied Statistics. SIS 2021. Springer Proceedings in Mathematics & Statistics, Studies in Theoretical and Applied Statistics; Salvati, N., Perna, C., Marchetti, S., Chambers, R., Eds.; Springer: Cham, Switzerland, 2022; Volume 406. [Google Scholar]
Figure 1. Eigenvalues of the similarity matrix W : (a) Eigenvalues of the Dermatology data according to the self-tuning kernel (8), using chi-square distance (2). (b) Eigenvalues of the Heart data according to the self-tuning kernel (8), using the chi-square distance (2).
Figure 1. Eigenvalues of the similarity matrix W : (a) Eigenvalues of the Dermatology data according to the self-tuning kernel (8), using chi-square distance (2). (b) Eigenvalues of the Heart data according to the self-tuning kernel (8), using the chi-square distance (2).
Mathematics 12 00508 g001
Figure 2. Dermatology dataset. Scatter plots of the embedded data according to the self-tuning kernel (8), using chi-square distance (2) from two different perspectives: (a) eigenvectors v 2 , v 3 , and v 4 ; (b) eigenvectors v 3 , v 4 , and v 5 . The colors highlight the data in the real classes.
Figure 2. Dermatology dataset. Scatter plots of the embedded data according to the self-tuning kernel (8), using chi-square distance (2) from two different perspectives: (a) eigenvectors v 2 , v 3 , and v 4 ; (b) eigenvectors v 3 , v 4 , and v 5 . The colors highlight the data in the real classes.
Mathematics 12 00508 g002
Figure 3. Congressional Voting Records dataset. Scatter plots of the embedded data according to the self-tuning kernel (8), using chi-square distance (2). Colors highlight the data in the real classes.
Figure 3. Congressional Voting Records dataset. Scatter plots of the embedded data according to the self-tuning kernel (8), using chi-square distance (2). Colors highlight the data in the real classes.
Mathematics 12 00508 g003
Figure 4. Heart dataset. Plot of the similarity matrix according to the self-tuning kernel (8), using chi-square distance (2).
Figure 4. Heart dataset. Plot of the similarity matrix according to the self-tuning kernel (8), using chi-square distance (2).
Mathematics 12 00508 g004
Table 1. Accuracy of k-means algorithm computed on similarity matrix for categorical datasets.
Table 1. Accuracy of k-means algorithm computed on similarity matrix for categorical datasets.
BALBCCACVDERHEALYM
Gower0.57120.72920.79170.87820.85470.77230.5252
Dice0.60960.61010.52220.86440.84080.79540.7465
Jaccard0.58240.57760.52370.86210.88270.79540.7253
Eskin0.49920.70760.82390.87590.72060.78550.4507
Goodall 10.51840.48370.82540.87360.81280.81190.6549
Goodall 20.53920.48740.82390.88040.86590.81520.6479
Goodall 30.54080.48740.82390.87590.86590.81850.6479
Goodall 40.5440.72920.47170.87820.6760.68980.5915
IOF0.50720.72920.81920.87360.67880.81520.5563
OF0.53120.71480.79170.87820.65920.79870.662
LIN0.520.740.81780.87590.81870.82180.669
LIN10.4640.740.5360.86670.9330.82510.7253
SM0.53920.73280.82690.87590.87150.82510.5493
VE0.53920.740.830.87590.75980.80860.5422
VM0.52160.740.830.87590.79340.81190.5493
Table 2. The properties of the categorical datasets.
Table 2. The properties of the categorical datasets.
Dataset NamenmTypeClassesClass Distribution
(in %)
Balance scale (BAL)6254Nominal37.84, 46.08, 46.08
Breast Cancer (BC)2779Nominal270.76, 29.24
Credit Approval (CA)65315Mixed254.67, 45.33
Congressional Voting Records (CV)43516Nominal261.38, 38.62
Dermatology (DER)35834Mixed631.0, 16.76, 19.83, 13.41, 13.41, 5.59
Heart (HEA)30316Mixed245.54, 54.46
Lymphography (LYM)14218Nominal257.04, 42.96
Table 3. Accuracy of the spectral-clustering algorithm for different combinations of the categorical measures introduced in Section 2 and different kernel functions. In order to compare the results obtained, also shown is the accuracy of the k-means, k-Modes, h-Prototypes, and SpectralCAT algorithms.
Table 3. Accuracy of the spectral-clustering algorithm for different combinations of the categorical measures introduced in Section 2 and different kernel functions. In order to compare the results obtained, also shown is the accuracy of the k-means, k-Modes, h-Prototypes, and SpectralCAT algorithms.
BALBCCACVDERHEALYM
SpectralCAT [9] 0.41120.50540.82690.87360.870.80530.4577
k -Modes 0.53120.55230.79320.86670.57260.78880.6268
h -Prototypes --0.7976-0.77370.5875-
k   meansHamming0.5120.70760.81160.85860.85190.79870.4577
Chi-square0.46880.68950.52370.88040.79050.80530.7042
Lin10.4640.74010.5360.86670.9330.82510.7253
VE0.53920.74010.83920.87590.75980.80860.5422
Spectral clustering
with self-tuning kernel
Equation (8)
Hamming0.49280.69670.82390.88270.9050.80530.5775
Chi-square0.60480.77650.52680.88730.97210.80860.7746
Lin10.46880.74730.55280.87130.91340.8350.7324
VE0.49280.72940.84070.88040.82960.79870.5704
Spectral clustering
with adaptive density-
aware kernel
Equation (9)
Hamming0.52480.72920.83310.88270.9050.80530.6056
Chi-square0.66080.77650.61710.89190.97210.80860.7958
Lin10.46880.75450.58040.87590.94690.8350.831
VE0.53920.75090.84070.88040.90220.81520.7958
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Di Nuzzo, C. Advancing Spectral Clustering for Categorical and Mixed-Type Data: Insights and Applications. Mathematics 2024, 12, 508. https://doi.org/10.3390/math12040508

AMA Style

Di Nuzzo C. Advancing Spectral Clustering for Categorical and Mixed-Type Data: Insights and Applications. Mathematics. 2024; 12(4):508. https://doi.org/10.3390/math12040508

Chicago/Turabian Style

Di Nuzzo, Cinzia. 2024. "Advancing Spectral Clustering for Categorical and Mixed-Type Data: Insights and Applications" Mathematics 12, no. 4: 508. https://doi.org/10.3390/math12040508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop