1. Introduction
Data visualization is a fundamental tool for exploring and understanding complex datasets [
1,
2]. Highdimensional data, which often arise in various fields such as bioinformatics, genomics, and image processing, pose significant challenges for visualization [
3,
4,
5]. ISOMAP is a nonlinear dimensionality reduction technique, a widely used algorithm, that offers a solution by mapping highdimensional data into a lowerdimensional space while preserving the intrinsic geometric structure of the data. However, it can be computationally and memoryintensive to calculate geodesic distances for huge datasets. When working with extremely highdimensional data, ISOMAP might still succumb to the curse of dimensionality, similar to many other dimensionality reduction approaches [
6]. To solve such problems, we can use an approximate method to reduce the computational and memory costs. For this purpose, we propose a random kitchen sinks (RKS) [
7,
8,
9] approximationbased distance computation method in this paper. By utilizing RKS in the ISOMAP algorithm, we aim to significantly reduce the computational complexity of the traditional ISOMAP method while obtaining a meaningful lowdimensional data representation. This approach has the potential to make ISOMAP more scalable and practical for analyzing large datasets.
The motivation behind this research is to improve the efficiency of the ISOMAP algorithm by leveraging an approximate method based on random kitchen sinks (RKS) [
7,
8]. RKS is a technique that allows for faster computation of the kernel matrix, which can be used to approximate the pairwise distances between data points. It leverages random features to efficiently approximate the geodesic distances without explicitly computing them. For each data point, ISOMAP identifies its
knearest neighbors based on a distance metric and constructs a nearestneighbor graph. The local structure of the data is efficiently captured by this graph, which joins nearby data points. The geodesic distances between the data points in the closest neighbor graph are calculated [
10]. This critical phase accounts for the curvature and nonlinearity of the underlying manifold. The pairwise geodesic distances are calculated [
11], and then the data are projected onto a lowerdimensional space while adhering as much as possible to the pairwise geodesic distances using classical multidimensional scaling (MDS) [
12].
RKS begins by generating a set of random features from a random distribution, like a Gaussian distribution or a Rademacher distribution [
9]. RKS uses a kernel approximation method to estimate the pairwise inner products between the mapped data points in the highdimensional space using these random features [
7]. The base functions for the mapping of data points are the random features. The main benefit of RKS is that it can approximate geodesic distances well without explicitly creating the nearest neighbor graph or calculating all pairwise geodesic distances [
13]. When compared to more established methods like ISOMAP, this can dramatically lower the computational complexity and memory needs [
14], especially for large datasets. RKS is particularly helpful because it reduces the computational burden of calculating geodesic distances in the original highdimensional space.
To assess the performance of the approximate ISOMAP approach using RKS, we compared it with the widely used tSNE [
15] algorithm. Both ISOMAP and tSNE are popular techniques for visualizing highdimensional data, but they differ in their underlying principles and computational characteristics. By comparing the results of ISOMAP with those of RKS and tSNE, we gained insights into the strengths and limitations of each method and their suitability for different types of datasets. To evaluate the quality of the lowdimensional embeddings generated by ISOMAP with RKS and tSNE, we employed several metrics: mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). These metrics provide quantitative measures of the reconstruction error, comparing the distance matrix
D computed from the highdimensional original data with the approximate distance matrix
${D}_{approx}$ obtained from the lowdimensional embeddings. Additionally, we measured the runtime of each algorithm to assess its computational efficiency.
The comparison and evaluation were performed on a set of protein sequences, which are used in typical bioinformatics tasks such as characterizing SARSCoV2 variants, location of infection, or host specificity. Protein sequences are fundamental to understanding protein structure and function, and they are often represented as highdimensional data. In this study, we employed different embedding methods based on
kmers [
16], minimizers [
17], and position weight matrix (PWM) [
18] to capture various aspects of the underlying structure and relationships between the protein sequences. This allowed us to explore the effectiveness of the approximate ISOMAP approach using RKS across different representations of the protein sequences. By evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against tSNE on a protein sequence dataset, this research aims to provide insights into the use of the RKSbased method for approximating ISOMAP. The evaluation focused on the quality of the lowdimensional embeddings and their ability to accurately capture the important characteristics of the protein sequences. The findings of this study contribute to the field of data visualization by helping with choosing the appropriate dimensionality reduction method for bioinformatics and other domains dealing with highdimensional data.
This paper makes the following key contributions to the field of data visualization and dimensionality reduction:
Proposed Approximate ISOMAP using RKS: We developed an approximate ISOMAP algorithm using random kitchen sinks (RKS) to improve the computational efficiency of ISOMAP while preserving the quality of the lowdimensional embeddings.
Comparative Analysis with tSNE: We conducted a comprehensive comparative analysis between the approximate ISOMAP approach with RKS and the traditional tSNE algorithm. This comparison provides insights into the relative strengths and weaknesses of these methods in terms of accuracy, computational efficiency, and ability to capture the intrinsic structure of the data.
Evaluation on Protein Sequences: We evaluated the performance of the approximate ISOMAP approach and tSNE on a dataset of SARSCoV2 spike protein sequences. This evaluation included different embedding methods based on kmers, minimizers, and position weight matrix (PWM), allowing for a comprehensive assessment of the algorithms’ effectiveness on diverse representations of the protein sequences.
Assessment Metrics: We introduced and utilized evaluation metrics, including mean squared error (MSE), mean absolute error (MAE), explained variance score (EVS), and runtime, to measure the quality of the lowdimensional embeddings and compare the performance of ISOMAP with that of RKS and tSNE.
The remainder of this paper is organized as follows: In
Section 2, we provide a detailed background on ISOMAP, tSNE, and the random kitchen sinks (RKS) technique.
Section 3 presents the methodology of the proposed approximate ISOMAP approach using RKS. In
Section 4, we describe the experimental setup, including the dataset and the different embedding methods employed.
Section 5 presents the results and analysis of the comparative evaluation between approximate ISOMAP and tSNE, including different performance metrics. Finally, in
Section 6, we discuss the findings, limitations, and potential future directions of this research.
2. Related Work
In recent years, various dimensionality reduction techniques have been developed to tackle the challenges posed by highdimensional data visualization. In this section, we review some of the relevant work in the field, focusing on ISOMAP, tSNE, and related methods. ISOMAP, introduced by [
11], is a widely used algorithm for nonlinear dimensionality reduction. It aims to preserve the geodesic distances between data points in lowdimensional space. ISOMAP constructs a neighborhood graph based on pairwise distances and then uses graphbased algorithms, such as the shortest path algorithm, to compute the geodesic distances. While ISOMAP has been successful in preserving the global structure of the data, its computational complexity grows quadratically with the number of data points, making it inefficient for largescale datasets.
To address the computational limitations of ISOMAP, several approximate methods have been proposed. For example, authors in [
19] introduced an approximation algorithm that leverages graph sparsification techniques to reduce the computational complexity of ISOMAP. They demonstrated that their approach achieves comparable performance to the original ISOMAP algorithm while significantly reducing the computation time. Another approach proposed by [
20] utilizes landmarkbased approximation, where a small set of landmark points is selected to approximate the pairwise distances in the highdimensional space. By constructing a lowdimensional embedding based on the landmark distances, the computational complexity of ISOMAP is effectively reduced.
While these approximate methods have produced promising results, they still suffer from certain limitations, such as the loss of accuracy in preserving the global structure or the requirement for additional parameter tuning. In this paper, we propose a novel approach that leverages random kitchen sinks (RKS) to approximate the pairwise distances, aiming to improve the efficiency of ISOMAP while maintaining the quality of the lowdimensional embeddings. Authors [
21] proposed the approximate geodesic distance matrix, which implies that ISOMAP can be solved by a kernel eigenvalue problem. In another work [
22], the authors proposed upgraded landmarkisometric mapping (ULIsomap) but their aim was to address the problems of landmark selection for hyperspectral imagery (HSI) classification. Other authors proposed a hybrid approach, quantum kitchen sink (QKS, in [
23], which uses quantum circuits to nonlinearly transform classical inputs into features. It is inspired by a technique known as random kitchen sinks, whereby random nonlinear transformation can greatly simplify the optimization of machine learning (ML) tasks [
24]. However, QKS is more helpful for complex optimization problems that arise in machine learning or simulating quantum systems rather than tasks like visualization.
tSNE, introduced by van der Maaten and Hinton [
15], is another popular dimensionality reduction algorithm known for its ability to preserve the local structure of the data. Unlike ISOMAP, which focuses on preserving global distances, tSNE emphasizes the preservation of pairwise similarities between nearby data points. It constructs a probability distribution over pairs of highdimensional data points and a similar distribution over pairs of their lowdimensional counterparts. The algorithm then minimizes the Kullback–Leibler divergence between the two distributions to obtain the lowdimensional embedding. tSNE has been widely adopted in various domains, including image analysis, natural language processing, and bioinformatics. Many studies have explored the effectiveness of tSNE in visualizing highdimensional biological data, such as gene expression profiles [
25], singlecell RNA sequencing data [
26], and protein structures [
27]. tSNE has demonstrated its ability to reveal meaningful patterns and clusters in complex biological datasets. While tSNE is effective in preserving local similarities, it may struggle to capture global structures. Additionally, tSNE’s computational complexity scales quadratically with the number of data points, making it challenging to apply to largescale datasets.
3. Proposed Approach
In this section, we present the proposed approach for improving the efficiency of the ISOMAP algorithm using random kitchen sinks (RKS) [
8]. The RKS technique allows for faster computation of the kernel matrix, which approximates the pairwise distances between data points.
The isometric feature mapping (ISOMAP) algorithm is a classical dimensionality reduction technique that aims to preserve the global geometric structure of highdimensional data in a lowerdimensional space. The algorithm is based on the concept of geodesic distances, which measure the shortest path between two points along the manifold on which the data lay. In the original ISOMAP approach, the algorithm starts by constructing a neighborhood graph, where each data point is connected to its nearest neighbors. Then, ISOMAP computes the pairwise geodesic distances between all data points using graphbased algorithms such as Dijkstra’s algorithm. These pairwise distances are used to construct a lowdimensional embedding of the data using techniques like multidimensional scaling (MDS). MDS is responsible for finding a configuration of points in a lowerdimensional space that best approximates the pairwise geodesic distances observed in the highdimensional space. By mapping the data points to a lowerdimensional space while preserving their pairwise geodesic distances, ISOMAP can reveal the underlying manifold structure of the data and enable effective visualization of its intrinsic geometry.
3.1. Random Kitchen Sinks (RKS)
The RKS algorithm is employed to compute the kernel matrix approximation. Given an input data matrix X and the desired number of components ${n}_{\mathrm{components}}$, the RKS algorithm proceeds as given in Algorithm 1.
Remark 1. Note that we use the approximate distance matrix within the traditional ISOMAP instead of the default distance matrix, which requires more computational time.
Algorithm 1 Random Kitchen Sinks (RKS) 
 1:
Input: X (input data matrix), ${n}_{\mathrm{components}}$ (number of components)  2:
Output: ${D}_{\mathrm{all}}$ (kernel matrix approximation)  3:
function Procedure(X, ${n}_{\mathrm{components}}$) \* Initialize kernel matrix ${D}_{\mathrm{all}}$ with shape $(n\times n)$, where n is the number of data points *\  4:
${D}_{all}$ = initializeKernelMatrix(X.shape[0]) \* Initialize the projection matrix P with shape $(d,{n}_{\mathrm{components}})$, where d is the number of features in the input data *\  5:
d = length(X[0,:]) ▹ Column Length  6:
P = initializeProjectionMatrix(d, ${n}_{\mathrm{components}}$)  7:
for i from 1:${n}_{\mathrm{components}}$) do \* Generate a random vector r with the same number of elements as the input data *\  8:
r = generateRandomVector(d) \* Divide random vector r with normalize r and store it as the ith column of P *\  9:
P[:, i] = r / Norm(r)  10:
end for \* Loop over the rows and columns of ${D}_{all}$ *\  11:
for i from 1: n do  12:
for j from 1: n do \* Calculate the kernel matrix element using the projection matrix P *\  13:
${D}_{\mathrm{all}}[i,j]=X\left[i\right]@P\xb7X\left[j\right]@P$ ▹$@\to $ operator for matrix multiplication  14:
end for  15:
end for  16:
Return ${D}_{\mathrm{all}}$  17:
end function

The RKS algorithm initializes the kernel matrix ${D}_{\mathrm{all}}$ and the projection matrix P. It then generates random vectors, normalizes them, and stores them as columns in P. The algorithm computes each element of ${D}_{\mathrm{all}}$ by taking the dot product between the projected data points $X\left[i\right]@P$ and $X\left[j\right]@P$.
3.2. Approximate Isomap
To improve the efficiency of the ISOMAP algorithm, we utilize the RKS algorithm to approximate the pairwise distances between data points. The following steps outline the proposed approximate ISOMAP approach:
Set the desired number of components ${n}_{\mathrm{components}}$ and the number of nearest neighbors ${n}_{\mathrm{neighbors}}$.
Compute the kernel matrix approximation D using the RKS algorithm: $D=\mathrm{RKS}(X,{n}_{\mathrm{components}})$.
Perform ISOMAP on the approximated distances:
 (a)
Initialize the ISOMAP algorithm with the desired number of components and the number of nearest neighbors: $\mathrm{Isomap}({n}_{\mathrm{components}},{n}_{\mathrm{neighbors}})$.
 (b)
Fit the ISOMAP model on D.
Compute the reconstruction error using several different metrics:
 (a)
Compute the distance matrix approximation ${D}_{approx}$ obtained from the ISOMAP model
 (b)
Calculate the mean squared error (MSE) between the original distance matrix D and the approximated distance matrix ${D}_{approx}$: $\mathrm{MSE}=\mathrm{mean}\_\mathrm{squared}\_\mathrm{error}(D,{D}_{approx})$.
 (c)
Calculate the mean absolute error (MAE) between
D and
${D}_{approx}$:
 (d)
Calculate the explained variance score (EVS) between D and ${D}_{approx}$: $\mathrm{EVS}=\mathrm{explained}\_\mathrm{variance}\_\mathrm{score}(D,{D}_{approx})$.
Record the ending time and compute the running time.
Display the reconstruction error metrics and the running time.
The proposed approximate ISOMAP approach starts by recording the starting time. The number of components ${n}_{\mathrm{components}}$ and the number of nearest neighbors ${n}_{\mathrm{neighbors}}$ are set. The RKS algorithm is then used to compute the kernel matrix approximation D based on the input data matrix X and ${n}_{\mathrm{components}}$. The ending time is recorded, and the running time is computed.
Next, the ISOMAP algorithm is initialized with the desired number of components and nearest neighbors. The ISOMAP model is fitted to the approximated distances D, and the data are transformed using the fit_transform function. The ending time is recorded again, and the running time is computed.
This proposed approach combines the efficiency of the RKS algorithm for approximating the pairwise distances with the ISOMAP algorithm’s ability to preserve the intrinsic structure of the data. By utilizing the approximate ISOMAP approach, we aim to achieve a computationally efficient yet effective dimensionality reduction technique for highdimensional datasets.
4. Experimental Setup
In this section, we provide details about the dataset used in our experiments and discuss the embedding methods used to convert biological sequences into fixeddimensional numerical representations. The experiments were conducted on a system with an Intel Core i5 processor running at 2.40 GHz and equipped with 32 GB of memory. The system operated on the Windows operating system. For hyperparameter tuning, we used 5fold crossvalidation. Specifically, we used 5 nearest neighbors used for Spike7k, 6 for the Coronavirus Host, and 5 for the Protein Subcellular dataset. Moreover, we used the Euclidean distance metric for calculating nearest neighbors.
4.1. Dataset Statistics
We used the following datasets for experimentation.
4.1.1. Spike7k Dataset
The Spike7k dataset consists of aligned spike protein sequences obtained from the GISAID database (
https://www.gisaid.org/, accessed on 15 July 2023). The dataset comprises a total of 7000 sequences, which represent 22 different lineages of coronaviruses (class labels). Each sequence in the dataset has a length of 1274 amino acids. The distribution of lineages in the Spike7k dataset is presented in
Table 1.
4.1.2. Coronavirus Host
The Coronavirus Host dataset consists of unaligned spike protein sequences along with information about the genus/subgenus and infected host of the clades within the Coronaviridae family. These data were extracted from ViPR [
34] and GISAID. The dataset comprises a total of 5558 spike sequences, corresponding to 21 unique hosts, including Bats, Bovines, Cats, Cattle, Equine, Fish, Humans, Pangolins, Rats, Turtles, Weasels, Birds, Camels, Canis, Dolphins, the Environment, Hedgehogs, Monkeys, Pythons, and Swines. In this dataset, the classification tasks are based on the host names, which serve as the class labels. The maximum, minimum, and average lengths of the sequences in this dataset are 1584, 9, and 1272.36, respectively. Additional statistics for this dataset can be found in
Table 2.
4.1.3. Protein Subcellular
The Protein Subcellular dataset [
35] comprises unaligned protein sequences annotated with information on 11 distinct subcellular locations, which are used as class labels for classification tasks. The dataset contains a total of 5959 sequences. The subcellular locations, along with their respective counts, are presented in
Table 3.
4.2. Baseline
As a baseline comparison, we used the standard tdistributed Stochastic Neighbor Embedding (tSNE) approach [
15]. The tSNE algorithm is a popular dimensionality reduction technique used for visualizing highdimensional data. The algorithm aims to preserve the local structure of the data points in e lowdimensional space. In the original tSNE approach, the algorithm starts by computing pairwise similarities between data points in the highdimensional space. The similarities are typically calculated using a Gaussian kernel function, where points that are close to each other have higher similarity values. Then, tSNE constructs a probability distribution over pairs of highdimensional points and a similar probability distribution over pairs of lowdimensional points in a lowerdimensional space. The algorithm iteratively minimizes the divergence between these two probability distributions by adjusting the positions of the lowdimensional points. During each iteration, tSNE employs a gradientbased optimization technique to update the positions of the lowdimensional points, aiming to better match the pairwise similarities observed in highdimensional space. By the end of the optimization process, tSNE generates a lowdimensional representation of the data where nearby points in the highdimensional space are still close to each other, allowing for effective visualization of the data’s structure.
4.3. Evaluation Metrics
To assess the quality of the lowdimensional embeddings obtained from the approximate ISOMAP approach, reconstruction error metrics were computed. The distance matrix approximation ${D}_{approx}$ obtained from the ISOMAP model was compared with the original distance matrix D using the mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). The ending time was recorded once more, and the final running time was computed.
4.3.1. Mean Squared Error (MSE)
The MSE is defined as follows:
where
N is the total number of data points,
${D}_{ij}$ is the pairwise distance between data points
i and
j in the original highdimensional space, and
${D}_{approx\left(ij\right)}$ is the pairwise distance between the corresponding lowdimensional embeddings.
4.3.2. Mean Absolute Error (MAE)
The MAE is defined as follows:
where
N is the total number of data points,
${D}_{ij}$ is the pairwise distance between data points
i and
j in the original highdimensional space, and
${D}_{approx\left(ij\right)}$ is the pairwise distance between the corresponding lowdimensional embeddings.
4.3.3. Explained Variance Score (EVS)
The EVS is defined as follows:
where
D is the distance matrix computed from the original highdimensional data,
${D}_{approx}$ is the distance matrix calculated from the lowdimensional embeddings,
$Var$(
D) is the variance of the original distance matrix, and
$Var(D{D}_{approx})$ is the variance of the difference between the original distance matrix and the distance matrix computed from the lowdimensional embeddings.
4.4. Embedding Generation
To generate the numerical embeddings from the biological sequences, we used the following methods.
4.4.1. Spike2Vec
The Spike2Vec method, as proposed in Ali et al. [
36], operates on a protein sequence and produces a spectrum on alphabet
$\Sigma $, which represents 21 amino acids
ACDEFGHIKLMNPQRSTVWXY. Specifically, the spectrum comprises a vector for all possible numbers of
kmers, where each bin includes the count/frequency of each
kmer present within a protein sequence. Following alphabetical order, the first bin is for the
kmer
$AAA$, while the last bin of the spectrum is for the
kmer
$YYY$. The total number of bins (i.e., the length of the spectrum) is defined by the following expression:
This spectrum captures the frequency of occurrence of kmers within the sequence. In our implementation, we set the value of k to 3, which was determined using a validation set approach. This choice ensured that the spectrum effectively represents the patterns and relationships of trimer subsequences in the protein sequence.
4.4.2. Min2Vec
The Min2Vec embedding method utilizes the concept of minimizers to represent biological sequences in a more concise manner [
17]. A minimizer is a modified version of a
kmer, which is a substring of consecutive nucleotides/amino acids, and it is selected as the lexicographically smallest substring (also called
mmer, where
$m<k$) in both the forward and backward directions of the
kmer. This approach helps reduce the complexity of representing the sequence.
Formally, for a given
kmer, a minimizer (also known as an mmer) is defined as a substring of length
m that is chosen from the
kmer. The selected substring is the lexicographically smallest in both the forward and backward directions of the
kmer. Here, the values of
k and
m are chosen empirically, and, for our experiments, we set
$k=9$ and
$m=3$. The choice of these values was determined using a standard validation set approach [
37].
Once the minimizers are computed for a given biological sequence, a frequency vector/spectrumbased representation is generated. This representation counts the occurrence of each
mmer (similar to the
kmers spectrum described in
Section 4.4.1) within the sequence. The resulting frequency vector captures the distribution of these minimizers and provides a compact representation of the sequence. The length of the
mmersbased spectrum is as follows:
4.4.3. PWM2Vec
The PWM2Vec [
18] approach involves assigning weights to each
kmer within a protein sequence. Specifically, given a sequence, PWM2Vec calculates a position weight matrix (PWM) with dimensions
$\Sigma \times k$, where
$\Sigma $ represents the 21 amino acid alphabet, and
k is the chosen length of the
kmers. The PWM captures the frequency of each amino acid in all
kmers within the sequence. Subsequently, PWM2Vec assigns weights to each
kmer based on their corresponding count values in the PWM. These weights are then utilized as input vectors for machine learning models. In our implementation, we set
k to 9 using the standard validation set approach [
37].
4.5. Limitation
Since RKS begins by generating a set of random features from a random distribution, the selection of random features could be a challenging task. The distance matrix generated on line 4 of Algorithm 1 can be expensive in terms of memory consumption when the number of data points (i.e., biological sequences) is very large in the input data. For computing reconstruction errors, we used MSE, MAE, and EVS. However, using a more comprehensive list could help us extract more insights regarding the proposed method, which we will explore in future extensions. Moreover, since we obtained the results for datasets comprising only protein sequences, the generalizability of the proposed method could not be fully tested. Hence, it is not clear how the proposed method will behave on other biological datasets, e.g., nucleotides, shortread sequence datasets, etc.
5. Results and Discussion
The results for the approximate ISOMAP and the baseline tSNE are reported in
Table 4 for different embedding methods and datasets. For Spike7k data, the PWM2Vec with ISOMAP outperformed the other methods on all evaluation metrics (other than EVS) and embedding methods including tSNE. In the case of MSE, the Spike2Vec with ISOMAP showed a
$98.7\%$ improvement in performance compared with the Spike2Vec with traditional tSNE. A similar pattern was observed for MAE, EVS, and computational runtime. For the Coronavirus Host dataset, we again observed that all embedding methods with ISOMAP significantly outperformed the same embedding methods with traditional tSNE (for all evaluation metrics other than EVS). In the case of MSE, the Spike2Vec with ISOMAP achieved a
$99.4\%$ improvement in performance compared with the same Spike2Vec embedding with traditional tSNE. Similarly, in the case of the Protein Subcellular dataset, we again observed that apart from EVS, all embedding methods with ISOMAP outperformed the same embeddings with tSNE. The Spike2Vec with ISOMAP achieved an
$86.8\%$ improvement in performance compared with the same Spike2Vec with traditional tSNE. Note that to compute the performance gain, we used the following expression:
where
$Va{l}_{tSNE}$ is the value (i.e., MSE, MAE, EVS, or runtime) computed from tSNE, while
$Va{l}_{ISOMAP}$ is the value computed from ISOMAP.
Discussion
Since the approximate ISOMAP outperforms the traditional tSNE in terms of reconstruction error (which means that the ISOMAP might be a more suitable choice when the preservation of global data structure is the priority), we think that the understanding of protein sequences using our approach as an alternative to the tSNE could help explore the patterns/clusters in the original data. These patterns may be hidden/unexplored while using the traditional methods. More specifically, if there is a new coronavirus variant emerging in the given data (i.e., set of protein sequences), the traditional tSNE method may not highlight a separate cluster of the new variant, which is possible using the proposed method due to its better performance compared with tSNE.
Embedding methods like Spike2Vec and Min2Vec have sparse information, i.e., most of the bins in the spectrum could have very small/zero numbers. These methods work by transforming complex data into lowerdimensional representations. In this process, some intricate details of the data might be lost or underrepresented, leading to sparse representations. For example, in the context of Spike2Vec and Min2Vec, these methods operate on spectral data, where each bin in the spectrum represents certain features of the data. However, due to the nature of the data, many of these bins might end up with very small or zero values. When using tSNE on such sparse embeddings, several issues can arise. tSNE excels at capturing and preserving local and global patterns within data. However, its effectiveness relies on the distribution of data points in the embedded space. Sparse regions in the embeddings can pose challenges. Specifically, tSNE might not allocate enough attention to these sparse areas, potentially leading to the omission of crucial patterns, outliers, or rare data instances. In essence, the sparsity in embeddings can cause tSNE to overlook important data characteristics, resulting in an incomplete representation. In contrast, ISOMAP offers a different approach. Instead of focusing on the values of data points, ISOMAP prioritizes the relationships and distances between data points. It constructs the pairwise distances between data points and then maps these points to a lowerdimensional space while preserving these distances as accurately as possible. By maintaining the relative distances between data points, ISOMAP can capture the overall structure and relationships within the data, even if certain regions are sparse. It does not risk overlooking important patterns that might be masked by sparsity.
For the computational runtime results in
Table 4, since the dimension of the embedding is lowest for the PWM2Vecbased embeddings (i.e., it equals the number of
kmers within a protein sequence), its runtime is the shortest among the three embedding methods. Since the embedding dimensionalities for Spike2Vec and Min2Vec are similar, their computational runtimes are closer to each other.
6. Conclusions
In this paper, we proposed a fast and efficient method for the visualization of highdimensional data. The proposed method is based on the idea of approximating the traditional ISOMAP method using random kitchen sinks (RKS). We compared the performance of the approximate ISOMAP approach with RKS with that of the traditional tSNE algorithm and evaluated their effectiveness using three different protein sequence datasets. Our findings, based on different evaluation metrics and embedding methods (including kmers, minimizers, and position weight matrix), demonstrate that the approximate ISOMAP approach using RKS offers promising results for dimensionality reduction. By leveraging RKS, we were able to reduce the computational complexity of ISOMAP while still obtaining meaningful representations of biological sequences. The comparative analysis with the traditional tSNE method revealed that the approximate ISOMAP approach performed favorably in terms of capturing the intrinsic structure of the data while providing a faster runtime by approximating the geodesic distances’ computation. This suggests that the RKSbased method is versatile and can be applied to various domains dealing with highdimensional data. The results of the proposed method can guide researchers and practitioners in selecting the most suitable dimensionality reduction method for their specific datasets and applications. Future work can explore further enhancements to the approximate ISOMAP approach using other methods such as quantum kitchen sinks, Nystroem kernel approximation, random Fourier features, etc. Optimizing the hyperparameters or investigating their performance on other types of highdimensional data could also be areas of future investigation. RKS uses random features, and using some techniques to reduce this randomness could be a potential extension of this work. Additionally, incorporating other evaluation metrics or conducting user studies to assess the visual interpretability of the lowdimensional embeddings can provide a more comprehensive understanding of the proposed approach.