Next Article in Journal
Disease-Specific α-Synuclein Seeding in Lewy Body Disease and Multiple System Atrophy Are Preserved in Formaldehyde-Fixed Paraffin-Embedded Human Brain
Next Article in Special Issue
Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
Previous Article in Journal
Volatile Characterization of Lychee Plant Tissues (Litchi chinensis) and the Effect of Key Compounds on the Behavior of the Lychee Erinose Mite (Aceria litchii)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

1
Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
2
IBM Research, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA
*
Authors to whom correspondence should be addressed.
Biomolecules 2023, 13(6), 934; https://doi.org/10.3390/biom13060934
Submission received: 18 April 2023 / Revised: 19 May 2023 / Accepted: 31 May 2023 / Published: 2 June 2023

Abstract

:
The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.

1. Introduction

During the COVID-19 pandemic, whole genome sequencing (WGS) of the SARS-CoV-2 virus has played a crucial role in unraveling important biological information. Through phylogenetic analysis, it has been revealed that SARS-CoV-2 shares 50% and 79% sequence similarity with MERS-CoV and SARS-CoV, respectively, indicating their evolutionary connections [1]. Notably, the genome sequence of SARS-CoV-2 exhibits an 85% similarity to a bat coronavirus, establishing its zoonotic origin within the Coronaviridae family and the Betacoronavirus genus [2]. These genomic data have been instrumental in confirming the virus’s source and classification. Recognizing the significance of gathering genetic data from diverse SARS-CoV-2 sequences and variants, researchers worldwide swiftly recognized the need for comprehensive genome information [3,4]. The Centers for Disease Control and Prevention’s Office of Advanced Molecular Detection (AMD) released details regarding SARS-CoV-2 whole genome sequencing on various platforms, including PacBio, Illumina, and Ion Torrent. Emphasizing the importance of publicly accessible genome sequences, the World Health Organization (WHO) strongly supports their utilization in developing novel public health strategies and conducting research to combat the spread of COVID-19. A valuable resource in this endeavor is the Global Initiative on Sharing All Influenza Data (GISAID), which hosts one of the largest international databases of SARS-CoV-2 genome sequences [5]. Leveraging GISAID, along with the open-source tools NextStrain and NextClade, researchers have made significant advancements in their investigations [6,7]. These resources have proven instrumental in understanding the evolution and characteristics of the virus, aiding in the development of efficient strategies to mitigate the COVID-19 infection’s spread [8,9,10].
Third-generation sequencing technology has emerged as a widely used method for sequencing SARS-CoV-2 during the pandemic. These technologies, known for their ability to generate long reads, are increasingly employed in transcriptomics studies. Advancements in long-read sequencing enable the comprehensive sequencing of RNA molecules, utilizing cDNA or direct RNA protocols from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) [11,12,13]. However, the high error rates associated with long-read technologies pose challenges for accurate and efficient downstream analysis, such as genome assembly. Indels, or insertions and deletions, are the primary error types that complicate alignment processes. While various error correction tools exist, there remains a need for further development in this computational biology domain. To effectively combat the COVID-19 infection and facilitate research, an increased number of SARS-CoV-2 genome sequences are required [14,15]. Researchers worldwide rely on third-generation sequencing technologies to sequence the virus. Cutting-edge technology heavily relies on SARS-CoV-2 genomic sequences for virus tracking. To analyze genomic data effectively, scientists employ machine learning (ML) and deep learning (DL) algorithms along with embedding methods for classification purposes [16,17,18,19]. ML and DL algorithms have become valuable tools even for novice bioinformatics practitioners and core data analysts who may lack prior knowledge of sequencing technologies and associated challenges. These algorithms enable comprehensive analysis of SARS-CoV-2 sequencing data, contributing to advancements in classification techniques and aiding in our understanding of the virus’s genetic characteristics and behavior. Therefore, it is crucial to establish a robust benchmark report on SARS-CoV-2 genome sequences generated using third-generation sequencing technology, which will serve as a guide for future genomic research involving long-read sequencing.
The current study aims to evaluate the performance of current classification models in handling third-generation sequencer-specific errors present in SARS-CoV-2 genome sequences. Specifically, the study investigates the effectiveness of various embedding methods under specified levels of disturbance. The evaluation of machine learning models on SARS-CoV-2 genomic sequences remains limited, with only a few existing studies in this area. For instance, a previous study [20] conducted a benchmark of ML and DL models using different embedding methods for classifying SARS-CoV-2 genome sequences that included sequencer-specific errors. However, this study did not identify the best ML model for SARS-CoV-2 genome sequence classification. In line with a similar approach, our current study focuses exclusively on SARS-CoV-2 genomes generated using long reads obtained from third-generation sequencing (TGS) technologies, such as PacBio and Nanopore, while also considering the possibility of random errors occurring by chance. To assess the effectiveness of machine learning algorithms on SARS-CoV-2 genome sequences, we conducted simulations that accounted for various error types. Our simulations employed two primary approaches: one involved generating SARS-CoV-2 genome sequences with platform-specific errors (PacBio or ONT), while the other introduced random errors. The workflow for these simulations is depicted in Figure 1. To analyze the SARS-CoV-2 sequences, we employed six distinct embedding methods, including one-hot encoding (OHE), Wasserstein-distance-guided representation learning (WDGRL), string kernel, spaced k -mers, weighted k-mers, and weighted position weight matrix (PWM). Leveraging these embedding methods, we performed supervised analyses using a variety of linear and non-linear classifiers considering both clean and error-incorporated SARS-CoV-2 sequences. This comprehensive methodology enabled us to evaluate the effectiveness of these methods in detecting errors and classifying sequences.
The subsequent sections of the current study are described in an arranged manner as follows. Section 2 comprises comprehensive details of the dataset statistics, dataset generation methodology, and various embedding techniques considered to convert SARS-CoV-2 genome sequences to fixed-length numerical representations. Our results for accuracy and robustness are reported in Section 3. Finally, the current study concludes in Section 4.

2. Dataset and Methodology

This section is devoted to the elucidation of the datasets utilized in this study and the process through which the validation dataset incorporating long-read specific error models (PacBio and ONT) and random errors was generated (refer to Section 2.1). In addition, Section 2.2 provides a succinct overview of the different types of embedding methods employed. The methodology adopted for the development of machine learning classification algorithms and the computation of their accuracy and robustness is presented in Section 2.3. Finally, Section 2.4 expounds on the visualization of the high-dimensional SARS-CoV-2 sequencing data.

2.1. Dataset Generation

In this study, four distinct datasets were employed. One of these datasets encompasses all genomes from the Global Initiative on Sharing All Influenza Data (GISAID), which have been meticulously curated to ensure their accuracy. The remaining three datasets were derived from distinct error models. Specifically, two datasets were generated through the use of PacBio and ONT models, while the fourth dataset was produced via a random error model. A detailed exposition of the dataset properties and characteristics is provided in their corresponding sections, namely Section 2.1.1, Section 2.1.2, Section 2.1.3, and Section 2.1.4.

2.1.1. Dataset 1: High-Quality SARS-CoV-2 Genome Sequences

To create a dataset of high-sequencing quality SARS-CoV-2 whole genome sequences, we analyzed 8172 sequences from GISAID between September and December 2021. Our selection criteria focused on complete and high-coverage genome sequences to ensure the collection of high-quality genomes. We specifically limited our sequence collection to those obtained from the human host. Additionally, we gathered lineage information for the sequences, resulting in 41 unique Pango lineages within our dataset. For detailed sequence statistics, please refer to Table 1.

2.1.2. Dataset 2: SARS-CoV-2 Genome Sequences Generated from Long Reads Incorporating PacBio Sequencing Errors

To generate the second SARS-CoV-2 genome sequence dataset, we utilized Pacific Biosciences (PacBio) sequencing technology and simulated long reads with PacBio sequencing errors. This was accomplished using PBSIM, a tool specifically designed to simulate PacBio sequencing reads with varying error rates [21]. PBSIM can generate two types of reads associated with the PacBio sequencer: continuous long reads (CLR) and circular consensus sequencing (CCS) short reads. CCS reads generally exhibit lower error rates compared to CLR reads. PBSIM offers two simulation approaches: sampling-based and model-based simulation, which facilitate the generation of PacBio CCS and CLR reads. In the sampling-based simulation, PBSIM considers the quality and length of the input read set, while the model-based simulation incorporates a built-in error model.
In our study, we employed the model-based approach of PBSIM, utilizing the pbsim-v1.0.3 tool to simulate PacBio long reads with errors based on the genomic sequence of SARS-CoV-2. Subsequently, these erroneous long reads were aligned to the SARS-CoV-2 reference genome (GenBank accession number NC_0455122) using Minimap v2-2.24 [22], and variants were called from the aligned reads using bcftools v1.6 [23]. This process resulted in the generation of a consensus sequence, which represents a SARS-CoV-2 genome sequence incorporating typical long-read errors. We generated simulated reads with errors at two distinct depths, namely 5× and 10×, specifically on Dataset 1, leading to the creation of Dataset 2.

2.1.3. Dataset 3: SARS-CoV-2 Genome Sequences Generated from Long Reads Incorporating Oxford Nanopore Technology (ONT) Sequencing Errors

The third dataset was generated using long reads simulated with an Oxford Nanopore Technologies (ONT) sequencing error profile. To simulate long reads with Nanopore sequencing errors, we employed the Badread software tool, known for incorporating realistic artifacts introduced by Nanopore sequencers, including chimeras, junk reads, glitches, and adapters [24]. Badread utilizes a gamma distribution for read length, allowing for user-specified mean and standard deviation parameters.
We utilized Badread v0.2.0 to simulate ONT (Oxford Nanopore Technologies) long reads with errors based on the genome sequence of SARS-CoV-2. Following this, we aligned these error-prone long reads to the SARS-CoV-2 reference genome (GenBank accession number NC_0455122) using Minimap v2-2.24 [22]. By leveraging bcftools v1.6 [23] to call variants from the aligned reads, we obtained a consensus SARS-CoV-2 sequence that incorporated errors typically associated with Nanopore sequencing technologies. In order to ensure a comprehensive analysis, we generated erroneous simulated reads at two distinct depths: 5× and 10×. These steps were specifically performed on Dataset 1, resulting in the creation of Dataset 3.

2.1.4. Dataset 4: SARS-CoV-2 Genome Sequences Generated from Long Reads Incorporating Random Errors

The fourth dataset of the SARS-CoV-2 genome sequence was generated using long reads with random errors. For this purpose, we utilized the random option in the Badread software tool, known for its ability to simulate long reads with various types of errors, including random errors [24].
By utilizing the random option available in Badread v0.2.0, we simulated long reads with random errors. Subsequently, these long reads were aligned to the SARS-CoV-2 reference genome (GenBank accession number NC_0455122) using Minimap v2-2.24 [22]. We then performed variant calling using bcftools v1.6 [23] and derived a consensus SARS-CoV-2 sequence by incorporating the introduced variants caused by random errors. To ensure a comprehensive analysis, we repeated this procedure at two different read depths, specifically 5× and 10×, which aligns with the approach taken for Datasets 2 and 3. These aforementioned steps were executed on Dataset 1, resulting in the creation of Dataset 4.

2.2. Embedding Generation Methods

This section delineates the analytical methodologies used to examine the datasets explicated in the previous section. Six distinct embedding methods, namely one-hot encoding (OHE), Wasserstein-distance-guided representation learning (WDGRL), string kernel, spaced k-mers, weighted k-mers, and weighted position weight matrix (PWM), were implemented to transform the sequences into machine-readable, low-dimensional numerical embeddings (also known as feature vectors) in this study. The specifics of each method are elaborated upon in their respective sections, namely Section 2.2.1, Section 2.2.2, Section 2.2.3, Section 2.2.4, Section 2.2.5, Section 2.2.6, respectively.

2.2.1. One-Hot Encoding (OHE)

One-hot encoding is a common method for generating numerical embedding from a nucleotide sequence (OHE). OHE represents each nucleotide in the sequence as a binary (0–1) vector; in this case, the nucleotides are A, T, C, and G [17,18]. To illustrate this mathematically for nucleotide sequences, consider a mathematical function ( f ) that maps each nucleotide to its appropriate one-hot encoding vector. We can write the function as follows:
f ( Σ ) = ( One - hot encoded vector )
Suppose in Equation (1), Σ is a one of the nucleotides from { A , T , C , G } . For instance,
f ( A ) = ( 1 , 0 , 0 , 0 ) f ( C ) = ( 0 , 1 , 0 , 0 ) f ( G ) = ( 0 , 0 , 1 , 0 ) f ( T ) = ( 0 , 0 , 0 , 1 )
After this, we can concatenate all the one-hot encoded vectors generated for individual nucleotides using function ( f ) from a DNA sequence to get a conclusive embedding vector.
Using the above concept, suppose a DNA sequence (X) of length (n) can then be represented by creating a final binary vector ( ϕ x ) in Equation (2) by concatenating all the individual components (X) generated using the function ( f ) .
ϕ x = ( f ( X 1 ) , f ( X 2 ) , , f ( X n ) )
In the above equation, X i is the nucleotide in DNA sequence X at position i. The ϕ x ’s dimension in this instance is | Σ | × n ( | Σ | is the size of the nucleotide alphabet, i.e., 4 in this case.)

2.2.2. Wasserstein-Distance-Based Generative Adversarial Network for Representation Learning (WDGRL)

The Wasserstein-distance-based generative adversarial network for representation learning (WDGRL) is an unsupervised method intended to generate a low-dimensional embedding [25]. It accomplishes the goal by extracting the features from input data using a neural network-based model that takes advantage of the source and encoded target data distribution. The model determines the Wasserstein distance (WD) between the original high-dimensional vector and low-dimensional representation. WDGRL considers one-hot encoded (OHE) vectors as input and generates a very-low-dimensional representation that consists solely of essential features.
Let us consider X numbers of one-hot encoded (OHE) SARS-COV-2 genome sequences as input data D to the neural network model M θ . The model M θ with parameters θ maps every OHE vector X i from X to a low-dimensional representation h i in R d . During this process, M θ learns to generate an h i that captures the essential features from X i by considering the encoded distribution of the source and the target data.
Suppose the distributions of the encoded representation of h i for the source and target data are P s ( h ) and P t ( h ) . Here, P s ( h ) and P t ( h ) can be estimated by a density estimation method. The loss function for the WDGRL can be written as
L ( θ ) = m a x f ( E [ h s P s [ f ( h s ) ] ] E [ h t P t [ f ( h t ) ] ] )
In Equation (3), f is a Lipschitz-1 function to approximate the Wasserstein distance between the distributions P s ( h ) and P t ( h ) . Here, the function f ( h ) can be parameterized by another neural network D α , which considers the encoded representation of h i as input and outputs a scalar value. The loss function is optimized by minimizing the negative of its value with respect to the parameter’s θ and α :
L ( θ , α ) = L ( θ ) = min θ , α ( E [ h s P s [ D α ( h s ) ] ] E [ h t P t [ D α ( h t ) ] ] )
The neural networks G θ and D α are jointly trained to minimize the WDGRL loss function using gradient descent or other optimization methods. The resulting low-dimensional representation captures important features of the input data that are useful for downstream tasks, such as classification.

2.2.3. String Kernel

The string kernel method operates in the non-Euclidean space and measures the similarity between the SARS-CoV-2 genome sequences by computing a kernel matrix, also known as a Gram matrix [26].
Given a set of SARS-COV-2 genome sequences X = { X 1 , X 2 , X 3 , . , X n } , the string kernel method first computes the k-mers of length k; for the current scenario, k is 3 for each genome sequence. The term k-mer refers to a substring of length k that occurs in the SARS-CoV-2 genome sequence.
Let us consider M be the matrix of all possible k-mers for a SARS-CoV-2 genome sequence j, and M i j is the number of times the k-mer i occurs in the sequence j. Then, the similarity between two SAR-CoV-2 genome sequences X i and X j can be computed using the kernel function, Equation (5).
K ( x i , x j ) = i x i j x j M i k M j k
Here, M i k and M j k are the number of occurrences of k-mer i and j in sequences X i and X j , respectively.
To reduce the computational complexity, the string kernel method considers a locally sensitive hashing-based approach to estimate the k-mers of two sequences at distance m from each other. This approach hashes k-mers into bins considering their locality and further uses the bin information to estimate the matching k-mers between the two SARS-CoV-2 sequences. The resulting kernel matrix K from Equation (5) is a symmetric matrix, and K i j is the kernel value between the genome sequences X i and X j .
To generate a low-dimensional representation of the input genome sequences, we performed kernel principal component analysis (PCA) on the kernel matrix K. Kernel PCA is a nonlinear dimensionality reduction technique that maps the input data to a new space defined by the eigenvectors of the kernel matrix. The top components are further considered as the reduced dimensional feature vector for the downstream analysis.
Suppose for the kernel matrix K, V is the matrix of eigenvectors and Lambda is the diagonal matrix of the corresponding eigenvectors. Then, the top k principal components are represented by:
PC k = ( λ k ) 1 / 2 V [ : , k ]
where V[:,k] is the k-th column of the matrix V and λ k is the k-th diagonal element of Lambda. For this analysis, the top 500 principal components are selected by considering a standard validation approach, and these components are used as the final feature vector for each SARS-CoV-2 genome sequence for downstream tasks such as classification.

2.2.4. Spaced k-Mers

The spaced k-mers method is used to reduce the sparsity and size of k-mers (nucleotide substrings of length k) in the SARS-CoV-2 genome sequence [19]. Given a SARS-CoV-2 genome sequence S, the spaced k-mers method first computes g-mers. Here, g-mers are nucleotide subsequences of length g (where g is an integer > 1). From those g-mers, the method then computes k-mers, where k < g. While generating k-mers from g-mers, the method skips some of the characters (nucleotides) between adjacent g-mers. The size of the gap between adjacent g-mers is determined by g-k.
Formally, suppose S is a SARS-CoV-2 genome sequence of length L. Then, the definition of g-mers is as follows:
G = { S i : i + g 1 i = 1 , 2 , , L g + 1 }
(Here, S i : i + g − 1 represents the genomic subsequences of S starting at position i and ending at position i + g − 1.)
From the above computed set of G mers, the method computes the set of k-mers as follows:
K = { S i : i + k 1 i = 1 , 2 , , g k + 1 }
(Here, S i : i + k − 1 represents the genomic subsequences of S starting at position i and ending at position i + k − 1.)
After computing the k-mers, the method generates a numerical vector of length | Σ | k (where Σ corresponds to the alphabet A, C, G, T).
In our case, the spaced k-mer method considers k = 4 and g = 9, which is determined using a standard validation approach. This means we compute k-mers of length 4 from SARS-CoV-2 genomic subsequences of length 9, with a gap of 5 nucleotide between adjacent subsequences.

2.2.5. Weighted k-Mers

The weighted k-mers-based spectrum method is used to denote biological sequences as fixed-length vectors that capture the occurrences of all possible k-mers (k represents the length of the subsequences). The method allocates weights to the k-mers that are calculated based on their inverse document frequency (IDF), which determines how uncommon a particular k-mer is across all sequences.
Formally, to compute the IDF weights, we first calculate the total number of input SARS-CoV-2 genome sequences, N, and the number of input genome sequences that contain a specific k-mer, n i . The IDF weight for the k-mer i is then provided by:
IDF ( i ) = log ( N / n i )
Then, we generate a list of all possible k-mers based on the nucleotide set A, C, G, T of the input SARS-CoV-2 genome sequences. For each SARS-CoV-2 genome sequence, we calculate the frequency of each k-mer and multiply it by the corresponding IDF weight to obtain a weighted frequency, which is given by:
w ( i , j ) = f ( i , j ) * IDF ( i )
Here, w ( i , j ) is the weighted frequency of k-mer i in SARS-CoV-2 genome sequence j, f ( i , j ) is the frequency of k-mer i in SARS-CoV-2 genome sequence j, and IDF ( i ) is the IDF weight of k-mer i.
The above weighted frequency values are then used to construct a frequency vector, where each element represents the frequency of a particular k-mer in the sequence. The frequency vector for sequence j is given by:
v ( j ) = [ w ( 1 , j ) , w ( 2 , j ) , , w ( m , j ) ]
where m is the total number of possible k-mers. In our experiment, we considered k = 3, which is selected by a standard validation approach.

2.2.6. The Weighted Position Weight Matrix (PWM)

The weighted position weight matrix (PWM) technique generates PWM scores for all possible k-mers (k is the length of subsequences) in SARS-CoV-2 genome sequences [27]. The PWM is produced by a two-step method: In the first step, the method calculates the occurrence of each base at every position of the k-mers. In the second step, it computes a weighted score for each k-mer considering the log-odd ratio of its observed frequency compared to the background frequency. The background frequency is estimated using the LaPlace pseudocount and the equal probability assumption for each nucleotide.
The formula for computing the PWM is as follows:
PWM ( k - mer ) = i = 1 K log 2 f ( k mer i , b ) b i
where k-mer is the k-mer sequence, K is the length of the k-mer, f ( k meri , b ) is the count of the nucleotide b at position i in the k − mer, b i is the background frequency of the nucleotide b, and log2 is the base-2 logarithm function.
The final output is a list of scores for each k-mer in each input sequence, where the score is the sum of the weight scores for each base in the k-mer. For the current experiment, k = 3 was selected using the standard validation set approach.

2.3. Machine Learning Classification Algorithms: SVM, NB, MLP, KNN, RF, LR, and DT

To perform the classification task, we utilize seven machine learning algorithms: Support Vector Machine (SVM), Naïve Bayes (NB), Multi-Layer Perceptron (MLP), K-Nearest Neighbors (KNN), Random Forest (RF), Logistic Regression (LR), and Decision Tree (DT). Our objective is to evaluate the performance of these algorithms by employing two different approaches.

2.3.1. Approach 1: Accuracy

Here, we compute the average accuracy, precision, recall, F1 (weighted), F1 (Macro), and ROC-AUC for the entire dataset, including all the class labels mentioned in Table 1. We exclude error sequences from the dataset.

2.3.2. Approach 2: Robustness

Robustness is crucial to machine learning models. It represents their ability to generate reasonable outputs for input examples not included in the training data. As our test set, we consider only PacBio, ONT, and random protocol-specific errors incorporating noisy examples, whereas the training set uses non-errored sequences. We then calculate the average accuracy, precision, recall, F1 (weighted), F1 (Macro), and ROC-AUC for the ML models based on the test set. Overall, these two strategies provide comprehensive evaluations of machine learning algorithms’ performance. This allows us to compare and identify the most-suitable algorithm for our classification task.

2.4. Data Visualization

In order to ascertain if there exists any inherent clustering in our dataset, we employ the t-distributed stochastic neighbor embedding (t-SNE) approach to produce a two-dimensional representation of the feature embeddings [28].

3. Results and Discussion

This section provides an overview of the outcomes achieved by our methods on the datasets employed in this study. The first subsection, labeled Section 3.1, discusses the accuracy evaluation of machine learning classification algorithms that utilized various embedding methods. The second subsection, labeled Section 3.2, covers the robustness evaluation of machine learning classification algorithms that used different embedding methods. The third subsection, labeled Section 3.3, focuses on the comparison of predictive performance of machine learning models on SARS-CoV-2 sequences with errors obtained from PacBio and ONT sequencers. Lastly, Section 3.4 explores the analysis of coronavirus variants using various embedding vector generation methods with the aid of t-SNE visualization.

3.1. Accuracy Evaluation of Machine Learning Classification Algorithms Using Different Embedding Methods

We considered 8172 clean (error-free) full-length SARS-CoV-2 nucleotide sequences from the GISAID database. These sequences were used to evaluate the machine learning models with embedding methods. In order to do that, we split the sequences into training and test sets with a 70/30% ratio. After that, we executed each analysis five times and considered the average results, reported in Table 2 and Figure 2. The results show that the machine learning classification algorithms’ performance significantly varies depending on the embedding method employed. Specifically, the one-hot embedding method leads to an accuracy of 0.773 for the SVM algorithm, whereas the WDGRL embedding method only results in an accuracy of 0.327. The spaced k-mers embedding method with the SVM, RF, LR, and DT classification algorithms achieves an accuracy of up to 0.956. This method employs g-mers and k-mers to decrease the sparsity and size of k-mers in the genome sequence. As a result, it generates fixed-length vectors that capture the occurrences of all possible k-mers, which are then used to construct frequency vectors representing the frequency of each k-mer in the sequence. This method performs well with the error-free set of SARS-CoV-2 genome sequences. However, the NB algorithm yields the worst results, with an accuracy of only 0.017 when the weighted k-mers embedding method is used. Additionally, some algorithms, such as SVM and LR, have significantly longer training times compared to others. Thus, while selecting an algorithm and embedding method, one should consider both performance and training time.

3.2. Robustness Evaluation of Machine Learning Classification Algorithms Using Different Embedding Methods

We considered 8172 clean SARS-CoV-2 sequences and incorporated errors specific to PacBio, ONT, and the random protocol, as described in the methods section. This approach helped to generate three different types of datasets: genome sequences with typical PacBio sequencing errors, ONT sequencing errors, and random errors. To evaluate the robustness of the machine learning models with embedding methods on the three different datasets, we train the models with clean SARS-CoV-2 sequences and test them on error-incorporated sequences.

3.2.1. The Robustness Results for PacBio Sequencing Error-Incorporated Datasets

Table 3 displays the accuracy values of various machine learning classification algorithms that used different embedding methods on SARS-CoV-2 genome sequence datasets simulated at two different depths, 5 and 10, with PacBio sequencer-specific errors incorporated. Furthermore, Figure 3 reveals that the accuracy values for machine learning algorithms ranged from 0.001 to 0.276 across all embedding methods. The spaced k-mers embedding method, in general, performed better than other embedding methods, achieving the highest accuracy value of 0.276 for the maximum number of algorithms for the depth-5 sequencing dataset, and a similar trend was observed for the depth-10 dataset. The reason for this is that the spaced k-mers method employs g-mers and k-mers to decrease the sparsity and size of k-mers in the genome sequence. As a result, it generates fixed-length vectors that capture the occurrences of all possible k-mers, which are then used to construct frequency vectors representing the frequency of each k-mer in the sequence. The accuracy results confirm that as the depth decreases, the error rate increases, resulting in a decrease in the performance of machine learning models. The model’s performance did not improve significantly by increasing sequencing depth from 5 between the two SARS-CoV-2 genome sequence datasets.

3.2.2. The Robustness Results for Oxford Nanopore Technologies (ONT) Sequencing Error-Incorporated Datasets

Table 4 displays the accuracy values obtained from different machine learning algorithms using various embedding methods on two SARS-CoV-2 genome sequence datasets with depths of 5 and 10, respectively, which were generated from long-reads containing Oxford Nanopore Technology (ONT) sequencer-specific errors. Moreover, Figure 4 presents a heatmap that visualizes the accuracy values, which ranged from 0.001 to 0.276. The weighted k-mers embedding method resulted in the highest accuracy values for the majority of the machine learning algorithms on both datasets, i.e., depths of 5 and 10. Because each k-mer is given a weight depending on its inverse document frequency under the weighted k-mers technique, this method generates fixed-length vectors that capture the existence of all potential k-mers. These vectors are then used to create frequency vectors that indicate the frequency of each k-mer in the sequence. However, due to the lower sequencing depth with ONT sequencer-specific errors, poor-quality SARS-CoV-2 genome sequences were generated, leading to a significant decrease in the predictive performance of machine learning algorithms.

3.2.3. The Robustness Results for Random-Error-Incorporated Datasets

In this section, we evaluated the accuracy of various machine learning algorithms using different embedding methods on two SARS-CoV-2 genome sequence datasets. These datasets were generated by incorporating random errors into long-reads at depths of 5 and 10. The results, presented in Table 5 and Figure 5, indicate that the weighted k-mers method achieved the highest accuracy of 0.276 across the majority of machine learning classification algorithms for both datasets. The main objective of incorporating random errors into the SARS-CoV-2 datasets was to compare the performance of machine learning models on datasets generated by different types of errors, including sequencer-specific errors and random errors. Interestingly, we found that there was not much difference in accuracy between these two types of errors.

3.3. Comparison of Predictive Performance of Machine Learning Models on SARS-CoV-2 Sequences with Errors from PacBio and ONT Sequencers

Third-generation sequencing (TGS) technologies such as PacBio and Oxford Nanopore Technology (ONT) are widely used for generating long reads with high error rates. However, PacBio technology sequences a DNA molecule multiple times, whereas ONT sequences it only twice, making PacBio generate higher-quality data with lower error rates compared to ONT. Through our analysis, we discovered that the errors specific to the PacBio sequencer have a more significant impact on the predictive performance of machine learning (ML) models on SARS-CoV-2 sequences than errors specific to ONT. Our ML model’s predictive performance indicated that PacBio sequences have a lower error rate than ONT, but the low predictive power was due to low coverage. We also compared the predictive performance of ML models on SARS-CoV-2 sequences incorporated with random errors with other datasets and observed that the results were similar to the ONT scenario.

3.4. Analysis of Coronavirus Variants Based on Different Embedding Vector Generation Methods Using t-SNE Visualization

The t-distributed stochastic neighbor embedding (t-SNE) method is a widely used data visualization technique that preserves the pairwise distances between high-dimensional vectors in a lower-dimensional space. In this study, we employed t-SNE to visualize the clustering patterns of different coronavirus variants using various embedding vector generation methods, including one-hot encoding (OHE), Wasserstein-distance-guided representation learning (WDGRL), string kernel, spaced k-mer, weighted k-mer, and weighted position weight matrix (PWM). Our analysis, as depicted in Figure 6, reveals the remarkable effectiveness of t-SNE in capturing the pairwise distance information and unveiling the distinct grouping patterns of coronavirus variants in a two-dimensional space. Specifically, the t-SNE plot based on the OHE vector demonstrated that AY.44 variants were more clearly grouped than the other variants, while the WDGRL vector maintained a smaller group of variants than OHE vector. Furthermore, the string-kernel-vector-based t-SNE plot exhibited clearer grouping patterns of AY.44 and other variants than the OHE vector. Additionally, the spaced k-mer vector method showed a more distinct grouping of variants compared to other embedding vector generation methods. The weighted k-mer vector exhibited grouping of the variants similar to the WDGRL vector, whereas the weighted PWM vector showed grouping patterns more similar to the string kernel vector.

4. Conclusions

In summary, the COVID-19 pandemic has emphasized the importance of transitioning from second-generation to third-generation sequencing technology. Long-read sequencing has emerged as a critical tool for unraveling various genomic features of the SARS-CoV-2 virus. With the ability to read longer DNA fragments, ranging from 5000 to 30,000 base pairs, long-read sequencing addresses a major challenge faced by short-read sequencing methods. This extended read length has enabled researchers to detect complex structural variations, including large insertions/deletions, inversions, repeats, duplications, and translocations. Additionally, long-read sequencing has facilitated the phasing of SNPs into haplotypes and facilitated de novo genome assembly. However, it is important to acknowledge that the high error rate associated with long-read sequencing may impact the interpretation of SARS-CoV-2’s biology.
In this study, we have demonstrated that the accuracy of machine learning classification algorithms in analyzing SARS-CoV-2 genome sequences greatly depends on the selection of appropriate embedding methods. Our analysis of simulated SARS-CoV-2 viral sequences underscores the value of employing robust embedding techniques capable of effectively managing errors and accurately categorizing genome sequences considering both long-read sequencer-specific errors and random error types. Specifically, we have identified certain embedding methods, such as WDGRL and weighted PWM, as superior in detecting errors and classifying sequences. These findings highlight the potential of machine learning in analyzing SARS-CoV-2 genomic data, contributing to a deeper understanding of the virus’s evolution and spread.
In the future, we want to explore more sequence embedding and advanced deep learning methods on SARS-CoV-2 genomic sequences generated at different long-read sequencing depths with third-generation sequence-specific errors. These experiments will help us develop robust models to improve our ability to adapt long-read sequencing technology (PacBio and ONT) to produce error-free SARS-CoV-2 genome sequences to understand and answer critical biological questions.

Author Contributions

B.S. prepared the datasets, analyzed the results, and wrote the manuscript; S.A. performed the ML analysis, selected the embedding methods, and wrote the manuscript; P.-Y.C. analyzed the results; M.P. analyzed the results and wrote the manuscript; A.Z. supervised the project. All authors have read and agreed to the published version of the manuscript.

Funding

B.S. and S.A. were partially supported by a GSU Molecular Basis of Disease Fellowship, M.P. was supported by a GSU/Department of Computer Science startup grant, and A.Z. was partially supported by NSF grant CCF-2212508 and NIH grant 1R21CA241044-01A1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data utilized in this research study were acquired from the publicly accessible database known as the Global Initiative on Sharing All Influenza Data (GISAID) (https://www.gisaid.org/ (accessed on 19 May 2023)) for the period spanning September to December 2021. To facilitate replication of the findings, the source codes and pipelines employed in the analysis can be accessed at https://github.com/sarwanpasha/Long_Read_Noisy_Sequences (accessed on 19 May 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WGSWhole Genome Sequencing
TGSThird-Generation Sequencing
GISAIDGlobal Initiative on Sharing All Influenza Data
MLMachine Learning
DLDeep Learning
WDGRLWasserstein Distance-based Generative Adversarial Network for Representation Learning
PacBioPacific Biosciences
ONTOxford Nanopore Technologies
OHEOne-Hot Encoding
PWMPosition Weight Matrix
SVMSupport Vector Machine
NBNaïve Bayes
MLPMulti-Layer Perceptron
KNNK-Nearest Neighbors
RFRandom Forest
LRLogistic Regression
DTDecision Tree

References

  1. Wu, F.; Zhao, S.; Yu, B.; Chen, Y.M.; Wang, W.; Song, Z.G.; Hu, Y.; Tao, Z.W.; Tian, J.H.; Pei, Y.Y.; et al. A new coronavirus associated with human respiratory disease in China. Nature 2020, 579, 265–269. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Kim, D.; Lee, J.Y.; Yang, J.S.; Kim, J.W.; Kim, V.N.; Chang, H. The Architecture of SARS-CoV-2 Transcriptome. Cell 2020, 181, 914–921. [Google Scholar] [CrossRef] [PubMed]
  3. Park, S.E. Epidemiology, virology, and clinical features of severe acute respiratory syndrome -coronavirus-2 (SARS-CoV-2; Coronavirus Disease-19). Clin. Exp. Pediatr. 2020, 63, 119–124. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Rambaut, A.; Holmes, E.C.; O’Toole, Á.; Hill, V.; McCrone, J.T.; Ruis, C.; du Plessis, L.; Pybus, O.G. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 2020, 5, 1403–1407. [Google Scholar] [CrossRef] [PubMed]
  5. GISAID Website. 2021. Available online: https://www.gisaid.org/ (accessed on 9 June 2022).
  6. Hadfield, J.; Megill, C.; Bell, S.M.; Huddleston, J.; Potter, B.; Callender, C.; Sagulenko, P.; Bedford, T.; Neher, R.A. Nextstrain: Real-time tracking of pathogen evolution. Bioinformatics 2018, 34, 4121–4123. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Aksamentov, I.; Roemer, C.; Hodcroft, E.; Neher, R. Nextclade: Clade assignment, mutation calling and quality control for viral genomes. J. Open Source Softw. 2021, 6, 3773. [Google Scholar] [CrossRef]
  8. Gardy, J.L.; Loman, N.J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Reviews. Genet. 2018, 19, 9–20. [Google Scholar] [CrossRef] [PubMed]
  9. Arons, M.M.; Hatfield, K.M.; Reddy, S.C.; Kimball, A.; James, A.; Jacobs, J.R.; Taylor, J.; Spicer, K.; Bardossy, A.C.; Oakley, L.P.; et al. Presymptomatic SARS-CoV-2 Infections and Transmission in a Skilled Nursing Facility. N. Engl. J. Med. 2020, 382, 2081–2090. [Google Scholar] [CrossRef] [PubMed]
  10. Korber, B.; Fischer, W.; Gnanakaran, S. Tracking changes in SARS-CoV-2 Spike: Evidence that D614G increases infectivity of the COVID-19 virus. Cell 2020, 182, 812–827.e19. [Google Scholar] [CrossRef] [PubMed]
  11. Rhoads, A.; Au, K.F. PacBio Sequencing and Its Applications. Genom. Proteom. Bioinform. 2015, 13, 278–289. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Laver, T.; Harrison, J.; O’Neill, P.; Moore, K.; Farbos, A.; Paszkiewicz, K.; Studholme, D. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol. Detect. Quantif. 2015, 3, 1–8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. McCarthy, A. Third Generation DNA Sequencing: Pacific Biosciences’ Single Molecule Real Time Technology. Chem. Biol. 2010, 17, 675–676. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Lu, H.; Giordano, F.; Ning, Z. Oxford Nanopore MinION Sequencing and Genome Assembly. Genom. Proteom. Bioinform. 2016, 14, 265–279. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Reuter, J.; Spacek, D.V.; Snyder, M. High-Throughput Sequencing Technologies. Mol. Cell 2015, 58, 586–597. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Singh, O.P.; Vallejo, M.; El-Badawy, I.M.; Aysha, A.; Madhanagopal, J.; Mohd Faudzi, A.A. Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms. Comput. Biol. Med. 2021, 136, 104650. [Google Scholar] [CrossRef] [PubMed]
  17. Ali, S.; Sahoo, B.; Ullah, N.; Zelikovskiy, A.; Patterson, M.; Khan, I. A k-mer based approach for sars-cov-2 variant identification. In Proceedings of the International Symposium on Bioinformatics Research and Applications, Shenzhen, China, 26–28 November 2021; pp. 153–164. [Google Scholar]
  18. Kuzmin, K.; Adeniyi, A.E.; DaSouza, A.K., Jr.; Lim, D.; Nguyen, H.; Molina, N.R.; Xiong, L.; Weber, I.T.; Harrison, R.W. Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 2020, 533, 553–558. [Google Scholar] [CrossRef] [PubMed]
  19. Singh, R.; Sekhon, A.; Kowsari, K.; Lanchantin, J.; Wang, B.; Qi, Y. Gakco: A fast gapped k-mer string kernel using counting. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Skopje, Macedonia, 18–22 September 2017; pp. 356–373. [Google Scholar]
  20. Ali, S.; Sahoo, B.; Zelikovsky, A.; Chen, P.Y.; Patterson, M. Benchmarking machine learning robustness in COVID-19 genome sequence classification. Sci. Rep. 2023, 13, 4154. [Google Scholar] [CrossRef] [PubMed]
  21. Ono, Y.; Asai, K.; Hamada, M. PBSIM: PacBio reads simulator—toward accurate genome assembly. Bioinformatics 2012, 29, 119–121. [Google Scholar] [CrossRef] [Green Version]
  22. Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef] [Green Version]
  23. Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M.; et al. Twelve years of SAMtools and BCFtools. GigaScience 2021, 10, giab008. [Google Scholar] [CrossRef] [PubMed]
  24. Wick, R. Badread: Simulation of error-prone long reads. J. Open Source Softw. 2019, 4, 1316. [Google Scholar] [CrossRef]
  25. Shen, J.; Qu, Y.; Zhang, W.; Yu, Y. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  26. Farhan, M.; Tariq, J.; Zaman, A.; Shabbir, M.; Khan, I. Efficient Approximation Algorithms for Strings Kernel Based Sequence Classification. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6935–6945. [Google Scholar]
  27. Ali, S.; Bello, B.; Chourasia, P.; Punathil, R.T.; Zhou, Y.; Patterson, M. PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. Biology 2022, 11, 418. [Google Scholar] [CrossRef] [PubMed]
  28. Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 2014, 15, 3221–3245. [Google Scholar]
Figure 1. Workflow employed for generating a dataset incorporating long-read (PacBio and ONT) specific errors, which was subsequently used to evaluate the robustness of machine learning models employing different embedding techniques. The workflow comprises four distinct steps: 1. collection of high-quality SARS-CoV-2 reference genome sequences from GISAID, 2. generating long-read sequences tailored to PacBio and ONT sequencers, incorporating sequencer-specific errors (represented as star), utilizing the reference genome for the ORF1a gene, 3. aligning the error-incorporated long reads (reads marked with star) to the reference genome, and 4. creating the final compilation of SARS-CoV-2 genome sequences, encompassing long-read sequencer-specific errors (represented as star) in the ORF1a gene.
Figure 1. Workflow employed for generating a dataset incorporating long-read (PacBio and ONT) specific errors, which was subsequently used to evaluate the robustness of machine learning models employing different embedding techniques. The workflow comprises four distinct steps: 1. collection of high-quality SARS-CoV-2 reference genome sequences from GISAID, 2. generating long-read sequences tailored to PacBio and ONT sequencers, incorporating sequencer-specific errors (represented as star), utilizing the reference genome for the ORF1a gene, 3. aligning the error-incorporated long reads (reads marked with star) to the reference genome, and 4. creating the final compilation of SARS-CoV-2 genome sequences, encompassing long-read sequencer-specific errors (represented as star) in the ORF1a gene.
Biomolecules 13 00934 g001
Figure 2. Represents the accuracies of machine learning classification algorithms obtained from an error-free set of 8172 SARS-CoV-2 genome sequences with different embedding generation methods.
Figure 2. Represents the accuracies of machine learning classification algorithms obtained from an error-free set of 8172 SARS-CoV-2 genome sequences with different embedding generation methods.
Biomolecules 13 00934 g002
Figure 3. Represents the robustness of machine learning classification algorithms with respect to specific errors related to the PacBio sequencer on a set of 8172 SARS-CoV-2 genome sequences. The analysis was carried out using various embedding generation techniques, and the results were obtained separately for two different sequencing depths: 5 and 10. The presentation of results is segregated into two sections, with the top section representing the outcomes for a depth of 5 and the bottom section representing the findings for a depth of 10.
Figure 3. Represents the robustness of machine learning classification algorithms with respect to specific errors related to the PacBio sequencer on a set of 8172 SARS-CoV-2 genome sequences. The analysis was carried out using various embedding generation techniques, and the results were obtained separately for two different sequencing depths: 5 and 10. The presentation of results is segregated into two sections, with the top section representing the outcomes for a depth of 5 and the bottom section representing the findings for a depth of 10.
Biomolecules 13 00934 g003aBiomolecules 13 00934 g003b
Figure 4. Represents the robustness of machine learning classification algorithms with respect to specific errors related to the Oxford Nanopore sequencer on a set of 8172 SARS-CoV-2 genome sequences. The analysis was carried out using various embedding generation techniques, and the results were obtained separately for two different sequencing depths: 5 and 10. The presentation of results is segregated into two sections, with the top section representing the outcomes for a depth of 5 and the bottom section representing the findings for a depth of 10.
Figure 4. Represents the robustness of machine learning classification algorithms with respect to specific errors related to the Oxford Nanopore sequencer on a set of 8172 SARS-CoV-2 genome sequences. The analysis was carried out using various embedding generation techniques, and the results were obtained separately for two different sequencing depths: 5 and 10. The presentation of results is segregated into two sections, with the top section representing the outcomes for a depth of 5 and the bottom section representing the findings for a depth of 10.
Biomolecules 13 00934 g004aBiomolecules 13 00934 g004b
Figure 5. Represents the robustness of machine learning classification algorithms with respect to random errors incorporated into a set of 8172 SARS-CoV-2 genome sequences. The analysis was carried out using various embedding generation techniques, and the results were obtained separately for two different sequencing depths: 5 and 10. The presentation of results is segregated into two sections, with the top section representing the outcomes for a depth of 5 and the bottom section representing the findings for a depth of 10.
Figure 5. Represents the robustness of machine learning classification algorithms with respect to random errors incorporated into a set of 8172 SARS-CoV-2 genome sequences. The analysis was carried out using various embedding generation techniques, and the results were obtained separately for two different sequencing depths: 5 and 10. The presentation of results is segregated into two sections, with the top section representing the outcomes for a depth of 5 and the bottom section representing the findings for a depth of 10.
Biomolecules 13 00934 g005aBiomolecules 13 00934 g005b
Figure 6. The t-SNE visualizations of the original set of 8172 error-free SARS-CoV-2 sequences employing different embedding techniques, including OHE, WDGRL, string kernel, spaced k-mers, weighted k-mers, and weighted PWM. The visualizations offer a comparative analysis of the effectiveness of each embedding method, with notable prominence observed for AY.44 variants in the t-SNE plot based on the string kernel vector. For optimal viewing experience, it is advised to refer to the figure in color.
Figure 6. The t-SNE visualizations of the original set of 8172 error-free SARS-CoV-2 sequences employing different embedding techniques, including OHE, WDGRL, string kernel, spaced k-mers, weighted k-mers, and weighted PWM. The visualizations offer a comparative analysis of the effectiveness of each embedding method, with notable prominence observed for AY.44 variants in the t-SNE plot based on the string kernel vector. For optimal viewing experience, it is advised to refer to the figure in color.
Biomolecules 13 00934 g006
Table 1. Dataset statistics for different lineages in our data. The total number of SARS-CoV-2 genome sequences (and corresponding lineages) was 8172 after preprocessing.
Table 1. Dataset statistics for different lineages in our data. The total number of SARS-CoV-2 genome sequences (and corresponding lineages) was 8172 after preprocessing.
LineageNo. SequencesLineageNo. Sequences
AY.1032253AY.12140
AY.441407AY.7536
AY.100715AY.3.130
AY.3705AY.3.328
AY.25582AY.10727
AY.25.1381AY.34.125
AY.39247AY.46.621
AY.119241AY.98.120
B.1.617.2173AY.1319
AY.20129AY.116.118
AY.26107AY.12617
AY.499AY.11415
AY.11793AY.46.114
AY.11392AY.3414
AY.11886AY.12514
AY.4385AY.9213
AY.12284AY.46.412
BA.179AY.9812
AY.119.274AY.12712
AY.4773AY.11110
AY.39.170__
Table 2. Accuracy results obtained from an error-free set of 8172 SARS-CoV-2 genome sequences. The optimal values are highlighted in bold for ease of interpretation.
Table 2. Accuracy results obtained from an error-free set of 8172 SARS-CoV-2 genome sequences. The optimal values are highlighted in bold for ease of interpretation.
Embed. MethodML Algo.Acc.Prec.RecallF1 Weigh.F1 MacroROC-AUCTrain. Runtime (s)
OHESVM0.7730.7720.7730.7660.5710.76019,231.462
NB0.0860.1920.0860.0910.1940.595615.813
MLP0.3600.3440.3600.2520.0430.5141222.237
KNN0.6690.6890.6690.6490.4090.66638.431
RF0.7740.7740.7740.7650.5740.758224.910
LR0.7210.7410.7210.7070.5550.74137,539.362
DT0.8440.8450.8440.8420.6100.796219.236
WDGRLSVM0.3270.1590.3270.2030.0360.5112.789
NB0.1660.1690.1660.1670.0180.5100.028
MLP0.4130.3180.4130.3270.0770.53021.971
KNN0.4630.4310.4630.4370.1920.5810.118
RF0.4490.4460.4490.4450.2070.6011.671
LR0.3230.2610.3230.2010.0360.5100.752
DT0.4400.4410.4400.4380.1950.5960.028
String KernelSVM0.8810.8800.8810.8780.7760.8807.200
NB0.0330.3090.0330.0320.0480.5310.556
MLP0.7150.7000.7150.7040.3690.69034.899
KNN0.7490.7570.7490.7350.5440.7380.648
RF0.6720.7500.6720.6340.3950.65210.258
LR0.8640.8580.8640.8540.6750.81750.100
DT0.5720.5780.5720.5730.3370.6693.636
Spaced k-mersSVM0.9560.9560.9560.9550.8900.9338.761
NB0.0680.3050.0680.0590.1840.6060.553
MLP0.8250.8320.8250.8260.5390.77114.855
KNN0.7960.8080.7960.7840.6110.7760.754
RF0.9150.9200.9150.9080.7490.8352.107
LR0.9560.9560.9560.9540.8810.92119.964
DT0.8340.8360.8340.8320.6470.8160.739
Weighted k-mersSVM0.2930.2010.2930.1580.0370.51030.304
NB0.0170.0140.0170.0140.0270.5230.088
MLP0.2750.1680.2750.1590.0370.51116.368
KNN0.1930.1860.1930.1590.0520.5160.661
RF0.2780.1980.2780.1800.0610.5191.066
LR0.2850.1780.2850.1640.0430.5131.648
DT0.2650.1740.2650.1740.0550.5170.064
Weighted PWMSVM0.8520.8500.8520.8490.7410.8633.257
NB0.0280.0150.0280.0180.0480.5530.093
MLP0.7600.7400.7600.7480.3930.69528.944
KNN0.7510.7550.7510.7380.5550.7430.635
RF0.8010.8300.8010.7830.6220.7593.285
LR0.8610.8590.8610.8560.7550.8828.702
DT0.6510.6620.6510.6530.3910.7060.450
Table 3. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and specific errors associated with the PacBio sequencer. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.
Table 3. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and specific errors associated with the PacBio sequencer. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.
Embed. MethodML Algo.Depth-5 Simulated ErrorDepth-10 Simulated Error
Acc.Prec.RecallF1 Weigh.F1 MacroROC-AUCTrain. Runtime (s)Acc.Prec.RecallF1 Weigh.F1 MacroROC- AUCTrain. Runtime (s)
OHESVM0.2760.0760.2760.1190.0110.500181,443.90.2760.0760.2760.1190.0110.500188,143.3
NB0.2760.0760.2760.1190.0110.5001452.370.2760.0760.2760.1190.0110.5001450.059
MLP0.0870.0080.0870.0140.0040.5002833.820.1720.0300.1720.0510.0070.5001763.060
KNN0.2760.0760.2760.1190.0110.500126.6320.2760.0760.2760.1190.0110.500119.041
RF0.1720.0300.1720.0510.0070.500300.6390.2760.2200.2760.1200.0110.500306.980
LR0.0210.0000.0210.0010.0010.50077099.60.0210.0000.0210.0010.0010.50071666.1
DT0.1720.0300.1720.0510.0070.500890.8730.0300.0870.0300.0020.0010.500441.523
WDGRLSVM0.2760.0760.2760.1190.0110.5008.0560.2760.0760.2760.1190.0110.5008.117
NB0.0010.0000.0010.0000.0000.5000.0530.0010.0000.0010.0000.0000.5000.052
MLP0.2760.0760.2760.1190.0110.5009.3740.2760.0760.2760.1190.0110.50016.305
KNN0.1720.0300.1720.0510.0070.5001.7240.1720.0300.1720.0510.0070.5001.936
RF0.2760.0760.2760.1190.0110.5000.7110.2760.0760.2760.1190.0110.5000.552
LR0.1720.0300.1720.0510.0070.5000.5000.1720.0300.1720.0510.0070.5000.574
DT0.2760.0760.2760.1190.0110.5000.0060.2760.0760.2760.1190.0110.5000.006
String KernelSVM0.2760.0760.2760.1190.0110.50019.9940.2760.0760.2760.1190.0110.50018.068
NB0.0010.0000.0010.0000.0000.5001.8730.0010.0000.0010.0000.0000.5001.957
MLP0.2760.0760.2760.1190.0110.50071.4140.2760.0760.2760.1190.0110.50059.946
KNN0.2760.0760.2760.1190.0110.5002.3900.2760.0760.2760.1190.0110.5002.409
RF0.2760.0760.2760.1190.0110.50014.8670.2760.0760.2760.1190.0110.50016.127
LR0.2760.0760.2760.1190.0110.50075.8290.2760.0760.2760.1190.0110.50075.556
DT0.0870.0080.0870.0140.0040.5006.0130.0290.0010.0290.0020.0010.5005.979
Spaced k-mersSVM0.2760.0760.2760.1190.0110.50073.1490.2760.0760.2760.1190.0110.50060.317
NB0.2760.0760.2760.1190.0110.5006.3020.2760.0760.2760.1190.0110.5004.487
MLP0.2760.0760.2760.1190.0110.500126.5960.0160.0000.0160.0000.0010.500101.059
KNN0.2760.0760.2760.1190.0110.5001.9450.2760.0760.2760.1190.0110.5002.076
RF0.2760.0760.2760.1190.0110.5008.2780.2760.0760.2760.1190.0110.5003.487
LR0.2760.0760.2760.1190.0110.50023.1530.2760.0760.2760.1190.0110.50019.397
DT0.0860.0070.0860.0140.0040.5001.4550.1720.0300.1720.0510.0070.5000.433
Weighted k-mersSVM0.2760.0760.2760.1200.0110.50063.2600.2760.0760.2760.1200.0110.50062.970
NB0.0010.0000.0010.0000.0000.5000.8700.0010.0000.0010.0000.0000.5000.697
MLP0.2760.0760.2760.1200.0110.50029.4690.2760.0760.2760.1200.0110.50025.876
KNN0.0710.0050.0710.0090.0030.5001.9700.0710.0050.0710.0090.0030.5001.979
RF0.2760.0760.2760.1200.0110.5001.6830.2760.0760.2760.1200.0110.5001.640
LR0.2760.0760.2760.1200.0110.5002.4040.2760.0760.2760.1200.0110.5002.378
DT0.2760.0760.2760.1200.0110.5000.0980.2760.0760.2760.1200.0110.5000.097
Weighted PWMSVM0.2760.0790.2760.1200.0130.5018.6510.2760.0760.2760.1200.0110.50010.205
NB0.0020.0000.0020.0000.0000.5000.7870.0020.0000.0020.0000.0000.5000.622
MLP0.0050.0000.0050.0000.0000.50024.2500.0050.0000.0050.0000.0000.50031.357
KNN0.2760.0760.2760.1200.0110.5002.3580.2760.0760.2760.1200.0110.5002.828
RF0.1720.0300.1720.0510.0070.5004.8260.2760.0760.2760.1200.0110.5006.342
LR0.0130.0000.0130.0000.0010.50013.2940.0130.0000.0130.0000.0010.50017.801
DT0.0710.0050.0710.0090.0030.5000.6940.0460.0020.0460.0040.0020.5000.793
Table 4. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and specific errors associated with the Oxford Nanopore Technology (ONT) sequencer. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.
Table 4. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and specific errors associated with the Oxford Nanopore Technology (ONT) sequencer. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.
Embed. MethodML Algo.n2020 5× Simulated Errorn2020 10× Simulated Error
Acc.Prec.RecallF1 Weigh.F1 MacroROC-AUCTrain. Runtime (s)Acc.Prec.RecallF1 Weigh.F1 MacroROC-AUCTrain. Runtime (s)
OHESVM0.1220.1330.1220.1120.0220.50170,679.00.1060.1270.1060.0940.0180.500103,909.2
NB0.2760.0760.2760.1190.0110.5001075.200.2760.0760.2760.1190.0110.500833.594
MLP0.0710.0970.0710.0100.0030.5003906.450.2760.0760.2760.1190.0110.5001539.707
KNN0.1890.1300.1890.1340.0190.50182.1950.1890.1330.1890.1340.0170.500108.649
RF0.0720.1410.0720.0760.0080.500276.7730.2410.1160.2410.1470.0160.500319.020
LR0.2720.1080.2720.1220.0110.50067103.90.2700.1300.2700.1240.0120.50068286.4
DT0.1660.1340.1660.1030.0170.501460.5520.1630.1270.1630.0870.0130.500411.342
WDGRLSVM0.2440.1080.2440.1470.0150.5018.9060.2140.1190.2140.1380.0160.5028.812
NB0.0110.0940.0110.0190.0020.4960.0630.1840.1290.1840.0890.0110.5010.060
MLP0.2680.0850.2680.1220.0120.50030.6520.1920.1370.1920.1160.0130.50228.833
KNN0.2610.1500.2610.1270.0130.5000.3820.1850.1450.1850.1200.0180.5020.374
RF0.1090.1470.1090.0800.0120.4992.5720.1530.1510.1530.0900.0140.5002.383
LR0.1250.0730.1250.0740.0090.5001.0880.1910.1670.1910.0880.0110.5011.058
DT0.1030.1480.1030.0740.0120.4990.0480.1660.1570.1660.0930.0140.5010.047
String KernelSVM0.1440.1340.1440.1370.0240.50020.1790.1540.1420.1540.1470.0250.50118.168
NB0.0040.0000.0040.0000.0010.5032.0190.0040.0000.0040.0000.0010.4991.803
MLP0.1370.1350.1370.1350.0260.50159.1440.1320.1390.1320.1340.0280.50249.297
KNN0.1950.1230.1950.1320.0200.5002.8770.1890.1370.1890.1470.0220.5002.807
RF0.2490.1160.2490.1440.0150.50015.9510.2630.1570.2630.1530.0160.50115.812
LR0.1570.1300.1570.1410.0230.50078.3820.1710.1400.1710.1530.0240.50076.744
DT0.1300.1290.1300.1290.0230.5005.9100.1370.1400.1370.1370.0260.5015.969
Spaced k-mersSVM0.1320.1470.1320.1330.0390.50823.8900.1930.2080.1930.1980.0920.53520.470
NB0.0040.0000.0040.0000.0020.5033.8430.0060.0030.0060.0010.0040.5024.476
MLP0.1120.1420.1120.1180.0250.50258.5970.1760.1920.1760.1780.0510.51444.872
KNN0.2250.1500.2250.1460.0180.5003.0240.2220.1770.2220.1810.0350.5082.920
RF0.2060.1480.2060.1620.0330.5064.0260.2320.1960.2320.2070.0620.5194.038
LR0.1350.1510.1350.1370.0400.50970.5160.1980.2110.1980.2020.0970.53857.721
DT0.1310.1400.1310.1280.0280.5021.0500.1450.1760.1450.1570.0500.5161.011
Weighted k-mersSVM0.2760.0760.2760.1190.0110.50073.8580.2760.0760.2760.1190.0110.50074.108
NB0.0010.0000.0010.0000.0000.5000.7780.0010.0000.0010.0000.0000.5000.805
MLP0.2760.0760.2760.1190.0110.50030.1450.2760.0760.2760.1190.0110.50031.810
KNN0.1720.0300.1720.0510.0070.5002.0790.1720.0300.1720.0510.0070.5001.937
RF0.2760.0760.2760.1190.0110.5001.7630.2760.0760.2760.1190.0110.5001.645
LR0.2760.0760.2760.1190.0110.5002.5570.2760.0760.2760.1190.0110.5002.473
DT0.2760.0760.2760.1190.0110.5000.1160.2760.0760.2760.1190.0110.5000.116
Weighted PWMSVM0.2030.1410.2030.1570.0230.50111.1230.2400.1700.2400.1770.0280.5049.825
NB0.0020.0000.0020.0000.0000.4980.8080.0030.0000.0030.0000.0000.5000.814
MLP0.0860.1410.0860.0910.0220.50136.3250.1190.1800.1190.1320.0350.51327.304
KNN0.1930.1270.1930.1290.0170.5012.7450.1880.1480.1880.1510.0250.5032.475
RF0.1750.1410.1750.1510.0310.5055.8870.2030.1750.2030.1810.0420.5104.699
LR0.0840.1520.0840.0930.0260.50516.4200.1430.1940.1430.1550.0510.51912.654
DT0.0690.1280.0690.0770.0210.5010.8300.1010.1520.1010.1150.0300.5040.670
Table 5. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and incorporated with random errors. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.
Table 5. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and incorporated with random errors. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.
Embed. MethodML Algo.Random 5× Simulated ErrorRandom 10× Simulated Error
Acc.Prec.RecallF1 Weigh.F1 MacroROC-AUCTrain. Runtime (s)Acc.Prec.RecallF1 Weigh.F1 MacroROC-AUCTrain. Runtime (s)
OHESVM0.1010.1380.1010.0780.0160.500157054.20.1070.1450.1070.0930.0190.50174,697.7
NB0.2760.0760.2760.1190.0110.5001113.2710.2760.0760.2760.1190.0110.5001119.210
MLP0.0130.0000.0130.0000.0020.5001503.2680.2670.0850.2670.1250.0120.5001625.962
KNN0.2230.1310.2230.1360.0180.50091.7080.2040.1350.2040.1390.0180.501100.443
RF0.2020.1120.2020.1260.0140.500259.5620.1860.1140.1860.1120.0130.500405.555
LR0.2710.1090.2710.1220.0110.50070035.070.2720.1230.2720.1220.0120.50083434.3
DT0.1570.1260.1570.0750.0120.500776.2630.1640.1360.1640.0860.0140.500943.475
WDGRLSVM0.2760.0760.2760.1190.0110.5008.7760.2750.1180.2750.1580.0170.5028.842
NB0.0010.0000.0010.0000.0000.5000.0680.0050.1050.0050.0070.0010.5000.051
MLP0.2760.0760.2760.1190.0110.50031.4870.2720.1290.2720.1250.0120.50021.765
KNN0.2760.0760.2760.1190.0110.5000.3480.1690.1420.1690.1190.0160.5000.383
RF0.0870.0080.0870.0140.0040.5002.4880.0720.1380.0720.0860.0150.4992.510
LR0.0870.0080.0870.0140.0040.5001.0690.2670.1150.2670.1270.0120.5001.059
DT0.0010.0000.0010.0000.0000.5000.0460.0680.1350.0680.0830.0150.4990.046
String KernelSVM0.1690.1320.1690.1420.0240.50119.3080.1400.1280.1400.1330.0230.50018.115
NB0.0030.0040.0030.0010.0020.5011.8940.0040.0000.0040.0000.0010.5001.811
MLP0.1380.1350.1380.1340.0230.50083.0350.1240.1300.1240.1270.0240.50060.808
KNN0.2130.1280.2130.1300.0170.5002.7730.2110.1360.2110.1420.0200.5012.816
RF0.2630.1400.2630.1400.0140.50015.6490.2570.1220.2570.1460.0150.50015.715
LR0.1800.1320.1800.1480.0220.50079.2430.1580.1290.1580.1410.0220.50076.412
DT0.1340.1290.1340.1290.0230.5006.3180.1250.1300.1250.1270.0220.4995.773
Spaced k-mersSVM0.0800.1390.0800.0800.0120.50117.0100.1240.1550.1240.1180.0220.50122.282
NB0.0050.0040.0050.0010.0020.5023.8790.0060.0000.0060.0010.0020.5004.803
MLP0.1330.1110.1330.0600.0110.50041.9500.1160.1630.1160.1070.0190.50247.611
KNN0.2630.0920.2630.1210.0120.5002.7630.2460.1430.2460.1250.0140.5002.646
RF0.2100.1280.2100.1480.0190.5013.9210.2330.1390.2330.1590.0230.5023.587
LR0.1470.1440.1470.1160.0180.50164.8520.1350.1710.1350.1280.0270.50560.750
DT0.1560.1280.1560.0880.0150.5010.9520.1270.1350.1270.0960.0200.5011.006
Weighted k-mersSVM0.2760.0760.2760.1190.0110.50076.2200.2760.0760.2760.1190.0110.50078.497
NB0.0010.0000.0010.0000.0000.5000.7740.0010.0000.0010.0000.0000.5000.749
MLP0.2760.0760.2760.1190.0110.50023.6390.2760.0760.2760.1190.0110.50034.219
KNN0.1720.0300.1720.0510.0070.5001.9320.1720.0300.1720.0510.0070.5001.915
RF0.2760.0760.2760.1190.0110.5001.8510.2760.0760.2760.1190.0110.5001.749
LR0.2760.0760.2760.1190.0110.5002.7060.2760.0760.2760.1190.0110.5002.626
DT0.2760.0760.2760.1190.0110.5000.1000.2760.0760.2760.1190.0110.5000.102
Weighted PWMSVM0.2030.1330.2030.1450.0200.50011.3300.1960.1430.1960.1510.0240.5018.605
NB0.0020.0000.0020.0000.0000.5000.5940.0020.0000.0020.0000.0000.5000.766
MLP0.0160.0870.0160.0150.0040.49943.6800.0370.1100.0370.0360.0120.50129.312
KNN0.2140.1080.2140.1220.0150.5002.7580.2010.1430.2010.1300.0170.5002.618
RF0.1630.1220.1630.1290.0170.5015.6490.1810.1340.1810.1360.0190.5014.915
LR0.0610.1040.0610.0580.0100.49916.6970.0700.1310.0700.0740.0140.50212.835
DT0.0610.1380.0610.0580.0150.5000.8750.0730.1360.0730.0700.0180.5000.673
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sahoo, B.; Ali, S.; Chen, P.-Y.; Patterson, M.; Zelikovsky, A. Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors. Biomolecules 2023, 13, 934. https://doi.org/10.3390/biom13060934

AMA Style

Sahoo B, Ali S, Chen P-Y, Patterson M, Zelikovsky A. Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors. Biomolecules. 2023; 13(6):934. https://doi.org/10.3390/biom13060934

Chicago/Turabian Style

Sahoo, Bikram, Sarwan Ali, Pin-Yu Chen, Murray Patterson, and Alexander Zelikovsky. 2023. "Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors" Biomolecules 13, no. 6: 934. https://doi.org/10.3390/biom13060934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop