Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences

Ali, Nehal M.; Shaheen, Mohamed; Mabrouk, Mai S.; Aborizka, Mohamed

doi:10.3390/app12115583

Open AccessArticle

Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences

¹

College of Computing and Information Technology, Arab Academy for Science Technology and Maritime Transport, Cairo 2033, Egypt

²

College of Computing and Information Technology, Arab Academy for Science Technology and Maritime Transport, Alexandria 1029, Egypt

³

Department of Biomedical Engineering, Faculty of Computer Science, Misr University for Science and Technology, Cairo P.O. Box 77, Egypt

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(11), 5583; https://doi.org/10.3390/app12115583

Submission received: 12 April 2022 / Revised: 23 May 2022 / Accepted: 27 May 2022 / Published: 31 May 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Thanks to high-throughput data technology, microRNA analysis studies have evolved in early disease detection. This work introduces two complete models to detect the biomarkers of two autoimmune diseases, multiple sclerosis and rheumatoid arthritis, via miRNA analysis. Based on work the authors published previously, both introduced models involve complete pipelines of text mining methods, integrated with traditional machine learning methods, and LSTM deep learning. This work also studies the fragmentation of miRNA sequences to reduce the needed processing time and computational power. Moreover, this work studies the impact of obtaining two different library preparation kits (NEBNEXT and NEXTFLEX) on the detection accuracy for rheumatoid arthritis. Additional experiments are applied to the proposed models based on three different transcriptomic datasets. The results denote that the transcriptomic fragmentation model reported a biomarker detection accuracy of 96.45% on a sequence fragment size of 0.2, indicating a significant reduction in execution power while retaining biomarker detection accuracy. On the other hand, the LSTM model obtained a promising detection accuracy of 72%, implying savings in feature engineering processing. Additionally, the fragmentation model and the LSTM model reported 22.4% and 87.5% less execution time than work in the literature, respectively, denoting a considerable execution power reduction.

Keywords:

miRNA analysis; multiple sclerosis; rheumatoid arthritis; machine learning; deep learning; LSTM; text mining

1. Introduction

Multiple sclerosis (MS) is a chronic autoimmune disease that damages the human central nervous system (CNS) and causes severe physical disabilities, including partial or total vision loss and significant motor disabilities, in addition to psychological impacts, including deep depression. On the other hand, rheumatoid arthritis (RA) is a long-term autoimmune disease that severely impacts joints, causing swollen, warm, painful joints and stiff wrists. It also affects the heart, nerves, blood, lungs, eyes, and skin. RA can also result in inflammation around the lungs, a low red blood cell count, or inflammation around the heart [1,2,3].

Both RA and MS are disabling autoimmune diseases in humans. MS’s breach in immune tolerance leads to inflammation in the CNS, causing the dysfunction of peripheral organs. On the other hand, RA’s breach in immune tolerance causes a peripheral autoimmune disease that primarily affects the CNS via neuroimmune mechanisms [4,5].

Next-generation sequencing (NGS) is an enormously parallel sequencing technology. It enables the sequencing of whole genomes. Thanks to NGS, human genome data have evolved, and biological science has been revolutionized. NGS technology generates significant amounts of output data. Hence, studying and analyzing these sequenced data and applying this analysis in genetic studies have significantly developed and grown in importance accordingly [6,7].

RNA sequencing (RNA-Seq) utilizes the competencies of the high-throughput sequencing methods to provide an insight into the transcriptome of a studied cell. RNA-Seq avails a significantly higher coverage and superior resolution of the dynamic nature of the transcriptome, and consequently provides higher early disease biomarker detection compared to the microarray-based methods and Sanger sequencing.

miRNA are small non-coding molecules that primarily regulate the expression of genes at the transcript level. Sequencing miRNAs isolated from exosomes represent a high potential for disease biomarkers’ detection. However, exosomes with a low RNA amount obstruct proper quantification and analysis. Hence, library preparation protocols are utilized with miRNA NGS to enhance the samples and, consequently, the miRNA analysis results [8,9].

Hence, different sequencing preparation protocols involve studying their influence on the sequenced data and how these various protocols impact the disease biomarker detection accuracy and the results of genetic studies [10,11].

This work proposes two complete miRNA analysis models for early MS and RA biomarker detection based on a comprehensive MS disease biomarker detection model that the authors published earlier. The presented model, introduced in the authors’ previous work, primarily converted the NGS analysis problem into a text mining problem; promising results have been reported upon studying an miRNA dataset of multiple sclerosis (MS) patients. In addition, a novel feature extraction method (kmerFIDF) for detecting disease biomarkers was introduced in our previous work [11].

In this work, the first model introduces transcriptomic sample fragmentation and studies the feasibility of fragmenting transcriptomic samples while retaining the high disease prediction accuracy. The second model utilizes LSTM as a deep learning model in analyzing these type of data. Both models are additionally examined against three different datasets. In addition, the model that was introduced previously is also tested on the same three datasets to study its robustness with more extensive datasets [11]. Comparative analysis is applied to the results of the three models.

Moreover, this work studies two commonly used library preparation kits, NEBNEXT and NEXTFLIX, on a transcriptomic dataset that consists of RA patients, healthy controls, and synthesized samples [12]. This study analyzes the impact of both preparation kits on the accuracy of RA biomarker detection. This study also evaluates the robustness of the model’s capabilities to classify the dataset from many perspectives, including classifying the samples of the two kits, determining the synthesized samples, and identifying the ten subclasses of the synthesized samples.

Hence, the critical contributions of this study are:

Introducing the transcriptomic fragmentation model for miRNA sequence analysis and autoimmune disease biomarker detection (Model_I);
Introducing an LSTM model for the detection of biomarkers of autoimmune diseases (Model_II);
Studying the impact of two library preparation kits on biomarker detection (NEBNEXT and NEXTFLIX);
Applying further experiments to a previously published model (Model_0) using a more extensive dataset;
Conducting further experiments on both introduced models using three different datasets of autoimmune diseases, and the findings below were produced:
○
Model_I saves 22.4% of the execution time compared to Model_0.
○
Using Model_I, a fragmentation size of 0.2 of a whole sequence file is sufficient for autoimmune disease biomarker detection.
○
Model_I reported sensitivity, specificity, precision, accuracy, and F1 scores of (92.7, 92.8, 94.8, 95.7, 95.2), respectively, in RA biomarker detection.
○
Model_I reported sensitivity, specificity, precision, accuracy, and F1 scores of (94.6, 94.7, 95.6, 96.8, 96.1), respectively, in MS biomarker detection.
○
Model_I reported an accuracy score of 89% in analyzing sensitive synthetic data.
○
The sequences prepared by NEBNEXT demonstrated relatively higher detection accuracy.
○
Model_II reported a promising accuracy score of 0.72 for MS biomarker detection, considering the study’s relatively small dataset.

1.1. Background

The first step of next-generation sequencing is library preparation, as it obligates DNA or RNA to follow the sequencing flowcell and permits the sample to be identified. Thus, library preparation protocols can impact the NGS experiment results. Library preparation protocols primarily involve two main steps (“fragmentation and end repair” and “adapter addition”), in addition to one optional step (polymerase chain reaction (PCR) amplification). These steps are briefly elaborated on in this section [13].

1.1.1. Fragmentation and End Repair

In this step, samples are fragmented into constant pieces to make them compliant with sequencing. This fragmentation occurs because short-read sequencing technologies such as Illumina’s cannot analyze the very lengthy DNA strands.

After applying fragmentation, the DNA fragments are then end repaired. A single adenine base is added to formulate an overhang through an A-tailing reaction. The A-overhang permits the adapters to have a single thymine overhanging base to pair with DNA fragments [14].

1.1.2. Adapter Ligation

In this step, a ligase enzyme binds the adapter covalently and adds the DNA fragments, forming a complete library molecule. Binding these adapters is beneficial, as they permit sequencing by attaching the sequences to the flowcell. They can also contain barcodes or indexes that identify the samples and enable multiplexing [15].

1.1.3. Polymerase Chain Reaction (PCR) Amplification

Performing library amplification is a non-obligatory step that primarily depends on the adapter type and input. PCR clean-up is applied upon PCR amplification by removing the small fragments and remaining oligonucleotides accordingly. PCR amplification can be considered as a sample replication to support the results. A spin column or magnetic beads can be used for PCR clean-up [14].

The two standard library preparation methods are ligation-based library prep and tagmentation-based library prep. In ligation-based libraries, the DNA fragmentation and adapter ligation to the ends of the fragments are performed in two separate steps. On the other hand, the tagmentation-based libraries associate both DNA fragmentation and adapter ligation in a one-reaction step [16].

2. Materials and Methods

2.1. Datasets

In this work, three transcriptomic datasets were studied. All datasets were provided by the National Center for Biotechnology Information (NCBI), and all datasets were labeled.

Dataset_I consisted of 42 miRNA files: 6 miRNA files of rheumatoid arthritis (RA) cases, 6 miRNA files of controls, and 30 miRNA files of synthetic samples classified into five classes (A, B, C, D, and E). The NEBNEXT kit prepared 50% of the entire dataset, and 50% was prepared by the NEXTflex kit [12].

Dataset_VI was created by combining two datasets (a literature dataset, Dataset_II, of 215 miRNA files of 54 MS patients before and after fingolimod treatment [17], and Dataset_III, a dataset of 24 MS cases/control [18]). Hence, three classes were obtained in Dataset_VI (117 cases, 110 after fingolimod treatment, and 12 controls).

Table 1 summarizes the main specifications of each dataset. Dataset_II and Dataset_III were highlighted to indicate that they were combined to compose Dataset_VI. The class distribution of each dataset’s samples is illustrated in Figure 1, Figure 2, Figure 3 and Figure 4.

2.2. Development Environment

The specifications of the development environment are summarized in Table 2, while Table 3 summarizes the key libraries/platforms used in the models’ development.

2.3. Implemented Models

Two transcriptomic analysis models were implemented in this work; the first (Model_I) was based on the model introduced in our previous work (Model_0) and added a sequence fragmentation step before the feature extraction step in order to study the feasibility of retaining the prediction accuracy while reducing the execution power by analyzing fragments of the sequence files rather than analyzing the entire sequence. On the other hand, the second model (Model_II) leveraged the preprocessing steps introduced in our previous work and applied the LSTM network [19]. Additionally, the fragmentation step was added before applying the LSTM, and the model was examined with and without fragmentation.

The preprocessing pipeline of the analytical model introduced earlier in our previous work (Model_0) was implemented as follow; it started with evaluating the sequence quality using FastQC to determine the parts of the libraries that should be trimmed in the following steps and improve the accuracy of the analysis results. Afterwards, the low-quality sequences were trimmed accordingly. The Trimmomatic tool was utilized for this purpose [20]. The studied sequences were then converted into flat text FASTA files, and the following steps of each model were applied accordingly [11].

Model 0 is recapped in the following section, and the two introduced models (Model_I and Model_II) are elaborated on in the following sections.

2.3.1. Model_0

As shown in the block diagram shown in Figure 5, upon using Model_0 (the model that the authors published earlier), a labeled dataset was built accordingly from the produced flat files after completing the sequence preprocessing pipeline. The sequence data were dealt with as text data in this stage, and KmerFIDF was then obtained. KmerFIDF is a novel feature extraction method that primarily leverages the Kmer counting method that splits (i.e., tokenizes) and counts the terms’ frequency sequences with respect to their order in the original sequence. This step produced a kmer counts matrix of each kmer term in the datasets. This matrix was then processed using the KmerFIDF equation given by Equation (1) [11].

K m e r F I D F (t, d, D) = K m e r C o u n t (t, d) \cdot i d f (t, D)

(1)

Afterwards, dimensionality reduction using the LDA algorithm was applied to the obtained KmerFIDF matrix to clean the sparse matrix and to eliminate the kmers with low KmerFIDF vectors. Finally, as per the literature, the traditional predictive method was obtained (random forest (RF)). The experiments used Datasets I, II, and IV since Dataset_III was tested in our previous work [12]. Figure 5 summarizes the processing steps of Model_0.

2.3.2. Fragmented Sequence Model (Model_I)

The first model (Model_I) proposed in this work studies the datasets under fragmenting of the studied sequence files to determine the indicative parts of an miRNA sequence and leverage them to reduce the data processing power; the block diagram of Model_I is shown in Figure 6.

In this model, the sequence files were fragmented into fragments representing a percentage of the total file size, as illustrated in Figure 7. For instance, when the fragmentation size f was set to 0.2, each sequence file (A) in the dataset was split into five files (A₁, A₂, A₃, A₄, and A₅).

Hence, if the examined dataset had (A–M) files, then each sequence file was split into n files based on the fragmentation size f, and then all the dataset files were grouped, formulating smaller datasets (A₁, B₁, C₁,… M₁), (A₂, B₂, C₂,… M₂), (A₃, B₃, C₃,… M₃),…. (A_n, B_n, C_n,… M_n). Afterward, each group was examined as a sub-dataset. (DS₁, DS₂…. DS_n).

Python Seqkit was used to perform this fragmentation. The obtained sequence fragments were then processed as separate datasets, using KmerFIDF for feature extraction, and then the predictive models were obtained accordingly.

2.3.3. Analyzing Datasets Using a Deep Learning Model (Model_II)

The second model introduced in this work utilized long short term memory (LSTM). Since LSTM leverages the studied sequential information, it has a memory that captures what has been calculated. As per the literature, LSTM was built among embedding, dense, and activation layers [21,22]. The TensorFlow platform was utilized for the model implementation. The parameters were set to the recommended values by the literature and the default values were established as per the development platforms [23].

The embedding layer facilitated the application of machine learning on the sparse vectors representing the examined terms, as it is a relatively low-dimensional space where high-dimensional vectors can be translated. After constructing the embedding vectors in the embedding layer, the output vector with a fixed length was then piped through a dense layer that represents the projection of the term into a continuous vector space. The dense layer was implemented with 16 hidden units as per the literature [24].

For activation, TensorFlow LSTM primarily uses the sigmoid as a non-linear activation function to transform the complex input data into a 0.0 or 1.0 output node [25].

In Model_II, the same preprocessing pipeline of Model_0 was applied, followed by the elaborated layers of LSTM. A dataset fragmentation step was also integrated with Model_II, as shown in Figure 8.

Model_II was examined with the original datasets before fragmentation. It was also studied with the sequences’ fragmentation to increase the dataset size and enhance the performance of Model_II, as elaborated on in detail in the following experiments section. Figure 8 summarizes the processing steps of the LSTM model.

3. Results

Ten experiments were conducted using the three studied datasets and the three implemented models. Table 4 summarizes the performed experiments and gives each experiment a label that shall be used in the following discussion section.

All datasets were split so that 70% were for training and 30% for testing. Additionally, cross-validation was applied with ten folding values. Each experiment was run ten times, and average accuracy, sensitivity, specificity, precision, and F1 scores were calculated accordingly.

Table 4. Summary of the conducted experiments and results.

Model	Dataset	Experiment Name	Objective	Description	Result
Model_I	Dataset_VI	Ex01	To determine the miRNA sequence fragment size that obtains the highest classification accuracy and retains the detection accuracy of the nonfragmented dataset. To determine the accuracy of early MS disease detection using the fragmentation model To determine which sub-dataset (DS_i) can provide the highest detection accuracy.	Apply sequence fragmentation among the sub-datasets produced from Dataset_VI (DS₁–DS_n) with different fragmentation sizes f = (0.1, 0.2, 0.3, and 0.4) and evaluate the model accuracy.	The sequence fragmentation size that reported the highest detection accuracy is 0.2 The detection accuracy was 96.4 DS₁ was the sub-dataset that provided the highest detection accuracy (Figure 9 and Figure 10).
Model_I	Dataset_I	Ex02	To determine the miRNA sequence fragment size that obtains the highest classification accuracy on a different disease dataset Dataset_I	Apply sequence fragmentation among the sub-datasets produced from Dataset_I (DS₁–DS_n) with different fragmentation sizes f = (0.1, 0.2, 0.3, and 0.4) and evaluate the model accuracy.	The fragmentation size that reported the highest classification accuracy was f = 0.2, which confirms the results of EX01 (Figure 11).
Model_I	Dataset_I	Ex03	To detect the biomarkers of RA disease using Model_I	Classifying Dataset_I into Cases, Controls, and Synthetic across the entire dataset	The reported accuracy of RA biomarkers detection is 95.7 (Figure 12).
Model_I	Dataset_I	Ex04	To determine the impact of library preparation kits on disease biomarkers’ detection.	Classifying the NEBNEXT samples into (RA Case/Control/Synthetic) and NEXTFlex Samples into (RA Case/Control/Synthetic) and comparing the resulting classification accuracy	NEXTflex-prepared have a higher potential in detecting RA biomarkers. (Figure 13).
Model_I	Dataset_I	Ex05	To evaluate the model with sensitive data.	Distinguish between synthetic data classes (A, B, C, D, and E) prepared by NEBNEXT/NEXTFlex kits	The classification accuracy scores reported by samples prepared by NEXTflex are relatively higher then NEBNEXT (Figure 14). EX03, EX04, and EX05 were consolidated in Figure 15.
Model_I	Dataset_II	Ex06	To detect the biomarkers of MS disease	Classifying Dataset_II into MS Cases/Controls	Figure 16
Model_I	Dataset_VI	Ex07	To detect the biomarkers of MS disease and examine Mode_0 with more extensive datasets.	Classifying Dataset_VI into MS Cases/Controls	Figure 16
Model_II	Dataset_VI	Ex08	To detect the biomarkers of MS disease.	Classifying Dataset_VI into MS Cases/Controls	Figure 16
Model_II	Dataset_VI (Fragmented)	Ex09	To determine the miRNA sequence fragment size that obtains the highest classification accuracy. To increase the number of samples in the studied dataset	Run the model among Dataset_VI after fragmenting the sequences by fragmentation size of (10%, 20%, and 30%) to multiplicate the number of the analyzed sequence samples and study the impact of increasing the number of samples on the disease detection accuracy of Model_II.	Figure 17 and Table 5
Model_0, Model_I, and Model_II	Dataset_VI	EX10	To have a comparative analysis of the execution time of the three models.	Apply the three proposed models to the exact dataset and analyze the execution times accordingly.	Figure 18

Figure 9. Examining each sub-dataset (DS_n) and determining the biomarker detection accuracy of each sub-dataset. Fragmentation size f = {0.1, 0.2, 0.3, and 0.4} were applied to determine which sub-dataset DS_n has the highest accuracy scores of biomarker detection.

Figure 10. The reported accuracy scores upon sequence fragmentation (f = 0.1, 0.2, 0.3, 0.4, 1) of applying Model_I to Dataset_VI (EX01).

Figure 11. The reported accuracy scores of biomarker detection upon sequence fragmentation (f = 0.1, 0.2, 0.3, 0.4, 1) using Model_I and Dataset_I (EX02).

Figure 12. Reported accuracy scores of applying Model_I on Dataset_I (EX03). Classifying Dataset_I into (a) RA cases; (b) controls; (c) synthetic samples.

Figure 13. Accuracy scores of biomarker detection of: (a) Applying Model_0 on the samples of Dataset_I prepared by NEBNEXT library (21 samples). (b) Applying Model_0 on the samples of Dataset_I prepared by NEXTflex (21 samples).

Figure 14. The resulting accuracy scores of EX05, classifying the synthetic samples of Dataset_I into A, B, C, D, and E classes. (a) Samples prepared by NEBNEXT (15 samples); (b) samples prepared by NEXTflex (15 samples).

Figure 15. Consolidated results of all experiments conducted on Model_I over Dataset_I (EX03, EX04, and EX05).

Figure 16. Obtained accuracy scores of MS biomarkers’ detection by: (a) applying Model_I on Dataset_II (EX06); (b) applying Model_I on Dataset_VI (EX07); (c) applying Model_II on Dataset_VI (EX08).

Figure 17. Obtained prediction accuracies of fragmenting Dataset_VI with fragmentation sizes of 0.1, 0.2, and 0.3 and applying Model_II with each resulting dataset (EX09).

Figure 18. The reported average execution times of: (a) studying Dataset_VI with Model_0; (b) applying Model_I, Model_II, and Model_II to fragmented Dataset_VI (fragmentation size = 0.2).

Table 5. The results of Ex09.

Sequence Fragmentation Size	Number of Dataset_VI Samples after Fragmentation	Prediction Accuracy
0.1	2390	0.61
0.2	1195	0.72
0.3	717	0.65

4. Discussion

Model_I was tested with EX01 to determine the feasibility of fragmenting the studied miRNA sequences and determining the minimum fragment size of the miRNA sequence while retaining the model accuracy. Dataset_VI was used with this fragmentation model since it has the highest number of sequences samples and, hence, more credible results. Consequently, sequences of Dataset_VI were fragmented arbitrarily by fragmentation size f = (0.1, 0.2, 0.3, 0.4, and 1), and no further fragmentation between 0.4 and 1 was tested since accuracy sustainability was reached.

Accordingly, Dataset_VI was subdivided using Model_I after determining the fragmentation size in each run. Afterward, each sub-dataset DS_x resulting from the fragmentation is examined with the prediction model, and the MS biomarker detection is reported accordingly for each DS_x.

The obtained accuracy scores of Model_I denote that the first sub-dataset DS₁ obtained the highest prediction accuracy, as illustrated in Figure 9 across all fragmentation sizes. This implies that the first sub-dataset of the fragmented files, DS₁, has the most elevated indicative disease biomarkers.

Additionally, the obtained results of Ex01 denote that the fragmentation size f = 0.2 is the minimum sequence fragmentation size that has enough distinctive biomarkers for MS disease detection. The reported scores of this fragmentation size are relative to the scores obtained upon analyzing the entire dataset with no fragmentation (fragmentation size f = 1). The scores start to diverge upon lowering the fragmentation size to 0.1. This result implies that considerable processing time can be saved upon analyzing transcriptomic data using this fragmentation model (22.4% lower execution time, as shall be elaborated on in EX10). Figure 10 illustrates the reported accuracy scores for the tested fragmentation size with Dataset_IV.

To ensure that the obtained results of Ex01 are not only related to MS disease, Ex02 was conducted to determine the fragmentation size f on a different disease dataset. Hence, Model_I was applied on Dataset_I to determine the f on RA disease. The obtained results were aligned with the findings of Ex01 and confirmed that DS₁ with f = 0.2 retains the prediction accuracy of the non-fragmented dataset (f = 1). As illustrated in Figure 11, fragmentation sizes were f = (0.1, 0.2, 0.3, 0.4, and 1), and no further fragmentation between 0.4 and 1 was tested since accuracy sustainability was reached.

Consequently, all following experiments that used the fragmentation model were applied using f = 0.2 and DS1.

Hence, the results of EX01 and EX02 show that with a fragmentation size of f = 0.2, the first fragmented sub-dataset (DS₁) can be used for miRNA analysis and biomarker detection using the proposed models instead of running the analysis over the entire dataset being studied, which saves significant computational time as shall be discussed in EX10.

EX03 was conducted upon Dataset_I, using Model_I, and detected the biomarkers of RA disease by classifying the entire dataset, with no consideration of the library preparation kit type, into cases (six samples), controls (six samples), and synthetic samples (30 samples). The reported sensitivity, specificity, precision, accuracy, and F1 scores were 92.7, 92.8, 94.8, 95.7, and 95.2, respectively, as illustrated in Figure 12. These scores indicate the high potential of the introduced model in the early detection of RA biomarkers.

The objective of EX04 was to study the impact of the used samples’ preparation kits on the accuracy of the detection of the disease’s biomarkers (Model_I classified the samples into cases, controls, and synthetic samples).

Thus, Model_I was experimented on using Dataset_I twice; first with the samples prepared by NEBNEXT separately, and second with the samples prepared by NEXTFLEX, and the detection accuracy scores were recorded accordingly.

As Figure 13 shows, the obtained results from Ex04 imply that the samples that are prepared using NEXTFLEX report higher accuracy scores of RA biomarker detection than the samples prepared by NEBNEXT.

In Ex05, the objective was to test Model_I on sensitive data. The synthetic data prepared by NEBNEXT and NEXTFLEX libraries were classified into A, B, C, D, and E. The reported accuracy scores in Ex05 were relatively low. However, the classification accuracy scores reported by samples prepared by NEXTFLEX were relatively higher, which supports the results of Ex04. These results can be explained since the classified data in this experiment are synthetic, and the biomarkers being classified throughout classes A, B, C, D, and E are not relatively significant. Figure 14 illustrates the reported scores when classifying the prepared data by NEBNEXT and NEXTflex kits. Figure 15 consolidates the obtained accuracy scores for Dataset_I with Model_I, indicating that the highest scores obtained were in EX04 using the NEXTflex preparation kit.

Consequently, the results of EX04 and EX05 denote that the transcriptomic samples prepared by NEXTFLEX obtained higher biomarker detection accuracy when used with Model_I.

The obtained results for Ex06 (Model_I with Dataset_II) and Ex07 (Model_0 with Dataset_IV) confirm the robustness of detecting MS disease biomarkers with low and higher sample datasets. The results are also aligned with the accuracy scores in our previous work. Figure 16 summarizes the results of both EX06 and EX07.

On the other hand, the experiments applied on the deep learning model (Model_II) using Dataset_VI reported a relatively lower disease detection accuracy. This is explained by the fact that the number of the studied samples (239 sequence files) was inconsiderable compared to the number of the obtained features in the dataset (hundreds of thousands). Thus, the sensitivity, specificity, precision, accuracy, and F1 scores of EX08 were 66.4, 66.4, 66.1, 68, and 67.8, respectively, as shown in Figure 16.

To increase the number of the studied samples in the dataset, in Ex09, Dataset_VI was fragmented using the fragmentation techniques of Model_I. Fragmentation sizes were f = (0.1, 0.2, and 0.3). Noting this, the fragmentation size f = 0.2 was applied based on the results of EX01, and, accordingly, one point above and one point below the value of 0.2 were obtained. No further fragmentation sizes were tested since no increase in the detection accuracy was reported with the values 0.1 or 0.3.

After fragmentation, each fragment was considered for one miRNA sequence in the experimented dataset (the first layer of fragmentation in Model_I before clustering was applied). This increase in the dataset size elevated the model accuracy score to 0.72. This increase in the dataset size is still not comparable to the number of the extracted features.

Hence, the obtained results denote that the highest accuracy score of 0.72 reported with Model_II was with a fragmentation size of 0.2, implying that the resulting sequences of fragmentation size (0.2) have distinctive features, while reducing the fragmentation size to 0.1 significantly reduces the number of indicative features and consequently impacts the model’s prediction accuracy. On the other hand, setting the fragmentation size to 0.3 increases the number of analyzed features with a relatively small number of samples, reducing the prediction accuracy of Model_II. Table 5 summarizes the results of EX09. Additionally, the graph in Figure 17 demonstrates these results.

Furthermore, the execution time of the three studied models is compared in EX10. The three models processed the same dataset (Dataset_VI), and the execution times were monitored accordingly. The obtained results were 97, 75, 600, and 370 min for Model_0, Model_I, Model_II, and Model_II (with fragmented Dataset_VI with size (0.2)), respectively. This denotes that Model_I has the lowest execution time at 22.4% less than Model_0 and 87.5% less than Model_II. These results can be clearly explained because Model_I analyzed smaller file sizes with the traditional predictive model. Model_0 demonstrated a higher execution time since the entire sequence was being examined.

These results imply that Model_I retained the high accuracy scores of the detection of diseases’ biomarkers seen in Model_0 with 22.4% less execution time.

On the other hand, Model_II had the relatively highest value of execution time considering the file sizes being analyzed with this model. Additionally, the fragmented Dataset_VI displayed a lower execution time since the LSTM network analyzed smaller fragments. Figure 18 summarizes the execution time scores obtained by EX10.

Finally, the work in the literature that analyzed Dataset_II by applying conventional univariate/multivariate modeling for feature extraction and RF as a predictive model reported average accuracy scores of 0.77 and 0.91 as the highest reported accuracy scores [12,18].

Hence, the reported prediction accuracy scores of Model_0 that was published earlier and Model_I introduced in this study outperform the results obtained by the work in the literature on the same dataset (EX06). On the other hand, Model_II reported relatively less prediction accuracy than the work in the literature, as explained earlier; this can be justified by the data size being studied using the deep learning model (EX10). Figure 19 summarizes the comparison of the average accuracy scores reported by Model_0, Model_I, Model_II, and the literature work.

5. Conclusions and Future Work

This work introduced two models (Model_I and Model_II) for the early analysis and detection of diseases. Both models are based on the model the authors published previously (Model_0). Model_I introduced the miRNA fragmentation method combined with Model_0 to retain high accuracy of disease detection while reducing the needed processing time and consequently the required processing specifications. On the other hand, Model_II introduced an LSTM model combined with the preprocessing pipeline of Model_0 and the fragmentation method of Model_II. Additional experimental work was applied on the three different datasets. The reported accuracy of MS and RA biomarker detection using Model_I was 96.5 and 95.7, respectively. Additionally, the study findings denote that the fragmentation size of 0.2 and the first subset of the fragmented dataset retain the same accuracy as when analyzing an entire miRNA sequence, which can reduce the execution time by 22.4% compared to Model_0 and by 87% compared to Model_II. In addition, the experimental results imply that NEXTFLEX improves disease detection when processed using the introduced models.

Furthermore, the additional experimental work applied on Model_0 supported the previously published results, with an early MS detection accuracy of 96.5. Finally, the results obtained by the deep learning model (Model_II) indicate its high potential for early MS detection, considering the size of the studied dataset’s limitation, resulting in a detection accuracy of 0.71. In our future work, further experimental work will be applied on Model_II with a higher number of samples in the dataset that is relatively high considering the number of the extracted features. In addition, the introduced three models will be studied with other diseases.

6. Limitations

A dataset with thousands of transcriptomic data samples is required to provide more confidence in the results obtained from Model_II. In this study, Dataset_IV was shown to be the most extensive available miRNA dataset of MS disease on NCBI in terms of sample counts. In addition, the dataset used to study RA had a limited number of samples, and an extensive transcriptomic RA dataset could not be used in this work. Further experimental work will be conducted to elucidate the results of EX01 and why the first parts of the studied files have more indicative disease biomarkers.

Author Contributions

Conceptualization, N.M.A. and M.S.M.; methodology, N.M.A.; software, N.M.A.; validation, N.M.A., M.S. and M.S.M.; formal analysis, N.M.A.; investigation, N.M.A.; resources, N.M.A.; data curation, N.M.A.; writing—original draft preparation, N.M.A.; writing—review and editing, N.M.A.; visualization, N.M.A.; supervision, M.S. and M.S.M.; project administration, M.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Study datasets are available at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA594317, https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA588268, and https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA514238 (all accessed on 10 March 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Grothe, M.; Ellenberger, D.; von Podewils, F.; Stahmann, A.; Rommer, P.S.; Zettl, U.K. Epilepsy as a Predictor of Disease Progression in Multiple Sclerosis. Mult. Scler. J. 2021. [Google Scholar] [CrossRef] [PubMed]
Mo, J.J.; Zhang, W.; Wen, Q.W.; Wang, T.H.; Qin, W.; Zhang, Z.; Huang, H.; Cen, H.; Wu, X. Di Genetic Association Analysis of ATG16L1 Rs2241880, Rs6758317 and ATG16L2 Rs11235604 Polymorphisms with Rheumatoid Arthritis in a Chinese Population. Int. Immunopharmacol. 2021, 93, 107378. [Google Scholar] [CrossRef] [PubMed]
Lo, J.; Chan, L.; Flynn, S. A Systematic Review of the Incidence, Prevalence, Costs, and Activity and Work Limitations of Amputation, Osteoarthritis, Rheumatoid Arthritis, Back Pain, Multiple Sclerosis, Spinal Cord Injury, Stroke, and Traumatic Brain Injury in the United States: A 2019 Update. Arch. Phys. Med. Rehabil. 2021, 102, 115–131. [Google Scholar]
Schorr, E.M.; Kurz, D.; Rossi, K.C.; Zhang, M.; Yeshokumar, A.K.; Jette, N.; Dhamoon, M.S. Depression Readmission Risk Is Elevated in Multiple Sclerosis Compared to Other Chronic Illnesses. Mult. Scler. J. 2022, 28, 139–148. [Google Scholar] [CrossRef]
Olivares, D.; Perez-Hernandez, J.; Perez-Gil, D.; Chaves, F.J.; Redon, J.; Cortes, R. Optimization of Small RNA Library Preparation Protocol from Human Urinary Exosomes. J. Transl. Med. 2020, 18, 132. [Google Scholar] [CrossRef]
El Hamid, M.M.A.; Ali, N.M.; Saad, M.N.; Mabrouk, M.S.; Shaker, O.G. Multiple Sclerosis: An Associated Single-Nucleotide Polymorphism Study on Egyptian Population. Netw. Model. Anal. Health Inform. Bioinform. 2020, 9, 48. [Google Scholar] [CrossRef]
Li, G.; Zhu, N.; Zhou, J.; Kang, K.; Zhou, X.; Ying, B.; Yi, Q.; Wu, Y. A Magnetic Surface-Enhanced Raman Scattering Platform for Performing Successive Breast Cancer Exosome Isolation and Analysis. J. Mater. Chem. B 2021, 9, 2709–2716. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, D.; Wang, D.; Wu, Q. Library Preparation Based on Transposase Assisted RNA/DNA Hybrid Co-Tagmentation for Next-Generation Sequencing of Human Noroviruses. Viruses 2021, 13, 65. [Google Scholar] [CrossRef] [PubMed]
Shtratnikova, V.; Naumov, V.; Bezuglov, V.; Zheludkevich, A.; Smigulina, L.; Dikov, Y.; Denisova, T.; Suvorov, A.; Pilsner, J.R.; Hauser, R.; et al. Optimization of Small RNA Extraction and Comparative Study of NGS Library Preparation from Low Count Sperm Samples. Syst. Biol. Reprod. Med. 2021, 67, 230–243. [Google Scholar] [CrossRef] [PubMed]
Raymond-Bouchard, I.; Maggiori, C.; Brennan, L.; Altshuler, I.; Manchado, J.M.; Parro, V.; Whyte, L.G. Assessment of Automated Nucleic Acid Extraction Systems in Combination with MinION Sequencing As Potential Tools for the Detection of Microbial Biosignatures. Astrobiology 2022, 22, 87–103. [Google Scholar] [CrossRef] [PubMed]
Ali, N.M.; Shaheen, M.; Mabrouk, M.S.; Aborizka, M.A. A Novel Approach of Transcriptomic MicroRNA Analysis Using Text Mining Methods: An Early Detection of Multiple Sclerosis Disease. IEEE Access 2021, 9, 120024–120033. [Google Scholar] [CrossRef]
Heinicke, F.; Zhong, X.; Zucknick, M.; Breidenbach, J.; Sundaram, A.Y.M.; Flåm, S.T.; Leithaug, M.; Dalland, M.; Rayner, S.; Lie, B.A.; et al. An Extension to: Systematic Assessment of Commercially Available Low-Input MiRNA Library Preparation Kits. RNA Biol. 2020, 17, 1284–1292. [Google Scholar] [CrossRef] [PubMed]
Kapp, J.D.; Green, R.E.; Shapiro, B. A Fast and Efficient Single-Stranded Genomic Library Preparation Method Optimized for Ancient DNA. J. Hered. 2021, 2021, 1–9. [Google Scholar] [CrossRef] [PubMed]
Psonis, N.; Vassou, D.; Kafetzopoulos, D. Testing a Series of Modifications on Genomic Library Preparation Methods for Ancient or Degraded DNA. Anal. Biochem. 2021, 623, 114193. [Google Scholar] [CrossRef]
Hu, T.; Chitnis, N.; Monos, D.; Dinh, A. Next-Generation Sequencing Technologies: An Overview. Hum. Immunol. 2021, 82, 801–811. [Google Scholar] [CrossRef]
Shi, H.; Zhou, Y.; Jia, E.; Pan, M.; Bai, Y.; Ge, Q. Bias in RNA-Seq Library Preparation: Current Challenges and Solutions. BioMed Res. Int. 2021, 2021, 6647597. [Google Scholar] [CrossRef]
Ebrahimkhani, S.; Beadnall, H.N.; Wang, C.; Suter, C.M.; Barnett, M.H.; Buckland, M.E.; Vafaee, F. Serum Exosome MicroRNAs Predict Multiple Sclerosis Disease Activity after Fingolimod Treatment. Mol. Neurobiol. 2020, 57, 1245–1258. [Google Scholar] [CrossRef]
Baulina, N.; Osmak, G.; Kiselev, I.; Popova, E.; Boyko, A.; Kulakova, O.; Favorova, O. MiRNAs from DLK1-DIO3 Imprinted Locus at 14q32 Are Associated with Multiple Sclerosis: Gender-Specific Expression and Regulation of Receptor Tyrosine Kinases Signaling. Cells 2019, 8, 133. [Google Scholar] [CrossRef] [Green Version]
Mohamed Ali, N.; El Hamid, M.M.A.; Youssif, A. Sentiment analysis for movies reviews dataset using deep learning models. Int. J. Data Min. Knowl. Manag. Process 2019. [Google Scholar] [CrossRef]
Saif, R.; Ejaz, A.; Mahmood, T.; Zia, S. Differential Gene Expression Pipeline for Whole Transcriptome RNA-Seq Data Using Personal Computer. bioRxiv 2021, bioRxiv:2021.01.26.428352. [Google Scholar]
Taghavi Namin, S.; Esmaeilzadeh, M.; Najafi, M.; Brown, T.B.; Borevitz, J.O. Deep Phenotyping: Deep Learning for Temporal Phenotype/Genotype Classification. Plant Methods 2018, 14, 1–14. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Xia, J.; Pan, S.; Zhu, M.; Cai, G.; Yan, M.; Su, Q.; Yan, J.; Ning, G.; Duggento, A. A Long Short-Term Memory Ensemble Approach for Improving the Outcome Prediction in Intensive Care Unit. Comput. Math. Methods Med. 2019, 2019, 8152713. [Google Scholar] [CrossRef] [PubMed]
Haghighat, E.; Juanes, R. SciANN: A Keras/TensorFlow Wrapper for Scientific Computations and Physics-Informed Deep Learning Using Artificial Neural Networks. Comput. Methods Appl. Mech. Eng. 2021, 373, 113552. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef] [Green Version]
Xiao, Y.; Yin, H.; Zhang, Y.; Qi, H.; Zhang, Y.; Liu, Z. A Dual-Stage Attention-Based Conv-LSTM Network for Spatio-Temporal Correlation and Multivariate Time Series Prediction. Int. J. Intell. Syst. 2021, 36, 2036–2057. [Google Scholar] [CrossRef]

Figure 1. Dataset_I samples’ class distribution. In total, 50% of the samples were prepared using NEBNEXT and 50% were prepared using NEXTFLEX preparation kits. Dataset_I is composed of: (a) six RA Cases samples; (b) six controls; (c) thirty synthetic samples categorized into five classes (A, B, C, D, and E) and each of these synthetic classes consists of six samples.

Figure 2. Sample distribution of Dataset_II (MS Disease): (a) 110 samples of MS patients treated by Fingolimod; (b) 105 samples of MS patients before Fingolimod treatment.

Figure 3. Sample distribution of Dataset_III (MS Disease): (a) 12 controls; (b) 12 MS cases.

Figure 4. Sample distribution of Dataset_VI (by combining Dataset_II and Dataset_III), Dataset_VI consists of (a) 12 samples of controls; (b) 110 samples of treated MS patients; (c) 117 samples of untreated MS patients.

Figure 5. The implemented model using traditional predictive machine learning methods (Model_0).

Figure 6. The proposed model using miRNA sequence fragmentation (Model_I).

Figure 7. Sequence file fragmentation step of Model_I.

Figure 8. The proposed LSTM model used in miRNA sequences’ analysis (Model_II).

Figure 19. The reported average accuracy scores of experimenting Dataset_II on: (a) literature work; (b) Model_0 (the base model); (c) Model_I; (d) Model_II.

Table 1. Key specifications of the four studied datasets. (Dataset_VI was created by combining Dataset_II and Dataset_III).

		Dataset_VI
Parameter	Dataset_I	Dataset_II	Dataset_III
BioProject	PRJNA594317	PRJNA588268	PRJNA514238
Organism	Synthetic construct; Homo Sapiens	Homo Sapiens	Homo Sapiens
Datastore filetype	FASTQ, SRA	FASTQ, SRA	FASTQ, SRA
Datastore provider	GS, NCBI, S3	GS, NCBI, S3	GS, NCBI, S3
Library Source	Transcriptomic	Transcriptomic	Transcriptomic
Instrument	Illumina HiSeq 2500	Illumina HiSeq 2000	Illumina MiSeq
Library Layout	SINGLE	SINGLE	SINGLE
Number of samples	42	215	24
Number of cases	6	105 (before treatment)	12
Number of Controls	6	110 (after treatment)	12
Number of Synthesized samples	30	N/A	N/A
Disease	RA	MS	MS
Registration Data	9 December 2019	7 November 2019	23 January 2019

Table 2. Specifications of development environment.

Specification	Value
Processor	TPU
RAM	35.35 GB
Disk Space	2 T
Data Storage space	Google Drive
Development platform	Python 3.6

Table 3. Libraries used in the implementation.

Implementation Step	Tool/Library Used
Sequence files’ download	pysradb
Sequences’ quality control	FastQC
Sequences’ trimming	Trimmomatic
Fastq files’ conversion and manipulation	Biopython
Files’ fragmentation	seqkit
Random forest model implementation	Sklearn
LSTM	Tensorflow

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ali, N.M.; Shaheen, M.; Mabrouk, M.S.; Aborizka, M. Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences. Appl. Sci. 2022, 12, 5583. https://doi.org/10.3390/app12115583

AMA Style

Ali NM, Shaheen M, Mabrouk MS, Aborizka M. Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences. Applied Sciences. 2022; 12(11):5583. https://doi.org/10.3390/app12115583

Chicago/Turabian Style

Ali, Nehal M., Mohamed Shaheen, Mai S. Mabrouk, and Mohamed Aborizka. 2022. "Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences" Applied Sciences 12, no. 11: 5583. https://doi.org/10.3390/app12115583

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences

Abstract

1. Introduction

1.1. Background

1.1.1. Fragmentation and End Repair

1.1.2. Adapter Ligation

1.1.3. Polymerase Chain Reaction (PCR) Amplification

2. Materials and Methods

2.1. Datasets

2.2. Development Environment

2.3. Implemented Models

2.3.1. Model_0

2.3.2. Fragmented Sequence Model (Model_I)

2.3.3. Analyzing Datasets Using a Deep Learning Model (Model_II)

3. Results

4. Discussion

5. Conclusions and Future Work

6. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI