The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU

Sun, Xue; Wu, Chao-Chin; Liu, Yan-Fang

doi:10.3390/sym13122385

Open AccessArticle

The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU

by

Xue Sun

^1,2,

Chao-Chin Wu

^3,*

and

Yan-Fang Liu

⁴

¹

College of Urban Rail Transit and Logistics, Beijing Union University, Beijing 100101, China

²

Joint Jiaotong College of Beijing Union University & Russian University of Transport, Beijing Union University, Beijing 100101, China

³

Department of Computer Science and Information Engineering, National Changhua University of Education, Changhua 50007, Taiwan

⁴

National Chung-Shan Institute of Science & Technology, Taoyuan 32546, Taiwan

^*

Author to whom correspondence should be addressed.

Symmetry 2021, 13(12), 2385; https://doi.org/10.3390/sym13122385

Submission received: 8 September 2021 / Revised: 12 November 2021 / Accepted: 2 December 2021 / Published: 10 December 2021

Download

Browse Figures

Versions Notes

Abstract

:

In the field of computational biology, sequence alignment is a very important methodology. BLAST is a very common tool for performing sequence alignment in bioinformatics provided by National Center for Biotechnology Information (NCBI) in the USA. The BLAST server receives tens of thousands of queries every day on average. Among the procedures of BLAST, the hit detection process whose core architecture is a lookup table is the most time-consuming. In the latest work, a lightweight BLASTP on CUDA GPU with a hybrid query-index table was proposed for servicing the sequence query length shorter than 512, which effectively improved the query efficiency. According to the reported protein sequence length distribution, about 90% of sequences are equal to or smaller than 1024. In this paper, we propose an improved lightweight BLASTP to speed up the hit detection time for longer query sequences. The largest sequence is enlarged from 512 to 1024. As a result, one more bit is required to encode each sequence position. To meet the requirement, an extended hybrid query-index table (EHQIT) is proposed to accommodate three sequence positions in a four-byte table entry, making only one memory access sufficient to retrieve all the position information as long as the number of hits is equal to or smaller than three. Moreover, if there are more than three hits for a possible word, all the position information will be stored in contiguous table entries, which eliminates branch divergence and reduces memory space for pointers to overflow buffer. A square symmetric scoring matrix, Blosum62, is used to determine the relative score made by matching two characters in a sequence alignment. The experimental results show that for queries shorter than 512 our improved lightweight BLASTP outperforms the original lightweight BLASTP with speedups of 1.2 on average. When the number of hit overflows increases, the speedup can be as high as two. For queries shorter than 1024, our improved lightweight BLASTP can provide speedups ranging from 1.56 to 3.08 over the CUDA-BLAST. In short, the improved lightweight BLASTP can replace the original one because it can support a longer query sequence and provide better performance.

Keywords:

sequence alignment; BLASTP; CUDA GPU

1. Introduction

In order to cope with the explosive and high-dimensional growth of biological information, the biological database and analysis tools for bioinformatics research are very critical. Among them, sequence alignment is a very important methodology in the field of computational biology [1,2]. The purpose of the sequence alignment is to infer the structure, functional and evolutional relationship between the sequences of nucleic acid or protein by identifying the regions of similarity [3]. Traditionally, the Smith–Waterman algorithm, firstly proposed by Temple F.Smith and Michael S.Waterman in 1981, is always applied in performing local sequence alignment [4]. However, because the Smith–Waterman algorithm is very time-consuming [5,6], NCBI (The National Center for Biotechnology Information) in the USA provides a rapid processing tool called BLAST (basic local alignment search tool) which employs embedded input-sensitive heuristics [7,8]. In the past 20 years, BLAST has had over 118,000 citations and the BLAST server in NCBI receives tens of thousands of queries every day on average [9,10]. Moreover, users can also download the executable files from the NCBI website to conduct a BLAST search on their own servers [11]. Therefore, any improvement for the algorithm in accelerating the search speed will have a wide and high impact on bioinformatics.

Today’s complex genomic research requires more depth information beyond the capabilities of traditional DNA/RNA or protein sequencing technologies. In addition, the size of the sequences database has been increased rapidly, leading to the urgent need for innovative technologies to improve the query efficiency of the BLAST. The massively parallel sequencing techniques have been applied for the above requirement of Next-generation sequencing (NGS), including ScalaBLAST on CPU Clusters [12], muBLASTP on a multicore CPU [13], SparkBLAST on Clouds [14], and CUDA-BLASTP, GPU-BLAST, H-BLAST, and cuBLASTP on many-core GPU (Graphic Processing Unit) [9,15,16,17,18,19,20,21].

Modern GPUs are very suitable for processing a large amount of uniform data with millions of parallel threads. At present, GPUs have been widely used in various application domains and can be used to solve many practical and complicated problems, including sequence alignment in the field of bioinformatics. Especially the compute unified device architecture (CUDA) [22,23,24,25] has made GPU programming much easier. The main GPU optimization techniques include a reduction of bank conflict and branch divergence and enhancement of memory coalescing [26].

Four stages are performed sequentially in BLAST: seed generation, ungapped extension, gapped extension, and alignment with traceback. There are many examples to accelerate BLAST on GPU architecture. CUDA-BLASTP [18] and GPU-BLAST [19] mainly exploit coarse-grained parallelism in which a sequence alignment is mapped to only one thread. CUDA-BLASTP optimizes the Deterministic Finite-state Automaton (DFA), a two-level lookup table proposed in FSA-BLAST [27,28]. In GPU-BLAST, a vector of bits is allocated in shared memory to store the information about each possible sequence word. H-BLAST [9] and cuBLASTP [15,16], on the other hand, exploit fine-grained parallelism to execute each stage of BLAST with multiple threads. H-BLAST integrates the computing powers of the CPU, as well as the GPU; cuBLASTP reduces branch divergence in the seed generation and ungapped extension stages, and it also reorders data access patterns with multiple buffers for better performance.

NCBI BLAST is actually a family consisting of several members. For instance, BLASTP is for protein alignment. According to the study of NCBI BLASTP [19], 75% of the execution time is spent on the two stages of seed generation and ungapped extension. Similarly, when NCBI BLASTX is performed, the total execution time of these two stages accounts for about 85% of the entire execution time [9]. Consequently, in order to use GPU parallel computing to accelerate sequence alignment, researchers mainly focus on optimizing the first two stages of BLAST, including H-BLAST, cuBLAST, CUDA-BLASTP and GPU-BLAST. It should be emphasized that the hit detection process, belonging to the seed generation stage, accounts for 77% of the total execution time of the first two stages [29]. Therefore, the design of a lookup table for hit detection is very critical for algorithm efficiency.

In recent work [26], a lightweight BLASTP has been proposed especially for those queries of lengths smaller than 512. Because the hit detection process is the most time-consuming procedure in BLASTP and the core structure of the process is a lookup table, the lightweight BLASTP aims at designing a much smaller lookup table to increase the utilization of the query table with a lower cache miss ratio. Experimental results show that for the first two stages the lightweight BLASTP outperforms CUDA-BLASTP with a speedup of 3.37. Moreover, a Compact BLASTP has been also proposed especially for query sequences of lengths smaller than 129 in the work [26]. The size of the lookup table can be further reduced, which is more suitable for exploiting the power of the fast but small shared memory on CUDA GPU.

In this paper, we extend the service scope of the lightweight BLASTP via enlarging the maximum length from 512 to 1024, which can cover about 90% of protein sequences. However, when the maximum length is increased to 1024, each sequence position requires one more bit to encode, resulting in one memory word that cannot accommodate three positions. Therefore, we need to redesign the hybrid query table to avoid additional memory accesses. In experiments, we adopted Blosum62, a square symmetric scoring matrix, to determine the relative score made by matching two characters in a sequence alignment. Experimental results show that our improved lightweight BLASTP can provide speedups ranging from 1.56 to 3.08 over the CUDA-BLASTP. Moreover, it is worth noticing that for those queries of length equal to or smaller than 512, the new and the original versions of the lightweight BASTP can both perform their sequence alignment. According to experimental results, at best our new BLASTP can run two times faster than the original lightweight BLASTP when there are many hit overflows. In other words, the new version not only expands the service coverage for longer sequences but also provides better performance for shorter sequences supported in the original version. Therefore, the improved lightweight BLASTP proposed in this paper can fully replace the original lightweight BLASTP [26].

The paper is organized as follows. Section 2 presents the related work. Section 3 introduces the design of our improved lightweight BLASTP. Section 4 carries on the implementation of how to parallelize our improved lightweight BLASTP on CUDA GPU. Section 5 demonstrates the experimental results and analyzes the performance. Section 6 gives the conclusions of the paper.

2. Related Work

2.1. CUDA GPU

With the rapid development of GPUs, more and more researchers use GPUs to solve many kinds of application problems, this type of GPU is called general-purpose GPU (GPGPU) [30]. GPGPU adopts the heterogeneous model of CPU + GPU, in which GPU is considered as the co-processor of the CPU. A GPU is mainly responsible for the parallelization of computationally intensive large-scale data, while a CPU is mainly for logic and transaction processing that is not suitable for data parallelization.

CUDA is a general-purpose computing platform proposed by NVIDIA in 2006, which uses the Single Instruction Multiple Threads (SIMT) architecture to divide a job into several sub-jobs and assign them to many threads for parallel processing, improving the system performance greatly [23,24,25]. Compared with previous platforms, the main difference of CUDA is the use of C language libraries and CUDA functions to write programs, coupled with many powerful interfaces and low entry barriers, making it relatively easy to develop GPU parallel programs. With CUDA, NVIDIA GPUs have been widely used in various fields. When executing a CUDA program, it is mainly divided into two parts: Host and Device. During the execution course of the program, the Host side needs to pre-process the data required by the Device side, and then copy them to the memory of the Device side. After the program is executed, the computational results will be returned from the Device to the Host, and each subroutine that performs parallel calculation inside the Device is called a kernel. A kernel is organized in the form of a Grid composed of a one- or two-dimensional array of thread blocks. Among them, every 32 threads in each block are grouped into a warp and execute the same instruction during the same time instance. Branch divergence should be avoided when designing CUDA programs. Data transfer between blocks must be carried out through the global memories in the CUDA framework.

There are many different types of memories on a CUDA GPU [31]. Global Memory, as a bridge of communication between host and device, is the most used memory in GPU. It can be accessed by all threads in the grid but requires long access latency of about 300–400 cycles. Therefore, two optimization techniques, including Aligned Memory Access and Coalesced Memory Access, should be paid attention to when using Global Memory. Both Texture Memory and Constant Memory can realize the communication between the host and a device like Global Memory, but they are both read-only. Texture Memory with its own cache can achieve good efficiency without conforming to coalesced memory access. Since Constant Memory sends data to threads in the form of broadcast, better performance can be obtained when all threads read the same block of data. Shared Memory can realize the cooperation of threads in the same block to exchange data quickly. Since Shared Memory can be divided into 16 banks of equal size, under the premise of avoiding bank conflict, its access speed can reach 3–4 cycles, which is almost faster than the other memories and even close to the fastest Register. Each thread has its own Local Memory whose performance is like Global Memory. Therefore, only when the number of threads accessing Registers exceeds the limit or the memory occupied by variables is too large, data will be accessed to Local Memory. The last one is Register, which is the fastest access among all the memories on GPU. Each thread has its own Registers. When the Register is not enough, the system will automatically replace it with the lower-performance Local Memory.

2.2. BLAST

In bioinformatics, sequence alignment is the use of a specific mathematical model or algorithm to find the maximum matching nucleotide or amino acid residues between two or more sequences. Sequence alignment is widely applied in gene prediction, analysis of the function of genes or proteins, analysis of species evolution and detection of mutation, insertion, or loss. The Smith–Waterman algorithm to perform local sequence alignment is a common method in sequence alignment [4,5,6], but it takes a long time to query large-scale datasets due to its quadratic complexity [26]. In order to improve the weakness of long runtime for the Smith–Waterman algorithm, and in response to the explosive high-dimensional growth of biologic information, the sequence analysis tool BLAST, developed by NCBI in the USA, is currently the most widely used database search algorithm. In line with the different types of sequences, BLAST includes BLASTP, BLASTX, BLASTN, TBLASTN, and TBLASTX [7,8]. By comparing a protein or nucleotide query sequence with a library or database of subject sequences, BLAST uses a filtration-based heuristic to quickly determine the level of homology between them. The heuristic is based on the important observation: a good alignment always maintains a short exact or high score match between the query sequence and the subject sequence, which is called a hit. Consequently, BLAST can provide fast processing speed and high alignment accuracy.

The algorithm of BLAST is commonly divided into four stages as follows: seed generation, ungapped extension, gapped extension and trackback. As shown in Figure 1, in the pre-processing the query sequence is converted into a table, and it needs to extract the fixed-length overlapping subsequences of length W from the query sequence and every possible subject sequence. BLAST compares the query sequence organized as a query index with every possible subject sequence in Stage 1 for the hit detection process (i.e., hits) to discern the words in the query sequence. After the comparison in seed generation of Stage 1, most of the impossible solutions can be eliminated. Stage 2 is to determine if two or more hits on the same diagonal can become a local alignment without any insertion or deletion by a virtual matrix. This process can find the results of ungapped alignment with scores higher than the threshold, and then pass the solutions of high-score segment pairs (HSPs) to the next stage. In Stage 3, the Smith–Waterman algorithm is applied for the HSPs, obtained in the previous stage, where gaps are added when necessary, in order to have higher scores. The result for this stage is called High Scoring Alignments (HSAs). The last Stage 4 performs a trackback algorithm to return the final comparison results from the gapped extension of Stage 3 and the top scores are reported to the user.

According to the analysis of the experimental results of CUDA-BLASTP in the literature [29], when a smaller database SWISS-PORT is used, the average execution time of data transfer between CPU and GPU accounts for 47.3% of the total execution time, and the kernel calculation time in Stage 1 and Stage 2 takes up to 51.8% of the total execution time. However, when using a larger database env_nr [32], data movement only consumes 32% of the execution time, while the kernel calculation in Stage 1 and Stage 2 takes up to 66.8%. As the size of the sequence database grows, the percentage of the total execution time consumed by seed generation and ungapped extension is expected to significantly increase. Especially with the emergence of NGS technology, the execution time of the first two stages in BLAST will profoundly affect the overall query efficiency.

Moreover, further analysis shows that the hit detection time of Stage 1 consumes 77% of the kernel execution time on average. When the length of the query sequence is smaller than 256, more than 90% of the execution time is used for hit detection. It can be noted that the hit detection spends the most execution time among the four stages. Especially when the query sequence length is smaller, the time it takes becomes longer. Therefore, our research is mainly aimed at how to improve the hit detection algorithm of Stage 1, thereby improving the performance of seed generation.

2.3. Hit Detection

Hit detection is also known as the seed generation of Stage 1. Generally, at this stage, the subsequences of Query demanded by users and Subject in the database are divided into 11 or three characters in groups in line with DNA/RNA or protein. Then comparisons between the words of query and subject are conducted with the help of the alignment-scoring matrix (e.g., BLOSUM62). When the final score is equal to or greater than the threshold, the words will be considered as a hit. As an example shown in Figure 2, there is a query sequence “FGHDEGF…” and a subject sequence “EHDFGED…”, and the sequences are divided into three-character Words. Next, each word in the query sequence is compared with all words in the subject sequence one by one through the alignment-scoring matrix. Words with scores higher than a threshold are recorded as hits (e.g., Word 2 in Query and Word 1 in Subject, as shown in Figure 3 assuming the threshold is 13.) In this way, all the hits will be passed to the next stage.

The time complexity of the above words comparison algorithm is O (n²). In order to reduce the execution time of hit detection, a lookup table is constructed based on the query sequence, which performs a hash-based matching during preprocessing in Stage 1. The lookup table can be a query-index table or a deterministic finite automation (DFA) table [9,15,16,18,19,27,28]. FSA-BLAST proposed the use of DFA to speed up the process of hit detection [27,28]. Since the original DFA requires an extra pointer for each word, resulting in a lot of storage space being taken up, where the average consumption of pointers can account for more than half of the entire DFA size. FSA-BLAST addresses the above problem by improving the original DFA algorithm on BLAST [26]. There are two pointers for each transition, one points to the words which have a common prefix, the other is to the next state. FSA-BLAST rearranges the sequence of the query position of each word to use the query position information of words with the same prefix as much as possible to reduce the size of the lookup structure in advance. CUDA-BLASTP [29] puts forward the idea of applying DFA on NVIDIA GPU to solve BLASTP. In addition to the application of the GPU parallel computing method, CUDA-BLASTP optimizes the DFA lookup table according to the properties of protein sequences.

2.4. Lightweight BLASTP

Because in CUDA-BLASTP the hit detection process consumes a lot of execution time, meanwhile most of the protein sequences’ length is no more than 512, in recent work, Huang et al. [26] proposed lightweight BLASTP that cuts down the hit detection time with a hybrid query index table (HQIT) especially for serving the not-too-long queries. There are four improvements in lightweight BLASTP. Firstly, each query position is encoded with nine bits instead of 16 bits according to the longest length of query supported, making the lookup table much smaller. Secondly, each table entry is composed of four bytes to store up to three query positions. According to the statistics, most queries shorter than 512 have three or fewer hits for each possible word, so only one memory fetch is enough to extract all hit information for a word. Thirdly, in the original BLAST, many empty entries in the query location slots have never been used. In order to decrease the storage space required for overflow and improve the utilization rate of the lookup table, lightweight BLASTP is designed to use empty entries to buffer spilled query locations. Fourthly, dummy entries for buffering more overflows are embodied into the table and interleaved with valid entries.

Moreover, a Compact BLASTP is also proposed for serving the queries with a length shorter than 129 in the work based on the lightweight BLAST [26]. Compact BLASTP uses a condensed hybrid query-index table (CHQIT) that is much smaller than the HQIT in lightweight BLASTP, which can be fit into the faster shared memory on a GPU.

3. Our Improved Lightweight BLASTP

The lightweight BLASTP based on a much smaller lookup table HQIT can effectively improve the query efficiency especially for the sequences of lengths equal to or smaller than 512. In this paper, we aim at proposing an improved lightweight BLASTP to extend service scope: the maximum query sequence length is increased from 512 to 1024. According to the protein sequence length distribution, about 90% of protein sequences are smaller than 1024, as shown in Figure 4 [33].

When the maximum query sequence length is increased from 512 to 1024, we need one more bit to encode each word position. As a result, we could not accommodate three word positions in one memory word in the original lightweight BLASTP, resulting in much more additional memory accesses. To address the problem, we need to modify the original design.

Our proposed extended hybrid query-index table (EHQIT) is based on HQIT in the lightweight BLASTP [26], similar to the query index table of GPU-BLAST [19]. Only one EHQIT table is sufficient for hit detection, instead of multiple tables required by DFA. The new data structure of the index table we proposed is shown in Figure 5.

Each table entry is composed of five fields: Flag #1, Flag #2, Position #1, Position #2, and Position #3, as shown in Figure 5. Flag #1 and #2 consisting of one bit individually are used to identify different statuses. Positions #1, #2 and #3 store word positions on the query sequence and each position is encoded with 10 bits because the maximum query length is 1024.

EHQIT consists of two types of areas: valid and dummy areas, and they are interleaved in the table. Valid areas store the entries corresponding to all possible words. Totally, there are 20 × 20 × 20 = 8000 possible three-letter words because there are 20 different proteins and each protein is represented with one letter. Each entry in valid areas is to store the position information of the query sequence for a possible word, which can be accessed directly via the three letters of the word. Each entry in valid areas can store three positions at most. If the number of hits is larger than three, an overflow takes place and we need to store the spilled hit information in dummy areas. The spilled position information for the same word will be stored contiguously in dummy area entries and they can be accessed with a link from valid areas. Unlike GPU-BLAST that requires an overflow buffer in addition to the index table, our design needs only one table. In GPU-BLAST there is an extra table for storing the information about overflow whenever there are more than three hits for a query sequence word. In our design, the overflow table is shared by all query sequence words, and pointers in the index table are used to link the words and their spilled position information in the overflow table. In conventional design, a pointer consists of four bytes or even eight bytes for storing memory addresses, which enlarges the index table, and thus, increases the cache miss ratio. On the other hand, we partition the index table and the overflow table into valid areas and dummy areas, respectively, and then interleave these two types of areas in an integrated table. Spilled position information for a word will be placed in a dummy area next to the valid area at which the word redies. Offset, rather than a completer memory address, between a valid area entry and a dummy area entry is recorded in the valid area to link the spilled information. In our approach, only 10 bits are allocated to an offset and the 10 bits can be exclusively used to store a query sequence position depending on the two flags in the same entry. In this way, the integrated table requires much smaller memory space and has a lower cache miss ratio.

The improved lightweight BLASTP is designed for queries with lengths equal to or smaller than 1024. Ten bits are enough to encode a position. In fact, if the word length defined by users is three, although the largest query sequence position is 1024, the actual query sequence length that the improved lightweight BLASTP can support is 1026 (= 1024 + 3 − 1). Since only four bytes are allocated for every table entry and ten bits are required for each query position, there are at most three positions that can be stored within one table entry. Therefore, if the number of hits is equal to or smaller than three, only one instance of memory access is enough to retrieve all the position information for a possible word. If a possible word occurs more than three times in a query sequence, an overflow takes place. The number of occurrences of the word should be recorded in the corresponding entry and one position field will be transformed into an offset field that links to the dummy area entries buffering the spilled position information.

Flags #1 and #2 determine how many hits the corresponding word has. If the values of Flag #1 and #2 are both zeros, there are no hits for the corresponding word. If the values of Flag #1 and #2 are 0 and 1, respectively, there is one hit for the word and the position is stored in Position #1. Similarly, If the values of Flag #1 and #2 are 1 and 0, respectively, there are two hits for the word and the positions are stored in Position #1 and Position #2.

If both flag values are set, the value in Position #3 is used to tell whether there is any hit overflow for the word. If the value of Position #3 is smaller than 1023, it indicates that there are three hits for the word and the three positions are recorded in Positions #1, #2 and #3. That is, Position #3 is also used to store a position for the word. On the other hand, if the value of Position #3 is equal to 1023, there is an overflow. In such a case, Position #1 contains the information about how many hits the word has and Position #2 stores the offset that can lead to the first dummy area entry that buffers the spilled position information. The offset is equal to the distance between the valid area entry and the dummy area entry. For a word that has more than three hits, all the positions are stored in contiguous entries of a dummy area. Each dummy area entry can store up to three positions. For instance, if there are seven positions to be recorded, three contiguous dummy area entries will be allocated for the corresponding word; the 10-bit offset is a signed integer, ranging from −512 to 511. Since valid areas and dummy areas are interleaved, the displacement between a valid area entry and a dummy area entry can be as far as about 512 entries. Table 1 summarized the meaning of different combinations of the flag and position values.

A simplified example is used to illustrate the function of EHQIT in our improved lightweight BLASTP, as shown in Figure 6. The gray areas in the figure are dummy areas, and the remainings are valid areas. When hit overflows take place the entries in dummy areas are accessed. According to the values of Flags #1 and #2, W_n₊₂ has no hits, W₀ has two hits with hit positions P₀¹ and P₀², and W₁ has one hit with position P₁¹. Since both Flags #1 and #2 are set, W₃, W_n, and W_n₊₁ might have three or more hits. Position #3 should be checked to determine if there are overflows. W₃ and W_n have hit overflows since their Position #3 values are 1023. W_n₊₁ has no overflow; Position #3 indicates the number of hits is three because its value is not equal to 1023. Furthermore, the three positions are P_n₊₁¹, P_n₊₁² and P_n₊₁³. The word entry W₃ has hit overflow because the value of Position #3 is 1023. The value of Position #1 is 9 meaning there are nine hits totally for the word W₃, and the value of Position #2 is equal to 26 indicating that the offset between the word entry and the first dummy area entry is 26. All the nine positions for the word W₃ are stored in three contiguous dummy area entries, where the first one of the three contiguous dummy area entries can be reached by the offset recorded in Position #2 in the valid area entry W₃. The nine hits of W₃ are sequentially stored in three consecutive entries, W₃′, W₄′ and W₅, in the dummy area, and each entry contains three positions at most. That is, Positions from P₃¹ to P₃⁹ are stored in W₃′, W₄′ and W₅ in the dummy area. Similarly, there are a total of seven hits for W_n. Since the value of Position #2 is −49 (the leftmost bit is a sign bit), the dummy area allocated for the overflow is located at a lower memory address than that of the resident valid area. In this example, W₀′ in the dummy area is the first entry chosen for buffering the hit positions of W_n. In addition to W₀′, the adjacent entries W₁′ and W₂′ are also allocated to buffer the seven-hit positions. Through this continuous storage allocation for buffering spilled hit positions in the dummy area, it is very efficient to fetch continuous data without the need for a lot of pointers to different entries.

4. Parallelize Our Improved Lightweight BLASTP on CUDA GPU

When implementing on a CUDA GPU, we let threads grab all word entries at the same time without the need of knowing how many hits the corresponding word has beforehand. At present, according to the assumption, threads will firstly process the words with only one hit, and after the execution grab the words with two hits. Then threads perform the words with three hits or overflows. In order to avoid branch divergence in the design, which will greatly reduce the execution efficiency of threads, the parallel method we adopt is shown in Figure 7. The parallel procedure is divided into three steps. At the first step, threads are performed simultaneously to grab the information in Position #1 of word entries in valid areas containing more than one hit. Then the threads continue to grab the corresponding information of Position #2 in the second step. Finally, in the third step, since it needs to determine whether there is an overflow or not, the Position #3 values are checked. If the value is not equal to 1023, the retrieved value should be put back for use in the next stage. If the value is equal to 1023, the related procedures of hit overflow should be executed.

During Step 3, if the value captured by Position #3 is 1023, meaning hit overflow, it must re-capture the total number of hits and offset in the fields of Position #1 and Position #2 for the related word entries. Using the offset in Position #2 to lead us to the first table entry in the target dummy area. Divide the total number of hits saved in Position #1 by three can know how many dummy area entries are required to retrieve all the position information for the word. With this information, threads are performed to grab the hit positions through a loop structure. As shown in Figure 8, at Step 1′ of the overflow subroutine, the internal loop is executed by each thread to get the hit positions for every word whose position information is stored in a dummy area due to overflow. The number of iterations in the loop is set to the number of entries if the number of hits is divisible by three. Otherwise, the number of iterations in the loop is set to the number of entries minus one. In order to process the last entry of the hits overflow for a word, there are three possible storage results. The first, is that an extra hit position happens to be stored in Position #1 of the last entry, just like P_n⁷ in Position #1 of W₂′. The second case is that there are two hits stored in Positions #1 and #2 in the last entry. The third case is the same as the entry corresponding to the word W₅′. There are exactly three hits P₅⁷, P₅⁸ and P₅⁹ stored in Positions #1, #2 and #3. For the above three possible cases, we implement the procedure of retrieving the hit position values in a synchronous manner. After the execution of Step 1′ is completed, Step 2′ is executed. According to the mod operation result, the threads simultaneously fetch the value in Position #1 in the last entry if the mod result is smaller than or equal to 2. Then Step 3′ is executed to continue to judge whether the result is equal to 1 or 0, which means Position #2 of the last entry has valid position information or not. If there are valid positions in Position #2, the corresponding threads begin to fetch them. Finally, Step 4′ determines whether the result is equal to 0. If the condition is met, the threads fetch the contents in Position #3 in parallel. More details about the algorithm of the above procedures are shown in Figure 9.

5. Performance Evaluation

Our experimental evaluations are conducted on a computer node that is composed of an Intel Core i7 4790 quad-core processor and an NVIDIA Tesla K20 GPU, which is a massively parallel computing graphics card dedicated to NVIDIA workstation specifications. The hardware configuration of our experimental platform is shown in Table 2. The system runs 64-bit Ubuntu Linux 12.04 and NVIDIA CUDA toolkit 5.0.

We use SwissProt as the database for testing. All sequence entries in the SwissProt database are carefully verified by molecular biologists and protein chemists through computer tools and related literature. Each entry in the database has detailed annotations, and the database includes cross-reference codes with more than ten secondary databases, such as nucleic acid sequence database EMBL/GenBank/DDBJ, protein structure database PDB, and Prosite and PRINTTS. For the input query sequences, we choose five sequences with lengths larger than 128 and less than 1024 to conduct the experiments for comparing with the CUDA-BLASTP algorithm. In addition, five sequences with lengths smaller than 128 are selected to compare with the original lightweight BLASTP. Finally, Blosum62, a square symmetric scoring matrix, was used to determine the relative score made by matching two characters in a sequence alignment. Since our design is to shorten the hit detection time and CUDA-BLASTP also implements the first two stages in the same kernel, the following comparisons are all about the execution time of the kernel.

5.1. Comparison with CUDA-BLASTP

We selected five query sequences with their lengths ranging between 128 and 1024, their names and lengths are listed in Table 3. The number of hits for each of them, as shown in Figure 10, is counted according to the threshold value of 12. During the experiment, the number of hits with overflow is shown in Figure 11.

We compare the execution times of our improved lightweight BLASTP and CUDA-BLASTP. The comparison of speedups is shown in Figure 12. Different combinations of the number of blocks and number of threads are adopted for testing. Figure 12a–c indicate that the number of blocks is set to 13, 32, or 64, respectively. For each setting of the number of blocks, the number of threads is set to 32, 62 or 96. According to the results of all the combinations, our improved lightweight BLASTP has higher speedups over CUDA-BLASTP for all the testing query sequences. As for shorter queries, the speedups are higher than 2.26, and the speedups can be at least 1.56 when the number of hits overflows is large.

5.2. Comparison with Lightweight BLASTP

Four query sequences whose lengths are smaller than or equal to 128, used in the literature [26], were selected for the experiment. In addition, one sequence is customized to test overflow handling with a larger hit count. They are listed in Table 4. The threshold value is also set to 12, and the number of hits is counted respectively for the five query sequences as shown in Figure 13. In addition, Figure 14 shows the number of hit overflows for each query sequence, and the query sequence YP_000123 has the largest number of hit overflows.

We use several combinations of different numbers of blocks and numbers of threads for the test, similar to that described in Section 5.1. Figure 15 shows the speedups of our improved lightweight BLASTP over the lightweight BLASTP with different combinations of blocks and threads. Our improved lightweight BLASTP is more efficient than the previous lightweight BLASTP, especially for those with more hits overflows. Most of the test cases can achieve a speedup of 1.13 or higher. In the original lightweight BLASTP, we utilize those empty entries near the entry that requires memory space for buffering spilled positions, where multiple pointers are needed to link these spilled positions. On the other hand, we use only one pointer to like all the spilled positions for the same entry, where all the spilled positions are stored contiguously. As a result, the improved lightweight BLASTP has better performance than the original lightweight BLASTP.

In addition, when there is an overflow, the speedup can reach 1.22 or even higher. In Figure 15a–c, the processing of the query sequence with the largest number of hit overflows has the highest speedup. Especially in Figure 15c, when the number of blocks is 64 and the number of threads is 96, the speedup of query sequence YP_000123 that has the largest number of hit overflows can reach up to 1.86. In summary, the more overflows there are, the higher the performance we can obtain.

6. Conclusions

BLAST is a very popular tool in bioinformatics. With the development of NGS, many studies have been conducted on parallelizing BLAST to shorten the execution time. According to the reported protein sequence length distribution, about 90% of protein sequences are shorter than 1024. Based on this observation, in the paper we extended our previous work, lightweight BLASTP [2], to give a wider service scope. The maximum query sequence length is increased from 512 to 1024 in the improved lightweight BLASTP. Moreover, because in BLAST the hit detection time consumes most of the execution time and the lookup table is the most important part of the hit detection process, an extended hybrid query-index table (EHQIT) is proposed in the new version of lightweight BLASTP to shorten the execution time. EHQIT addresses the problem that any HQIT entry cannot accommodate three sequence positions once the maximum length is enlarged from 512 to 1024. One more bit is required for each sequence position in the new version. Moreover, we proposed a new approach to handle the overflow. All the sequence positions for an overflowed possible word are stored contiguously in a dummy area. Consequently, they can be fetched more efficiently. Each entry of EHQIT is four-byte long and consists of five fields. The first two fields are 1-bit flags, respectively, which are together used to provide different encoding formats. The last three fields, each occupying 10 bits, are used to store hit positions, the number of hits or an offset to the first dummy area entry for spilled positions, depending on the flags. This structure requires only one memory fetch as long as the number of hits is equal to or smaller than three. EHQIT is composed of valid areas and dummy areas, and these two types of areas are interleaved. The possible words only exist in the entries of valid areas, and the dummy areas are used to store spilled query positions. EHQIT is much smaller than a traditional lookup table. Once a hit overflow occurs, the hit positions can be fetched via consecutive entries in the dummy areas. In this way, threads can read the corresponding positions in parallel, resulting in higher utilization of the lookup table and a lower cache miss ratio.

On basis of the performance evaluations, our lightweight BLASTP outperforms CUDA-BLASTP with speedups ranging from 1.56 to 3.08. When compared with the lightweight BLASTP, our improved method is more efficient especially when there are more hit overflows. Among the tested query sequences, when the number of hit overflows is small, the speedup is 1.2 on average. When the number of hit overflow increases, the speedup also increases, even up to two. In short, the improve lightweight BLASTP can replace the original one because it not only can support a longer query sequence, but also provides better performance.

The key design of the approach is the extended hybrid query index table (EHQIT), and it has three advantages. (1) It can provide three query sequence positions at most through only one memory access. (2) The design of interleaving valid areas and dummy areas in a table significantly reduces the memory space requirements for storing pointers that link to the overflow buffer. (3) All the spilled positions for a possible word are recorded in contiguous table entries, making the retrieval of position information more efficient.

The EHQIT is a data structure that can be accessed sequentially or in parallel. Of course, parallel executing BLASTP can fully exploit data parallelism and reduce the total execution time significantly. In addition, the proposed approach can be implemented on any computing platform. In other words, the improved lightweight BLSTP can be implemented on a single computer, a computer cluster, a parallel computer, or a parallel system. In this paper, we take NVIDIA GPU as a case study to realize how much speedup the improved lightweight BLASTP can obtain. It takes time but would be very appealing to implement the proposed approach on other parallel systems. In the future, we will study more optimization techniques to accelerate BLAST on CUDA GPU. Furthermore, we will study how to use different architectures to implement BLAST for wider usage.

Author Contributions

Conceptualization, C.-C.W. and X.S.; methodology, C.-C.W., X.S. and Y.-F.L.; software, Y.-F.L. and C.-C.W.; validation, C.-C.W., X.S. and Y.-F.L.; formal analysis, C.-C.W., X.S. and Y.-F.L.; investigation, C.-C.W., X.S. and Y.-F.L.; resources, C.-C.W., X.S. and Y.-F.L.; data curation, C.-C.W., X.S. and Y.-F.L.; writing—original draft preparation, X.S. and C.-C.W.; writing—review and editing, C.-C.W. and X.S.; visualization, C.-C.W. and X.S.; supervision, C.-C.W.; project administration, C.-C.W.; funding acquisition, C.-C.W. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Ministry of Science and Technology, Taiwan under Grant MOST109-2221-E-018-016-MY2, in part by Premium Funding Project for Academic Human Resources Development in Beijing Union University under Grant No. BPHR2020CZ05.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mathews, D.H.; Turner, D.H.; Zuker, M. RNA Secondary Structure Prediction. Curr. Protoc. Nucleic Acid Chem. 2007, 28, 11.2. [Google Scholar] [CrossRef] [Green Version]
Chang, D.-J.; Kimmer, C.; Ouyang, M. Accelerating the Nussinov RNA folding algorithm with CUDA/GPU. In Proceedings of the 10th IEEE International Symposium on Signal Processing and Information Technology, Luxor, Egypt, 15–18 December 2010; pp. 120–125. [Google Scholar]
Gollery, M. Bioinformatics: Sequence and Genome Analysis. Clin. Chem. 2005, 51, 2219–2220. [Google Scholar] [CrossRef]
Smith, T.F.; Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195–197. [Google Scholar] [CrossRef]
Altschul, S.F.; Erickson, B.W. Optimal Sequence Alignment Using Affine Gap Costs. Bull. Math. Biol. 1986, 48, 603–616. [Google Scholar] [CrossRef]
Myers, E.W.; Miller, W. Optimal alignments in linear space. Bioinformatics 1988, 4, 11–17. [Google Scholar] [CrossRef]
Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ye, W.; Chen, Y.; Zhang, Y.; Xu, Y. H-BLAST: A fast protein sequence alignment toolkit on heterogeneous computers with GPUs. Bioinformatics 2017, 33, 1130–1138. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rangwala, H.; Lantz, E.; Musselman, R.; Pinnow, K.; Smith, B.; Wallenfelt, B. Massively parallel BLAST for the Blue Gene/L. In Proceedings of the High Availability and Performance Workshop, Santa Fe, NM, USA, 11 October 2005. [Google Scholar]
Basic Local Alignment Search Tool. Available online: https://blast.ncbi.nlm.nih.gov/Blast.cgi/ (accessed on 6 August 2021).
Oehmen, C.S.; Baxter, D.J. ScalaBLAST 2.0: Rapid and robust BLAST calculations on multiprocessor systems. Bioinformatics 2013, 29, 797–798. [Google Scholar] [CrossRef] [Green Version]
Zhang, J.; Misra, S.; Wang, H.; Feng, W.-C. muBLASTP: Database-indexed protein sequence search on multicore CPUs. BMC Bioinform. 2016, 17, 1–14. [Google Scholar] [CrossRef] [PubMed] [Green Version]
de Castro, M.R.; Tostes, C.D.S.; Dávila, A.M.; Senger, H.; da Silva, F.A. SparkBLAST: Scalable BLAST processing using in-memory operations. BMC Bioinform. 2017, 18, 318. [Google Scholar] [CrossRef]
Zhang, J.; Wang, H.; Lin, H.; Feng, W.C. cuBLASTP: Fine-grained parallelization of protein sequence search on a GPU. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, 19–23 May 2014; pp. 251–260. [Google Scholar]
Zhang, J.; Wang, H.; Feng, W.-C. cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU + GPU. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 830–843. [Google Scholar] [CrossRef] [PubMed]
Xiao, S.; Lin, H.; Feng, W.-C. Accelerating Protein Sequence Search in a Heterogeneous Computing System. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, Anchorage, AK, USA, 16–20 May 2011. [Google Scholar]
Liu, W.; Schmidt, B.; Muller-Wittig, W. CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011, 8, 1678–1684. [Google Scholar] [CrossRef] [PubMed]
Vouzis, P.D.; Sahinidis, N.V. GPU-BLAST: Using graphics processors to accelerate protein sequence alignment. Bioinformatics 2010, 27, 182–188. [Google Scholar] [CrossRef]
Kaiyong, Z.; Xiaowen, C. G-BLASTN: Accelerating nucleotide alignment by graphics processors. Bioinformatics 2014, 10, 1384–1391. [Google Scholar]
Wan, N.; Xie, H.B.; Zhang, Q.; Zhao, K.Y.; Chu, X.W.; Jun, Y.U. A preliminary exploration on parallelized BLAST algorithm using GPU. Comput. Eng. Sci. 2009, 31, 98–112. [Google Scholar]
Halfhill, T.R. Parallel Processing with CUDA, Nvidia’s High-Performance Computing Platform uses Massive Multithreading. Micro-Process. Rep. 2008, 28. Available online: http://cs.brown.edu/courses/cs295-11/2008/cuda.pdf/ (accessed on 12 August 2021).
NVIDIA GPU. What Is Gpu Accelerated Computing? 2012. Available online: http://www.nvidia.com/object/what-is-gpu-computing.html (accessed on 12 July 2019).
NVIDIA. NVIDIA Cuda2.0 Programming Guide. 2009. Available online: http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf (accessed on 12 July 2019).
NVIDIA CUDA. CUDA Parallel Computing Platform. 2012. Available online: http://www.nvidia.com/object/cuda_home_new.html (accessed on 12 July 2019).
Huang, L.-T.; Wei, K.-C.; Wu, C.-C.; Chen, C.-Y.; Wang, J.-A. A lightweight BLASTP and its implementation on CUDA GPUs. J. Supercomput. 2021, 77, 322–342. [Google Scholar] [CrossRef]
FSA-BLAST. Available online: http://fsa-blast.sourceforge.net/ (accessed on 15 August 2021).
Cameron, M.; Williams, H.E.; Cannane, A. A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST. J. Comput. Biol. 2006, 13, 965–978. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Glasco, D. An Analysis of BLASTP Implementation on NVIDIA GPUs. 2012. Available online: https://biochem218.stanford.edu/Projects%202012/Glasco.pdf/ (accessed on 26 August 2021).
De Verdiere, G.C. Introduction to GPGPU, a hardware and software background. Comptes Rendus Mécanique 2011, 339, 78–89. [Google Scholar] [CrossRef]
Wei, K.-C.; Sun, X.; Chu, H.; Wu, C.-C. Reconstructing permutation table to improve the Tabu Search for the PFSP on GPU. J. Supercomput. 2017, 73, 4711–4738. [Google Scholar] [CrossRef]
NCBI Genbank. Available online: ftp://ftp.ncbi.nlm.nih.gov/genbank/ (accessed on 26 August 2021).
Duvaud, S.; Gabella, C.; Lisacek, F.; Stockinger, H.; Ioannidis, V.; Durinx, C. Expasy, the Swiss Bioinformatics Resource Portal, as designed by its users. Nucleic Acids Res. 2021, 49, W216–W227. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The flow chart of BLAST algorithm.

Figure 2. An example of comparison of query sequence and subject sequence.

Figure 3. Hit detection process for words comparison through the alignment-scoring matrix.

Figure 4. The protein sequence length distribution (Source: https://web.expasy.org/docs/relnotes/relstat.html/ (accessed on 7 November 2021).

Figure 5. The structure of our designed EHQIT.

Figure 6. An example of the usage of EHQIT. Spilled query positions can be stored in dummy areas contiguously.

Figure 7. Main procedures of parallelizing our improved lightweight BLASTP on a CUDA GPU.

Figure 8. Parallelize subprogram of hit overflow processing in our proposed method on CUDA GPU.

Figure 9. The algorithm of fetching hit positions in the lightweight BLASTP.

Figure 10. The statistics of query sequences whose lengths are between 128 and 1024 (the threshold value is 12).

Figure 11. Distribution of the number of hit overflows for each query sequence.

Figure 12. Comparison of Speedups of our improved lightweight BLASTP over CUDA-BLASTP with different numbers of blocks. (a) the number of blocks is 13; (b) the number of blocks is 32; (c) the number of blocks is 64.

Figure 13. The statistics of query sequences whose length is smaller than or equal to 128 (the threshold value is 12).

Figure 14. Distribution of the number of hits overflows for each query sequence with a length smaller than or equal to 128.

Figure 15. Comparison of Speedups of our improved lightweight BLASTP over the original lightweight BLASTP with different numbers of blocks. (a) the number of blocks is 13; (b) the number of blocks is 32; (c) the number of blocks is 64.

Table 1. Flag#2 and Position #3 in valid areas.

Flag #1	Flag #2	Position #3	Position #1	Position #2	Note
0	0	Valid	Valid	Valid	No hit
0	1	Valid	Query position	Valid	One hit
1	0	Valid	Query position	Query position	Two hits
1	1	1023	The number of hits (N)	Offset	N spilled hit and one pointer
1	1	Query position (<1023)	Query position	Query position	Three hits

Table 2. Hardware configuration of the experimental platform.

Inter Core i7 4790		Tesla K20
Number of Cores	4	Number of streaming processors	2496
Number of hyperthreads	8	Number of streaming multiprocessors	13
Clock Speed	3600 MHZ	Clock Speed	706 MHZ
Memory Size	8 GB	Memory Size	4.8 GB
Memory Types	DDR3	Memory Types	GDDR5

Table 3. Query sequences with lengths between 128 and 1024.

Query Sequences	Length
NP_344578	252
NP_344554	453
NP_344666	637
NP_344582	877
NP_345072	958

Table 4. Query sequences with lengths smaller than or equal to 128.

Query Sequences	Length
NP_345015	36
NP_000338	87
NP_952492	94
NP_085504	106
NP_000123	78

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, X.; Wu, C.-C.; Liu, Y.-F. The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU. Symmetry 2021, 13, 2385. https://doi.org/10.3390/sym13122385

AMA Style

Sun X, Wu C-C, Liu Y-F. The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU. Symmetry. 2021; 13(12):2385. https://doi.org/10.3390/sym13122385

Chicago/Turabian Style

Sun, Xue, Chao-Chin Wu, and Yan-Fang Liu. 2021. "The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU" Symmetry 13, no. 12: 2385. https://doi.org/10.3390/sym13122385

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU

Abstract

1. Introduction

2. Related Work

2.1. CUDA GPU

2.2. BLAST

2.3. Hit Detection

2.4. Lightweight BLASTP

3. Our Improved Lightweight BLASTP

4. Parallelize Our Improved Lightweight BLASTP on CUDA GPU

5. Performance Evaluation

5.1. Comparison with CUDA-BLASTP

5.2. Comparison with Lightweight BLASTP

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI