Next Article in Journal
An Algorithm for Solving the Problem of Phase Unwrapping in Remote Sensing Radars and Its Implementation on Multicore Processors
Next Article in Special Issue
Accurate Computations with Block Checkerboard Pattern Matrices
Previous Article in Journal
A Two-Grid Algorithm of the Finite Element Method for the Two-Dimensional Time-Dependent Schrödinger Equation
Previous Article in Special Issue
Iteration-Based Temporal Subgridding Method for the Finite-Difference Time-Domain Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficiency of Various Tiling Strategies for the Zuker Algorithm Optimization

by
Piotr Blaszynski
*,†,
Marek Palkowski
,
Wlodzimierz Bielecki
and
Maciej Poliwoda
Faculty of Computer Science and Information Systems, West Pomeranian University of Technology, Zolnierska 49, 72210 Szczecin, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(5), 728; https://doi.org/10.3390/math12050728
Submission received: 28 January 2024 / Revised: 12 February 2024 / Accepted: 27 February 2024 / Published: 29 February 2024
(This article belongs to the Special Issue Numerical Algorithms: Computer Aspects and Related Topics)

Abstract

:
This paper focuses on optimizing the Zuker RNA folding algorithm, a bioinformatics task with non-serial polyadic dynamic programming and non-uniform loop dependencies. The intricate dependence pattern is represented using affine formulas, enabling the automatic application of tiling strategies via the polyhedral method. Three source-to-source compilers—PLUTO, TRACO, and DAPT—are employed, utilizing techniques such as affine transformations, the transitive closure of dependence relation graphs, and space–time tiling to generate cache-efficient codes, respectively. A dedicated transpose code technique for non-serial polyadic dynamic programming codes is also examined. The study evaluates the performance of these optimized codes for speed-up and scalability on multi-core machines and explores energy efficiency using RAPL. The paper provides insights into related approaches and outlines future research directions within the context of bioinformatics algorithm optimization.

1. Introduction

RNA secondary structure prediction poses a fundamental and computationally intensive challenge in biological computing. The objective is to predict the secondary non-crossing RNA structure for a given RNA sequence, minimizing the total free energy. Early dynamic programming algorithms by Smith and Waterman [1] and Nussinov et al. [2] focused on maximizing the number of complementary base pairs.
Zuker et al. [3] introduced a sophisticated dynamic programming algorithm that predicts the most stable secondary structure for a single RNA sequence by computing its minimal free energy, utilizing a “nearest neighbor” model. The algorithm estimates thermodynamic parameters for neighboring interactions, scoring all possible structures based on loop entropies. The RNA secondary structure is composed of four independent substructures—stack, hairpin, internal loop, and multi-branched loop—with the energy of the structure being the sum of these substructure energies.
Zuker’s algorithm comprises two steps. The first, most time-consuming step involves calculating the minimal free energy of the input RNA sequence using recurrence relations outlined in the provided formulas. The second step entails a trace-back to recover the secondary structure with the base pairs. While the trace-back step is computationally less demanding, optimizing the energy matrix calculation in the initial step is critical for enhancing overall algorithm performance [4].
This paper delves into the examination of the performance of tiled Zuker loop nests codes generated by selected automatic optimizers based on the polyhedral model.
The polyhedral model represents loop nests as polyhedra with affine loop bounds and schedules, offering advanced loop transformations and the ability to analyze data dependencies. By utilizing this model, compilers can automatically optimize loops, enhance performance (especially in locality using loop tiling), and exploit parallelism [5].
Loop tiling, also known as loop blocking or loop partitioning, is a compiler optimization technique to improve cache utilization and enhance the performance of loop-based computations [6]. It involves partitioning a loop into smaller sub-loops or blocks, known as tiles, which fit effectively into the cache. The primary goal of loop tiling is to capitalize on spatial locality, emphasizing the access of data elements in close proximity to memory. Iterations within each tile can leverage data element reuse by breaking a loop into smaller tiles, reducing cache misses, and optimizing memory access patterns.
Bioinformatics algorithms such as those of Zuker, Nussinov, or Smith–Waterman can benefit from parallelization using the widely recognized loop skewing method implemented in polyhedral compilers. This technique for loop transformation modifies the iteration order of loop nests to create a more favorable schedule. The fundamental concept behind loop skewing involves altering the original iteration space of a loop nest by applying an affine transformation to the loop indices. This transformation introduces a skewing factor that dictates the new relationship between the loop indices, ultimately changing the execution order of loop iterations. It is worth noting that the effectiveness of loop skewing depends on the tiling algorithms implemented in compilers, and the parallelization itself may vary in the resulting codes.
Several numerical approaches within the polyhedral model aim to enhance structure prediction accuracy for RNA sequences. The algorithm by Zhi J. Lu and colleagues introduces the Maximum Expected Accuracy (MEA) concept, incorporating base pair and unpaired probabilities [7]. The NPDP Benchmark Suite, encompassing Nussinov, Zuker, and MEA algorithms, serves as a comprehensive resource for numerical sources and aspects, emphasizing the challenges of optimizing these tasks with conventional tiling strategies [8,9,10].
Efforts to optimize RNA folding algorithms include the “Transpose” technique proposed by Li et al. for Nussinov’s RNA folding, later extended to optimize Zuker’s code [11]. Zhao et al. improved the “Transpose” method and conducted an experimental study of energy-efficient codes for Zuker’s algorithm based on the LRU cache model [12].
While PLUTO, an advanced polyhedral code generation tool, has been widely used for optimizing C/C++ programs, it faces limitations in achieving maximal code locality and performance for NPDP problems, such as Nussinov’s RNA folding and the McCaskill probabilistic RNA folding kernel [13,14]. The inability of PLUTO to tile the innermost loop of Nussinov’s RNA folding and the third loop nest in Zuker’s code highlights these challenges.
As Mullapudi and Bondhugula presented for Zuker’s optimal RNA secondary structure prediction, dynamic tiling involves 3D iterative tiling for dynamic scheduling, calculated using reduction chains [13]. However, this approach emphasizes dynamic scheduling of tiles rather than generating a static schedule.
Wonnacott et al. introduced 3D tiling for “mostly-tileable” loop nests in RNA secondary-structure prediction codes, focusing on serial codes due to limitations in handling more complex instances [14]. Tchendji et al. explored tiling correction and Four-Russian RNA Folding, proposing a parallel tiled and sparsified four-Russians algorithm for Nussinov’s RNA folding [15].
In our previous endeavors, we introduced a tiling technique aimed at transforming original rectangular tiles into target ones, ensuring validity under lexicographic order [16]. The process of tile correction was conducted through the utilization of the transitive closure of loop dependence graphs. This innovative approach yielded a substantial speed-up in the generated tiled code compared to state-of-the-art source-to-source optimizing compilers. The technique found its practical implementation within the polyhedral TRACO compiler.
Our exploration into tiling correction and Four-Russian RNA Folding attracted the attention of Tchendji et al., leading to an in-depth study. They proposed a parallel, tiled, and sparsified four-Russians algorithm tailored for Nussinov’s RNA folding [15]. This novel approach is deemed more cache-friendly, strategically organizing blocks of four Russians into parallelogram-shaped tiles. The experimental study, spanning CPUs and massively parallel GPU architectures, showcased superior performance compared to the outcomes of our previous work [16,17]. While the authors concentrated only on the Nussinov loop nest manually, they expressed a commitment to exploring other NPDP problems in the future.
In a more recent contribution, our team introduced a space–time loop-tiling approach in a paper published in 2019 [18]. This methodology generates target tiles by applying the intersection operation to sets representing sub-spaces and time slices. Each time partition encompasses independent iterations, facilitating parallel execution, while the enumeration of time partitions follows a lexicographical order. This approach extends our ongoing efforts in space–time tiling, showcasing promising prospects for the development of new polyhedral optimizing compilers. The corresponding codes were generated using the DAPT compiler as introduced in a previous publication [19].
There are other state-of-the-art polyhedral compilers, such as Tiramisu [20], AlphaZ [21], Pencil [22], Halide [23], AutoGen [24], and Apollo [25], that generate code for CPUs and GPUs. However, we have not found their applicability to NPDP problems, or they are based on the PLUTO framework, or not fully documented source-to-source maintained projects. As a result, we will not further explore their capabilities in this article.
In the realm of RNA folding algorithms, the Zuker kernel and Nussinov RNA folding pose challenges for optimizing compilers due to their involvement in mathematical operations over affine control loops within the polyhedral model [13]. The acceleration of Zuker RNA folding proves particularly intricate, residing in the domain of non-serial polyadic dynamic programming (NPDP), a subset with non-uniform data dependencies [16]. Moreover, Zuker’s loop structure is more intricate for automatic tiling strategies than Nussinov’s algorithm, featuring quadruple nested loops with more instructions and data dependencies.
The paper by Yuan et al. [26] introduces a novel two-level tessellation scheme for stencil computations, aiming to explore data locality and parallelism more efficiently than traditional blocking methods. It designs a set of blocks that tessellate the spatial space in various ways, allowing for parallel processing without redundant computation. Experimental results demonstrate up to 12% performance improvement over existing concurrent schemes. The paper by Bertolacci et al. [27] leverages Chapel parallel iterators to implement advanced tiling techniques, including time dimension tiling, to improve the parallel scaling of stencil computations on multicore processors. It proposes parameterized space and time tiling iterators through libraries, facilitating code reuse and easier tuning for improved programmer productivity. The approach demonstrates better scaling compared to traditional data parallel schedules. There is also a study [28] which presents a method for constructing tiled computational processes organized as a two-dimensional structure for algorithms represented by multidimensional loops. The method allows for data exchange operations to be confined within rows or columns of processes, optimizing parallel computations on distributed memory computers. Some studies have investigated the problem of obtaining global dependencies, i.e., informational dependencies between tiles, in the context of parametrized hexagonal tiling applied to algorithms with a two-dimensional computational domain. The paper [29] contributes to the efficient use of multilevel memory and optimization of data exchanges in both sequential and parallel programming. Experimental results demonstrate that the model-based tiled Sparse Matrix–Dense Matrix Multiplication (SpMM) and Sampled Dense–Dense Matrix Multiplication (SDDMM) achieve high performance relative to the current state-of-the-art methods [30]. In paper [31], the authors introduce monoparametric tiling, a restricted parametric tiling transformation for polyhedral programs that retains the closure properties of the polyhedral model. The technique facilitates efficient autotuning and run-time adaptability without breaking the mathematical closure properties of the polyhedral model.
The upcoming sections of the paper will be dedicated to various aspects of our research. In the next section, we delve into the polyhedral representation of the Zuker loop nests, elucidating the application of three polyhedral compilers: PLUTO, TRACO, and DAPT. The Results section comprehensively explores the time and energy benefits, as well as the locality and scalability of the generated codes. The final section analyzes the experimental study with our previous work, concluding the paper with insights into future work.

2. Materials and Methods

Zuker defines two energy matrices, W ( i , j ) and V ( i , j ) , with  𝒪 ( n 2 ) pairs ( i , j ) satisfying the constraints 1 i N and i j N , where N is the length of a sequence. W ( i , j ) represents the total free energy of a sub-sequence defined by indices i and j, while V ( i , j ) represents the total free energy of a sub-sequence starting at index i and ending at index j if i and j form a pair, otherwise V ( i , j ) = .
The main recursion of Zuker’s algorithm for all i, j with 1 i < j N , where N is the length of a sequence, is the following:
W ( i , j ) = { (1) W ( i + 1 , j ) (2) W ( i , j 1 ) (3) V ( i , j ) (4) min i < k < j { W ( i , k ) + W ( k + 1 , j ) }
Below, we present the computation of V:
V ( i , j ) = { (5) e H ( i , j ) (6) V ( i + 1 , j 1 ) + e S ( i , j ) (7) min i i j j 2 < i i + j j < d { V ( i , j ) + e L ( i , j , i , j ) } (8) min i < k < j 1 { W ( i + 1 , k ) + W ( k + 1 , j 1 ) }
eH (hairpin loop), eS (stacking) and eL (internal loop) are the structure elements of energy contributions in the Zuker algorithm.
The computation of Equations (1), (2), (3), (5), and (6) takes 𝒪 ( n 2 ) steps. Equations (4) and (8) require 𝒪 ( n 3 ) steps. The time complexity of a direct implementation of this algorithm is 𝒪 ( n 4 ) because we need 𝒪 ( n 4 ) operations to compute Equation (7). This formulation as a computational kernel involves float arrays and operations.
The computation domain and dependencies for Zuker’s recurrence cell ( i , j ) are more complex than those of Nussinov’s recurrence. Equations (3), (4), and (8) generate long-range (non-local) dependencies for cell ( i , j ), while the other equations have short-range (local) dependencies. The computation of the element V ( i , j ) in Equation (3) spans a triangular area of several dozens to hundreds of cells.
Listing 1 shows the affine loop nest for finding the minimums of the V and W energy matrices.
The Zuker affine loop nest implies that loop bounds, conditional statements, and array addresses are represented by affine expressions. Within Zuker statements, there are numerous non-uniform dependencies characteristic of NPDP problems. Non-uniform dependencies refer to dependencies between iterations in a computation where the relationship or distance between dependent iterations varies during execution. Examples include expressions like i i + j or i 2 · i , where the dependency pattern changes dynamically based on the varying values of the involved variables. Another non-uniform dependency arises from the conditional expression based on Equation (8). Non-uniform dependencies in the context of NPDP problems present significant challenges for parallelization and optimization because they require algorithms and optimization strategies that can dynamically adapt to the data-dependent nature of these dependencies.
Listing 1. Zuker’s recurrence loop nest.
  • for (i = N − 1; i >= 0; i−−){
  •      for (j = i + 1; j < N; j++) {
  •           for (k = i + 1; k < j; k++){
  •                for (m = k + 1; m < j; m++){
  •                     if (k − i + j − m > 2 && k − i + j − m < 30)
  •                        V[i][j] = MIN (V[k][m] + EL(i,j,k,m), V[i][j]);
  • // Equation (3)
  •                }
  •               W[i][j] = MIN (MIN(W[i][k], W[k + 1][j]), W[i][j]);
  • // Equation (8)
  •                if (k < j − 1)
  •                     V[i][j] = MIN (W[i + 1][k] + W[k + 1][j − 1], V[i][j]);
  • // Equation (4)
  •          }
  •         V[i][j] = MIN (MIN (V[i + 1][j − 1] + ES(i,j), EH(i,j), V[i][j]);
  • // Equations (1) and (2)
  •          W[i][j] = MIN (MIN (MIN (W[i + 1][j], W[i][j − 1]), V[i][j]), W[i][j]);
  • // Equations (5), (6) and (7)
  •    }
  • }
Fortunately, Zuker loop nests can be transformed to exploit parallelism and locality within the polyhedral model. They do not contain non-linear expressions or break and continue expressions. Most of the opportunities for optimizing computational efficiency arise from the application of loop tiling. Consequently, this improvement in data locality reduces the number of memory access operations to the slower main memory, leading to reduced execution times and increased overall performance of the algorithm.
Polyhedral compilers utilize a sophisticated approach for analyzing and transforming loop structures in programs, leveraging mathematical abstractions, such as polyhedra. This method represents loop nests and their iteration spaces as polyhedra, a geometric form that captures the multidimensional space of loop indices. By doing so, compilers can precisely understand the dependencies and execution order of iterations within loops. Once the loops are represented as polyhedra, polyhedral compilers use operations such as the following:
  • Affine transformations: These are mathematical transformations that maintain the straight-line nature of code segments and allow the compiler to reorder, fuse, or tile loops for optimization purposes.
  • Dependency analysis: By examining the vertices and edges within the polyhedral model, compilers can determine which iterations of the loop depend on others, allowing for safe parallelization or reordering of loops without altering the program’s semantics.
  • Scheduling: The compiler can generate an optimized execution schedule that improves data locality and parallel execution by analyzing the polyhedral representation. This is particularly effective for optimizing memory access patterns and leveraging cache memory more efficiently.
This methodology facilitates the optimization of nested loops in a way that is provably correct and often results in significant improvements in execution time, especially for high-performance computing applications.
Well-known polyhedral compilers like PLUTO, TRACO and DAPT realize dependence analysis and code generation by applying similar polyhedral techniques. These optimizing compilers start by extracting the relevant loops from the input C program and representing them in the polyhedral model. This representation captures the loop iteration space, dependencies, and data access patterns as mathematical entities. Next, they perform a comprehensive dependency analysis to understand the data dependencies within the program. This analysis ensures that any transformations preserve the program’s semantics, meaning the transformed program will produce the same output as the original. Using the polyhedral model, they automatically apply transformations aimed at improving data locality and exposing parallelism. This involves optimizing the loop order, tiling (breaking down loops into smaller blocks), and loop fusion or fission to better utilize the cache memory and enable parallel execution. All three compilers operate with default settings; for PLUTO, the options –tile and –parallel are specified, while the other two tools work in such mode by default.
Dependency analysis in loop optimization involves identifying relationships between loop iterations to determine data dependencies. PET (Polyhedral Expression Translator) [5] is a tool that aids in this process, translating source code into a polyhedral model for advanced dependency analysis. PET facilitates various optimization strategies, including loop interchange and loop unrolling, ultimately generating optimized code for different hardware architectures.
Code generation in the context of libraries like Chunky Loop Generator (CLooG) [32] and Integer Set Library (isl) [5] relies on the polyhedral model. CLooG provides tools for generating efficient loop code based on complex polyhedral models, enabling optimizations such as loop interchange and loop space transformations. ISL, often used in conjunction with CLooG, is a library for manipulating sets of integers, crucial for the polyhedral model. These tools are utilized for the automatic generation of optimized code, particularly in the case of nested loops in numerical programs.
The primary distinction among compilers like PLUTO, TRACO, and DAPT lies in loop program transformations, particularly the techniques applied for loop blocking to generate cache-efficient code. Nevertheless, all these tools share a common foundation, as they are implemented using isl. The library provides operations on sets and relations, including union, intersection, negation, transitive closure of relations, projection, and others. These operations are employed in the form of matrices or sets/relations, offering a unified basis for the functionalities of these compilers.
The advanced PLUTO compiler [8] utilizes the affine transformation framework (ATF) to generate parallel tiled code, employing execution-reordering loop transformations to facilitate multi-threading and improve locality. An embedded Integer Linear Programming (ILP) cost function helps create effective tiling hyperplanes, optimizing parallelism while minimizing communication and enhancing code locality in the processor space. PLUTO supports both one-dimensional and multi-dimensional time schedules for loop nest statement instances.
TRACO [33], on the other hand, employs the transitive closure of dependence relation graphs to form valid target tiles. This process involves partitioning the iteration space into original tiles and correcting tiles by removing invalid dependence destinations. TRACO achieves this by applying the transitive closure of the dependence graph to the iteration subspace, eliminating invalid dependence destinations, and redistributing them to tiles with lexicographically greater identifiers. The compiler uses loop skewing for parallelizing NPDP tiled codes but does not implement ATF techniques.
DAPT [19] addresses non-uniform dependencies by approximating them to uniform counterparts, simplifying the complexities associated with nonlinear time-tiling constraints. DAPT has successfully normalized non-uniform dependencies and has an advantage over PLUTO, as it supports three-dimensional tiling and benchmarks like Nussinov, nw, and sw [8].
To assess the performance of tiling polyhedral techniques, we manually crafted code inspired by Li’s transformation [11], employing transposed arrays to reduce inefficient column reading. Li originally introduced the code for the Nussinov RNA folding algorithm. In Listing 2, we introduce a customized version designed for Zuker’s loop nests. Adhering to Li’s methodology [11], towards the conclusion of the loop body, cells originating from the unused left lower triangle are overwritten to the upper right one, reinstating valid W array values.
Listing 2. Optimized Zuker’s recurrence loop nest with the Transpose technique.
  • for (diag = 2; diag <= N − 1; diag++)
  • #pragma omp parallel for shared(diag) private (col, row, k, m)
  • for (row = 0; row <= N − diag − 1; row++){
  •     col = diag + row;
  •      for (k = row; k < col; k++){
  •           for (m = k + 1; m < col; m++)
  •                 if (k − row + col − m > 2){
  •                     V[row][col] = MIN(V[k][m] + EFL[row][col], V[row][col]);
  •                   }
  •                 W[row][col] += MIN ( MIN(W[row][k], W[col][k + 1]), W[row][col]);
  •                  if (k < col − 1){
  •                       V[row][col] = MIN(W[row + 1][k] + W[col − 1][k + 1], V[row][col]);
  •                  }
  •     }
  •    V[row][col] = MIN( MIN (V[row + 1][col − 1], EHF[row][col]), V[row][col]);
  •    W[row][col] = MIN( MIN ( MIN ( W[row + 1][col], W[row][col − 1]),
  •                                                                                       V[row][col]), W[row][col]);
  •     W[col][row] = W[row][col];
  •   }
  • }

3. Results

To carry out the experiments, we used a machine with a processor AMD Epyc 7542, 2.35 GHz, 32 cores, 64 threads, 128 MB Cache, and a machine with a processor Intel Xeon Gold 6240, 2.6 GHz (3.9 GHz turbo), 18 cores, 36 threads, and 25 MB Cache. The optimized codes were compiled using the GNU C++ compiler version 9.3.0 with the -O3 flag.
AMD EPYC processors offer several advantages, including a higher core count per socket, improved memory bandwidth, lower power consumption, and more PCIe 4.0 lanes. However, they may exhibit lower single-core performance compared to Intel Xeon processors, and some software may not be optimized for AMD architectures. On the other hand, Intel Xeon processors excel in higher single-core performance, superior virtualization support, a strong brand reputation, and broad software compatibility. Nonetheless, they come with fewer cores per socket, higher power consumption, and support fewer PCIe lanes compared to AMD EPYC counterparts. Hence, in our research, we intend to examine time, locality, scalability, and energy efficiency. Other parameters, such as cache misses or RAM usage, can also be measured; however, they are more challenging to observe and, at the same time, are strongly correlated with energy consumption and execution time. The consumption of memory remains invariant and is solely contingent upon the dimensions of the problem at hand, rather than the specificities of the algorithm employed. It is imperative to acknowledge that the paramount benefit of the reduced execution time is predominantly attributed to the enhancement of data locality. The algorithm’s complexity is denoted as 𝒪 ( n 4 ) . Variations in runtime are principally predicated on the strategic selection of loop sizes that yield improved locality.
Tests were conducted using ten randomly generated RNA sequences with lengths ranging from 1000 to 5000. The discussion in papers [11,16] shows that cache-efficient code performance does not change based on the strings themselves, but it depends on the size of the string.
We compared the performance of tiled codes generated with the presented approaches: (i) PLUTO parallel tiled code (based on affine transformations) [34], (ii) tile code based on the space–time technique [18] generated with DAPT, (iii) tiled code based on the correction technique TileCorr [16] generated with TRACO, and (iv) Li’s manual cache-efficient implementation of Zuker’s RNA folding Transpose [11]. All codes are multi-threaded within the OpenMP standard [35]. All three compilers utilize affine transformations; additionally, TRACO employs transitive closure to enhance the discovery of optimization opportunities. In contrast, DAPT uses dependency approximation for finding affine transformations, which allows for further increased locality. All compilers allow the specification of block size; additionally, DAPT enables the partitioning of the time space, which facilitates further optimization of the loop boundaries, thereby enabling better cache memory fitting.
The tile size 16 × 16 × 1 × 16 for the PLUTO code was chosen empirically (PLUTO does not tile the third loop) as the best among the many sizes examined. The tile size of 16 × 16 × 16 × 16 for the tile correction technique was chosen according to paper [36]. For the space–time tiled code, we chose the same tile sizes. Our preliminary empirical testing did not yield improved tile sizes for this algorithm.
Table 1 presents execution times in seconds for ten sizes of the RNA sequence using AMD Epyc 7542. Problem sizes from 1000 to 5000 (roughly the size of the longest human mRNA) are chosen to illustrate advantages for smaller and larger instances. Output codes are executed for 64 threads. We can observe that the presented space–time tiling approach allows for obtaining cache-efficient tiled code, which significantly outperforms the other examined implementations for each RNA strand length. The second most efficient code is loop tiling, which is produced by the PLUTO compiler. Figure 1 depicts the speed-up for the times presented in Table 1.
Table 2 presents execution times in seconds using two processors, Xeon Gold 6240 and 36 threads. The presented space–time tiling strategy strongly outperforms the other studied techniques for all RNA strand lengths. The Transpose technique allows us to obtain faster code than the ATF tiled code and the tile correction code with this machine. Figure 2 depicts speed-ups for time executions in Table 2.
Table 3 and Table 4 present energy consumption in kJ for AMD Epyc and Intel Xeon Gold processors, respectively. The optimized codes significantly reduce power consumption. We observe that AMD is a more energy-efficient machine than Intel Xeon. However, Intel Xeon completes calculations about 4 min faster despite having about twice the energy consumption for all optimized codes. We can also observe a stronger correlation between shorter time and lower energy consumption for AMD machines. Figure 3 and Figure 4 illustrate shorter processing times for larger thread sizes on AMD Epyc and Intel Xeon Gold. Hence, we can assert that the presented codes are scalable concerning the size of the problem and the number of threads.
All source codes used in the experimental study are available at the address https://github.com/markpal/zuker (accessed on 28 January 2024).

4. Discussion

The most favorable time results are achieved through the implementation of time–space tiling within the DAPT compiler. The initial application of this technique was implemented in the TRACO compiler and tested with Nussinov’s algorithm in 2019 [18]. Subsequently, the algorithm underwent a rewrite in the newer DAPT compiler, incorporating additional capabilities and replacing the Petit dependence analyzer [37] and cloog [32] modules with pet and isl [19]. The Zuker algorithm, a task with 𝒪 ( n 4 ) complexity, along with another NPDP RNA folding task, the Maximum Expected Accuracy (MEA) prediction, were analyzed in a paper published in 2020 [38]. This paper solely presented a comparison of ATF with tile correction, excluding time–space tiling.
The main challenge in leveraging tiling strategies for the Zuker algorithm lies in its sequential dependencies, as computations for subsequent stages rely on the outcomes of earlier ones. Moreover, the irregular memory access patterns and the need for synchronization between different units of the algorithm can hinder optimal performance gains when applying tiling strategies. However, given the rapid increase in biological data volume, speeding up the execution of bioinformatics algorithms through parallelization and tiling becomes essential. Moreover, exploring the concept of dimensionality expansion presents a significant opportunity for bioinformatics, allowing for the overcoming of inherent sequential limitations by transforming the problem space into higher dimensions, potentially overcoming some of the inherent sequential constraints.
To continue our research, this paper explores the application of polyhedral techniques to space–time tiling for NPDP kernels where ATF fails to expose the potential parallelism or locality. In the future, we plan to explore other 𝒪 ( n 4 ) NPDP tasks, such as the Multiple Sequence Alignment (MEA) or the Smith–Waterman algorithm applied to three DNA sequences.
Also, the use of GPUs to enhance the efficiency of parallel computations with tiling presents promising prospects. GPUs, with their high degree of parallelism and numerous cores, are well suited for executing many parallel threads, making them ideal for tiled algorithms, where computations can be distributed across cores. Additionally, for computations performed on GPUs, the appropriate partitioning of loops into blocks is crucial, as it directly impacts the efficiency and scalability of the parallel processing by ensuring optimal utilization of the GPU computational resources.
In summary, the introduced space–time tiled code demonstrates enhanced and scalable performance on multi-core processors, irrespective of the number of threads or problem size. We also observed a significant improvement in energy efficiency with the automatically generated code. The space–time tiling strategy implemented in the polyhedral compiler DAPT emerges as a promising solution for optimizing NPDP tasks. Future work includes exploring its application in other NPDP bioinformatics problems on both CPU and GPU platforms.

Author Contributions

Conceptualization and methodology, W.B., M.P. (Marek Palkowski) and P.B.; software, P.B., M.P. (Marek Palkowski) and M.P. (Maciej Poliwoda); validation, P.B. and M.P. (Marek Palkowski); data curation, M.P. (Marek Palkowski) and P.B.; original draft preparation, M.P. (Marek Palkowski); writing—review and editing, P.B. and M.P. (Marek Palkowski); visualization, P.B. and M.P. (Marek Palkowski). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Source codes to reproduce all the results described in this article can be found at: https://github.com/markpal/zuker (accessed on 28 January 2024).

Acknowledgments

Wlodzimierz Bielecki’s contributions and insights were instrumental in initiating this article, profoundly shaping its early development. His dedication to academic excellence and mentoring have left an indelible mark on all who had the privilege to work with him. Although he is no longer with us, his legacy continues to resonate in our research and teachings. We dedicate this work to his memory, hoping to reflect even a fraction of his brilliance.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ATFAffine Transformation Framework
CLooGThe Chunky Loop Generator
CPUCentral Processing Unit
DAPTDependence Approximation for Parallelism and Tiling
GPUGraphics Processing Unit
ATFAffine Transformation Framework
LRULeast Recently Used
MEAMaximum Expected Accuracy
NPDPNon-serial Polyadic Dynamic Programming
OpenMPOpen Multi-Processing
RAPLRunning Average Power Limit
RNARiboNucleic Acid
TRACOcompiler based on the TRAnsitive ClOsure of dependence graphs

References

  1. Smith, T.; Waterman, M. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195–197. [Google Scholar] [CrossRef]
  2. Nussinov, R.; Pieczenik, G.; Griggs, J.R.; Kleitman, D.J. Algorithms for loop matchings. Siam J. Appl. Math. 1978, 35, 68–82. [Google Scholar] [CrossRef]
  3. Zuker, M.; Stiegler, P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981, 9, 133–148. [Google Scholar] [CrossRef]
  4. Lei, G.; Dou, Y.; Wan, W.; Xia, F.; Li, R.; Ma, M.; Zou, D. CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications. BMC Genom. 2012, 13, S14. [Google Scholar] [CrossRef]
  5. Verdoolaege, S. isl: An Integer Set Library for the Polyhedral Model; Mathematical Software; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  6. Xue, J. Loop Tiling for Parallelism; Kluwer Academic Publishers: Norwell, MA, USA, 2000. [Google Scholar]
  7. Lu, Z.J.; Gloor, J.W.; Mathews, D.H. Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA 2009, 15, 1805–1813. [Google Scholar] [CrossRef] [PubMed]
  8. Palkowski, M.; Bielecki, W. NPDP Benchmark Suite for Loop Tiling Effectiveness Evaluation. In Parallel Processing and Applied Mathematics; Springer International Publishing: Berlin/Heidelberg, Germany, 2023; pp. 51–62. [Google Scholar] [CrossRef]
  9. Malas, T.; Hager, G.; Ltaief, H.; Stengel, H.; Wellein, G.; Keyes, D. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. SIAM J. Sci. Comput. 2015, 37, C439–C464. [Google Scholar] [CrossRef]
  10. Bondhugula, U.; Bandishti, V.; Pananilath, I. Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 1285–1298. [Google Scholar] [CrossRef]
  11. Li, J.; Ranka, S.; Sahni, S. Multicore and GPU algorithms for Nussinov RNA folding. BMC Bioinform. 2014, 15, S1. [Google Scholar] [CrossRef] [PubMed]
  12. Zhao, C.; Sahni, S. Efficient RNA folding using Zuker’s method. In Proceedings of the 2017 IEEE 7th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Orlando, FL, USA, 19–21 October 2017. [Google Scholar] [CrossRef]
  13. Mullapudi, R.T.; Bondhugula, U. Tiling for Dynamic Scheduling. In Proceedings of the 4th International Workshop on Polyhedral Compilation Techniques, Vienna, Austria, 20–22 January 2014. [Google Scholar]
  14. Wonnacott, D.; Jin, T.; Lake, A. Automatic tiling of “mostly-tileable” loop nests. In Proceedings of the 5th International Workshop on Polyhedral Compilation Techniques, Amsterdam, The Netherlands, 19–21 January 2015. [Google Scholar]
  15. Tchendji, V.K.; Youmbi, F.I.K.; Djamegni, C.T.; Zeutouo, J.L. A Parallel Tiled and Sparsified Four-Russians Algorithm for Nussinov’s RNA Folding. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 20, 1795–1806. [Google Scholar] [CrossRef]
  16. Palkowski, M.; Bielecki, W. Parallel tiled Nussinov RNA folding loop nest generated using both dependence graph transitive closure and loop skewing. BMC Bioinform. 2017, 18, 290. [Google Scholar] [CrossRef]
  17. Frid, Y.; Gusfield, D. An improved Four-Russians method and sparsified Four-Russians algorithm for RNA folding. Algorithms Mol. Biol. 2016, 11, 22. [Google Scholar] [CrossRef] [PubMed]
  18. Palkowski, M.; Bielecki, W. Tiling Nussinov’s RNA folding loop nest with a space-time approach. BMC Bioinform. 2019, 20, 208. [Google Scholar] [CrossRef]
  19. Bielecki, W.; Palkowski, M.; Poliwoda, M. Automatic code optimization for computing the McCaskill partition functions. In Proceedings of the Annals of Computer Science and Information Systems, Sofia, Bulgaria, 4–7 September 2022. [Google Scholar] [CrossRef]
  20. Baghdadi, R.; Ray, J.; Romdhane, M.B.; Sozzo, E.D.; Akkas, A.; Zhang, Y.; Suriana, P.; Kamil, S.; Amarasinghe, S.P. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. arXiv 2018, arXiv:1804.10694. [Google Scholar]
  21. Yuki, T.; Gupta, G.; Kim, D.; Pathan, T.; Rajopadhye, S.V. AlphaZ: A System for Design Space Exploration in the Polyhedral Model. In Proceedings of the LCPC, Tokyo, Japan, 11–13 September 2012; pp. 17–31. [Google Scholar]
  22. Baghdadi, R.; Beaugnon, U.; Cohen, A.; Grosser, T.; Kruse, M.; Reddy, C.; Verdoolaege, S.; Betts, A.; Donaldson, A.F.; Ketema, J.; et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), San Francisco, CA, USA, 18–21 October 2015; pp. 138–149. [Google Scholar] [CrossRef]
  23. Ragan-Kelley, J.; Barnes, C.; Adams, A.; Paris, S.; Durand, F.; Amarasinghe, S. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, New York, NY, USA, 16–19 June 2013; PLDI ’13. pp. 519–530. [Google Scholar] [CrossRef]
  24. Chowdhury, R.; Ganapathi, P.; Tithi, J.J.; Bachmeier, C.; Kuszmaul, B.C.; Leiserson, C.E.; Solar-Lezama, A.; Tang, Y. Autogen: Automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs. ACM SIGPLAN Not. 2016, 51, 1–12. [Google Scholar] [CrossRef]
  25. Caamaño, J.M.M.; Sukumaran-Rajam, A.; Baloian, A.; Selva, M.; Clauss, P. APOLLO: Automatic speculative polyhedral loop optimizer. In Proceedings of the IMPACT 2017-7th International Workshop on Polyhedral Compilation Techniques, Stockholm, Sweden, 23–25 January 2017; p. 8. [Google Scholar]
  26. Yuan, L.; Zhang, Y.; Guo, P.; Huang, S. Tessellating Stencils. In Proceedings of the SC17: International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 12–17 November 2017; pp. 1–13. [Google Scholar] [CrossRef]
  27. Bertolacci, I.J.; Olschanowsky, C.; Harshbarger, B.; Chamberlain, B.; Wonnacott, D.; Strout, M. Parameterized Diamond Tiling for Stencil Computations with Chapel parallel iterators. In Proceedings of the 29th ACM on International Conference on Supercomputing, Irvine, CA, USA, 8–11 June 2015. [Google Scholar] [CrossRef]
  28. Likhoded, N.A.; Paliashchuk, M.A. Tiled parallel 2D computational processes. Proc. Natl. Acad. Sci. Belarus. Phys. Math. Ser. 2019, 54, 417–426. [Google Scholar] [CrossRef]
  29. Sobolevsky, P.I.; Bakhanovich, S.V. Global dependences in hexagonal tiling. Proc. Natl. Acad. Sci. Belarus. Phys. Math. Ser. 2020, 56, 114–126. [Google Scholar] [CrossRef]
  30. Kurt, S.E.; Sukumaran-Rajam, A.; Rastello, F.; Sadayappan, P. Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures. In Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 9–19 November 2020; pp. 1–14. [Google Scholar] [CrossRef]
  31. Iooss, G.; Alias, C.; Rajopadhye, S. Monoparametric Tiling of Polyhedral Programs. Int. J. Parallel Program. 2021, 49, 376–409. [Google Scholar] [CrossRef]
  32. Bastoul, C. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the PACT’13 IEEE International Conference on Parallel Architecture and Compilation Techniques, Juan-les-Pins, France, 29 September–3 October 2004; pp. 7–16. [Google Scholar]
  33. Bielecki, W.; Palkowski, M. TRACO: Source-to-Source Parallelizing Compiler. Comput. Inform. 2017, 35, 1277–1306. [Google Scholar]
  34. Bondhugula, U.; Hartono, A.; Ramanujam, J.; Sadayappan, P. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 2008, 43, 101–113. [Google Scholar] [CrossRef]
  35. OpenMP Architecture Review Board. OpenMP Application Program Interface Version 5.2; The OpenMP Forum: Beaverton, OR, USA, 2022. [Google Scholar]
  36. Palkowski, M.; Bielecki, W. Parallel Tiled Cache and Energy Efficient Code for Zuker’s RNA Folding. In Parallel Processing and Applied Mathematics; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 25–34. [Google Scholar] [CrossRef]
  37. Kelly, W.; Maslov, V.; Pugh, W.; Rosser, E.; Shpeisman, T.; Wonnacott, D. New User Interface for Petit and Other Extensions. User Guide 1996, 1, 996. [Google Scholar]
  38. Palkowski, M.; Bielecki, W. Parallel tiled cache and energy efficient codes for O(n4) RNA folding algorithms. J. Parallel Distrib. Comput. 2020, 137, 252–258. [Google Scholar] [CrossRef]
Figure 1. Speed-up for AMD Epyc 7542 and 64 threads.
Figure 1. Speed-up for AMD Epyc 7542 and 64 threads.
Mathematics 12 00728 g001
Figure 2. Speed-up for Intel Xeon Gold 6240 and 36 threads.
Figure 2. Speed-up for Intel Xeon Gold 6240 and 36 threads.
Mathematics 12 00728 g002
Figure 3. Execution times for different thread configurations and size = 3000 for AMD Epyc.
Figure 3. Execution times for different thread configurations and size = 3000 for AMD Epyc.
Mathematics 12 00728 g003
Figure 4. Execution times for different thread configurations and size = 3000 for Intel Xeon Gold 6240.
Figure 4. Execution times for different thread configurations and size = 3000 for Intel Xeon Gold 6240.
Mathematics 12 00728 g004
Table 1. Execution times (in seconds) for AMD Epyc 7542 and 64 threads.
Table 1. Execution times (in seconds) for AMD Epyc 7542 and 64 threads.
SizeClassicTransposePLUTOTileCorrSpace–Time
100026.483.373.308.061.81
1500132.0814.0712.6325.748.28
2000412.5842.5431.7557.8322.64
25001011.0799.4868.62118.5754.32
30002133.66201.09137.07214.70107.80
35003824.40375.20249.97354.73205.71
40006473.75621.05411.77590.88345.90
450010,774.621010.73637.25864.74544.16
500015,484.411486.96958.281254.46850.98
Table 2. Execution times (in seconds) for Intel Xeon Gold 6240 and 36 threads.
Table 2. Execution times (in seconds) for Intel Xeon Gold 6240 and 36 threads.
SizeClassicTransposePLUTOTileCorrSpace–Time
100026.974.342.966.892.18
1500137.7917.1819.3029.5414.61
2000385.9542.9523.8941.4618.97
2500938.39102.7649.3387.0440.96
30001936.38206.7698.42155.5580.32
35003624.42375.28179.91263.93150.41
40006091.16655.14300.65422.26255.97
45009802.571066.72499.87650.51419.34
500015,074.181565.50835.31953.10641.13
Table 3. Energy consumption (in kJ) for AMD Epyc.
Table 3. Energy consumption (in kJ) for AMD Epyc.
SizeClassicTransposePLUTOTileCorrSpace–Time
10001.840.350.280.600.22
15009.111.681.182.130.97
200028.595.033.165.312.70
250070.0812.087.5511.696.46
3000144.8525.4115.9522.6813.97
3500268.2849.3828.6940.0826.91
4000540.5882.1150.8865.6345.68
4500731.20132.1281.36100.7073.55
50001116.53197.20131.31149.57115.03
Table 4. Energy consumption (in kJ) for Intel Gold.
Table 4. Energy consumption (in kJ) for Intel Gold.
SizeClassicTransposePLUTOTileCorrSpace–Time
10003.950.640.731.450.65
150019.374.665.297.614.62
200060.127.497.4710.567.32
2500144.6315.0916.3623.4916.83
3000299.1233.9130.7744.5831.91
3500555.0957.2558.3678.6160.93
4000935.6396.5096.85128.7899.44
45001494.08158.46155.38200.72160.10
50002287.78238.16228.60300.73236.83
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Blaszynski, P.; Palkowski, M.; Bielecki, W.; Poliwoda, M. Efficiency of Various Tiling Strategies for the Zuker Algorithm Optimization. Mathematics 2024, 12, 728. https://doi.org/10.3390/math12050728

AMA Style

Blaszynski P, Palkowski M, Bielecki W, Poliwoda M. Efficiency of Various Tiling Strategies for the Zuker Algorithm Optimization. Mathematics. 2024; 12(5):728. https://doi.org/10.3390/math12050728

Chicago/Turabian Style

Blaszynski, Piotr, Marek Palkowski, Wlodzimierz Bielecki, and Maciej Poliwoda. 2024. "Efficiency of Various Tiling Strategies for the Zuker Algorithm Optimization" Mathematics 12, no. 5: 728. https://doi.org/10.3390/math12050728

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop