# DiamondTorre Algorithm for High-Performance Wave Modeling

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Computation Models

## 3. Problem Statement

## 4. Algorithm as a Rule of Subdividing a Dependency Graph

- The most compact shape that encompasses the stencil in space coordinates is a diamond. The 2D computational domain is subdivided into diamond-shaped tiles. For ${N}_{O}=2$, each tile contains two vertices.
- One diamond tile is chosen on the initial layer. Its influence conoid is plotted. After $Nt$ layers, we choose another tile, which lies near the edge of the influence conoid base, on the far side in the positive direction of the x-axis. Its dependence conoid is plotted.
- On the intersection of conoids, we find a prism (Figure 5).

## 5. Benefits of the LRnLA Approach

## 6. CUDA Implementation

__shared__ float2 ExchZ[8][Nz];#define SH_c(i) ExchZ[i][threadIdx.x]struct Cell { float F[Nz], G[Nz]; };__global__ void __launch_bounds__(Nz, 1+(Nz<=320)) //regs limit for Nz>320weq_calcO2_DDe(int Ntime, int ixs0, int ixa0) {Cell* c0=...;//set pointer to tower’s bottom base cellregister float2 f00={LS(F,-5,0),LS(F,-4,0)};//load data of tower’s bottomregister float2 f10={LS(F,-4,1),LS(F,-3,1)};//from device memoryregister float2 g00={LS(G,-4,0),LS(G,-3,0)};//using macro LS and pointer c0,register float2 g10={LS(G,-3,1),LS(G,-2,1)};//then localize in 64 registers...for(int it=0; it<Ntime; it+=8) {//DTS=4, 4 pair tiers per loop iteration, 4*4*4*2=128 cells stepsSH_c(0) = make_float2(f00.y,f01.x);//put data to the shared memory forz-derivativeSH_c(1) = make_float2(f10.y,f11.x);//float2 for Kepler’s optimizationSH_c(2) = make_float2(f20.y,f21.x);SH_c(3) = make_float2(f01.y,f02.x);__syncthreads();//calculations chunk separationg00 = K1*make_float2(f00.y,f01.x) - g00 + K2*(SH_p(0)+SH_m(0)+f00+f01+f10+f31);//cross-stencil; SH_p, SH_m are the macros for getting iz+1, iz-1 datafrom shared memoryLS(G,-4,0) = g00.x; LS(G,-3,0) = g00.y; //store recalculated(up to 8 times) dataf00.x = LS(F,-1,4); f00.y = LS(F, 0,4); //load data from device memoryg10 = K1*make_float2(f10.y,f11.x) - g10 + K2*(SH_p(1)+SH_m(1)+f10+f11+f20+f01);LS(G,-3,1) = g10.x; LS(G,-2,1) = g10.y; f10.x = LS(F, 0,5); f10.y = LS(F, 1,5);g20 = K1*make_float2(f20.y,f21.x) - g20 + K2*(SH_p(2)+SH_m(2)+f20+f21+f30+f11);LS(G,-2,c0,2) = g20.x; LS(G,-1,2) = g20.y; f20.x = LS(F, 1,6);f20.y = LS(F, 2,6);g01 = K1*make_float2(f01.y,f02.x) - g01 + K2*(SH_p(3)+SH_m(3)+f01+f02+f11+f32);SH_c(4) = make_float2(f31.y,f32.x);SH_c(5) = make_float2(f22.y,f23.x);SH_c(6) = make_float2(f13.y,f10.x);SH_c(7) = make_float2(f32.y,f33.x);__syncthreads();...c0 += 2*Ny;//jump to next tower’s tier}...//store data from top tower’s tier}

for(int ixs0=NS-Nt; ixs0>=0; ixs0--) {weq_calc_O2_DD<<<NA, Nz>>>(Nt, ixs0, 0); //even stageweq_calc_O2_DD<<<NA, Nz>>>(Nt, ixs0, 1); //odd stage}

## 7. Results

## 8. Generalization

## 9. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Abbreviations

LRnLA | locally-recursive non-locally-asynchronous |

GPGPU | general purpose graphical processing unit |

FDTD | finite difference time domain |

DTS | diamond tile size |

CUDA | compute unified device architecture, a parallel computing platform and application programming interface |

MPI | message passing interface, a communication protocol for programming parallel computers |

## References

- LAPACK. Available online: http://www.netlib.org/lapack/ (accessed on 5 March 2015).
- TOP500 List. Available online: http://www.top500.org/lists/2014/11/ (accessed on 5 March 2015).
- Gropp, W.D.; Keyes, D.E. Domain decomposition on parallel computers. IMPACT Comput. Sci. Eng.
**1989**, 1, 421–439. [Google Scholar] [CrossRef] - MPI: A Message-Passing Interface Standard Version 3.0, Message Passing Interface Forum. 21 September 2012. Available online: http://www.mpi-forum.org (accessed on 5 March 2015).
- Wolfe, M. More Iteration Space Tiling. In Supercomputing ’89, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Reno, NV, USA, 12–17 November 1989; ACM: New York, NY, USA, 1989; pp. 655–664. [Google Scholar]
- OpenCL. Available online: https://www.khronos.org/opencl/ (accessed on 5 March 2015).
- CUDA Toolkit 6.5. Available online: https://developer.nvidia.com/cuda-downloads (accessed on 5 March 2015).
- Lamport, L. The Parallel Execution of DO Loops. Commun. ACM
**1974**, 17, 83–93. [Google Scholar] [CrossRef] - Bertsekas, D.P.; Tsitsiklis, J.N. Parallel and Distributed Computation: Numerical Methods; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 1989. [Google Scholar]
- OpenMP Application Program Interface Version 4.0—July 2013. Available online: http://openmp.org/wp/openmp-specifications/ (accessed on 5 March 2015).
- Intel 64 and IA-32 Architectures Optimization Reference Manual. Available online: http://www.intel.com/content/www/us/en/architecture-and-technology/ (accessed on 5 March 2015).
- Lameter, C. NUMA (Non-Uniform Memory Access): An Overview. ACM Queue
**2013**, 11. [Google Scholar] [CrossRef] - Prokop, H. Cache-Oblivious Algorithms. Master’s Thesis, MIT, Cambridge, MA, USA, 1999. [Google Scholar]
- Frigo, M.; Strumpen, V. Cache Oblivious Stencil Computations. In ICS ’05, Proceedings of the 19th Annual International Conference on Supercomputing, Cambridge, MA, USA, 18–21 June 2005; ACM: New York, NY, USA, 2005; pp. 361–366. [Google Scholar]
- Orozco, D.; Gao, G. Mapping the FDTD application to many-core chip architectures. In Proceedings of the 2009 International Conference on Parallel Processing, Vienna, Austria, 22–25 September 2009; pp. 309–316.
- Strzodka, R.; Shaheen, M.; Pajak, D.; Seidel, H.P. Cache accurate time skewing in iterative stencil computations. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP), Taipei City, Taiwan, 13–16 September 2011; IEEE Computer Society: Piscataway, NJ, USA, 2011; pp. 571–581. [Google Scholar]
- Malas, T.; Hager, G.; Ltaief, H.; Stengel, H.; Wellein, G.; Keyes, D. Multicore-optimized wavefront diamond blocking for optimizing stencil Updates. SIAM J. Sci. Comput.
**2014**, 37, C439–C464. [Google Scholar] [CrossRef] - Micikevicius, P. 3D Finite Difference Computation on GPUs Using CUDA. In GPGPU-2, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, Washington, DC, USA, 8 March 2009; ACM: New York, NY, USA, 2009; pp. 79–84. [Google Scholar]
- Volkov, V.; Demmel, J.W. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 SC—International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, 15–21 November 2008; pp. 1–11.
- Levchenko, V.D. Asynchronous parallel algorithms as a way to archive effectiveness of computations. J. Inf. Tech. Comp. Syst.
**2005**, 1, 68–87. (In Russian) [Google Scholar] - Perepelkina, A.Y.; Goryachev, I.A.; Levchenko, V.D. Implementation of the kinetic plasma code with locally recursive non-locally asynchronous algorithms. J. Phys. Conf. Ser.
**2014**, 510, 012042. [Google Scholar] [CrossRef] - Korneev, B.A.; Levchenko, V.D. Effective numerical simulation of the gas bubble-shock interaction problem using the RKDG numerical method and the DiamondTorre algorithm of the implementation. Keldysh Inst. Prepr.
**2014**, 97, 1–12. [Google Scholar] - Perepelkina, A.Y.; Levchenko, V.D.; Goryachev, I.A. 3D3V plasma kinetics code DiamondPIC for modeling of substantially multiscale processes on heterogenous computers. In Proceedings of the 41st EPS Conference on Plasma Physics, Berlin, Germany, 23–27 June 2014.
- Zakirov, A.; Levchenko, V.D.; Perepelkina, A.Y.; Zempo, Y. High performance FDTD code implementation for GPGPU supercomputers. Keldysh Inst. Prepr.
**2016**, 44, 1–22. [Google Scholar] [CrossRef] - Williams, S.; Waterman, A.; Patterson, D. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM
**2009**, 52, 65–76. [Google Scholar] [CrossRef] - Levchenko, V.D.; Goryachev, I.A.; Perepelkina, A.Y. Interactive FDTD simulation using LRnLA algorithms. In Progress in Electromagnetics Research Symposium Abstracts; The Electromagnetics Academy: Cambridge, MA, USA, 2013. [Google Scholar]
- Korneev, B.A.; Levchenko, V.D. Detailed numerical simulation of shock-body interaction in 3D multicomponent flow using the RKDG numerical method and ”DiamondTorre” GPU algorithm of implementation. J. Phys. Conf. Ser.
**2016**, 681, 012046. [Google Scholar] [CrossRef]

**Figure 2.**Dependency graph of the problem (arrows are omitted) with example stencil for sixth order of approximation.

**Figure 4.**Algorithm as a rule of subdividing a dependency graph: stepwise (

**left**), domain decomposition (

**center**) and LRnLA example (

**right**). Arrows show data dependencies between subdivision shapes.

**Figure 5.**DiamondTorre algorithm construction as an intersection of an influence cone (orange) and a dependence cone (red) of two tiles. If a point falls onto a shape border, we consider that it is inside the shape if it is on the bottom face; and that it is outside the shape if it falls on the top face.

**Figure 6.**Data dependencies are directed from green prisms into purple one; from purple prisms to yellow prisms. Red and purple prisms’ calculations are completely independent from each other and may be processed asynchronously.

**Figure 8.**DiamondTorre algorithm implementation with CUDA. First stage (

**left**) and second stage (

**right**).

**Figure 9.**After boundary prisms are processed, in some cells of the computation domain, the field has the values of the $Nt$-th iteration step; in some cells, the field has its initial values; and other cells have values on the time step from zero to $Nt$ (

**left**). After all stages are processed, all field values reach the $Nt$-th time step (

**right**).

**Figure 11.**Performance results for different parameters. Horizontal axis labels are in the form ${N}_{O}/DTS$.

${\mathit{N}}_{\mathit{O}}$ | ${\mathit{C}}_{\mathbf{0}}$ | ${\mathit{C}}_{\mathbf{1}}$ | ${\mathit{C}}_{\mathbf{2}}$ | ${\mathit{C}}_{\mathbf{3}}$ | ${\mathit{C}}_{\mathbf{4}}$ |
---|---|---|---|---|---|

2 | $-1$ | 1 | — | — | — |

4 | $-5/4$ | $4/3$ | $-1/12$ | — | — |

6 | $-49/36$ | $3/2$ | $-3/20$ | $1/90$ | — |

8 | $-205/144$ | $8/5$ | $-1/5$ | $8/315$ | $-1/560$ |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Levchenko, V.; Perepelkina, A.; Zakirov, A.
DiamondTorre Algorithm for High-Performance Wave Modeling. *Computation* **2016**, *4*, 29.
https://doi.org/10.3390/computation4030029

**AMA Style**

Levchenko V, Perepelkina A, Zakirov A.
DiamondTorre Algorithm for High-Performance Wave Modeling. *Computation*. 2016; 4(3):29.
https://doi.org/10.3390/computation4030029

**Chicago/Turabian Style**

Levchenko, Vadim, Anastasia Perepelkina, and Andrey Zakirov.
2016. "DiamondTorre Algorithm for High-Performance Wave Modeling" *Computation* 4, no. 3: 29.
https://doi.org/10.3390/computation4030029