Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACC

Padilla-Perez, Diego; Medina-Sanchez, Isaac; Hernández, Jorge; Couder-Castañeda, Carlos

doi:10.3390/app122211430

Open AccessArticle

Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACC

by

Diego Padilla-Perez

^†

,

Isaac Medina-Sanchez

^†

,

Jorge Hernández

^†

and

Carlos Couder-Castañeda

^*

Instituto Politécnico Nacional, Centro de Desarrollo Aeroespacial, Belisario Domínguez 22, Centro, Cuauhtémoc, Ciudad de México 06610, Mexico

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2022, 12(22), 11430; https://doi.org/10.3390/app122211430

Submission received: 8 September 2022 / Revised: 4 November 2022 / Accepted: 6 November 2022 / Published: 10 November 2022

(This article belongs to the Special Issue High Performance Computing, Modeling and Simulation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Although GPUs can offer higher computing power at low power consumption, their low-level programming can be relatively complex and consume programming time. For this reason, directive-based alternatives such as OpenACC could be used to specify high-level parallelism without original code modification, giving very accurate results. Nevertheless, in the FDTD method, absorbing boundary conditions are commonly used. The key to successful performance is correctly implementing the boundary conditions that play an essential role in memory use. This work accelerates the simulations of electromagnetic wave propagation that solve the Maxwell curl equations by FDTD using CMPL boundary in TE mode using OpenACC directives. A gain of acceleration optimizing the use of memory is shows, checking the loops intensities, and the use of single precision to improve the performance is also analyzed, producing an acceleration of around 5X for double precision and 11X for single precision respectively, comparing with the serial vectorized version, without introducing errors in long-term simulations. The scenarios of simulation established are common of interest and are solved at different frequencies supported by a Mid-range cards GeForce RTX 3060 and Titan RTX.

Keywords:

CPML; FDTD; Maxwell; OpenACC

1. Introduction

The revolution of the paradigm of graphics cards began with the introduction of CUDA, an extension of the C programming language that allows you to code general-purpose applications. With CUDA, many codes were migrated to speed up computing processes, taking advantage of a better performance at a low energy cost [1,2].

GPUs are not for all types of applications. Even with CUDA C, the applications that can benefit have to express a high degree of parallelism. The code must be rewritten entirely for the application migration as it is heavily architecture-bound. To facilitate the programming of these devices, OpenACC was introduced as a parallel programming model based on user-directed directives managed by the compiler. In this sense, the compiler is in charge of generating the parallel code [3].

The introduction of GPUs as general-purpose devices to increase computing demand has led to programming paradigms and methodologies evolution. In fact, the GPU is considered a core part of some HPC systems, working at the lowest point. Specific frameworks for this type of system have been developed to facilitate the implementation of programming models based on MPI+OpenMP+CUDA/OpenACC [4,5].

The initial purpose of OpenACC is to abstract away from the architectural details and require more in the application intent at a lower development effort, without modifying the existing CPU implementation [6], nevertheless, many applications are developed in CUDA of the need of speed up the calculations [7,8].

Despite the facilities offered by OpenACC, the most challenging part of accelerator programming is to have a strategy before the first line of code is written and to avoid errors that can decrease the performance [9]. Having the right design is essential for acceleration success, so to take advantage of the GPU, the code has to be highly parallel. If this is the case, the FDTD algorithm is very parallelizable [10,11]. OpenACC is a reliable standard tool for the design of parallel programs for GPU [12]. For this reason some applications have been accelerated using OpenACC, for example, materials modeling [13], Computational Fluid Dynamics Frameworks [14], Mixing Layer Simulation [15], Neural Networks [16], High-order spectral element fluid dynamics solver [17], supersonic flow simulation [18], indoor propagation [19], among some others.

Over the last few years, there have been multiple research projects involving the FDTD method concerning its complexity, meshing techniques, and computing time; for example, the use of a spatially filtered FDTD via a subgridding technique to analyze multiscale objects [20]. Similar studies approach the same issue by utilizing space transformations to overcome the conventional FDTD issues in a complex domain [21]. Another approach for this problem is made in [22], where the authors propose a stable, accurate, and fast numerical method by arranging a triangular mesh and using space transformations for electrodynamics problems with arbitrary boundaries. In the same sense, implicit forms of the FDTD have been applied [23,24,25].

This shows that there is constant research to obtain improved techniques for the calculation of electromagnetic characteristics in microwave devices, therefore proving to be an essential and relevant scientific problem. For the relevance of the FDTD, many researchers have developed their own codes for specific applications. However, the FDTD is very computing time-consuming. Hence it is necessary to accel the execution time. Waiting for a week for simulation results implies undesirable idle times.

This research shows how an own-developed code in FORTRAN can be accelerated using OpenACC directives to reduce the wait time for a simulation. It is necessary because the simulations consume too much time. Being a 2D application is unsuitable for OpenMP because it is memory bound, and a multi-thread CPU implementation doesn’t scale; consequently, the exec time is not reduced. This can be analyzed through the computational intensity, which is relatively low, so it is more suitable for a GPU architecture.

The first thing that must be taken into account to port to OpenACC is to maintain the data region persistently, avoiding unnecessary data transfer between the CPU and GPU, but in many FDTD applications are used absorbing boundary conditions to simulate an infinite domain. The energy dissipation has to be handled correctly to reduce the memory and improve the performance due to the GPU devices having a small global memory compared with the CPU.

In this respect, it is shown how the absorbing boundary condition is handled, specifically the Convolutional Perfectly Match Layer (CMPL), in conjunction with the single precision to reduce to the maximum the memory used and have an excellent performance without introducing errors in long-term simulations.

The study cases selected in this work have already been studied in [26], and the scenarios consist in simulating wave propagation in transverse electric mode (TE) in free space, in a parabolic reflector, and in a coplanar nanowaveguide at different frequencies based on Maxwell Curl equations solved by Yee’s finite-difference algorithm [27].

Boundary conditions definition is very important for electromagnetic simulations because it allows simulating an infinite region without using infinite RAM memory, and we can find many formulations of Absorbing Boundary Conditions (ABC) for the FDTD to solve Maxwell’s equations [28].

In this sense, ABC is essential for electromagnetic simulations to avoid spurious energy reflection, at different frequencies, within the physical domain, given the thinnest possible absorption region. There are various formulations of PML Boundary Conditions for the FDTD to solve Maxwell’s equations [29,30], but the Convolutional Perfectly Matched Layer (CPML) is very efficient and enough for the purposes [31,32]. CPML plays an essential role in the code design in performance. For this reason, this work focuses on the CPML implementation, proving that proper management yields a reduction of memory used and execution time reduction [18].

This paper is organized as follows. Section 2 presents the governing electromagnetic equations with the FDTD method and the CPML formulation. Section 3 analyzes the algorithm considering the CPML boundary and the cycles intensities. In Section 4 is established the scenarios and carried out the simulations. Finally, the conclusion is given showing an acceleration of up to almost 13X versus the highly vectorized CPU serial version.

2. Propagation Equations and Algorithm

The governing equations for the electric transverse mode (TE) in the absence of sources, are:

\begin{matrix} \frac{\partial E_{z}}{\partial t} & = \frac{1}{ε} [\frac{\partial H_{y}}{\partial x} - \frac{\partial H_{x}}{\partial y} - σ E_{z}], \end{matrix}

(1)

\begin{matrix} \frac{\partial H_{y}}{\partial t} & = \frac{1}{μ} \frac{\partial E_{z}}{\partial x}, \end{matrix}

(2)

\begin{matrix} \frac{\partial H_{x}}{\partial t} & = - \frac{1}{μ} \frac{\partial E_{z}}{\partial y} . \end{matrix}

(3)

Equations (1)–(3) are set in finite differences over a staggered cell mesh (see Figure 1) as follows. At time

n + 1 / 2

,

\partial_{t} E_{z}

,

\partial_{x} H_{y}

and

\partial_{y} H_{x}

take the following form:

\begin{matrix} \frac{\partial E_{z}}{\partial t} |_{i, j}^{n + 1 / 2} & \approx \frac{(E_{z (i, j)}^{n + 1} - E_{z (i, j)}^{n})}{Δ t}, \end{matrix}

(4)

\begin{matrix} \frac{\partial H_{y}}{\partial x} |_{i, j}^{n + 1 / 2} & \approx \frac{(H_{y (i, j + 1 / 2)}^{n + 1 / 2} - H_{y (i, j - 1 / 2)}^{n + 1 / 2})}{Δ x}, \end{matrix}

(5)

\begin{matrix} \frac{\partial H_{x}}{\partial y} |_{i, j}^{n + 1 / 2} & \approx \frac{(H_{x (i + 1 / 2, j)}^{n + 1 / 2} - H_{x (i - 1 / 2, j)}^{n + 1 / 2})}{Δ y} . \end{matrix}

(6)

Replacing these approximations, (4)–(6), in Equation (1), it yields:

\begin{matrix} E_{z (i, j)}^{n + 1} = E_{z (i, j)}^{n} \frac{(1 - \frac{Δ t σ_{(i, j)}}{2 ε_{(i, j)}})}{(1 + \frac{Δ t σ_{z (i, j)}}{2 ε_{(i, j)}})} + \\ \frac{Δ t}{ε_{(i, j)} (1 + \frac{Δ t σ_{(i, j)}}{2 ε_{(i, j)}})} \frac{(H_{y (i, j + 1 / 2)}^{n + 1 / 2} - H_{y (i, j - 1 / 2)}^{n + 1 / 2})}{Δ x} \\ - \frac{Δ t}{ε_{(i, j)} (1 + \frac{Δ t σ_{(i, j)}}{2 ε_{(i, j)}})} \frac{(H_{x (i + 1 / 2, j)}^{n + 1 / 2} - H_{x (i - 1 / 2, j)}^{n + 1 / 2})}{Δ y} \end{matrix}

(7)

Similarly, approximating

\partial_{t} H_{y}

and

\partial_{x} E_{z}

in Equation (2) as

\begin{matrix} \frac{\partial H_{y}}{\partial t} |_{i, j + 1 / 2}^{n} & \approx \frac{(H_{y (i, j + 1 / 2)}^{n + 1 / 2} - H_{y (i, j + 1 / 2)}^{n - 1 / 2})}{Δ t}, \end{matrix}

(8)

\begin{matrix} \frac{\partial E_{z}}{\partial x} |_{i, j}^{n} & \approx \frac{E_{z (i, j + 1)}^{n} - E_{z (i, j)}^{n}}{Δ x} . \end{matrix}

(9)

it takes the form

\begin{matrix} H_{y (i, j + \frac{1}{2})}^{n + 1 / 2} = H_{y (i, j + \frac{1}{2})}^{n - 1 / 2} \frac{(1 - \frac{Δ t σ_{(i, j + 1 / 2)}}{2 μ_{(i, j + 1 / 2)}})}{(1 + \frac{Δ t σ_{(i, j + 1 / 2)}}{2 μ_{(i, j + 1 / 2)}})} + \\ \frac{Δ t}{μ (1 + \frac{σ_{(i, j + 1 / 2)} Δ t}{2 μ_{(i, j + 1 / 2)}})} (\frac{E_{z (i, j + 1)}^{n} - E_{z (i, j)}^{n}}{Δ x}) \end{matrix}

(10)

By approximating

\partial_{t} H_{x}

y

\partial_{t} E_{z}

in Equation (3) as

\begin{matrix} \frac{\partial H_{x}}{\partial t} |_{i + 1 / 2, j}^{n} & \approx \frac{(H_{x (i + \frac{1}{2}, j)}^{n + 1 / 2} - H_{x (i + \frac{1}{2}, j)}^{n - 1 / 2})}{Δ t} \end{matrix}

(11)

\begin{matrix} \frac{\partial E_{z}}{\partial x} |_{i, j}^{n} & \approx \frac{E_{z (i + 1, j)}^{n} - E_{z (i, j)}^{n}}{Δ x} . \end{matrix}

(12)

it takes the form

\begin{matrix} H_{x (i + \frac{1}{2}, j)}^{n + M M 1 / 2} = H_{x (i + \frac{1}{2}, j)}^{n - 1 / 2} \frac{(1 - \frac{Δ t σ_{(i + 1 / 2, j)}}{2 μ_{(i + 1 / 2, j)}})}{(1 + \frac{Δ t σ_{(i + 1 / 2, j)}}{2 μ_{(i + 1 / 2, j)}})} + \\ \frac{Δ t}{μ (1 + \frac{σ_{(i + 1 / 2, j)} Δ t}{2 μ_{(i + 1 / 2, j)}})} \frac{E_{z (i + 1, j)}^{n} - E_{z (i - 1, j)}^{n}}{Δ y} \end{matrix}

(13)

CPML ABC Formulation

According to [31], the convolutional term

Ψ

the CPML ABC is likely to be obtained as a time recursion, being able to update the F (H or E) field memory variable in the x or y direction, for each time-step n, as

Ψ_{x}^{n} (F) = b_{x} Ψ_{x}^{n - 1} (F) + a_{x} {(\partial F)}^{n - 1}

(14)

This formulation is suitable to be implemented within an FDTD code, just substituting the spatial derivative

\partial x

, with

\frac{1}{k} \partial x + Ψ

; the time evolution of

Ψ

is that of the other variables. For example, the equation for

\partial_{x} H_{y}

adding the CPML is expanded as:

\begin{matrix} Ψ (\partial_{x} H_{y} |_{(i, j)}^{n + 1 / 2}) = & b_{x} (i) Ψ (\partial_{x} H_{y} |_{(i, j)}^{n + 1 / 2}) \\ + a_{x} (i) (\partial_{x} H_{y} |_{(i, j)}^{n + 1 / 2}) . \end{matrix}

(15)

The arrays

k_{x}

,

b_{x}

y

a_{x}

are calculated, inside the absorption region, as:

k_{x} (q) = 1 + (k_{m a x} - 1.0) x {(q)}_{norm}^{m},

(16)

where

x {(q)}_{norm}

=

\frac{T_{h} - q Δ x}{T_{h}}

, and

T_{h}

is the CPML thickness, where

q = 0, 1, 2 \dots N

, and N is the number of points in the absorbing region. One must note that

b_{x} = 0

,

c_{x} = 0

and

k_{x} = 1

within the physical domain; within the CPML:

b_{x} (q) = e^{- (\frac{σ_{x} (q)}{k_{x} (q) α_{x} (q)}) (\frac{Δ t}{ε_{0}})}

(17)

a_{x} (q) = \frac{σ_{x} (q) (b_{x} (q) - 1)}{σ_{x} (q) k_{x} (q) + k_{x} {(q)}^{2} α_{x} (q)}

(18)

where

σ_{x} (q) = σ_{\max} x {(q)}_{norm}^{m}

y

α_{x} (q) = α_{\max} {(1.0 - x {(q)}_{norm})}^{m a}

.

The vectors

k_{x_{h}}

,

b_{x_{h}}

and

a_{x_{h}}

have to be interpolated in the points

j + 1 / 2

, while the vectors

k_{y_{h}}

,

b_{y_{h}}

y

a_{y_{h}}

in the points

i + 1 / 2

.

Equations (7), (10) and (13) are observed in what follows, featuring the CPML:

\begin{matrix} E_{z (i, j)}^{n + 1} = E_{z (i, j)}^{n} \frac{(1 - \frac{Δ t σ_{(i, j)}}{2 ε_{(i, j)}})}{(1 + \frac{Δ t σ_{z (i, j)}}{2 ε_{(i, j)}})} + \\ \frac{Δ t}{ε_{(i, j)} (1 + \frac{Δ t σ_{(i, j)}}{2 ε_{(i, j)}})} \frac{1}{k_{x} (i)} \frac{(H_{y (i, j + 1 / 2)}^{n + 1 / 2} - H_{y (i, j - 1 / 2)}^{n + 1 / 2})}{Δ x} + \\ Ψ (\partial_{x} H_{y} |_{(i, j)}^{n + 1 / 2}) \\ - \frac{Δ t}{ε_{(i, j)} (1 + \frac{Δ t σ_{(i, j)}}{2 ε_{(i, j)}})} \frac{1}{k_{y} (j)} \frac{(H_{x (i + 1 / 2, j)}^{n + 1 / 2} - H_{x (i - 1 / 2, j)}^{n + 1 / 2})}{Δ y} + \\ Ψ (\partial_{y} H_{x} |_{(i, j)}^{n + 1 / 2}) \end{matrix}

(19)

\begin{matrix} H_{y (i, j + \frac{1}{2})}^{n + 1 / 2} = H_{y (i, j + \frac{1}{2})}^{n - 1 / 2} \frac{(1 - \frac{Δ t σ_{(i, j + 1 / 2)}}{2 μ_{(i, j + 1 / 2)}})}{(1 + \frac{Δ t σ_{(i, j + 1 / 2)}}{2 μ_{(i, j + 1 / 2)}})} + \\ \frac{Δ t}{μ (1 + \frac{σ_{(i, j + 1 / 2)} Δ t}{2 μ_{(i, j + 1 / 2)}})} \frac{1}{k_{x_{h}} (i)} (\frac{E_{z (i, j + 1)}^{n} - E_{z (i, j)}^{n}}{Δ x}) + \\ Ψ (\partial_{x} E_{z} |_{(i, j + 1 / 2)}^{n}) . \end{matrix}

(20)

\begin{matrix} H_{x (i + \frac{1}{2}, j)}^{n + 1 / 2} = H_{x (i + \frac{1}{2}, j)}^{n - 1 / 2} \frac{(1 - \frac{Δ t σ_{(i + 1 / 2, j)}}{2 μ_{(i + 1 / 2, j)}})}{(1 + \frac{Δ t σ_{(i + 1 / 2, j)}}{2 μ_{(i + 1 / 2, j)}})} + \\ \frac{Δ t}{μ (1 + \frac{σ_{(i + 1 / 2, j)} Δ t}{2 μ_{(i + 1 / 2, j)}})} \frac{1}{k_{y_{h}} (j)} (\frac{E_{z (i + 1, j)}^{n} - E_{z (i - 1, j)}^{n}}{Δ y}) + \\ Ψ (\partial_{y} E_{z} |_{(i + 1 / 2, j)}^{n}) \end{matrix}

(21)

3. Algorithm and Flux Diagram

CPML formulation plays a significant role in code implementation; in fact, any PML formulation is crucial in codification. A typical ABC setup over a computational domain is shown in Figure 2 and can be observed the PML region where the boundary conditions are applied.

As previously mentioned, the CPML is implementable with a recursive sum in time, only in the absorbing region [31]; for this reason, we have two implementation options. Consider the variables implicated in the CPML as local in the absorbing region or the overall domain, simplifying the number of loops and variables but increasing the amount of memory used.

The used CPML formulation reduces the number of memory arrays within the algorithm [33]. It can be therefore easily implemented in GPU architectures, due to their limited memory capacity.

Coding the CPML in the FDTD method requires saving the spatial derivatives in a convolutional variable. It can be implemented, as mentioned, in two different ways: reserving memory space for them in all the domain or only in the absorption region.

The advantage of allocating the convolutional variables in all the domain is that it only requires four computational cycles for the space derivatives and can be applied to vectorization techniques to gain performance. (see diagram depicted in Figure 3).

Just allocating the convolutional variables in the CPML region saves memory but increases the implementation complexity and the number of computational cycles (see diagram depicted in Figure 4) [18,34].

To avoid getting into low-level programming, OpenACC is a directive-based to generate parallel code [3], it is similar to the well-established OpenMP [35], a set of compiler directives, routines, libraries, and environment variables that can be used to specify high-level parallelism. Similarly, OpenACC allows you to tell the compiler how to generate parallel code based on CUDA C instead of writing it directly at a low level.

For code migration to OpenACC, the first action is to identify where the most significant computational load is located to reduce the execution time, so it is necessary to discern the degree of parallelization of the algorithm. In this case, is used a numerical method based on finite differences scheme based on a staggered grid. The characteristics of this method make the algorithm highly parallelizable with OpenACC as previously studied by [36,37]. The execution time used for reading the initial conditions, memory allocation, or variables initialization variables does not cause overhead since they are only executed once during the entire program.

The metric used to check if the loops are computationally enough is the intensity. The intensity of a loop refers to the relationship between floating-point operations and memory accesses. A relatively easy way to get this parameter is through the compiler (NVIDIA HPC SDK) using the flag -Minfo=intensity. The intensity for parallelization should at least be ≥1. Still, if any loop has an intensity less than 1, it is possible to include it in the parallelization if it is part of a larger context. The intensity is defined as

I = \frac{f}{m}

, where f is the quantity of floating-point operations performed and m is the number of data moves. The intensities reported by the NVIDIA HPC compiler for both implementation options are shown in Table 1 and Table 2 respectively.

For both versions, almost all the loops have an intensity more significant than one, except for some cycles such as

M_{6}

,

M_{9}

, and

M_{13}

, since they handle the boundary conditions domain. As an unwritten rule, every loop with an intensity level

I \geq 1.0

is a candidate for parallelization. However, it is necessary to mention that sometimes the intensity exceeds the value of 1.0. Still, it does not mean that it is highly parallelizable since there may be a high intensity but little data on which the loops operate. In this case, there will be no satisfactory acceleration.

It is known that to get good performance with graphics cards, it is necessary to reduce the data transfers between the memory of the CPU and the GPU. Therefore, the essence of the strategy is to leave the persistent variables or data on the GPU. The directive for that purpose is acc data and is used to define a persistent region of data on the GPU. Any code included in the data region will keep persistent variables in the GPU without the need to transfer the variables themselves to the calculation region (kernel) only when they have to be recorded to disk, so it must be declared outside of the time cycle (see diagram in Figure 5).

4. Simulation Settings and Results

Two GPUs were used an NVIDIA GeForce RTX 3060 with a compute capability of 8.6 with 12 GB and an NVIDIA TITAN RTX with a compute capability of 7.5 with 24 GB of NVRAM and the NVIDIA HPC FORTRAN, C++, and C Compilers with OpenACC version 22.9. The processor is an Intel Xeon E5-2630v4 2.2 Ghz with ten physical cores.

Simulations were carried out on domains that require computational demand in single and double precision, and the reference solution was calculated with the CPU in double precision to check solutions accuracy from GPU. Clearly, single precision reduced memory storage by a factor of almost two, and the GPU performed the floating point operations faster. Indeed the first GPUs for general purposes applications only supported single precision.

Single precision is very convenient for memory saving but can introduce precision errors. Notwithstanding there are no studies reported in the literature using single precision for FDTD solving Maxwell with GPU. Long-term simulations require a lot of time steps, potentially introducing oscillations or spurious returning waves to the computational domain, for this reason the simulations results has been compared versus the CPU obtained.

For a fair comparison experiments carried out on the CPU were highly optimized with vectorization, the flag compiler used to generate the vectorization report is -Minfo=vect and according to the report, all the cycles were successfully vectorized. The versions in the CPU are labeled CPU-M (Memory Saving Version) and CPU-V (Vectorized Version), respectively. On the GPU, experiments with memory saving (GPU-MD) and without memory saving (GPU-VD) are executed in double precision, as well as in single-precision for memory saving (GPU-MS) and without memory saving (GPU-VS).

This work considered three simulation scenarios at different frequencies, giving seven experiments performed in two cards:

Free space propagation at 2.45 Ghz, 5.00 Ghz and 20 Ghz.
Parabolic reflector at 2.45 Ghz, 5.00 Ghz and 20 Ghz.
Coplanar nano-waveguide (CNWG) at 100 Thz.

4.1. Free Space Propagation

In this experiment it is setup an empty environment without absorbing obstacles nor reflecting surfaces and different frequencies were tested: 20.00 GHz, 5.00 GHz, and 2.45 GHz, using computational domains of 20 m × 20 m, 10 m × 10 m, and 5 m × 5 m, respectively, with the source located in the center. Table 3 shows the configuration of the scenarios and Table 4 the quantity of the memory required for each case. It is necessary to mention that the time used to write to the disk is negligible because once transferred to the CPU memory, it is writing in parallel. This means that the kernel executions continue while disk writing is in process. Of course, the time used to transfer the data from the GPU to the CPU is considered part of the GPU total time, but the transfer is done asynchronously, taking on average 13∼14

μ

s.

The location of the numerical control viewers for the free space propagation is shown in Figure 6.

Computing times measured for the simulations using one CPU core with the Hyperthreading disabled are shown in Table 5. It is important to mention that the serial version is highly optimized with vectorization and the version implemented in OpenMP does not improve the performance, because the computational intensity is low and the problem is memory bound. Due to the benefits vectorization provides for speeding up codes, it is necessary to apply this technique to the serial version of the code for execution on the CPU since, for fair performance comparison, the best compiled version for the code should be used.

Table 6 shows the execution times obtained using the GPUs an the Table 7 the corresponding speed-ups. The vector length is 1024 (block size), giving the best performance. Figure 7 depicts a bar graph comparing the times for the 2.45 GHz case; similar computing time behavior was found for 5.00 GHz (see Figure 8) and 20.00 GHz (see Figure 9). Speed-up factors were calculated taking as a baseline the CPU-MD times.

The simulation results for the 2.45 GHz case are depicted in Figure 10. Very similar results are obtained for the 5.00 GHz and 20.00 GHz cases.

The purpose of using OpenACC is accelerating the execution. However, the quality of the solution obtained with the GPU should be verified. An advantage of the CPU over the GPU is its direct access to storage devices and much more flexible management of memory so that in the CPU, the information of the variables in points of interest can be recorded at each time step (viewers) but not in the GPU.

In Table 8 are shown the calculated MSE between the control points values of the CPU in double precision versus the fastest GPUs execution (single precision). It must take into account that the comparison between the control points is not at all time steps, since information is only transferred every certain number of time steps. Nevertheless, for a viewer, the solution was compared with the reference solution, which matches precisely when the values are copied to the CPU memory, validating the precision along the time Figure 11.

The numerical reliability of the fastest execution (GPU-MS), produced by the GPUs is compared with its sequential counterpart in the CPU. The most error-prone precision is the single-precision due to only has seven significant digits after the decimal point. Figure 12 show the absolute errors calculated for the last snapshots of the simulations. As can be observed, the error are in the expected range of

10^{- 6}

∼

10^{- 7}

. The GPUs give exactly the same numerical solution.

4.2. Parabolic Plate Reflector

In this experiment, the feed of the antenna will radiate electromagnetic waves that reach the parabolic antenna, distributing the electromagnetic field along the surface and reflecting the field towards the primary radiator (i.e., the focus of the parabola). The domains are of size 5 m × 5 m, 10 m × 10 m, and 20 m × 20 m for 2.45 GHz, 5.00 GHz, and 20.00 GHz respectively. The parabolic plate is considered made of silver (

σ = 6.3 \times 10^{7}

). The electromagnetic field source is isotropic, spherically symmetric, and is placed within a hood feeder. Figure 13 shows the experimental setup, where the green marks depict the location of the control points used to check the quality of the solution. The behavior of the resultant field within the physical domain is expected, and demonstrates the adequate conversion from spherical to plane wavefronts. Moreover, the larger energy occurs over the parabola’s focus, when the plane wavefront illuminates it. Table 9 shows the configuration of the numerical scenarios for the parabolic plate, and Table 10 the memory required.

Computing times measured for the simulations of the parabolic reflector using one CPU are shown in Table 11. Table 12 shows the computing times obtained using the GPUs. Figure 14, Figure 15 and Figure 16 depicts a bar graph comparing the times for the 2.45 GHz, 5.00 GHz and 20.00 GHz cases respectively. Speed-up factors are showed in Table 13.

In Table 14 are shown the calculated MSE between the control point values of the CPU in double precision versus the fastest GPU execution in single precision. The simulations results for the parabolic plate are depicted in Figure 17. Figure 18, shows the absolute errors for the last snapshots of the simulations, the errors are around

10^{- 6}

∼

10^{- 7}

.

In this study, a memory saving design makes sense because the memory in the GPU is limited, and the instances GPU-MD (double) and GPU-VD (double) can not be performed due to the limitation of 12 GB memory in the RTX 3060 card.

4.3. Coplanar Nano-Waveguide

The propagation of electromagnetic field with operating frequencies along the THz band is under study in this section. Waveguides are commonly used as a microwave device where information travels in various communication systems; it is necessary to correctly describe the propagation of waves within them to determine their filtering effects, as well as their limitations on electric field intensity [38].

The structure of the Coplanar Nanowaveguide consists of three silver rectangular parallel plates, the central one fixed and two equidistantly separated at its sides. Figure 20 shows the described configuration of this experiment, which is simulated at a frequency of 100 THz. Such plates are located over a squared dielectric substrate of side

6 \times 10^{- 5}

m, with plates of

4.15 \times 10^{- 5}

m long by

7.50 \times 10^{- 6}

m width. The distance between plates is

9 . 375^{- 6}

m. Sources of field (orange dots) are located at a distance of

2.537 \times 10^{- 5}

from the plates. The substrate has

150 \times 10^{- 6}

m of thickness. Five numerical control points are located within the physical domain, to be able to record the fields in such locations (green dots). Specifically, two of them are located between the plates to analyze the behavior of the field within the nanometric waveguide, while the remaining ones are outside the waveguide. Furthermore, the sources are strategically located to measure waves along the waveguide. For this experiment, the computational domain is from [6

\times 10^{- 5}

, 6

\times 10^{- 5}

] to [6.5

\times 10^{- 4}

, 6.5

\times 10^{- 4}

]. Table 15 shows the configuration of the scenario and Table 16 the quantity of the memory required.

Computing times measured for the simulations of the nanowaveguide using one CPU are shown in Table 17. Table 18 shows the computing times obtained using the GPUs. Figure 21 shows three snapshots of the simulation results for the nanowaveguide case. Figure 22 depicts a bar graph comparing the times for the 100 THz case. Speed-up factors are showed in Table 19.

In Table 20 are shown the calculated mean squared error between the control points values of the CPU in double precision versus the fastest GPU execution (single precision). Figure 23, shown the absolute error for the last snapshot of the simulation, the errors are in the expected range of

10^{- 6}

∼

10^{- 7}

.

5. Conclusions

Currently, GPUs have become a core part of computing equipment oriented to science and engineering and for this reason, programmability should be facilitated to focus on the details of the application code rather than the details of the hardware implementation. With OpenACC, this goal is achieved by reducing the coding effort based on a directive-based methodology. Nevertheless, it always is necessary in OpenACC to keep the persistent data regions while the kernels are executed since it largely avoids the transfer of data between the CPU and the GPU since data transfer is highly penalized at run-time and, finally, is necessary to reduce the memory utilized since the GPUs have a reduced quantity of memory compared with the CPU.

Particularly with this type of application, the implementation has to be performed carefully to improve performance due to memory bound. The best option is to split into many cycles and use single precision. It was shown that using single precision does not introduce precision errors in different scenarios. In fact, the presence of obstacles or reflecting surfaces is not affecting the performance.

The performance gain obtained in the RTX 3060 for double precision mode is around 5.0X, which means a factor of five compared with the serial vectorized version. However, using the single precision, almost the half memory is saved and no significant errors are introduced. Indeed, the results are in the expected range, and the performance is around 11X.

The Titan RTX is similar in performance to an RTX 2080, with a compute capability of 7.5 lower than the 8.6 of the RTX 3060, but with 24 GB of NVRAM, with this quantity of memory was possible to run experiments for the parabolic plate in double precision, not carried out in the RTX 3060. For this card found a solid 5.0X of speed-up in performance for the double precision for all experiments but a lower performance using the single precision with a speed-up of around 7.0X; this could be explained because it is an older architecture card than the RTX 3060.

In computation, specifically in parallel computation is difficult to establish a measurement standard [39]; the results obtained can vary like in all bench-marking parallel computing studies, using different hardware or accelerators, and the results showed have to be taken as a guide. Nevertheless, reducing the use of memory and using single precision is the key to gaining performance in this study.

About the numerical reliability, the results produced by the GPUs are compared with its sequential counterpart in the CPU. The most error-prone precision is the single-precision; it only has seven significant digits after the decimal point; however, it is the fastest and does not introduce oscillations in the solution. The GPUs give exactly the same solution. Finally, we can establish that OpenACC is an excellent tool to accelerate wave propagation simulation codes based on FDTD [36,37].

Future work of this porting is using several GPUs integrated into the same computational node, controlled using OpenMP. A combination of OpenMP+OpenACC could be used if integrated into the same system. If there are n GPUs, we can create n OpenMP threads, each thread controlling one GPU. In case the GPUs are located in different nodes or workstations, it is necessary to use MPI or a combination of MPI+OpenMP+OpenACC, depending on the system architecture. In all cases, latency caused by bottlenecks has to be studied based on the necessity to transfer information among the GPUs.

Author Contributions

Conceptualization, C.C.-C.; formal analysis of the scenarios J.H.; coding, D.P.-P.; validation, I.M.-S.; writing—original draft preparation, C.C.-C.; writing-review and editing, C.C.-C. and I.M.-S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Secretaría de Investigación y Posgrado, Instituto Politécnico Nacional, project numbers 20220176 and 20220907.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Authors acknowledge the Secretaría de Investigación y Posgrado, EDI grant given to all authors.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Arora, M.; Nath, S.; Mazumdar, S.; Baden, S.B.; Tullsen, D.M. Redefining the Role of the CPU in the Era of CPU-GPU Integration. IEEE Micro 2012, 32, 4–16. [Google Scholar] [CrossRef] [Green Version]
Papadrakakis, M.; Stavroulakis, G.; Karatarakis, A. A new era in scientific computing: Domain decomposition methods in hybrid CPU–GPU architectures. Comput. Methods Appl. Mech. Eng. 2011, 200, 1490–1508. [Google Scholar] [CrossRef]
Wienke, S.; Springer, P.; Terboven, C. OpenACC—first experiences with real-world applications. In Proceedings of the European Conference on Parallel Processing; Springer: Berlin/Heidelberg, Germany, 2012; pp. 859–870. [Google Scholar]
Chen, Y.; Xiao, G.; Li, K.; Piccialli, F.; Zomaya, A.Y. FgSpMSpV: A Fine-Grained Parallel SpMSpV Framework on HPC Platforms. ACM Trans. Parallel Comput. 2022, 9, 1–29. [Google Scholar] [CrossRef]
Xiao, G.; Li, K.; Chen, Y.; He, W.; Zomaya, A.Y.; Li, T. CASpMV: A Customized and Accelerative SpMV Framework for the Sunway TaihuLight. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 131–146. [Google Scholar] [CrossRef]
Kraus, J.; Schlottke, M.; Adinetz, A.; Pleiter, D. Accelerating a C++ CFD code with OpenACC. In Proceedings of the 2014 First Workshop on Accelerator Programming Using Directives, New Orleans, LA, USA, 17 November 2014; pp. 47–54. [Google Scholar]
Sanchez-Noguez, J.; Couder-Castañeda, C.; Hernández-Gómez, J.J.; Navarro-Reyes, I. Solving the Heat Transfer Equation by a Finite Difference Method Using Multi-dimensional Arrays in CUDA as in Standard C. In Proceedings of the Latin American High Performance Computing Conference; Springer: Berlin/Heidelberg, Germany, 2022; pp. 221–235. [Google Scholar]
Wang, X.M.; Xiong, L.L.; Liu, S.; Peng, Z.Y.; Zhong, S.Y. GPU-Accelerated Parallel Finite-Difference Time-Domain Method for Electromagnetic Waves Propagation in Unmagnetized Plasma Media. 2017. Available online: https://www.researchgate.net/profile/Ximin-Wang/publication/319478533_GPU-Accelerated_Parallel_Finite-Difference_Time-Domain_Method_for_Electromagnetic_Waves_Propagation_in_Unmagnetized_Plasma_Media/links/59affe74458515150e4ce8af/GPU-Accelerated-Parallel-Finite-Difference-Time-Domain-Method-for-Electromagnetic-Waves-Propagation-in-Unmagnetized-Plasma-Media.pdf (accessed on 18 October 2022).
Alghamdi, A.M.; Eassa, F.E. OpenACC Errors Classification and Static Detection Techniques. IEEE Access 2019, 7, 113235–113253. [Google Scholar] [CrossRef]
Sonoda, J.; Koseki, Y.; Sato, M. Evaluation of Various FDTD Method Using OpenACC Directive on GPU. IEICE Tech. Rep. IEICE Tech. Rep. 2013, 113, 21–26. [Google Scholar]
Le Bras, R. Acceleration in Acoustic Wave Propagation Modelling Using OpenACC/OpenMP and Its Hybrid for the Global Monitoring System. In Proceedings of the Accelerator Programming Using Directives: 6th International Workshop, WACCPD 2019, Denver, CO, USA, 18 November 2019; Revised Selected Papers. Springer Nature: Berlin/Heidelberg, Germany, 2020; Volume 12017, p. 25. [Google Scholar]
Aldinucci, M.; Cesare, V.; Colonnelli, I.; Martinelli, A.R.; Mittone, G.; Cantalupo, B.; Cavazzoni, C.; Drocco, M. Practical parallelization of scientific applications with OpenMP, OpenACC and MPI. J. Parallel Distrib. Comput. 2021, 157, 13–29. [Google Scholar] [CrossRef]
Smith, M.; Tamerus, A.; Hasnip, P. Portable Acceleration of Materials Modeling Software: CASTEP, GPUs, and OpenACC. Comput. Sci. Eng. 2022, 24, 46–55. [Google Scholar] [CrossRef]
Xue, W.; Jackson, C.W.; Roy, C.J. An improved framework of GPU computing for CFD applications on structured grids using OpenACC. J. Parallel Distrib. Comput. 2021, 156, 64–85. [Google Scholar] [CrossRef]
Da Silva, H.U.; Schepke, C.; Lucca, N.; Da Cruz Cristaldo, C.F.; De Oliveira, D.P. Parallel OpenMP and OpenACC Mixing Layer Simulation. In Proceedings of the 2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), Valladolid, Spain, 9–11 March 2022; pp. 181–188. [Google Scholar]
Fujita, K.; Kikuchi, Y.; Ichimura, T.; Hori, M.; Maddegedara, L.; Ueda, N. GPU Porting of Scalable Implicit Solver with Green’s Function-Based Neural Networks by OpenACC. In International Workshop on Accelerator Programming Using Directives; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; pp. 73–91. [Google Scholar]
Vincent, J.; Gong, J.; Karp, M.; Peplinski, A.; Jansson, N.; Podobas, A.; Jocksch, A.; Yao, J.; Hussain, F.; Markidis, S.; et al. Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems. arXiv 2022, arXiv:2109.03592. [Google Scholar]
Couder-Castañeda, C.; Barrios-Piña, H.; Gitler, I.; Arroyo, M. Performance of a Code Migration for the Simulation of Supersonic Ejector Flow to SMP, MIC, and GPU Using OpenMP, OpenMP+LEO, and OpenACC Directives. Sci. Program. 2015, 2015, 739107. [Google Scholar] [CrossRef] [Green Version]
Rodríguez Sánchez, A.; Enciso Aguilar, M.; Sosa Pedroza, J.R.; Benavides Cruz, A.M.; Coss Domínguez, S.; Peña Ruíz, S.; Couder Castañeda, C. Full 3D-FDTD analysis and validation for indoor propagation at 2.45 GHz. Microw. Opt. Technol. Lett. 2016, 58, 2880–2884. [Google Scholar] [CrossRef]
Xu, J.; Xie, G. A Novel Hybrid Method of Spatially Filtered FDTD and Subgridding Technique. IEEE Access 2019, 7, 85622–85626. [Google Scholar] [CrossRef]
Kazemzadeh, M.; Xu, W.; Broderick, N.G. Faster and More Accurate Time Domain Electromagnetic Simulation Using Space Transformation. IEEE Photonics J. 2020, 12, 1–13. [Google Scholar] [CrossRef]
Kazemzadeh, M.R.; Broderick, N.G.R.; Xu, W. Novel Time-Domain Electromagnetic Simulation Using Triangular Meshes by Applying Space Curvature. IEEE Open J. Antennas Propag. 2020, 1, 387–395. [Google Scholar] [CrossRef]
Sun, G.; Trueman, C.W. Efficient implementations of the Crank-Nicolson scheme for the finite-difference time-domain method. IEEE Trans. Microw. Theory Tech. 2006, 54, 2275–2284. [Google Scholar]
Jiang, H.L.; Wu, L.T.; Zhang, X.G.; Wang, Q.; Wu, P.Y.; Liu, C.; Cui, T.J. Computationally efficient CN-PML for EM simulations. IEEE Trans. Microw. Theory Tech. 2019, 67, 4646–4655. [Google Scholar] [CrossRef]
Sun, G.; Trueman, C. Unconditionally-stable FDTD method based on Crank-Nicolson scheme for solving three-dimensional Maxwell equations. Electron. Lett. 2004, 40, 589–590. [Google Scholar] [CrossRef]
Rodríguez-Sánchez, A.; Couder-Castañeda, C.; Hernández-Gómez, J.; Medina, I.; Peña-Ruiz, S.; Sosa-Pedroza, J.; Enciso-Aguilar, M. Analysis of electromagnetic propagation from MHz to THz with a memory-optimised CPML-FDTD algorithm. Int. J. Antennas Propag. 2018, 2018, 5710943. [Google Scholar] [CrossRef] [Green Version]
Yee, K.S.; Chen, J.S. The finite-difference time-domain (FDTD) and the finite-volume time-domain (FVTD) methods in solving Maxwell’s equations. IEEE Trans. Antennas Propag. 1997, 45, 354–363. [Google Scholar] [CrossRef]
Berenger, J.P. A perfectly matched layer for the absorption of electromagnetic waves. J. Comput. Phys. 1994, 114, 185–200. [Google Scholar] [CrossRef]
Xie, G.; Fang, M.; Huang, Z.; Ren, X.; Wu, X. A unified 3-D simulating framework for Debye-type dispersive media and PML technique based on recursive integral method. Comput. Phys. Commun. 2022, 280, 108463. [Google Scholar] [CrossRef]
Wang, J.; Li, G.; Chen, Z. Convolutional Implementation and Analysis of the CFS-PML ABC for the FDTD Method Based on Wave Equation. IEEE Microw. Wirel. Components Lett. 2022, 32, 811–814. [Google Scholar] [CrossRef]
Martin, R.; Komatitsch, D. An unsplit convolutional perfectly matched layer technique improved at grazing incidence for the viscoelastic wave equation. Geophys. J. Int. 2009, 179, 333–344. [Google Scholar] [CrossRef] [Green Version]
Martin, R.; Couder-Castaneda, C. An improved unsplit and convolutional perfectly matched layer absorbing technique for the navier-stokes equations using cut-off frequency shift. CMES-Comput. Model. Eng. Sci. 2010, 63, 47–77. [Google Scholar]
Martin, R.; Komatitsch, D.; Ezziani, A. An unsplit convolutional perfectly matched layer improved at grazing incidence for seismic wave propagation in poroelastic media. Geophysics 2008, 73, T51–T61. [Google Scholar] [CrossRef] [Green Version]
Arroyo, M.; Couder-Castañeda, C.; Trujillo-Alcantara, A.; Herrera-Diaz, I.E.; Vera-Chavez, N. A performance study of a dual Xeon-Phi cluster for the forward modelling of gravitational fields. Sci. Program. 2015, 2015, 316012. [Google Scholar] [CrossRef] [Green Version]
Dagum, L.; Menon, R. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 1998, 5, 46–55. [Google Scholar] [CrossRef] [Green Version]
Mohammadi, S.; Karami, H.; Azadifar, M.; Rachidi, F. On the Efficiency of OpenACC-aided GPU-Based FDTD Approach: Application to Lightning Electromagnetic Fields. Appl. Sci. 2020, 10, 2359. [Google Scholar] [CrossRef] [Green Version]
Liu, S.; Chen, C.; Sun, H. Fast 3D transient electromagnetic forward modeling using BEDS-FDTD algorithm and GPU parallelization. Geophysics 2022, 87, E359–E375. [Google Scholar] [CrossRef]
Medina, I.; Couder-Castaneda, C.; Hernandez-Gomez, J.; Saucedo-Jimenez, D. On Waveguides Critical Corona Breakdown Thresholds Dependence on the Collision Frequency between Electrons and Air. IEEE Trans. Plasma Sci. 2019, 47, 1611–1615. [Google Scholar] [CrossRef]
Hoefler, T.; Belli, R. Scientific benchmarking of parallel computing systems: Twelve ways to tell the masses when reporting performance results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, 15–20 November 2015; pp. 1–12. [Google Scholar]

Figure 1. Mesh used in the algorithm and notation.

Figure 2. CPML Absorbing Boundary Condition implementation options.

Figure 3. Flux diagram showing the four cycles necessary if the convolutional variables are allocated in all the computational domain. The cycles are labeled as

V_{1}

,

V_{2}

,

V_{3}

and

V_{4}

.

Figure 3. Flux diagram showing the four cycles necessary if the convolutional variables are allocated in all the computational domain. The cycles are labeled as

V_{1}

,

V_{2}

,

V_{3}

and

V_{4}

.

Figure 4. Flux diagram showing the fourteen cycles necessary when the convolutional variables are allocated only in the absorption region. Saving Memory but increasing the number of cycles. The cycles are numbered as

M_{1} \dots M_{14}

.

Figure 4. Flux diagram showing the fourteen cycles necessary when the convolutional variables are allocated only in the absorption region. Saving Memory but increasing the number of cycles. The cycles are numbered as

M_{1} \dots M_{14}

.

Figure 5. Flux diagram of design in OpenACC maintaining persistent data regions between loops.

Figure 6. Free space propagation experiment. The orange dot indicates the position of the source, while green dots show the location of the control points.

Figure 7. Comparison between the computation times obtained for propagation in open free space at 2.45 Ghz.

Figure 8. Comparison between the computation times (

\times 10

) obtained for propagation in open free space at 5.00 Ghz.

Figure 8. Comparison between the computation times (

\times 10

) obtained for propagation in open free space at 5.00 Ghz.

Figure 9. Comparison between the computation times (

\times 10

) obtained for propagation in open free space at 20.00 Ghz.

Figure 9. Comparison between the computation times (

\times 10

) obtained for propagation in open free space at 20.00 Ghz.

Figure 10. Snapshots of the electric field distribution (

E_{z}

) for the free space at 2.45 GHz frequency. (A–C) show the images at 150, 650, and 700 time-steps, respectively. The simulation was carried out during 1160 time steps.

Figure 10. Snapshots of the electric field distribution (

E_{z}

) for the free space at 2.45 GHz frequency. (A–C) show the images at 150, 650, and 700 time-steps, respectively. The simulation was carried out during 1160 time steps.

Figure 11. Comparison of the solution along the time obtained in the GPU vs. CPU in viewer 3. Very similar solutions were obtained for viewers 1 and 2. The solution can only be compared when the information is transferred to the CPU.

Figure 12. Absolute error between the CPU (double precision in serial) vs. the GPU (single precision) for the open space propagation at the last Time-step at 2.45 (A), 5.00 GHz (B), and 20.00 GHz (C).

Figure 13. Configuration for the parabolic plate propagation scenario. The green dots indicate the location of the control points. The source is located inside the hood feeder.

Figure 14. Comparison between the execution times obtained for the parabolic propagation at 2.45 Ghz.

Figure 15. Comparison between the execution times (

\times 10

) obtained for the parabolic propagation at 5.00 Ghz.

Figure 15. Comparison between the execution times (

\times 10

) obtained for the parabolic propagation at 5.00 Ghz.

Figure 16. Comparison between the execution times (

\times 10^{2}

) obtained for the parabolic propagation at 20.00 Ghz.

Figure 16. Comparison between the execution times (

\times 10^{2}

) obtained for the parabolic propagation at 20.00 Ghz.

Figure 17. The parabolic plate experiment’s electric field distribution

(E_{z})

is depicted. Snapshots at 400 (A) and 750 (B), and 1160 (C) time steps are shown. The simulation lasted 1160 time steps, and the source was turned off at 580. Very similar results are obtained for the 5.00 GHz and 20 GHz frequencies.

Figure 17. The parabolic plate experiment’s electric field distribution

(E_{z})

is depicted. Snapshots at 400 (A) and 750 (B), and 1160 (C) time steps are shown. The simulation lasted 1160 time steps, and the source was turned off at 580. Very similar results are obtained for the 5.00 GHz and 20 GHz frequencies.

Figure 18. Absolute error between the CPU (double precision in serial) vs. the GPU (single precision) for the parabolic plate experiment at the last Time-step at 2.45 (A), 5.00 GHz (B), and 20.00 GHz (C).

Figure 19. Comparison of the solution along the time obtained in the GPU vs. CPU in viewer 3. Very similar solutions were obtained for all other viewers. The solution can only be compared when the information is transferred to the CPU.

Figure 20. Configuration for the nanowaveguide simulation scenario. The orange dots indicate the sources, and the green dots the position of the numerical control points. The simulation lasted 3000 time steps, and the sources were turned off at 500.

Figure 21. The nanowaveguide experiment’s electric field distribution (

E_{z}

) is depicted. Snapshots at 500 (A), 1500 (B) and 3000 (C) time steps are shown.

Figure 21. The nanowaveguide experiment’s electric field distribution (

E_{z}

) is depicted. Snapshots at 500 (A), 1500 (B) and 3000 (C) time steps are shown.

Figure 22. Comparison between the execution times (

\times 10

) obtained for the nano-waveguide simulation.

Figure 22. Comparison between the execution times (

\times 10

) obtained for the nano-waveguide simulation.

Figure 23. The absolute error between the CPU (double precision in serial mode) and the GPU (single precision) for the nanowaveguide experiment.

Figure 24. Comparison of the solution along the time obtained in the GPU vs. CPU in the control point 3. Very similar solutions were obtained for all other viewers. The solution can only be compared when the information is transferred to the CPU.

Table 1. Intensities corresponding to the cycles depicted in the diagram of the Figure 3. Every loop in the cycle of time is numbered as

V_{i}

.

Table 1. Intensities corresponding to the cycles depicted in the diagram of the Figure 3. Every loop in the cycle of time is numbered as

V_{i}

.

Loop	Intensity
V1	1.62
V2	1.64
V3	1.29
V4	3.67

Table 2. Intensities corresponding to the cycles depicted in the diagram of the Figure 4. Every loop in the cycle of time is numbered as

V_{i}

.

Table 2. Intensities corresponding to the cycles depicted in the diagram of the Figure 4. Every loop in the cycle of time is numbered as

V_{i}

.

Loop	Intensity	Loop	Intensity	Loop	Intensity
$M_{1}$	2.00	$M_{6}$	0.67	$M_{11}$	1.29
$M_{2}$	2.25	$M_{7}$	2.00	$M_{12}$	1.29
$M_{3}$	2.25	$M_{8}$	2.25	$M_{13}$	0.60
$M_{4}$	1.29	$M_{9}$	0.60	$M_{14}$	3.67
$M_{5}$	1.29	$M_{10}$	2.00

Table 3. Experiments setup for the free space scenery. Iterations refer to the number of time steps performed. Disk, to the number of iterations when the information has to be transferred to the CPU and saved to the disk. Source, refers to the number of operations when the source is turned off.

	Mesh Size	CPML Thickness	$Δ x, Δ y$	$Δ t$	Iterations	Disk	Source
2.45 Ghz case	3269 × 3269	20	6.1182 $\times 10^{- 3}$	1.4286 $\times 10^{- 11}$	1160	50	580
5.00 Ghz case	5004 × 5004	20	2.9979 $\times 10^{- 3}$	7.0004 $\times 10^{- 12}$	3480	100	1160
20.00 Ghz case	5338 × 5388	20	7.4948 $\times 10^{- 4}$	1.7501 $\times 10^{- 12}$	3480	150	380

Table 4. Used memory for each scenario in the free space propagation, with and without memory saving and using single and double precision.

Version	2.45 Ghz	5.00 Ghz	20.00 Ghz
With Memory Saving (double)	1155 MB	2585 MB	2923 MB
Without Memory Saving (double)	1483 MB	3353 MB	3795 MB
With Memory Saving (single)	635 MB	1337 MB	1519 MB
Without Memory Saving (single)	803 MB	1721 MB	1959 MB

Table 5. Execution times obtained in serial form with the CPU, expressed in mm:ss, for the open space propagation.

Version	2.45 Ghz	5.00 Ghz	20.00 Ghz
CPU-M (reference)	04:08	28:21	32:13
CPU-V (reference)	04:35	32:32	37:06

Table 6. Execution times obtained using the RTX 3060 and Titan RTX, expressed in mm:ss, for the free space propagation.

Version	2.45 Ghz	5.00 Ghz	20.00 Ghz
GPU-MD (double) RTX 3060	00:49	05:22	06:05
GPU-VD (double) RTX 3060	01:56	16:17	18:57
GPU-MS (single) RTX 3060	00:22	02:45	02:46
GPU-VS (single) RTX 3060	01:10	09:28	10:41
GPU-MD (double) Titan RTX	00:46	05:16	05:53
GPU-VD (double) Titan RTX	02:10	14:33	15:50
GPU-MS (single) Titan RTX	00:21	04:07	04:39
GPU-VS (single) Titan RTX	01:00	10:24	11:36

Table 7. Speed-up factors calculated using as a base the utilized execution time by CPU-MD, for the open free space simulation.

Version	2.45 Ghz	5.00 Ghz	20.00 Ghz
CPU-MD (double) RTX 3060	1.00X	1.00X	1.00X
CPU-VD (double) RTX 3060	0.90X	0.87X	0.87X
GPU-MD (double) RTX 3060	5.06X	5.28X	5.30X
GPU-VD (double) RTX 3060	2.14X	1.74X	1.70X
GPU-MS (single) RTX 3060	11.27X	10.31X	11.64X
GPU-VS (single) RTX 3060	3.54X	2.99X	3.02X
GPU-MD (double) Titan RTX	5.39X	5.38X	5.48X
GPU-VD (double) Titan RTX	1.91X	1.95X	2.03X
GPU-MS (single) Titan RTX	11.81X	6.89X	6.93X
GPU-VS (single) Titan RTX	4.13X	2.73X	2.78X

Table 8. Comparison of the reference solution versus the solution obtained in the GPU (fastest in single precision) for 2.45 GHz, 5.00 GHz, and 20.00 GHz frequencies, using the MSE. The values used for the error calculation are transferred every fixed number of iterations, as shown in Figure 11.

Control Points	2.45 GHz	5.00 GHz	20.00 GHz
Point 1	3.020517 $\times 10^{- 7}$	1.073611 $\times 10^{- 7}$	5.041632 $\times 10^{- 7}$
Point 2	5.257362 $\times 10^{- 7}$	1.189589 $\times 10^{- 7}$	2.217147 $\times 10^{- 7}$
Point 3	4.047071 $\times 10^{- 7}$	9.549111 $\times 10^{- 7}$	3.002896 $\times 10^{- 7}$

Table 9. Experiments setup for the parabolic reflector scenery. Iterations refer to the number of time steps performed. Disk, to the number of iterations when the information has to be transferred to the CPU and saved to the disk. Source, refers to the number of operations when the source is turned off.

	Mesh Size	CPML Thickness	$Δ x, Δ y$	$Δ t$	Iterations	Disk	Source
2.45 Ghz case	3269 × 3269	20	6.1182 $\times 10^{- 3}$	1.4286 $\times 10^{- 11}$	2000	500	580
5.00 Ghz case	5004 × 5004	20	2.9979 $\times 10^{- 3}$	7.0004 $\times 10^{- 11}$	4000	100	870
20.00 Ghz case	5338 × 5338	20	7.4948 $\times 10^{- 4}$	1.7501 $\times 10^{- 12}$	8000	4000	870

Table 10. Memory used in the GPU, for the parabolic reflector propagation.

Version	2.45 GHz	5.00 GHz	20.00 GHz
GPU-MD (double)	1155 MB	2585 MB	12,648 MB
GPU-VD (double)	1465 MB	3335 MB	16,488 MB
GPU-MD (single)	635 MB	1337 MB	6329 MB
GPU-VD (single)	785 MB	1703 MB	8326 MB

Table 11. Computing times obtained in serial form with the CPU, expressed in mm:ss, for the parabolic reflector propagation.

Version	2.45 GHz	5.00 GHz	20.00 GHz
CPU-M	07:24	32:46	5:36:39
CPU-V	07:39	36:56	6:27:05

Table 12. Execution times obtained using the RTX 3060 and Titan RTX, expressed in hh:mm:ss, for the parabolic propagation.

Version	2.45 Ghz	5.00 Ghz	20.00 Ghz
GPU-MD (double) RTX 3060	01:37	07:38	Out of memory
GPU-VD (double) RTX 3060	03:19	15:31	Out of memory
GPU-MS (single) RTX 3060	00:42	03:31	36:21
GPU-VS (single) RTX 3060	01:54	11:05	2:05:45
GPU-MD (double) Titan RTX	01:17	05:49	59:59
GPU-VD (double) Titan RTX	03:18	15:35	47:31
GPU-MS (single) Titan RTX	00:37	04:42	47:11
GPU-VS (single) Titan RTX	01:42	11:51	2:01:41

Table 13. Speed-up factors calculated using as a base the utilized execution time by CPU-MD, for the parabolic propagation.

Version	2.45 Ghz	5.00 Ghz	20.00 Ghz
CPU-MD (double) RTX 3060	1.00X	1.00X	1.00X
CPU-VD (double) RTX 3060	0.90X	0.89X	0.87X
GPU-MD (double) RTX 3060	4.58X	4.29X	Out of Memory
GPU-VD (double) RTX 3060	2.23X	2.11X	Out of memory
GPU-MS (single) RTX 3060	10.57X	9.32X	9.27X
GPU-VS (single) RTX 3060	3.89X	2.96X	2.68X
GPU-MD (double) Titan RTX	5.77X	5.63X	5.62X
GPU-VD (double) Titan RTX	2.24X	2.10X	7.09X
GPU-MS (single) Titan RTX	12.00X	6.97X	7.14X
GPU-VS (single)	4.35X	2.77X	5.46X

Table 14. MSE calculated for each control point compared with the reference solution (CPU). The values used for the error calculation are transferred every fixed number of iterations, as shown in Figure 19.

Control Points	2.45 GHz	5.00 GHz	20.00 GHz
Point 1	7.5934 $\times 10^{- 7}$	3.0083 $\times 10^{- 7}$	4.3917 $\times 10^{- 7}$
Point 2	3.6161 $\times 10^{- 7}$	1.8743 $\times 10^{- 7}$	3.4687 $\times 10^{- 07}$
Point 3	4.6658 $\times 10^{- 7}$	2.7481 $\times 10^{- 7}$	4.1319 $\times 10^{- 07}$
Point 4	7.2940 $\times 10^{- 7}$	5.0781 $\times 10^{- 7}$	6.6099 $\times 10^{- 7}$
Point 5	3.6715 $\times 10^{- 7}$	3.1612 $\times 10^{- 7}$	3.8112 $\times 10^{- 7}$
Point 6	3.5750 $\times 10^{- 7}$	1.6050 $\times 10^{- 7}$	3.0605 $\times 10^{- 7}$

Table 15. Experiments setup for the free open space scenery. Iterations refer to the number of time steps performed. Disk, to the number of iterations when the information has to be transferred to the CPU and saved to the disk. Source, refers to the number of operations when the source is turned off.

	Mesh Size	CPML Thickness	$Δ x, Δ y$	$Δ t$	Iterations	Disk	Source
100 Thz case	4337 × 4337	20	3.5002 $\times 10^{- 16}$	1.4990 $\times 10^{- 7}$	3000	500	200

Table 16. Memory used for the nanowaveguide propagation.

Version	100 THz
GPU-MD (double)	1961 MB
GPU-VD (double)	2537 MB
GPU-MD (single)	1025 MB
GPU-VD (single)	1313 MB

Table 17. Execution times obtained using the CPU in serial form, expressed in mm:ss, for the nanowaveguide propagation.

Version	100 THz
CPU-M	0:18:23
CPU-V	0:20:23

Table 18. Execution times obtained using the RTX 3060 and Titan RTX, expressed in mm:ss, for the open space propagation.

Version	100.00 Thz
GPU-MD (double) RTX 3060	00:49
GPU-VD (double) RTX 3060	01:56
GPU-MS (single) RTX 3060	00:22
GPU-VS (single) RTX 3060	01:10
GPU-MD (double) Titan RTX	00:46
GPU-VD (double) Titan RTX	02:10
GPU-MS (single) Titan RTX	00:21
GPU-VS (single) Titan RTX	01:00

Table 19. Speed-up factors calculated using as a base the utilized execution time by CPU-MD, for the nano wave guide simulation.

Version	2.45 Ghz
CPU-MD (double) RTX 3060	1.00X
CPU-VD (double) RTX 3060	0.90X
GPU-MD (double) RTX 3060	5.06X
GPU-VD (double) RTX 3060	2.05X
GPU-MS (single) RTX 3060	11.61X
GPU-VS (single) RTX 3060	4.03X
GPU-MD (double) Titan RTX	5.60X
GPU-VD (double) Titan RTX	2.05X
GPU-MS (single) Titan RTX	11.61X
GPU-VS (single)	4.03X

Table 20. MSE for each control point compared with the reference solution (CPU double precision) versus the solution obtained in the GPUs at 100 THz. The values used for the error calculation are transferred every fixed number of iterations, as shown in Figure 24.

Control Points	100 THz
Point 1	3.09058 $\times 10^{- 7}$
Point 2	1.02595 $\times 10^{- 7}$
Point 3	1.02590 $\times 10^{- 7}$
Point 4	1.29402 $\times 10^{- 7}$
Point 5	1.29402 $\times 10^{- 7}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Padilla-Perez, D.; Medina-Sanchez, I.; Hernández, J.; Couder-Castañeda, C. Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACC. Appl. Sci. 2022, 12, 11430. https://doi.org/10.3390/app122211430

AMA Style

Padilla-Perez D, Medina-Sanchez I, Hernández J, Couder-Castañeda C. Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACC. Applied Sciences. 2022; 12(22):11430. https://doi.org/10.3390/app122211430

Chicago/Turabian Style

Padilla-Perez, Diego, Isaac Medina-Sanchez, Jorge Hernández, and Carlos Couder-Castañeda. 2022. "Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACC" Applied Sciences 12, no. 22: 11430. https://doi.org/10.3390/app122211430

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACC

Abstract

1. Introduction

2. Propagation Equations and Algorithm

CPML ABC Formulation

3. Algorithm and Flux Diagram

4. Simulation Settings and Results

4.1. Free Space Propagation

4.2. Parabolic Plate Reflector

4.3. Coplanar Nano-Waveguide

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI