1. Introduction
Low-frequency electromagnetic problems in the so-called magneto-quasi-static (MQS) limit (such as, for instance, eddy current problems) can be conveniently solved through numerical models coming from integral formulations, with the advantage of limiting the meshing to the conducting regions only. For real-world applications, however, the final formulation likely leads to large linear systems of equations, and suitable strategies are required to lower the computational cost of the numerical solution, such as those based on fast Fourier transform [
1], fast multipole method [
2], and H-matrix [
3]. However, there are cases of interest where the dimensions are not so large to require the use of such techniques. For such cases, a direct solution method is preferable, since it provides intrinsic robustness and accuracy, with a modest increase in computational cost.
The final goal of this work is the optimization of the solution of integral formulations of eddy current problems solved with direct methods, in applications where long transients must be studied, and where an effective pre-conditioning is not possible. The applications targeted here come from the world of the nuclear fusion machines, and specifically refer to the evaluation of the current density induced in the conducting structures as a consequence of an electromagnetic transient associated with plasma disruptive events. The proposed optimization is based on the parallelization of the tasks associated with the numerical implementation of the integral formulation, based on a hybrid OpenMP–MPI approach.
The classical approach in parallel computing was based on the assumption that the systems were homogeneous, with each single node sharing the same characteristics: in this case, the computation time required by each node would be the same, if the same tasks were assigned to each single node. The present approach is based on the use of computational resources with high performance, made by cluster multicore systems [
4], where variability and heterogeneity between the cores and between nodes introduces major issues, such as load imbalance, that strongly affect the effectiveness of parallel computation [
5]. Optimizing the resources is of a primary importance for an efficient use of supercomputers, that are usually shared among several users. Traditional load-balancing techniques, such as those based on global re-partitioning of the threads, require a priori estimations on the time needed for completing each single task or, alternatively, require a dynamical reallocation of the loads based on a real-time check of the status of the nodes [
6].
More recently, this problem has been addressed by using hybrid parallel programming paradigms, like those that adopt both a parametrized communication paradigm, such as the message passing interface, MPI [
7], and an application programming interface, such as OpenMP [
8,
9]. Such a hybrid programming approach has been used in several fields, such as computational fluid dynamics [
10,
11], chemical simulations [
12,
13], geotechnics [
14], high efficiency video coding [
15], and electromagnetic simulations [
16,
17,
18], due to its very interesting performance [
19]. As pointed out in [
19,
20], the advantages of using a hybrid OpenMP–MPI programming approach are many, as follows: (i) suitability to the architecture of modern supercomputers, with interconnected shared memory nodes; (ii) improved scalability; (iii) optimization of the total consumed memory; (iv) reduction in the number of MPI processes. On the other hand, a few disadvantages can be observed, as follows: (i) complexity of the implementation; (ii) total gains in performance. These two issues require special care in the implementation of a hybrid approach, to handle the added extra costs to the existing code.
The main contribution of this paper is that of defining and assessing the implementation of such an approach when solving the integral formulation of eddy current problems by means of a Galerkin-based finite elements method, such as CARIDDI code, which is widely used in the fusion community [
21]. The proposed solution fully exploits the main advantages of both paradigms: the MPI approach is more efficient in the process-level parallelization, whereas the OpenMP approach optimizes the lower level parallelization. The performance improvement is not only related to memory saving, but also to the reduction in the communication [
22], with a consequent dramatic lowering of the latency time for the job to go in running state. The effectiveness of the proposed approach is witnessed by the obtained speed-up values for the analyzed real-world applications, about ×30 and ×50.
The paper is organized as it follows:
Section 2 provides a short description of the numerical formulation of the MQS problem, and the tasks associated with building and solving the final numerical system are discussed. In
Section 3, two different approaches for parallelizing the assembly of the system’s matrices are presented and compared: a pure MPI approach and the proposed hybrid OpenMP–MPI one. In
Section 4, first a benchmark test to validate the hybrid approach is provided. Then, the analysis of case studies referred to large nuclear fusion machines is carried out, comparing the computational performance of pure MPI and OpenMP–MPI approaches.
2. Numerical Formulation of the Magneto-Quasi-Static Problem
An electromagnetic problem in the so-called magneto-quasi-static problem (MQS) limit is studied here; namely, a low frequency eddy currents problem in non-magnetic conductors: these are typical conditions encountered in the applications analyzed in this paper, that are related to the plasma fusion machines, such as the fusion reactor ITER [
23]. The mathematical model may be cast in integral form by introducing a vector potential,
A, and a scalar one,
ϕ, to define the electrical field,
E, and the magnetic one,
B, as follows:
In this way, Faraday’s law is implicitly imposed. The vector potential is calculated from the current density,
J, flowing inside the conductors’ region,
VC, and from the density,
JS, of the currents located outside that region, as follows:
The integral formulation of the problem comes from the imposition of the constitutive equation
in the ohmic conductors by means of the weighted residual approach, as follows:
The subspace
S contains the functions solenoidal in
VC, satisfying the interface conditions for the current at the boundary:
. By replacing (1) and (2) in (3), the final weak form is obtained, as follows:
The above formulation is implemented in the code CARIDDI [
21], that is here adopted to perform the numerical analysis of the considered case studies. The numerical model implemented in CARIDDI solves Equation (4) by applying Galerkin’s method, starting from the following decomposition in edge elements,
Nk, of the unknown current density, as follows:
By using (5) in (4), under the assumption of time-harmonic signals, the vector,
I, of the unknown coefficients,
Ik, can be calculated by solving the following linear system:
The resistance
R and inductance
L matrices in (6) are calculated by means of the following integrals in the conducting regions,
VC:
whereas
is a known vector that is computed from the external imposed sources, as follows:
Once the numerical problem (6) is solved, the flux density
B is obtained at any given position
as
, where the generic entry of the matrix
is defined as follows:
The solution of the linear system (6) via a direct method requires a first step of matrix assembly, followed by a final step of matrix inversion. Specifically, the following four tasks may be identified:
- (a)
assembly of the known-terms vector, ;
- (b)
assembly of the resistance and inductance matrices, R and L;
- (c)
assembly of the flux density matrix, ;
- (d)
inversion of the impedance matrix, , defined as in (6), via factorization and back substitution.
In real-world applications such as those analyzed in this paper, related to fusion machines, the high computational burden requires the use of efficient parallel computing techniques, as discussed in the next section. In this paper, we focus on the parallelization of the assembly phase; whereas, the matrix inversion is here performed by using the best available routines, such as the Scalapack one [
24]. The “inversion phase” (task (d)) is usually the most time consuming when using a direct solver, but it is worth noticing that in many practical cases also tasks (a)–(c) in the “assembly phase” can be very demanding of both CPU and RAM resources. Indeed, if we denote with
the degrees of freedom of the numerical problem, the time required for the inversion (
) is proportional to
,
; whereas, the assembly time (
) is proportional to
, i.e.,
. Following this consideration, one can be brought to affirm that the assembly time is lower than the inversion one. However, in many applications, the actual CPU-times for the two phases are comparable, since
depends only on the computational resources, whereas
depends on the accuracy. Indeed, there are some cases of practical interest among the MQS problems where the required accuracy imposes values of
that may lead to values of
comparable to, or even greater than,
. This happens, for instance, when the density of the electrical current varies rapidly in space and in time and consequently a high accuracy in the evaluation of the inductance matrix is required. Let us note that a higher accuracy may be achieved in two ways, as follows: (i) by refining the meshing, hence increasing
; (ii) by using the same mesh (hence the same
), but increasing the number of integration Gauss points,
n. It is evident that solution (i) is not preferable when direct methods are used, given the cost of inversion. Let us then study the sensitivity of the assembly cost to increasing the values of
n. The simulation times required for the inversion of (6) and for assembly of the matrix,
L, are reported in
Figure 1, as a function of
. For a lower accuracy (
n = 1),
is always greater than
, but for a better accuracy (
n > 1), the assembly time is longer than the inversion one for
, where
increases with
n—it is 6500 for
n = 270,000 for
n = 3, and 351,000 for
n = 4.
Another source of computational burden is the calculation of the matrix
L, i.e., task (b): the double integration in (8) requires a proper handling of the singularities in the kernels, to retain the property of
L to be positive definite. An adaptive procedure for integrating these singularities is given in [
25].
3. Parallel Computing Based on a Hybrid OpenMP–MPI Approach
3.1. Parallelization Strategy Based on MPI Approach: Description and Limits
The numerical model of the MQS problem, described in the previous section, has been implemented in a parallel-computing scheme, based on a pure MPI approach, which is here briefly summarized (further details are given in [
26]).
As for task (a), when assembling the known-terms vector, , the sources are the impressed currents, given in an external mesh (the sources mesh). The computation of the entries of requires a double loop: the first on the mesh elements, the second one on the external mesh (as shown in Algorithm 1). Actually, is a small size vector, and hence, although each MPI process allocates the whole vector in V0_loc, only a part of its memory is used. Then, a reduction operation is required in order to gather the final vector.
Algorithm 1 Evaluation of V0, pure MPI approach |
// Initialization |
Split Source Points between MPI processes |
// Parallel pure MPI computation |
for each MPI process do |
for each iel mesh element point do |
for each iel0 source mesh element do |
compute V0 contribution, between iel and iel0 |
end for |
end for |
end for |
// All reduce of the local contribute on the global V0 |
mpi_allreduce(V0_loc,V0_global) |
The assembly cost for task (b) is mainly related to the inductance matrix
L, being
R a sparse matrix [
26]. The choice of the most suitable parallelization approach for computing the
L matrix comes from a trade-off between the needs of optimizing the costs of computation, memory writing, and communications. For this purpose, two main strategies have been in the past investigated by the authors [
26]: in the first one, the element by elements interactions needed to compute the integrals in (8) are distributed all over available MPI processes; in the second one, each process takes care of building a sub-block of the
L matrix. Overall, numerical experiments in [
26] show that the first approach proved to be the most effective one, given the characteristics of the transient electromagnetic problems of interest in the study of nuclear fusion machines. For this reason, we have adopted such an approach, that is here implemented by means of a hybrid OpenMP–MPI paradigm.
Looking at its definition in (8), the assembly of L requires two nested loops on the mesh elements, to compute the element–element interactions, as shown in Algorithm 2. To this end, a critical point is the need to accumulate and store these interactions in a correct way. Three dummies’ memories are defined:
- -
MDMESH, used to store the geometrical mesh information. These data have a dimension of the order of magnitude of the number of the mesh elements;
- -
MDL, used to temporarily store the entries Lij produced during the main loop of element–element interactions. This memory has the same dimension of MLOC, that is the chunk memory required for matrix storage at each node;
- -
MDEE, used to carry on the main loop of element–element interactions: such a memory is of the order of , being the number of degrees of freedom per element.
Algorithm 2 Assembly of L matrix, pure MPI approach |
Equally distribute element-element interactions among MPI processes |
// Initialization |
for each MPI process |
Allocate dummy memory MdMesh |
Allocate dummy memory MdEE |
Allocate dummy memory MdL |
end for |
Broadcast the geometrical information |
// Parallel pure MPI computation |
for each MPI process do |
for each element iel1 do |
for each element iel2 do |
Compute local iel1-iel2 interactions |
Accumulate local interactions in MdL |
end for |
end for |
end for |
//Final Communications Step |
for each MPI process do |
Allocate local memory Mloc |
end for |
for each MPI process do |
Send and receive the local matrices MdL |
Accumulate in Mloc |
end for |
deallocate(MdMesh, MdEE,MdL) |
As for task (c), the assembly of the flux density matrix, Q, is obtained through a double loop, the first on the mesh elements, and the second one on the requested field points, as shown in Algorithm 3. In a pure MPI approach, the set of field points are equally distributed among MPI processes. Memory is then allocated in each MPI process.
Algorithm 3 Evaluation of Q m, pure MPI Approach |
// Initialization |
Equally distribute field points among MPI process |
Broadcast the geometrical information |
// Parallel pure MPI computation |
for each MPI process do |
for each mesh element iel do |
for each field point ifp in the set belonging to current process do |
Compute Magnetic Field or Vector potential |
Accumulate the values |
end for |
end for |
end for |
//Final allreduce |
mpi_allreduce(MagField,Vector potential) |
As a general comment, we stress that with the pure MPI paradigm, due to the limited size of per node memory and to the need of each MPI process to allocate not only the memory required to store the local portion of the matrices but also the dummy memory required to carry on the computation, it may not be possible to launch a number of MPI process equal to the number of the physical cores available at the node. For the above reasons, many of these cores are forced to stay idle during the job execution because there is no more space to allocate. This fact limits the maximum achievable speed-up.
For example, to compute the L matrix, the mesh information should be available at each MPI process in the dummy memory MdMesh defined above: in practical applications, this memory likely reaches the dimensions of some GBs. In the following, we refer to the typical case of homogeneous supercomputing systems, made by nodes that are equal in terms of core number, memory, and operating frequency. In order to assess the performance, let us here define the following quantities:
- -
—number of the MPI processes;
- -
—number of the cluster nodes;
- -
—total memory required to store the global matrix (for example, L);
- -
—total memory available at any node;
- -
—dummy memory per MPI process;
- -
—dummy memory per thread;
- -
—memory available at each node (for all is needed at the node);
- -
—actual available memory at each node.
In the ideal case, the maximum computation speed-up would be equal to the total number of MPI processes,
. At each node, the number of running processes is
, and thus the available memory is given as
. Note that the available memory must be used not only the matrix storage (chunk memory) but for any other data required by the process. Unfortunately, the dummy memory,
, is replicated
times, due to the distributed memory approach adopted here. Therefore, the actual available memory at each node,
, is reduced with respect to
, and must be larger than the requested chunk memory,
. These conditions are summarized in (11), that sets the limit for the pure MPI approach by imposing a maximum value to
:
3.2. Parallelization Strategy Based on Hybrid OpenMP–MPI Approach
A hybrid OpenMP–MPI approach is here proposed to overcome the abovementioned limits of the pure MPI approach. The underlying idea is to allocate in each node only one instance of the memory, and then to use all the physical cores only for computation. This means that we must logically separate allocation from computation, that is forbidden in a pure MPI paradigm, which maps one-to-one cores and processes. The solution could be provided by the use of the OpenMP environment directives: the node computation is broken down among the physical cores (threads) present in the node, that share the same memory on the node thanks to the shared memory approach implemented by OpenMP.
A hybrid OpenMP–MPI approach may be then implemented, based on the two following steps:
- (1)
as in the pure MPI paradigm, the overall computation is partitioned in MPI processes, limiting the number of the processes at node level (in the ideal case this number would be 1);
- (2)
the computational burden of each MPI process at node level is divided (again) in several threads, in accordance with the characteristics of the OpenMP paradigm.
In order to investigate the memory allocation scheme of the hybrid approach, we firstly recall that in OpenMP environment, each thread should allocate its local memory in order to face its own computation. Note that the dimension of the dummy memory per thread,
, is much lower than
. In order to speed up the inter-thread computation and to fully gain from the modern CPU architecture,
should match the local core cache size (
cache coherence, e.g., [
27]). Let us now define as
the number of threads per node: if we use one MPI process per node, and after considering that
is allocated only once, Condition (11) on the available memory for the pure MPI approach is replaced for the hybrid OpenMP–MPI one by the following:
In view of comparing the two approaches, it is convenient to introduce the following speed-up parameters:
(for pure MPI approach) and
(for hybrid OpenMP–MPI). They can be expressed from (11) and (12), as follows:
The above relations set the upper bounds for the two speed-up parameters, that linearly depend on the number of nodes,
:
Figure 2 shows the corresponding straight lines, whose intersection occurs at:
where
, since
. For
, it is always
and the distance between the two bounds increases monotonically, as shown in
Figure 2.
Indeed, this result can be demonstrated by considering the ratio between the slopes of the straight lines—from (13), as follows:
where the inequality holds since
, and
, hence
.
As clearly shown in
Figure 2, this hybrid OpenMP–MPI approach provides two advantages over the pure MPI approach, as follows:
- (i)
resources saving—for a fixed speed-up S, the required number of nodes NN is lower;
- (ii)
speed-up advantage—for fixed resources (NN), the speed-up, S, is higher.
In addition, as pointed out in the Introduction, the approach also reduces the communication burden, because the parallel routines impose a fast communication between each thread in the node.
It is worth noticing that it is not trivial to accomplish OpenMP–MPI paradigm, since three critical issues arise, as follows:
- (i)
Need for local thread memory— the main loop is distributed among all available threads, each of them requesting its local memory to work properly. This memory is related to mesh element information (e.g., the curls of the elements, the shape functions, geometrical information, element local output, and so on). Anyway, it is usually very small (few GBs) and, in cases of practical interest, is much smaller than the required output global node memory.
- (ii)
Global matrix memory update—once the single thread made its own job, the thread local output should be accommodated into the global memory of the node. This is a non-trivial operation, and it is a bottleneck for the method, because the local update must be carried out by each thread in a sequential access, so to guarantee consistency. To this end, the CRITICAL OpenMP directive is used [
6].
- (iii)
Global input memory access—this access is critical being shared between the threads. It may cause “cache missing”.
Of course, all of the above issues cause a degradation of the performances; moreover, as we can see in
Section 4, they do not significantly affect the obtained speed-up parameters.
Let us now describe the changes introduced by the proposed hybrid approach in the algorithms for assembling the matrices and vectors of the Numerical Problem (6). In building the known-terms vector V0, a hybrid implementation of the external loop (over the mesh elements) is handled by all threads. The local memory V0_loc is automatically obtained by the means of OpenMP reduction operations (Algorithm 4).
The new algorithm to be used to assembly the matrix
L is described in Algorithm 5. In the MQS applications, such as those analysed in this paper, the fully populated matrix
L is by far the largest quantity. The entries of this matrix are arranged in a 2D cyclic block size fashion, in view of using the Scalapack inversion routine [
24,
26]. To this end, the entries MdL are reorganized in
in the last step (final communication step). Once again, we stress the advantage of the hybrid approach over the pure MPI approach: since it is usually
, the memories MdMesh and MdL (by far the largest ones) are allocated only once in the node, instead of being allocated for each process in the pure MPI approach.
Finally, the Q matrix can be very large if the computation of the flux density is required in a huge number of points and/or in each element of the mesh. With the hybrid approach, the external element loop is distributed among the threads in the current MPI process, see Algorithm 6.
Algorithm 4 Evaluation of V0, hybrid OpenMP-MPI Approach |
// Initialization |
Split Source Point between MPI process |
// Parallel pure MPI computation |
for each MPI process do |
for each iel mesh element point do |
#pragma omp parallel for |
#pragma ompreduction(+:V0loc) |
for each iel0 source mesh element do |
compute V0_loc |
end for |
end for |
end for |
// Synchronization point |
mpi barrier |
// All reduce of the local contribute on the global V0 |
mpi_allreduce(V0_loc,V0_global) |
Algorithm 5 Assembly of L matrix, hybrid OpenMP-MPI approach |
Equally distribute element-element interactions among MPI processes |
// Initialization |
for each MPI process do |
Allocate dummy memory MdMesh |
Allocate dummy memory MdL |
Allocate dummy memory MdEE |
end for |
for each MPI process do |
Broadcast the geometrical information |
end for |
// Hybrid MPI openMP computation |
for each MPI process do |
Declare MdEE as private for each thread |
#pragma omp parallel for |
for each element iel1 do |
for each element iel2 do |
Compute local iel1-iel2 interactions |
Compute the local interactions on a private dummy mem. MdEE |
#pragma omp critical |
Accumulate the local interactions on the shared mem. MdL |
end for |
end for |
end for |
for each MPI process |
Allocate local memory Mloc |
end for |
//Final Communications Step |
for each MPI process do |
Send and receive the local matrices MdL |
Accumulate the contribute on Mloc |
end for |
deallocate(MdMesh, MdEE,MdL |
Algorithm 6 Evaluation of Q matrix, hybrid OpenMP-MPI Approach |
// Initialization |
Equally, distribute field points among MPI process |
Broadcast the geometrical information |
// MPI OpenMP computation |
for each MPI process do |
for each iel mesh element do |
#pragma omp parallel for |
for each ifp field point do |
Compute Magnetic Field or Vector potential |
Accumulate the values |
end for |
end for |
end for |
//Final allreduce |
mpi_allreduce(MagField,Vector potential) |