Next Article in Journal
Lossy Micromaser Battery: Almost Pure States in the Jaynes–Cummings Regime
Next Article in Special Issue
BAR: Blockwise Adaptive Recoding for Batched Network Coding
Previous Article in Journal
Aggregated Power Indices for Measuring Indirect Control in Complex Corporate Networks with Float Shareholders
Previous Article in Special Issue
Scalable Network Coding for Heterogeneous Devices over Embedded Fields
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Network Coding Approaches for Distributed Computation over Lossy Wireless Networks

1
Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing 211100, China
2
School of Computer and Information, Hohai University, Nanjing 211100, China
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(3), 428; https://doi.org/10.3390/e25030428
Submission received: 10 January 2023 / Revised: 20 February 2023 / Accepted: 23 February 2023 / Published: 27 February 2023
(This article belongs to the Special Issue Information Theory and Network Coding II)

Abstract

:
In wireless distributed computing systems, worker nodes connect to a master node wirelessly and perform large-scale computational tasks that are parallelized across them. However, the common phenomenon of straggling (i.e., worker nodes often experience unpredictable slowdown during computation and communication) and packet losses due to severe channel fading can significantly increase the latency of computational tasks. In this paper, we consider a heterogeneous, wireless, distributed computing system performing large-scale matrix multiplications which form the core of many machine learning applications. To address the aforementioned challenges, we first propose a random linear network coding (RLNC) approach that leverages the linearity of matrix multiplication, which has many salient properties, including ratelessness, maximum straggler tolerance and near-ideal load balancing. We then theoretically demonstrate that its latency converges to the optimum in probability when the matrix size grows to infinity. To combat the high encoding and decoding overheads of the RLNC approach, we further propose a practical variation based on batched sparse (BATS) code. The effectiveness of our proposed approaches is demonstrated by numerical simulations.

1. Introduction

In recent years, due to the proliferation of computationally intensive applications at the wireless edge, such as federated learning [1] and image recognition [2], wireless distributed computing has drawn great interest [3,4], where large-scale computational tasks are carried out by a cluster of wireless devices collaboratively. Meanwhile, due to the inherent randomness of wireless environment, wireless distributed computing systems are facing multiple challenges. One main challenge is called the straggler issue, where computing devices often experience unpredictable slowdown or even dropout during computation and communication, which can lead the computational task to much larger latency or even failure [5]. Another challenge is the packet-loss issue, where the packets can be lost during transmission due to severe channel fading of wireless networks.
In this paper, we consider a typical wireless distributed computing system consisting of multiple worker nodes and a master node. We focus on distributed matrix multiplication y = A x , which forms the core of many computation-intensive machine learning applications, such as linear regression, and aims at tackling the two above challenges. One common approach to mitigate the effect of stragglers is providing redundancy through replication [6,7,8], which has been widely used in large distributed systems such as MapReduce [9] and Spark [10]. However, this kind of r-replication strategy can only tolerate r stragglers, and using a larger r increases the computation redundancy, which can lead to poor performance.
Recently, Lee et al. [11] firstly introduced coding-based computation framework, and then proposed an ( n , k ) maximum-distance-separable (MDS) code approach, such that the master node can recover the desired result from the local computation results of any k out of n worker nodes. Based on this, Das et al. further proposed a fine-grained model such that the partial results of stragglers can be leveraged. However, MDS codes fail to make full use of the partial work done by stragglers. Ferdinand et al. [12] and Kiani et al. [13] proposed approaches to make use of stragglers by allocating more fine-grained computing tasks to each worker. Very recently, Mallick et al. [14] proposed the use of rateless codes such as LT codes [15] and Raptor codes [16] and demonstrated that a rateless coding approach can achieve an asymptotically optimal latency. However, all these approaches assumed that the communication between each worker node and the master node is reliable and can only lead to inferior performance in wireless distributed computing.
In fact, the packet-loss issue has been widely investigated in communication networks, and the existing approaches roughly belong to two categories. The first is automatic repeat-request (ARQ) based, which employs feedback-based retransmissions to combat packet loss. It has been adopted by Han et al. [17] in a MDS-code-based wireless distributed computing system. However, the feedbacks from the master node can increase the computation latency significantly due to the inherent delays of feedback, especially when the communication traffic between worker nodes and the master node is large. The other is forward error correction (FEC)-based, employing error-correcting code to combat packet losses. Traditional FEC approaches mainly focus on achieving reliable transmission over each communication link, but in the context of distributed matrix multiplication, the objective is to recover the desired computation result. How to tackle both the straggler issue and the packet-loss issue for distributed matrix multiplication in wireless distributed computing system remains an open problem.
In this paper, by leveraging the linearity of matrix multiplication, we show how network coding [18] can be applied to solve the two issues efficiently in a joint manner. The main contributions of this paper are summarized as follows:
  • We first propose a random linear network coding (RLNC) [19] based approach. In this approach, the matrix A to be multiplied is first split into multiple submatrices A 1 , , A k , and each worker node is assigned multiple submatrices, each of which is a random linear combination of the A 1 , , A k . Each worker node multiplies each assigned submatrix with the input x , and it generates random linear combinations of submatrix-vector products that have been created for transmission. Once receiving enough packets with independent global encoding vectors, the master node can recover the desired result A x by Gaussian elimination. We model the computation and communication process as a continuous-time trellis, and by conducting a probabilistic analysis of the connectivity of the trellis, we theoretically show that the latency of RLNC approach converges to the optimum in probability when the matrix size grows to infinity.
  • Since RLNC approach has high encoding and decoding costs, we further propose a practical variation of RLNC approach based on batched sparse (BATS) code [20] and show how to optimize the performance of the BATS approach.
  • We conducted numerical simulations to evaluate the proposed RLNC and BATS approaches. The simulation results show that both approaches can overcome the straggler issue and the packet-loss issue effectively and achieve near-optimal performance.
The reminder of the paper is organized as follows. Section 2 introduces the system model. Section 3 and Section 4 introduce the RLNC approach and the BATS approach, respectively. Section 5 presents the numerical evaluation results. Finally, Section 6 concludes.

2. System Model

2.1. Coding-Based Wireless Distributed Computation

As shown in Figure 1, we consider a heterogeneous, wireless distributed computing system consisting of a master node and n heterogeneous worker nodes. These worker nodes, denoted by w 1 , w 2 , , w n , are connected wirelessly to the master node. We focus on the matrix-vector multiplication problem, whose goal is to compute the result y = A x for a given matrix A R m × d and an arbitrary vector x R d × 1 , where R is a set of real numbers. Our results can be directly extended to matrix-matrix multiplication, where x is a small matrix.
In order to mitigate the effect of unpredictable node slowdown during computation and communication, we consider an error-correcting code based computing framework which consists of four components:
  • Encoding before computation: The matrix A is first split along its rows equally into k submatrices A 1 , , A k , i.e., A T = [ A 1 T A 2 T A k T ] . Without loss of generality, here we assume that m / k is an integer. These submatrices are encoded into more submatrices using an error-correcting code, which are further placed on worker nodes. The submatrices assigned to worker node w i are denoted as A ˜ i , 1 , A ˜ i , 2 , , A ˜ i , k i , where k i is the number of submatrices assigned to w i . Here, we emphasize that, in many applications, such as linear regression, this encoding will be used for multiple computations with different inputs x [11], so that the encoding is often required to be executed before the arrival of any x .
  • Computation at each worker node: When an input x is arrived at the master node, the master node will broadcast x to all these worker nodes. Once worker node w i receives x , it will compute A ˜ i , 1 x , A ˜ i , 2 x , , A ˜ i , k i x in a sequential manner.
  • Communication from each worker node: During the computation, each worker node also keeps on sending its local computation results to the master node in some manner. For this, each submatrix-vector product which is a vector of length m / k is encapsulated into a packet. We assume that the communication link between worker i and the master node can be modeled as a packet erasure channel, where each packet is erased independently with probability ε i . In order to combat these packet losses, each worker node can transmit its local computation results using a coding based approach.
  • Decoding at the master node: Once the master node receives enough information, it will recover the desired result y = A x and notify all the worker nodes to stop the computation.

2.2. Delay Model

In this paper, we mainly focus on minimizing the latency, which is the time required by the wireless computing system so that the result y = A x can be successfully decoded at the master node by aggregating the results sent from the worker nodes. For the characterization of the latency, we consider the following two models, one for computation delay and the other for communication delay.
As in [14], we consider a computation delay model as follows. The computation delay at each worker node w i consists of two parts. The first is an initial setup time before w i starts to perform any submatrix-vector multiplication, denoted by X i , which is assumed to follow an exponential distribution with rate λ i . The second is a constant time for calculating each submatrix-vector product, which is denoted by τ i . Hence, the delay for computing r submatrix-vector products by w i is X i + τ i r .
In order to characterize the straggling effect during the communication, we model the communication time of a packet from worker node i to the master node as a shifted-exponential distribution with rate μ i and shift parameter θ i . Additionally, the communication times of all packets are mutually independent. The model has also been adopted by [17,21].

3. A Network Coding Approach

In order to combat the straggling effects during both computation and communication and the packet losses during communication, in this section, we propose a random linear network coding (RLNC)-based approach and show that it can achieve optimal latency performance in the asymptotic sense, i.e., when the number of rows of A goes to infinity, when the overheads incurred are ignored. A practical version of this approach is given in the next section.

3.1. Description

We describe the RLNC based approach based on the computing framework given in Section 2.1:
Encoding before computation: In the RLNC-based approach, each submatrix A ˜ i , j assigned to worker node w i is a random linear combination of A 1 , , A k ; i.e.,
A ˜ i , j = e = 1 k c i , j , e A e , j = 1 , 2 , , k i
where c i , j , e is chosen randomly and independently according to a standard normal distribution. Since this encoding approach is rateless, k i can be arbitrarily large.
Computation at each worker node: When the worker node w i receives an input x , it starts to compute the local results y ˜ i , 1 = A ˜ i , 1 x , y ˜ i , 2 = A ˜ i , 2 x , , y ˜ i , k i = A ˜ i , k i x , in a sequential manner.
Communication from each worker node: For each packet transmission starting at time t, the worker node w i will generate a linear combination of all the local computation results in hand as
y ^ i , t = j = 1 d i ( t ) c j y ˜ i , j ,
where d i ( t ) is the number of local results that have been computed before time t by w i . Here, ( c 1 , , c d i ( t ) ) is referred to as the local encoding vector of y ^ i , t .
Decoding at the master node: Due to the linearity of matrix-vector multiplication, we can see that
y ^ i , t = j = 1 d i ( t ) c j y ˜ i , j = j = 1 d i ( t ) c j A ˜ i , j x = j = 1 d i ( t ) c j e = 1 k c i , j , e A e x = e = 1 k j = 1 d i ( t ) c j c i , j , e A e x
i.e., each packet received by the master node is a linear combination of A 1 x , A 2 x , , A k x . Here,
j = 1 d i ( t ) c j c i , j , 1 , j = 1 d i ( t ) c j c i , j , 2 , , j = 1 d i ( t ) c j c i , j , k
is referred to as the global encoding vector of y ^ i , t . Hence, when the master node receives enough packets that have k linearly independent global encoding vectors, it can recover the desired results A 1 x , A 2 x , , A k x by Gaussian elimination.
Overhead: Our RLNC approach suffers from its high encoding and decoding complexities, just like RLNC for communication. More specifically, in our approach, the encoding cost per submatrix is O ( k · m / k · d ) = O ( m d ) , and the total decoding cost is O ( k 3 + k 2 · m / k ) = O ( k 3 + m k ) . We can see that the encoding cost is high, but the encoding can been done before any computation and just once, which can be used for computing A x as many times as possible with different x . Meanwhile, the decoding cost is also high when k is large, but it is independent of d, the number of columns of A . Thus, when d is very large, the decoding cost at the master node can be much lower than the computation cost at each worker node. In addition, the decoding at the master node can be done in an incremental fashion using Gauss–Jordan elimination, which can further reduce the decoding latency.
Note that the global encoding vector is required by the master node for decoding. To achieve this efficiently, we use a pseudo-random number generator to generate the local encoding vector for each transmitted packet and append the random seed. The number of local results are computed for the packet. Then, the master node can get the global encoding vectors according to (3). In this way, the coefficient overhead is negligible, which is opposite to the traditional RLNC for communication networks.
Remark 1.
Lin et al. [22] have also applied RLNC in distributed training on mobile devices. They used RLNC to create coded data partitions among mobile devices so as to tolerate computational uncertainties, and their main purpose is to reduce the need to exchange data partitions across mobile devices. Differently from [22], the use of RLNC in this paper is for straggler mitigation and packet-loss tolerance in a joint manner, while leveraging the computation and communication capabilities of all worker nodes.
Remark 2.
Since random linear network coding is performed over the field of real numbers as opposed to a finite field, the entries of generated matrices could be very large numbers, leading the whole computation to be numerically unstable. In fact, this issue is present in any coded distributed computation over the field of real numbers and is not just limited to our approaches. There are two basic approaches to dealing with this issue. One is to use very small coefficients to avoid the emergence of large numbers, which is possible, as the encoding operations are also linear with these coefficients in our proposed approach. This is significantly different from the Reed–Solomon-code/polynomial-code-based approaches which have been widely adopted in coded distributed computation (see, e.g., [11,23]), as the coefficients are powers of evaluation points. In particular, the numerical instability issue for the RLNC approach is much less severe than that for Reed–Solomon-code/polynomial-code-based approaches, since Vandermonde matrices have exponentially large condition numbers. The other is to employ the finite field embedding technique [24,25], where the entries are quantized into number of finite digits and then embedded into a finite field. Nevertheless, both approaches incur numerical errors. How to guarantee numerical stability in coded distributed computation is still an open problem and requires further study.

3.2. Latency Analysis

Let r i = 1 θ i + 1 / μ i , and r i = min { 1 / τ i , r i ( 1 ε i ) } . Define
T 0 = k i = 1 n r i + i = 1 n r i X i i = 1 n r i .
The following result characterizes a upper bound of the latency of the proposed RLNC-based approach.
Theorem 1.
For any constant δ > 0 , the latency of the proposed RLNC-based approach, denoted by T RLNC , satisfies
lim k Pr ( T RLNC ( 1 + δ ) T 0 ) = 1 .
The following result establishes a lower bound on the latency of any scheme under the coding framework.
Theorem 2.
For any scheme under the coding framework, the probability that its latency T any is less than T 0 decays exponentially with k; i.e., for any constant δ > 0 , there exists some constant η > 1 that does not depend on k, such that
Pr ( T any ( 1 δ ) T 0 ) = 1 O ( η k ) .
From Theorems 1 and 2, it is straightforward to see that the proposed RLNC-based approach is asymptotically optimal. In the following, we will formally prove Theorems 1 and 2 by a connectivity analysis of a continuous-time trellis, which models the computation and communication processes.
For any scheme under the coding framework, as illustrated in Figure 2, we model the computation and communication processes of each worker node w i up to time t using a continuous-time trellis ( G i ( t ) ) [26], where edges are classified into three types: computation edges, transmission edges and memory edges. Each computation edge models the computation of a submatrix-vector product. Suppose w i computes a submatrix-vector product from time t 0 to t 0 + τ i t . Then, two nodes, w i ( t 0 ) and w i ( t 0 + τ i ) , will be introduced, and there is a computation edge from w i ( t 0 ) to w i ( t 0 + τ i ) . Similarly, suppose a packet is transmitted from w i at time t 0 and received successfully by the master node at time t 1 t . Then, two nodes w i ( t 0 ) and m ( t 1 ) , if they do not exist, will be introduced, and there is a transmission edge from w i ( t 0 ) to m i ( t 1 ) . We also introduce nodes w i ( 0 ) and a node m i ( t ) . Nodes { w i ( · ) } are connected through the timeline, so are nodes { w i ( · ) } and nodes { m i ( · ) } . The edges for such connections are called memory edges. Each computation edge and each transmission edge is associated with unit capacity, and each memory edge is associated with an infinity capacity. Finally, we construct a global continuous-time trellis G ( t ) , which includes the union of all G i ( t ) and two auxiliary nodes w ( 0 ) and m ( t ) . In addition, there is an edge from w ( 0 ) to each w i ( 0 ) with an infinity capacity, and there is an edge from each m i ( t ) to m ( t ) with an infinity capacity.
The usefulness of the continuous-time trellis model is summarized in the following result.
Proposition 1.
For any scheme that achieves latency of T, then the maximum flow from w ( 0 ) to m ( T ) in its continuous-time trellis G ( T ) must be least k. Moreover, for our RLNC approach, if the maximum flow from w ( 0 ) to m ( T ) in its continuous-time trellis G ( T ) is at least k, then the master node can recover the desired computation result at time T with probability one.
Proof. 
It is straightforward to see that the first part holds. The second part is inherited from the optimality of RLNC in communication networks [19] and the fact that all the operations are over the real field R . □
Now, we proceed to prove Theorems 1 and 2. We start by presenting some concentration results regarding the communication between worker nodes and the master node.
Lemma 1.
Suppose Y 1 , Y 2 , follow a shifted exponential distribution with rate μ and shift parameter θ independently. Then, for any constant δ > 0 , there exists some constant η 1 > 1 , such that
Pr i = 1 s Y i ( θ + μ 1 ) s > δ ( θ + μ 1 ) s = O ( η 1 s ) .
Proof. 
The result can be proved by a Chernoff-like argument based on moment generating function [27].
The moment generating function of Y i is
E [ e h Y i ] = μ μ + h e h θ
Hence,
Pr i = 1 s Y i < ( 1 δ ) ( θ + μ 1 ) s = Pr e h i = 1 s Y i > e h ( 1 δ ) ( θ + μ 1 ) s E e h i = 1 s Y i e h ( 1 δ ) ( θ + μ 1 ) s = i = 1 s E e h Y i e h ( 1 δ ) ( θ + μ 1 ) s = μ μ + h e h θ s e h ( 1 δ ) ( θ + μ 1 ) s
where the inequality holds by applying the Markov’s inequality. Let h = 1 ( 1 δ ) ( θ + μ 1 ) θ μ . We then have
Pr i = 1 s Y i < ( 1 δ ) ( θ + μ 1 ) s e μ ( 1 δ ) ( θ + μ 1 ) θ 1 μ ( 1 δ ) ( θ + μ 1 ) θ s e 1 δ ( 1 + θ μ ) 1 1 δ ( 1 + θ μ ) s
By setting η 1 = e 1 δ ( 1 + θ μ ) 1 1 δ ( 1 + θ μ ) , we get the desired result. □
For a scheme, let N i ( t ) ( N i ( t ) , resp.) be the number of packet transmissions (successful packet transmissions, resp.) from worker node w i to the master node during the time interval ( X i , X i + t ) .
Lemma 2.
For any scheme and any constant δ > 0 , there exists some constant η 2 > 1 , such that
Pr N i ( t ) ( 1 + δ ) r i t = O ( η 2 t ) .
Proof. 
Let Y 1 , Y 2 , , Y N i ( t ) be i.i.d. shifted exponential random variables with rate μ i and shift parameter θ i , and s = ( 1 + δ ) r i t . According to Lemma 1, there exist some constant s = ( 1 + δ ) r i t η 1 > 1 and η 2 = η 1 ( 1 + δ ) r i such that
Pr N i ( t ) ( 1 + δ ) r i t Pr j = 1 s Y j t Pr j = 1 s Y j 1 δ 1 + δ θ i + μ i 1 s η 1 s = O η 2 t
Lemma 3.
For any scheme and any constant δ > 0 , there exists some constant η 3 > 1 such that
Pr ( N i ( t ) ( 1 + δ ) r i ( 1 ε i ) t ) = O ( η 3 t ) .
Proof. 
Let A denote the event that N i ( t ) ( 1 + δ / 2 ) r i t . By the total law of probability,
Pr N i ( t ) ( 1 + δ ) r i 1 ε i t = Pr N i ( t ) ( 1 + δ ) r i 1 ε i t A Pr ( A ) + Pr N i ( t ) ( 1 + δ ) r i 1 ε i t A ¯ Pr ( A ¯ ) Pr ( A ) + Pr N i ( t ) ( 1 + δ ) r i 1 ε i t A ¯
According to Lemma 2, there exists some constant η 2 > 1 such that Pr ( A ) = O η 2 t . Let N be a binomial random variable with parameters ( 1 + δ / 2 ) r i t and 1 ε i . Then, there exists some constant η 3 > 1 such that
Pr N i ( t ) ( 1 + δ ) r i 1 ε i t A ¯ Pr N ( 1 + δ ) r i 1 ε i t = O η 3 t
where the second step follows by applying the Chernoff bound for a binomial random variable [27]. Finally, by letting min η 3 = η 2 , η 3 , we have
Pr N i ( t ) ( 1 + δ ) r i 1 ε i t = O η 3 t
Lemma 4.
For any scheme, let F i ( t ) be the maximum flow from w i ( 0 ) to m ( t ) in its continuous-time trellis G ( t ) . Then, for any constant δ > 0 , there exists some constant η 4 > 1 such that
Pr F i ( ( 1 δ ) T 0 ) r i k j = 1 n r j O ( η 4 k ) .
Proof. 
Let B be the event that ( 1 δ ) T 0 X i > ( 1 δ / 2 ) k j = 1 n r j . Then
Pr ( B ) Pr ( 1 δ ) T 0 > ( 1 δ / 2 ) k i = 1 n r i = Pr i = 1 n r i X i > δ 2 ( 1 δ ) k Pr i s . t . r i X i > δ 2 n ( 1 δ ) k i = 1 n Pr r i X i > δ 2 n ( 1 δ ) k = i = 1 n e λ i δ 2 n ( 1 δ ) r i k = O η 4 k
for some constant η 4 > 1 . By the total law of probability,
Pr F i ( 1 δ ) T 0 r i k j = 1 n r j = Pr F i ( 1 δ ) T 0 r i k j = 1 n r j A Pr ( A ) + Pr F i ( 1 δ ) T 0 r i k j = 1 n r j A ¯ Pr ( A ¯ ) Pr ( A ) + Pr F i ( 1 δ ) T 0 r i k j = 1 n r j A ¯
We consider two cases. In the first case, 1 τ i r i ( 1 ε i ) . Thus, r i = 1 τ i . Since F i ( t ) cannot exceed the number of computation edges t X i τ i , it is straightforward to check that
Pr F i ( ( 1 δ ) T 0 ) r i k j = 1 n r j A ¯ = 0 .
Thus, Pr F i ( ( 1 δ ) T 0 ) r i k j = 1 n r j = O ( η 4 k ) . In the second case, 1 τ i > r i ( 1 ε i ) . Thus, r i = r i ( 1 ε i ) . Since F i ( ( 1 δ ) T 0 ) N i ( ( 1 δ ) T 0 X i ) ,
Pr F i ( ( 1 δ ) T 0 ) r i k j = 1 n r j A ¯ Pr N i ( ( 1 δ ) T 0 X i ) r i k j = 1 n r j A ¯ Pr N i ( 1 δ / 2 ) k j = 1 n r j r i k j = 1 n r j = O ( η 5 k )
for some constant η 5 > 1 , where the last step follows from Lemma 3. Thus, we can show that Pr F i ( ( 1 δ ) T 0 ) r i k j = 1 n r j = O ( η 4 k ) for constant η 4 = min { η 4 , η 5 } . □
Now we are ready to prove Theorem 2.
Proof 
(Proof of Theorem 2). For any scheme, since the maximum flow from w ( 0 ) to m ( ( 1 δ ) T 0 ) in its continuous-time trellis G ( ( 1 δ ) T 0 ) is equal to i = 1 n F i ( ( 1 δ ) T 0 ) , according to Proposition 1, its latency T any satisfies
Pr ( T any ( 1 δ ) T 0 ) Pr i = 1 n F i ( ( 1 δ ) T 0 ) k Pr i s . t . F i ( ( 1 δ ) T 0 ) r i k j = 1 n r j i = 1 n Pr F i ( ( 1 δ ) T 0 ) r i k j = 1 n r j = O ( η 4 k )
where the last step follows from Lemma 4. □
Next, we turn to prove Theorem 1. For the RLNC approach and t X i , let N i ( t , t + Δ t ) be the number of successful packet transmissions from worker node w i to the master node during the time interval ( t , t + Δ t ) . We have the following result.
Lemma 5.
For any t X i ,
N i ( t , t + Δ t ) Δ t P r i ( 1 ε i ) , as Δ t ;
i.e., N i ( t , t + Δ t ) / Δ t converges to r i ( 1 ε i ) in probability when Δ t goes to infinity, or equivalently, for any constant ϵ > 0 .
Proof. 
The result can be shown similarly to that of Lemma 3. □
Lemma 6.
Let F i ( t ) be the maximum flow from w i ( 0 ) to m ( t ) in the continuous-time trellis G ( t ) of the RLNC approach. Then,
F i ( t ) t X i P min 1 τ i , r i ( 1 ε i ) = r i , as t ,
Proof. 
According to Theorem 1 of [26], Lemma 5 implies this result immediately. □
Now we can prove Theorem 1.
Proof. 
(Proof of Theorem 1). According to Lemma 6, F i ( T 0 ) P ( T 0 X i ) r i , as k . Hence, i = 1 n F i ( T 0 ) P k as k . Since,
F i ( ( 1 + δ ) T 0 ) F i ( T 0 ) F i ( ( 1 + δ ) T 0 ) ( 1 + δ ) T 0 X i · ( 1 + δ ) ( T 0 X i ) F i ( T 0 ) P 1 + δ ,
it is straightforward to check that
lim k Pr i = 1 n F i ( ( 1 + δ ) T 0 ) < k = 0 .
According to Proposition 1, this implies that
lim k Pr { T RLNC ( 1 + δ ) T 0 } = 0 .
The proof is accomplished. □

4. BATS-Code-Based Approach

As mentioned earlier, despite its optimality, RLNC based approach suffers from its high encoding and decoding overheads. In this section, we propose a new approach based on batched sparse (BATS) code [20], which is a variation of RLNC having low encoding and decoding overheads.

4.1. Description

In the BATS-code-based approach, the k submatrices A 1 , , A k are first encoded into A 1 , , A k , A k + 1 , , A k using a fixed-rate systematic erasure code (called a precode), where k = ( 1 + ϵ ) k and ϵ is a small positive constant (e.g., 0.02). BATS codes are rateless, as an infinite number of batches can be generated. The generation of each batch is as follows:
  • Sample a degree d e g according to a given degree distribution Ψ = ( Ψ 1 , , Ψ D ) , where D is the maximum degree;
  • Select d e g distinct submatrices uniformly at random from A 1 , , A k , A k + 1 , , A k ;
  • Generate M random linear combinations of the d e g submatrices, which are referred to as a batch.
Based on BATS code, batches of submatrices are assigned to worker nodes, and each worker node performs the local computation on the basis of a batch, which consists of M submatrix-vector multiplications. In order to forward the computational result of a batch to the master node, each worker node will generate a number of packets, each of which is a random linear combination of the M submatrix-vector products corresponding to the batch. For decoding, the master node first recovers A 1 x , , A k x , A k + 1 x , , A k x using Gaussian-elimination-based belief propagation (BP) decoding, and once any k or slightly more than k of A 1 x , , A k x , A k + 1 x , , A k x are recovered, the master node can recover all these A 1 x , , A k x by decoding the precode. See [20] for more details.
Overhead: In the BATS-code-based approach, the encoding cost per submatrix is O ( d e g · m k · d ) = O ( m d k ) , and the total decoding cost is O ( ( M 3 + M 2 m k ) · k M ) = O ( M 2 k + M m ) . Clearly, both the encoding cost and decoding cost are much lower than for the RLNC approach, especially when M is a small constant (e.g., 8 or 16). As for the RLNC approach, the decoding cost is independent of d, and the coefficient overhead is negligible when leveraging the pseudo-random-number-generator-based approach.
Remark 3.
There have been many other sparse variants of random linear network coding, including chunked codes (e.g., [28,29]), tunable sparse network coding (e.g., [30,31], and sliding-window coding (e.g., [32,33,34,35,36]). While many of these codes can also be applied, BATS codes are more suitable for this distributed computing scenario. On the one hand, BATS codes are rateless. Thus, all the worker nodes can keep on computing and forwarding local results to the master node before the whole computation is completed, as long as enough batches are placed on each worker node. In contrast, chunked codes (e.g., [28,29]) usually have fixed coding rates or require a lot of feedback from the master node. On the other hand, as mentioned in Section 2, in many applications, the step of encoding before computation is required to be performed before the arrival of any input x . In other words, this encoding step should be irrelevant to the uncertain computation and communication processes of worker nodes. However, differently from BATS codes, sliding-window codes are often generated on-the-fly and are not as suitable as BATS codes.

4.2. Performance Optimization

The performance of BATS code heavily depends on how the M computation results of each batch are transmitted to the master node, and which degree distribution is used.
Suppose that worker node w i sends Z i coded packets to the master nodes for the computation results of each batch B j . Let H j be a Z i × M matrix, where each row corresponds to a transmitted packet. If the packet is successfully received by the master node, then the row is the local encoding vector. Otherwise, the row is zero-vector. Let h i = ( h i , 0 , , h i , M ) denote the rank distribution of H j , where h i , r is the probability that H j has rank r. We can show that
h i , r = = r u b Pr ( Z i = ) r ( 1 ε i ) r ε i r , r M 1 = M u b Pr ( Z i = ) s = M s ( 1 ε i ) s ε i s , r = M .
where u b is an upper bound of Z i . In order to maximize the transmission efficiency for BATS code, we apply the linear programming method [37] to optimize the distribution of Z i :
max r = 1 M r h i , r s . t . = 0 u b Pr Z i = θ i + μ i 1 M τ i = 0 u b Pr Z i = = 1 0 Pr Z i = 1 , = 0 , , u b
Here, the objective is to maximize the expected rank. The first constraint stands for the expected time for transmitting Z j packets to the master node being no larger than the time for computing M submatrix-vector multiplications, and the last two constraints stand for Pr ( Z i = ) , = 0 , , u b being a probability distribution.
When the time goes to infinity, we can see that the proportion of batches whose computation results have been sent to the master node by worker node w i is 1 / τ i j = 1 n 1 / τ j . Hence, we can derive the empirical rank distribution h over all the batches done by worker nodes as
h = i = 1 n 1 / τ i j = 1 n 1 / τ j h i .
Based on the empirical rank distribution, we can find a good degree distribution Ψ such that the BATS code can achieve a coding rate close to h ¯ / M , where h ¯ is the expected value corresponding to the empirical rank distribution (c.f. [20]).

5. Performance Evaluation

In this section, we first evaluate the decoding cost incurred by our proposed approaches, and then we present simulations conducted to evaluate the overall computational performances of these approaches in comparison to some state-of-the-art approaches.
We first ran some experiments on a computer with an Intel(R) Core(TM) i7-10700 CPU 2.90 GHz and Python 3.7. In these experiments, the matrix A was 50,000 × d, where d ranged from 1000 to 16,000. Matrix A was split into 1000 sub-matrices of the same size, and each submatrix consisted of 50 rows so that each transmitted packet consisted of 50 real numbers. In the BATS-code-based approach, the batch size was set to eight. We simulated the decoding process and evaluated the decoding delays (in terms of second) of both the RLNC based approach and the BATS-code-based approach. The delay for the original matrix multiplication was also evaluated. The results are presented in Table 1.
Note that the decoding latencies of both the RLNC based approach and the BATS-code-based approach are irrelevant to d, and the latency for the original matrix multiplication grows linearly with d. From this table, we can see that even when d = 1000 , the decoding latency of the BATS-code-based approach is only about 1.58% of the latency of original computation, and when d grows larger, this latency becomes negligible. In contrast, when d = 1000 or d = 2000 , the decoding cost of the RLNC based approach is prohibitive.
We also conducted simulations to evaluate the performances of our proposed approaches. In our simulations, the number of worker nodes was 10, and the settings of matrix A remained the same as above, except that the number of columns d was irrelevant in our simulations. We simulated four scenarios. In the first three scenarios, worker nodes were homogeneous, and the size relationship between computation time per submatrix-vector product and average communication time of a packet varied among these scenarios. In the last scenario, worker nodes were heterogeneous. The involved parameters of these scenarios are given as follows.
  • Scenario I, where ( λ i , τ i ) = ( 0.1 , 0.2 ) , ( μ i , θ i ) = ( 20 , 0.05 ) and ε i = 0.2 ;
  • Scenario II, where ( λ i , τ i ) = ( 0.1 , 0.15 ) , ( μ i , θ i ) = ( 10 , 0.05 ) and ε i = 0.2 ;
  • Scenario III, where ( λ i , τ i ) = ( 0.1 , 0.1 ) , ( μ i , θ i ) = ( 10 , 0.1 ) and ε i = 0.2 ;
  • Scenario IV, where for each worker i, parameters λ i , τ i , μ i , θ i and ε i were uniformly distributed at random over intervals [0.07, 0.2], [0.1, 0.3], [10, 20], [0.05, 0.2] and [0.1, 0.4], respectively.
For these scenarios, we evaluated the following five methods.
  • Uniform uncoded, where the divided sub-matrices were equally assigned to 10 worker nodes—i.e., each worker node computed 100 sub-matrices.
  • Two-Replication, where the divided sub-matrices were equally assigned to five worker nodes, and the computing tasks of these worker nodes were replicated at another five worker nodes.
  • ( 10 , 8 ) MDS code, where the divided 1000 sub-matrices were encoded into 1250 sub-matrices and then equally assigned to 10 worker nodes.
  • LT code [14], where the 1000 original sub-matrices were encoded using LT codes, and an infinite number of coded sub-matrices was assigned to each worker node.
  • RLNC: The details are introduced in Section 3. The time cost of recoding and decoding operations was ignored.
  • BATS code: The details are introduced in Section 4, and a batch size of eight was used.
While our proposed schemes tackle the packet-loss issue, the first four of the above schemes do not consider this issue at all. For these schemes, we used an ideal retransmission (IR) scheme for the first four schemes, where the worker nodes know whether a transmitted packet is lost or not immediately. This leads these schemes to perform better. In the following, we refer to the first four schemes as Uncoded + IR, Rep + IR, (10,8)MDS + IR and LT + IR, respectively.
The latency performance levels of these approaches under the four scenarios are plotted in Figure 3, where the decoding latency at the master node is ignored. From this figure, we observe the following.
  • Among the first four schemes, LT + IR achieved the best performance for all four scenarios. Note that IR eliminates the packet-loss issue, and this result has also been demonstrated in [14], where only the straggler issue was considered. This is because LT codes can achieve near-perfect load balance among the worker nodes in the presence of stragglers.
  • For all these scenarios, the proposed RLNC approach achieved the best latency performance among all these schemes. In particular, the performance of the RLNC approach was slightly better than that of LT + IR. Just like LT + IR, our RLNC approach also achieved near-perfect load balance among the worker nodes. Meanwhile, LT + IR incurred a small precode overhead, whereas the RLNC approach did not. This result also demonstrates the near-optimality of the RLNC approach.
  • Our BATS approach performed much better than Uncoded + IR, Rep + IR, and (10,8) MDS + IR in all these scenarios, but slightly worse than LT + IR and RLNC. Since LT + IR assumes an ideal retransmission scheme, which is impractical, and the RLNC approach incurs high encoding and decoding costs, the BATS approach is much more practical.
In summary, both our RLNC approach and our BATS approach can overcome both the straggler issue and the packet-loss issue effectively and can achieve near-optimal performance in different scenarios when the number of columns d is large enough.

6. Conclusions

In this paper, we focused on addressing the straggler issue and the packet-loss issue jointly for distributed matrix multiplication in wireless distributed computing systems. We proposed an RLNC approach and proved its asymptotical optimality using a continuous-time-trellis-based argument. We further proposed a more practical variation of the RLNC approach based on BATS code. The effectiveness of both approaches was demonstrated through numerical simulations.

Author Contributions

Methodology, B.F., B.T. and Z.Q.; Validation, B.F.; Formal analysis, B.T.; Writing—original draft, B.F. and B.T.; Writing—review & editing, Z.Q. and B.Y.; Supervision, B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the Water Conservancy Project of Jiangsu Province under Grant No. 2021053, the National Natural Science Foundation of China under Grant No. 61872171, the Fundamental Research Funds for the Central Universities under Grant No. B210201053, the Natural Science Foundation of Jiangsu Province under Grant No. BK20190058, and the Future Network Scientific Research Fund Project under Grant No. FNSRFP-2021-ZD-07.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Zhao, S. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 2179–2217. [Google Scholar] [CrossRef]
  2. Drolia, U.; Guo, K.; Narasimhan, P. Precog: Prefetching for image recognition applications at the edge. In Proceedings of the Second ACM/IEEE Symposium on Edge Computing, San Jose, CA, USA, 12–14 October 2017; pp. 1–13. [Google Scholar]
  3. Datla, D.; Chen, X.; Tsou, T.; Raghunandan, S.; Hasan, S.S.; Reed, J.H.; Kim, J.H. Wireless distributed computing: A survey of research challenges. IEEE Commun. Mag. 2012, 50, 144–152. [Google Scholar] [CrossRef]
  4. Li, S.; Yu, Q.; Maddah-Ali, M.A.; Avestimehr, A.S. A scalable framework for wireless distributed computing. IEEE-ACM Trans. Netw. 2017, 25, 2643–2654. [Google Scholar] [CrossRef]
  5. Dean, J.; Barroso, L.A. The tail at scale. Commun. ACM 2013, 56, 74–80. [Google Scholar] [CrossRef]
  6. Zaharia, M.; Konwinski, A.; Joseph, A.D.; Katz, R.H.; Stoica, I. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA, USA, 8–10 December 2008; pp. 7–21. [Google Scholar]
  7. Wang, D.; Joshi, G.; Wornell, G. Efficient task replication for fast response times in parallel computation. In Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems, Austin, TX, USA, 16–20 June 2014; pp. 599–600. [Google Scholar]
  8. Wang, D.; Joshi, G.; Wornell, G. Using straggler replication to reduce latency in large-scale parallel computing. ACM Sigmetrics Perform. Eval. Rev. 2015, 43, 7–11. [Google Scholar] [CrossRef]
  9. Dean, J.; Ghemawat, S. MapReduce: Simplified data processing on large clusters. Comm. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
  10. Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster computing with working sets. HotCloud 2010, 10, 10. [Google Scholar]
  11. Lee, K.; Lam, M.; Pedarsani, R.; Papailiopoulos, D.; Ramchandran, K. Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 2017, 64, 1514–1529. [Google Scholar] [CrossRef]
  12. Ferdinand, N.; Draper, S.C. Hierarchical coded computation. In Proceedings of the 2018 IEEE International Symposium on Information Theory, Vail, CO, USA, 17–22 June 2018; pp. 1620–1624. [Google Scholar]
  13. Kiani, S.; Ferdinand, N.; Draper, S.C. Exploitation of stragglers in coded computation. In Proceedings of the 2018 IEEE International Symposium on Information Theory, Vail, CO, USA, 17–22 June 2018; pp. 1988–1992. [Google Scholar]
  14. Mallick, A.; Chaudhari, M.; Sheth, U.; Palanikumar, G.; Joshi, G. Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication. Commun. ACM 2022, 65, 111–118. [Google Scholar] [CrossRef]
  15. Luby, M. LT codes. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, Vancouver, BC, Canada, 16–19 November 2002; pp. 271–282. [Google Scholar]
  16. Shokrollahi, A. Raptor codes. IEEE Trans. Inf. Theory 2006, 52, 2551–2567. [Google Scholar] [CrossRef]
  17. Han, D.J.; Sohn, J.Y.; Moon, J. Coded Wireless Distributed Computing With Packet Losses and Retransmissions. IEEE Trans. Wirel. Commun 2021, 20, 8204–8217. [Google Scholar] [CrossRef]
  18. Ahlswede, R.; Cai, N.; Li, S.Y.; Yeung, R.W. Network information flow. IEEE Trans. Inf. Theory 2000, 46, 1204–1216. [Google Scholar] [CrossRef]
  19. Ho, T.; Médard, M.; Koetter, R.; Karger, D.R.; Effros, M.; Shi, J.; Leong, B. A random linear network coding approach to multicast. IEEE Trans. Inf. Theory 2006, 52, 4413–4430. [Google Scholar] [CrossRef] [Green Version]
  20. Yang, S.; Yeung, R.W. Batched sparse codes. IEEE Trans. Inf. Theory 2014, 60, 5322–5346. [Google Scholar] [CrossRef] [Green Version]
  21. Park, H.; Lee, K.; Sohn, J.Y.; Suh, C.; Moon, J. Hierarchical coding for distributed computing. In Proceedings of the 2018 IEEE International Symposium on Information Theory, Vail, CO, USA, 17–22 June 2018; pp. 1630–1634. [Google Scholar]
  22. Lin, Z.; Narra, K.G.; Yu, M.; Avestimehr, S.; Annavaram, M. Train where the data is: A case for bandwidth efficient coded training. arXiv 2019, arXiv:1910.10283. [Google Scholar]
  23. Yu, Q.; Maddah-Ali, M.; Avestimehr, S. Polynomial codes: An optimal design for high-dimensional coded matrix multiplication. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  24. Ramamoorthy, A.; Tang, L.; Vontobel, P.O. Universally decodable matrices for distributed matrix-vector multiplication. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1777–1781. [Google Scholar]
  25. Ramamoorthy, A.; Tang, L. Numerically stable coded matrix computations via circulant and rotation matrix embeddings. IEEE Trans. Inf. Theory 2022, 68, 2684–2703. [Google Scholar] [CrossRef]
  26. Wu, Y. A trellis connectivity analysis of random linear network coding with buffering. In Proceedings of the IEEE International Symposium on Information Theory, Seattle, WA, USA, 9–14 July 2006; pp. 768–772. [Google Scholar]
  27. Motwani, R.; Raghavan, P. Randomized Algorithms; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
  28. Tang, B.; Yang, S.; Ye, B.; Yin, Y.; Lu, S. Expander chunked codes. EURASIP J. Adv. Signal Process. 2015, 1, 106. [Google Scholar] [CrossRef] [Green Version]
  29. Tang, B.; Yang, S. An LDPC approach for chunked network codes. IEEE ACM Trans. Netw. 2018, 26, 605–617. [Google Scholar] [CrossRef]
  30. Feizi, S.; Lucani, D.E.; Médard, M. Tunable sparse network coding. In Proceedings of the 22th International Zurich Seminar on Communications (IZS), Zürich, Switzerland, 29 February–2 March 2012. [Google Scholar]
  31. Garrido, P.; Sørensen, C.W.; Lucani, D.E.; Agüero, R. Performance and complexity of tunable sparse network coding with gradual growing tuning functions over wireless networks. In Proceedings of the 2016 IEEE 27th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), Valencia, Spain, 4–8 September 2016. [Google Scholar]
  32. Garrido, P.; Gómez, D.; Lanza, J.; Agüero, R. Exploiting sparse coding: A sliding window enhancement of a random linear network coding scheme. In Proceedings of the 2016 IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia, 22–27 May 2016. [Google Scholar]
  33. Wunderlich, S.; Gabriel, F.; Pandi, S.; Fitzek, F.H.; Reisslein, M. Caterpillar RLNC (CRLNC): A practical finite sliding window RLNC approach. IEEE Access 2017, 5, 20183–20197. [Google Scholar] [CrossRef]
  34. Yang, J.; Shi, Z.P.; Wang, C.X.; Ji, J.B. Design of optimized sliding-window BATS codes. IEEE Commun. Lett. 2019, 23, 410–413. [Google Scholar] [CrossRef]
  35. Karetsi, F.; Papapetrou, E. Lightweight network-coded ARQ: An approach for ultra-reliable low latency communication. Comput. Commun. 2022, 185, 118–129. [Google Scholar] [CrossRef]
  36. Tasdemir, E.; Nguyen, V.; Nguyen, G.T.; Fitzek, F.H.; Reisslein, M. FSW: Fulcrum sliding window coding for low-latency communication. IEEE Access 2022, 10, 54276–54290. [Google Scholar] [CrossRef]
  37. Tang, B.; Yang, S.; Ye, B.; Guo, S.; Lu, S. Near-optimal one-sided scheduling for coded segmented network coding. IEEE Trans. Comput. 2015, 65, 929–939. [Google Scholar] [CrossRef]
Figure 1. Illustration of the wireless distributed computing system for matrix multiplication.
Figure 1. Illustration of the wireless distributed computing system for matrix multiplication.
Entropy 25 00428 g001
Figure 2. Illustration of a continuous-time trellis, G i ( t ) .
Figure 2. Illustration of a continuous-time trellis, G i ( t ) .
Entropy 25 00428 g002
Figure 3. The latency performances of different approaches under four scenarios, where the error bar indicates the standard deviation.
Figure 3. The latency performances of different approaches under four scenarios, where the error bar indicates the standard deviation.
Entropy 25 00428 g003
Table 1. The decoding delays (in terms of second) of our proposed approaches in comparison with the delay of original matrix multiplication.
Table 1. The decoding delays (in terms of second) of our proposed approaches in comparison with the delay of original matrix multiplication.
d = 100020004000800016,00032,000
matrix multiplication delay34.1669.59138.69280.86550.431116.84
decoding delay (RLNC)34.5134.5134.5134.5134.5134.51
decoding delay (BATS)0.540.540.540.540.540.54
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, B.; Tang, B.; Qu, Z.; Ye, B. Network Coding Approaches for Distributed Computation over Lossy Wireless Networks. Entropy 2023, 25, 428. https://doi.org/10.3390/e25030428

AMA Style

Fan B, Tang B, Qu Z, Ye B. Network Coding Approaches for Distributed Computation over Lossy Wireless Networks. Entropy. 2023; 25(3):428. https://doi.org/10.3390/e25030428

Chicago/Turabian Style

Fan, Bin, Bin Tang, Zhihao Qu, and Baoliu Ye. 2023. "Network Coding Approaches for Distributed Computation over Lossy Wireless Networks" Entropy 25, no. 3: 428. https://doi.org/10.3390/e25030428

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop