Next Article in Journal
An Optimized Convolutional Neural Network for the 3D Point-Cloud Compression
Previous Article in Journal
Cherenkov Radiation in Optical Fibres as a Versatile Machine Protection System in Particle Accelerators
Previous Article in Special Issue
Data Center Traffic Prediction Algorithms and Resource Scheduling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Federated Learning and Deep Reinforcement Learning-Based Method with Two Types of Agents for Computation Offload

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(4), 2243; https://doi.org/10.3390/s23042243
Submission received: 10 January 2023 / Revised: 14 February 2023 / Accepted: 14 February 2023 / Published: 16 February 2023

Abstract

:
With the rise of latency-sensitive and computationally intensive applications in mobile edge computing (MEC) environments, the computation offloading strategy has been widely studied to meet the low-latency demands of these applications. However, the uncertainty of various tasks and the time-varying conditions of wireless networks make it difficult for mobile devices to make efficient decisions. The existing methods also face the problems of long-delay decisions and user data privacy disclosures. In this paper, we present the FDRT, a federated learning and deep reinforcement learning-based method with two types of agents for computation offload, to minimize the system latency. FDRT uses a multi-agent collaborative computation offloading strategy, namely, DRT. DRT divides the offloading decision into whether to compute tasks locally and whether to offload tasks to MEC servers. The designed DDQN agent considers the task information, its own resources, and the network status conditions of mobile devices, and the designed D3QN agent considers these conditions of all MEC servers in the collaborative cloud-side end MEC system; both jointly learn the optimal decision. FDRT also applies federated learning to reduce communication overhead and optimize the model training of DRT by designing a new parameter aggregation method, while protecting user data privacy. The simulation results showed that DRT effectively reduced the average task execution delay by up to 50% compared with several baselines and state-of-the-art offloading strategies. FRDT also accelerates the convergence rate of multi-agent training and reduces the training time of DRT by 61.7%.

1. Introduction

In recent years, with the development of mobile smart devices and 5G, many computationally intensive applications with low latency requirements have emerged, such as autonomous driving [1], virtual reality and augmented reality [2], online interactive gaming [3], and video streaming analysis [4]. These applications all have high requirements for quality of service (QoS). However, mobile devices (MDs) have limited computing power and are challenged by the growing demands for application computing power and increasingly stringent latency requirements.
To overcome this challenge, the mobile edge computing (MEC) [5] paradigm, as the core technology of 5G, pushes the computing resources on the network edge that is much closer to the MDs, thus relieving the network congestion and task delay of traditional centralized cloud computing. Unlike traditional cloud servers, MEC servers are not very resource-rich. Therefore, MDs need an efficient computation offloading strategy to determine whether to offload the tasks generated by MDs to MEC servers or the cloud server for execution, so as to fully utilize the computational resources to meet the quality of experience (QoE) of MDs and minimize the task execution latency. However, the uncertainty of computing tasks and the time-varying nature of wireless channels make it difficult for accurate and appropriate computation offloading decisions.
Reinforcement learning (RL) is a method for learning “what to do (i.e., how to map the current environment into an action) to maximize the numerical revenue signal” [6]. With the rise of artificial intelligence, the deep reinforcement learning (DRL) that combines RL and deep learning is considered as an effective method to find asymptotically optimal solutions in time-varying edge environments [7]. Without any prior knowledge, DRL can capture the hidden dynamics of the environment well by enhancing the intelligence of the edge network, so as to learn strategies and achieve optimal long-term goals through context-specific repeated interactions. Such a property allows DRL to show its unique potential in designing computation offloading strategies in dynamic MEC systems. In the DRL method, the user data are transmitted to a central server for model training, and the agent deployed on MDs learns the strategy. The centralized DRL will not only put pressure on the wireless network but also lead to the risk of user data leakage. However, the existing DRL-based methods [8,9,10] rarely consider the issue of data privacy protection.
To address these problems, this paper proposes a federated learning (FL) and DRL-based method with two types of agents for computation offload, named FDRT. FL [11] decouples the model from the user data and aggregates the model according to the uploaded local model parameters, thus achieving the balance between data privacy protection and data sharing. Applying FL to the MEC environment can realize decentralized distributed training and accelerate the parameter transmission and training speed of DRL agents. FDRT uses two types of DRL agents to better explore other MEC server resources for making the optimal offloading decision, thus achieving the goal of minimizing the task execution delay of each MD. The major contributions of this paper are summarized as follows.
  • We proposed a multi-agent collaborative computation offloading strategy, named DRT. By building a collaborative cloud-side end MEC system, DRT was used to design a double deep Q-network (DDQN)-based mobile device agent and a dueling DDQN (D3QN)-based MEC server agent, which correspondingly decided to compute tasks locally on MD or offload tasks to MEC server. DRT enabled the offloading strategy to consider task information, network status, and nearby MEC server resources to ensure the optimal decision for minimizing the execution delay of tasks.
  • We proposed an FL-based multi-agent training method, FDRT, to optimize the training of DRT. In FDRT, the MEC server aggregated the network parameters of MDs within its coverage to obtain the semi-global model, and the global model was aggregated from the semi-global models of nearby MEC servers, which reduced the parameter transmission of traditional FL training and the network overhead and thus improved the system’s QoS. Meanwhile, FDRT enabled the privacy protection of user data.
  • We conducted two sets of simulation experiments. The experimental results showed that the proposed DRT achieved significant performance improvements over several baseline and state-of-the-art computation offloading strategies, reducing the average task execution delay by up to over 50%. We also demonstrated the effectiveness of FDRT, which reduced the training time of DRT by 61.7% and the average task execution delay by 2.8%.
The rest of the paper is organized as follows. Section 2 introduces the related work. Section 3 gives a detailed description of the system model, followed by the methodology in Section 4. In Section 5, we describe the proposed training optimization method. Section 6 presents the extensive experiments and result analysis. Finally, we conclude the paper in Section 7.

2. Related Work

The mathematical optimization method is a type of classical solution with a long history of computation offloading strategies, such as mixed integer programming (MIP), game theory, and heuristic search. Recently, Hazra et al. [12] propose a heuristic-based transmission scheduling strategy and a graph-based task offloading strategy using MIP in a high-traffic scenario for industrial IoT applications with a collaborative fog and cloud environment. Chen et al. [13] propose a non-cooperative game algorithm to find the Nash equilibrium solution with the optimization object of energy consumption and time delay for computation offloading of MEC applications. These methods are suitable for solving static optimization problems but not enough for dynamic optimization problems or sequence optimization problems. When a system disturbance occurs, mathematical methods need to reschedule or reoptimize. Moreover, when there are many user devices, these methods have a high computational complexity. For example, the time complexities of [12] and [13] are O(N3) and O(2N), respectively, where N is the number of user devices in the network.
In recent years, learning-based approaches have become the dominant solutions to computation offloading strategies for MEC environments. Learning-based approaches mainly include multidisciplinary techniques, such as collective intelligence, constraint satisfaction problem (CSP), and machine learning, which have been widely used in different computer systems and network scenarios.
The properties of collective intelligence make it a practical model for algorithms to solve complex problems. The most commonly used algorithms are particle swarm optimization (PSO) algorithms [14,15]. Hussein et al. [16] efficiently distribute tasks to edge nodes and improve the response time of IoT applications by leveraging the global search capability of the ant colony algorithm. Rodrigues et al. [17] use the PSO to reduce the transmission and processing latency for minimizing the total end-to-end latency of partially offloaded tasks. Yadav et al. [18] combine a genetic algorithm and a particle swarm algorithm and propose a hybrid model to achieve near-optimal computation offloading for IoT applications while minimizing the total completion time and energy consumption.
The computation offloading problem can be defined as CSP, which has constraints such as QoS, heterogeneity of MDs, and dynamics of task generation. Wang et al. [19] formulate the CSP of computation offloading by combining user mobility, task properties, and network constraints to reduce the task execution delay in the MEC infrastructure. Kamal et al. [20] use the CSP formulation in conjunction with a minimal conflict scheduling algorithm to achieve a balanced load on MEC server resources with minimal energy consumption.
Due to the strong generalization ability, the machine learning models are also widely used for the task offloading strategy. Rahbari et al. [21] use a classification regression tree to select the most suitable edge device for offloading, and they also consider parameters such as authentication, integrity, availability, capacity, speed, and cost to minimize the time and energy consumption. Bashir et al. [22] use logistic regression to calculate the load of each edge node and propose a dynamic resource allocation policy. Ullah et al. [23] use K-means clustering approach to provide efficient task scheduling based on resource requirements in terms of CPU, I/O, and communication, thus improving the utilization of edge devices. To determine the combination of different devices and dynamic tasks, Rani et al. [24] propose a deep learning model to address the speed, power, and security challenges, while meeting the QoS.
RL is often combined with deep learning in order to generalize previously unseen data in terms of the environment, state, and behavior. In the MEC environment, Huang et al. [25] implement a DRL model that determines whether the task is computed locally or entirely handled by the MEC server. Zhou et al. [26] design an agent based on the double deep Q-network (DDQN) approach and deploy the agent on the MEC server to perform computation offloading decisions for all users simultaneously. In the context of vehicular networking, Zhao et al. [27] propose an algorithm based on duel deep Q-learning (DDQL) to implement distributed agent decision making, increasing the scalability and flexibility of the practical implementation of the algorithm. Yun et al. [8] use a genetic algorithm for user devices to determine the optimal offloading decision and propose a multi-actor deep Q-network agent for the MEC server to achieve dynamic resource allocation. Zhou et al. [9] formulate the optimization problem of computation offloading, service caching, and resource allocation as mixed-integer non-linear programming (MINLP) and use the Deep Deterministic Policy Gradient (DDPG) algorithm to find the optimal strategy to minimize the long-term energy consumption. Zhang et al. [10] use the reconfigurable intelligent surface (RIS) technique to adjust the phase shift and amplitude of reflective elements for improving the wireless network link status and energy efficiency and also use the DDPG algorithm for the computation offloading strategy. Li et al. [28] also design a offloading strategy based on a DDPG approach where all computing tasks sequentially decide the computation location through an agent. Koo et al. [29] adopt a Q-learning algorithm to find a task offloading strategy through the device-to-device communication in the MEC environment. They cluster nearby devices and use the head device of each cluster making the decision to reduce the computing complexity of agents. Under the conditions of power constraint of IoT devices and wireless charging, Wei et al. [30] propose agents based on post-state learning algorithms, which are deployed on MDs for decision making. In cyber-twin networks, Hou et al. [31] achieve fast task processing, dynamic real-time task allocation, and low training overhead by using a multi-agent deep deterministic policy gradient approach.
There are several main drawbacks of the machine-learning-based methods. First, most of the existing methods use single-type DRL agents to accomplish the computation offloading decisions. They mainly consider the resources of MDs, connected MEC servers, and cloud servers but ignore the resources of other MEC servers in the MEC environment. In addition, it is difficult for MDs to access the resource information of MEC servers and cloud server, and accessing this information also causes decision latency. Second, in traditional machine learning architectures for MEC environment, data producers must frequently send and share data with third parties (e.g., MEC servers or cloud servers) to train their models. This not only causes the risk of user data privacy disclosure but also has a high demand for network bandwidth. In the time-varying wireless network environment, high communication overhead will undoubtedly have a negative impact on the decision delay.
To address these problems, we design two types of agents, a DDQN-based mobile device agent and a D3QN-based MEC server agent, to make the optimal offloading decision by considering the resources of near-neighbor MEC servers jointly. We apply the FL to the multi-agent collaborative training. FL can effectively protect user data privacy by keeping the data localized while only transmitting local model parameters for global model training. It has been applied in multiple domains in the MEC environment, including privacy protection and communication optimization for large-scale model training [32,33], content caching [34], malware and anomaly detection [35], task scheduling and resource allocation [36,37], computation offloading [38], etc. However, in most of the MEC federated learning systems, the parameters of all MDs are transmitted to an MEC server or the central cloud, which will not only cause a high network overhead and occupy the network resources in the core network or side layer but will also increase the latency of users waiting for the global model parameters, therefore reducing the system’s QoS. We also optimize the FL training efficiency by building a decentralized federation learning system model. Each MEC server and MD deploys an MEC server agent and a mobile device agent, respectively. Each MEC server additionally deploys a mobile device agent model for the semi-global model of all mobile devices within its current base station range. The MEC server aggregates the semi-global model from the nearby MEC servers to generate the global model, which reduces the parameter transmission of traditional FL training and the network overhead, especially compared with a central cloud-based federation learning model. Table 1 summarizes the comparison between our work and recent representative related work. For the complexity, N denotes the number of devices in the network, T denotes the number of episodes for model training, and H denotes the number of steps per episode.

3. System Model

3.1. Network Model

In this paper, we considered the MEC system model with cloud-side end collaboration as shown in Figure 1. In the model, the scenario was seamlessly covered by M BSs that provided computation offloading services to N user MDs distributed within their range through a 5G wireless communication network, where the BSs were connected to each other via optical fiber. Each BS was equipped with an MEC server to provide computing power, so that the BS could compute a variety of tasks to meet the user’s needs. The user’s MD generated computing tasks that could be computed on the device or offloaded to an MEC server for execution. In addition, the BSs were connected to the core network via high-speed optical fiber, which in turn exchanged data with the central cloud, so that the MD could also offload tasks to the central cloud through the BS. The MDs were denoted as a set U = u 1 , u 2 , , u n , , u N , and the BSs were denoted as a set E = e 1 , e 2 , e m , , e M .
To facilitate the subsequent modeling, a time slot model was used to discretize the time into equal time intervals. The length of the time slot was denoted as T l e n , and the index was denoted as   t = 0 , 1 , 2 , ,   T l e n . The size of T l e n was set as the coherence time. The channel could be kept constant during the coherence time. In communication systems, the communication channel may change with time. This channel variation is more significant in wireless communication systems due to the Doppler effect. Table 2 presents the main notations that are used in this paper.

3.2. Communication Model

In the MEC system network with cloud–edge–end collaboration, two main communication methods were included, i.e., edge-to-end wireless communication and edge-to-edge and edge-to-cloud wired communication. The transmission rate between BSs e m and e m was denoted as v m , m E , and the transmission rate between the BS e m and the central cloud was denoted as v m E , C . The two kinds of communications both obeyed stable and independent random processes with probability distribution functions as f u n E v E and f u n E , C v E , C , respectively. Next, the wireless communication methods between MDs and BSs are described and modeled in detail.
In order to minimize the execution delay of the computing tasks for MDs, we constructed a wireless communication model based on 5G technology. MDs may need to compute offload every T time to transmit data to the BS; therefore, the orthogonal frequency division multiplexing (OFDM) technique was used to assign different sub-channels to different devices to reduce the mutual interference between sub-channels and ensure the device transmission requirements.
In this paper, a Rayleigh fading model was constructed based on the free-space path loss model to simulate the channel in a dense building scenario that is very common in cities. In this scenario, there was no direct path between the transmitter and receiver, and the signal was attenuated, reflected, refracted, and diffracted by buildings or other objects. The channel remained stable within a time slot but changed from time slot to time slot. The channel gain h n , m t of the wireless channel is calculated by Equation (1) according to [26],
h n , m t = β n , m t A d c 0 4 π f c d n , m t d e ,  
where β n , m   t denotes the channel fading factor of the Rayleigh distribution between u n and e m in the tth time slot, and its probability distribution function is   f u n   B β , A d denotes the radar gain of the BS connected to e m , c 0 denotes the light speed in vacuum, f c denotes the carrier frequency of the BS connected to e m , d n , m t denotes the distance between u n and e m in the tth time slot, and d e denotes the path loss index.
Since the distance between a BS and its deployed MEC server was close, the transmission delay was negligible. According to the Shannon equation C = B   log 2 1 + S / N , combined with Equation (1), the transmission rate v n , m t between u n and e m in the tth time slot can be calculated by Equation (2),
v n , m t = B   log 2 1 + p 0 h n , m t N 0 ,
where B denotes the channel bandwidth between u n and e m , p 0 denotes the transmission power of u n , and N 0 denotes the Gaussian white noise power. Since the OFDM subchannels did not interfere with each other, the noise in the channel was only Gaussian white noise.

3.3. Task Model

At the beginning of the t th time slot, N MDs generate N indivisible tasks at the same time, and these tasks are independent. If no data were generated at the beginning of the t th time slot, s n t = 0. Since the data were actually generated in the t 1 th time slot of u n , which was processed from the tth time slot, they could be considered to be generated at the beginning of the t th time slot for the convenience of model representation. The computing task generated by u n at the beginning of the t th time slot was denoted by d n t = s n t , c n t .

3.4. Computation Model

The computing power of the central cloud is thousands of times greater than that of MDs and MEC servers. Therefore, the CPU frequency of the central cloud was set to infinity, and the CPU frequencies of the MD and MEC servers were set to fixed values, denoted as f U and f E , respectively, where f E f U . The computing power of the MD and MEC servers was relatively limited. If a task was computed on an MD or an MEC server, there may have been other tasks running, and the task could not be executed immediately. Therefore, we set task queues on both MD and MEC servers for the queuing tasks, where the “first in, first out” principle was applied to the queues.
The computing tasks of an MD could be executed locally on the device, offloaded to the MEC server of its BS or the MEC server of its neighboring BS, or offloaded to the central cloud for execution. Different computing modes led to different latencies, and the four computing modes were analyzed as follows.
(1)
Local Computing. In the tth time slot, u n ’s task queue is denoted as q n U t = b n U t , c n U t . When t = 0, q n U 0 = 0 , 0 , i.e., there is no task in the queue. If the computing task d n t of u n is executed locally, the task will be added to the local queue, and its execution delay l n L t is calculated by Equation (3) according to [26].
l n L t = s n t c n t + c n U t f U .
(2)
MEC Computing. In the tth time slot, e m ’s task queue is denoted as q m E t = b m E t , c m E t . If d n t is executed on e m , the task will be added to the task queue of the MEC server, and its execution delay l n E t is calculated by Equation (4) according to [26]. Since the resulting data of the task are usually very small, the transmission delay of the resulting data can be neglected.
l n E t = s n t v n , m t + s n t c n t + c m E t f E .  
(3)
Near MEC Computing. If d n t is executed on one of its neighboring BS e m , the task will be added to the task queue of the MEC server e m , and the task execution delay l n E t can be calculated by Equation (5) according to [26].
l n E t = s n t v n , m t + s n t v m , m E + s n t c n t + c m E t f E .
(4)
Cloud Computing. If d n t is executed in the central cloud, its execution time delay l n C t can be calculated by Equation (6) according to [26]. Since the computing power of the central cloud is overwhelmingly strong, the computational time of the task in the central cloud can be ignored.
l n C t = s n t v n , m t + s n t v m E , C .

4. Methodology

4.1. Problem Statement

Based on the established system model, we first formalized the optimization problem of minimizing the long-term computational latency of tasks for the entire MEC system with limited resource constraints. The execution location of the computing task d n t in the t th time slot of u n is denoted as a n t . When a n t = 0 , it means that d n t is computed locally; when a n t = 1 , it means that the task is executed on the MEC server of its connected BS; when a n t = 2 , it means that the task is executed on the MEC server of its neighboring BS; and when a n t = 3 , it means that the task is executed on the central cloud server. Jointly with Equations (3)–(6), the execution delay of the computing task d n t is calculated by Equation (7),
l n t = & 𝟙 a n t = 0 l n L t + 𝟙 a n t = 1 l n E t + & 𝟙 a n t = 2 l n E t + 𝟙 a n t = 3 l n C t ,  
where 𝟙   is the indicator function. When the condition in the function bracket is true, the value of the function is 1; otherwise, it is 0. Then, the optimization problem can be expressed as Equation (8),
P 1 : min A t lim τ 1 τ t = 0 τ 1 n = 1 N l n t s . t .   C 1 : a n t 0 ,   1 ,   2 ,   3 C 2 : 0 d n , m t D ,  
where A t denotes the set of tasks’ execution locations of all MDs in the tth time slot, and D denotes the coverage radius of BS. Constraint C 1 ensures that the task can only be executed once. Constraint C 2 ensures that all user MDs are active within the coverage area of the BS and another BS continues to provide service when the MD exceeds the maximum service distance of the current BS.
This optimization problem is an NP-hard problem. To effectively solve this problem, we proposed a DRL-based strategy with two types of agents for computation offload, i.e., DRT. DRT first divided the problem into two sub-problems, i.e., whether the MD executes computing tasks locally or which of the three computing modes of MEC computing, near-MEC computing, or cloud computing the task should perform. According to these two sub-problems, we approximately modeled them as a Markov decision process (MDP) and designed two types of DRL agents.

4.2. DDQN-Based Mobile Device Agent

We designed a DDQN-based mobile device agent, which was deployed on the MD to decide whether the MD executed computing tasks locally. The agent could make offloading decisions based only on the current computing task information, its task queues, and the network transmission rate between the MD and its connected MEC server. Since the agent was deployed on the MD, it could easily obtain such information. The three key elements of using the MDP to model the agent, including state space, action space, and reward function, are described as follows.

4.2.1. State

Due to the limited computing power of MDs, it was not suitable to deploy a complex agent on the MD. We minimized the state space and defined the state s n U t S U of the mobile device agent in the t th time slot, as shown in Equation (9). The state space includes the size of computed data generated by a task, the number of CPU cycles required to process 1 bit of the data, the number of tasks in the local queue, the total number of CPU cycles to compute tasks in the local queue, and the transmission rate between the MD and its connected MEC server.
s n U t = d n t , q n U t , v n , m t = s n t , c n t , b n U t , c n U t , v n , m t .  

4.2.2. Action

The goal of the mobile device agent was to choose the optimal action to minimize the task execution delay based on the current state. The agent was responsible for deciding whether to execute the computing task locally or not at each time slot. The action of the agent is represented by a n U t , which is defined by Equation (10),
a n U t A U = 0 , 1 ,  
where AU denotes the action set of the agent. When a n U t = 0 , it means the task is executed locally, and a n U t = 1 means that the task is offloaded, and the subsequent decision is made by the agent of MEC server.

4.2.3. Reward

In general, the reward function was related to the optimization goal, which was to minimize the long-term computation latency of all tasks in the entire MEC system. RL was an effective method to maximize numerical benefits, and we used the RL to construct the reward function. Therefore, the value of reward function was negatively correlated with the value of the optimization problem. The value of the reward function, r U s n U t , a n U t R U , where RU is the reward space, for the MD agent to take action a n U t under the state s n U t in the t th time slot is calculated by Equation (11) according to [39],
r U s n U t , a n U t = l n L t ,                                                                       a n U t = 0 s n t v n , m t + t i m e n , m t ,         a n U t = 1 ,
where t i m e n , m t indicates the subsequent execution time of the task d n t when it is offloaded for execution, and it is described in Section 4.3.3.

4.2.4. DDQN Model Training

We used the DDQN model [39] to train the mobile device agent to obtain the optimal offloading decision, which could reduce the high computational complexity caused by the state space explosion in the MDP. The DDQN model adopted an experience playback method to decouple the action selection and Q value calculation, where Q is the value of the reward function to take an action. During the training process, the agent maintained an experience playback pool and saved the transfer quads (s, a, r, s’) of each time slot in the experience playback pool. The elements of the transfer quad were the current state, the action, the reward value taking an action under current state, and the next state taking an action, respectively. The agent randomly selected a small batch of samples in the experience playback pool to update the parameters of the network, that is, randomly selected some previous experience to learn. The update of the experience playback pool followed the principle of “first in, first out”. The agent maintained the current network Q(s, a; θ) and the target network Q ^ (s, a; θ ^ ), where Q is used to select actions, Q ^ is used to evaluate the value of the selected action, and θ and θ ^ represent the parameters of Q and Q ^ . The target network regularly updated its own parameters using the parameters of the current network, that is, to copy the parameters of the current network. The experience playback method was beneficial to accelerate the training convergence.
Figure 2 shows the training process of the DDQN-based mobile device agent. In each single episode, we first initialized the system model according to the task information, the task queue information, and the wireless network transmission rate, including the parameter θ of the current network Q , the parameter θ ^ = θ of the target network Q ^ , and the experience replay pool M . The model started iterative training when obtaining the initial state. In each time slot, the “ε-greedy” strategy was used to select actions. It obtained a random value within [0,1); if the value was greater than ε, i.e., a preset exploration rate, then it randomly selected an action to execute, and otherwise, it selected the action with the maximum output Q value of the current network. Then, the model executed the action, obtained the corresponding reward and the next state, and stored the transfer information into M . The agent selected batch samples from M to calculate the gradient of loss function and used the gradient descent method to back propagate the gradient to minimize the loss function. Finally, the current network converged to the optimal action-value function through continuous iterative training of the above steps.
The DDQN model could ensure that the agent still had the ability to explore other actions after the network roughly converged, so as to prevent it from falling into the local optimum.

4.3. D3QN-Based MEC Server Agent

The dueling DDQN (D3QN) model [40] has advantages in generalizing learning across actions without imposing any change to the underlying algorithm. This feature of D3QN is very suitable for the MEC server agent with a large action space. Therefore, we designed the D3QN-based MEC server agent, which is deployed on the MEC server, to make decisions for task offloading, i.e., it decided to compute the task on the MEC server or offload the task to a neighboring MEC server or to the central cloud for execution. Similar to the mobile device agent, we used the MDP to model the MEC server agent.

4.3.1. State

The state s n , m E t S E of the MEC server agent e m at the arrival of the computing task from u n is defined as Equation (12), where M ˜ indicates the number of the MEC server near e m . The agent made offloading decisions based on the offloaded tasks from MDs, the task queue status of its own MEC server, and the task queue status of its neighboring MEC servers. Since the server agent was deployed on the MEC server of BS, and the BSs of telecom operators were also mutually trusted, the agent could easily access the task queue status of its neighboring MEC servers.
s n , m E t = d n , m t , q n , m E t , q n , m E 1 t , .. , q n , m E M ˜ t .  

4.3.2. Action

The goal of the MEC server agent was to decide the execution location of the offloaded task, i.e., choose the optimal action to minimize the task execution delay based on the current state. The MEC server agent was responsible for deciding the computing mode of the task. The action for task d n , m t is defined by Equation (13), where AE denotes the action set of the server agent. When a n , m E t = 0 , it means the task is executed on the local MEC server, a n , m E t = 1 means that the task is offloaded to the central cloud for execution, and a n , m E t = 2 , , M ˜ + 1 means that the task is offloaded to a neighboring MEC server for execution, and the numerical value represents the number of neighboring servers.
a n , m E t A E = 0 , 1 , 2 , , M ˜ + 1 .  

4.3.3. Reward

Similar to the MD agent, the value of reward function, r E s n , m E t , a n , m E t R E , for the server agent taking action a n , m E t under state s n , m E t in the tth time slot, is calculated by Equation (14) according to [40]. The subsequent execution time of the task, t i m e n , m t , is given by Equation (15) according to [40] for different cases, i.e., when it is executed on a local MEC server ( a n , m E t = 0 ), offloaded to the central cloud for execution ( a n , m E t = 1), or offloaded to a neighboring MEC server for execution ( a n , m E t = 2 , , M ˜ + 1 ). It is related to the task information, the task queue information, and the wireless network transmission rate.
r E s n , m E t , a n , m E t = t i m e n , m t .  
t i m e n , m t = s n , m t c n , m t + c m E t f E ,   a n , m E t = 0 s n , m t v m E , C ,   a n , m E t = 1 s n , m t v m , m E + s n , m t c n , m t + c m E t f E ,   a n , m E t = 2 , , M ˜ + 1 .

4.3.4. D3QN Model Training

The MEC server agent had larger and more complex state space and action space than the MD agent, especially the action space. When the action space was very large, D3QN performed better than the traditional DRL networks [40]. Therefore, we used the D3QN model to train the MEC server agent, so as to avoid the slow convergence of the MEC server agent and the inability to make timely decisions on computing task offloading. The training process of the D3QN model was similar to that of the DDQN model. The current network and the target network of D3QN model had the same functions as those of the DDQN model, but their structures were slightly different. The current network and the target network of the D3QN model had two sets of output parameters, which aggregated to output the Q value of each action. Due to the paper limitation, we will not describe these specifically, and the relevant network structure can be found in [40].

4.4. DRT

Based on these two types of agents and Equation (8), DRT obtained the optimal computation offload strategy for the MEC system through multi-agent cooperative training. The training process of DRT with multiple episodes is described as follow.
First, the system model and the parameters of each agent network were initialized before multi-episode iterative training. In a single episode, each mobile device agent obtained the current initial state and started training from that state. If the mobile device agent offloaded computing tasks to its own connected MEC server, the MEC server agent started a single episode training. Finally, through continuous iterative training and learning over multiple episodes, the current network convergence of each mobile device agent and each MEC server agent approximated the optimal action-value function, that is, the optimal unloading strategy for all tasks in the entire MEC system was learned, which minimized the calculation delay of all tasks in the entire MEC system.

5. Optimization Method of Multi-Agent Training

5.1. Deployment of Two Types of Agents

Although DRT could dynamically and efficiently find the optimal computation offloading strategy, it also required a lot of computing resources to train agents. Therefore, the deployment of the two types of agents should be carefully considered in order to effectively use the computing resources in the MEC systems.
As described in Section 4, the MEC server agent is suitable to be deployed on the MEC server. For the mobile device agent, training it using the traditional RL training method on the MD would introduce two shortcomings. First, additional energy will be wasted by training separate agents for each MD. Second, MDs have limited computing power, and the cost of training the agent from scratch is too high. There are also several drawbacks if the mobile device agent is trained on the MEC server that the MD is connected to. (a) There will be communication overhead between the MD and the MEC server, which causes delays in making offloading decisions and makes it difficult to make decisions in real time. (b) If only the connected MEC server maintains the mobile device agent, it involves the migration of agent data when users move to another BS’s coverage. If all MEC servers maintain agents for each MD, it will not only result in resource waste but also bring in extra synchronization of agent data. (c) The privacy of MDs may be compromised as the uploaded training data may be privacy-sensitive, especially in industrial information scenarios. (d) Although the training data can be transformed to protect privacy, the data received by the MEC server would lose some relevancies compared with the source data, making it difficult to optimize the offloading decision or even making a worse offloading decision. (e) A large amount of training data is always transmitted from MDs to the MEC server, which puts a heavy burden on the wireless channel of BSs.
Therefore, we deployed the DDQN-based agent on MDs and the D3QN-based agent on the MEC server and proposed the FDRT optimization method for multi-agent training.

5.2. FDRT

We applied the FL to achieve the optimization method of the multi-agent training for the proposed DRT, named FDRT. As a cooperative framework of machine learning, FL can train models without accessing users’ private data, and it can achieve decentralized distributed training with the mechanism of aggregation model. DRT needs to interact with the environment frequently for multi-agent collaborative training, which will generate a large number of network communications, increase the computing pressure of resource-constrained mobile devices, and thus increase the latency of computation offload decisions. When combining the FL with DRT, due to the decentralized topology of FL, the mobile device agent only needs to communicate with its connected MEC server. The MEC server aggregates the local model parameters of MDs within its coverage and transmits the aggregation results to them. This efficient decentralized training model can greatly reduce network overhead, improve users’ QoS, and reduce the training time of the multi-agent model.
In the multi-agent trained federation learning system model, each MEC server and each mobile device deployed an MEC server agent and a mobile device agent, respectively. Each MEC server also deployed an additional mobile device agent, which was used as the global model of all MDs within the BS of the server. The mobile device agent can be regarded as the client of the FL model, and its connected MEC server can be regarded as the parameter server. For the mobile device agent, the state space is the initial data set of the client of the FL model. The target network of the mobile device agent used the discounted sum of its output Q ^ value and the reward as the target value of the action in the current state and used the difference between the target value and the Q value of the current network to calculate the local loss function of the client, so as to update the current network parameters. The client updated the neural network parameters based on the local data set and uploaded the trained network parameters to the server. The server aggregates these updated parameters to obtain a global parameter and then transmitted it back to the clients for the next round of local training. Before parameter aggregation, each client performed multiple local training and parameter updates in a round of training.

5.3. FDRT Training and Semi-Global Aggregation

In the FL model for multi-agent training, the workflow was basically the same for all BSs. The workflow for a single BS with multiple MDs scenario is shown in Figure 3. First, a global model of the mobile device agent was initialized on the MEC server. The initial model parameters may be different for different MEC servers. The current network parameters of the model are distributed to all MDs within the coverage area of the BS. The MD synchronized the received parameters to the local current network and the target network. Then, the mobile device agent started local training, and the parameters were transmitted to the MEC server to which the MD was connected after every F times training in a round. If the device moved to the coverage of another BS during training, it transmitted the parameters of the current network to the reconnected MEC server, rather than transmitting the parameters back to the previous connected MEC server. The MEC server received the network parameters of all MDs in its coverage area and averaged the parameters to obtain a new network parameter, which was referred to as a semi-global model parameter. Further, the MEC server obtained the semi-global model parameters from its neighboring MEC servers to generate the global model parameter based on the weighted average of the number of devices in the coverage area. If a neighboring MEC server did not have any MDs, the semi-global model parameters of this server were not aggregated. Finally, the MEC server transmitted the global model parameters back to MDs. The MD synchronized the new parameters to the local current and target networks and then started a new round of training.
In traditional FL, local model parameters of all clients are transferred to a server for global parameter aggregation. In the proposed FDRT, mobile devices transmitted local model parameters to their connected MEC servers to calculate a semi-global parameter, and the MEC server aggregated the semi-global parameters of their neighboring MEC servers to obtain the global parameter. Therefore, compared with traditional FL, FDRT can alleviate the pressure of wireless network communication, reduce the delay of global model parameter transmission, and improve the efficiency of model training.
Algorithm 1 describes the training process of the mobile device agent of the FDRT. The training process of MEC server agent was similar. The time complexity of FDRT is O(TH), where T is the number of training episodes, and H is the number of steps per episode. The complexity of FDRT mainly came from the training process of the DDQN agent and the D3QN agent, which was consistent with the complexity of other DQN-based methods [41].
Algorithm 1 FDRT’s Mobile Device Agent Training.
INPUT: maximum number of training episodes M A X _ E P I S O D E , maximum number of training steps per episode M A X _ S T E P , learning rate α , discount factor γ , exploration rate ε , experience replay pool M capacity C A P A C I T Y , number of batch samples B A T H _ S I Z E , current network Q , target network Q ^ , target network update frequency C , numbers of training in a round of federated learning F
OUTPUT: parameter θ t of current network
// r a n d o m a , b is a function that generates random numbers in range a , b
// r a n d i n t a , b is a function that generates random integers in range a , b
// m e m o r y . i s f u l l ( ) indicates whether the experience replay pool M is full
// θ e denotes the global model parameters of the MEC server
Initialization: Wireless communication model between MDs and BSs; task model and queue model of MD ,   parameter   θ of current   network   Q ,   parameter   θ ^ = θ   of   target   network   Q ^ ,   experience   replay   pool   M .
1:  for  e p i s o d e = 0 : M A X _ E P I S O D E  by 1 do
2:    Get initial state s 0 ;
3:    for  t = 0 : M A X _ S T E P  by 1 do
4:       x r a n d o m 0 , 1 ;
5:      if( x > ε ) then
6:         a t r a n d i n t 0 , 2 ;
7:      else
8:         a t argmax a Q s t , a ; θ t ;
9:      end if
10:      Perform action   a t in the system model, get reward r s t , a t and next state s t + 1 ;
11:       s t s t + 1 ;
12:      Put I t = s t , a t , r s t , a t , s t + 1 in   M ;
13:      if ( m e m o r y . i s f u l l ( ) )  then
14:        continue;
15:      end if
16:      if e p i s o d e * M A X _ S T E P + t   m o d   F = = 0 then
17:         Upload θ t to connected MEC server;
18:         θ t θ e ;
19:      else
20:        Randomly choose a batch sample from M to update parameter, θ t θ t α L θ t ;
21:      end if
22:      if t   m o d   C = = 0 then
23:         θ ^ t θ t ;
24:      end if
25:    end for
26:  end for
27:  return  θ t ;
Algorithm 2 describes the process of MEC server parameter aggregation of the FDRT. First, the MEC server obtained the parameters of all MDs in its coverage range and calculated the average value to generate the semi-global model parameters. Then, the MEC server obtained the semi-global model parameters of its neighboring MEC servers and calculated the global model parameter.
Algorithm 2 FDRT’s MEC Server Parameter Aggregation.
INPUT: mobile device agent parameters θ u
OUTPUT: global model parameter θ e
// u U e means that MD u is within the coverage of MEC server e , where U e denotes the set of MDs within the coverage of e
// e E e means that MEC server e is a server near e , where E e denotes the set of servers near e
// g e t _ M D _ n u m e denotes the number of MDs within range of MEC server e
1:   θ t e m p e 0 ;
2:  for  u  in  U e  do 
3:       θ t e m p e θ t e m p e + θ u ;
4:  end for
5:   θ t e m p e θ t e m p e / U e ;
6:   c o u n t g e t _ M D _ n u m e ;
7:   θ e c o u n t * θ t e m p e ;
8:  for  e  in  E e  do
9:      if g e t _ M D _ n u m e > 0 then
10:           θ e θ e + g e t _ M D _ n u m e * θ t e m p e ;
11:           c o u n t c o u n t + g e t _ M D _ n u m e ;
12:      end if
13:  end for
14:   θ e θ e / c o u n t ;
15:  return  θ e ;

6. Experimental Results

6.1. Comparison Strategies

We conducted two sets of simulation experiments to evaluate the performance of the proposed DRT and the FDRT, respectively. To verify the effectiveness of DRT, we compared it with six baseline strategies and two state-of-the-art offloading strategies. To verify the training efficiency of FDRT, we compared it with DRT and a traditional federated learning method (TFL) [11]. As this work aimed to minimize the system latency, we mainly evaluated the reward value and the task execution delay under different strategies. The reward value could not only reflect the system latency but also reflect the convergence trend of the agent.
The baseline strategies are denoted as MDL, MDO, MDR, MSL, MSC, and MSR. MDL represents the strategy that all computing tasks are executed on MDs locally; MDO represents the strategy that all MDs offload their tasks to MEC servers to decide the specific execution location; MDR represents the strategy that all MDs randomly select tasks to compute locally or offload tasks; MSL represents the strategy that all MEC servers execute tasks locally; MSC represents the strategy that all MEC servers offload tasks to the central cloud; and MSR represents the strategy that all MEC servers randomly select tasks to compute locally, offload tasks to a neighboring MEC server, or offload tasks to the central cloud.
In addition, two state-of-the-art computing task offloading strategies are the single MD agent (SMDA) method [27] and one MEC server agent (OMSA) method [26]. SMDA designs the DDQL-based agent deployed on MDs to make offload decisions based on the task information generated by current time slot, the task queue information of MD and its connected MEC server, and the network conditions of both. SMDA adopts the collaborative training of FL and blockchain to learn the optimal offloading strategy, which is consistent with the goal of our strategy, that is, to minimize the system latency. OMSA designs the DDQN-based agent deployed on the MEC server, which finds the offloading decision through DRL training to solve the MINLP optimization problem based on the task information, the task queue information of MDs and all MEC servers, and the network status of MDs and their connected MEC servers. The optimization objective of OMSA is to minimize the energy consumption with delay constraints.

6.2. Simulation Setting

We simulated a collaborative cloud-side end MEC system, which consisted of multiple BSs and multiple MDs. Figure 4 shows the topology diagram of MEC servers. In the simulation system, the number of MEC servers was set to six. The shaded circles indicate that there are MDs within the range of the BSs, and the blank circles indicate that there are no MDs within the range of the BSs. The number of MDs in each BS was set to five. In the training process of agents, the maximum episodes and the maximum steps of each episode were set to 200.
Table 3 shows the model parameters and their simulation values. We used the parameters of Huawei’s 5G AAU5619 wireless product (Huawei Technologies Co., LTD., Shenzhen, China) to simulate the BS transmitter and built a wireless communication model between MDs and BSs on this basis. In the MEC system, the wired communication model included the wired communication between BSs and between BSs and the central cloud. The computing model included the task model of MDs and the CPU frequency of MDs and MEC servers. Both the mobile device agent and the MEC server agent used the Adam optimizer and the mean square error loss function. The current network and target network structures of mobile device agents were four-layer fully connected neural networks, in which the numbers of hidden layer neurons were 128 and 64, respectively. The current network and target network structures of MEC server agent were a three-layer fully connected neural network and a one-layer two-branch network, in which the number of hidden layer neurons was 128.
We designed a simple movement model to simulate the movement of MDs, which was used to generate the distance between BSs and MDs. The MD initially moved randomly in a clockwise or counterclockwise direction along the connection line between the MEC servers in Figure 4, and there was a probability of 0.00001 that the MD would turn around and move in the opposite direction during the movement. The moving speed of the MD obeyed a uniform distribution of U(29.6,30.4) m/s, which is the general walking speed of people.
The simulation experiments were performed on a server equipped with two Inter(R) Xeon(R) Gold 6248 processors at 2.50 GHz (Intel Corporation, Santa Clara, CA, USA). All experimental codes were implemented in Python, version 3.6.8, and the PyTorch library, version 1.8.2.

6.3. Performance of DRT

We used the reward values to evaluate the performances of different computation offloading strategies. Since the change trend of the task execution delay was exactly opposite to that of the reward, the larger the reward value was, the smaller the task execution delay was.
Figure 5 shows the comparison results of the reward with the DRT, MDL, MDO, and MDR strategies during the training process. The ordinate represents the average value of the cumulative reward sum of all MDs per episode. As the number of training episodes increased, the rewards of all strategies tended to converge. DRT achieved the highest reward compared with the other baseline strategies, which meant DRT could learn the best offloading decision with minimal latency for MDs. In the first 50 episodes, the rewards of MDO and MDR increased gradually. This was because both MDO and MDR offloaded tasks to the MEC server for further decision making by the MEC server agent, i.e., the increasing trends reflected the training process of the D3QN-based MEC server agent. The MDO obtained a reward close to that of DRT. This was because when the wireless network was stable for most of the time, it was more conducive for the mobile device agent to offload tasks to the more powerful MEC server to reducing the execution delay of all tasks. Due to the limited computing power of MDs, local computing had a limited effect on reducing the overall task execution delay. For the same reason, the MDL received the lowest reward and the worst task execution delay.
Figure 6 shows the comparison results of the reward with DRT, MSL, MSC, and MSR strategies. The DRT still achieved the highest reward compared with the other baseline strategies, which meant DRT could fully utilize the computing resources of the MEC servers and the central cloud to make the best offloading decision. In the first 25 episodes, the rewards of both MSL, MSC, and MSR increased gradually. The reason was that the DDQN-based mobile device agent was learning the offloading strategy, which also reflected the convergence performance of the MD agent. The reward obtained by MSL was close to that of DRT. Although DRT considered the computing resources of neighboring MEC servers to offload tasks, these resources were also used by other MEC servers, so the overall improvement was slightly better than MSL. However, compared with MSC and MSR, DRT showed significant performance improvements.
Table 4 shows the comparison results of the average task delay between DRT and baseline strategies. The results of the single episode were selected from the multiple episodes before convergence, and the results of multiple episodes were calculated from the convergent episode. DRT reduced the average task execution delay by 41.2%, 5.9%, 17.2%, 2.8%, 31.7%, and 24.2% compared with MDL, MDO, MDR, MSL, MSC, and MSR, respectively. This was due to the fact that DRT adopted two types of agents to jointly learn the optimal offloading strategy, which could more comprehensively evaluate the entire MEC system resources to minimize the system latency.
Figure 7 shows the comparison results of the reward with DRT, SMDA, and OMSA during the training process. With the increase in the training episode, both SMDA and OMSA tended to converge, and DRT obtained the highest reward. Compared with SMDA and OMSA, DRT could fully utilize the computing resources of MDs, MEC servers, and the central cloud. The reason for the huge fluctuation of OMSA was that it filled the experience replay pool in the first episode, and the agent already started to learn and update the network parameters in this episode, which caused overfitting. Therefore, when OMSA encountered a large number of states different from previous ones, the learned strategies did not necessarily perform well. When the reward during 0-100 episodes was zoomed in on, it could be observed that SMDA was close to DRT, but DRT converged faster than SMDA and OMSA. This was because SMDA applied FL to achieve decentralized training, which effectively reduced the communication overhead. However, DRT divided the computation offload decision problem into two sub-decision problems, which were respectively solved by the two types of agents. In this way, the state and action space of DRT agent was much smaller than that of SMDA and OMSA. It could not only enable the agent to learn the optimal strategy faster but also reduce unnecessary data transmission between MDs and MEC servers. Since SMDA and OMSA used a single type of agent, they needed to obtain MEC server information or mobile device information, which increased the burden of wireless networks and led to higher decision delay.
Table 5 shows the comparison results of average task delay between DRT and SMDA and OMSA. DRT reduced the average task execution delay by 8.0% and 50.3% compared with SMDA and OSMA, respectively. When the FDRT optimization method was applied to DRT, the task execution delay was further reduced. The performance will be presented in Section 6.4.

6.4. Performance of FDRT

Since FDRT combined the FL with DRT to optimize the training of multi-agent model, we compared FDRT with DRT and TFL. DRT used the original DRL method to train the model, and TFL [11] used a traditional federated learning method to train the mobile device agent.
Figure 8 shows the comparison results of the loss with FDRT, DRT, and TFL during the training process. The ordinate denotes the average values of the cumulative loss sum of all mobile device agents in each episode. The loss values were kept at 0 at first and then rose sharply. This was because the experience replay pools of mobile device agents were not full yet, so the agents had not started learning. In addition, FDRT and TFL accelerated the convergence of mobile device agents, and the losses of the three methods were finally stabilized at a certain value. It can be seen from the zoomed in sub-figure that DRT converged at about 60 episodes, while FDRT converged at about 23 episodes, saving 61.7% of the training time for the mobile device agents.
Figure 9 shows the comparison results of the average task execution delay of FDRT, DRT, and TFL. It can be observed that the learned strategies of FDRT and TFL were basically the same as that of DRT. The FDRT sped up the training convergence of mobile device agents, while still maintaining the optimal learned strategy. In addition, FDRT aggregated fewer network parameters of mobile device agents in each MEC server and reduced the network transmission of parameters more than DRT and TFL. Compared with DRT, FDRT reduced the average task execution delay by 2.8%, indicating that the strategy learned by FDRT was better. This was because the semi-global aggregation parameter mechanism of FDRT enabled each mobile device agent to learn more state information, so that it was more likely to take the optimal action when the state of the MD changed.

6.5. Discussion

The superior performance of this work over existing methods mainly comes from the following aspects. Compared with existing DRL-based methods with a single type of agent, DRT used two types of agents for MDs and MEC servers to collaboratively learn the offloading strategies. Hence, DRT could find the optimal decision by comprehensively considering all MEC server resources and network conditions in the MEC environment for minimizing the system latency. The DDQN agent and the D3QN agent also considered the application characteristics of MDs and MEC servers to reduce the high dimensionality of MDP. In addition, FDRT applied FL to DRT to accelerate the multi-agent collaborative training, while effectively providing user data privacy protection. The decentralized FL training method distributed the model aggregation among MEC servers to relieve the computing pressure of resource-constrained MDs. The semi-global aggregation mechanism reduced the parameter transmission and network overhead, which further shortened the delay of strategy decision. These optimizations were very effective for latency-sensitive and computationally intensive applications to quickly learn the computation offloading strategy in the time-varying MEC environment. Therefore, this work is very suitable for the real-word applications with a high requirement of task execution delay, such as autonomous driving, online interactive gaming, virtual reality applications, and video streaming analysis.
Since this work focuses on the system latency, one of its main limitations is that it may not be able to fully meet the application scenarios with a high requirement of energy consumption. When combining energy consumption into our current model, it needs to consider the heterogeneity of different resources to formulate the multi-objective optimization problem. It needs more consideration to realize a balanced optimal offloading strategy among many objectives such as energy consumption, task delay, system utilization, etc. This will be one of our future work directions. In addition, this work is a binary computation offloading method. For many large-scale applications, the partial computation offloading strategy is more suitable. We will also study the partially offloading method in the future.

7. Conclusions

In this paper, we proposed a computation offloading method based on DRL and FL with the goal of minimizing the task execution delay for a system of multi-base station multi-mobile device MEC networks with cloud-side end collaboration. First, we proposed a multi-agent computation offloading strategy using a DDQN MD agent and a D3QN MEC server agent, to collaboratively make decisions based on their own task information, resources, and time-varying network conditions. Second, we proposed a new FL-based parameter aggregation model that greatly reduced communication overhead and improved the training efficiency of the multi-agent DRL model, while avoiding user data privacy disclosure. Extensive simulation experiments validated the advantages of the proposed method in computing task execution delay and training efficiency over several baseline and state-of-the-art computation offloading methods. In future work, we will try to extend the scope of application of this work by jointly considering the energy consumption model to maximize the system utility with energy latency constraints. For large-scale real-world applications, it is also our future research direction to use the collaboration of DRL and FL, solving the partial computation offloading problem.

Author Contributions

Conceptualization, S.L. and S.Y.; methodology, S.L.; software, S.Y.; validation, S.Y. and H.Z.; formal analysis, S.Y.; investigation, S.Y.; resources, S.Y.; data curation, H.Z.; writing—original draft preparation, S.L. and S.Y.; writing—review and editing, S.L. and H.Z.; visualization, S.Y.; supervision, S.L. and W.W.; project administration, W.W.; funding acquisition, S.L. and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62002279 and No. 61972311), Natural Science Basic Research Program of Shaanxi (Program No. 2020JQ-077), and Shandong Provincial Natural Science Foundation (No. ZR2021LZH009).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in Table 3, Table 4 and Table 5 and Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9.

Acknowledgments

We thank the reviewers for their constructive suggestions to improve the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wang, J.; Liu, J.; Kato, N. Networking and communications in autonomous driving: A survey. IEEE Commun. Surv. Tutor. 2018, 21, 1243–1274. [Google Scholar] [CrossRef]
  2. Xiong, J.; Hsiang, E.L.; He, Z.; Zhan, T.; Wu, S.T. Augmented reality and virtual reality displays: Emerging technologies and future perspectives. Light Sci. Appl. 2021, 10, 216. [Google Scholar] [CrossRef] [PubMed]
  3. Esteves, J.J.A.; Boubendir, A.; Guillemin, F.; Sens, P. Optimized network slicing proof-of-concept with interactive gaming use case. In Proceedings of the 2020 23rd Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN), Paris, France, 24–27 February 2020; pp. 150–152. [Google Scholar]
  4. Al-Abbasi, A.O.; Aggarwal, V.; Ra, M.R. Multi-tier caching analysis in CDN-based over-the-top video streaming systems. IEEE/ACM Trans. Netw. 2019, 27, 835–847. [Google Scholar] [CrossRef] [Green Version]
  5. Hu, Y.C.; Patel, M.; Sabella, D.; Sprecher, N.; Young, V. Mobile edge computing—A key technology towards 5G. ETSI White Pap. 2015, 11, 1–16. [Google Scholar]
  6. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  7. Zhang, W.; Zhang, Z.; Zeadally, S.; Chao, H.C.; Leung, V.C. MASM: A multiple-algorithm service model for energy-delay optimization in edge artificial intelligence. IEEE Trans. Ind. Inform. 2019, 15, 4216–4224. [Google Scholar] [CrossRef]
  8. Yun, J.; Goh, Y.; Yoo, W.; Chung, J.M. 5G multi-RAT URLLC and eMBB dynamic task offloading with MEC resource allocation using distributed deep reinforcement learning. IEEE Internet Things J. 2022, 9, 20733–20749. [Google Scholar] [CrossRef]
  9. Zhou, H.; Zhang, Z.; Wu, Y.; Dong, X.; Leung, V.C.M. Energy efficient joint computation offloading and service caching for mobile edge computing: A deep reinforcement learning approach. IEEE Trans. Green Commun. Netw. 2022. [Google Scholar] [CrossRef]
  10. Zhang, X.; Wu, W.; Liu, S.; Wang, J. An efficient computation offloading and resource allocation algorithm in RIS empowered MEC. Comput. Commun. 2023, 197, 113–123. [Google Scholar] [CrossRef]
  11. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, 20–22 April 2016; pp. 1273–1282. [Google Scholar]
  12. Hazra, A.; Donta, P.K.; Amgoth, T.; Dustdar, S. Cooperative transmission scheduling and computation offloading with collaboration of fog and cloud for industrial IoT applications. IEEE Internet Things J. 2022. [Google Scholar] [CrossRef]
  13. Chen, J.; Deng, Q.; Yang, X. Non-cooperative game algorithms for computation offloading in mobile edge computing environments. J. Parallel Distrib. Comput. 2023, 172, 18–31. [Google Scholar] [CrossRef]
  14. Dorigo, M. Optimization, Learning and Natural Algorithms. Ph. D. Thesis, Politecnico di Milano, Milan, Italy, 1992. [Google Scholar]
  15. Poli, R.; Kennedy, J.; Blackwell, T. Particle swarm optimization. Swarm Intell. 2007, 1, 33–57. [Google Scholar] [CrossRef]
  16. Hussein, M.K.; Mousa, M.H. Efficient task offloading for IoT-based applications in fog computing using ant colony optimization. IEEE Access 2020, 8, 37191–37201. [Google Scholar] [CrossRef]
  17. Rodrigues, T.G.; Suto, K.; Nishiyama, H.; Kato, N. A PSO model with VM migration and transmission power control for low Service Delay in the multiple cloudlets ECC scenario. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–6. [Google Scholar]
  18. Yadav, V.; Natesha, B.V.; Guddeti, R.M.R. Ga-pso: Service allocation in fog computing environment using hybrid bio-inspired algorithm. In Proceedings of the TENCON 2019–2019 IEEE Region 10 Conference (TENCON), Kochi, India, 17–20 October 2019; pp. 1280–1285. [Google Scholar]
  19. Wang, Z.; Zhao, Z.; Min, G.; Huang, X.; Ni, Q.; Wang, R. User mobility aware task assignment for mobile edge computing. Future Gener. Comput. Syst. 2018, 85, 1–8. [Google Scholar] [CrossRef] [Green Version]
  20. Kamal, M.B.; Javaid, N.; Naqvi, S.A.A.; Butt, H.; Saif, T.; Kamal, M.D. Heuristic min-conflicts optimizing technique for load balancing on fog computing. In Proceedings of the International Conference on Intelligent Networking and Collaborative Systems, Bratislava, Slovakia, 7–9 September 2018; Springer: Cham, Switzerland, 2018; pp. 207–219. [Google Scholar]
  21. Rahbari, D.; Nickray, M. Task offloading in mobile fog computing by classification and regression tree. Peer-to-Peer Netw. Appl. 2020, 13, 104–122. [Google Scholar] [CrossRef]
  22. Bashir, H.; Lee, S.; Kim, K.H. Resource allocation through logistic regression and multicriteria decision making method in IoT fog computing. Trans. Emerg. Telecommun. Technol. 2022, 33, e3824. [Google Scholar] [CrossRef]
  23. Ullah, I.; Youn, H.Y. Task classification and scheduling based on K-means clustering for edge computing. Wirel. Pers. Commun. 2020, 113, 2611–2624. [Google Scholar] [CrossRef]
  24. Rani, D.S.; Pounambal, M. Deep learning based dynamic task offloading in mobile cloudlet environments. Evol. Intell. 2021, 14, 499–507. [Google Scholar] [CrossRef]
  25. Huang, L.; Bi, S.; Zhang, Y.J.A. Deep reinforcement learning for online computation offloading in wireless powered mobile-edge computing networks. IEEE Trans. Mob. Comput. 2019, 19, 2581–2593. [Google Scholar] [CrossRef] [Green Version]
  26. Zhou, H.; Jiang, K.; Liu, X.; Li, X.; Leung, V.C. Deep reinforcement learning for energy-efficient computation offloading in mobile-edge computing. IEEE Internet Things J. 2022, 9, 1517–1530. [Google Scholar] [CrossRef]
  27. Zhao, N.; Wu, H.; Yu, F.R.; Wang, L.; Zhang, W.; Leung, V.C. Deep-reinforcement-learning-based latency minimization in edge intelligence over vehicular networks. IEEE Internet Things J. 2022, 9, 1300–1312. [Google Scholar] [CrossRef]
  28. Li, Y.; Qi, F.; Wang, Z.; Yu, X.; Shao, S. Distributed edge computing offloading algorithm based on deep reinforcement learning. IEEE Access 2020, 8, 85204–85215. [Google Scholar] [CrossRef]
  29. Koo, S.; Lim, Y. A cluster-based optimal computation offloading decision mechanism using RL in the IIoT field. Appl. Sci. 2022, 12, 384. [Google Scholar] [CrossRef]
  30. Wei, Z.; Zhao, B.; Su, J.; Lu, X. Dynamic edge computation offloading for Internet of Things with energy harvesting: A learning method. IEEE Internet Things J. 2018, 6, 4436–4447. [Google Scholar] [CrossRef]
  31. Hou, W.; Wen, H.; Song, H.; Lei, W.; Zhang, W. Multiagent deep reinforcement learning for task offloading and resource allocation in Cybertwin-based networks. IEEE Internet Things J. 2021, 8, 16256–16268. [Google Scholar] [CrossRef]
  32. Ren, J.; Yu, G.; Ding, G. Accelerating DNN training in wireless federated edge learning systems. IEEE J. Sel. Areas Commun. 2021, 39, 219–232. [Google Scholar] [CrossRef]
  33. Liu, S.; Yu, G.; Chen, X.; Bennis, M. Joint user association and resource allocation for wireless hierarchical federated learning with IID and non-IID data. IEEE Trans. Wirel. Commun. 2022, 21, 7852–7866. [Google Scholar] [CrossRef]
  34. Wang, X.; Wang, C.; Li, X.; Leung, V.C.; Taleb, T. Federated deep reinforcement learning for Internet of Things with decentralized cooperative edge caching. IEEE Internet Things J. 2020, 7, 9441–9455. [Google Scholar] [CrossRef]
  35. Sater, R.A.; Hamza, A.B. A federated learning approach to anomaly detection in smart buildings. ACM Trans. Internet Things 2021, 2, 1–23. [Google Scholar] [CrossRef]
  36. Sheng, S.; Chen, P.; Chen, Z.; Wu, L.; Yao, Y. Deep reinforcement learning-based task scheduling in IoT edge computing. Sensors 2021, 21, 1666. [Google Scholar] [CrossRef]
  37. Samarakoon, S.; Bennis, M.; Saad, W.; Debbah, M. Distributed federated learning for ultra-reliable low-latency vehicular communications. IEEE Trans. Commun. 2019, 68, 1146–1159. [Google Scholar] [CrossRef] [Green Version]
  38. Ren, J.; Wang, H.; Hou, T.; Zheng, S.; Tang, C. Federated learning-based computation offloading optimization in edge computing-supported internet of things. IEEE Access 2019, 7, 69194–69201. [Google Scholar] [CrossRef]
  39. Hasselt, H.V.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
  40. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.V.; Lanctot, M.; Freitas, N.D. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1995–2003. [Google Scholar]
  41. Jin, C.; Zhu, Z.; Bubeck, S.; Jordan, M.I. Is Q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018; MIT Press: Cambridge, MA, USA, 2018; pp. 4868–4878. [Google Scholar]
Figure 1. MEC system model of cloud–edge–end collaboration.
Figure 1. MEC system model of cloud–edge–end collaboration.
Sensors 23 02243 g001
Figure 2. Training process of the DDQN-based mobile device agent.
Figure 2. Training process of the DDQN-based mobile device agent.
Sensors 23 02243 g002
Figure 3. Federated learning system workflow for multi-agent training.
Figure 3. Federated learning system workflow for multi-agent training.
Sensors 23 02243 g003
Figure 4. MEC server topology diagram.
Figure 4. MEC server topology diagram.
Sensors 23 02243 g004
Figure 5. Comparison of reward with different strategies on MDs during training.
Figure 5. Comparison of reward with different strategies on MDs during training.
Sensors 23 02243 g005
Figure 6. Comparison of reward with different strategies on MEC servers during training.
Figure 6. Comparison of reward with different strategies on MEC servers during training.
Sensors 23 02243 g006
Figure 7. Comparison of reward with SMDA and OSMA.
Figure 7. Comparison of reward with SMDA and OSMA.
Sensors 23 02243 g007
Figure 8. Comparison of losses of mobile device agents with FDRT, DRT, and TFL.
Figure 8. Comparison of losses of mobile device agents with FDRT, DRT, and TFL.
Sensors 23 02243 g008
Figure 9. Comparison of average task execution delay of FDRT, DRT, and TFL.
Figure 9. Comparison of average task execution delay of FDRT, DRT, and TFL.
Sensors 23 02243 g009
Table 1. Comparison with existing works.
Table 1. Comparison with existing works.
Existing WorksMain MethodDevice MobilityData Privacy ProtectionNetwork Training OptimizationOptimization GoalComplexity
[12]Heuristic, MIPStaticNoEnergy consumption and latencyO(N3)
[13]Game theoryDynamicNoEnergy consumption and latencyO(2N)
[8]Genetic algorithm, multiactor DQN with single type of agentDynamicNoOriginal DQN trainingEnergy consumption and system utilityO(N)
[9]MINLP, DDPG with single type of agentDynamicNoOriginal DQN trainingSystem energy consumptionO(TH)
[29]Q-learning with single type of agentStaticNoOriginal Q-learning training with device-to-device communicationEnergy consumption and latencyO(TH)
[26]MINLP, DDQN with single type of agentDynamicNoOriginal DDQN trainingSystem energy consumptionO(TH)
[27]Collaboration of FL, blockchain, and DDQL with single type agentDynamicYesDecentralized FL trainingSystem latencyO(TH)
OursCollaboration of FL and two types of DDQN agent and D3QN agentDynamicYesDecentralized FL training with semi-global aggregationSystem latencyO(TH)
Table 2. Main used notations.
Table 2. Main used notations.
SymbolsDefinition
h n , m t Wireless   channel   gain   between   mobile   device   u n   and   MEC   server   e m   in   the   t th time slot
v n , m t Transmission rate v n , m t between u n and e m in the tth time slot
d n t Computing task   generated   by   u n   at   the   beginning   of   the   t th time slot
s n t Data   size   computed   by   d n t
c n t Number   of   CPU   cycles   required   to   process   1   bit   data   of   d n t
q n U t Task   queue   of   u n
b n U t Number of tasks in q n U t
c n U t Total number of CPU cycles required to compute tasks of q n U t
q m E t Task   queue   of   e m
b m E t Number of tasks in q m E t
c m E t Total number of CPU cycles required to compute tasks of q m E t
l n L t Execution   delay   of   d n t on u n
l n E t Execution   delay   of   d n t on e m
l n E t Execution   delay   of   d n t on a nearby MEC server e m
l n C t Execution   delay   of   d n t on central cloud
s n U t State of u n in the tth time slot
a n U t Action of u n in the tth time slot
r U s n U t , a n U t Reward   of   u n to take a n U t   under   s n U t
s n , m E t State of e m   for   d n t in the tth time slot
a n , m E t Action of e m   for   d n t in the tth time slot
r E s n , m E t , a n , m E t Reward   of   e m to take a n , m E t   under   s n , m E t
t i m e n , m t Subsequent execution time of d n t on e m
Table 3. All model parameters and simulation values.
Table 3. All model parameters and simulation values.
ModelsParametersSimulation Values
Wireless communication modelAd24.5 dB
fc2595 MHz
D200 m
c03 × 108 m/s
βn,m(t)Rayleigh(1)
de2.8
B20 MHz
p023 dBm
N0−90 dBm
Wired communication model v m , m E U(1250, 1350) Mbits/s
v m E , C U(15, 25) Mbits/s
Computing modelsnp(0.3) Mbits
cnU(4, 15) cycles/bit
fU450 MHz
fE1500 MHz
Mobile device agentαU0.0004
γU0.9
εU0.9
CAPACITYU2000
BATCH_SIZEU64
CU100
MEC server agentαE0.0003
γE0.9
εE0.9
CAPACITYE1000
BATCH_SIZEE32
CE100
Table 4. Average task execution delay compared with baseline strategies (×10−3 ms).
Table 4. Average task execution delay compared with baseline strategies (×10−3 ms).
StrategiesSingle EpisodeMultiple Episodes
DRT5.1254.968
MDL8.4688.445
MDO5.8895.280
MDR6.0566.000
MSL5.5105.113
MDC8.0647.271
MSR6.8336.550
Table 5. Average task execution delay compared with SMDA and OMSA (×10−3 ms).
Table 5. Average task execution delay compared with SMDA and OMSA (×10−3 ms).
StrategiesSingle EpisodeMultiple Episodes
DRT5.1254.968
SMDA5.5495.401
OSMA10.9929.993
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Yang, S.; Zhang, H.; Wu, W. A Federated Learning and Deep Reinforcement Learning-Based Method with Two Types of Agents for Computation Offload. Sensors 2023, 23, 2243. https://doi.org/10.3390/s23042243

AMA Style

Liu S, Yang S, Zhang H, Wu W. A Federated Learning and Deep Reinforcement Learning-Based Method with Two Types of Agents for Computation Offload. Sensors. 2023; 23(4):2243. https://doi.org/10.3390/s23042243

Chicago/Turabian Style

Liu, Song, Shiyuan Yang, Hanze Zhang, and Weiguo Wu. 2023. "A Federated Learning and Deep Reinforcement Learning-Based Method with Two Types of Agents for Computation Offload" Sensors 23, no. 4: 2243. https://doi.org/10.3390/s23042243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop