FedRDR: Federated Reinforcement Distillation-Based Routing Algorithm in UAV-Assisted Networks for Communication Infrastructure Failures

Li, Jie; Liu, Anqi; Han, Guangjie; Cao, Shuang; Wang, Feng; Wang, Xingwei

doi:10.3390/drones8020049

Open AccessArticle

FedRDR: Federated Reinforcement Distillation-Based Routing Algorithm in UAV-Assisted Networks for Communication Infrastructure Failures

by

Jie Li

¹,

Anqi Liu

¹,

Guangjie Han

^2,*

,

Shuang Cao

³,

Feng Wang

¹ and

Xingwei Wang

^1,*

¹

School of Computer Science and Engineering, Northeastern University, Shenyang 110167, China

²

The Department of Information and Communication Systems, Hohai University, Changzhou 213022, China

³

School of Software College, Northeastern University, Shenyang 110167, China

^*

Authors to whom correspondence should be addressed.

Drones 2024, 8(2), 49; https://doi.org/10.3390/drones8020049

Submission received: 20 December 2023 / Revised: 18 January 2024 / Accepted: 30 January 2024 / Published: 4 February 2024

(This article belongs to the Special Issue UAV-Assisted Internet of Things)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Traditional Internet of Things (IoT) networks have limited coverage and may experience failures due to natural disasters affecting critical IoT devices, making it difficult for them to provide communication services. Therefore, how to establish network communication service more efficiently in the presence of fault points is the problem we solve in this paper. To address this issue, this study constructs a hierarchical multi-domain data transmission architecture for an emergency network with unmanned aerial vehicles (UAVs) employed as core communication devices. This architecture expands the functionality of UAVs as key network devices and provides a theoretical basis for their feasibility as intelligent network controllers and switches. Firstly, the UAV controllers perceive the network status and learn the spatio-temporal characteristics of air-to-ground network links. Secondly, a routing algorithm within the domain based on federated reinforcement distillation (FedRDR) is developed, which enhances the generalization capability of the routing decision model by increasing the training data samples. Simulation experiments are conducted, and the results show that the average communication data size between each domain controller and the server is approximately 45.3 KB when using the FedRDR algorithm. Compared to the transmission of parameters through federated reinforcement learning algorithms, FedRDR reduces the transmitted parameter size by approximately 29%. Therefore, the FedRDR routing algorithm helps to facilitate knowledge transfer, accelerate the training process of intelligent agents within the domain, and reduce communication costs in resource-constrained scenarios for UAV networks and has practical value.

Keywords:

routing algorithm; UAV-assisted networks; reinforcement learning; federated learning

1. Introduction

In emergency situations and natural disaster conditions, traditional IoT communications infrastructure, including sensor facilities, may be damaged. This can lead to communication service disruptions. One urgent issue to address is the ability to transition quickly from a faulty state to a normal state and ensure that ongoing communication services can continue smoothly without being affected by network failures. After infrastructure is damaged, drones communicate through wireless channels. These drones carry communication devices and communicate using wireless communication protocols. This enables the entire IoT system to restore its normal communication state. For example, in September 2023, certain regions in China were affected by Typhoon Dusuwei, resulting in floods and geological disasters. Relevant departments utilized drones to play a crucial role in communication support, disaster reconnaissance, and material delivery in challenging rescue environments and complex weather conditions [1]. Reference [2] addressed the inability to rescue disaster victims in remote areas due to traditional emergency communications not being successfully established in time. Therefore, they used drones to build a reliable communication network to conduct emergency rescue operations.

Unmanned aerial vehicles (UAVs) have the ability to create collaborative air-to-ground networks above disaster areas, regions with infrastructure challenges, or densely populated communication hotspots. They support and enhance existing cellular infrastructure to connect previously unconnected networks [3,4]. Due to their high mobility, strong line-of-sight signal transmission, rapid deployment, and strong adaptability to various environments, UAVs play a critical role in the Internet of Things. For instance, after Hurricane Ida hit Louisiana, AT&T used UAV-mounted cellular on wings (COW) to restore LTE coverage for cellular users [5]. Moreover, He et al. utilized multiple UAVs to provide services to a wide range of users [6]. To extend the overall duration of a communication network, load balancing can be implemented between adjacent UAVs and resource consumption can be shared, significantly increasing the overall coverage time of the UAV network. In addition, UAV cluster networks are a crucial component of air-to-ground collaborative networks and provide an essential foundation for achieving aerial networking and data transmission [7,8].

In the construction of a ground-based collaborative network for drone clusters, priority should be given to the privacy protection of user data. Federated learning (FL) provides an effective method for training machine learning models while ensuring user privacy. In an FL system, clients have various computing and communication resources [9]. In recent years, researchers have been devoted to improving the communication efficiency of federated learning. One commonly used method is gradient compression, which directly reduces the size of model updates. Another widely adopted method is collaborative learning, where clients share local model predictions instead of transmitting model updates, thus reducing communication costs. Collaborative learning is one mode of knowledge distillation (KD). For example, in [10], FL introduces KD to achieve efficient and low-cost information exchange, especially when dealing with heterogeneous models, with the aim of reducing communication expenses.

In the design of air-to-ground collaborative networking, drones mainly provide an access service to ground nodes. Due to the wide coverage area required for the service, the communication range supported by drones is limited. Therefore, we use multiple drones for multi-domain collaboration to achieve multi-domain coverage. At the same time, the drones are equipped with domain controllers to achieve relay communication of the collaborative network through the domain controllers. The network management of distributed and heterogeneous nodes may be challenging. Moreover, due to the dynamic changes in network conditions, link interruptions can easily occur, making transmission reliability difficult to ensure. This paper focuses on the application of air-to-ground collaborative networking and uses mobile edge computing technology to study network transmission strategies and routing mechanisms in highly dynamic networks. Based on the above network architecture, an intra-domain routing mechanism is designed. Considering the intelligent perception of the network state in the domain, we design a quality of service (QoS)-sensing routing mechanism based on deep reinforcement learning (DRL) to obtain the optimal forwarding path and improve the quality of service in the domain. To address the issues of the low intra-domain data volume and poor generalization ability of local decision models, a joint strategy is proposed to aggregate multi-domain data and accelerate intra-domain training while protecting intra-domain privacy and avoiding network information leakage. The difference between this paper and the existing UAV rescue network is that we adopt the hierarchical multi-domain air-ground cooperative network architecture used in SDN architectures and use a high-performance UAV as the domain controller to simplify the management difficulty and improve the scalability of large-scale networks. Moreover, our enhanced distillation learning algorithm is used as the intra-domain routing algorithm to improve the communication performance of the network and make it adapt to the disaster relief network faster.

The main contributions of this paper are as follows:

(1) First, we design an air–ground collaborative network architecture for UAV-assisted networks. This architecture utilizes SDN’s hierarchical multi-domain networking technology with a super domain controller responsible for obtaining global network state information and cross-domain business requests. The domain controller is responsible for collecting network state information within its domain and aggregating both local and cross-domain business requests, ultimately enhancing the network’s information processing capabilities and reducing communication latency. This architecture is the basis of the design of the intra-domain routing algorithm proposed in this paper.

(2) In order to achieve the real-time awareness of the network status, we introduce unmanned aerial vehicles (UAVs) for communication transmission, enabling the entire IoT system to handle tasks more flexibly. We adopt a federated reinforcement learning approach where the server sends the obtained model parameters to each domain, and the intelligent agent in the local domain controller is further trained and updated based on these parameters. This approach allows for the use of independent model parameters within each domain while protecting the privacy of network states within the domain.

(3) Considering the challenges of energy consumption and limited computational resources in UAVs, as well as the need to reduce the burden of handling large parameters, we introduce knowledge distillation and develop a routing algorithm based on federated reinforcement distillation (FedRDR). By combining federated reinforcement learning with knowledge distillation, we can address the issue of transmission overload caused by the presence of a large amount of parameter data in intelligent agent models. This enables us to achieve real-time awareness of the network status and adjust the data forwarding rules accordingly.

(4) To validate our method, we construct a simulation environment for a drone network using a Mininet network simulator and Ryu controller and conduct algorithm analysis using Python. Through our experiments, we demonstrate that FedRDR outperforms other routing algorithms based on joint learning. Specifically, compared to the federated forced learning routing algorithm, FedRDR can reduce parameter transmission by approximately 29% and significantly accelerate the convergence speed of the model.

2. Related Work

In recent years, natural disasters have had a significant impact on infrastructure, including communication base stations. Essential resources have become scarce, and disruptions in communication links have occurred as a result, making the delivery of critical services to individuals challenging [11]. The process of reconstructing a fully operational communication backbone network after a disaster, which is crucial for restarting telecommunications and internet services, can take anywhere from several days to weeks [12]. Hence, it is therefore imperative that we are able to harness the power of emerging artificial intelligence technologies to restore such emergency communication networks.

UAVs are highly effective in search and rescue missions due to their rapid deployment capabilities. By deploying drones in the air, it is possible to establish a local area network (LAN) and backbone network, even in locations without existing network infrastructure. This approach significantly reduces the data transmission time compared to relying solely on satellite links [13]. Once a mission is completed, the retrieval of the drones minimizes the consumption of resources required for network construction. In the context of drone control strategies, G. B. Tarekegn et al. proposed a dynamic and mobile control strategy utilizing DRL [14]. The objective was to maximize communication coverage and network connectivity for multiple real-time users within a specified timeframe. Additionally, Zhang et al. studied a communication system assisted by multiple drones equipped with base stations (BSs) to minimize the number of drones required and improve coverage by optimizing the three-dimensional positions of drones [15], user clusters, and frequency band allocation. The above research relies on the perspective of ground users and fully leverages their geographical positions. However, GPS location errors resulting from disasters can pose challenges for air–ground drone communication.

Deploying multiple drones can significantly enhance network coverage within a specific area. Wang et al. utilized a particle swarm optimization algorithm to optimize the deployment positions of multiple drones with the aim of maximizing network coverage [16]. Shi et al. employed a distributed deep Q network (DQN) approach in which each drone possesses its own DQN and shares action decisions. This algorithm focuses on maximizing throughput while considering fair service constraints [17]. In a similar vein, Dai et al. proposed a drone network resource allocation algorithm based on multi-agent cooperative environment learning [18]. This method adopts a distributed architecture, where each drone is treated as an independent agent. By dynamically making decisions related to deployment positions, transmission power, and occupied sub-channels, this algorithm enhances the utility of the drone network. However, in the aforementioned research, drones are required to maintain continuous communication to process network information, which can potentially result in reduced work efficiency.

In disaster areas, there is a significant demand for wireless communication, and the privacy of various types of sensitive data must be protected. Federated learning (FL) is a privacy-preserving approach that transmits only the model instead of the raw data during the training process. However, if model updates contain a large number of parameters, they can become excessively large, resulting in high communication costs that burden clients. To address this issue, Wu et al. proposed a federated learning method called FedKD [19], which utilizes adaptive knowledge distillation and dynamic gradient compression techniques. This approach affords enhanced communication efficiency and effectiveness. Wang et al. introduced ProgFed [20]. Furthermore, Yang et al. proposed another privacy-preserving method called partial variable training (PVT) [21], which trains only a small subset of variables on edge devices to reduce memory usage and communication costs. However, all of these cost-reduction methods have been proposed for their respective scenarios and do not consider improving the overall workload problem from the perspective of the transmitted parameter data volume.

Since most of the existing studies focus on single UAVs and the communication coverage of a single UAV is limited, they are not applicable to resource-constrained networks, and many studies on multi-UAV networks need the support of ground base stations. In disaster scenarios, ground base stations may be destroyed at any time, so the above studies are not applicable to disaster scenarios due to the emergence of federal reinforcement learning. We should protect user privacy and reduce communication costs. However, these studies are presented for their own scenarios and do not consider the overall workload problem from the perspective of the amount of transmitted parameter data. Therefore, they do not apply to resource-constrained networks. Therefore, we propose a UAV-assisted software-defined networking framework to enhance the network’s information processing capability and reduce communication latency between the air and ground. Additionally, we design a state-aware routing algorithm based on deep reinforcement learning to achieve the real-time perception of the network status and adjust the data forwarding rules accordingly. Finally, we develop FedRDR to address the issue of transmission link overload caused by the large volume of intelligent agent model parameter data while also ensuring the privacy of the network status within the domain.

3. System Model

Based on the unique characteristics of UAV edge computing networks, this paper presents a hierarchical multi-domain network architecture based on SDN. This architecture aims to establish network connections for mobile nodes, thereby creating a mobile edge computing network system with UAVs acting as edge controllers. This paper stipulates that all network communication should be completed before the UAV’s energy consumption is complete. The network architecture consists of two main components: the control plane and the data plane. The control plane comprises a global server and high-performance UAVs acting as domain controllers. We constructed a set

N = {1, 2, 3, \dots, n}

to represent the domain controllers. The data plane is composed of ordinary UAVs. As a switch in the data plane, ordinary UAVs provide mobile node access and data forwarding services. We also created a set

M = {1, 2, 3, \dots, m}

to represent the switches. The global controller maintains the domain controller information for all high-performance UAVs. We assume that all high-performance UAVs can communicate with the global controller deployed on the ground or satellite. For situations in which infrastructure is damaged due to natural disasters, the network status information obtained through network processors (such as topology or link, node, or network traffic status) is converted into a traffic matrix. Based on this, decisions are made for our routing selection. In order to protect the privacy of information regarding the network status of different domains, a federated learning architecture is deployed on domain controllers and global controllers for the overall decision-making model. This enhances the data privacy and security of the network while reducing network transmission costs.

The domain controller plays a crucial role in collecting status information within its coverage domain and ensuring information consistency throughout the network. This helps to reduce the delay of business control within the domain and reduces reliance on the server for the control architecture. When the domain controller receives an intra-domain traffic flow request, it processes the request, identifies the forwarding path for the traffic flow, and then sends control messages to the switches within the domain to modify their status and complete the forwarding of the request. The server interacts with the domain controller to establish global data and achieves the establishment of a logically centralized control plane with global knowledge using a physically distributed hierarchical approach.

Multi-domain routing is facilitated by the domain controller and the global server through the federated learning model. The domain controller is responsible for maintaining the routing information within its domain and updating it within the global server as a parameter for federated learning. The global server collects parameters from all domain controllers and maintains a global federated learning model. As the model parameters are transmitted during the learning process, the specific routing path information within the domain is protected, ensuring privacy preservation. The intelligent routing system model, based on the SDN hierarchical multi-domain data transmission architecture, is illustrated in Figure 1.

4. Algorithm Model

This paper proposes a routing algorithm based on federated enhanced distillation. In real network environments, each domain generates relatively small amounts of data. Agents deployed in a single-domain controller require extensive training to obtain a stable model, which significantly increases the model’s training time and the computational energy consumption of the UAVs acting as controllers. Moreover, since the agent can only use data within its own domain for training, the trained model exhibits personalized characteristics and is only suitable for local routing decisions. To address these challenges, we designed a federated reinforcement learning routing algorithm. This algorithm constructs multiple model training domains and aggregates data from multiple domains. Each domain controller can participate in the training of the complete model, thereby expanding the data samples and avoiding the leakage of network state information within each domain. This approach enhances the universality and generalization ability of the model.

To ensure that the model can better adapt to the task requirements and possess a greater generalization ability, larger-scale models are often utilized. However, the exchange of model parameters incurs significant costs. Communication between the server and the domain controller can lead to excessive link load when uploading large amounts of data. This can result in data loss within the domain, which, in turn, impacts the training of the decision-making model.

Therefore, in order to perform tasks within each domain, complex models need to be deployed, resulting in resource wastage. In the proposed federated reinforcement distillation approach, it is not the model parameters that are transmitted but rather the proxy experience memory. Through the federated reinforcement distillation algorithm, knowledge is transferred from one domain controller to a heterogeneous model in another domain controller. In complex task domains, agents with larger-scale neural networks are deployed to achieve a better task performance, while smaller-scale models are deployed in simpler task domains. This enables models in each domain to complete tasks more efficiently within their local domain, thus reducing the computational energy consumption of the domain controller.

In this study, the routing path is directly determined by the deep reinforcement learning algorithm with the aim of achieving the fastest forwarding of traffic from the source to the destination node. Furthermore, quality of service (QoS) is considered by taking into account link delay. After taking an action, the agent modifies the current network environment. Network state parameters are collected through the network state monitoring module and subsequently updated to reflect changes in the network state.

Therefore, we have designed a federated reinforcement distillation [22] routing method—a distributed machine learning architecture that protects privacy and has high communication efficiency. We constructed a proxy for domain experience memory and exchanged data between the domain controller and the server. The proposed federated reinforcement distillation routing algorithm does not leak sensitive intra-domain data when combining multi-domain proxy experience memories and reduces the amount of data that needs to be transmitted compared to traditional federated distillation algorithms, thus lowering communication costs.

When there are data to be transmitted in the switch queue, the domain controller obtains the network state through the network state monitoring module and calculates the routing strategy so as to guide the specific forwarding process of the data. For each input state of the algorithm, a strategy is selected to obtain an output result, and an action is selected according to the output result. The input state (state), output action (action), and reward value (reward) of the algorithm are, respectively, defined as follows.

(1) The input state: At present, the training process of the reinforcement learning routing algorithm in most studies is based on the network traffic matrix. However, in an actual network, it is difficult to obtain a real-time and effective traffic matrix. In this paper, the switch traffic is used as the state input of the agent so that the actions made are more time-effective. The input state is specifically represented as a matrix. Every time the agent takes an action, the domain controller obtains the current state parameters, including the short-term traffic of the switch. We assume that there are n nodes in the network, in which

d_{i}

represents the i-th node. The input status

S_{t}

of time t is represented by Equation (1):

S_{t} = [\begin{matrix} d_{1} & : & f_{1}^{t} \\ d_{2} & : & f_{2}^{t} \\ ⋮ & ⋮ & ⋮ \\ d_{n} & : & f_{n}^{t} \end{matrix}]

(1)

where

f_{i}^{t}

represents the amount of traffic flowing through

d_{i}

from time

t - 1

to time t.

(2) The output action: After the agent receives the input state, the output value is calculated according to the neural network, and the output value is mapped to the corresponding action. According to the authors of [23], the action output is the path of the data flow, but a routing algorithm based on Q-learning is used. When the traffic in the network increases, the state and action space increase sharply, resulting in a poor learning effect for the agent. This study considers all nodes from the source to the destination node as the candidate action set. The space size of this method is fixed, which facilitates the training process of the agent. The action candidate set of the current state contains all the next hop nodes connected to the current traffic node, which are expressed according to Equation (2):

A = a_{i} - > [a_{i j} \dots a_{i k}]

(2)

where

a_{i}

represents the current traffic node, and

[a_{i j} \dots a_{i k}]

represents all the next hop nodes of node i, from node j to node k.

(3) The reward value: The reward value is an important parameter that affects the training of the agent. After the agent selects an action in the current state, the environment is subsequently affected. A reward value is then obtained based on the feedback of the environment to evaluate the quality of the action. At the same time, the network transitions to the next state. Common reward values in network traffic are designed for link bandwidth, latency, and throughput. The goal of this study is to minimize the delay of the link, so the reward set is the average delay of the link. The reward value is expressed as shown in Equation (3):

R = - \frac{\sum_{j = 1}^{m} D_{j}}{m}

(3)

where m represents the number of flows in the network,

D_{j}

represents the link delay of the jth flow, and R is the average link delay. Because the agent needs to find the maximum reward value during training, the reward value is set as a negative value.

This study introduces deep reinforcement learning (DDQN) to compute routing decisions. The main network, the Q-network, and the target network, the target Q-network, are both four-layer networks, consisting of an input layer, two hidden layers, and an output layer. The values output by the output layer are mapped into the action space, corresponding one-to-one with the designed candidate actions. In the training process, the intelligent agent first interacts with the environment, takes actions

A_{t}

based on the state

S_{t}

, and receives a reward value

r_{t}

. After the agent takes action, the environment enters the next state, and these four parameters are stored in the experience pool. The network parameters are updated when the experience pool is full. Firstly, both

S_{t}

and

S_{t + 1}

in memory

(S_{t}, A_{t}, r_{t}, S_{t + 1})

are input into the main network, the Q-network, obtaining the value function

Q_{m a i n} (A_{t})

and the optimal Q-value

M A X Q_{m a i n} (A_{t + 1})

, which can be calculated as shown in Equation (4):

A_{t + 1} = a r g m a x Q (S_{t + 1}, A_{t + 1}, w)

(4)

Next,

M A X Q_{m a i n} (A_{t + 1})

is input into the target network, and

Q_{t a r g t e} (A_{t + 1}) = M a x Q_{m a i n} (A_{t + 1})

is obtained through

Q_{t a r g e t} (A_{t + 1})

. Then,

Q_{t a r g t e} (A_{t + 1})

,

Q_{m a i n} (A_{t})

, and

r_{t}

are incorporated into the loss function to calculate the loss, which is illustrated by Equation (5):

\begin{matrix} l o s s & = E [{(r_{t} + γ Q_{t a r g e t} (S_{t + 1}, A_{t + 1}, w_{t a r g e t}) - Q_{m a i n} (S_{t}, A_{t}, w_{m a i n}))}^{2}] \end{matrix}

(5)

Then, the main network parameters are updated according to the loss function by being copied to the target network at regular intervals. According to the Bellman formula, the objective value function can be calculated from the estimated values

Q_{t a r g e t} (S_{t + 1}, A_{t + 1}, w_{t a r g e t})

of

r_{t}

and

a_{t + 1}

, as shown in Equation (6):

\begin{matrix} q_{t} & = r_{t} + γ Q_{t a r g e t} (S_{t + 1}, a r g m a x_{a_{j}} Q_{m a i n} (S_{t + 1}, A_{t + 1}, w_{m a i n}), w_{t a r g e t}) \end{matrix}

(6)

The network parameters of the original network are then updated by Equation (7):

\begin{matrix} w_{m a i n}^{'} & = w_{m a i n} + α (Q_{m a i n} (S_{t}, A_{t}, w_{m a i n}) - q_{t}) \nabla Q_{m a i n} (S_{t}, A_{t}, w_{m a i n}) \end{matrix}

(7)

During the training process of the primary Q-network, the experience replay technique is used to store a series of states, actions, rewards, and next states obtained by the intelligent agent interacting with the environment in an experience replay pool. During training, fixed batches of data can be randomly selected from the experience pool to increase the training speed of the intelligent agent. However, since each datum stored after the interaction between the intelligent agent and the environment contains the next moment state, there is a certain correlation between the samples. To reduce the correlation between data samples and prevent the intelligent agent from falling into the local optimum, a random strategy is adopted when selecting data sets. In addition, because each tuple has different contributions to training, some scholars have used the priority-based experience replay technique. This study adopts the method of directly selecting fixed batches of data from the total experience memory for training using the experience replay technique. When the memory is full, the principle of first-in-first-out (FIFO) is used to replace old data with new data.

The algorithm in this paper is mainly divided into four stages: local update, parameter upload, parameter aggregation, and parameter delivery. These stages are defined as follows. (1) Local update: The local update process adopts the routing algorithm based on DDQN to train the local model and update the parameters. (2) Parameter upload: The agent in each domain controller completes training via the local environment, and multiple domain controllers upload their own model parameters

w_{i} (ε)

to the server. Here, i denotes the ID of the domain controller, and

ε

denotes the number of local iterations. (3) Parameter aggregation: D represents the delay of all links, as shown in Equation (8). The formula for parameter aggregation is presented in Equation (9), in which

ρ_{i}

represents the ratio of the parameter data volume of the i-th domain controller to the total data volume, and n represents the number of domain controllers. The calculation expression for

ρ_{i}

is shown in Equation (10). The server trains the global model through the aggregated parameters to generate global model parameters. The model convergence judgment is performed before the global model parameters are issued. If the model converges, it means that the global model has been learned, and the federated reinforcement learning routing algorithm ends. Otherwise, the algorithm enters the parameter delivery stage. (4) Parameter delivery. In the parameter delivery stage, global parameters are delivered to each domain controller. The domain controllers assign the parameters to the local model and use local data to update the model training parameters. This four-stage process is executed cyclically.

The global loss function

F (w)

is the weighted average of the loss functions of each domain controller, as shown in Equation (11). The goal of the federated reinforcement learning routing algorithm is to minimize the loss function, as shown in Equation (12).

D = \sum_{i = 1}^{n} D_{i}

(8)

w^{t + 1} (ρ) = \frac{\sum_{i = 1}^{n} D_{i} w_{i}^{t} (ε)}{D} = \sum_{i = 1}^{n} ρ_{i} w_{i}^{t + 1} (ε)

(9)

ρ_{i} = \frac{D_{i}}{D}

(10)

F (w) = \frac{\sum_{i = 1}^{n} D_{i} F_{i} (w_{i})}{D} = \sum_{i = 1}^{n} ρ_{i} F_{i} (w_{i})

(11)

\begin{matrix} w_{find} (ε) & = argmin \frac{\sum_{i = 1}^{n} D_{i} F_{i} (w_{global} (ε))}{D} & = argmin \sum_{i = 1}^{n} ρ_{i} F_{i} (w_{global} (ε)) \end{matrix}

(12)

The formula for proxy experience memory is shown in Equation (13):

M = {(s_{i}^{p}, π^{p} (a_{i} | s_{i}^{p}))}_{i = 0}^{N^{p}}

(13)

where

s_{i}^{p}

indicates the proxy state in each domain controller, i represents the ID of the domain controller, and

π^{p}

represents the average proxy strategy. The proxy state set

C_{i} \in C

is a further division of the state space, and the aggregation of the proxy state set

C_{i}

is the entire state space C. N indicates the number of participating domains.

The interaction process of proxy experience memory between the server and domain controllers in the FedRDR algorithm is as follows:

First, each domain trains the model based on local data and stores tuples $(S_{t}, A_{t}, r_{t}, S_{t + 1})$ . The states in the tuple are then aggregated to form a proxy state set $C_{j}$ . The policy in the proxy state set is the average of the multiple state policies $π_{θ_{i}} (a | s)$ that make up the proxy state. The average strategy calculation process is as shown in Equation (14):

$π_{θ_{i}}^{p} (a_{k} | s_{k}^{p}) = \sum_{k = i}^{j} \frac{π_{θ_{i}} (a_{k} | s_{k})}{j - i}$

(14)

where $θ_{i}$ represents the local model and $s_{k}^{p}$ represents a series of state sets, $(s_{i} \dots s_{k} \dots s_{j}) \in s_{k}^{p} \in C_{j}$ . All proxy states are combined to form a set.
After all the agents have filled the experience memory, the average proxy strategy is calculated. The proxy experience memory, the proxy state, and the corresponding average strategy are constructed, and these are uploaded to the server. The proxy experience memory of the ith domain controller is formulated as shown in Equation (15):

$M_{i}^{p} = {(s_{k}^{p}, π_{θ_{i}}^{p} (a_{k} | s_{k}^{p}))}_{k = 0}^{N_{i}^{p}}$

(15)

where $N_{i}^{p}$ represents the local agent experience memory size and p refers to the policy.
When the local proxy experience memory is uploaded to the server, it is then aggregated. The same proxy states of multiple domains are combined into one proxy state, and the corresponding policies are then, once again, averaged. The global average proxy experience memory formula is shown in Equation (16):

$M^{p} = {(s_{k}^{p}, π^{p} (a_{k} | s_{k}^{p}))}_{k = 0}^{N^{p}}$

(16)
The global average proxy experience memory is then delivered from the server to the domain controller.
After the domain controller receives the global average proxy experience memory, it calculates the KL dispersion loss of the local mode and the global average proxy experience memory so that the agent can learn knowledge and speed up the training process of the local agent. The KL dispersion loss function is shown in Equation (17):

$L_{i}^{p} (M^{p}, θ_{i}) = - \sum_{k = 1}^{N^{p}} π^{p} (a_{k} | s_{k}^{p}) l o g (π_{θ_{i}} (a_{k} | s_{k}^{p}))$

(17)

The proxy state in the proxy experience memory in each domain is not the original state, and the actual policy corresponding to the proxy state is averaged over time. Therefore, sharing the local agent’s experience memory not only protects data privacy in the domain but also reduces the amount of communication data. The FedRDR algorithm model is shown in Figure 2:

Firstly, the DDQN-based intra-domain routing algorithm is introduced, as shown in Algorithm 1. Secondly, the FedRDR algorithm is introduced, as shown in Algorithm 2.

Algorithm 1 Federated reinforcement learning model training process.

Require:

The short-term switch traffic F, the network topology T, and the target network update frequency $ζ$

Ensure:

The routing path
Initialize the hyperparameters of the DDQN agent;
Initialize the virtual simulation environment env based on Mininet; initialize the current state S, the next state $S^{'}$ , the action A, and the reward value R;
Initialize the switch flow table F, link delay, and network topology T;
$T < -$ Obtain the network topology;
$P < -$ Find all candidate nodes from the source node to the destination node according to T;
$F < -$ Obtain the traffic information from switches $j \in (1, 2, \dots, m)$ ;
for $t = 1, 2, 3, \dots, T$ do
Agent input state $< - S_{t}$ :
$A_{t} < -$ Agent output action;
According to the action $A_{t}$ , the domain controller selects the next hop from the candidate node P, changes the corresponding switch flow table, and forwards the data to the next hop node;
$r_{t}$ , $T M P < -$ The domain controller obtains the network link delay and the switch flow meter;
$S_{t + 1} < - (T M P - F)$ , $F < - T M P$ ;
$S_{t} < - S_{t + 1}$ ;
The agent puts $(s_{t}, A_{t}, r_{t}, S_{t + 1})$ into the experience playback pool;
if $M e m o r y = = M$ then
Store the newly obtained result $(s_{t}, A_{t}, r_{t}, S_{t + 1})$ and overwrite the old result using the FIFO principle;
Select a fixed batch of data from $M e m o r y$ and update the main network parameters $w_{M a i n} (t)$ by minimizing the loss function;
end if
if $t % ζ = = 0$ then
Update the target parameters $w_{t a r g e t} (t) = w_{m a i n} (t)$ ;
end if
if $t % φ = = 0$ then
The local proxy experience memory is constructed, the proxy experience memory of each domain controller is uploaded to the server, and the global proxy experience memory $M^{p}$ is obtained;
After successfully building the global proxy experience memory, it will be distributed;
All agents in the domain controller update the local model;
if $M_{g l o b a l}^{i}$ constant then
Federal reinforcement distillation ends;
end if
end if
end for
return $M_{g l o b a l}^{i}$ ;

Algorithm 2 FedRDR: Federated reinforcement distillation-based routing algorithm

Require:

The main network parameters $w_{m a i n}$ and target network parameters $w_{t a r g e t}$ in DDQN, the network parameter update frequency $ζ$ , the federated reinforcement distillation model update frequency $φ$ , the number of domain controllers n, and the local model for federated reinforcement distillation $M_{g l o b a l}^{i}$ .

Ensure:

Local decision model
Initialize the local model parameters $w_{M a i n}^{i} (0)$ = $w_{T a r g e t}^{i} (0)$ $i \in {1, 2, \dots, i, \dots, n}$ $M_{g l o b a l}^{i}$ ;
Set the $M e m o r y = M$ used to store the optimal results;
Set the hyperparameters of DDQN according to Algorithm 1’s description with a time frame T;
for $t = 1, 2, 3, \dots, T$ do
Perform the following operations on n domain controllers at the same time (take the domain controller i as an example):
Initialize switch flow table F, link delay, and network topology T;
$T < -$ Obtain network topology;
$P < -$ Find all candidate nodes from source to destination according to T;
$F < -$ Obtain the traffic information from switches $j \in (1, 2, \dots, m)$ ;
$S = F$
Agent input state $< - S_{t}$ ;
$A_{t} < -$ Agent output action;
According to the action $A_{t}$ , the controller selects the next hop from the candidate node P, changes the corresponding switch flow table, and forwards the data to the next hop node;
$r_{t}, T M P < -$ The controller obtains the network link delay and switch flow;
$S_{t + 1} < - (T M P - F)$ , $F < - T M P$ ;
$S_{t} < - S_{t + 1}$ ;
The agent puts $(s_{t}, A_{t}, r_{t}, s_{t + 1})$ into the experience playback pool;
if $M e m o r y = = M$ then
Store the newly obtained result $(s_{t}, A_{t}, r_{t}, s_{t + 1})$ and overwrite the old result using the FIFO principle;
Select a fixed batch of data from $M e m o r y$ and update the main network parameters $w_{M a i n} (t)$ by minimizing the loss function;
end if
if $t % ζ = = 0$ then
Update the target network parameters $w_{t a r g e t} (t) = w_{M a i n} (t)$ ;
end if
if $t % φ = = 0$ then
Update the target network parameters $w_{t a r g e t} (t) = w_{M a i n} (t)$ ;
end if
end for

5. Evaluation and Discussion of Results

5.1. Experimental Environment

The simulation system described in this paper was implemented using a Mininet network simulator and a Ryu controller. The algorithm itself was implemented in Python, with pytorch serving as the overall framework for the algorithm. The experimental environment used in this study is presented in Table 1.

5.2. Simulation System Design

This paper considers a distributed multi-domain data transmission architecture. In order to make the routing decision process efficient, we needed to minimize the amount of parameters to be transmitted, thus enabling quick generation of decision models. The number of drones required for the backbone network part does not necessarily have to be very large. This system was implemented by the design of four modules, as follows:

Network status monitoring module: The main function of the network status monitoring module is to obtain the network status information through echo and LLDP messages in the controller. The two types of messages interact between the SDN controller and the switch, obtaining the required states, rewards, and actions for our DDQN-based intelligent agent—more specifically, the short-term traffic information of switches, the network topology, and the link latency based on the topology structure.
Smart routing decision module: The DDQN-based intra-domain routing algorithm is the core of the decision module. The main function of this module is to make routing decisions based on the network status. This module can be divided into three parts: environment perception, the self-learning of intelligent agents, and handling of experience memory.
Routing decision execution module: This module is mainly used to guide the switches to forward traffic according to the routing decisions made by the controller. After the routing decision module calculates the routing paths, the controller installs the flow table to the switch. The switch updates its flow table and forwards traffic based on the new flow table.
Server processing module: The main task of the server processing module is to connect to the domain controller and receive the parameters transmitted by the domain controller. The Update_params function aggregates all the parameters uploaded by the domain controller and transmits the global parameters to each domain controller to accelerate the intra-domain training process and iterates this process until the model converges.

5.3. Performance Evaluation

Setting appropriate hyperparameters, such as the learning rate, batch size, and memory size for experience replay, during the process of training an intelligent agent can make the model converge faster. Therefore, we conducted experiments on the hyperparameters of the intelligent agent to select the optimal parameters.

From Figure 3, it can be seen that when the learning rate is set to 0.1 and 0.01, the intelligent agent shows rapid gradient descent in the early stages, but the convergence effect of the algorithm is poor in later stages. Compared with learning rates of 0.0001 and 0.001, the intelligent agent has already converged after 200 iterations with a learning rate of 0.001. Therefore, we set the parameter learning rate to 0.001.

We set the batch sizes to 16, 32, and 64, and the experimental results are shown in Figure 4. From the graph, it can be seen that when the batch size is set to 16, the algorithm cannot converge. Compared with a batch size of 64, the model has better convergence when the batch size is set to 32. Therefore, we set the batch size to 32.

We determined the other hyperparameters of the agent through experiments and finally set the hyperparameters of the agent as shown in Table 2:

To verify the performance of the domain-based routing algorithm based on deep reinforcement learning, after determining the model hyperparameters of the intelligent agent, we conducted comparative experiments. The compared algorithms were the depth first search (DFS) algorithm and the Q-learning-based routing algorithm using reinforcement learning.

By sending network traffic packets through iperf, three traffic packets were sent from

h_{11}

,

h_{12}

, and

h_{13}

to

h_{41}

,

h_{42}

, and

h_{43}

, respectively. The bandwidth requirements for each stream were set as shown in Equations (18)–(20):

F_{1} : h_{11} - > h_{41} : 3 Mbit / s

(18)

F_{2} : h_{12} - > h_{42} : 2 Mbit / s

(19)

F_{3} : h_{13} - > h_{43} : 2 Mbit / s

(20)

A comparative analysis of the three routing algorithms was conducted under this network state, and the experimental results are shown in Figure 5. From the graph, we can see that the delay in the data transmission using the routing strategy designed by the DFS algorithm is about 130 ms, which is not the optimal forwarding path. However, when the traffic in the network increases, the DFS algorithm cannot implement the optimal forwarding strategy. For state-aware intelligent routing algorithms, the final link delay under the converged state of both is about 25 ms. For the same device, the Q-learning algorithm requires more time to reach convergence compared to the DDQN routing algorithm.

To demonstrate the advantages of our proposed algorithm, we increased the number of switches to 8 and the number of flows to be forwarded in the network to 8 and compared the performance of the three routing algorithms. The experimental results are shown in Figure 6.

From the graph, it is evident that the routing strategy calculated by the DFS algorithm still leads to significant network delay. Furthermore, it was discovered in this study that the Q-learning-based routing algorithm failed to converge within 1000 iterations. When the number of switches and flows increases, the number of Q-table items also increases significantly, resulting in slow or even non-convergence of the algorithm. In contrast, our proposed DDQN-based routing algorithm is capable of quickly converging and guiding traffic forwarding, even in the presence of a large state space and action space.

In our approach, we designed a process for uploading the local model parameters to the server after every 100 iterations in each domain. The server then aggregates the parameters from the three domains. Based on the aggregated parameters, the server undergoes an additional 100 iterations of training before distributing the parameters back to the domain controllers. The domain controllers continue training the model based on the received parameters, and this process repeats until the model reaches a stable state. The experimental results of the federated reinforcement learning routing algorithm are depicted in Figure 7. After receiving the aggregated parameters from the server, the model of each domain controller achieved convergence after around 20 iterations of training.

When a new task domain emerges, the model parameters that already exist in the server are sent to the new domain. The new domain does not need to train the local model from scratch, which saves time when it comes to agent training. We selected one of two domains with the same task for local training. After local model training is completed, the model parameters migrate to the other domain controller via the server to verify the impact of the model parameters within one domain on the acceleration effect of the agent in another domain. The experimental results are shown in Figure 8.

In the experiment, Controller 1 represents an intelligent agent that has not undergone any parameter assignment training, while Controller 2 represents an intelligent agent after parameter assignment training. From the graph, it is evident that the intelligent agent in Controller 1 reaches a convergence state after almost 70 iterations, whereas the intelligent agent in Controller 2 achieves convergence after approximately 60 iterations. This experimental result demonstrates that the federated reinforcement learning-based intelligent routing algorithm can expedite the training process of intelligent agents within a single domain.

Next, we examined the impact of the number of domains on the convergence of the global model. We conducted experiments with 3, 5, and 8 domains. The training process was consistent across all experiments, where parameters were sent from the domain controller to the server every 100 iterations. The server then performed parameter aggregation and trained for an additional 100 iterations based on the aggregated parameters before distributing the parameters back to the domain controllers. This communication process was repeated three times.

The experimental results are depicted in Figure 9. From the graph, it can be observed that when there are more domains, the model exhibits less fluctuation after nearly 100 iterations and achieves a slight advantage in terms of convergence speed. However, irrespective of whether 3, 5, or 8 domains participate in the federated reinforcement learning-based multi-domain joint process, the accuracy of the model is hardly affected. A smaller number of domains does not lead to inaccurate models in the federated reinforcement learning-based multi-domain joint process. However, as the number of domains increases, the amount of uploaded data also increases, thereby enhancing the generalizability and applicability of the trained model. The experimental results demonstrate the significant utility advantages of the federated reinforcement learning-based intelligent routing algorithm.

Next, we verified the effect of the federated reinforcement learning-based multi-domain intelligent routing algorithm on reducing the training energy consumption within a single domain. The results are shown in Figure 10.

The horizontal axis represents time, with each recording interval being 100 ms. The vertical axis represents energy consumption, measured in milliwatt-hours (mWh). From the graph, it is evident that the energy consumption of the agents in both controllers is nearly identical until the 250th recording. However, since we assigned the model parameters from Controller 1 to Controller 2, the training speed in the second domain controller is accelerated, leading to faster model convergence. Consequently, the training process can be terminated once the model converges. As a result, the second controller concludes training earlier, reducing the energy consumption of the UAV acting as the domain controller due to reduced computation requirements. The federated reinforcement learning-based intelligent routing algorithm necessitates interaction and communication between domain controllers and servers to update the model parameters of intelligent agents. During the communication between the server and domain controllers, we collected data on the amount of exchanged data in model parameter interactions between domain controllers and servers, as illustrated in Figure 11.

From the figure, we can observe the specific size of the data volume for each communication between the domain controllers and the server. The average communication data volume is approximately 63.6 KB.

In summary, the federated reinforcement learning routing algorithm can expedite the training process of intelligent agents within domains while reducing the computational energy consumption of domain controllers. The experimental results provide evidence of the significant utility advantages of the federated reinforcement learning-based multi-domain intelligent routing algorithm.

Initially, the experiment validated the effectiveness of the FedRDR algorithm in accelerating the training of intelligent agents within domains. In federated reinforcement distillation, the agent’s experience memory stored in domain controllers is uploaded to the server, and intelligent agents from other domains learn from this experience memory. The comparison results are depicted in Figure 12.

The intelligent agent of domain Controller 1 undergoes local environment training to generate a local experience memory. Once the experience memory storage is full, it is sent to domain Controller 2 to observe its effect on accelerating the training of intelligent agents within the domain. From Figure 12, it can be observed that the model in domain Controller 1 reaches convergence after approximately 70 iterations of training, while the intelligent agent in domain Controller 2 achieves convergence after around 40 iterations of training. This indicates that knowledge transfer can enhance the learning speed of intelligent agents. During the training process in domain Controller 2, the training data are not obtained through the interaction between the local intelligent agent and the environment; instead, they are provided by domain Controller 1. This reduction in the interaction process between the intelligent agent and the environment contributes to the acceleration of training. Therefore, the proposed federated reinforcement distillation-based intelligent routing algorithm can expedite the training process of intelligent agents within domains.

Both the federated reinforcement learning routing algorithm and the FedRDR algorithm demonstrate the capability of accelerating the training of intelligent agents within domains. We compare their effects on accelerating the training of intelligent agents, and the experimental results are illustrated in Figure 13.

From the graph, it is evident that the federated reinforcement learning model achieves convergence after approximately 60 iterations of training. On the other hand, the federated reinforcement distillation model enters the convergence state after about 40 iterations of training. This indicates that knowledge transfer through federated reinforcement distillation has a better acceleration effect and reduces the training time for single-domain intelligent agents compared to parameter transfer through federated reinforcement learning.

In the federated reinforcement learning routing algorithm, we also examined the size of communications between the server and domain controllers during their interactions, which averaged around 63.6 KB. Additionally, we collected the amount of data transferred between the five domain controllers and the server during each communication, as depicted in Figure 14.

According to Figure 15, the intelligent routing algorithm proposed in this paper can reduce communication volume by 25–33% compared to FRL. The results indicate that this algorithm reduces the amount of data that needs to be transmitted, improving model training performance.

The maximum amount of data transferred between each domain controller and the server during communication was 47.6 KB, whereas in the federated reinforcement learning routing algorithm, the lowest amount of communication data was 59.3 KB. Federated reinforcement distillation demonstrates better communication efficiency compared to federated reinforcement learning. By proxying the state and aggregating the policy of the states, federated reinforcement distillation reduces the communication overheads between the server and domain controllers. In the multi-domain framework utilizing federated reinforcement learning, the average amount of communication between each domain and the server per iteration is approximately 63.6 KB, while in the multi-domain framework based on federated reinforcement distillation, it is around 45.3 KB. Compared to parameter transmission through the federated reinforcement learning algorithm, FedRDR can reduce the amount of transmitted parameters by approximately 29%. The average communication amounts are presented in Table 3.

As presented in the above table, the FedRDR algorithm offers a significant reduction in the transmitted data volume, thereby enhancing the training speed of single-domain intelligent agents. Additionally, when it comes to federated reinforcement distillation, the transmitted data only comprise the training data, irrespective of the intricacy of the models within each domain.

In a nutshell, the FedRDR algorithm not only expedites the training process of intelligent agents in individual domains but also diminishes the communication data between domain controllers and servers. Our empirical findings demonstrate that the efficacy of the FedRDR algorithm becomes more pronounced when confronted with intricate environments and tasks.

6. Discussion

In order to achieve the rapid deployment of communication networks in disaster-stricken areas and transportation hotspots in the Internet of Things (IoT) and to provide link services for ground nodes, this paper presents a temporary network with drones as communication nodes. In addition, we propose a hierarchical, multi-domain data transmission architecture based on SDN. We further divide the control layer of SDN, enhancing the network control capability and simplifying network management. By utilizing the computing and communication capabilities of drones, communication devices in the network can communicate more rapidly. Moreover, when high-performance drones serve as domain controllers, deploying efficient routing algorithms on them can make the entire IoT system more flexible and efficient. Therefore, this study focuses on the design of the FedRDR algorithm and its deployment on high-performance drones. Compared to the federated reinforcement learning routing algorithm, the FedRDR algorithm reduces approximately 29% of the transmission parameter quantity and further accelerates the convergence speed of the model. This paper verifies the proposed routing algorithm’s excellent performance in drone-assisted IoT systems based on the simulation system. It demonstrates the significance of the proposed architecture and algorithm for temporary networking supported by drones, thereby resolving the communication transmission obstacles in traditional IoT systems used in disaster-stricken areas and transportation hotspots. The network architecture and routing algorithm proposed in this paper are universal and can be well adapted to more complex networks.

Author Contributions

Conceptualization, J.L. and A.L.; methodology, J.L. and A.L; software, A.L. and S.C.; validation, J.L., A.L. and S.C.; formal analysis, A.L. and S.C.; investigation, J.L., A.L. and S.C.; resources, A.L., F.W. and S.C.; data curation, F.W. and A.L.; writing—original draft preparation, A.L. and S.C.; writing—review and editing, A.L., X.W. and S.C.; visualization, A.L., X.W. and S.C.; supervision, A.L., X.W., G.H. and S.C.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Projects (2022YFB4500800); the Applied Basic Research Program Project of Liaoning Province (2023JH2/101300192); the Fundamental Research Funds for the Central Universities (N2116014); the National Natural Science Foundation of China (62032013 62072094); and the New Generation Information Technology Innovation Project of the Ministry of Education (2021ITA10011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

China Emergency Management Report. Available online: https://baijiahao.baidu.com/s?id=1776459815771608944&wfr=spider&for=pc (accessed on 1 October 2023).
Ganesh, S.; Gopalasamy, V.; Shibu, N.S. Architecture for Drone Assisted Emergency Ad-hoc Network for Disaster Rescue Operations. In Proceedings of the 2021 International Conference on COMmunication Systems & NETworkS (COMSNETS), Bangalore, India, 5–9 January 2021. [Google Scholar] [CrossRef]
Xie, J.; Fu, Q.; Jia, R.; Lin, F.; Li, M.; Zheng, Z. Optimal Energy and Delay Tradeoff in UAV-Enabled Wireless Sensor Networks. Drones 2023, 7, 368. [Google Scholar] [CrossRef]
Wang, Y.; Farooq, J. Resilient UAV formation for coverage and connectivity of spatially dispersed users. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; Volume 10, pp. 225–230. [Google Scholar]
Take a Look at AT&T’s ‘Flying COWs’—Drones That Returned Cell Service to Hurricane Ida-Hit Louisiana. Available online: https://news.yahoo.com/look-ts-flying-cows-drones-113500471.html (accessed on 10 September 2021).
He, G.; Bao, W.; Hui, Y. A UAV Emergency Network User Allocation Method for Load Balancing. In Proceedings of the 2022 8th International Conference on Big Data and Information Analytics (BigDIA), Guiyang, China, 24–25 August 2022. [Google Scholar]
Xu, Z.; Li, X.; Lin, X. Research on the Construction of Forestry Protection Drone Project-Take the Construction of Forest Fire Monitoring Project of Huizhou Engineering Vocational College as an Example. In Proceedings of the 6GN for Future Wireless Networks: 4th EAI International Conference, 6GN 2021, Huizhou, China, 30–31 October 2021; pp. 230–244. [Google Scholar]
Deng, X.; Wang, L.; Gui, J.; Jiang, P.; Chen, X.; Zeng, F.; Wan, S. A review of 6G autonomous intelligent transportation systems: Mechanisms, applications and challenges. J. Syst. Archit. 2023, 142, 102929. [Google Scholar] [CrossRef]
Abdellatif, A.A.; Mhaisen, N.; Mohamed, A.; Erbad, A.; Guizani, M.; Dawy, Z.; Nasreddine, W. Communication-efficient hierarchical federated learning for IoT heterogeneous systems with imbalanced data. Future Gener. Comput. Syst. 2022, 128, 406–419. [Google Scholar] [CrossRef]
Liu, L.; Zhang, J.; Song, S.H.; Letaief, K.B. Communication-Efficient Federated Distillation with Active Data Sampling. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 201–206. [Google Scholar]
Paguem Tchinda, A. Optimisation of Wireless Disaster Telecommunication Network based on Network Functions Virtualisation under special consideration of Energy Consumption. Doctoral Dissertation, University of Plymouth, Plymouth, UK, 2022. [Google Scholar]
Alsamhi, S.H.; Shvetsov, A.V.; Kumar, S.; Shvetsova, S.V.; Alhartomi, M.A.; Hawbani, A.; Rajput, N.S.; Srivastava, S.; Saif, A.; Nyangaresi, V.O. UAV computing-assisted search and rescue mission framework for disaster and harsh environment mitigation. Drones 2022, 6, 154. [Google Scholar] [CrossRef]
Liu, C.; Feng, W.; Chen, Y.; Wang, C.X.; Ge, N. Cell-Free Satellite-UAV Networks for 6G Wide-Area Internet of Things. IEEE J. Sel. Areas Commun. 2021, 39, 1116–1131. [Google Scholar] [CrossRef]
Tarekegn, G.B.; Juang, R.-T.; Lin, H.-P.; Munaye, Y.Y.; Wang, L.-C.; Bitew, M.A. Deep-Reinforcement-Learning-Based Drone Base Station Deployment for Wireless Communication Services. IEEE Internet Things J. 2022, 9, 21899–21915. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, L.; Zhu, L.; Zhang, T.; Xiao, Z.; Xia, X.G. 3D deployment of multiple UAV-mounted base stations for UAV communications. IEEE Trans. Commun. 2021, 69, 2473–2488. [Google Scholar] [CrossRef]
Wang, L.; Zhang, H.; Guo, S.; Yuan, D. 3D UAV Deployment in Multi-UAV Networks with Statistical User Position Information. IEEE Commun. Lett. 2022, 26, 1363–1367. [Google Scholar] [CrossRef]
Shi, W.; Li, J.; Wu, H.; Zhou, C.; Cheng, N.; Shen, X. Drone-Cell Trajectory Planning and Resource Allocation for Highly Mobile Networks: A Hierarchical DRL Approach. IEEE Internet Things J. 2020, 8, 9800–9813. [Google Scholar] [CrossRef]
Dai, Z.; Zhang, Y.; Zhang, W.; Luo, X.; He, Z. A Multi-Agent Collaborative Environment Learning Method for UAV Deployment and Resource Allocation. IEEE Trans. Signal Inf. Process. Over Netw. 2022, 8, 120–130. [Google Scholar] [CrossRef]
Wu, C.; Wu, F.; Lyu, L.; Huang, Y.; Xie, X. Communication-efficient federated learning via knowledge distillation. Nat. Commun. 2022, 13, 2032. [Google Scholar] [CrossRef] [PubMed]
Wang, H.P.; Stich, S.; He, Y. Communication-efficient federated learning via knowledge distillation. In Proceedings of the International Conference on 617 Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 23034–23054. [Google Scholar]
Yang, T.-J.; Guliani, D.; Beaufays, F.; Motta, G. Partial Variable Training for Efficient on-Device Federated Learning. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4348–4352. [Google Scholar]
Cha, H.; Park, J.; Kim, H.; Kim, S.L.; Bennis, M. Federated Reinforcement Distillation with Proxy Experience Memory. arXiv 2019, arXiv:1907.06536. [Google Scholar]
Rischke, J.; Sossalla, P.; Salah, H.; Fitzek, F.H.; Reisslein, M. QR-SDN: Towards Reinforcement Learning States, Actions, and Rewards for Direct Flow Routing in Software-Defined Networks. IEEE Access 2020, 8, 174773–174791. [Google Scholar] [CrossRef]

Figure 1. The intelligent routing system model.

Figure 2. The FedRDR algorithm model.

Figure 3. The effect of the learning rate on the algorithm.

Figure 4. The impact of the data batch size on the algorithm.

Figure 5. A performance comparison of the three algorithms based on Figure 4.

Figure 6. Performance comparison of the three algorithms based on 8 switches.

Figure 7. Three–domain federation reinforcement learning routing.

Figure 8. The accelerated effect of training based on federated reinforcement learning.

Figure 9. The effect of the number of domains on the model.

Figure 10. The impact of energy consumption based on federated reinforcement learning.

Figure 11. The amount of data communicated between the server and the domain controller.

Figure 12. The effects of federated reinforcement distillation on accelerating model training.

Figure 13. A comparison of the acceleration effects of federated reinforcement learning and federated reinforcement distillation.

Figure 14. Traffic size based on federated reinforcement distillation.

Figure 15. The communication traffic between FRL and FRD.

Table 1. Experimental environment.

Type	Description
Hardware	Intel Core i7-8700 CPU @ 3.20 GHz3.19 GHz
Random-access memory	8.00 GB
System type	Ubuntu 20.0.4
Development language	Python 3
Development tool	PyCharm 2019.1.1 (Professional Edition)
Virtual environment	Mininet + Ryu

Table 2. Agent model hyperparameters.

Hyperparameters	Value
Learning rate	0.001
Batch size	32
Memory	10,000
Reward discount	0.9
Max steps	1000
Network update frequency	100

Table 3. Traffic based on the federated reinforcement learning framework and the federated reinforcement distillation framework.

Algorithm	Amount of Communication Data per Round
Federated reinforcement learning	63.6 KB
Federated reinforcement distillation	45.3 KB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Liu, A.; Han, G.; Cao, S.; Wang, F.; Wang, X. FedRDR: Federated Reinforcement Distillation-Based Routing Algorithm in UAV-Assisted Networks for Communication Infrastructure Failures. Drones 2024, 8, 49. https://doi.org/10.3390/drones8020049

AMA Style

Li J, Liu A, Han G, Cao S, Wang F, Wang X. FedRDR: Federated Reinforcement Distillation-Based Routing Algorithm in UAV-Assisted Networks for Communication Infrastructure Failures. Drones. 2024; 8(2):49. https://doi.org/10.3390/drones8020049

Chicago/Turabian Style

Li, Jie, Anqi Liu, Guangjie Han, Shuang Cao, Feng Wang, and Xingwei Wang. 2024. "FedRDR: Federated Reinforcement Distillation-Based Routing Algorithm in UAV-Assisted Networks for Communication Infrastructure Failures" Drones 8, no. 2: 49. https://doi.org/10.3390/drones8020049

Article Menu

FedRDR: Federated Reinforcement Distillation-Based Routing Algorithm in UAV-Assisted Networks for Communication Infrastructure Failures

Abstract

1. Introduction

2. Related Work

3. System Model

4. Algorithm Model

5. Evaluation and Discussion of Results

5.1. Experimental Environment

5.2. Simulation System Design

5.3. Performance Evaluation

6. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI