Next Article in Journal
Factory Extraction from Satellite Images: Benchmark and Baseline
Next Article in Special Issue
Load Estimation Based Dynamic Access Protocol for Satellite Internet of Things
Previous Article in Journal
Development of a New Vertical Water Vapor Model for GNSS Water Vapor Tomography
Previous Article in Special Issue
ConstDet: Control Semantics-Based Detection for GPS Spoofing Attacks on UAVs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

UAV-Assisted Fair Communication for Mobile Networks: A Multi-Agent Deep Reinforcement Learning Approach

1
School of Artificial Intelligence, Henan University, Zhengzhou 450046, China
2
International Joint Research Laboratory for Cooperative Vehicular Networks of Henan, Zhengzhou 450046, China
3
Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON K7L 3N6, Canada
4
College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(22), 5662; https://doi.org/10.3390/rs14225662
Submission received: 5 October 2022 / Revised: 30 October 2022 / Accepted: 5 November 2022 / Published: 9 November 2022
(This article belongs to the Special Issue Satellite and UAV for Internet of Things (IoT))

Abstract

:
Unmanned Aerial Vehicles (UAVs) can be employed as low-altitude aerial base stations (UAV-BSs) to provide communication services for ground users (GUs). However, most existing works mainly focus on optimizing coverage and maximizing throughput, without considering the fairness of the GUs in communication services. This may result in certain GUs being underserviced by UAV-BSs in pursuit of maximum throughput. In this paper, we study the problem of UAV-assisted communication with the consideration of user fairness. We first design a Ratio Fair (RF) metric by weighting fairness and throughput to evaluate the tradeoff between fairness and communication efficiency when UAV-BSs serve GUs. The problem is formulated as a mixed-integer non-convex optimization problem based on the RF metric and we propose a UAV-Assisted Fair Communication (UAFC) algorithm based on multi-agent deep reinforcement learning to maximize the fair throughput of the system. The UAFC algorithm comprehensively considers fair throughput, UAV-BSs coverage, and flight status to design a reasonable reward function. In addition, the UAFC algorithm establishes an information sharing mechanism based on gated functions by sharing neural networks, which effectively reduces the distributed decision-making uncertainty of UAV-BSs. To reduce the impact of state dimension imbalance on the convergence of the algorithm, we design a new state decomposing and coupling actor network architecture. Simulation results show that the proposed UAFC algorithm increases fair throughput by 5.62%, 26.57% and fair index by 1.99%, 13.82% compared to the MATD3 and MADDPG algorithms, respectively. Meanwhile, UAFC can also meet energy consumption limitation and network connectivity requirement.

Graphical Abstract

1. Introduction

At present, the establishment and realization of mobile communication networks mainly rely on terrestrial base stations and other fixed communication equipment, which requires time-consuming network planning with the consideration of many practical factors. Unmanned Aerial Vehicles (UAVs) with high flexibility, low cost, and wide coverage have aroused widespread concern in academia and industry [1,2], e.g., UAV-assisted base station communications [3], relay communications [4], data collection [5], and secure communications [6]. In the area of UAV-assisted communications, UAVs are mainly regarded as mobile base stations to provide high-quality communication services to ground users (GUs) [7]. The mobility and flexibility of UAV Base Stations (UAV-BSs) can establish communication connections quickly and improve data transmission efficiency and communication range significantly [8,9,10,11]. For example, when the ground communication infrastructure is damaged by natural disasters, UAVs can be employed as temporary base stations to provide emergency communication services for GUs.
UAVs as aerial base stations have many advantages compared to terrestrial base stations. (1) UAVs can establish good Line-of-Sight (LoS) links with GUs [12]. (2) For mobile GUs, UAVs can adjust flight trajectory or follow the GUs to provide better communication services [13]. It is worth noting that due to limited communication resources and coverages, UAV-BSs cannot serve all GUs by merely optimizing the locations of UAV-BSs. Thus, UAV-BSs mainly face two challenges in UAV-assisted communication. (1) How to select the GUs to be served by optimizing the locations of the UAVs to maximize the communication efficiency (e.g., throughput). (2) How to adopt a service strategy to ensure fairness among GUs when providing communication services for multiple GUs.
For the problem of maximizing the efficiency of UAV-assisted communication, most studies regard the energy efficiency and throughput as the primary optimization objective. In [14], the authors proposed a centralized multi-agent Q-learning algorithm to maximize the energy efficiency of wireless communication. In [15], a Deep Reinforcement Learning (DRL) algorithm based on Q-Learning and convolutional neural networks is proposed to maximize spectral efficiency. However, maximizing communication efficiency results in the tendency of UAV-BSs to hover approaching a small subset of GUs. This will lead to communication mission interruptions for other GUs due to being out of the communication range of the UAV-BSs. For the fair communication problem, previous works focus on fair coverage. For example, an algorithm based on DRL is proposed in [16] to deploy UAV-BSs. UAV-BSs can provide fair coverage and reduce collisions among UAVs. In [17], a DRL-based control algorithm is proposed to implement energy efficient and fair coverage with energy recharge. A mean-field game model is proposed in [18] to maximize the coverage score while ensuring fair communication range and network connectivity. These studies [16,17,18] focus on fair coverage of ground areas or users. In the pursuit of regional fairness, UAVs need to serve all cell as much as possible. Thus, focusing only on fair coverage will cause partial task interruption and the degradation of communication efficiency. It is necessary to consider both communication efficiency (throughput) and user fairness in UAV-assisted communications, which has been overlooked in the literature [14,15,16,17,18].

1.1. Related Work

In this subsection, we review relevant work on UAV-assisted communication and point out the inadequacy of these works.

1.1.1. UAV-Assisted Communication

UAV-assisted communication has been extensively researched. For example, Jeong et al. [19] proposed an optimization algorithm based on a concave-convex procedure to maximize the transmission rate by designing the flight trajectory and the transmitting power of the UAV. Yin et al. [20] studied UAV as aerial base stations serving multiple GUs and proposed a deterministic policy gradient algorithm to maximize the total uplink transmission rate. These works [19,20] focus only on the communication performance of UAV-assisted communication networks, the energy consumption of the UAV is also a crucial issue due to the limited onboard energy. In [21], an Actor–Critic-based deep stochastic online scheduling algorithm is proposed to minimize the overall energy consumption of the communication network by optimizing the data transmission and hovering time of the UAV. The proposed algorithm can reduce energy by 29.94% and 24.84% compared to the DDPG and PPO algorithms. Yang et al. [22] investigated UAV-assisted data collection, where the data transmission rate and the energy consumption of the UAV are in conflict with each other during the data collection process. Theoretical expressions for the energy consumption of the UAV and GUs are derived to achieve different Pareto-optimal trade-offs. The results provide a new insight for future energy efficiency of UAV-assisted communication. Zhang et al. [23] considered post-disaster rescue scenarios where energy is limited due to the collapse of the power system. To satisfy energy constraints and obstacle constraints, a safe deep Q-learning-based UAV trajectory optimization algorithm is proposed to maximize uplink throughput. Its weakness is that it cannot be applied to a larger disaster area due to the limited communication range and on-board energy of single UAV. The above studies [21,22,23] take full consideration of the energy consumption of the UAV during the trajectory design, enabling the UAV to perform communication services with greater energy efficiency. These studies [19,20,21,22,23] consider single UAV scenarios and the GUs are stationary. Thus, the methods designed in the above references are only applicable to small-scale simple scenarios.
For complex scenarios, it is particularly important that multiple UAVs collaborate with each other to accomplish complex communication tasks. Shi et al. [24] proposed a dynamic deployment algorithm based on hierarchical Reinforcement Learning (RL) to maximize the long-term desired average throughput of the communication network. The proposed algorithm can increase throughput by 40%. To meet the Quality of Experience (QoE) of all GUs with limited system resources and energy, Zeng et al. [25] jointly optimized GUs scheduling, UAVs trajectory, and transmit power to maximize energy efficiency and meet GUs QoE. The proposed algorithm increases energy efficiency by 12.5% compared to the baseline algorithm. Ding et al. [26] modeled the UAVs and GUs as a hybrid cooperative-competitive game problem, maximizing throughput by simultaneously optimizing the trajectory of UAVs and the access of GUs. The above studies [24,25,26] focus on the multiple UAVs and multiple GUs scenario to maximize throughput and energy efficiency by designing the flight trajectory. However, the optimization objectives of [19,20,21,22,23,24,25,26] focus on communication performance and ignore fairness among GUs. This leads to UAVs allocating more communication resources to GUs with high throughput or following the movement of the GUs. As a result, the UAV-BSs ignore service requests of other critical GUs.

1.1.2. UAV-Assisted Fair Communication

It is also a crucial issue to consider service fairness in UAV-assisted communication. Diao et al. [27] studied the problem of fair perceptual task allocation and trajectory optimization in UAV-assisted edge computing. The non-convex optimization problem is transformed into multiple convex sub-problems for solution by introducing auxiliary variables to minimize energy consumption while meeting fairness. In [28], Ding et al. derived an expression for the energy consumption of UAV. Based on the energy consumption of UAV and the fairness of GUs, a DRL-based algorithm is proposed to maximize the system throughput by jointly optimizing the trajectory and bandwidth allocation. The proposed method increases the fair index by 15.5%. The works in [27,28] study the issue of single UAV-assisted fairness communication and the proposed algorithms are not applicable to multi-UAV scenarios.
For multi-UAV scenarios, Liu et al. [29] studied the fair coverage problem of UAVs and proposed a DRL-based control algorithm to maximize the energy efficiency while guaranteeing communication coverage, fairness, energy consumption, and connectivity. The study improves coverage scores and fairness index by 26% and 206%. A novel distributed control scheme algorithm is proposed in [30] to deploy multiple UAVs in an area to improve coverage with minimum energy consumption and maximum fairness. The proposed algorithm covers 84.6% of the area and improves the fair index by 3% on the same network conditions. Liu et al. [31] investigated how to deploy UAVs to improve service quality for GUs and maximize fair coverage according to the designed fair index. The above studies [29,30,31] focus on the problem of fair coverage in multi-UAV-assisted communication by optimizing the locations of UAVs to cover the ground area in a fair manner. In contrast with the above studies, we are concerned with maximizing system throughput while considering user fairness in a multi-UAV mobile network.

1.2. Motivation and Contribution

UAV-assisted mobile wireless communication networks have many advantages compared with traditional terrestrial fixed communication infrastructures. However, UAVs as mobile base stations still face the following two problems. (1) Existing research mainly focuses on the communication efficiency. The research objectives aim to maximize communication metrics such as throughput, transmission rate, and energy efficiency by optimizing the flight trajectory and resource allocation of the UAVs. However, these studies ignore the fairness among GUs. For example, service user A can obtain high throughput and user B needs more urgent communication service. If we select throughput as the service goal, it will cause the UAVs to allocate more communication resources to user A maximizing the goal. Thus, the service request of user B is ignored despite the more urgent service request. (2) Another important problem in UAV-assisted communication is the optimization problem of UAV location, which is a typical optimal sequential decision problem. The computational complexity of heuristic and convex optimization algorithms grows exponentially with the increase of numbers of GUs and UAVs, which is not suitable for multi-UAV and multi-user scenarios.
Motivated by communication efficiency and user fairness, this paper investigates fair communication in UAV-BSs-assisted communication systems. We propose an information sharing mechanism based on gated functions and incorporate it into Multi-Agent Deep Reinforcement Learning (MADRL) to obtain near-optimal strategies. The main contributions of this paper are summarized as follows:
  • To evaluate the trade-off between fairness and communication efficiency, we design a Ratio Fair (RF) metric by weighing fairness and throughput when UAV-BSs serve GUs. Based on the RF metric, we formulate the UAV-assisted fair communication as a non-convex problem and utilize DRL to acquire a near-optimal solution for this problem;
  • To solve the above continuous control problem with an infinite action space, we propose the UAV-Assisted Fair Communication (UAFC) algorithm. The UAFC algorithm establishes an information sharing mechanism based on gated functions to reduce the distributed decision uncertainty of UAV-BSs;
  • To address the dimension imbalance and training difficulty due to the high dimension of the state space, we design a novel actor network structure of decomposing and coupling. The actor network utilizes dimension spread and state aggregation to obtain high-quality state information.
The remainder of this paper is organized as follows. Section 2 introduces energy consumption and communication models. In Section 3, we describe the problem and formulate it as a Markov decision process. Section 4 describes the implementation process of the UAFC algorithm. The results and analyses of the experiments are presented in Section 5. We discuss some of the limitations of this paper and future research work in Section 6. Section 7 concludes our paper.
Notations: In this paper, variables are denoted by italicized notation and vectors are denoted by bold notation. | | | | denotes L2 parametric; R W denotes a W-dimensional vector space. { } denotes set. For convenience of reading, the important symbols are listed in Table 1 with the corresponding descriptions.

2. System Model

In this paper, we consider a wireless communication scenario in an area. The scenario contains multi-UAV and mobile GUs. UAVs provide communication services for GUs by optimizing the trajectories, as shown Figure 1. Air-to-Ground (A2G) links exist between UAV-BSs and GUs. K GUs are randomly distributed, and the set of GUs is denoted as K = Δ { 1 , 2 , , K } . The location of G U k ( k K ) at time t is denoted as ω k ( t ) = [ x k ( t ) , y k ( t ) , 0 ] R 3 . D UAVs are deployed as mobile base stations to provide communication services for GUs, and the set of UAV-BSs is denoted as D = Δ { 1 , 2 , , D } . In the three-dimensional Cartesian coordinate system, the location of UAV d ( d D ) at time t is denoted as u d ( t ) = [ x d ( t ) , y d ( t ) , z d ( t ) ] R 3 . To reduce the additional energy overhead in the climbs of UAVs, we assume that UAVs fly at a fixed altitude H, i.e.,  z d = H ( d D ) .

2.1. UAV Movement and Energy Consumption Model

The movement of GUs leads to the change of the channel quality between UAVs and GUs. Thus, the position of the UAV-BSs needs to be optimized to provide better communication services. As shown in Figure 1, the coordinate of UAV d at time t is denoted as [ x d ( t ) , y d ( t ) , z d ( t ) ] . UAV d flies towards the next position according to the moving distance d i s t d ( t ) and the flight angle ϑ d ( t ) , where the maximum flying distance and flight angle of UAV d are denoted as d i s t d ( t ) [ 0 , d i s t max ] and ϑ d ( t ) [ 0 , 2 π ] , respectively. Therefore, the next position of UAV d is calculated as
x d ( t + 1 ) = x d ( t ) + d i s t d ( t ) · cos ( ϑ d ( t ) ) , y d ( t + 1 ) = y d ( t ) + d i s t d ( t ) · sin ( ϑ d ( t ) ) .
The energy consumption of UAVs mainly depends on propulsion energy consumption and communication energy consumption. According to [32], the propulsion energy consumption is related to the speed and acceleration of the UAV. To simplify the system model and computational complexity, the effect of acceleration on propulsion energy consumption is ignored [33]. In time t, the energy consumption E m ( t ) due to the movement of the UAV d can be expressed as
E m ( t ) = 0 t P d ( τ ) d τ ,
where P d ( τ ) is the propulsion power of UAV d at time τ . The remaining energy of UAV d (denoted as E d ( t ) ) is calculated as
E d ( t ) = E max ( E m ( t ) + E c ( t ) ) ,
where E max is the maximum energy value of UAV d being fully charged. The communication energy consumption of UAV d in the [0, t] period is denoted as E c ( t ) .

2.2. Communication Model

The transmission links between UAVs and GUs are modeled as a probabilistic channel model [34], and the probability P L o S ( t ) of establishing LoS connection between UAVs and GUs is given by
P L o S ( t ) = 1 1 + X exp { Y ( arctan ( z d ( t ) r d , k ( t ) ) X ) } ,
where X and Y are the coefficients related to the environment, respectively. r d , k ( t ) denotes the horizontal distance between UAV d and G U k .
The link path loss models between UAV d and G U k are given for the LoS link and Non-Line-of-Sight (NLoS) link (denoted as L L o S and L N L o S ), respectively, as follows:
L L o S = 20 log ( 4 π f c d d , k ( t ) c ) + E L o S , L N L o S = 20 log ( 4 π f c d d , k ( t ) c ) + E N L o S ,
where d d , k ( t ) denotes the distance between UAV d and G U k , and  f c denotes the carrier frequency. E L o S and E N L o S denote the additional path loss of the LoS link and NLoS link [35], respectively. 20 log ( 4 π f c d d , k ( t ) c ) is the free space path loss. c is the speed of light.
Therefore, the average path loss between UAV d and G U k (denoted as P L d , k ( t ) ) is calculated by
P L d , k ( t ) = P L o S ( t ) × L L o S + ( 1 P L o S ( t ) ) × L N L o S .
In this model, the path loss threshold γ d k is defined, and the link is considered broken when P L d , k ( t ) γ d k . Thus, a binary variable β d , k ( t ) is defined, which denotes the association of the UAV d to the G U k at time t.
β d , k ( t ) = 1 , i f UAV d i s c o n n e c t e d t o G U k . 0 , o t h e r w i s e .
The transmission rate between UAV d and G U k (denoted as R d k ( t ) ) is expressed as
R d k ( t ) = B k β d , k ( t ) log 2 ( 1 + P t n 0 × G d , k ( t ) ) ,
where G d , k ( t ) = 10 P L d , k ( t ) / 10 denotes the channel power gain, P t is the fixed transmit power of the UAV d , B k indicates the communication bandwidth allocated to G U k , and  n 0 represents the noise power spectral density.

3. Problem Formulation and Transformation

3.1. Problem Formulation

To maximize the system throughput while guaranteeing user fairness, we design an RF metric to evaluate the trade-off between fairness and communication efficiency when UAVs serve GUs. The aim is to provide communication services for GUs with high communication efficiency. The throughput priority for G U k (denoted as f k ( t ) ) and is calculated by
f k ( t ) = R ¯ k ( t ) T ¯ ( t ) ,
where R ¯ k ( t ) represents the total throughput of G U k in the period [0, t], and  T ¯ ( t ) is the throughput of all GUs. R ¯ k ( t ) and T ¯ ( t ) are given by Equations (10) and (11), respectively.
R ¯ k ( t ) = 0 t R d k ( τ ) d τ
T ¯ ( t ) = k = 1 K R ¯ k ( t )
However, UAVs tend to follow GUs with high throughput to maximize the total throughput and ignore service request from other GUs, leading to unfairness among GUs. Due to the unfair behavior of GUs, the communication resource allocation of UAVs is unbalanced, which affects the quality of service for GUs. To maximize throughput while ensuring user fairness, we design the RF metric based on priority and the Jain’s index [36] to measure the user fairness, which can be calculated by the following formula
W k ( t ) = ( k = 1 K f k ( t ) ) 2 K ( k = 1 K f k ( t ) 2 ) .
Then, the RF index is employed to obtain a weight coefficient that considers fairness and priority comprehensively, and the total fair throughput is defined as weighted throughput and is denoted as
R ¯ t o a l ( t ) = k = 1 K 0 T t W k ( τ ) R ¯ k ( τ ) d τ .
The optimization objective of this paper is to maximize the fair throughput by optimizing the location of the UAVs (referred to as problem P 1 ). Problem P 1 can be formulated as follows:
( P 1 ) : max R ¯ t o a l ( t ) { u d ( t ) } d D = k = 1 K 0 T t W k ( τ ) R ¯ k ( τ ) d τ s . t . C 1 : E d ( 0 ) = E max , E d ( T t ) = E min , C 2 : k = 1 K β d , k ( t ) 1 , d D , k K , C 3 : β d , k ( t ) { 0 , 1 } , d D , k K , C 4 : P L d , k ( t ) γ d k , C 5 : u i ( t ) u j ( t ) 2 d min , i , j D , and i j , C 6 : x d ( t ) , x k ( t ) [ X min , X max ] , d D , k K , C 7 : y d ( t ) , y k ( t ) [ Y min , Y max ] , d D , k K ,
where constraint C1 represents the energy consumption constraint of the UAV-BSs. Constraints C2 and C3 indicate that a GU can only connect to one UAV-BS at time t. Constraint C4 requires that the path loss between the UAV d and the G U k should not be greater than the threshold to avoid transmission interruptions. Constraint C5 represents the safe distance between UAV i and UAV j . Constraints C6 and C7 represent the movement area constraints of UAVs and GUs.

3.2. Problem Transformation

Since the locations of the UAV-BSs change continuously, the optimization variables are continuous and exist nonlinear coupling. To make the problem P 1 trackable, the entire task time is divided into N t timeslots and the duration of each timeslot is expressed as δ t = T t / N t . Thus, the continuous optimization problem P 1 can be transformed into discrete problem P 2
( P 2 ) : max R ¯ t o a l ( n ) { u d ( n ) } d D = n = 1 N t k = 1 K W k ( n ) R ¯ k ( n ) δ t s . t . C 1 C 7 .
Since the constraints are non-convex, problem P 2 is a complex non-convex optimization problem. Traditional heuristic algorithms [37,38,39] can obtain optimal strategy at the expense of high computational complexity and are not suitable for dynamic environment. DRL is a learning-based approach in which agents obtain optimal strategies by interacting with the environment and it requires little prior experience. Thus, DRL is commonly employed to solve optimal decision problems. However, single-agent DRL algorithms are not applicable to multi-agent problems. The main reason is that a centralized controller is needed to collect global information and control all the agents, leading to the increase of communication costs [40]. To address the problem, the MADRL algorithm can be employed, in which each UAV acts as an agent to learn the optimal collaboration policy.
First, the problem P 2 is described as a multi-agent Markov Decision Process (MDP) which consist of five parts < S , A , P , R , γ > [41]: The state set S, the action set A, the state transition probability function P, the reward function R, and the reward discount factor γ . The state space, action space, and reward function are designed as follows:
State: In time slot n [ 0 , N t ] , state s d = { { u d ( n ) } d D , { ω k ( n ) } k K , E d ( n ) d D } consists of three parts.
  • { u d ( n ) } d D represents the coordinates of UAV d at time slot n;
  • { ω k ( n ) } k K represents the coordinates of the G U k at time slot n;
  • { E d ( n ) } d D represents the remaining energy of UAV d at time slot n.
Action: In time slot n [ 0 , N t ] , action a d = { d i s t d ( n ) , ϑ d ( n ) } d D consists of two parts;
  • d i s t d ( n ) [ 0 , V d ( t ) δ t ] represents the distance that UAV d flies in time slot n. V d ( t ) represents the maximum flight speed;
  • ϑ d ( n ) [ 0 , 2 π ] represents the direction of UAV d in time slot n.
Reward: Since the goal of the action taken by the agents is to maximize the system reward, the setting of the reward function plays an important role in MADRL. The reward mainly includes the following three components:
  • Fair throughput r 1 = k = 1 K W k ( n ) R ¯ k ( n ) δ t : In the UAV-assisted fair communication problem, to trade off the user fairness and communication efficiency, we define the weighted sum of the fairness index and throughput as fair throughput and as part of the reward function. W k ( n ) is an RF metric utilized to weigh communication efficiency and communication fairness.
  • Coverage reward r 2 = d = 1 D k = 1 K e d , k : To accelerate the convergence of the UAFC algorithm, we design the coverage reward of the UAV in the reward function. The coverage reward is proportional to the number of GUs covered by the UAVs. e d , k = 1 indicates that G U k is covered by UAV d , and  e d , k = 0 otherwise. Note that the coverage range is not strictly a communication range, and covering more GUs only provides a direction for the UAVs to search for the optimal strategy.
  • Punishment: The UAVs will receive large negative reward when one of the following requirements are fulfilled:
    (1) The UAVs fly out of the mission boundary area, i.e.,  x d , k ( t ) [ X min , X max ] or y d , k [ Y min , Y max ] , where X min , X max , Y min , and Y max represent the values of the abscissa and ordinate of the mission area, respectively;
    (2) UAV i and UAV j collide with each other, i.e.,  u i ( t ) u j ( t ) 2 d min , where d min represents the safety distance threshold;
    (3) The remaining energy of the UAV d is lower than the threshold, i.e.,  E d E min . A binary variable ξ i { 0 , 1 } is employed to indicate whether violation occurs in the above condition. ξ i = 1 ( i { 1 , 2 , 3 } ) means that violation occurs and a fixed penalty p i ( i { 1 , 2 , 3 } ) will be given to the UAVs.
In summary, the reward function is formulated as
r = r 1 + r 2 ξ 1 p 1 ξ 2 p 2 ξ 3 p 3 .
In MDP, the UAVs aim to maximize the reward function by optimizing policy π and thus the problem P 2 is rephrased as
max π E ( n = 1 N t r | π , s , a ) s . t . C 1 C 7 .

4. The UAFC Algorithm

Since multi-agent systems are sensitive to the change in the training environment [42], the policies obtained by agents may fall into local optimization. The Multi-Agent Twin Delayed Deep Deterministic policy gradient (MATD3) algorithm [43] is based on the Actor–Critic architecture and incorporates policy smoothing technique in the actor network. The target policy smoothing technique is utilized to compute the target Q value, which is beneficial to improving the accuracy of the target Q value and ensure the stability of the training process. Thus, the proposed UAFC algorithm employs the MATD3 algorithm as the basic algorithm and adopts the MADRL framework with centralized training and distributed execution [44] as shown in Figure 2. In the centralized training stage, the MATD3 algorithm learns a policy by jointly modeling all agents. Specifically, the observations of all the agents are employed as input to the actor network, which outputs the joint actions of the agents. Thus, the problem of environment non-stationarity is solved according to centralized training. In the distributed execution stage, the UAVs cannot fully obtain the state information of the environment and other agents due to the limited perception ability. Thus, the unknown state information results in the uncertainty of strategy and makes it challenging for the agent to obtain the optimal strategy quickly. To reduce the distributed decision-making uncertainty of UAVs, the information-sharing based on gated functions is designed in the UAFC algorithm.

4.1. MATD3 Algorithm

As shown in Figure 2, agents adopt the TD3 algorithm. Two main techniques are introduced to enhance the performance of the TD3 algorithm: clipped double-Q learning and target policy smoothing.
  • Clipped Double-Q Learning: The TD3 algorithm consists of an actor network with parameter μ d and two critical networks with network parameters θ d 1 and θ d 2 , respectively. We assume that the actions, states, and rewards of all agents are accessible during training. The actor network makes decisions based on the local state information, and the critic network utilizes the state–action pair to learn two centralized evaluation functions Q d θ i ( s ( t ) , a ( t ) ) ( i { 1 , 2 } ) to evaluate the policy. To avoid the overestimate of the Q value in a single critical network, the Q value is updated with the minimum value of the two critic networks. Thus, the target values y i can be formulated as
    y i = r i + δ min i = 1 , 2 Q d θ i ( s , a ˜ ) , i = 1 , 2 .
    where s indicates next moment state, a ˜ denotes the action generated by the target actor network.
  • Target Policy Smoothing: Furthermore, clipped Gaussian noise ξ is added to the actor network to prevent overfitting of the Q value, which can achieve smoother state–action estimation and the modified target action.

4.2. Information Sharing Mechanism Based on Gated Functions

In addition, to reduce the uncertainty of distributed decision-making. The information-sharing mechanism based on gated functions is designed, which enables UAVs to establish state information sharing through a central memory M with a storage capacity of M [45]. The memory is used to store the collective state information m R M of the UAVs. As shown in Figure 2, with the information sharing mechanism, the strategy of each UAV becomes s d × M ( d D ) . The policy is determined by observation s d and the information in the memory. Each UAV accesses the central memory to retrieve information shared by other UAVs before taking action. The neural networks are utilized to build policy networks for DRL. Furthermore, the gated functions are employed to characterize the information interaction between the agent and memory.

4.2.1. Encoding and Reading Operations

The encoding operation and reading operation are shown in Figure 3. Each UAV maps its own state vector to an embedding vector (denoted as e d ) representing the state information and is given by
e d = φ θ d e e n c ( s d ) ,
where φ θ d e e n c is a neural network with network parameters θ d e .
The UAVs perform the reading operation to extract the associated information stored in M after encoding the current information. A latent vector h d is generated to learn the temporal and spatial dependency information of the embedded vector e d
h d = W d h e d , h d R H , W d h R H × E ,
where W d h denotes the network parameters of the linear mapping. H denotes the dimension of the context vector and E denotes the dimension of the embedding vector. The state embedding vector e d , the context vector h d , and content m in current memory M contain different information, respectively.
e d , h d , and m are employed jointly as input to learn a gated mechanism. k d is utilized as a weighting factor to adjust the information reading from the memory and is given by
k d = σ ( W d k [ e d , h d , m ] ) , k d [ 0 , 1 ] M , W d k R M × ( E + H + M ) ,
where [ e d , h d , m ] denotes the concatenation operation of the vectors and σ ( ) conducts the calculation of the sigmoid activation function. M represents the dimension of the content m. Thus, the information reading from M (denoted as m d ) is given by
m d = m k d ,
where ⊙ indicated the Hadamard product.

4.2.2. Writing Operation and Action Selection

The writing operation regulates the keeping and discarding of the information through gated functions, and the framework is shown in Figure 4. UAV d obtains a candidate storage vector c d based on the state embedding vector e d and the shared information m by nonlinear mapping
c d = tanh ( W d c [ e d , m ] ) c d [ 1 , 1 ] M , W d c R M × ( E + M ) ,
where W d c is the network parameter. The input gate g d is employed to regulate the contents of the candidate, and f d is utilized to decide the information to be kept. These operations can be expressed as
g d = σ ( W d g [ e d , m ] ) g d [ 0 , 1 ] M , W d g R M × ( E + M ) , f d = σ ( W d f [ e d , m ] ) f d [ 0 , 1 ] M , W d f R M × ( E + M ) .
Then, UAV d finally generates newly updated information m by weighting the historical state information and real-time state information, and it is calculated as
m = g d c d + f d m .
After completing the reading and writing operations, UAV d obtains the action a d , which depends on the current state and the information reading from the M
a d = π d μ ( s d , m d ) .
According to the above description, the pseudo code of the reading and writing operation based on gated functions is given in Algorithm 1.
Algorithm 1 Memory-Based Reading and Writing Operations
Input:
State information of UAVs: s d = { { u d ( n ) } d D , { ω k ( n ) } k K , E d ( n ) d D } ;
Output:
Decisions of UAVs: a d = { d i s t d , θ d } d D ;
1:
Initialize the state s d , memory M ;
2:
Initialize each actor networks of UAV d with weights μ d and μ d , respectively;
3:
for d = 1 to D do
4:
   Obtain state s d and the share information m;
5:
   Set m d = m ;
6:
   Generate observation encoding e d according to Equation (19);
7:
   Generate read vector m d according to Equation (22);
8:
   Generate new message m according to Equation (25);
9:
   Update information in memory;
10:
   Select action a d = π d μ ( s d , m d ) according to Equation (26);
11:
end for
Both reading and writing operations in Algorithm 1 are the core of the information sharing mechanism. The agents utilize gated functions to select the required information from the memory based on own observations. Thus, unknown state information can be obtained through reading operation. The read information and observations are jointly used as input to the policy network. Hence, the actions depend on observations and the state information of other agents. With the dynamic changes of both agents and environments, the information in the memory needs to be dynamically updated. The writing operation regulates the keeping and discarding of the information through gated functions. As a result, Algorithm 1 enables the sharing of state information among UAVs and avoids policy uncertainty due to the partial state information.

4.3. The Architecture of Actor Network

Furthermore, the actor network of the UAFC algorithm consists of more than one network. The input of actor network can be divided into three categories:
(1) The remaining energy of UAVs ( s e ). It determines whether the UAVs perform the mission.
(2) The location of UAVs and GUs ( s l ). They determine whether the UAVs should move to optimal location to provide great communication services.
(3) The information read from memory ( m d ). They can help UAVs create optimal policies.
The final actions of the UAVs depend on the comprehensive impact of these three categories of input information. If we directly input all the state information and share information into an actor network, it may hardly output desirable policy due to the imbalance and high dimension of state information. Thus, we design a novel actor network architecture of decomposing and coupling. The architecture decouples the input vector into three categories. Then, it expands the dimension of part state information ( s e ) and aggregates three parts of information as a total input vector. This method of state dimension spread and state aggregation can address the dimension imbalance problem and reduce state dimension to generate higher-quality policy.
The actor network architecture is shown in Figure 5. It aims to avoid the crash and service interruption of the UAVs due to insufficient power. Thus, the energy state of dimension size D is very important. Furthermore, the energy state information dimension is much smaller than the position information dimension of the UAVs and GUs. There exists a dimension imbalance problem, which makes the algorithm difficult to converge. The dimension spread and linear mapping are utilized to process energy state, location state, and the information read from memory to obtain three state vectors with the same dimension, respectively. After the state decomposing and linear mapping, the input dimension is reduced and the vectors are denoted as N e , N l , and  N d . Then, network 4 combines N e , N l , and  N d into a new vector and as the input, and outputs the final action.

4.4. Training of UAV-Assisted Fair Communication

Algorithm 2 summarizes the UAFC algorithm for UAVs-assisted fair communication. First, the training data are randomly sampled from the experience replay pool. s j and s j are input into the evaluation and target critic network to generate state-action value function Q d θ i and target state-action value function target Q d θ i , respectively. The loss function is constructed according to Q d θ i and Q d θ i to train the critic network.
Algorithm 2 UAFC Algorithm
Input:
State information of UAVs: s d = { { u d ( n ) } d D , { ω k ( n ) } k K , E d ( n ) d D } ;
Output:
Decisions of UAVs: a d = { d i s t d , θ d } d D ;
1:
⊳ Parameter initialization
2:
Initialize actor and critic networks parameters μ d , μ d , { θ n i } i = 1 , 2 and { θ d i } i = 1 , 2 , respectively;
3:
Initialize replay buffer B;
4:
for each  episode  do
5:
   ⊳ Action generation
6:
   Obtain the action of UAV d from Algorithm 1;
7:
   Set s = ( s 1 , s 2 , , s D ) and Φ = ( m 1 , m 2 , , m D ) ;
8:
   ⊳ Experience storage
9:
    UAV d take selected actions a = ( a 1 , a 2 , , a D ) ;
10:
    UAV d obtain the reward R, state s transfers to new s ;
11:
   The experience ( s , a , Φ , R , s ) is stored in replay pool B;
12:
   ⊳ Parameter updating
13:
   for  d = 1 to D do
14:
     Sample a random mini-batch of ( s j , a j , ϕ j , R j , s j ) from B;
15:
     Update weights { θ d i } i = 1 , 2 of evaluation critic networks by minimizing loss function L o s s ( θ d i ) according to Equation (28);
16:
     Update weights μ d of evaluation actor network according to Equation (30);
17:
     Update the weights of the three target networks according to Equations (31) and (32);
18:
   end for
19:
end for
Initialization (lines 2–3): During the centralized training phase, the actor network and critic network parameters are randomly initialized. Furthermore, the two storage spaces of the experience replay pool and the memory are initialized.
Generate action (lines 4–7): Each UAV through the current observation value s d and the information m d of other UAVs obtains the action according to the policy function π d μ ( s d , m d ) .
Experience storage (Lines 9–11): The experience of each UAV can be expressed as a tuple ( s d , a d , m d , R d , s d ) . After performing the action, UAV d obtains the reward R d , and the current state will transfer to the new state at the next moment. Finally, the experience ( s d , a d , m d , R d , s d ) is stored into replay pool B with a capacity of M r .
Parameter update (lines 13–17): During the training process, experience ( s j , a j , m j , R j , s j ) of size M b is randomly sampled from the experience replay pool. The evaluation actor network generates a policy π d μ ( s j , m j ) according to s j and m j . The parameters of the evaluation actor network are updated according to the following policy gradient [46]
μ d J ( μ d ) = 1 M b j = 1 M b μ d π d μ ( s d j , m d j ) a d Q d θ 1 ( s j , a 1 j , a 2 j , , a D j ) | a d = π d μ ( s d j , m d j ) .
Based on the policy π d μ ( s j , m j ) , two Q values, i.e., Q d θ 1 ( s j , π d μ ( s j , m j ) ) and Q d θ 2 ( s j , π d μ ( s j , m j ) ) , are obtained by two evaluation critic networks. The parameters of the critic networks are updated by minimizing the loss function L o s s ( θ d i )
L o s s ( θ d i ) = 1 M b j = 1 M b [ y i Q d θ i ( s j , a j ) ] 2 , i = 1 , 2 .
According to the above loss function, each UAV updates three evaluation networks
θ d i θ d i λ · θ d i L ( θ d i ) , i = 1 , 2 ,
μ d μ d λ · μ d J ( μ d ) ,
where λ denotes the learning rate, and the target network parameters are updated as follows
μ d = u · μ d + ( 1 u ) · μ d ,
θ d j = u · θ d i + ( 1 u ) · θ d j , i = 1 , 2
where u denotes the updating rate.

4.5. Complexity Analysis

We evaluate the efficiency of the UAV-assisted fair communication algorithm by complexity analysis. The non-linear mapping of states to actions is achieved by a deep neural network during the offline training and online execution phases. The actor and critic networks contain J-th layer and F-th layer neural networks, respectively. Thus, the time complexity of the UAFC algorithm (denoted as T U A F C ) is given by
T U A F C = 2 × j = 1 J U a c t o r , j · U a c t o r , j + 1 + 4 × f = 1 F U c r i t i c , f · U c r i t i c , f + 1 = O ( j = 1 J U a c t o r , j · U a c t o r , j + 1 + f = 1 F U c r i t i c , f · U c r i t i c , f + 1 ) ,
where U a c t o r , j represents the number of neurons in the j-th layer of the actor network, and U c r i t i c , f represents the number of neurons in the f-th layer of the critic network.
A matrix of P × Q and a bias of Q exist in a fully connected neural network. Therefore, the number of storage unit required by a fully connected neural network is ( P + 1 ) × Q , and thus, the space complexity is O ( G ) . In addition, it is also necessary to allocate storage space to the experience replay pool and memory to store information in the process of training, and the space complexities are O ( M r ) and O ( M ) , respectively. Hence, the space complexity of the UAFC algorithm (denoted as S U A F C ) is formulated as
S U A F C = j = 1 J ( U a c t o r , j + 1 ) · U a c t o r , j + 1 + 2 × f = 1 F ( U c r i t i c , f + 1 ) · U c r i t i c , f + 1 + M r + M = O ( j = 1 J U a c t o r , j · U a c t o r , j + 1 + f = 1 F U c r i t i c , f · U c r i t i c , f + 1 ) O ( G ) + O ( M r ) + O ( M ) .
In the distributed execution stage, only the trained actor network is needed. Thus, the space complexity of the execution phase is
O ( j = 1 J U a c t o r , j · U a c t o r , j + 1 ) + O ( M ) ,
and the time complexity is
O ( j = 1 J U a c t o r , j · U a c t o r , j + 1 ) .

5. Performance Evaluation

In this section, we introduce the detailed settings of the algorithm and simulation parameters, and conduct extensive simulation experiments to verify the effectiveness of the UAFC algorithm.

5.1. Simulation Settings

We verify the performance of the UAFC algorithm through extensive experiments. The experimental platform is built based on Intel Core i9-11900H, NVIDIA GeForce RTX3090, and Tensorflow-CPU-1.14. GUs are randomly deployed in a target area (500 m × 500 m) and move in random directions and speeds. The UAVs initializes their position randomly to provide communication services for GUs. The experimental parameters are shown in Table 2. The two metrics are chosen for performance evaluation: A novel fairness index W k and fair throughput are expressed as Equation (12) and Equation (13), respectively.

5.2. Training Results

To verify the impact of RF metric and state decomposing and coupling on algorithm performance. We compare the accumulative reward, fair throughput, and fair index of the UAFC algorithm with the UAFC-NSDC and UAFC-NRF algorithms.
  • No state decomposing and coupling (UAFC-NSDC): Compared with the UAFC algorithm, this algorithm directly involves complete state information without employing state decomposing and coupling.
  • No RF (UAFC-NRF): The UAFC-NRF algorithm maximizes throughput while ignoring the fairness of the GUs in communication services.
From Figure 6a, we can observe that the UAFC algorithm achieves higher accumulative reward than the other two algorithms. This is because the UAFC algorithm takes into consideration the GUs fairness to serve more GUs and obtain higher coverage reward. Furthermore, the UAFC algorithm utilizes state decomposing and coupling to eliminate the influences of state dimension imbalance. Thus, UAVs can obtain high-quality policies to achieve great reward.
From Figure 6b, we can observe that the UAFC-NRF algorithm converges to optimal value quickly. This is due to the fact that the UAFC-NRF algorithm ignores the fairness among GUs where the UAVs tend to hover close to partial GUs to achieve higher throughput. Compared to the UAFC-NRF algorithm, the throughput values of both UAFC and UAFC-NSDC are lower, since these two algorithms trade off the communication efficiency and fairness. To ensure fair communication services for GUs, the throughputs of UAVs are sacrificed, especially in the case of a limited number of UAVs.
The mean fairness index is employed as the evaluation indicator. From Figure 6c, we can observe that the mean fair index of the UAFC algorithm outperforms the other two algorithms, because UAFC considers the fairness of GUs and the state information is involved into the actor network after state decomposing and coupling.

5.3. Performance Comparisons with Two Related Algorithms

In this subsection, we compare the accumulative reward, fair throughput, and fair index of the UAFC algorithm with another two existing algorithms.
  • MADDPG: The MADDPG algorithm in [47] is utilized as the benchmark for designing the trajectory of UAV-BSs to maximize throughput without considering fairness.
  • MATD3: The MATD3 algorithm in [48] is employed as a UAV trajectory planning and resource allocation algorithm based on MATD3 to minimize the time delay and energy consumption of UAV tasks.
Figure 7 shows the convergence curves of the UAFC algorithm and the three baseline algorithms for accumulative reward value, total fair throughput, and mean fair index, respectively.
  • Figure 7a shows the results of accumulative reward. The reward value of UAFC converges to around 6500 at 30,000 episodes. UAFC outperforms both MATD3 and MADDPG. This is because the information sharing mechanism makes full use of the shared states information among UAVs. The information is conducive to finding the optimal policy and avoiding falling into the local optimum. Furthermore, the training curves of the three algorithms have obvious oscillations. This is because MADRL is different from supervised learning, since it has no clear label information.
  • Figure 7b shows the convergence curve of fair throughput. We can observe that the total fair throughput of the MATD3 algorithm is better than the UAFC algorithm in the first 25,000 episodes. This is due to the fact that the actor network of UAFC contains reading and writing operations, which are implemented through a multi-layer fully connected (FC) layer neural network. In the FC network, the gradient becomes smaller as the hidden layers propagate backwards. This means that neurons in the previous hidden layers learn more slowly than neurons in the later hidden layers. Thus, the UAFC is harder to train than that of MATD3. The total fair throughput curve of UAFC converges at 40,000 episodes and outperforms both MADDPG and MATD3 as the training time increases. This is because: (1) UAVs share state information with each other to obtain the optimal policy; (2) State information is processed by state decomposing and coupling. Thus, the input of actor network gains a low-dimension state vector which includes complete state information and reading information from M .
  • The experimental results of the three algorithms on mean fair index are shown in Figure 7c. In the first 20,000 episodes, the UAFC algorithm fluctuates greatly and the numerical value is lower than that of both MATD3 and MADDPG algorithms. By decoupling and coupling the input of the actor network, the overall actor network has more layers and more complex structures. Furthermore, the distribution of GUs is changing, and it is difficult to obtain the best strategy for UAVs cooperative search. Both MATD3 and MADDPG converge to the local optimal value quickly, which can also be seen from the final convergence value. The fairness index keeps increasing and it finally converges to 0.72. Compared with the MADDPG and MATD3 algorithms, the mean fairness index of the UAFC algorithm is improved by 18.13 % and 6.31 % , respectively.
To demonstrate the performance of the UAFC algorithm more intuitively, we present the result of the UAFC algorithm and the four compared algorithms regarding the evaluation metrics in Table 3. It can be seen that the UAFC algorithm outperforms the comparison algorithm in terms of reward function and fair index. Table 4 shows a comparison of the results on reward function, equity throughput, and fair index. Compared with UAFC-NRF, UAFC has a 9.09% decrease in fair throughput. The main reason is that the UAFC-NRF algorithm does not consider fairness among GUs. As a result, the UAV can always serve GUs with high throughput. The results also show that the UAFC algorithm obtains a higher fair index by sacrificing part of the throughput.
Figure 8 shows the impact of GU numbers on system performance. The results indicate the following:
  • From Figure 8a, we can observe that the average fair throughput of the three algorithms increases as the number of GUs increases. As the number of GUs served by the UAVs will increase as the number of GUs increases, there is an upward tendency in the average fair throughput. It is worth noting that as the number of GUs increases, the performance of the UAFC algorithm on fair throughput outperforms both MADDPG and MATD3. This is because each UAV can obtain the state information of other UAVs, and perform state decomposing and coupling. Thus, the UAVs can obtain complete state information of the environment and other UAVs;
  • Figure 8b shows the changing trend of the fairness index of the three algorithms with the increase of GUs. As the number of mobile GUs increases, the fairness of the three algorithms does not change much. The fairness index of the MADDPG algorithm is the lowest. This is because the absence of target policy smoothing regularization in the actor network of MADDPG leads to convergence to a local optimum. Furthermore, the UAFC algorithm numerically outperforms the MATD3 algorithm in fairness index. This is due to the fact that the UAVs can provide fair communication services for GUs in a collaborative way by sharing the states information of the UAVs.
Figure 9 shows the impact of number of UAV-BSs on system performance. From the results, we can conclude that:
  • More GUs can be served as the number of UAVs increases, thus, the fair throughput of the three algorithms gradually increases. The performance of the UAFC algorithm outperforms both MADDPG and MATD3 in terms of fair throughput. Thus, the UAFC algorithm can provide fair communication services. This is due to the fact that: (1) The information sharing mechanism is vital and reduce the distributed decision-making uncertainty of UAVs. Thus, the UAFC algorithm still performs well when the number of UAVs increases; (2) The dimension of state information increases with the number of UAVs. The state decomposing and coupling can address the dimension imbalance problem and reduce state dimension to generate higher-quality policy;
  • As shown in Figure 9b, the fair index of the three algorithms increases as the number of UAVs increases. The reason is that more UAVs means more extensive coverage and the ability to cover almost all GUs. When the number of UAVs is 3, the UAFC algorithm improves the fairness index by 16.4 % and 9.6 % compared to the MADDPG and MATD3 algorithms. This is because the information mechanism and actor network architecture can help UAVs make decisions.

6. Discussion

In this section, we focus on some of the limitations of this paper. The main limitations of this paper are: (1) We model the UAV-assisted fair communication problem as a complex non-convex optimization problem that is an NP-hard problem. It is difficult to find an analytic solution. We utilize a DRL algorithm to solve it. The DRL algorithm cannot obtain an optimal solution, but it can be trained to obtain an approximate optimal solution. In addition, experimental results show that our algorithm is more effective than other algorithms. (2) UAVs can flexibly adjust their positions to establish good LoS communication links with GUs and provide reliable wireless communication environment. However, in some complex environments (e.g., urban scenarios), it is inevitable that the LoS link between the UAVs and the GUs will be blocked by high buildings or trees, affecting the quality of communication. In this paper, we assume that the links between the UAVs and the GUs is unaffected by obstructions. (3) In addition, UAVs carry very limited energy due to their limited size. This paper considers the residual energy consumption of UAVs only as a constraint and does not design methods to extend the flight time of UAVs, such as utilizing wireless power transfer technology to recharge the UAVs. In future work, we consider building more realistic mathematical models and designing more accurate solution algorithms.

7. Conclusions

UAV-assisted communication has been expected to be a suitable method for wireless communication. In this paper, we have studied the problem of UAV-assisted communication with the consideration of user fairness. First, a novel metric to evaluate the trade-off between fairness and communication efficiency is presented to maximize fair system throughput while ensuring user fairness. Then, the UAV-assisted fair communication problem is modeled as a mixed-integer non-convex optimization problem. We reformulated the problem as an MDP and proposed a UAFC algorithm based on MADRL. Further, inspired by the communication among agents, the information sharing mechanism based on gated functions is designed to reduce the distributed decision-making uncertainty of UAVs. To solve the problem of state dimension imbalance, a new actor network architecture is designed to reduce the impact of dimension imbalance and dimensional catastrophe on policy search through dimensional expansion and linear mapping techniques. Finally, we have verified the effectiveness of the proposed algorithm through extensive experiments. Simulation results show that the proposed UAFC algorithm increases fair throughput by 5.62%, 26.57% and fair index by 1.99%, 13.82% compared to the MATD3 and MADDPG algorithms. Intelligent Reflecting Surface (IRS) is a new technology in 6G that has received widespread academic attention. Combining IRS and wireless power information transmission technology is a good option to further improve the performance of UAV-assisted communication. In future, we will extend this paper to design a novel algorithm based on IRS.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, Z.J.; validation, Z.J., H.S., and Z.W.; formal analysis, H.S.; investigation, Z.J.; resources, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, H.S., N.L., and F.L.; visualization, Z.W.; supervision, N.L.; project administration, F.L.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (No. 62176088), the Program for Science & Technology Development of Henan Province (Nos. 212102210412, 222102210067, 222102210022), and Young Elite Scientist Sponsorship Program by Henan Association for Science and Technology (No. 2022HYTP013).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

UAVsUnmanned Aerial Vehicles
GUsGround Users
RFRatio Fair
UAFCUAV-Assisted Fair Communication
MADRLMulti-Agent Deep Reinforcement Learning
UAV-BSsUAV Base Stations
DRLDeep Reinforcemrnt Learning
A2GAir-to-Ground
LoSLine-of-Sight
NLoSNon-Line-of-Sight
MDPMarkov Decision Process
MATD3Multi-Agent Twin Delayed Deep Deterministic policy gradient

References

  1. Xiao, Z.; Zhu, L.; Liu, Y.; Yi, P.; Zhang, R.; Xia, X.G.; Schober, R. A survey on millimeter-wave beamforming enabled UAV communications & networking. IEEE Commun. Surv. Tutor. 2022, 24, 557–610. [Google Scholar]
  2. Liu, Y.; Yan, J.; Zhao, X. Deep reinforcement learning based latency minimization for mobile edge computing with virtualization in maritime UAV communication network. IEEE Trans. Veh. Technol. 2022, 71, 4225–4236. [Google Scholar] [CrossRef]
  3. Zhang, C.; Zhang, L.; Zhu, L.; Zhang, T.; Xiao, Z.; Xia, X. 3D Deployment of multiple UAV-mounted base stations for UAV communications. IEEE Trans Commun. 2021, 69, 2473–2488. [Google Scholar] [CrossRef]
  4. Zhong, X.; Guo, Y.; Li, N.; Chen, Y. Joint optimization of relay deployment, channel allocation, and relay assignment for UAVs-aided D2D networks. IEEE Trans Commun. 2020, 28, 804–817. [Google Scholar] [CrossRef]
  5. Zhang, J.; Zeng, Y.; Zhang, R. Multi-antenna UAV data harvesting: Joint trajectory and communication optimization. J. Commun. Inf. Netw. 2020, 5, 86–99. [Google Scholar] [CrossRef]
  6. Lu, W.; Ding, Y.; Gao, Y.; Su, H.; Wu, Y.; Zhao, N.; Gong, Y. Resource and trajectory optimization for secure communications in dual unmanned aerial vehicle mobile edge computing systems. IEEE Trans Commun. 2022, 18, 2704–2713. [Google Scholar] [CrossRef]
  7. Al-Ahmed, S.A.; Shakir, M.Z.; Zaidi, S.A.R. Optimal 3D UAV base station placement by considering autonomous coverage hole detection, wireless backhaul and user demand. J. Commun. Netw. 2020, 22, 467–475. [Google Scholar] [CrossRef]
  8. Hao, C.; Chen, Y.; Mai, Z.; Chen, G.; Yang, M. Joint optimization on trajectory, transmission and time for effective data acquisition in UAV-enabled IoT. IEEE Trans. Veh. Technol. 2022, 71, 7371–7384. [Google Scholar] [CrossRef]
  9. Liu, Y.; Xiong, K.; Lu, Y.; Ni, Q.; Fan, P.; Letaief, K.B. UAV-aided wireless power transfer and data collection in rician fading. IEEE J. Sel. Areas Commun. 2021, 39, 3097–3113. [Google Scholar] [CrossRef]
  10. Li, X.; Yao, H.; Wang, J.; Xu, X.; Jiang, C.; Hanzo, L. A near-optimal UAV-aided radio coverage strategy for dense urban areas. IEEE Trans. Veh. Technol. 2019, 68, 9098–9109. [Google Scholar] [CrossRef] [Green Version]
  11. Zhang, X.; Duan, L. Energy-saving deployment algorithms of UAV swarm for sustainable wireless coverage. IEEE Trans. Veh. Technol. 2020, 69, 10320–10335. [Google Scholar] [CrossRef]
  12. Alkama, D.; Ouamri, M.A.; Alzaidi, M.S.; Shaw, R.N.; Azni, M.; Ghoneim, S.S.M. Downlink performance analysis in MIMO UAV-cellular communication with LOS/NLOS propagation under 3D beamforming. IEEE Access 2022, 10, 6650–6659. [Google Scholar] [CrossRef]
  13. Wu, Z.; Yang, Z.; Yang, C.; Lin, J.; Liu, Y.; Chen, X. Joint deployment and trajectory optimization in UAV-assisted vehicular edge computing networks. J. Commun. Netw. 2022, 24, 47–58. [Google Scholar] [CrossRef]
  14. Wang, L.; Zhang, H.; Guo, S.; Yuan, D. Deployment and association of multiple UAVs in UAV-assisted cellular networks with the knowledge of statistical user position. IEEE Wirel. Commun. 2022, 21, 6553–6567. [Google Scholar] [CrossRef]
  15. Wang, C.; Deng, D.; Xu, L.; Wang, W. Resource scheduling based on deep reinforcement learning in UAV assisted emergency communication networks. IEEE Trans. Commun. 2022, 70, 3834–3848. [Google Scholar] [CrossRef]
  16. Abeywickrama, H.V.; He, Y.; Dutkiewicz, E.; Jayawickrama, B.A.; Mueck, M. A Reinforcement learning approach for fair user coverage using UAV mounted base stations under energy constraints. IEEE Open J. Veh. Technol. 2020, 1, 67–81. [Google Scholar] [CrossRef]
  17. Qi, H.; Hu, Z.; Huang, H.; Wen, X.; Lu, Z. Energy Efficient 3-D UAV Control for Persistent Communication Service and Fairness: A Deep Reinforcement Learning Approach. IEEE Access 2020, 8, 53172–53184. [Google Scholar] [CrossRef]
  18. Chen, D.; Qi, Q.; Zhuang, Z.; Wang, J.; Liao, J.; Han, Z. Mean Field Deep Reinforcement Learning for Fair and Efficient UAV Control. IEEE Internet Things J. 2021, 8, 813–828. [Google Scholar] [CrossRef]
  19. Jeong, C.; Chae, S.H. Simultaneous wireless information and power transfer for multiuser UAV-enabled IoT networks. IEEE Internet Things J. 2021, 8, 8044–8055. [Google Scholar] [CrossRef]
  20. Yin, S.; Zhao, S.; Zhao, Y.; Yu, F.R. Intelligent trajectory design in UAV-aided communications with reinforcement learning. IEEE Trans. Veh. Technol. 2019, 68, 8227–8231. [Google Scholar] [CrossRef]
  21. Yuan, Y.; Lei, L.; Vu, T.X.; Chatzinotas, S.; Sun, S.; Ottersten, B. Energy minimization in UAV-aided networks: Actor-critic learning for constrained scheduling optimization. IEEE Trans. Veh. Technol. 2021, 70, 5028–5042. [Google Scholar] [CrossRef]
  22. Yang, D.; Wu, Q.; Zeng, Y. Energy tradeoff in ground-to-UAV communication via trajectory design. IEEE Trans. Veh. Technol. 2018, 67, 6721–6726. [Google Scholar] [CrossRef] [Green Version]
  23. Zhang, T.; Lei, J.; Liu, Y.; Feng, C.; Nallanathan, A. Trajectory optimization for UAV emergency communication with limited user equipment energy: A safe-DQN approach. IEEE Trans. Green Commun. Netw. 2021, 5, 1236–1247. [Google Scholar] [CrossRef]
  24. Shi, W.; Li, J.; Wu, H.; Zhou, C.; Cheng, N.; Shen, X. Drone-cell trajectory planning and resource allocation for highly mobile networks: A hierarchical DRL approach. IEEE Internet Things J. 2021, 8, 9800–9813. [Google Scholar] [CrossRef]
  25. Zeng, F.; Hu, Z.; Xiao, Z.; Jiang, H.; Zhou, S.; Liu, W.; Liu, D. Resource allocation and trajectory optimization for QoE provisioning in energy-efficient UAV-enabled wireless networks. IEEE Trans. Veh. Technol. 2020, 69, 7634–7647. [Google Scholar] [CrossRef]
  26. Ding, R.; Xu, Y.; Gao, F.; Shen, X. Trajectory design and access control for air–ground coordinated communications system with multiagent deep reinforcement learning. IEEE Internet Things J. 2022, 9, 5785–5798. [Google Scholar] [CrossRef]
  27. Diao, X.; Zheng, J.; Cai, Y.; Wu, Y.; Anpalagan, A. Fair data allocation and trajectory optimization for UAV-assisted mobile edge computing. IEEE Commun. Lett. 2019, 23, 2357–2361. [Google Scholar] [CrossRef]
  28. Ding, R.; Gao, F.; Shen, X.S. 3D UAV trajectory design and frequency band allocation for energy-efficient and fair communication: A deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2020, 19, 7796–7809. [Google Scholar] [CrossRef]
  29. Liu, C.H.; Chen, Z.; Tang, J.; Xu, J.; Piao, C. Energy-efficient UAV control for effective and fair communication coverage: A deep reinforcement learning approach. IEEE J. Sel. Areas Commun. 2018, 36, 2059–2070. [Google Scholar] [CrossRef]
  30. Nemer, I.A.; Sheltami, T.R.; Belhaiza, S.; Mahmoud, A.S. Energy-efficient UAV movement control for fair communication coverage: A deep reinforcement learning approach. Sensors 2022, 22, 1919. [Google Scholar] [CrossRef]
  31. Liu, Y.; Huangfu, W.; Zhou, H.; Zhang, H.; Liu, J.; Long, K. Fair and energy-efficient coverage optimization for UAV placement problem in the cellular network. IEEE Trans Commun. 2022, 70, 4222–4235. [Google Scholar] [CrossRef]
  32. Zeng, Y.; Xu, J.; Zhang, R. Energy minimization for wireless communication with rotary-wing UAV. IEEE Trans. Wirel. Commun. 2019, 18, 2329–2345. [Google Scholar] [CrossRef] [Green Version]
  33. Cheng, Z.; Hong, L. Energy minimization in internet-of-things system based on rotary-wing UAV. IEEE Wirel. Commun. Lett. 2019, 8, 1341–1344. [Google Scholar]
  34. Al-Hourani, A.; Kandeepan, S.; Lardner, S. Optimal LAP altitude for maximum coverage. IEEE Wirel. Commun. Lett. 2014, 3, 569–572. [Google Scholar] [CrossRef] [Green Version]
  35. Al-Hourani, A.; Kandeepan, S.; Jamalipour, A. Modeling air-to-ground path loss for low altitude platforms in urban environments. In Proceedings of the 2014 IEEE Global Communications Conference, Austin, TX, USA, 8–12 December 2014. [Google Scholar]
  36. Jain, R.K.; Chiu, D.M.W.; Hawe, W.R. A quantitative measure of fairness and discrimination for resource allocation in shared computer systems. arXiv 1998, arXiv:cs/9809099. [Google Scholar]
  37. Lin, L.; Goodrich, M.A. Hierarchical heuristic search using a gaussian mixture model for UAV coverage planning. IEEE Trans Cybern. 2014, 44, 2532–2544. [Google Scholar] [CrossRef]
  38. Wang, H.; Wang, J.; Ding, G.; Chen, J.; Gao, F.; Han, Z. Completion time minimization with path planning for fixed-wing UAV communications. IEEE Trans. Wirel. Commun. 2019, 18, 3485–3499. [Google Scholar] [CrossRef]
  39. Dong, L.; Liu, Z.; Jiang, F.; Wang, K. Joint optimization of deployment and trajectory in UAV and IRS-assisted IoT data collection system. IEEE Internet Things J. 2022, 9, 21583–21593. [Google Scholar] [CrossRef]
  40. Zhang, W.; Yang, D.; Wu, W.; Peng, H.; Zhang, N.; Zhang, H.; Shen, X. Optimizing federated learning in distributed industrial IoT: A multi-agent approach. IEEE J. Sel. Areas Commun. 2021, 39, 3688–3703. [Google Scholar] [CrossRef]
  41. Meshgi, H.; Zhao, D. Opportunistic scheduling for a two-way relay network using markov decision process. IET Commun. 2016, 10, 1846–1854. [Google Scholar] [CrossRef]
  42. Papoudakis, G.; Christianos, F.; Rahman, A.; Albrecht, S.V. Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv 2019, arXiv:1906.04737. [Google Scholar]
  43. Ackermann, J.; Gabler, V.; Osa, T.; Sugiyama, M. Reducing overestimation bias in multi-agent domains using double centralized critics. arXiv 2019, arXiv:1910.01465. [Google Scholar]
  44. Yuan, T.; Neto, W.d.R.; Rothenberg, C.E.; Obraczka, K.; Barakat, C.; Turletti, T. Dynamic controller assignment in software defined internet of vehicles through multi-agent deep reinforcement learning. IEEE Trans. Netw. Serv. Manag. 2021, 18, 585–596. [Google Scholar] [CrossRef]
  45. Pesce, E.; Montana, G. Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication. In Proceedings of the Conference and Workshop on Neural Information Processing Systems, Montreal, QC, Cananda, 3–8 December 2018. [Google Scholar]
  46. Ding, F.; Xu, L.; Meng, D.; Jin, X.B.; Alsaedi, A.; Hayat, T. Gradient estimation algorithms for the parameter identification of bilinear systems using the auxiliary model. J. Comput. Appl. Math. 2020, 369, 112575–112588. [Google Scholar] [CrossRef]
  47. Zhao, N.; Liu, Z.; Cheng, Y. Multi-agent deep reinforcement learning for trajectory design and power allocation in multi-UAV networks. IEEE Access 2020, 8, 139670–139679. [Google Scholar] [CrossRef]
  48. Zhao, N.; Ye, Z.; Pei, Y.; Liang, Y.C.; Niyato, D. Multi-agent deep reinforcement learning for task offloading in UAV-assisted mobile edge computing. IEEE Trans. Wirel. Commun. 2022, 21, 6949–6960. [Google Scholar] [CrossRef]
Figure 1. Multi-UAV-assisted communication scenario.
Figure 1. Multi-UAV-assisted communication scenario.
Remotesensing 14 05662 g001
Figure 2. The architecture of the UAFC algorithm.
Figure 2. The architecture of the UAFC algorithm.
Remotesensing 14 05662 g002
Figure 3. Encoding and reading operations.
Figure 3. Encoding and reading operations.
Remotesensing 14 05662 g003
Figure 4. The writing operation framework.
Figure 4. The writing operation framework.
Remotesensing 14 05662 g004
Figure 5. The network architecture of the actor in the UAFC algorithm.
Figure 5. The network architecture of the actor in the UAFC algorithm.
Remotesensing 14 05662 g005
Figure 6. The training process of the UAFC, UAFC-NSDC, and UAFC-NRF algorithm. (a) Accumulative reward per episode. (b) Total fair throughput per episode. (c) Mean fair index per episode.
Figure 6. The training process of the UAFC, UAFC-NSDC, and UAFC-NRF algorithm. (a) Accumulative reward per episode. (b) Total fair throughput per episode. (c) Mean fair index per episode.
Remotesensing 14 05662 g006
Figure 7. The training process of the UAFC, MATD3, and MADDPG algorithms. (a) Accumulative reward per episode. (b) Total fair throughput per episode. (c) Mean fair index per episode.
Figure 7. The training process of the UAFC, MATD3, and MADDPG algorithms. (a) Accumulative reward per episode. (b) Total fair throughput per episode. (c) Mean fair index per episode.
Remotesensing 14 05662 g007
Figure 8. The impact of the number of GUs on system performance. (a) Average fair throughput. (b) Average fair index.
Figure 8. The impact of the number of GUs on system performance. (a) Average fair throughput. (b) Average fair index.
Remotesensing 14 05662 g008
Figure 9. The impact of the number of UAV-BSs on system performance. (a) Average fair throughput. (b) Average fair index.
Figure 9. The impact of the number of UAV-BSs on system performance. (a) Average fair throughput. (b) Average fair index.
Remotesensing 14 05662 g009
Table 1. Table of Important Symbols.
Table 1. Table of Important Symbols.
SymbolDescriptionSymbolDescription
D , D, dSet, number, and index of UAVsHThe hover altitude of the UAVs
K , K, kSet, number, and index of GUs d min Minimum safe distance
u d ( t ) , ω k ( t ) Location of the UAVs and GUs s d , a d The states and actions of the UAV d
d i s t d ( t ) , ϑ n ( t ) Flight distance and flight directionrReward value
E m ( t ) Propulsion energy consumption π d μ , Q d θ i Actor and critic networks
E c ( t ) Communication energy consumption π d μ , Q d θ i Target actor and critic networks
E max , E d ( t ) Maximum energy and residual energy of UAV d μ n , μ n Parameters of actor and target actor networks
f c , P t Carrier frequency and transmit power θ n 1 , θ n 2 Parameters of critic and target critic networks
X, YThe parameter of path loss model λ , uLearning and updating rate
E L o S , E N L o S Additional path loss for LoS and NLoS links M b , M r Buffer size and mini-batch
Table 2. Simulation Settings.
Table 2. Simulation Settings.
ParametersValues
Number of UAVs (D) { 2 , 3 }
Number of Gus (K){10∼15}
Carrier frequency ( f c )2.4 GHz
Maximum and minimum energy of UAVs ( E max , E min )500 KJ, 50 KJ
The parameters of channel model (X, Y)4.88, 0.33
Additional path loss for LoS and NLoS ( E L o S , E N L o S )1.6, 2.1
The hover altitude and minimum safe distance of the UAVs (H, d min )100 m, 10 m
Learning rate( λ )0.001
Buffer size and mini-batch ( M b , M r )60,000, 256
Memory capacity (M)256
Discount factor ( γ )0.99
Updating rate (u)0.01
Penalty value ( p i , i { 1 , 2 , 3 } ) { 500 , 100 , 100 }
Table 3. Results of five algorithms on reward function, fair throughput, and fair index.
Table 3. Results of five algorithms on reward function, fair throughput, and fair index.
AlgorithmRewardFair ThroughputFair Index
UAFC6290.3194558.4940.667
UAFC-NSDC5902.3234465.8630.645
UAFC-NRF6098.5765014.3040.559
MATD35323.7564341.3740.654
MADDPG4588.573601.5020.586
Table 4. Comparison of results on reward function, fair throughput, and fair index.
Table 4. Comparison of results on reward function, fair throughput, and fair index.
AlgorithmRewardFair ThroughputFair Index
UAFC VS UAFC-NSDC6.57%2.07%3.41%
UAFC VS UAFC-NRF3.14%−9.09%19.32%
UAFC VS MATD318.15%5.62%1.99%
UAFC VS MADDPG37.09%26.57%13.82%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Jin, Z.; Shi, H.; Wang, Z.; Lu, N.; Liu, F. UAV-Assisted Fair Communication for Mobile Networks: A Multi-Agent Deep Reinforcement Learning Approach. Remote Sens. 2022, 14, 5662. https://doi.org/10.3390/rs14225662

AMA Style

Zhou Y, Jin Z, Shi H, Wang Z, Lu N, Liu F. UAV-Assisted Fair Communication for Mobile Networks: A Multi-Agent Deep Reinforcement Learning Approach. Remote Sensing. 2022; 14(22):5662. https://doi.org/10.3390/rs14225662

Chicago/Turabian Style

Zhou, Yi, Zhanqi Jin, Huaguang Shi, Zhangyun Wang, Ning Lu, and Fuqiang Liu. 2022. "UAV-Assisted Fair Communication for Mobile Networks: A Multi-Agent Deep Reinforcement Learning Approach" Remote Sensing 14, no. 22: 5662. https://doi.org/10.3390/rs14225662

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop