Joint Beamforming Design for RIS-Assisted Integrated Satellite-HAP-Terrestrial Networks Using Deep Reinforcement Learning

Wu, Min; Zhu, Shibing; Li, Changqing; Chen, Yudi; Zhou, Feng

doi:10.3390/s23063034

Open AccessArticle

Joint Beamforming Design for RIS-Assisted Integrated Satellite-HAP-Terrestrial Networks Using Deep Reinforcement Learning

by

Min Wu

¹

,

Shibing Zhu

^1,*,

Changqing Li

¹

,

Yudi Chen

¹

and

Feng Zhou

²

¹

School of Space Information, Space Engineering University, Beijing 101416, China

²

College of Information Engineering, Yancheng Institute of Technology, Yancheng 224051, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(6), 3034; https://doi.org/10.3390/s23063034

Submission received: 1 December 2022 / Revised: 7 March 2023 / Accepted: 8 March 2023 / Published: 11 March 2023

(This article belongs to the Special Issue Integration of Satellite-Aerial-Terrestrial Networks)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, we consider reconfigurable intelligent surface (RIS)-assisted integrated satellite high-altitude platform terrestrial networks (IS-HAP-TNs) that can improve network performance by exploiting the HAP stability and RIS reflection. Specifically, the reflector RIS is installed on the side of HAP to reflect signals from the multiple ground user equipment (UE) to the satellite. To aim at maximizing the system sum rate, we jointly optimize the transmit beamforming matrix at the ground UEs and RIS phase shift matrix. Due to the limitation of the unit modulus of the RIS reflective elements constraint, the combinatorial optimization problem is difficult to tackle effectively by traditional solving methods. Based on this, this paper studies the deep reinforcement learning (DRL) algorithm to achieve online decision making for this joint optimization problem. In addition, it is verified through simulation experiments that the proposed DRL algorithm outperforms the standard scheme in terms of system performance, execution time, and computing speed, making real-time decision making truly feasible.

Keywords:

reconfigurable intelligent surface (RIS); integrated satellite–HAP–terrestrial networks (IS-HAP-TNs); deep reinforcement learning (DRL); optimization performance

1. Introduction

As fifth-generation mobile communication systems enter commercial operations worldwide, terrestrial wired and wireless networks are beginning to provide instant, high-speed data transmission services to users in high-density population areas. However, due to geographical conditions and business models, networks in remote areas are still unable to meet the needs of multiple users for full-area coverage and ubiquitous access [1,2]. Compared with traditional terrestrial wireless communication systems, the satellite–aerial–ground integrated network (SAGIN) emerged as a high-potential infrastructure for future wireless communication networks that can establish seamless coverage and massive connectivity for the explosive growth of terrestrial users [3]. In SAGIN, the high-altitude platform (HAP)-based relaying communications were expected to be the primary choice for aerial communication compared to UAV-based relaying, owing to their lower operating costs, longer residence time, deployment flexibility and the number of communication devices they can carry. Thus, the integration of HAPs for enabling unobstructed connectivity of the integrated satellite–terrestrial networks (ISTNs) attracted widespread attention from academia and industry [4]. Nevertheless, integrated satellite high-altitude platform terrestrial networks (IS-HAP-TNs) also raise serious concern about the rapidly growing energy consumption and wireless security in the transmission process, which are of great significance for maintaining green and reliable communication schemes.

Among the various candidates, a novel energy-efficient mode, known as reconfigurable intelligent surface (RIS), has widely been applied to improve communication security and network performance [5,6]. Each of the RIS reflective elements is a varactor diode that allows the amplitude and/or phase shift of the incident signal to be independently controlled by an embedded RIS central controller [7]. An extensive study in [8] showed that RIS has already been applied in many different communication network scenarios, such as ambient reflectors, signal transmitters, and even signal receivers. Meanwhile, RIS was also used in ambient forward scatter/backscatter communication systems, which is a seminal contribution, as in [9].

The ability of RIS to reconfigure transmission paths in real time with low cost provides a new solution to the time-varying radio channels caused by the high maneuverable IS-HAP-TNs and the severe path loss caused by long-distance propagation [10]. Furthermore, unlike traditional active beamforming in satellite communications, where additional active antennas or radio-frequency (RF) chains are installed, in the RIS-assisted IS-HAP-TNs, where only passive reflective elements are used for phase and amplitude adjustment and control, RIS enables full duplex passive beamforming with no hardware modifications and low energy costs [11]. The numerous quality features of RIS have prompted many researchers to carry out numerous applications in a variety of different communication system applications, such as an orthogonal frequency division multiplexing (OFDM) system [12], covert communication [13], multi-relaying networks [14], non-orthogonal multiple access (NOMA) [15] and so on. In recent years, several research efforts have gradually introduced the RIS concept into SAGINs, for example, by placing the RIS under the satellite’s solar panel, on the side of the UAV/HAP or close to the ground receiving users, in order to improve the system’s communication performance [16].

On the other hand, there are several advantages to installing the RIS on the side of HAP, rather than the traditional way of installing RIS on the ground buildings. Firstly, although HAPs are likely to establish LoS links in most cases, they still face the challenge of blockages caused by obstacles. In addition, the satellite-HAP-ground transmission link may be exposed to severe interference and malicious eavesdropping. To address these issues, RIS can be employed at HAP to strengthen the desirable signals at the intended users and overcome blockages by adjusting the phase shift. Moreover, RIS is able to deteriorate the unfavorable signals at the unintended users to mitigate the interference and information leakage. In [17], the authors considered RIS-assisted air–ground networks in two case studies. In the first case, the RIS was mounted on a UAV to improve the communication quality from the base station to the user. In the second transmission scenario, the RIS was used to assist the transmission from the UAV to the ground. Meanwhile, the authors in [18,19] considered all the auxiliary communication scenarios in which UAV carries RIS. From the aspect of RIS implementation, with the help of HAP, the RIS can be deployed more flexibly with an elevated position, where the LoS links between RIS and the transceiver ends can be easily established, especially in the crowded urban scenario. Secondly, since the transmission signal is more likely to block, movable RIS are better suited to enhancing communication than fixed RIS. A persistent LOS link to the transmitter and receiver is maintained by constantly adjusting the position of the reflector to changes in the environment.

Thus, assuming that RIS is installed on HAP, there will be three main modes of communication. One case is that HAP serves directly as a communication relay, the second is that RIS can be installed on the side of HAP for auxiliary communication, and the third is that both HAP and RIS serve as communication relays. Motivated by the above analysis, in our paper, the HAP only serves as a mobile platform and is equipped with RIS to increase line-of-sight transmission. This will lay a foundation for subsequent research.

However, the demanding requirements of emerging applications in RIS-assisted IS-HAP-TNs are difficult to meet using traditional mathematical optimization algorithms alone, including communication scenarios in satellite communications that are too complex due to unknown channel models and communication scenarios that cannot be described by accurate mathematical expressions, such as the inevitable non-linearities due to hardware losses [20,21]. Additionally, as for the dynamic RIS configuration scenario, the RIS reflection coefficients are assumed to be reconfigured many times during one transmission instance. The capacity region that corresponds to the dynamic RIS configuration is obtained by the union of all achievable rate tuples over all possible combinations of values of reflection coefficients for each RIS element. However, determining this capacity region is prohibitively complex since the size of the set of all possible combinations of reflection coefficients grows exponentially with the number of reconfiguration. Considering that the future IS-HAP-TNs needs to meet a large number of real-time Internet of Things (IoT) or Internet of Vehicles (IoV) but, due to its channel environment, is complex and highly dynamic, the calculation optimization time of conventional methods is long, so more intelligent algorithms are urgently needed.

On the contrary, deep reinforcement learning (DRL) is a novel approach that combines deep learning (DL) and reinforcement learning (RL). It has been proven to be a significant breakthrough in non-convex optimization problems, including hybrid beamforming design [22], spectrum intelligence sensing [23], channel state estimation [24], and power allocation strategy optimization [25]. Compared with deep learning (DL), the DRL algorithm does not require a large amount of training labeled data as inputs and is therefore very friendly for the optimization of wireless communication systems, where obtaining data is more tedious. By interacting with the environment to obtain rewards from the network, DRL can learn and construct wireless channel knowledge without knowing the complete channel model information and the precise movement pattern, while implementing efficient algorithm design through embedded neural networks to sequentially find optimal solutions to complex multi-objective optimization problems. In [26], a deep Q-network (DQN) with greedy characteristics is proposed for the joint optimization of beamforming design, power allocation strategy and interference coordination for maximizing the signal to interference plus noise ratio (SINR). In [27], by using the DRL framework, the user distribution model is tracked and predicted to autonomously and dynamically optimize the MIMO broadcast beam and propose the optimal broadcast beam for each served cell. The results confirm that optimal coverage can be achieved using the DRL framework in both single-sector and multi-sector environments, and in both periodic and Markovian mobility modes.

1.1. Related Work and Motivation

Nowadays, many prior works are developed to investigate the performance of RIS-assisted wireless communication.

(1): For the RIS-assisted wireless communication: In the future wireless communications, the RIS can be deployed as the reconfigurable signal transmitter, receiver and passive reflector array. Thus, by optimizing the RIS reflector element phase shifts, for example, by selecting optimal reflector element positions and passive beamforming designs, RIS-assisted intelligent radio environments are a promising optimization paradigm to transform the design of modern wireless networks. In [28], the author jointly optimized the RIS reflection coefficients and the number of tunable reflective elements to improve the quality of service (QoS) of RIS-assisted edge networks by considering the actual amplitude and phase shift model of the RIS. The authors in [29] proposed an efficient alternating algorithm for fractional programming (FP), majorization–minimization (MM), and manifold optimization methods to jointly optimize active and passive beamforming under multiple constraints of radar sensing similarity, RIS constraints and transmit power constraints. Moreover, the authors in [30] considered the performance of RIS-assisted integrated satellite unmanned aerial vehicle (UAV)-terrestrial networks, and the closed-form expression for the outage probability (OP) was obtained to evaluate the impact of the introduction of RIS on the system performance.
(2): For the DRL in RIS-assisted wireless communication: By combining the function fitting benefits of deep learning with the environmental interaction decision-making benefits of reinforcement learning, DRL is believed to have the ability to solve non-linear problems without the need for a prior relaxation, enabling direct solutions to non-linear problems in mathematical methods. Recently, DRL-based methods were also applied to solve RIS-assisted wireless communication optimization problems. The authors in [31] presented a RIS-assisted multi-user full-duplex secure communication system with hardware impairments and maximized the total secrecy rate by using DRL. In [32], the authors considered a deep deterministic policy gradient (DDPG)-based framework to tackle the problem of maximizing the receiver signal-to-noise ratio (SNR) in a RIS-assisted single-user wireless communication system. The simulation results showed that the proposed DDPG algorithm can achieve higher performance in a shorter running time compared to traditional methods based on semi-definite relaxation.

Based on the advantages of DRL itself, the model-free DRL emerged as an extraordinarily remarkable technology to address massive data, mathematically intractable non-linear non-convex problems and high-computation issues. The DRL technology is most appealing to large-scale MIMO systems, such as the satellite communication or the IS-HAP-TNs with a massive number of array elements, where optimization problems become non-trivial due to the extremely large dimension optimization involved. Actually, DL-based approaches are able to significantly reduce the complexity and computation time utilizing the offline prediction but often require an exhaustive sample library for online training. Meanwhile, the deep reinforcement learning (DRL) technique, which embraces the advantage of DL in neural network training as well as improving the learning speed and the performance of reinforcement learning (RL) algorithms, has also been adopted in designing wireless communication systems. All of the above work verifies that the DRL algorithm can effectively solve the optimization problems of various wireless communication systems in terms of different performance metrics and achieve high performance in a short run-time with an appropriate algorithmic framework design.

1.2. Major Contributions and Novelty

Currently, the joint beamforming design technique is also a hot issue, which can greatly improve the communication efficiency, system capacity and transmission rate of wireless communication systems to some extent. The efficiency of applying DRL to RIS-assisted wireless communication systems has been demonstrated in simulation experiments in the relevant literature. Related meta-surface techniques, including RIS, have also recently been widely used for IS-HAP-TNs [33,34,35,36].

Motivated by the above analysis, we investigated the joint beamforming design for a RIS-assisted IS-HAP-TNs, in which more constraints must be considered, such as the angle of arrival (AoA) or the angle of departure (AoD) in RIS, maximum transmit power of the satellite, the optimal RIS phase shift matrix to reflect, and the highly dynamic signal transmission environment. However, the statistical characteristics of the shadowed-Rician (SR) channel are far more complex than the terrestrial Rayleigh channel, which is also the difficulty and innovation of our work. Nevertheless, since the IS-HAP-TNs environment is harsh as well as time varying, and the hardware of RIS may suffer from the damage unpredictably, it is challenging to perceive accurate and complete channel state information (CSI), making conventional optimization-based methods to IS-HAP-TNs operation no longer appropriate. In addition, the conventional optimization-based methods need a rich body of iterations to achieve a satisfactory solution, which results in it being impracticable for making real-time decisions in time-varying IS-HAP-TNs. Targeted at solving the difficult problem under uncertainty, deep reinforcement learning (DRL) emerged as the times required and was applied to solve the problems in RIS-aided wireless communications. By applying the advanced DRL framework to optimize the ground user transmit beamforming and RIS phase shift matrix to send the desired signal to the target satellite, the overall network system sum rate is maximized [37]. To the best of the authors’ knowledge, there is no prior work focusing on the beamforming design for RIS-assisted integrated satellite-HAP-terrestrial networks, especially the installation of RIS on HAP, which motivates our work. In this paper, considering the high dynamics of RIS-assisted IS-HAP-TNs transmission process, the main work and contributions are summarized as follows:

(1): Firstly, an innovative system model of installing RIS on the HAP side is proposed in the IS-HAP-TNs. Considering the time-varying characteristics of the IS-HAP-TNs fading channel model and signal transmission model, the system sum rate formulations are given under these system model constraints using the active transmit beamforming at the ground user equipment, the phase shift matrix at the RIS, and the maximization expressions under the proposed constraints.
(2): Secondly, a parameter soft-updated strategy framework based on DDPG framework is designed to optimize the above system sum rate maximization problems. The framework does not need to know the explicit model and specific mobile model of the wireless environment, and solves the formal problem of the system by rationalizing the design of the elemental state space, action space and reward function in the DDPG algorithm so that it can handle continuous variable problems well.
(3): Finally, the simulation experiments on the number of RIS elements as well as the average reward show that the designed DRL algorithm framework outperforms other comparative baseline algorithms, which illustrates the effectiveness of the DRL algorithm in solving joint beam optimization problems and provides guidance for real-time decision making in dynamic IS-HAP-TNs communication environments.

The remainder of this paper is arranged as follows. Section 2 describes the considered system model and identifies the optimization objective problem under the constraints. Section 3 gives the basic framework of the soft update parameter strategy and gives the design flow for the optimization of the active transmit beamforming matrix and the RIS phase shift matrix under this framework. Section 4 plots the network performance simulation results under this framework and provides a detailed theoretical analysis. Finally, Section 5 concludes the whole work.

2. System Model Description

In this illustration, we envision an uplink transmission communication system that includes a geosynchronous Earth orbit (GEO) satellite and backward high altitude platforms (HAPs) deployed with RIS, as well as the K ground user equipment (UE) employing a single antenna as shown in Figure 1. In our proposed system model, the UEs transmission communication information through RF links to the RIS with M reflective elements installed on the HAP, which acts as a passive reflection relay with changeable transmission links and sends the received signal to the satellite.

It is noted that the satellites are linked to the cloud data computing processing center by the free-space optical (FSO), which can collect global communication information, such as the user’s requirements, as the system control link. Instead of coding satellites and HAPs separately, it centralizes the baseband processing of the entire network in the cloud, with the cloud as the core, taking into account resource management and environmental feedback [38].

In order to realistically simulate the UEs-RIS link, where the RIS is mounted on the HAP in the aerial, here, we consider the small-scale path loss model [39], and then, the channel model vector of the UEs-RIS can be expressed as

h_{U R} = \sqrt{\frac{M K}{L_{t o t a l}}} \sum_{l = 1}^{L_{t o t a l}} α_{l} g (m, φ_{A R}) g^{T} (k, φ_{D U})

(1)

where

L_{t o t a l}

denotes the number of the total transmission path,

α_{l}

represents the Nakagami-m channel model random variable,

φ_{A R, l}

and

φ_{D U, l}

denote the angle of arrival (AoA) of RIS and the angle of departure (AoD) of the UEs in the l-th transmission path. The channel model vector

g (L, φ)

as a function of the transmission path L and the AOA or AOD

φ

can be expressed as

g (L, φ) ≜ \frac{1}{\sqrt{L}} {[1, e^{j π cos φ}, e^{2 j π cos φ}, . . . e^{(L - 1) j π cos φ}]}^{T}

(2)

The RIS-satellite uplink channel vector is denoted by

H_{R S}

, which can be expressed as

H_{R S} = \sqrt{M N_{s} P_{r}} [g (N_{s}, φ_{A S})] g (M^{T}, φ_{D R})

(3)

where the

N_{s}

denotes the antenna numbers of the uniform linear array (ULA) in the satellite, and

φ_{A S}

and

φ_{D R}

are the AOA of the satellite and the AOD of the RIS, respectively. Meanwhile, the

P_{r}

is the free space path loss between the RIS and the satellite [40]. Note that in the RIS-satellite uplink channel model, considering that the HAP flies at a higher altitude than most ground buildings and the RIS is mounted on the HAP, we only assume the line-of-sight (LoS) transmission path between the RIS and the satellite, and the

P_{r}

can be expressed by the following formula:

\begin{matrix} P_{r} = \frac{λ^{2} G_{s r} G_{s t}}{{(4 π)}^{2} {d_{s r}}^{2} κ_{a} T_{a} B_{W}} \end{matrix}

(4)

where

λ

,

G_{s r}

,

G_{s t}

,

d_{s r}

,

κ_{a}

,

T_{a}

,

B_{W}

denote the carrier wavelength of signal, the gains of every RIS reflection unit, antenna gain of each satellite, the transmission distance between RIS to center of satellite coverage area, the Boltzmann constant, the temperature of the propagating noise and the frequency band of signal, respectively. The direct uplink channel

H_{U S}

from the UE and the satellite is basically a standard MIMO channel model and can be characterized by existing methods to express its channel characteristics [41].

2.1. Signal Transmission Model

We assume that the k-th UEs intends to transmit signals denoted as

\sqrt{w_{k}} s_{k}

, where

w_{k}

is the transmission power matrix coefficient vector at the k-th UEs of the total transmit beamforming matrix

W = [w_{1}, w_{2}, . . ., w_{k}]

under less than the maximum power constraint

P_{m a x}

and

s_{k}

is satisfied

E [|s_{k} {(t)}^{2}|] = 1

with zero mean unit variance entries at the t-th transmission moment [42]. The transmitted signal from the ground user propagates through the direct link and the reflected link, the latter link reaching the satellite under the reflection of the RIS mounted on the HAP by means of changing the phase of the RIS reflecting elements. We define that

Φ

denotes the phase shift diagonal matrix applied to the reflective RIS for input by

Φ (m, m) = ϕ_{m} = χ_{m} e^{j φ_{m}}

, where

χ_{m}

is the magnitude and

φ_{m}

denotes the phase shift caused by the passive reflection of each reflective element of the RIS [43]. Thus, the signal received by the satellite at the t-th time step can then be further expressed in the following:

\begin{matrix} y (t) = (H_{U S}^{k} + H_{R S}^{k} Φ h_{U R}^{k}) \sqrt{w_{k}} s_{k} + \sum_{j, j \neq k}^{K} (H_{U S}^{j} + H_{R S}^{j} Φ h_{U R}^{j}) \sqrt{w_{j}} s_{j} + n_{0} \end{matrix}

(5)

where

n_{0}

denotes the system zero mean additive white Gaussian noise (AWGN) followed by

n_{0} \sim C N (0, σ_{n}^{2})

. As we can see from Equation (5) above, the introduction of the RIS does not introduce additional AWGN compared to the traditional use of relay-based communication systems. This is because the RIS is only a passive mirror relay that only reflects the signal incident on its plane without signal decoding and encoding. Once the RIS receives the signal, its phase will reconfigure the signal by means of a central microprocessor controller connected to the RIS. Thus, the received signal-to-interference-plus-noise ratio (SINR) of

U E_{k}

signal at the satellite is given by

\begin{matrix} γ_{k} = \frac{{|(H_{U S}^{k} + H_{R S}^{k} Φ h_{U R}^{k})|}^{2} w_{k}}{\sum_{j, j \neq k}^{K} {|(H_{U S}^{j} + H_{R S}^{j} Φ h_{U R}^{j})|}^{2} w_{j} + σ_{n}^{2}} \end{matrix}

(6)

The system can be described that the UEs transmits signals to satellite, so we set the performance metric for evaluating RIS-assisted wireless systems as the system ergodic sum rate, which can be modeled as

\begin{matrix} C (H_{U S}, H_{R S}, Φ, w, h_{U R}) = \underset{k = 1}{\sum^{K}} R_{k} = log (1 + \frac{\sum_{k = 1}^{K} {|(H_{U S}^{k} + H_{R S}^{k} Φ h_{U R}^{k})|}^{2} w_{k}}{σ_{n}^{2}}) \end{matrix}

(7)

where

R_{k}

represents the data transmission rate of the k-th UEs, formulated by

R_{k} = {log}_{2} (1 + γ_{k})

.

2.2. Problem Formulation

In this section, the aim is to determine the optimization objective function as the maximum rate of chemistry for the RIS-assisted IS-HAP-TNs involved. From the above, it is clear that

W = [w_{1}, w_{2}, . . ., w_{k}]

denotes the transmit beamforming matrix vector, and the proposed optimization problem can be modeled as

\begin{matrix} \max_{\{W, Φ\}} & C (H_{U S}, H_{R S}, Φ, w, h_{U R}) \\ s . t . & C 1 : w_{k} \leq P_{max}, \forall k = 1, 2, . . ., K, \\ C 2 : |ϕ_{m}| = 1, \forall m = 1, 2, . . . M . \end{matrix}

(8)

where

P_{m a x}

represents the maximum link transmission power. The constraint C1 regulates the transmission maximum power of UEs. The constraint C2 represents the constraints on RIS reflective elements. Obviously, the above optimization problem based on the proposed system model is non-convex, while constraints exist for a non-convex non-trivial optimization problem owing to the high-dimensional reflection elements phase shift and can hardly be settled by traditional improvement approaches. If the traditional mathematical tools were used, the problem would have to be solved by an exhaustive search to obtain the optimal solution, which at the same time implies a large amount of computational resources and processing measures, which is almost impossible for large-scale network communication scenarios, especially for networks with high real-time requirements such as our considered IS-HAP-TNs. Thus, this paper only considers the use of intelligent algorithms for the optimal beamforming solution, it does not mathematically solve challenging optimization problems by non-convex to convex conversions, etc.

Meanwhile, to demonstrate the superiority of the proposed algorithm, we refer to the traditional alternating optimization (AO) algorithm in the literature for solving the joint active transmit beamforming and passive beamforming optimization algorithm. First, the constraint in Equation (9) can greatly increase the difficulty of solving this problem, so it is also necessary to relax this constraint to

|ϕ_{m}| \leq 1, \forall m = 1, 2, . . ., M

. After obtaining the local optimal solution, the reflection coefficient is changed back to a value that conforms to the modulus constraint of 1 by the projection method. In each iteration of traditional AO algorithms, the globally sub-optimal

W

is solved by first fixing

Φ

and the sub-optimal

Φ

is solved by fixing the matrix

W

until the algorithm converges. For the design of high-dimensional continuous variables, such as the each element phase shift in large-scale networks, including the transmit power matrix, the phase shift matrix, etc., traditional mathematical optimization methods such as the AO algorithm and water filling (WF) algorithm cannot effectively solve these problems and often generate local optimal deviations. Thus, to efficiently tackle the considered non-convex jointly optimization problems, we propose a soft-updated DDPG algorithm, where, through continuous trial-and-error interaction with the environment, the DRL agent gradually learns a deterministic strategy that leads to the optimal action.

3. Soft-DDPG-Based Joint Active and Passive Beamforming Design

In this section, the method of DRL is used to jointly optimize the transmit beamforming shape and phase shift array, and utilizing the DDPG structure shown in Figure 2. First, we briefly discuss the soft-DDPG principle and operation process. Then, we introduce the proposed DRL architecture and provide a detailed description of the

s t a t e

,

a c t i o n

,

r e w a r d

, and the algorithm framework.

3.1. Overview of Soft-DDPG

It is supposed that there exists a central controller or a learning agent in this network that can collect the channel information or communication date immediately, such as the RIS to satellite channel

H_{R S}

and

h_{U R}

and the UE to the RIS channel

h_{U R}

. Figure 2 displays the soft-DDPG architecture suitable for the earning agents to interact with high dynamic communication environments to obtain the pre-defined rewards or punishments. The core concept of the soft-DDPG framework proposed in this letter is to perform effective beamforming design and phase shift convert under unforeseen circumstances, such as local state observations, e.g., RIS. The algorithm mainly includes two kinds of deep neural networks (DNNs), namely, the training network and the target network. To avoid or mitigate the issue of updating state participant values in a single case, we assume that the target and training networks have the same neural network architecture.

Based on the above extensions, we can more clearly portray the framework covered in this article, with four DNNs drawn in detail, which are the training critic network, the training actor network, the target critic network and the target actor network. The functions of these four neural networks described above are described below. The training critic network needs to input the current state

s^{(t)}

into the action network and output the current action

a^{(t)}

, and the training actor network needs to input state

s^{(t)}

and action

a^{(t)}

into the training critic network and output the Q value

Q_{π} (s^{(t)}, a^{(t)})

. The target critic network needs to input the updated state

s^{(t + 1)}

to the target actor network and output the

a^{(t + 1)}

. The target actor network needs to input the updated

s^{(t + 1)}

and

a^{(t + 1)}

to the target critic network and output the target Q value

Q_{π} (s^{(t + 1)}, a^{(t + 1)})

.

Considering the existence of plural inputs in the neural network input, this proposed model uses

t a n h

as the activation function of the hidden layer to limit the action space in the interval

(0, 2 π)

, and to eliminate the effect of the change in the distribution of the hidden layer data brought by the parameter update. This proposed DRL framework introduces a batch normalization layer after each hidden layer to process its output. The batch normalization layer can effectively combat the gradient disappearance phenomenon, improve the training efficiency, and make the training process of the deep layer network more stable. In addition, according to the constraints of the transmitting power and phase shift coefficients, the proposed model adds the

t a n h

activation function to the output layer of the actor network to restrict the output to the interval [−1,1], and subsequently transforms the action into the data format required by the optimization problem by taking the absolute value normalization and range mapping methods to meet the constraints of the power allocation and phase shift so as to calculate the system sum rate as Equation (8).

We generate different transmission link channel information by following the channel model features described earlier when channel state information (CSI) and the previous action

W^{(t - 1)}

and

Φ^{(t - 1)}

are known at the t-th time step, and the learning agent can establish knowledge about the current state space

s^{(t)}

in the t-th time step. It is considered that the difficulty of the joint optimal design of active transmission beamforming and passive RIS phase shift matrix are discrete and present a great challenge to continuous state-space and action-space settings. Next, the details of the DRL-based algorithm state space S, action space A and instant reward function R are explained below.

S t a t e

: State space is generally a description of the environmental observations at the t-th time step. In this paper, the DRL algorithm state space includes three parts, i.e., the k-th UE’s transmission power at the (

t - 1

)-th time step, the CSI of all communication links including direct links and cascaded reflective links containing RIS reflective elements at the

(t - 1)

-th time step, and the action from the (

t - 1

)-th time step, which are represented by

w_{k}^{(t - 1)}

,

G

and

a^{(t - 1)}

, respectively, where

\begin{matrix} G ≜ [H_{U S}, H_{R S}, H_{_{R S}}^{H}, h_{U R}, h_{U R}^{H}] \end{matrix}

(9)

Thus, the state space can be expressed as

s^{(t)} = [w_{k}^{(t - 1)}, G, a^{(t - 1)}]

.

Action: Action space is generally a series of choices for the next action. Once the agent performs the current action

a^{(t)}

step by step during the learning process according to the transfer policy

π

at the t-th time slot, the state space of the environment will be shifted from

s^{(t)}

to the next state

s^{(t + 1)}

. The action space is designed as the UE transmit power matrix

W

and the RIS phase shift matrix

Φ

. Considering that neural networks can only take real part as input and and match the neural network input formats, the process of constructing the action space, where both the transmit power imaginary distribution, such as

W = Re \{W\} + Im \{W\}

and

Φ = Re \{Φ\} + Im \{Φ\}

are to be separated as separate input ports. Thus, the action space can be expressed as

\begin{matrix} a^{(t)} = [Re \{w_{k}^{(t)}\}, Im \{w_{k}^{(t)}\}, Re \{ϕ_{m}^{(t)}\}, Im \{ϕ_{m}^{(t)}\}] \end{matrix}

(10)

where the

w_{k}^{(t)}

represents the k-th UE transmit power matrix, and the

ϕ_{m}^{(t)}

denotes m-th reflective element’s phase shift at the t-th time step.

R e w a r d

: The purpose of this paper is to maximize the system sum rate, and Equation (10) is adopted as the reward function:

\begin{matrix} r^{(t)} = C (H_{U S}, H_{R S}, Φ, w, h_{U R}) \end{matrix}

(11)

3.2. The Process of Algorithm Training

In order to break the coupling between experiences and adapt to a high dynamic environment, the experience replay approach allows agent access to previous historical experiences in subsequent training, the DDPG framework considered in this article. For policy-based algorithms, the agent collects experience in the episode. After an episode is run, experience is lost. It is better with a multi-threaded parallel architecture. This not only solves the previous problems but also makes efficient use of computing resources and improves the training efficiency.

In the proposed DDPG framework, the entire agent consists of a global network and multiple parallel independent workers, each including a set of an actor network and critic network. Each worker interacts independently with their own environment, gaining independent sampling experiences that are independent of each other, thus breaking the coupling between experiences to match the experience replay. Most of the underlying algorithms in DRL are single threaded, that is, a learning agent that interacts with the environment to generate experience. Including the underlying version of the actor network and critic network, because the environment is fixed and the action of the agent needs to be continuous, the experience gathered has strong timing associations and only part of the state and action space can be explored in a limited amount of time. To solve this problem, we adopt the soft-DDPG-based scheme to optimize the design process and present the corresponding pseudo-code in Algorithm 1.

In the initial stage of the algorithm, the experience replay buffer D, the training actor network

ψ (\cdot)

and training critic network

Q (\cdot)

need to be initialized randomly (Lines 1–2). They are copied to the target network

ψ^{'} (\cdot)

and

Q^{'} (\cdot)

(Line 3). After initializing and randomly generating the RIS-assisted IS-HAP-TNs communication channel environment state, the state is processed via DNN and the output

Y_{t}

(Lines 5–6). The action is derived based on

Y_{t}

, where

N

is denoted as random noise, with the aim of seeking efficient exploration (Lines 7–8). In this letter, we employ the mini batch to reduce the sample training amount of sampling and ensure the quality of gradient reduction. After the transformation sequence is saved in the memory replay buffer D (Line 10), to achieve the optimal action that maximizes the output of the critic train network, the two train networks are updated using the minibatches of size

ζ

randomly sampled from replay buffer D (Line 11). We update the critic target network parameters

Q (\cdot)

by minimizing the variance loss (Line 14). We make use of linking rules to update the actor networks parameters

ψ (\cdot)

. Finally, the target networks parameters of the actor network and critic network are slowly soft updated using the control factor

τ

as the decaying rate (Line 15).

Algorithm 1: Soft-DDPG-based Algorithm
1: Initialize experience memory D to empty;
2: Randomly initialization generate actor target/training network $ψ^{'} (\cdot)$ and critic target/training network $Q^{'} (\cdot)$ with parameters $ξ_{a}^{'}$ and $ξ_{c}^{'}$ , separately;
3: Input: $w$ , $ϕ$ , $H_{R S}$ and $h_{U R}$ ;
4: Output: Optimal action $a_{o p t}^{(t)}$ ;
5: for each episode do:
6: Initialize state $s^{(0)} \in S, S \leftarrow s^{(0)}$ ;
7: for $t = 0, 1, 2, . . . T - 1$ do:
8: Choose action $a^{(t)} = π (s^{(t)} ∣ θ^{π}) + N$ ;
9: Take action $a^{(t)}$ , get reward $r^{(t)}$ and $s^{(t)}$ evolves into new state $s^{(t + 1)}$ ;
10: Save $(s^{(t)}, a^{(t)}, r^{(t)}, s^{(t + 1)})$ into D;
11: Randomly sample $ζ$ transitions form D;
12: Training framework via DNN;
13: Compute target value for the critic’s evaluation network by
$y^{(i)} = r^{(i)} + γ Q_{π^{'}}^{'} (s^{(i + 1)}, π^{'} (s^{(i + 1)} \| θ^{π^{'}}) \| θ^{Q^{'}})$
14: Update the parameters of the critic’s evaluation network by
$L (θ^{Q}) = \frac{1}{ζ} \sum_{i = 1}^{ζ} {(y^{(i)} - Q_{π} (s^{(i)}, a^{(i)} ∣ θ^{Q}))}^{2}$ ;
15: Update the parameters of actor network with sampled policy gradients by
$\nabla_{θ^{π}} J =$
$\frac{1}{ζ} \sum_{i = 1}^{ζ} ⛛_{a} Q_{π} (s, a \| θ^{Q}) \|_{a = π (s^{(i)} \| θ^{π})} ⛛_{θ^{π} π (s \| θ^{π})};$
16: Soft-update the parameters of DDPG’s target networks by
$θ_{c}^{(t a r g e t)} \leftarrow τ_{c} θ_{c}^{(t r a i n)} + (1 - τ_{c}) θ_{c}^{(t a r g e t)}$
$θ_{a}^{(t a r g e t)} \leftarrow τ_{a} θ_{a}^{(t r a i n)} + (1 - τ_{a}) θ_{a}^{(t a r g e t)};$
17: Update the state $s^{(t + 1)}$ ;
16: end for;
17: end for;

During each iteration t each of the learning process, the actor train network will select the action from the continuous action space based on the current state

s^{(t)}

. During this training process, in order to effectively explore the optimal action, the stochastic noise

N_{a}

is also taken into account in the algorithm framework to obtain the deterministic strategy, i.e.,

a^{(t)} = π (s^{(t)} ∣ θ^{π}) + N_{a}

, where

θ^{π}

is the actor train network parameter, and

π

is the transfer policy. When the operation ends, the environment will transit the last action to the next state

s^{(t + 1)}

to obtain instant reward

r^{(t)}

, and then obtain an evaluation for the action to evaluate the optimal action

a^{(t)}

, modeling a state–action value function by parameterized by

θ^{Q}

as

\begin{matrix} Q_{π} (s^{(t)}, a^{(t)} ∣ θ^{Q}) \leftarrow α Q_{π} (s^{(t)}, a^{(t)} ∣ θ^{Q}) \\ + (1 - α) [r^{(t)} + {γ max_{a^{'}} Q_{π} (s^{(t + 1)}, a^{'} ∣ θ^{Q})|}_{a^{'} = π (s^{(t + 1)} ∣ θ^{π})}] \end{matrix}

(12)

where

α

denotes the algorithm learning rate in this algorithm framework. To ensure the stability, the target actor network is parameterized by

θ^{π^{'}}

and the target critic network is characterized by

θ^{Q^{'}}

, which is parameterized at intervals according to the online network parameters. Thus, considering that the parameter update strategy is a soft update method, the algorithm is called soft-DDPG. The soft update method of parameters ensures the slow update of parameters and alleviates the instability problem of the policy network during the learning process.

4. Numerical Simulation Results

In this section, we evaluate the performance improvement of the proposed DRL-based algorithm framework for the proposed system model from different perspectives. First, we randomly generate a channel model matrix following the shadowed-Rician fading distribution and the corresponding channel models mentioned in the article [44]. The system parameters and hyperparameters of the DDPG algorithms are listed in Table 1 [45]. To test whether our algorithm improves network performance, we also consider three other standard solutions:

(1): Hard-DDPG: The scheme indicates that the parameters in the DRL framework are updated in a hard-update strategy, which allows the network to copy all the parameters in the network at this time directly into the target network after every $t_{u}$ training session by pre-setting the parameter update interval $t_{u}$ .
(2): Random RIS: The scheme denotes that the RIS phase shift matrix $Φ$ is randomly generated.
(3): Without RIS: This scheme denotes that the communication scenario without RIS and the UEs can send signals directly to the satellite. Considering that the process is a continuous transmission, we assume that the successful transmission signal is 1/2.
(4): Traditional optimization scheme: The traditional algorithm is an alternating optimization algorithm, the specific process is to obtain the transmit power matrix by fixing the RIS reflection coefficient, according to the water filling algorithm. Then we obtain the corresponding current jointly beamforming design according to Equation (9) until the objective function value converges.

Figure 3 plots the relationship between the number of RIS reflection elements and the system sum rate. As can be seen from the figure, the system sum rate is significantly higher for all algorithms as the number of RIS elements increases due to the fact that more RIS reflection elements increase the reflection channel gain but also sacrifice the complexity of the RIS deployment at the HAP. In addition, we can observe that the soft-update parameter strategy obtains a higher system sum rate than the hard update parameter strategy, alleviates the instability of the Q-value network in the learning process, and the soft-update strategy obtains a higher system ensemble rate by more flexibly interacting with the environment to design the phase shift matrix more flexibly. It can also be seen that the traditional alternating optimization algorithm used can also obtain a high system sum rate, and that the value of the objective function it obtains does not decrease as the number of elements increases, which also ensures the convergence of the algorithm.

The setting of hyper-parameters will have a great impact on the performance, such as the stability and convergence speed of neural networks. This paper also explores the effect of different learning rates on the performance and convergence speed of the model in our proposed DRL framework. The average reward is used to measure its performance, which can be shown as

average_reward (T_{i}) = \frac{\sum_{t = 1}^{T} r^{(t)}}{T_{i}}, T_{i} = 1, 2, \dots, T

(13)

where T is the maximum step size of sample training. All the parameters initialized in the training phase and the channel sample parameters used for different parameters are the same for this simulated neural network, and a comparison of the network performance at different learning rates, i.e., 0.01, 0.001, 0.0001, 0.00001 is shown in Figure 4. Figure 4 shows the average reward versus time step under different learning rates, and it can be seen that the effect of different learning rate settings in the neural network on the performance of the DRL algorithm varies greatly. For larger learning rates, such as 0.01, the performance of this network decreases, and the convergence effect is unstable and prone to oscillations. In particular, the considered DRL framework with a learning rate of 0.001 performs the best, but converges more slowly than the others. As the RIS reflection element increases, the average system reward also increases gradually as expected with the addition of reflection channels, but this does not significantly increase the convergence time of the proposed DRL framework [46]. In summary, DRL-based algorithms, such as the DDPG used in this paper, are very sensitive to the setting of hyper-parameters in the neural network, and the optimal hyper-parameter settings can vary considerably under different model structures [47]. Therefore, careful experimental tuning is required to obtain optimal parameter settings that can significantly improve the performance and convergence speed of the algorithm [48,49].

Figure 5 shows a schematic comparison of the average reward performance and the outdated CSI coefficients, respectively. In the proposed DRL framework, we choose the last moment CSI as the state-space input, and we can see that the average reward of all algorithms decreases gradually as the outdated CSI coefficient decreases. However, the proposed soft-DDPG framework remains at a favorable level compared to the existing scheme and the hard-DDPG scheme. Compared with the advanced DRL schemes, which do not require an exact channel model information, the existing alternating parameter optimization scheme relies on the knowledge of static exact channel model, but because of the high dynamic communication scenario, the system performance is not as good as the soft DDPG and hard DDPG schemes.

5. Conclusions

This paper discussed the joint optimal design scheme of transmitting active beamforming and passive beamforming for maximizing the system sum rate. In the IS-HAP-TNs assisted by RIS, it is hard to sense the channel state information in the dynamic environment accurately and comprehensively. On this basis, a novel type of DRL architecture is proposed, namely the soft-DDPG algorithm. With the help of the network parameter soft-update strategy, the coordination of the phase shift matrix can be obtained, even when the increasing number of RIS reflective element amplitudes changes. Simulation results show that the proposed framework can achieve better network performance in a lower operation duration and can be applied to the real-time control of IS-HAP-TN systems.

Author Contributions

M.W., S.Z., C.L., Y.C. and F.Z. conceived and designed the experiments; M.W. performed the experiments; S.Z., C.L. and Y.C. analyzed the data; F.Z. contributed analysis tools; M.W., S.Z. and F.Z. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Foundation of China under Grants 62001517, 61901502 and 62071352, the National Postdoctoral Program for Innovative Talents under Grant BX20200101.

Conflicts of Interest

The authors declare no conflict of interest.

References

An, K.; Lin, M.; Ouyang, J.; Zhu, W.-P. Secure Transmission in Cognitive Satellite Terrestrial Networks. IEEE J. Sel. Areas Commun. 2016, 34, 3025–3037. [Google Scholar] [CrossRef]
Lin, Z.; Lin, M.; de Cola, T.; Wang, J.-B.; Zhu, W.-P.; Cheng, J. Supporting IoT with Rate-Splitting Multiple Access in Satellite and Aerial-Integrated Networks. IEEE Internet Things J. 2021, 8, 11123–11134. [Google Scholar] [CrossRef]
Liu, R.; Guo, K.; An, K.; Zhu, S.; Shuai, H. NOMA-Based Integrated Satellite-Terrestrial Relay Networks Under Spectrum Sharing Environment. IEEE Wirel. Commun. Lett. 2021, 10, 1266–1270. [Google Scholar] [CrossRef]
Shafin, R.; Chen, H.; Nam, Y.-H.; Hur, S.; Park, J.; Zhang, J.; Reed, J.H.; Liu, L. Self-Tuning Sectorization: Deep Reinforcement Learning Meets Broadcast Beam Optimization. IEEE Trans. Wirel. Commun. 2020, 19, 4038–4053. [Google Scholar] [CrossRef] [Green Version]
Yang, L.; Li, P.; Yang, Y.; Li, S.; Trigui, I.; Ma, R. Performance Analysis of RIS-Aided Networks with Co-Channel Interference. IEEE Commun. Lett. 2021, 26, 49–53. [Google Scholar] [CrossRef]
Luo, H.; Lv, L.; Wu, Q.; Ding, Z.; Al-Dhahir, N.; Chen, J. Beamforming Design for Active IOS Aided NOMA Networks. IEEE Wirel. Commun. Lett. 2022, 12, 282–286. [Google Scholar] [CrossRef]
Niu, H.; Chu, Z.; Zhou, F.; Zhu, Z.; Zhen, L.; Wong, K.-K. Robust Design for Intelligent Reflecting Surface-Assisted Secrecy SWIPT Network. IEEE Trans. Wirel. Commun. 2022, 21, 4133–4149. [Google Scholar] [CrossRef]
Gong, S.; Lu, X.; Hoang, D.T.; Niyato, D.; Shu, L.; Kim, D.I.; Liang, Y.-C. Toward Smart Wireless Communications via Intelligent Reflecting Surfaces: A Contemporary Survey. IEEE Commun. Surv. Tutor. 2020, 22, 2283–2314. [Google Scholar] [CrossRef]
Li, X.; Zhao, M.; Zeng, M.; Mumtaz, S.; Menon, V.G.; Ding, Z.; Dobre, O.A. Hardware Impaired Ambient Backscatter NOMA Systems: Reliability and Security. IEEE Trans. Commun. 2021, 69, 2723–2736. [Google Scholar] [CrossRef]
Wang, J.; Liang, Y.-C.; Han, S.; Pei, Y. Robust beamforming and phase shift design for IRS-enhanced multi-user MISO downlink communication. In Proceedings of the 2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar]
Nadeem, Q.-U.; Kammoun, A.; Chaaban, A.; Debbah, M.; Alouini, M.-S. Asymptotic Max-Min SINR Analysis of Reconfigurable Intelligent Surface Assisted MISO Systems. IEEE Trans. Wirel. Commun. 2020, 19, 7748–7764. [Google Scholar] [CrossRef] [Green Version]
Yan, W.; Yuan, X.; Cao, X. Frequency Reflection Modulation for Reconfigurable Intelligent Surface Aided OFDM Systems. IEEE Trans. Wirel. Commun. 2022, 21, 9381–9393. [Google Scholar] [CrossRef]
Wu, Y.; Wang, S.; Luo, J.; Chen, W. Passive Covert Communications Based on Reconfigurable Intelligent Surface. IEEE Wirel. Commun. Lett. 2022, 11, 2445–2449. [Google Scholar] [CrossRef]
Xu, C.; Xiang, L.; An, J.; Dong, C.; Sugiura, S.; Maunder, R.G.; Yang, L.-L.; Hanzo, L. OTFS-Aided RIS-Assisted SAGIN Systems Outperform Their OFDM Counterparts in Doubly Selective High-Doppler Scenarios. IEEE Internet Things J. 2023, 10, 682–703. [Google Scholar] [CrossRef]
Li, X.; Xie, Z.; Chu, Z.; Menon, V.G.; Mumtaz, S.; Zhang, J. Exploiting Benefits of IRS in Wireless Powered NOMA Networks. IEEE Trans. Green Commun. Netw. 2022, 6, 175–186. [Google Scholar] [CrossRef]
Guo, K.; An, K.; Zhang, B.; Huang, Y.; Tang, X.; Zheng, G.; Tsiftsis, T.A. Physical Layer Security for Multiuser Satellite Communication Systems with Threshold-Based Scheduling Scheme. IEEE Trans. Veh. Technol. 2020, 69, 5129–5141. [Google Scholar] [CrossRef]
Pang, X.; Sheng, M.; Zhao, N.; Tang, J.; Niyato, D.; Wong, K.-K. When UAV meets IRS: Expanding air-ground networks via passive reflection. IEEE Wirel. Commun. 2021, 28, 164–170. [Google Scholar] [CrossRef]
Jiao, S.; Fang, F.; Zhou, X.; Zhang, H. Joint Beamforming and Phase Shift Design in Downlink UAV Networks with IRS-assisted NOMA. J. Commun. Inf. Netw. 2020, 5, 138–149. [Google Scholar] [CrossRef]
Zhang, Q.; Saad, W.; Bennis, M. Reflections in the sky: Millimeter wave communication with UAV-carried intelligent reflectors. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
Letaief, K.B.; Chen, W.; Shi, Y.; Zhang, J.; Zhang, Y.-J.A. The Roadmap to 6G: AI Empowered Wireless Networks. IEEE Commun. Mag. 2019, 57, 84–90. [Google Scholar] [CrossRef] [Green Version]
Yue, X.; Liu, Y. Performance Analysis of Intelligent Reflecting Surface Assisted NOMA Networks. IEEE Trans. Wirel. Commun. 2022, 21, 2623–2636. [Google Scholar] [CrossRef]
Fozi, M.; Sharafat, A.R.; Bennis, M. Fast MIMO Beamforming via Deep Reinforcement Learning for High Mobility mmWave Connectivity. IEEE J. Sel. Areas Commun. 2022, 40, 127–142. [Google Scholar] [CrossRef]
Li, Y.; Zhang, W.; Wang, C.-X. Deep reinforcement learning for dynamic spectrum sensing and aggregation in multi-channel wireless networks. IEEE Trans. Cognitive Commum. Netw. 2020, 6, 464–475. [Google Scholar] [CrossRef]
Ren, H.; Pan, C.; Wang, L.; Liu, W.; Kou, Z.; Wang, K. Long-Term CSI-Based Design for RIS-Aided Multiuser MISO Systems Exploiting Deep Reinforcement Learning. IEEE Commun. Lett. 2022, 26, 567–571. [Google Scholar] [CrossRef]
Zhang, M.; Fu, S.; Fan, Q. Joint 3D Deployment and Power Allocation for UAV-BS: A Deep Reinforcement Learning Approach. IEEE Wirel. Commun. Lett. 2021, 10, 2309–2312. [Google Scholar] [CrossRef]
Mismar, F.B.; Evans, B.L.; Alkhateeb, A. Deep reinforcement learning for 5G networks: Joint beamforming, power control, and interference coordination. IEEE Trans. Commun. 2020, 68, 1581–1592. [Google Scholar] [CrossRef] [Green Version]
Shafin, R.; Jiang, M.; Ma, S.; Piazzi, L.; Liu, L. Joint Parametric Channel Estimation and Performance Characterization for 3D Massive MIMO OFDM Systems. In Proceedings of the IEEE International Conf. Commun. (ICC), Kansas City, MO, USA, 20 May 2018; pp. 1–6. [Google Scholar]
Tripathi, S.; Pandey, O.J.; Cenkeramaddi, L.R.; Hegde, R.M. Optimal active elements selection in RIS-assisted edge networks for improved QoS. In Proceedings of the 2022 IEEE 12th Sensor Array and Multichannel Signal Processing Workshop (SAM), Trondheim, Norway, 23 June 2022; pp. 21–25. [Google Scholar] [CrossRef]
Luo, H.; Liu, R.; Li, M.; Liu, Y.; Liu, Q. Joint Beamforming Design for RIS-Assisted Integrated Sensing and Communication Systems. IEEE Trans. Veh. Technol. 2022, 71, 13393–13397. [Google Scholar] [CrossRef]
Guo, K.; An, K. On the Performance of RIS-Assisted Integrated Satellite-UAV-Terrestrial Networks with Hardware Impairments and Interference. IEEE Wirel. Commun. Lett. 2022, 11, 131–135. [Google Scholar] [CrossRef]
Peng, Z.; Zhang, Z.; Kong, L.; Pan, C.; Li, L.; Wang, J. Deep Reinforcement Learning for RIS-Aided Multiuser Full-Duplex Secure Communications with Hardware Impairments. IEEE Internet Things J. 2022, 9, 21121–21135. [Google Scholar] [CrossRef]
Feng, K.; Wang, Q.; Li, X.; Wen, C. Deep reinforcement learning based intelligent reflecting surface optimization for MISO communication systems. IEEE Wirel. Commun. Lett. 2020, 9, 745–749. [Google Scholar] [CrossRef]
González-Ovejero, D.; Yurduseven, O.; Chattopadhyay, G.; Chahat, N. Metasurface antennas: Flat antennas for small satellites. In CubeSat Antenna Design; Wiley: Hoboken, NJ, USA, 2020; pp. 255–313. [Google Scholar]
Yurduseven, O.; Podilchak, S.; Khalily, M. Towards holographic beam-forming metasurface technology for next generation CubeSats. In Proceedings of the International Conference on UK-China Emerging Technologies (UCET), Glasgow, UK, 21 August 2020; pp. 1–4. [Google Scholar]
Wang, J.; Li, Y.; Jiang, Z.H.; Shi, T.; Tang, M.-C.; Zhou, Z.; Chen, Z.N.; Qiu, C.-W. Metantenna: When Metasurface Meets Antenna Again. IEEE Trans. Antennas Propag. 2020, 68, 1332–1347. [Google Scholar] [CrossRef]
Rotshild, D.; Abramovich, A. Wideband reconfigurable entire Ku-band metasurface beam-steerable reflector for satellite communications. IET Microw. Antennas Propag. 2019, 13, 334–339. [Google Scholar] [CrossRef]
Huang, C.; Mo, R.; Yuen, C. Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning. IEEE J. Sel. Areas Commun. 2020, 38, 1839–1850. [Google Scholar] [CrossRef]
Yang, L.; Zhu, Q.; Li, S.; Ansari, I.S.; Yu, S. On the Performance of Mixed FSO-UWOC Dual-Hop Transmission Systems. IEEE Wirel. Commun. Lett. 2021, 10, 2041–2045. [Google Scholar] [CrossRef]
Yang, L.; Meng, F.; Wu, Q.; da Costa, D.B.; Alouini, M. Accurate closed-form approximations to channel distributions of RIS-aided wireless systems. IEEE Wirel. Commun. Lett. 2020, 9, 985–1989. [Google Scholar] [CrossRef]
He, J.; Leinonen, M.; Wymeersch, H.; Juntti, M. Channel estimation for RIS-aided mmWave MIMO systems. In Proceedings of the GLOBECOM 2020—2020 IEEE Global Communications Conference, Taipei, Taiwan, 7 December 2020; pp. 1–6. [Google Scholar]
Yan, X.; Xiao, H.; An, K.; Zheng, G.; Chatzinotas, S. Ergodic Capacity of NOMA-Based Uplink Satellite Networks with Randomly Deployed Users. IEEE Syst. J. 2020, 14, 3343–3350. [Google Scholar] [CrossRef]
He, Y.; Cai, Y.; Mao, H.; Yu, G. RIS-Assisted Communication Radar Coexistence: Joint Beamforming Design and Analysis. IEEE J. Sel. Areas Commun. 2022, 40, 2131–2145. [Google Scholar] [CrossRef]
Li, X.; Xie, Z.; Huang, G.; Zhang, J.; Zeng, M.; Chu, Z. Sum Rate Maximization for RIS-Aided NOMA with Direct Links. IEEE Netw. Lett. 2022, 4, 55–58. [Google Scholar] [CrossRef]
Bletsas, A.; Shin, H.; Win, M.Z. Cooperative communication with outage-optimal opportunistic relaying. IEEE Trans. Wirel. Commun. 2007, 6, 3450–3460. [Google Scholar] [CrossRef] [Green Version]
Guo, K.; Lin, M.; Zhang, B.; Zhu, W.-P.; Wang, J.-B.; Tsiftsis, T.A. On the Performance of LMS Communication with Hardware Impairments and Interference. IEEE Trans. Commun. 2019, 67, 1490–1505. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Sun, H.; Chen, X.; Shi, Q.; Hong, M.; Fu, X.; Sidiropoulos, N.D. Learning to Optimize: Training Deep Neural Networks for Interference Management. IEEE Trans. Signal Process. 2018, 66, 5438–5453. [Google Scholar] [CrossRef]
Zhou, F.; Lu, G.; Wen, M.; Liang, Y.-C.; Chu, Z.; Wang, Y. Dynamic spectrum management via machine learning: State of the art, taxonomy, challenges, and open research issues. IEEE Netw. 2019, 33, 54–62. [Google Scholar] [CrossRef]
Zhang, R.; Gao, F.; Liang, Y.-C. Cognitive beamforming made practical: Effective interference channel and learning-throughput tradeoff. IEEE Trans. Commun. 2010, 58, 706–718. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Illustration of a RIS-assisted IS-HAP-TNs system.

Figure 2. The DRL-based active transmit matrix and phase shift design framework using DDPG.

Figure 3. Sum rate performance relative to the increasing number of elements on RIS.

Figure 4. Variation of average reward under different learning rate.

Figure 5. Average reward against the last moment CSI coefficient.

Table 1. System and DNN parameters.

System Parameters	Value
Frequency band	$f = 2 GHz$
Wavelength	$λ = 150 mm$
Noise power spectral density	−169 dBm/Hz
Link bandwidth	$W = 15 MHz$
Noise temperature	$T = 300$ K
Height of HAP	20 km
Number of the UEs	K = 3
Transmission path	L = 3
DNN Hyperparameters in DDPG	Value
Reward discount rate	0.99
Numbers of experiences with the mini-batch	16
Learning rate	0.0001
Decaying rate	0.0001
Experience replay buffer size	100,000
Numbers of steps in each training episode	10,000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Zhu, S.; Li, C.; Chen, Y.; Zhou, F. Joint Beamforming Design for RIS-Assisted Integrated Satellite-HAP-Terrestrial Networks Using Deep Reinforcement Learning. Sensors 2023, 23, 3034. https://doi.org/10.3390/s23063034

AMA Style

Wu M, Zhu S, Li C, Chen Y, Zhou F. Joint Beamforming Design for RIS-Assisted Integrated Satellite-HAP-Terrestrial Networks Using Deep Reinforcement Learning. Sensors. 2023; 23(6):3034. https://doi.org/10.3390/s23063034

Chicago/Turabian Style

Wu, Min, Shibing Zhu, Changqing Li, Yudi Chen, and Feng Zhou. 2023. "Joint Beamforming Design for RIS-Assisted Integrated Satellite-HAP-Terrestrial Networks Using Deep Reinforcement Learning" Sensors 23, no. 6: 3034. https://doi.org/10.3390/s23063034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Beamforming Design for RIS-Assisted Integrated Satellite-HAP-Terrestrial Networks Using Deep Reinforcement Learning

Abstract

1. Introduction

1.1. Related Work and Motivation

1.2. Major Contributions and Novelty

2. System Model Description

2.1. Signal Transmission Model

2.2. Problem Formulation

3. Soft-DDPG-Based Joint Active and Passive Beamforming Design

3.1. Overview of Soft-DDPG

3.2. The Process of Algorithm Training

4. Numerical Simulation Results

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI