Joint Data Transmission and Energy Harvesting for MISO Downlink Transmission Coordination in Wireless IoT Networks

Liu, Jain-Shing; Lin, Chun-Hung; Hu, Yu-Chen; Donta, Praveen Kumar

doi:10.3390/s23083900

Open AccessArticle

Joint Data Transmission and Energy Harvesting for MISO Downlink Transmission Coordination in Wireless IoT Networks

¹

Department of Computer Science and Information Engineering, Providence University, Taichung 433, Taiwan

²

Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 804, Taiwan

³

Department of Computer Science and Information Management, Providence University, Taichung 433, Taiwan

⁴

Research Unit of Distributed Systems, TU Wien, 1040 Vienna, Austria

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(8), 3900; https://doi.org/10.3390/s23083900

Submission received: 24 February 2023 / Revised: 7 April 2023 / Accepted: 9 April 2023 / Published: 11 April 2023

(This article belongs to the Special Issue Wireless Sensing and Networking for the Internet of Things: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The advent of simultaneous wireless information and power (SWIPT) has been regarded as a promising technique to provide power supplies for an energy sustainable Internet of Things (IoT), which is of paramount importance due to the proliferation of high data communication demands of low-power network devices. In such networks, a multi-antenna base station (BS) in each cell can be utilized to concurrently transmit messages and energies to its intended IoT user equipment (IoT-UE) with a single antenna under a common broadcast frequency band, resulting in a multi-cell multi-input single-output (MISO) interference channel (IC). In this work, we aim to find the trade-off between the spectrum efficiency (SE) and energy harvesting (EH) in SWIPT-enabled networks with MISO ICs. For this, we derive a multi-objective optimization (MOO) formulation to obtain the optimal beamforming pattern (BP) and power splitting ratio (PR), and we propose a fractional programming (FP) model to find the solution. To tackle the nonconvexity of FP, an evolutionary algorithm (EA)-aided quadratic transform technique is proposed, which recasts the nonconvex problem as a sequence of convex problems to be solved iteratively. To further reduce the communication overhead and computational complexity, a distributed multi-agent learning-based approach is proposed that requires only partial observations of the channel state information (CSI). In this approach, each BS is equipped with a double deep Q network (DDQN) to determine the BP and PR for its UE with lower computational complexity based on the observations through a limited information exchange process. Finally, with the simulation experiments, we verify the trade-off between SE and EH, and we demonstrate that, apart from the FP algorithm introduced to provide superior solutions, the proposed DDQN algorithm also shows its performance gain in terms of utility to be up to 1.23-, 1.87-, and 3.45-times larger than the Advantage Actor Critic (A2C), greedy, and random algorithms, respectively, in comparison in the simulated environment.

Keywords:

IoT; SWIPT; joint optimization; beamforming; power control; energy harvesting; transmission coordination; deep reinforcement learning

1. Introduction

Given the explosive growth of smart phones and other new applications that result in huge amounts of data transmission apart from the conventional telephone voice service, the massive Internet of Things (IoT) is currently facing significant challenges, such as achieving intelligent implementations [1] and ensuring secure and trustworthy operations [2]. To address these challenges, technologies, such as semi-federated learning [1] and blockchain [2], can be employed. Cellular-based mobile networks will continue to play a crucial role in the development of fifth-generation (5G) and beyond 5G (B5G) wireless communications for IoT, enabling innovative solutions to these challenges.

In such networks, frequency bands are usually reused to mitigate inter-cell interference. Herein, a frequency band shared by all cells is usually considered to have a harmful impact on communication. However, owing to the excessive increase of data traffic, such sharing becomes a possible solution to the problem of scarce radio resources to be used in ultra-dense cellular networks. For this, coordinated multi-point (CoMP) [3] is a promising concept to manage the resulting interference. Specifically, if each BS in the cellular network can perform downlink beamforming [4] for transmitting to its UE appropriately, the intra-cell and inter-cell interference would be mitigated. Given the significant advantage, CoMP is included in the specifications of long term evolution-advanced (LTE-A) [5].

Apart from the interference issue, user equipment (UE) in 5G or B5G is still energy-constrained due to its battery with limited capacity, which is especially true for low-power IoT devices acting as femto UEs within these networks. Despite the slow progress of the battery capacity in recent decades, energy harvesting techniques have emerged to address the crucial issue. As expected, various renewable energy resources could be adopted to refill batteries, such as wind and solar, but their usability is restricted to weather, position, and many other conditions.

In view of these problems, the radio frequency (RF)-based wireless energy transfer (WET) technique would be an alternative that can charge low-power devices over the air, simplify the maintenance procedure, and significantly contribute to the realization of scalable wireless networks [6]. As an extension, WET combined with the wireless network for transmitting information by default results in simultaneous wireless information and power transfer (SWIPT), which enables a UE to harvest energy from the electromagnetic waves in RF from its surroundings while it simultaneously performs information decoding (ID) for the data transmitted from its source [7,8].

1.1. Related Work

Based on SWIPT, many related works have been performed. Among them, a pioneering work [9] with a multi-antenna BS transmitting to its UE in downlink was proposed that provides the rate-energy trade-offs for the broadcast SWIPT system involved. In addition, it is shown that each UE can perform ID and EH at the same time with a power splitting (PS) scheme or at different time slots with a time switching (TS) scheme. As an extension of TS, the authors in [10] proposed two new time-splitting schemes, namely time-division mode switching (TDMS) and time-division multiple access (TDMA) for a multi-input single-output (MISO) interference channel (IC). With the possibility of simplifying the receiver design, TS, however, does not actually perform ID and EH simultaneously and would only provide limited exploitation of radio resources [11,12], which motivates the use of PS in this work.

As an example adopting PS, the work [13] resolves a throughput maximization problem subject to energy and temperature constraints at transmitting and receiving nodes, respectively, for a hybrid SWIPT relay system. Extending its viewpoint beyond throughput, the work [14] addresses a fundamental problem to characterize the trade-offs for maximizing energy efficiency (EE) vs. spectrum efficiency (SE) under a point-to-point additive white Gaussian noise (AWGN) channel.

In addition, with respect to orthogonal frequency division multiple access (OFDMA) systems, the related work [15] considered a resource-allocation problem to maximize EE in SWIPT with a PS scheme, and developed fractional programming models and sub-optimal iterative resource allocation algorithms to tackle the nonconvex problems encountered. In [16], with the assumption of using zero-forcing (ZF) beamforming patterns (BPs), the authors aimed to maximize EE under a PS-based MISO downlink system. In [17], a multi-user MISO SWIPT system was considered, and an iterative algorithm was proposed, which is guaranteed to achieve a Karush–Kuhn–Tucker solution for maximizing the EE of this system. Similarly, by focusing on wireless sensor networks, the authors in [18] tackled nonconvex EE optimization problems and proposed sub-optimal iterative algorithms through nonlinear fractional programming and Lagrangian dual decomposition.

Apart from the above, different EH-enabled frameworks can be also found in the literature. For example, the authors in [19] proposed a MOO formulation for a multi-pair two-way relay network to maximize the achievable rates of all K UE pairs involved. In that work, by using zero-forcing to null the multi-user interference, the achievable rate of a UE pair can only depend on their own PRs, and the MOO problem can be converted to K independent single objective optimization problems. Thus, the trade-off on data rate can be made between UE pairs. However, in our MOO formulation, the inter-cell interference would be involved, and the trade-off between SE and EH in the system is mainly considered rather than the trade-off for the rate between UE pairs in [19].

As another example, a wirelessly powered IoT system was also investigated in [20], wherein sensors harvested energies from the distributed access points (APs) and then transmitted data to the APs with the harvested energies. Although this is different from the SWIPT scenario considered here, how to extend the current work based on the results in [20] giving a higher WET efficiency could be an interesting future work. To see more related works on SWIPT, WET, or both, one may refer to survey papers, such as [21,22].

Despite the various mathematical approaches adopted in the related works that we mentioned, the computational complexity of mobile wireless network has made it impossible to decide all the system parameters required in time. To meet the time constraint, deep learning is a promising data-driven approach that adopts a deep neural network (DNN) to resolve complex nonlinear problems without explicitly formulating complicated mathematical models [23]. Recently, DNN-based learning algorithms have also been developed to resolve different problems in SWIPT-enabled networks as another way to find the solutions in time apart from the analytical-based methods under consideration, which may be not sufficiently time-efficient in usual cases.

As a method based on learning with DNN, the work in [24] proposed a long short-term memory (LSTM) recurrent neural network (RNN)-based mode-switching algorithm to maximize the achievable rate under the energy-causality constraint for its dual mode SWIPT system. In [25], the authors determine the subchannel allocation, power splitting ratio (PR), and transmit power for the SWIPT-based device-to-device (D2D) networks through the deep-reinforcement-learning (DRL)-based algorithm developed therein. For similar D2D SWIPT-based networks, an EE optimization problem was formulated in [26], and the authors adopted exhaustive search (ES) and gradient search (GS), respectively, to obtain the global optimum and local optimum for the formulated nonconvex optimization problem.

In [27], by clustering the antennas into two multiple-input multiple-output (MIMO) subsystems, the authors developed a sub-optimal method and a hybrid DRL method to resolve the combinatorial problem for the full-duplex MIMO system involved, which jointly optimized the antenna clusters and pre-coding matrices for ID and EH so that the weighted sum of their performance metrics can be maximized. In [28], with the multi-user MISO SWIPT-enabled heterogeneous wireless networks as the target, the authors maximized the achievable sum information rate of the femtocells by jointly optimizing BP and PR under the achievable data rate requirements through a multi-agent DDQN algorithm.

1.2. The Motivations and Characteristics of This Work

Taking both ID and EH into account, the previous works on SWIPT usually focused on throughput maximization [10,13], EE optimization [15,16,18], or both [14]. As a complement to the above, our work concerns the trade-off between SE and EH in the SWIPT-enabled networks with MISO channels, which is similar to the objective given in [29] for D2D networks without BP decision.

However, the objective considered here is to decide both BP and PR, and our work further reveals that, in addition to the interference management concerned by CoMP, the decisions on BP and PR in SWIPT lead to an overall system utility reflecting both SE and EH with weights to achieve the optimal trade-off subject to the transmit power constraint and the feasible PR constraint. As we know, such a trade-off for the coordinated beamforming in the MISO downlink SWIPT-enabled networks with FP and DRL under the logarithmic nonliner EH model [30,31] is not explicitly explored in the previous works. Specifically, the contributions of this work can be summarized as follows.

We derive a multi-objective optimization (MOO) formulation to obtain the optimal BP and PR for the MISO downlink SWIPT-enabled wireless networks under the logarithmic nonliner EH model. Then, with a weighted sum approach, we transform this formulation to obtain an objective function for the resulting multiple-ratio FP problem.
To solve the non-convex FP problem, instead of using the Dinkelbach’s transformation that is usually considered, we develop an evolutionary algorithm (EA)-aided quadratic transform technique that can obtain the desired PR with EA first, and then feed it to an effective iterative algorithm for near-optimal solutions.
To further reduce the computational complexity while avoiding the collection of global channel state information (CSI), we propose a distributed multi-agent learning-based approach that requires only partial observations of CSI. Specifically, we develop a multi-agent double DQN (DDQN) algorithm for each BS to decide its BP and PR based only on local observations with lower overheads of communication and computation.
Instead of centralized operations, such as centralized training centralized executing (CTCE) and centralized training distributed executing (CTDE), we adopt a distributed training distributed executing (DTDE) scheme, which makes the offline training and online decision making performed by each single agent or BS distributive and independent and limits the amount of information to be exchanged between neighboring BSs.
We verify the trade-off between SE and EH with simulations and show that our proposal can outperform the state-of-the-art centralized learning-based algorithm, Advantage Actor Critic (A2C), and baseline approaches, such as greedy and random algorithms. More specifically, it can be seen that, in addition to the introduced FP algorithm to provide superior solutions, the proposed DDQN algorithm can also show its performance gain in terms of utility up to 1.23-, 1.87-, and 3.45-times larger than the A2C, greedy, and random algorithms, respectively, in comparison.

The rest of this paper is structured as follows. In Section 2, we introduce the network, channel model, and problem formulation for this work. Next, we present the EA-aided quadratic transform technique and the FP-based iterative algorithm in Section 3. Then, the limited channel information exchange mechanism is summarized in Section 4, and the distributed multi-agent learning-based DDQN approach is introduced in Section 5. After that, the proposed algorithms are numerically examined in Section 6 to show the trade-offs between SE and EH and their performance differences when compared with other DRL-based algorithms and baseline approaches. Finally, our conclusions are drawn in Section 7.

2. System Model and Problem Formulation

2.1. Network and Channel Models

As an example shown in Figure 1, the downlink wireless network in question is composed of L cells, and, in each cell, there is a BS equipped with

N_{t}

antennas to transmit to a single-antenna IoT-UE (or UE for short in the sequel). In fact, each cell can support multiple UEs by using orthogonal frequency bands; thus, no intra-cell interference is considered here. However, as noted previously, a frequency band shared by all the cells involved is possible, and inter-cell interference would be concerned. Consequently, when focusing on a frequency band adopted, we can model the channel of this system as multi-cell MISO-IC, in which the received signal at the UE associated with i-th BS (or say, direct link i) at time t can be formulated as

y_{i} (t) = h_{i, i}^{†} (t) ω_{i} (t) x_{i} (t) + \sum_{j \neq i} h_{i, j}^{†} ω_{j} (t) x_{j} (t) + n_{i} (t)

(1)

where

x_{i}

and

x_{j}

are the transmitted signals from BS i and BS j, and their transmit powers

P_{i}

and

P_{j}

would satisfy the power constraints

E {| x_{i} |} = P_{i}

and

E {| x_{j} |} = P_{j}

, respectively. In addition,

h_{i, i} (t)

and

ω_{i} (t) \in C^{N_{t} \times 1}

denote, respectively, the downlink channel vector and BP of BS i toward its UE during time slot t, while

h_{i, j} (t)

and

ω_{j} (t) \in C^{N_{t} \times 1}

represent the cross-link channel between UE i and BS j, and BP of BS j, respectively. Finally,

n_{i} \in CN (0, σ^{2})

is the overall noise at UE i.

In the above, we assume that

N_{t}

antennas of each BS are arranged as a uniform linear array (ULA). In addition, similar to [9,11,25,26], we consider that each UE i with PS on the received signal

y_{i} (t)

can simultaneously perform ID and EH as shown in Figure 1. Specifically, with

θ_{i} (t) \in (0, 1)

to denote the PR adopted by UE i at time t, the instantaneous signal to interference and noise ratio (SINR) for ID can be formulated as [32]:

Y_{i} (t) = (1 - θ_{i} (t)) \frac{| h_{i, i}^{†} (t) ω_{i} {(t) |}^{2}}{\sum_{j \neq i} {| h_{i, j}^{†} ω_{j} (t) |}^{2} + σ^{2}}

(2)

Consequently, the achievable data rate of UE i would be

C_{i}^{d} (t) = log (1 + Y_{i} (t))

(3)

On the other hand, the signal split for EH can be denoted by

y_{i}^{E H} (t) = \sqrt{θ_{i} (t)} (h_{i, i}^{†} (t) ω_{i} (t) x_{i} (t) + \sum_{j \neq i} h_{i, j}^{†} ω_{j} (t) x_{j} (t) + n_{i} (t))

(4)

Given this, the conventional works, such as [9,11,25,33,34], usually convert the received signal

y_{i}^{E H} (t)

into the DC power with a linear function. However, a nonlinear function for the energy conversion would be more practical, and the previous works, such as [30,31], adopted the logarithmic nonliner EH model for the ith IoT device on the jth sub-carrier as follows:

e_{i}^{h} (t) = a_{i} log (1 + b_{i} {| h_{i, j} |}^{2} p_{i, j})

(5)

where

a_{i}

and

b_{i}

are the nonlinear model parameters, and

p_{i, j}

is the transmission power for the ith device on the jth sub-carrier with the assumption that the noise power is negligible [30,31]. Following the model without its assumption, this work considers

h_{i, j}

as the channel between UE i and BS j on the same frequency band as shown previously, and in terms of these notations, the energy harvested through the split part for EH would be denoted by

E_{i}^{h} (t) = θ_{i} (t) (a_{i} log (1 + b_{i} (\sum_{\forall j} | h_{i, j}^{†} ω_{j} (t) |^{2} + σ^{2})))

(6)

2.2. Multi-Objective Optimization

Based on the model with SWIPT, our aim is to jointly optimize BP and PR to obtain the maximal SE and EH simultaneously subject to the transmit power constraint and the feasible PR constraint in the MISO downlink network, which can be classified as a MOO problem. As noted in [35], MOO refers to as a type of optimization that involves multiple objective functions to be optimized simultaneously. In general, a nontrivial MOO problem does not have a single solution to concurrently optimize each of the objective functions involved. In such a general case known as conflicting, a Pareto optimization solution is usually pursued wherein none of the objective functions can be improved without degrading some of the other objectives in value. More specifically, it can be defined as a maximization problem as follows [35]:

Definition 1.

Given

f_{i} \in C \to R, 1 \leq i \leq I

, and

X

being the feasible set of constraints, a multi-objective optimization problem can be represented by

\begin{matrix} \begin{matrix} \overset{_{maximize}}{_{x}} & f (x) = (f_{1} (x), \dots, f_{I} (x)) \\ subject to & x \in X \end{matrix} \end{matrix}

(7)

For such an optimization problem, there may be feasible solutions to be obtained, which are denoted by

Y = f (x)

. In particular, these solutions are considered efficient if they satisfy the following definition (as Definition 2.1 of [35]):

Definition 2.

A point

x \in X

is called Pareto optimal if there does not exist other

x^{'} \in X

such that

f (x^{'}) ⪰ f (x)

, where ⪰ denotes the component-wise inequality.

In some cases, it would be easier to find the solutions that are called weakly Pareto optimal for the problems to be relaxed. Consequently, the following definition (as in Definition 2.24 of [35]) could be considered more often:

Definition 3.

A point

x \in X

is called weakly Pareto optimal if there does not exist other

x^{'} \in X

such that

f (x^{'}) ≻ f (x)

, where ≻ denotes the strict component-wise inequality.

Given these definitions, relevant works could aim to find (weakly) Pareto optimal points or solutions of their MOO problems. Similarly, for our problem, the weighted sum method exemplifying a simple scalarization technique as typically adopted is considered here and can collapse the vector objective into a single-objective sum as

\begin{matrix} \overset{_{maximize}}{_{_{x \in X}}} & \sum_{i = 1}^{I} W_{i} f_{i} (x) \end{matrix}

(8)

where each

W_{i}, 1 \leq i \leq I

denotes a non-negative real-valued weight for function

f_{i}

. In particular, as noted in Proposition 3.9 of [35], the optimal solution of problem (8) and the Pareto optimal points of problem (7) have the following relationship:

Proposition 1.

If

x^{*}

is an optimal solution of problem (8), then

x^{*}

is weakly efficient for the MOO problem (7).

2.3. Problem Formulation

As shown above, the MOO problem in question is to simultaneously maximize SE and EH from L cells in the MISO downlink network by jointly optimizing BP

\{ω_{i}\}

and PR

\{θ_{i}\}, \forall i

, subject to the transmit power constraint, and the feasible constraint for PR. Specifically, to simplify our representation in the following, this MOO problem is formulated without the time index t as follows:

\begin{matrix} \begin{matrix} \overset{_{\max}}{_{ω_{i}, θ_{i} \forall i}} & (C^{d} (Ω, θ), E^{h} (Ω, θ)) & (a) \\ subject to & P_{m i n} \leq | | ω_{i} {| |}^{2} \leq P_{m a x}, & \forall i & (b) \\ 0 \leq θ_{i} \leq 1, & \forall i & (c) \end{matrix} \end{matrix}

(9)

where the sum of the data rates and that of the harvested energies, i.e.,

C^{d} (Ω, θ) = \sum_{\forall i} C_{i}^{d}

and

E^{h} (Ω, θ) = \sum_{\forall i} E_{i}^{h}

in (9a), are the two objective functions to be maximized with

Ω = \{ω_{1}, ω_{2}, \dots, ω_{L}\}

and

θ = \{θ_{1}, θ_{2}, \dots, θ_{L}\}

. Apart from the above, it is worth noting that, due to the MOO formulation to maximize the metrics concurrently, no minimum data rate and harvested power are required to be the constraints for each cell involved. Instead, it applies (9b) to enforce that the transmit power

P_{i}

should be given within the range between the minimum transmit power

P_{m i n}

and the maximum transmit power

P_{m a x}

and uses (9c) to ensure

θ_{i}

is a nonnegative real number that is no larger than 1.

By means of the weighted sum approach (8) introduced in Section 2.2, which can produce a single-objective sum for the vector objective, the objective of this MOO problem (9) is represented here by

U (Ω, θ) = W \frac{C^{d} (Ω, θ)}{\bar{C^{d}}} + (1 - W) \frac{E^{h} (Ω, θ)}{\bar{E^{h}}}

(10)

where

W = W_{i} \in (0, 1], \forall i

represents the weight for all the cells or BSs. Clearly, it determines the importance between SE and EH in the system objective. In addition,

\bar{C^{d}}

and

\bar{E^{h}}

denote the estimated maximal values of

C^{d} (Ω, θ)

and

E^{h} (Ω, θ)

, respectively, which could be obtained from the initial phase with random

Ω

and

θ

, which is performed many times in our simulation. These values are utilized here to normalize the two metrics (the data rate and the harvested energy) lying in very different numerical scales. Thus, even with only their estimations, the resulting utility could still be fine-tuned by adjusting W and

1 - W

in (10) to meet the specific balance requirements from users on these metrics if required.

3. Fractional Programming-Based Approach

In this work, instead of using the classic Dinkelbach’s transformation [36] that is typically adopted for single-ratio FP problems, we adopt the quadratic transform technique developed in [37] for multi-ratio FP problems. Specifically, for the first objective in (10) aiming at SE, which involves SINR with fractional terms in the logarithm function, we adopt a Lagrangian dual reformulation with a set of dual or auxiliary variables

γ = \{γ_{1}, γ_{2}, \dots, γ_{L}\}

. According to Proposition 2 of [37], the SE objective can be reformulated as

U^{S E} (Ω, θ) = \sum_{i} (W_{1} log (1 + (1 - θ_{i}) γ_{i}) - W_{1} (1 - θ_{i}) γ_{i} + \frac{W_{1} (1 + (1 - θ_{i}) γ_{i}) {| h_{i, i}^{†} ω_{i} |}^{2}}{\sum_{j} {| h_{i, j}^{†} ω_{j} |}^{2} + σ^{2}})

(11)

where

W_{1} = W / \bar{C^{d}}

, and this ignores the time index t as noted previously. Then, by taking partial differentiation with respect to

γ_{i}

and leading the result to zero, i.e.,

\frac{\partial U^{S E}}{\partial γ_{i}} = 0

, we can obtain the optimal dual variable for SE as

γ_{i} = (\frac{| h_{i, i}^{†} ω_{i} |^{2}}{\sum_{j \neq i} {| h_{i, j}^{†} ω_{j} |}^{2} + σ^{2}}) / (1 - θ_{i})

(12)

On the other hand, the EH objective in (10) can be also denoted by

W_{2} (log (1 + b_{i} (\sum_{\forall j} | h_{i, j}^{†} ω_{j} (t) |^{2} + σ^{2})))

with

W_{2} = (1 - W) a_{i} θ_{i} / \bar{E^{h}}

. Then, as the SE counterpart, we can conduct a set of dual variables

α = \{α_{1}, α_{2}, \dots, α_{L}\}

, and apply the transform similar to that in Proposition 2 of [37] to reformulate the EH objective as

U^{e h} (Ω, θ) = \sum_{i} (W_{2} log (1 + α_{i}) - W_{2} α_{i} + \frac{W_{2} (1 + α_{i}) b_{i} (\sum_{j} | h_{i, j}^{†} ω_{j} |^{2} + σ^{2})}{b_{i} (\sum_{j} | h_{i, j}^{†} ω_{j} |^{2} + σ^{2}) + 1})

(13)

Similarly, by

\frac{\partial U^{e h}}{\partial α_{i}} = 0

, the optimal dual variable for EH with respect to i can be given by

α_{i} = b_{i} (\sum_{j} | h_{i, j}^{†} ω_{j} |^{2} + σ^{2})

(14)

However, for the consistency with

U^{S E}

, we adopt

U^{E H} \approx U^{e h}

to have the same denominator in the last term of

U^{S E}

as follows:

U^{E H} (Ω, θ) = \sum_{i} (W_{2} log (1 + α_{i}) - W_{2} α_{i} + \frac{W_{2} (1 + α_{i}) b_{i} (\sum_{j} | h_{i, j}^{†} ω_{j} |^{2} + σ^{2}) - 1}{b_{i} (\sum_{j} | h_{i, j}^{†} ω_{j} |^{2} + σ^{2})})

(15)

Finally, (11) and (15) can be combined, leading to the new overall utility as

\bar{U} (Ω, θ) = \frac{(W_{1} (1 + (1 - θ_{i}) γ_{i}) + W_{2} (1 + α_{i})) {| h_{i i}^{†} ω_{i} |}^{2}}{\sum_{j} {| h_{i, j}^{†} ω_{j} |}^{2} + σ^{2}} + C

(16)

where

C = C_{1} + C_{2} + C_{3}

is the independent part that does not directly relate to the transmit signal

h_{i, i}^{†} ω_{i}

in the numerator of (20), including

\begin{matrix} C_{1} & = & W_{1} log (1 + (1 - θ_{i}) γ_{i}) - W_{1} (1 - θ_{i}) γ_{i} \end{matrix}

(17)

\begin{matrix} C_{2} & = & {\hat{W}}_{2} log (1 + α_{i}) - W_{2}^{'} α_{i} \end{matrix}

(18)

\begin{matrix} C_{3} & = & \frac{W_{2} (1 + α_{i}) (\sum_{j \neq i} {| h_{i, j}^{†} ω_{j} |}^{2} + σ^{2} - 1 / b)}{\sum_{j} {| h_{i, j}^{†} ω_{j} |}^{2} + σ^{2}} \end{matrix}

(19)

where

{\hat{W}}_{2} = W_{2} / b

, and

b = b_{i}, \forall i

. However, with the signal from BS i to its receiver, i.e.,

h_{i, i}^{†} ω_{i}

, as the major part to be optimized, this formulation would lead to a BP focusing on the data rate to its receiver while ignoring the interference powers from the others to be harvested. To resolve this problem, the numerator part of

C_{3}

is modified to account for the powers transmitted from BS i to the others as

W_{2} (1 + α_{i}) (\sum_{j \neq i} {| h_{j, i}^{†} ω_{i} |}^{2} + σ^{2} - 1)

rather than the powers received from the others that cannot be controlled by BS i itself in the original form. Consequently, the overall utility function is modified as

\hat{U} (Ω, θ) = \frac{W_{1} (1 + (1 - θ_{i}) γ_{i}) {| h_{i i}^{t} ω_{i} |}^{2} + W_{2} (1 + α_{i}) (\sum_{j \neq i} {| h_{j, i}^{†} ω_{i} |}^{2} + σ^{2} - 1)}{\sum_{j} {| h_{i, j}^{†} ω_{j} |}^{2} + σ^{2}} + \hat{C}

(20)

where

\hat{C} = C_{1} + C_{2} - {\hat{W}}_{2} (1 + α_{i}) / (\sum_{j} | h_{i, j}^{†} ω_{j} |^{2} + σ^{2})

is not directly related to the transmit signals,

h_{j, i}^{†} ω_{i}, \forall j

, of BS i. Then, by using the quadratic transform in the multidimensional and complex case in Theorem 2 of [37] on the UE part and the SE part of (20) without

\hat{C}

, respectively, we have the system objective as

\begin{matrix} \hat{Q} (Ω, θ) & = & \sum_{i = 1}^{L} (2 (\sqrt{W_{1} (1 + (1 - θ_{i}) γ_{i})} Re \{ω_{i}^{†} h_{i, i}^{†} y_{i}\} + \\ \sqrt{W_{2} (1 + α_{i})} \sum_{j} Re \{ω_{i}^{†} h_{j, i}^{†} y_{i}\}) - y_{i}^{†} (σ^{2} I + \sum_{j \neq i} h_{i, j} ω_{j} ω_{j}^{†} h_{i, j}^{†}) y_{i}) \end{matrix}

(21)

where

y_{i}

is the dual variable in this case. Essentially, the objective is developed to facilitate solving this problem iteratively. That is, when

Ω

and the other variables are fixed, the optimal

y_{i}

can be found by solving the first-order optimality, i.e.,

\frac{\partial \hat{Q}}{\partial y_{i}} = 0

, and the result is

y_{i} = {(σ^{2} \hat{I} + \sum_{j} h_{i, j} ω_{j} ω_{j}^{†} h_{i, j}^{†})}^{- 1} (\sqrt{W_{1} (1 + (1 - θ_{i}) γ_{i})} h_{i, i} ω_{i} + \sqrt{W_{2} (1 + α_{i})} \sum_{j} h_{j, i} ω_{i})

(22)

Similarly, the optimal

ω_{i}

can be obtained by

ω_{i} = {(η_{i} \hat{I} + \sum_{j} h_{j, i}^{†} y_{j} y_{j}^{†} h_{j, i})}^{- 1} (\sqrt{W_{1} (1 + (1 - θ_{i}) γ_{i})} h_{i, i}^{†} y_{i} + \sqrt{W_{2} (1 + α_{i})} \sum_{j} h_{j, i}^{†} y_{i})

(23)

In the above,

η_{i}

is the dual variable introduced for the power constraint, and its optimal value can be denoted by

η_{i} = min \{η_{i} \geq 0 : P_{m i n} \leq | | ω_{i} (η_{i}) {| |}^{2} \leq P_{m a x}\}

(24)

which can be efficiently determined by means of a bisection search algorithm.

Apart from the above, it can also be seen that the formulations for

γ_{i}, α_{i}, y_{i}

, and

ω_{i}

explored so far all involve

θ_{i}

. In fact,

θ_{i}

is highly coupled among these formulas, and could not be easily resolved through them. For the resulting non-convexity, we resort to evolutionary algorithms (EAs) to find its value to approach the overall optimal solution. Specifically, we develop a simulated annealing (SA) algorithm for this aim as was implemented in [38]. Given this, the FP algorithm to maximize the objective (21) is summarized in Algorithm 1.

Algorithm 1 EA-aided FP algorithm.

1:: Provide $ℓ_{m}$ , $ℓ_{η}$ , $δ$ , $P_{m i n}$ , and $P_{m a x}$ ;
2:: Initialize $θ, γ, α, y, ω$ , $η_{m i n}$ , $η_{m a x}$ , and set $c_{1} = 0$ ;
3:: repeat
4:: Obtain $θ$ with SA;
5:: Update $y_{i}, \forall i$ , with (22) while fixing $γ$ , $α$ , $ω$ , and $θ$ ;
6:: for each BS or direct link i do
7:: Set $c_{2} = 0$ , $\bar{P_{m i n}} = P_{m i n}$ , and $\bar{P_{m a x}} = P_{m a x}$ ;
8:: while $| \bar{P_{m a x}} - \bar{P_{m i n}} | > δ$ and $c_{2} \leq ℓ_{η}$ do
9:: Obtain $ω_{m i n}$ and $ω_{m a x}$ with $η_{m i n}$ and $η_{m a x}$ , respectively, through (23) while fixing $γ$ , $α$ , $y$ , and $θ$ ;
10:: Let $\bar{P_{m i n}} = | | ω_{m i n} {| |}^{2}$ and $\bar{P_{m a x}} = | | ω_{m a x} {| |}^{2}$ ;
11:: Let $η_{m i d} = \frac{η_{m i n} + η_{m a x}}{2}$ ;
12:: Obtain $ω_{m i d}$ with $η_{m i d}$ through (23) while fixing $γ$ , $α$ , $y$ , and $θ$ ;
13:: Let $\bar{P_{m i d}} = | | ω_{m i d} {| |}^{2}$ ;
14:: if $\bar{P_{m i d}} > \bar{P_{m a x}}$ then
15:: Let $η_{m i n} = η_{m i d}$ ;
16:: else
17:: Let $η_{m a x} = η_{m i d}$ ;
18:: end if
19:: $c_{2} = c_{2} + 1$ ;
20:: end while
21:: Update $ω_{i}$ as $ω_{m i d}$ ;
22:: end for
23:: Update $γ_{i} and α_{i}, \forall i,$ with (12) and (14), respectively, while fixing $ω$ and $θ$ ;
24:: $c_{1} = c_{1} + 1$ ;
25:: until convergence or $c_{1} > ℓ_{m}$

Note that, although SA is well defined in the literature, our work still requires the FP iterative update procedure with certain modifications to be the fitness function for SA. Specifically, by regarding

θ_{i}

as the variable to be updated by the SA algorithm with the same FP iterative update process on the others (i.e.,

γ_{i}

,

α_{i}

,

y_{i}

, and

ω_{i}

), the resulting iterative-based fitness function, for example, the SA-Fitness function, can output the desired

θ_{i}

with a very limited number of iterations. More explicitly, let

{\hat{ℓ}}_{m}

be the iteration number of the outer loop and

{\hat{ℓ}}_{η}

be that of the inner loop in the SA-Fitness function.

Through our experiments,

{\hat{ℓ}}_{m} = 1

and

{\hat{ℓ}}_{η} = 100

can be found to quickly estimate

θ_{i}

, and we can then input the obtained

θ_{i}

into the EA-aided FP algorithm. Given this, our simulations in Section 6.2 confirm the effectiveness of the FP algorithm to provide the system performance metrics outperforming those from the learning-based algorithms and the baseline approaches in comparison.

In summary, the FP-based approach is developed to be an iterative algorithm, which involves (1) obtaining

θ_{i}

through SA, (2) updating

y_{i}

with (22), (3) updating

ω_{i}

with (23), (4) updating

γ_{i}

with (12), (5) updating

α_{i}

with (14), and (6) finding

η_{i}

with the bisection search under the limit of

ℓ_{η}

iterations, while fixing the other variables in each step within the total number of

ℓ_{m}

iterations. In the iterative updates, the inverse operation is required to find, e.g.,

ω_{i}

, with the time complexity

O (L N_{t}^{3})

, and the number of

ℓ_{η}

bisection-search iterations is also required to find

η

. Further, to obtain

θ_{i}

, SA implemented in [38] would expand

O (I_{c} N_{g})

steps to perform the cost evaluation, where

I_{c}

is the number of individuals to evaluate in a chain for every generation of SA, and

N_{g}

is the number of generations to evolve. Given this, its total time complexity would be

O (ℓ_{m} I_{c} N_{g} ℓ_{η} L N_{t}^{3}

).

4. Limited Channel Information Exchange

In the networks with MISO downlink channels, a practical approach that is frequently adopted is using BSs to collect the channel information. That is, a BS will obtain the channel measurement through the feedback from UE. To this end, there would exist a backhaul network to carry the global instantaneous CSI collected and transmit it to the central controller for global optimization. However, the signal overhead can be huge, which makes a centralized optimization approach infeasible in a highly dynamic environment.

To alleviate the problem in a practical way, our distributed learning-based approach will utilize only the basic operations of BS to exchange information with other BSs through predefined interfaces, such as

X 2

in LTE, resulting in a considerably lower signal overhead than that of the backhaul network for centralized optimization. Given this, we consider that each direct link k has two limited sets, namely interferers and interfered neighbors, similar to those in [39,40]. Specifically, we limit the number of neighbor U of link k with the dynamic thresholds

φ_{I_{k}}

and

φ_{O_{k}}

in the following two limited sets:

\begin{matrix} I_{k} = \{j \neq k : | h_{k, j}^{†} ω_{j} |^{2} \geq φ_{I_{k}}\} \\ O_{k} = \{i \neq k : | h_{i, k}^{†} ω_{k} |^{2} \geq φ_{O_{k}}\} \end{matrix}

(25)

where the two thresholds lead to

| I_{k} | = U

and

| O_{k} | = U

, respectively.

Now, with a control channel to return the feedback, BS k at current time t can obtain the channel gain

| h_{k, k}^{†} (t) ω_{k} {(t - 1) |}^{2}

and the interference-plus-noise

\sum_{j \neq k} {| h_{k, j}^{†} (t) ω_{j} (t - 1) |}^{2} + σ^{2}

through

ω_{j} (t - 1), \forall j

, measured by UE k at the previous time

t - 1

as well as the current channel vector

h_{k, j} (t), \forall j

. Similarly, BS k can send its own measurements to its interferers

j \in I_{k}

and interfered neighbors

i \in O_{k}

and receive the measurements from the two sets of neighbors as conducted in the previous works. The information for these measurements locally exchanged among the neighbors would then be utilized in the following multi-agent DDQN algorithm, which details the measurements to be adopted therein.

5. Learning-Based Approach

In addition to the indicated signal overhead, an optimization-based approach could also have a computational complexity for solving the MOO problem that is non-deterministic polynomial time (NP) in general. Although the FP-based algorithm could be computationally-efficient with the iterative update procedure proposed, to further reduce the signal overhead as well as the computational complexity, we develop a deep-reinforcement-learning-based algorithm to track the fast time-varying channels involved and provide its solutions in a time that could hardly be achieved by using the traditional optimization methods. Specifically, a multi-agent DDQN algorithm is introduced next to make each single agent or BS share only limited information exchanged among its neighbors, effectively reducing the overhead and complexity as mentioned.

5.1. Overview of DDQN

In principle, a reinforcement-learning (RL) algorithm has one or more agents to interact with the environment and to take actions based on certain strategies so that the accumulated reward can be maximized in the long term. The interaction between agent(s) and the environment is usually modeled as a Markov decision process (MDP). The well-known Q-learning algorithm is a MDP-based approach, represented here by a four-tuple structure <

S, A, R, P

>, where

S

is the set of states,

A

is the set of discrete actions,

R

is the reward, and P is the transition probability. Specifically, given r as the instant reward and

ν \in [0, 1)

as the discount factor, the cumulative discounted reward can be obtained by

R_{t} = \sum_{τ = 0}^{\infty} ν^{τ} r (t + τ + 1)

(26)

Given this, the Q-function associated with a policy

π

is the expected reward defined by

Q_{π} (s, a) = E_{π} \{R_{t} | s_{t} = s, a_{t} = a\}

(27)

where

a \in A

is an action taken in state

s \in S

in time t, and the optimal policy

π^{*} (a | s)

is a mapping from states to actions that maximizes the long-term cumulative discount reward. Then, through the concept of a one-step Markov process, it considers

R (s, a) = E_{π} \{r_{t + 1} | s_{t} = s, a_{t} = a\}

as the expected instant reward resulting from taking action a in state s and the transition probability

P_{s s^{'}}^{a} = Pr (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)

. Given this, the Q-function can be iteratively obtained by using the Bellman Equation [41]

Q_{π} (s, a) = R (s, a) + ν \sum_{s^{'} \in S} P_{s s^{'}}^{a} (\sum_{a^{'} \in A} π (s^{'}, a^{'}) Q_{π} (s^{'}, a^{'}))

(28)

Accordingly, to find the optimal policy

π^{*}

, the Q-learning algorithm is conducted to find the optimal action a in state s.Through the Bellman equation shown in above, the optimal Q-function associated with the optimal policy

π^{*} (a | s)

can be represented by

Q^{π^{*}} (s, a) = R (s, a) + ν \sum_{s^{'} \in S} P_{s s^{'}}^{a} max_{a^{'}} Q^{π^{*}} (s^{'}, a^{'})

(29)

Clearly, to obtain the optimal results, all state–action pairs should be stored in a place, namely the Q-table, in this algorithm, whose dimensions are

| S | \times | A |

, and this could be huge for a general application. Thus, the primitive Q-learning algorithm may be useful only when the state–action space is relatively small, which seriously limits its applicability. Fortunately, by replacing the Q-table with a neural network to find the optimum, the deep-learning algorithm that results, namely DQN, can significantly reduce the overhead, where the Q-function is denoted by

Q (s, a | ϕ)

with

ϕ

to denote the weight of DNN. Now, with the learning rate

α \in (0, 1]

, the Q-value can be updated by

Q (s, a | ϕ) = (1 - α) Q (s, a | ϕ) + α (r + ν max_{a^{'}} Q (s^{'}, a^{'} | ϕ))

(30)

The weights of DNN, however, can diverge due to a high correlation between the actions and states that exist, and the algorithm is not guaranteed to converge on the optimal value function. To resolve this problem, apart from the introduced DNN,

Q_{t r a i n}

, another DNN,

Q_{t a r g e t}

, is added to keep a copy of DNN and use it for the Q-value update in the Bellman equation. The two different DNNs have different Q-functions,

Q (s, a | ϕ_{1})

and

Q (s, a | ϕ_{2})

. The loss between them can then be defined by

L = \sum_{〈 s, a, r, s^{'} 〉} (Q_{t a r g e t}^{D Q N} - Q (s, a | ϕ_{1}))

(31)

where

Q_{t a r g e t}^{D Q N} = r^{'} + ν {max}_{a^{'}} Q (s^{'}, a^{'} | ϕ_{2})

, and minimizing this loss would lead to the optimal solution. Now, even given the loss function, the DQN algorithm may still significantly diverge by overestimating the value of

Q_{t a r g e t}

. The overestimating problem with respect to the deep deterministic policy gradient (DDPG) algorithm was also indicated in [42,43]. Additionally, DDPG has the potential to become unstable, and its performance may rely on finding the appropriate hyperparameters for a given problem [42]. Therefore, it is currently not being considered in this work.

Instead, a variant approach, namely double DQN (DDQN) as proposed in [44], is considered to select the actions and evaluate the Q-values separately. In particular, unlike DQN directly using the maximum Q-value for the target network, DDQN selects the action from the train network that yields the maximum Q-value, i.e.,

{arg max}_{a^{'}} Q (s^{'}, a^{'} | ϕ_{1})

and then identifies the Q-value in the target network by means of the selected action, i.e.,

Q (s^{'}, {arg max}_{a^{'}} Q (s^{'}, a^{'} | ϕ_{1}) | ϕ_{2})

. Finally, the Q-value for

Q_{t a r g e t}

in DDQN can be obtained by

Q_{t a r g e t}^{D D Q N} = r^{'} + ν Q (s^{'}, \underset{a^{'}}{arg max} Q (s^{'}, a^{'} | ϕ_{1}) | ϕ_{2})

(32)

Apart from the potential to resolve the overestimating problem, DDQN was also shown to obtain the best results through certain datasets for training [45] and the lowest cost for the dynamic context delivery when compared with the others [46]. In addition, as shown in [44], the lower bound on the absolute error of DDQN estimate is zero. Given these good properties, we develop, in the sequel, a distributed multi-agent DDQN algorithm to resolve the MOO problem (9) with the objective (10).

5.2. Distributed Multi-Agent DDQN Algorithm

In Section 3, the FP-based algorithm is introduced to represent a baseline to be obtained by an optimization-based algorithm. Given its merits on the centralized process, a distributed approach with lower time complexity is still considered better if each BS can independently determine its BP and PR with only limited information shared among their neighbors.

To this end, the proposed DDQN algorithm is conducted to follow the concept of DTDE as shown in Figure 2, wherein each agent k takes its action

a_{k}

based on its current state

s_{k}

obtained from the information exchanged among its neighbors, representing the concept of distributed executing (DE). In addition, each agent k trains its own DNNs,

Q_{t r a i n}

and

Q_{t a r g e t}

, by using the experiences

〈 s_{k}, a_{k}, r_{k}, s_{k}^{'} 〉

stored in its replay buffer

D_{k}

, representing distributed training (DT) in this algorithm. Specifically, the main MDP components for the proposed DDQN algorithm are summarized as follows:

(1): Action: In this algorithm, each action of agent k or $a_{k}$ is composed of BP $\{ω_{k}\}$ and PR $\{θ_{k}\}$ . As the action space of value-based DRL algorithm must be finite, the feasible actions should be taken from a set of discrete values of $\{ω_{k}\}$ and $\{θ_{k}\}$ , respectively. Here, as each BP is a complex vector, it should be discretized with real values. To this end, it is first decomposed into two parts as

$ω_{k} = \sqrt{P_{k}} {\bar{ω}}_{k}$

(33)

wherein the first part, $P_{k} = | | ω_{k} {| |}^{2}$ , is the transmit power of BS k, and the second part, ${\bar{ω}}_{k}$ , represents the beam direction of BS k. On the one hand, the transmit power can be discretized linearly to constitute a set of values, such as ${P_{m i n}, P_{m i n} + \frac{P_{m a x} - P_{m i n}}{N_{p} - 1}, P_{m i n} + \frac{2 (P_{m a x} - P_{m i n})}{N_{p} - 1}, \dots, P_{m a x}}$ of $N_{p}$ equal-spacing values.
On the other hand, ${\bar{ω}}_{k}$ could be discretized by using a codebook $C = \{c_{0}, \dots, c_{N_{c o d e} - 1}\}$ composed of $N_{c o d e}$ code vectors $c_{k} \in C^{N_{t} \times 1}$ , each specifying a beam direction in $[0, 2 π)$ . Providing a sufficient number of code $N_{c o d e} \geq N_{t}$ to be adopted and a number of S available phase values for each antenna element, we can consider a codebook matrix $C$ similar to that in [47]. Specifically, for the $n_{t}$ -th antenna element in the q-th code, its value can be given by

$C [n_{t}, q] = \frac{exp (j \frac{2 π}{S} ⌊ \frac{n_{t} mod (q + \frac{N_{c o d e}}{2}, N_{c o d e})}{N_{n o d e} / S} ⌋)}{\sqrt{N_{t}}}$

(34)

Apart from BP, we can similarly discretize each PR $θ_{k}$ into $N_{e h}$ levels with a set $E = \{0, \frac{1}{N_{e h} - 1}, \frac{2}{N_{e h} - 1}, \dots, 1\}$ , representing its values to be selected. Finally, by taking all the discrete-value sets into account, we have the action space for each agent as

$A = \{(p, c, e) | p \in P, c \in C, e \in E\}$

(35)

from which an agent k can choose its action $a_{k} (t)$ at time t.
(2): Reward: Apart from the above to select PR within $[0, 1]$ from $E$ to comply with the feasible PR constraint, for the MOO problem, which is also required to meet the transmit power constraint, we conduct a dual form of this optimization by conceptually lifting the power constraint as the penalty term added in the objective to represent a reward to be obtained by the distributed multi-agent DDQN algorithm. Specifically, the reward function is denoted by

$r = W \frac{C^{d} (Ω, θ)}{\bar{C^{d}}} + (1 - W) \frac{E^{h} (Ω, θ)}{\bar{E^{h}}} - W_{c} P_{s u m}$

(36)

where $W_{c}$ is the penalty weight, and $P_{s u m} = \sum_{\forall i} | | ω_{i} {| |}^{2}$ is the total transmit power consumption in the network. Given this, the reward of agent k at time t can be denoted by $r_{k} (t) = W \frac{C_{k}^{d} (t - 1)}{\bar{C^{d}}} + (1 - W) \frac{E_{k}^{h} (t - 1)}{\bar{E^{h}}} - W_{c} P_{s u m} (t - 1)$ .
(3): State: Conventionally, a state in MDP for RL-based algorithms is designed to represent the environmental information perceived by an agent. Given the same aim to represent as much available information as possible in the environment, the different problems involved, however, could realize their state spaces differently in the different related works, such as [39,40,48]. Here, to construct a state for this algorithm, an agent or BS k at time t will provide its local information about the direct link k at the previous time slot $t - 1$ to its interferers $j \in I_{k} (t), \forall j$ , including (1) the interference power received from j, $| h_{k, j}^{†} (t - 1) ω_{j} {(t - 1) |}^{2}$ ; (2) the interference-plus-noise power, $\sum_{l \neq k} {| h_{k, l}^{†} (t - 1) ω_{l} (t - 1) |}^{2} + σ^{2}$ ; (3) the achievable data rate, $C_{k}^{d} (t - 1)$ ; and (4) the channel gain, $h_{k, k}^{†} (t) {\bar{ω}}_{k} (t - 1)$ . At the same time, it will also send the information to its interfered neighbors $i \in O_{k} (t), \forall i$ , including the index $ℓ_{k} (t - 1)$ for the beam direction ${\bar{ω}}_{k} (t - 1)$ adopted and the achievable data rate $C_{k}^{d} (t - 1)$ .

In parallel, each interferer

j \in I_{k} (t)

will send the index

ℓ_{j} (t - 1)

for the beam direction

{\bar{ω}}_{j} (t - 1)

and the achievable data rate

C_{j}^{d} (t - 1)

to agent k. Similarly, each interfered neighbor

i \in O_{k} (t)

will send its measurements to agent k, including (1) the interference power,

| h_{i, k}^{†} (t - 1) ω_{k} {(t - 1) |}^{2}

; (2) the interference-plus-noise power,

\sum_{l \neq i} {| h_{i, l}^{†} (t - 1) ω_{l} (t - 1) |}^{2} + σ^{2}

; (3) the achievable data rate,

C_{i}^{d} (t - 1)

; and (4) the channel gain,

h_{i, i}^{†} (t - 1) {\bar{ω}}_{i} (t - 1)

.

Given this, each agent k includes the following as the local information of its state, denoted by

s_{k}^{l} (t)

, as

the normalized identity of BS, $k / N_{b}^{l}$ ;
the normalized channel gain, $(| h_{k, k}^{†} (t) {\bar{ω}}_{k} (t - 1) |^{2}) / N_{c}^{l}$ ;
the normalized interference-plus-noise power,
$(\sum_{l \neq k} | h_{k, l}^{†} (t) ω_{l} (t - 1) |^{2} + σ^{2}) / N_{i}^{l}$ ;
the normalized reward, $(W \frac{C_{k}^{d} (t - 1)}{\bar{C^{d}}} + (1 - W) \frac{E_{k}^{h} (t - 1)}{\bar{E^{h}}} - W_{c} P_{s u m} (t - 1)) / N_{r}^{l}$ ,

where

N_{b}^{l}, N_{c}^{l}, N_{i}^{l}

, and

N_{r}^{l}

denote the normalization factors corresponding to the above four items, respectively. These factors (as well as the others to be introduced) for state normalization actually play a key role on preprocessing the training sample sets to lead to a much easier and faster training process as noted in [49,50]. Apart from that, the state of agent k also includes a set of information from its interferers, denoted by

s_{k}^{i} (t)

. Specifically, for each interferer

j \in I_{k} (t)

, it involves

the normalized identity of the interferer BS, $j / N_{b}^{i}$ ;
the normalized beam direction index adopted by the interferer BS, $ℓ_{j} (t - 1) / N_{i}^{i}$ ;
the normalized interference power, $(| h_{k, j}^{†} (t - 1) ω_{j} (t - 1) |^{2}) / N_{c}^{i}$ ;
the normalized utility, $(W \frac{C_{j}^{d} (t - 1)}{\bar{C^{d}}} + (1 - W) \frac{E_{j}^{h} (t - 1)}{\bar{E^{h}}}) / N_{u}^{i}$ ,

where

N_{b}^{i}, N_{i}^{i}, N_{c}^{l}

, and

N_{u}^{l}

denote the corresponding normalization factors. In addition, a set of information from the interfered neighbors, denoted by

s_{k}^{d} (t)

, is also included in the state to completely describe the interference-limited environment for the MISO transmission. Specifically, the information for each interfered neighbor

i \in O_{k} (t)

is represented by

the normalized channel gain, $(| h_{i, i}^{†} (t - 1) {\bar{ω}}_{i} (t - 1) |^{2}) / N_{c}^{n}$ ;
the normalized utility, $(W \frac{C_{i}^{d} (t - 1)}{\bar{C^{d}}} + (1 - W) \frac{E_{i}^{h} (t - 1)}{\bar{E^{h}}}) / N_{u}^{n}$ ;
the normalized SINR with respect to k,
$\frac{| h_{i, k}^{†} (t - 1) ω_{k} {(t - 1) |}^{2}}{\sum_{l \neq i} {| h_{i, l}^{†} (t - 1) ω_{l} (t - 1) |}^{2} + σ^{2}} / N_{s}^{n}$ ;
the normalized totally-received power,
$(\sum_{\forall l} {| h_{i, l}^{†} (t - 1) ω_{l} (t - 1) |}^{2} + σ^{2}) / N_{e}^{n}$ ,

where

N_{c}^{n}, N_{u}^{n}, N_{s}^{n}

, and

N_{e}^{n}

are the normalization factors for the above four items, respectively. Note that, if agent k is not active in tim

t - 1

, the numerator

| h_{i, k}^{†} (t - 1) ω_{k} {(t - 1) |}^{2}

as well as the whole SINR shown in the above are zero and will be excluded from the total received power as well.

Concatenating all three parts, we now have the state

s_{k} (t)

=

\{s_{k}^{l} (t), s_{k}^{i} (t), s_{k}^{d} (t)\}

for each agent k. Here,

| s_{k} | = | s_{k}^{l} | + | s_{k}^{i} | + | s_{k}^{d} | = 4 + 4 U + 4 U

is the state size for each agent k to include the information from its U neighbors. Given this, the system state at time t can be denoted by

\{s_{1} (t), s_{2} (t), \dots, s_{L} (t)\}

. Then, following the principle of MDP, each agent k at time t will observe its own state

s_{k} (t)

and choose its action

a_{k} (t)

with the transition probability

P_{s_{k}, s_{k}^{'}}^{a_{k}}

determined by its DNN to move to the next state

s_{k}^{'}

.

(4): Selection policy and experience replay: Apart from MDP, the DDQN algorithm also adopts the same mechanisms usually found in DQN, such as $ϵ$ -greedy selection policy and experience replay. First, by using the $ϵ$ -greedy selection policy, each agent can explore the environment with the probability $ϵ$ and can exploit with the probability $1 - ϵ$ , where $ϵ$ is a hyperparameter for the trade-off between exploration and exploitation and decays with a rate of $λ_{ϵ}$ to its minimum value $ϵ_{m i n}$ , similar to that in [51]. Further, by means of experience replay, each agent k can store its transactions $(s_{k} (t), a_{k} (t), r_{k} (t), s_{k}^{'})$ in a buffer memory $D_{k}$ , and then randomly sample $D_{k}$ to construct a mini-batch for training its DNNs through, e.g., a stochastic gradient descent (SGD) algorithm to update the weights $ϕ_{1}$ and $ϕ_{2}$ for $Q_{t r a i n}$ and $Q_{t a r g e t}$ , respectively. As a summary, the proposed multi-agent DDQN algorithm is is shown in Algorithm 2 for reference.

Algorithm 2 Multi-agent DDQN algorithm.

1:: (Input) Simulated SWIPT MISO network and hyperparameters for the DDQN algorithm;
2:: (Output) Learned DDQN to decide $P_{k}, {\bar{ω}}_{k}, θ_{k}, \forall k$ , for MOO in (9) with objective in (10);
3:: Initialize a pair of $Q_{t r a i n}$ and $Q_{t a r g e t}$ with $ϕ_{1}^{k}$ and $ϕ_{2}^{k}$ for each agent/BS $k \in \{1, \dots, L\}$
4:: Initialize state $s_{k} (0)$ , action $a_{k} (0)$ and replay buffer $D_{k} = \emptyset$ for each agent k;
5:: for each time slot t do
6:: for each agent/BS k do
7:: Observe current state $s_{k} (t)$ in time slot t;
8:: generate a random number $n_{r}$ ;
9:: if $n_{r} < ϵ$ then
10:: Randomly select $a_{k} (t)$ from the action space $A$ ;
11:: else
12:: Select $a_{k} (t) = arg {max}_{a \in A} Q (s_{k}, a | ϕ_{1}^{k})$ ;
13:: end if
14:: Observe next state $s_{k}^{'}$ , and obtain reward $r_{k} (t)$ ;
15:: Store the new transition $(s_{k} (t), a_{k} (t), r_{k} (t), s_{k}^{'})$ in $D_{k}$ ;
16:: Randomly sample a mini-batch $(s_{k} (j), a_{k} (j)$ $, r_{k} (j), s_{k}^{'} (j))$ with $j \in J \subset D_{k}$ for experience;
17:: Compute the Q-value for DDQN with (32)
18:: Perform SGD to minimize the loss in (31), finding the optimal weights $ϕ_{1}^{k}$ and $ϕ_{2}^{k}$ of agent k;
19:: Update weight $ϕ_{1}^{k}$ (for $Q_{t r a i n}$ );
20:: Update weight $ϕ_{2}^{k}$ (for $Q_{t a r g e t}$ ) with $ϕ_{1}^{k}$ every $T_{s t e p}$ time slots;
21:: end for
22:: end for

Now, to evaluate its time complexity, we can assume that the neural network involved has J fully connected layers at most, in which

n_{j}

denotes the number of neural units at the j layer, and

n_{0}

is the input state size, leading to the complexity

O (\sum_{j = 0}^{j = J - 1} n_{j} n_{j + 1})

for its operations as noted in [49]. In addition, the DDQN algorithm is assumed to have

T_{m}

time slots to learn, and, in each time slot, there are L distributed agents/BSs to train their own neural networks. Given this, the total complexity would be

O (T_{m} L \sum_{j = 0}^{j = J - 1} n_{j} n_{j + 1})

.

Apart from the time complexity, each agent or BS requires at most four U messages from its neighbors with the limited channel information exchange. Otherwise, if a centralized approach in convention is adopted, the signal overhead would include the collection of

L^{2}

N_{t}

-dimension complex vectors. In general, the number of neighbors for an agent or BS (i.e., U) is much less than the number of cells or BSs (i.e., L); thus, our approach can pay a lower signal overhead than can the centralized counterpart.

6. Numerical Experiments

In this section, we conduct simulation experiments to evaluate the proposed EA-aided FP algorithm (denoted by “FP”) and distributed multi-agent DDQN algorithm (denoted by “dis-DDQN”). To validate the proposed algorithms, we include a greedy-based algorithm and a random-based algorithm (denoted by “greedy” and “random”, respectively) as the comparison baselines. In addition, to verify the effectiveness of the DDQN algorithm based on DTDE, we introduce a CTDE variant (denoted by “glo-DDQN”), which uses the global state

s

=

\{s_{1}, s_{2}, \dots, s_{L}\}

introduced in Section 5.2, to be the state for training each BS k instead of using only its local state

s_{k}

. Furthermore, to show the effectiveness of distributed computing, we also compare the Advantage Actor Critic (denoted by “A2C”) algorithm, which represents the state-of-the-art centralized RL algorithm to resolve this problem.

6.1. Simulation Setup

With the network and channel models introduced in Section 2, we set a simulation environment with 19 hexagonal cells with BS 0 located at the center, BSs 1–6 located in the first tier, and BSs 7–18 located in the second tier as shown in Figure 3, similar to the environment in [40]. However, unlike the previous, the cell radius was limited to 20 m for SWIPT to resemble that in a small cell, wherein the harvested energy would be significant enough in addition to the data transmitted.

Each UE is randomly located in each cell, and the path loss between BS k and UE j is similarly given by

β_{j, k} = 120.9 + 37.6 {log}_{10} d_{j, k}

dB, where the distance between them,

d_{j, k}

, is denoted in kilometers. Apart from the path loss, the signal was also generated with the log-normal shadowing effect, which had a standard deviation of 8 dB and AWGN noise power of −114 dBm. In addition, the number of multi-path was set to 4, and the difference between the maximum angle and the minimum angle, i.e., the angular spread, was

3^{\circ}

. Further, as UEs are located with random positions initially, the azimuth angle of UE to its BS serves as the direction of departure (DoD) of the wireless channel.

Apart from that, each channel had a time slot duration of 20 ms and a correlation coefficient of 0.64 for the successive time slots. As a summary, the important radio parameters with respect to the environment are tabulated in Table 1, and the import parameters and hyperparameters for DDQN are summarized in Table 2. Finally, along with

W = 0.5

for fairly weighting SE and EH in the first set of experiments and

W_{c}

=

10^{- 4}

for the penalty of power consumption, the DDQN algorithms were conducted by a DNN with two hidden layers composed of 128 and 64 neurons, respectively.

In the parametric analysis, we first conducted different experiments to find the most suitable parameters for the multi-agent DDQN algorithm to be compared in the following, including the number of transmit power levels (

N_{p}

), the number of beam directions (

N_{c o d e}

), and the number of power splitting ratios (

N_{e h}

). After that, we compared the proposed algorithms with the other schemes, and the results obtained confirm our proposal to outperform these benchmark schemes in terms of the utility

U (Ω, θ)

, data rate

C^{d} (Ω, θ) = \sum_{\forall i} C_{i}^{d}

, and harvested energy

E^{h} (Ω, θ) = \sum_{\forall i} E_{i}^{h}

.

6.2. Parametric Analysis

6.2.1. The Number of Power Levels

As shown in (33), there are two parts to constitute a BP. With respect to the first part of BP, transmit power, we set the transmit power to have 4, 8, and 16 levels of value for the Q learning to see its impact on the system performance. The results are summarized in Figure 4, showing that the different numbers of power levels

N_{p}

provided similar utilities, data rates, and harvested energies. It implies that the algorithm may not, in this case, find the optimum represented through the values shown in these power sets even if

N_{p}

and the overall state space increase. Thus,

N_{p} = 4

is considered sufficient in the sequel as it pays the lowest overhead for the algorithm to converge.

6.2.2. The Number of Beam Directions

For the second part of BP, the beam direction, we set the codebook to have 4, 8, and 16 vectors or directions, respectively, to see its impact on the system performance. The results are now summarized in Figure 5, showing that

N_{c o d e} = 8

could produce a higher data rate to compensate for a lower harvested energy and that

N_{c o d e} = 16

could obtain a higher harvested energy to compensate for a lower data rate when compared with that of

N_{c o d e} = 4

. However, the trend is still the same in that increasing

N_{c o d e}

would provide similar utility as that on

N_{p}

. This suggests that, despite the slight trade-off between the data rate and harvested energy,

N_{c o d e} = 4

would be sufficient for the algorithm to converge for the desired overall utility without further increasing its learning overhead.

6.2.3. The Number of Power Splitting Ratios (PR)

Apart from BP, PR is another objective in our MOO problem. For the distributed DDQN algorithm, the number of PR level has the same importance as the former. To see its impact on the system performance, we provided a set of 4, 8, and 16 real values equally distributed between 0 and 1, for the experiments. As shown in Figure 6,

N_{e h} > 4

(i.e.,

N_{e h}

= 8 and 16) provided higher harvested energies and lower data rates, which eventually led to higher utilities compared with that of

N_{e h} = 4

. However, to conduct the baseline for comparison without loss of generality, we adopted

N_{e h} = 4

as well as

N_{p} = N_{c o d e} = 4

, which exhibited the performance differences significantly enough for the DDQN algorithm in comparison and had a reasonable overall computational overhead.

Note that, as indicated in [52], when a multi-agent setting is modified by the actions of all agents, the environment becomes non-stationary from a single agent perspective, in which the effectiveness of most reinforcement-learning algorithms would not hold [53]. Thus, the performance of a multi-agent DRL algorithm does not guarantee an increase as the number of action increases through a trial-and-error mechanism in such environments [40] but could be explored by selecting suitable numbers of actions to constitute the action space as when performed for the proposed DDQN algorithm with the above experiments.

6.3. Performance Comparison

In this subsection, we exhibit the performance differences between the proposed algorithms and the other schemes. Specifically, based on the parametric analysis that we introduced, we set

N_{p} = N_{c o d e} = N_{e h} = 4

for the multi-agent DDQN algorithm as well as a CTDE counterpart for a benchmark to be introduced in the following and

ℓ_{m} = ℓ_{η} = 100

for the FP algorithm. Then, we conducted a performance comparison between these algorithms and the other four benchmark schemes shown as follows:

Global state information-based scheme: In principle, this scheme is the same as the distributed multi-agent DDQN algorithm. However, instead of adopting its own state $s_{k}$ only, each agent k adopts the full state information, i.e., $\{s_{1}, s_{2}, \dots, s_{L}\}$ for its own DDQN operations, based on the concept of centralized training distributed executing (CTDE). Clearly, collecting such information would require a centralized processor or a full information exchange mechanism to exist in the network and, thus, is denoted as “glo-DDQN” as noted at the beginning of this section.
Single-agent DRL scheme: As a branch of machine learning, DRL is conventionally developed with a single agent operated centrally in a processor. Here, the state-of-the-art RL algorithm, Advantage Actor Critic, is adopted as a centralized DRL-based benchmark scheme for resolving the MOO problem and is simply denoted as “A2C”.
Random-based scheme: As a baseline algorithm, the scheme leads each agent to randomly choose an action in each time slot and is denoted here as “random”.
Greedy-based scheme: As another baseline algorithm, each agent in this scheme adopts the beam direction with the maximum channel gain and the maximum transmit power while randomly selecting its PR from the set of $N_{e h}$ elements for the DDQN. For easy reference, this scheme is denoted as “greedy” in the sequel.

For these algorithms, we set

W = 0.1

, 0.5, and 0.9 in (10) to represent a “low”, “middle”, and “high” weight on the data rate (or a “high”, “middle”, and “low” weight on the harvested energy), and we examined the performance differences on these weights applied to these algorithms. Their results are summarized in Figure 7. Specifically, in Figure 7b,c, the random algorithm, which randomly chooses BP from the codebook despite W is shown to retain the same performance on these metrics, as expected. Similarly, given a non-zero W, each agent with the greedy algorithm chooses the best BP for its data rate despite the harvestable powers from the others, which are out of its control on BP, and this is also shown to remain the same on the two metrics when varying the weight.

Apart from these, the other algorithms exhibited similar trends, where increasing W increased the data rate and decreased the harvested energy, thus confirming the design aim of W. However, as the amount of the increased rate can be different from that of the decreased energy, their weighted sum or the resulting utility cannot be guaranteed to increase when W increases as shown in Figure 7a.

Given the similar trend, the FP-based algorithm (FP), which represents an optimization-based approach, is shown to provide the most effective solutions for the MOO problem, confirming our design aim. As shown in Figure 7a as well, the distributed multi-agent DDQN algorithm (dis-DDQN) has its overall utility under that of FP but outperforms the other schemes in comparison through the following viewpoints.

First, with respect to its variant (glo-DDQN), it can be observed that both algorithms (dis-DDQN and glo-DDQN) converge to similar results, and glo-DDQN can barely obtain a higher utility. The latter is possible because equipped with the global state information, each agent may need even more time to learn the strategy approaching the optimal system performance. It implies further that, with a higher overhead for learning, the large system state caused by glo-DDQN may not lead to a better result, a faster converging speed, or both, in time.

From Figure 8, which exemplifies the converging progresses of these algorithms with

W = 0.5

, it can be observed with more evidence that glo-DDQN actually converges more slowly than dis-DDQN in the time domain for all the metrics involved. Apart from the above, it can be also seen that the DDQN-based algorithms can obtain higher rates but provide relatively lower energies, which eventually leads to the overall utilities being lower than those obtained by the FP algorithm.

Second, with respect to A2C, which represents a state-of-the-art single-agent algorithm for the conventional environment to be evaluated centrally, it can be seen that such an algorithm may not work well in the distributed network with multiple BSs for a large state space, a large action space, or both. In other words, although A2C can handle the spaces involving both discrete and continuous variables (e.g., the beam direction is discretized while the transmit power, and the PR remains continuous in this case), its solution is not always efficient for the dynamic network environment. In contrast, by suitably discretizing the spaces involved, the distributed multi-agent DDQN (dis-DDQN) can be more easily handled by each agent to learn its strategy based on the limited discrete values in these spaces to approach the optimal solution.

Finally, in addition to the performance trends shown in the beginning, the greedy algorithm exhibits itself as a baseline scheme to provide a higher low-bound when no specific learning mechanism other than a greedy approach is adopted to resolve the MOO problem, and the random algorithm is shown to provide a lower low-bound on the performance if only randomly choosing an action is considered for solving this problem. As a summary, apart from the FP introduced, which represents an optimization-based approach to obtain outperforming solutions, the proposed DDQN algorithm (dis-DDQN) can also outperform the others in terms of the utility up to 1.23-, 1.87-, and 3.45-times larger than that of the A2C, greedy, and random algorithms, respectively, in comparison in the case of

W = 0.5

.

Apart from the above, we show, in Figure 9, the reward and loss for the RL-based algorithms in comparison. As can be easily seen, the reward increases and the loss decreases as time elapses, and dis-DDQN and glo-DDQN have higher rewards and lower losses compared to A2C, as expected. In particular, the lower losses found for the two DDQN algorithms suggest that the obtained models would perform better compared to A2C. To further validate the trained models from these RL-based algorithms, we prepared a set of 5000 test data by randomly generating channel fading conditions different from those of the training set.

By reacting to the random data, each trained model can provide its own BPs and PRs, leading to the performance results summarized in Figure 10. From this figure, we can see that the test can consistently give outputs similar to those at the end of training, despite the different random unseen data for testing. This observation indicates that the trained models would have good generalization performance as expected.

7. Conclusions

In this work, a MOO problem was formulated that aims to obtain the optimal BP and PR concurrently for MISO downlink SWIPT-enabled wireless networks. For this problem, a weighted sum approach was conducted to make a trade-off between SE and EH in the Pareto-optimal sense. Given this, an EA-aided quadratic transform technique was proposed to conduct an FP-based algorithm that can obtain near-optimal solutions with the computationally-efficient iterative update procedure introduced. At the same time, a DTDE scheme was adopted to introduce a multi-agent DDQN algorithm that requires only partial observations of CSI for local computation in each agent to further reduce the communication overhead and the computational complexity.

With the simulated environment, our experimental results demonstrated that, among the benchmark schemes conducted, the introduced FP-based algorithm was the most effective approach for solving the MOO problem. Apart from the FP algorithm, the proposed multi-agent DDQN algorithm was also shown to outperform A2C, which represents the state-of-the-art single-agent DRL algorithm and the other baseline schemes while providing lower overhead and complexity compared with that of FP. This reveals the possibility that a programming-based method and a DRL-based algorithm can complement each other to solve various optimization problems in networking, and a joint design to take benefits from both will be our future work.

Author Contributions

J.-S.L.: the main research idea, software implementation, validation, and manuscript preparation; C.-H.L.: research idea discussion, review, and manuscript preparation; Y.-C.H.: research idea discussion, edit, and manuscript preparation; P.K.D.: research idea discussion, and manuscript preparation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Technology, Republic of China, under grant MOST 111-2221-E-126-003.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

Not Applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Notations

Following the writing convention, vectors and matrices in this work are denoted by boldface lowercase and uppercase symbols (e.g.,

x

and

X

), respectively. The Hermitian transposition and inverse are denoted by the superscripts

{(\cdot)}^{†}

and

{(\cdot)}^{- 1}

, respectively. In addition,

| \cdot |

denotes the absolute value operator, and

| | \cdot | |

is that for the Euclidean norm. Further,

Re (\cdot)

is the operation to take the real part, and

\hat{I}

denotes an identity matrix.

References

Ni, W.; Zheng, J.; Tian, H. Semi-federated learning for collaborative intelligence in massive IoT networks. IEEE Internet Things J. 2023. [Google Scholar] [CrossRef]
Wang, W.; Chen, J.; Jiao, Y.; Kang, J.; Dai, W.; Xu, Y. Connectivity-aware contract for incentivizing IoT devices in complex wireless blockchain. IEEE Internet Things J. 2023. [Google Scholar] [CrossRef]
Irmer, R.; Droste, H.; Marsch, P.; Grieger, M.; Fettweis, G.; Brueck, S.; Mayer, H.; Thiele, L.; Jungnickel, V. Coordinated multipoint: Concepts, performance, and field trial results. IEEE Commun. Mag. 2011, 49, 102–111. [Google Scholar] [CrossRef] [Green Version]
Rashid-Farrokhi, F.; Liu, K.J.R.; Tassiulas, L. Transmit beamforming and power control for cellular wireless systems. IEEE J. Sel. Areas Commun. 1998, 16, 1437–1450. [Google Scholar] [CrossRef] [Green Version]
3GPP TR36.814. Evolved Universal Terrestrial Radio Access (E-UTRA); Further Advancements for E-UTRA Physical Layer Aspects. Available online: https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=2493 (accessed on 22 December 2022).
López, O.L.A.; Alves, H.; Souza, R.D.; Montejo-Sánchez, S.; Fernández, E.M.G.; Latva-Aho, M. Massive wireless energy transfer: Enabling sustainable IoT toward 6G era. IEEE Internet Things J. 2021, 8, 8816–8835. [Google Scholar] [CrossRef]
Ku, M.-L.; Li, W.; Chen, Y.; Liu, K.J.R. Advances in energy harvesting communications: Past, present, and future challenges. IEEE Commun. Surv. Tutor. 2016, 18, 1384–1412. [Google Scholar] [CrossRef]
Clerckx, B.; Zhang, R.; Schober, R.; Ng, D.W.K.; Kim, D.I.; Poor, H.V. Fundamentals of wireless information and power transfer: From RF energy harvester models to signal and system designs. IEEE J. Sel. Areas Commun. 2019, 37, 4–33. [Google Scholar] [CrossRef] [Green Version]
Zhang, R.; Ho, C.K. MIMO broadcasting for simultaneous wireless information and power transfer. IEEE Trans. Wirel. Commun. 2013, 12, 1989–2001. [Google Scholar] [CrossRef] [Green Version]
Shen, C.; Li, W.-C.; Chang, T.-H. Wireless information and energy transfer in multi-antenna interference channel. IEEE Trans. Signal Process. 2014, 62, 6249–6264. [Google Scholar] [CrossRef] [Green Version]
Zhou, X.; Zhang, R.; Ho, C.K. Wireless information and power transfer: Architecture design and rate-energy tradeoff. IEEE Trans. Commun. 2013, 61, 4754–4767. [Google Scholar] [CrossRef] [Green Version]
Kumar, D.; López, O.L.A.; Tölli, A.; Joshi, S. Latency-aware joint transmit beamforming and receive power splitting for SWIPT systems. In Proceedings of the 2021 IEEE 32nd Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Helsinki, Finland, 13–16 September 2021; pp. 490–494. [Google Scholar]
Oshaghi, M.; Emadi, M.J. Throughput maximization of a hybrid EH-SWIPT relay system under temperature constraints. IEEE Trans. Veh. Technol. 2020, 69, 1792–1801. [Google Scholar] [CrossRef]
Xu, J.; Zhang, R. Throughput optimal policies for energy harvesting wireless transmitters with non-ideal circuit power. IEEE J. Sel. Areas Commun. 2014, 32, 322–332. [Google Scholar]
Ng, D.W.K.; Lo, E.S.; Schober, R. Wireless information and power transfer: Energy efficiency optimization in OFDMA systems. IEEE Trans. Wirel. Commun. 2013, 12, 6352–6370. [Google Scholar] [CrossRef] [Green Version]
Shi, Q.; Peng, C.; Xu, W.; Hong, M.; Cai, Y. Energy efficiency optimization for MISO SWIPT systems with zero-forcing beamforming. IEEE Trans. Signal Process. 2016, 64, 842–854. [Google Scholar] [CrossRef]
Vu, Q.-D.; Tran, L.-N.; Farrell, R.; Hong, E.-K. An efficiency maximization design for SWIPT. IEEE Signal Process. Lett. 2015, 22, 2189–2193. [Google Scholar] [CrossRef] [Green Version]
Yu, H.; Zhang, Y.; Guo, S.; Yang, Y.; Ji, L. Energy efficiency maximization for WSNs with simultaneous wireless information and power transfer. Sensors 2017, 17, 1906. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, X.; Liu, J.; Zhai, C. Wireless power transfer-based multi-pair two-way relaying with massive antennas. IEEE Trans. Wirel. Commun. 2017, 16, 7672–7684. [Google Scholar] [CrossRef]
Wang, X.; Ashikhmin, A.; Wang, X. Wirelessly powered cell-free IoT: Analysis and optimization. IEEE Internet Things J. 2020, 7, 8384–8396. [Google Scholar] [CrossRef]
Lu, X.; Wang, P.; Niyato, D.; Kim, D.I.; Han, Z. Wireless networks with RF energy harvesting: A contemporary survey. IEEE Commun. Surv. Tutor. 2015, 17, 757–789. [Google Scholar] [CrossRef] [Green Version]
Huda, S.M.A.; Arafat, M.Y.; Moh, S. Wireless power transfer in wirelessly powered sensor networks: A review of recent progress. Sensors 2022, 22, 2952. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Park, J.J.; Moon, J.H.; Lee, K.; Kim, D.I. Transmitter-oriented dual-mode SWIPT with deep-learning-based adaptive mode switching for iot sensor networks. IEEE Internet Things J. 2020, 7, 8979–8992. [Google Scholar] [CrossRef]
Han, E.-J.; Sengly, M.; Lee, J.-R. Balancing fairness and energy efficiency in SWIPT-based D2D networks: Deep reinforcement learning based approach. IEEE Access 2022, 10, 64495–64503. [Google Scholar] [CrossRef]
Muy, S.; Ron, D.; Lee, J.-R. Energy efficiency optimization for SWIPT-based D2D-underlaid cellular networks using multiagent deep reinforcement learning. IEEE Syst. J. 2022, 16, 3130–3138. [Google Scholar] [CrossRef]
Al-Eryani, Y.; Akrout, M.; Hossain, E. Antenna clustering for simultaneous wireless information and power transfer in a MIMO full-duplex system: A deep reinforcement learning-based design. IEEE Trans. Commun. 2021, 69, 2331–2345. [Google Scholar] [CrossRef]
Zhang, R.; Xiong, K.; Lu, Y.; Gao, B.; Fan, P.; Letaief, K.B. Joint coordinated beamforming and power splitting ratio optimization in MU-MISO SWIPT-enabled hetnets: A multi-agent DDQN-based approach. IEEE J. Sel. Areas Commun. 2022, 40, 677–693. [Google Scholar] [CrossRef]
Sengly, M.; Lee, K.; Lee, J.-R. Joint optimization of spectral efficiency and energy harvesting in D2D networks using deep neural network. IEEE Trans. Veh. Technol. 2021, 70, 8361–8366. [Google Scholar] [CrossRef]
Han, J.; Lee, G.H.; Park, S.; Choi, J.K. Joint subcarrier and transmission power allocation in OFDMA-based WPT system for mobile-edge computing in iot environment. IEEE Internet Things J. 2022, 9, 15039–15052. [Google Scholar] [CrossRef]
Han, J.; Lee, G.H.; Park, S.; Choi, J.K. Joint orthogonal band and power allocation for energy fairness in WPT system with nonlinear logarithmic energy harvesting model. arXiv 2020, arXiv:2003.13255. [Google Scholar]
Huang, J.; Xing, C.-C.; Guizani, M. Power allocation for D2D communications with SWIPT. IEEE Trans. Wirel. Commun. 2020, 19, 2308–2320. [Google Scholar] [CrossRef]
Lu, W.; Liu, G.; Si, P.; Zhang, G.; Li, B.; Peng, H. Joint resource optimization in simultaneous wireless information and power transfer (SWIPT) enabled multi-relay internet of things (IoT) system. Sensors 2019, 19, 2536. [Google Scholar] [CrossRef] [Green Version]
Lee, K. Distributed transmit power control for energy-efficient wireless-powered secure communications. Sensors 2021, 21, 5861. [Google Scholar] [CrossRef] [PubMed]
Ehrgott, M. Multicriteria Optimization; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005; Volume 491. [Google Scholar]
Dinkelbach, W. On nonlinear fractional programming. Manag. Sci. 1967, 13, 492–498. [Google Scholar] [CrossRef]
Shen, K.; Yu, W. Fractional programming for communication systems-part I: Power control and beamforming. IEEE Trans. Signal Process. 2018, 66, 2616–2630. [Google Scholar] [CrossRef] [Green Version]
Radaideh, M.I.; Du, K.; Seurin, P.; Seyler, D.; Gu, X.; Wang, H.; Shirvan, K. Neorl: Neuroevolution optimization with reinforcement learning. arXiv 2021, arXiv:2112.07057. [Google Scholar]
Nasir, Y.S.; Guo, D. Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. IEEE J. Sel. Areas Commun. 2019, 37, 2239–2250. [Google Scholar] [CrossRef] [Green Version]
Ge, J.; Liang, Y.-C.; Joung, J.; Sun, S. Deep reinforcement learning for distributed dynamic MISO downlink-beamforming coordination. IEEE Trans. Commun. 2020, 68, 6070–6085. [Google Scholar] [CrossRef]
Bertsekas, D.P. Dynamic Programming and Optimal Control; Athena Scientific: Nashua, NH, USA, 1995; Volume 1. [Google Scholar]
Tiong, T.; Saad, I.; Teo, K.T.K.; Lago, H.b. Deep reinforcement learning with robust deep deterministic policy gradient. In Proceedings of the 2020 Second International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), London, UK, 31 August–3 September 2020; pp. 1–5. [Google Scholar]
Fujimoto, S.; Hoof, H.V.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, Helsinki, Finland, 13–16 September 2018; pp. 2587–2601. [Google Scholar]
Hasselt, H.V.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Ren, J.; Wang, H.; Hou, T.; Zheng, S.; Tang, C. Collaborative edge computing and caching with deep reinforcement learning decision agents. IEEE Access 2020, 8, 120604–120612. [Google Scholar] [CrossRef]
Nan, Z.; Jia, Y.; Ren, Z.; Chen, Z.; Liang, L. Delay-aware content delivery with deep reinforcement learning in internet of vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 8918–8929. [Google Scholar] [CrossRef]
Zou, W.; Cui, Z.; Li, B.; Zhou, Z.; Hu, Y. Beamforming codebook design and performance evaluation for 60 GHz wireless communication. In Proceedings of the 2011 11th International Symposium on Communications and Information Technologies (ISCIT), Hangzhou, China, 12–14 October 2011; pp. 30–35. [Google Scholar]
Simsek, M.; Bennis, M.; Guvenc, I. Learning based frequency- and time-domain inter-cell interference coordination in HetNets. IEEE Trans. Veh. Technol. 2015, 64, 4589–4602. [Google Scholar] [CrossRef] [Green Version]
Qiu, C.; Hu, Y.; Chen, Y.; Zeng, B. Deep deterministic policy gradient (DDPG)-based energy harvesting wireless communications. IEEE Internet Things J. 2019, 6, 8577–8588. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Bach, F., Blei, D., Eds.; PMLR: Cambridge, MA, USA, 2015; pp. 448–456. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Canese, L.; Cardarilli, G.C.; Nunzio, L.D.; Fazzolari, R.; Giardino, D.; Re, M.; Spano, S. Multi-agent reinforcement learning: A review of challenges and applications. Appl. Sci. 2021, 11, 4948. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]

Figure 1. An example of a MISO SWIPT-enabled wireless IoT network model.

Figure 2. Structure of the proposed distributed DDQN algorithm in the multi-agent system.

Figure 3. Simulation topology.