Next Article in Journal
Domain Adaptation with Data Uncertainty Measure Based on Evidence Theory
Previous Article in Journal
Learning from Knowledge Graphs: Neural Fine-Grained Entity Typing with Copy-Generation Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Attention-Shared Multi-Agent Actor–Critic-Based Deep Reinforcement Learning Approach for Mobile Charging Dynamic Scheduling in Wireless Rechargeable Sensor Networks

1
School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China
2
Beijing Engineering Research Center of Industrial Spectrum Imaging, Beijing 100083, China
3
Department of Automation, Tsinghua University, Beijing 100084, China
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(7), 965; https://doi.org/10.3390/e24070965
Submission received: 9 June 2022 / Revised: 9 July 2022 / Accepted: 9 July 2022 / Published: 12 July 2022
(This article belongs to the Topic Machine and Deep Learning)

Abstract

:
The breakthrough of wireless energy transmission (WET) technology has greatly promoted the wireless rechargeable sensor networks (WRSNs). A promising method to overcome the energy constraint problem in WRSNs is mobile charging by employing a mobile charger to charge sensors via WET. Recently, more and more studies have been conducted for mobile charging scheduling under dynamic charging environments, ignoring the consideration of the joint charging sequence scheduling and charging ratio control (JSSRC) optimal design. This paper will propose a novel attention-shared multi-agent actor–critic-based deep reinforcement learning approach for JSSRC (AMADRL-JSSRC). In AMADRL-JSSRC, we employ two heterogeneous agents named charging sequence scheduler and charging ratio controller with an independent actor network and critic network. Meanwhile, we design the reward function for them, respectively, by considering the tour length and the number of dead sensors. The AMADRL-JSSRC trains decentralized policies in multi-agent environments, using a centralized computing critic network to share an attention mechanism, and it selects relevant policy information for each agent at every charging decision. Simulation results demonstrate that the proposed AMADRL-JSSRC can efficiently prolong the lifetime of the network and reduce the number of death sensors compared with the baseline algorithms.

1. Introduction

Wireless sensor networks (WSNs) have been widely applied in target tracking, environment monitoring, intelligent medical and military monitoring, etc. [1,2], which have advantages including fast construction, self-organization, fault tolerance, and low-cost deployment [3]. Meanwhile, WSNs are usually composed of a large scale of sensors deployed in an area. However, sensors in WSN are always powered by batteries, and the capacity of these batteries is constrained by the volume of the sensor, which limits the lifetime of the sensors. Furthermore, the energy constraint problem affects the quality of service of WSN directly and greatly hinders the development of WSN. In recent years, the breakthrough of wireless energy transmission (WET) technology has greatly promoted the wireless rechargeable sensor networks (WRSNs) [4], since it provides a highly reliable and efficient energy supplement for the sensors. Particularly, a promising method to overcome the energy constraint problem in WRSNs is mobile charging by employing one or more mobile chargers (MCs) with a high capacity to charge sensors via WET. The MC can move to sensors autonomously and charge them according to a mobile charging scheduling scheme, which is formulated by MC based on the status information of sensors, including the residual energy, energy consumption rate, and position of sensors in WRSN. The status information of sensors is highly controllable and predictable. Theoretically, WRSNs could work indefinitely under a well-designed charging scheme [5]. Therefore, the design of the charging scheme in WRSN is critical, and it has drawn extensive attention from the research community.
There are plenty of works that have been presented to design the mobile charging schemes on WRSN. According to whether MC carries the determined charging scheme before starting from the base station, the existing works can be divided into two categories [6,7,8,9,10,11,12,13,14,15,16]: (1) offline methods [7,8,9,10,11,12] and (2) online methods [6,13,14,15,16]. In offline methods, before starting from the base station, MC will formulate a transparent charging scheme according to the status of sensors, including accurate location, fixed energy consumption rate, regular information transmission rate, etc. The MC will charge sensors with a scheduled trajectory determined by the charging scheme. The offline methods ignore the dynamic change in the status of sensors. Hence, the offline method is not suitable for dealing with application scenarios where the energy consumption rate of the sensor changes in real time and the large-scale WRSN. For example, Yan et al. [17] first attempted to introduce particle swarm optimization into optical wireless sensor networks, which could optimize the positioning of nodes, reduce the energy consumption of nodes effectively and converge faster. In [18], Shu et al. made the first attempt to deal with the jointly charging energy and designing operation scheduling in WRSN. They proposed an f-Approximate algorithm to address this problem and verify that the proposed algorithm could obtain an average 39.2% improvement of network lifetime beyond the baseline approaches. In [19], Feng et al. designed a novel algorithm called the newborn particle swarm optimization algorithm for charging-scheduling in industrial rechargeable sensor networks by adding new particles to improve the particle diversity. This improvement made the algorithm achieve better global optimization ability and improved the searching speed. V.K. Chawra et al. proposed a novel algorithm for scheduling multiple mobile rechargers using the hybrid meta-heuristic technique in [20], which combined the best features of the Cuckoo Search and Genetic Algorithm to optimize the path scheduling problem to achieve shorter charging latency and more significant energy usage efficiency. To enhance the charging efficiency [21,22,23,24], Zhang et al., Liang et al., and Wu et al. proposed some hierarchical charging methods for multiple MCs to charge sensors and themselves.
Different from the offline methods, in some application scenarios, the energy consumption rate of sensors is time-variant, and there are many uncertain factors in the network, which make the offline approaches unable to obtain an acceptable charging scheduling scheme according to the information in the network, while online approaches could successfully deal with these issues. The specific implementation is that the MC does not need to know the status of sensors clearly before starting from the base station but only needs to build candidate charging queues. When the residual energy of the sensor is lower than the set threshold, it will send a charging request and its energy information to the MC. The MC accepts the charging request and inserts it into all candidate charging queues. Then, the charging sequence will update according to the status of the sensors. For example, Lin et al. aimed to maximize the charging efficiency while minimizing the number of dead sensors to achieve the purpose of prolonging the lifetime of WRSN in [16]. Therefore, they developed a temporal–spatial real-time charging scheduling algorithm (TSCA) for the on-demand charging architecture. Furthermore, they also verified that the TSCA algorithm could obtain a better charging throughput, charging efficiency, and successful charging rate than the existing online algorithms, including Nearest-Job-Next with Preemption scheme and Double Warning thresholds with Double Preemption charging scheme. Feng et al. [25] proposed a mobile energy charging scheme that can improve the charging performance in WRSN by merging the advantages of online mode and offline mode. It includes the dynamicity of sensors’ energy consumption in the online mode and the benefit of lower charging consumption by optimizing the charging path of the mobile charger in offline mode. Kaswan et al. converted a charging scheduling problem to a linear programming problem and presented a gravitational search algorithm [26]. This approach presented a novel agent representation scheme and an efficient fitness function. In [27], Tomar et al. proposed a novel scheduling scheme for on-demand charging in WRSNs to address the joint consideration of multiple mobile chargers and the issue of ill-timed charging response to the nodes with variable energy consumption rates.
Unfortunately, although the online methods can address the mobile charging dynamic scheduling problem, they still have disadvantages, including short-sightedness, non-global optimization, and unfairness. Specifically, most recent works assume that the sensor closest to MC is usually inserted into the current charging queue. Meanwhile, sensors with low energy consumption rates are always ignored, resulting in their premature death and a reduction in the service quality of the WRSN. It is generally known that the mobile charging path planning problem in WRSN is a Markov decision process, which has been proved to be an NP-hard problem in [28]. Therefore, the most difficult problem is how to design an effective scheduling scheme to find the optimal or near-optimal solution more quickly and reliably when the size of network increases gradually.
It is known that Reinforcement Learning (RL) is an effective method to address the Markov decision process. As mentioned above, the charging scheduling problem in WRSN is NP-hard; thus, it is unable to provide available optimal labels for supervised learning. However, the quality of a set of charging decision can be evaluated via the reward feedback. Therefore, we need to design a reasonable reward function according to the states of WRSN for RL. During the interaction between agent and environment, the charging scheduling scheme will be found through learning strategies that can maximize the reward. There are several works that have tried to solve the charging scheduling problem with RL algorithms. For example, Wei et al. [29] and Soni and Shrivastava [30] proposed a charging path planning algorithm (CSRL), combining RL and MC to extend the network lifetime and improve the autonomy of MC. However, the proposed CSRL method only suits offline mode, where the energy consumption of sensor nodes is time-invariant. Meanwhile, this method can only be used to address small-scale networks, since the Q-learning algorithm generally fails to handle high-dimensional state space or large state space. Cao et al. [28] proposed a deep reinforcement learning-based on-demand charging algorithm to maximize the sum of rewards collected by the mobile charger in WRSN, which is subject to the energy capacity constraint on the mobile charger and the charging times of all sensor nodes. A novel charging scheme for dynamic WRSNs based on an actor–critic reinforcement learning algorithm was proposed by Yang et al. [31], which aimed to maximize the charging efficiency while minimizing the number of dead sensors to prolong the network lifetime. The above works have made significant model innovation and algorithm innovation, yet they ignore the impact of sensor charging energy on the optimization performance. Although Yang et al. [31] proposed a charging coefficient to constrain the upper charging energy threshold, they assumed that all sensors have a fixed charging coefficient during the scheduling, which cannot adjust according to the needs of the sensors. Specifically, the charging coefficient could directly determine the charging energy for the sensor. Therefore, how to select the next sensor to be charged and determining its corresponding charging energy brings novel challenges to the design of the charging scheme.
We study a joint mobile charging sequence scheduling and charging ratio control problem (JSSRC) to address the challenges mentioned above, where charging ratio is a parameter introduced to determine the charging energy for the sensor and replace on-demand charging requests with real-time changing demands. JSSRC provides timely, reliable, and global charging schemes for WRSNs in which sensors’ energy changes dynamically. Meanwhile, we propose the attention-shared multi-agent actor–critic deep reinforcement learning approach for JSSRC; this approach is abbreviated as AMADRL-JSSRC. We assume that the network deployment scenarios are friendly, barrier-free, and accessible. The transmission of information about real-time changes in energy consumption is reliable and deterministic. When the residual energy of MC is insufficient, it is allowed to return to the depot to renew its battery.
Table 1 highlights the performance comparison of the existing approaches and the proposed approach with respect to four key attributes.
The main contributions of this work are summarized as follows.
(1)
Different from the existing works, we consider both charging sequence and charging ratio optimization simultaneously in this paper. We introduce two heterogeneous agents named charging sequence scheduler and charging ratio controller. These two agents give the charging decisions separately under the dynamic changing environments, which aims to prolong the lifetime of the network and minimize the number of dead sensors.
(2)
We design a novel reward function with a penalty coefficient by comprehensively considering the tour length of MC and the number of dead sensors for AMADRL-JSSRC, so as to promote the agents to make better decisions.
(3)
We introduce the attention shared mechanism in AMADRL-JSSRC to the problem that charging sequence and charging ratio have different contributions to the reward function.
The rest of the paper is organized as follows: Section 2 describes the system models of WRSN and formulates the JSSRC problem. The proposed AMADRL-JSSRC approach is described in Section 3. Simulation results are reported in Section 4. The impacts of the parameters on the charging performance are discussed in Section 5. Conclusions and future work are given in Section 5.

2. System Model and Problem Formulation

In this section, we present the network structure, energy consumption model of sensors, energy analysis of MC, and the formulation of the charging scheduling problem in WRSNs.

2.1. Network Structure

In Figure 1, a WRSN with n heterogeneous isomorphic sensors S N = { s n 1 , s n 2 , , s n n } , an MC, a base station (BS), and a depot are adopted. It is assumed that due to different information transmission tasks, all sensors have the same energy capacity E s n and sensing ability but different energy consumption rates. They are deployed in a 2D area without obstacles; the positions of all sensors are fixed and can be determined accurately, and they are recorded as ( x i , y i ) , i [ 1 , n ] , and the position of BS is set as ( x 0 , y 0 ) . Therefore, a weighted undirected graph G = ( { B S , S N } , D s n , E 0 , E c ) is used to describe the network model of WRSN, where D s n is the set of distances between sensors, which is expressed as D s n = { d i j | d i j = d ( s n i , s n j ) } , i , j [ 1 , n ] with d ( s n i , s n j ) = ( x i x j ) 2 + ( y i y j ) 2 . The set of initial residual energy and the energy consumption rate of each sensor are represented by E 0 and E c , respectively. ( x D , y D ) is defined as the position of the depot.
It is assumed that each sensor in WRSN collects data and communicates with BS via ad hoc communication. The BS could estimate their residual energy according to data sampling frequency and transmission flow. MC can obtain the state information of the sensor but will not interfere with the working state of the sensor. Meanwhile, the total moving distance of MC during the charging tour is defined as D i s .
Although, in theory, the lifetime of the network can be extended indefinitely with single or multiple MC. The network will shut down, since the energy modules of sensors will age. Therefore, inspired by [28,31], we define the lifetime in this article as below.
Definition 1 (Lifetime).
The lifetime of WRSNs is defined as the period from the beginning of the network to the number of dead sensors reaching a threshold.
The lifetime and the threshold are described with T l i f e and ω % , respectively. Furthermore, the abbreviations used in this paper are summarized in Table 2.

2.2. Energy Consumption Model of Sensors

The energy of the sensor is mainly consumed in data transmission and reception. Therefore, based on [32,33], the energy consumption model at time slot t is adopted as below:
e c i ( t ) = ρ k = 1 , k i n f k , i r ( t ) + j = 1 , j i n [ ς i , j t f i , j t ( t ) + ς i , B t f i , B t ( t ) ]
where ρ is the energy consumption for receiving or transmitting 1 kb data from sensor s n i to sensor s n j (or BS). ς i , j t = ξ 1 + ξ 2 d i , j r represents the energy consumption for transmitting 1 kb data between each sensor, where d i , j is the distance between s n i and s n j . ξ 1 and ξ 2 represent the distance-free and distance-related energy consumption index, respectively. r is the signal attenuation coefficient. f k , i r means the data flow of receiving, f i , j t ( 1 j n ) and f i , B t are the data flow of transmitting from s n i to s n j and BS. Hence, ρ k = 1 , k i n f k , i r ( t ) represents the energy consumption of s n i receiving information from all sensor nodes. j = 1 , j i n [ ς i , j t f i , j t ( t ) + ς i , B t f i , B t ( t ) ] is the energy consumption of s n i by sending information to other sensors and BS.

2.3. Charging Model of MC

In this paper, the sensors in WRSN are charged by MC wirelessly, and the empirical wireless charging model is defined as [34]
P c = G s G r η L p ( λ 4 π ( d m s + β ) ) 2 P 0
where d m s represents the distance between the sensor and the mobile charger, P 0 is the output power, G s is the gain of the source antenna which is equipped on the mobile charger, G r is the gain of the receiver antenna, d m s is the distance between the mobile charger and the sensor, L p , and λ denote the rectifier efficiency and the parameter to adjust the Friis’ free space equation for short-distance transmission, respectively.
Since the MC moves to the position near the sensors, the distance can be regarded as a constant. Therefore, (2) can be simplified to (3)
P c = Δ μ P 0 ,
in which Δ = G s G r η λ 2 / 16 π 2 L p , μ = ( d m s + β ) 2 .
The moving speed of the MC is set as v m s , and the energy consumed per meter is e m J . The capacity of MC is E m c , and the target sensor will be charged with one-to-one charging mode only when the MC reaches it.

2.4. Problem Formulation

We define three labels to describe the working states of the visited point at time slot t, i [ 0 , n ] , i = v i s i t , i v i s i t and i = d e a d . They represent that s n i is selected to charge, not be selected and dead, respectively, while i = 0 represents that the visited point is a depot. The residual energy of the sensor is defined as e r i ( t ) , the charging demand of sensor s n i is defined as e d i ( t ) and the residual energy of MC is defined as e m c r ( t ) .
At time slot t, the residual energy of the sensor is described as (4), and the charging demand will also be updated with (5)
e r i ( t ) = { e r i ( t 1 ) e c i ( t ) i v i s i t , i [ 1 , n ] e r i ( t 1 ) + P c i = v i s i t ,   i [ 1 , n ] 0 i = d e a d ,   i [ 1 , n ]
e d i ( t ) = ε E s n e r i ( t )
where ε is the charging ratio, it could decide the upper threshold of charging energy, and its value range in ( 0 , 1 ] .
To effectively charge the sensors, more energy in the MC should be used on charging sensors, while the energy wasted on moving between the sensors and the depot should be minimized. Hence, within the network lifetime T l i f e , the JSSRC problem under WRSNs with the dynamic energy changing is defined as below.
Definition 2 (JSSRC).
The joint mobile charging sequence scheduling and the charging ratio control problem, which aims to prolong the lifetime of the network and minimize the number of dead sensors in WRSNs with dynamic energy changing, is defined as the JSSRC problem.
The relevant notations are defined as follows: at time slot t , the current state of sensor i is defined as (6) according to its residual energy, if τ i ( t ) = 1 , it indicates that the sensor is alive, and τ i ( t ) = 0 represents that the sensor has died.
τ i ( t ) = { 1 , e r i ( t ) > 0 0 , e r i ( t ) 0
Furthermore, the number of dead sensors is defined as N d (t), which is obtained with (7)
N d ( t ) = n i = 1 n τ i ( t ) .
There are three termination conditions of the JSSRC scheme, and they are described with (8):
(1)
The number of dead sensors reaches ω % of the total number, ω ( 0 , 100 ] .
(2)
The remaining energy of MC is insufficient to return to the depot.
(3)
The target lifetime or the base time is reached.
N d ( t ) = ω n e m c r ( t ) < d e m t = T t a r g e t
where d represent the distance from the MC’s current location to the depot, t is the running time of the test, and T t a r g e t is a given base time. Specifically, when any of the termination conditions in (8) are met, the charging process will end.
Then, within the network lifetime, the JSSRC problem can be formulated as
m i n D i s m i n N d s . t . ( 4 ) , ( 5 ) , ( 6 ) , ( 7 ) , ( 8 )

3. Details of the Attention-Shared Multi-Agent Actor-Critic Based Deep Reinforcement Learning Approach for JSSRC (AMADRL-JSSRC)

JSSRC is a joint scheduling problem with sequence scheduling and charging ratio control; it is difficult to schedule them simultaneously with the traditional single-agent reinforcement learning algorithm. Therefore, the multi-agent reinforcement learning algorithm is introduced to solve this problem. In this section, we first briefly introduce multi-agent reinforcement learning algorithms. Then, we model the provided problem and propose the AMADRL-JSSRC.

3.1. Basis of Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning is developed on the basis of the reinforcement learning algorithm, which is often described as the Markov game (or stochastic game) [35]. Multi-agent reinforcement learning is also an important branch of machine learning and deep learning, which aims to improve the shortcomings of multi-objective control that cannot be achieved by a single agent. Each agent can be a cooperative, competitive or mixed relationship, and they learn how to make decisions in an environment by observing the rewards obtained after the environment performs some actions. Specifically, there are m agents; each agent first receives their own observations o ϑ ( ϑ [ 1 , m ] ) . Then, we select an operation a ϑ from action spaces, which are subsequently sent to the environment. After that, the environment state transits from S to S , and each agent receives a reward r ϑ associated with these transitions. The purpose of training agents is to collect accumulated rewards from multiple agents as much as possible.

3.2. Learning Model Construction for JSSRC

The tuple { S , A 1 , A 2 , R , S } is used to define the JSSRC scheme, where S is the state space of two agents, A 1 and A 2 are the action spaces, R is the sum of rewards obtained by two agents after performing actions, and S is the state of the environment after executive action [36]. A state transition function is defined as T with T : S × A 1 × A 2 P ( S ) , which is the probability distribution over the possible next states. Furthermore, there are two agents in JSSRC with their own set of observations, O 1 and O 2 . The environment state is defined as S = ( O 1 , O 2 ) , and the new environment state is defined as S = ( O 1 , O 2 ) . The reward for each agent also depends on the global state and actions of all agents; thus, we have the reward function, R ϑ : S × A 1 × A 2 , where ϑ is the number of the agent, and is the set of all possible rewards.
The time step is defined as the time slot when the scheduling decision is made. Hence, at the k-th time step, the MC visits position i and completes the charging decision, where i [ 0 , n ] . K is defined as the maximum time step when any of the termination conditions are met. The time slot corresponding to the k-th time step is defined as t ( k ) ; when the action of the k-th time step is completed, the corresponding time slot is recorded as t ( k _ ) .
A scheduling example of JSSRC is shown in Figure 2. To express clearly, we omit the information communication process between sensors, leaving only the scheduling decision and the charging path. The relationship between the time slot and the time step is described in the upper part of the figure. Within the network lifetime T l i f e , two agents determine two actions a 1 ( k )   a n d   a 2 ( k ) according to their observations o 1 ( k ) and o 2 ( k ) in state S ( k ) at time step k. a 1 represents the decision decided by agent 1 to choose the next sensor to be charged, and a 2 represents the decision decided by agent 2 to control the charging ratio. Agents obtain their policy according to the continuous exploration and calculate the rewards R through the obtained strategies at the end of the T l i f e . Then, the states, actions, policies, and rewards of the environment are defined as follows.
States of the environment: The state space of the environment in JSAAC includes the state information of the MC and sensors, which are defined as S m c and S n e t , respectively. An example of the information at time step k, S m c is ( p m c ( k ) , e m c r ( k ) ) , S n e t is ( p s n i ( k ) , e d i ( k ) , e c i ( k ) ) , where p m c ( k ) and e m c r ( k ) are the position and the residual energy of MC, p s n i ( k ) is the position of s n i to be visited, e d i ( k ) and e c i ( k ) are the charging demand and the energy consumption rate of s n i , where k [ 0 , K ] and i [ 0 , n ] . s n 0 represents the depot, and the value of e d 0 is 0 because the depot does not need to be charged. The state embedding is a 5 × K dimensional vector at time step k with S ( k ) = ( S m c ( k ) , S n e t ( k ) ) ; only the position of the sensor is a static element, the others are dynamic.
Actions of the environment: The actions in JSSRC represent the decision of the target sensor and the charging ratio, which are determined by two agents.
Policies of the environment: The policy for a single agent is described with a = π ( o ) , where a is an action, o is the observation of the agent, and π is the policy. In JSSRC, there are two agents; we define two agents with policies parameterized by θ = { θ 1 , θ 2 } and let π = { π 1 , π 2 } with π ϑ : O ϑ P ( A ϑ ) where P ( ϑ ) [ 0 , 1 ] , ϑ = 1 , 2 . The main goal of JSSRC is to learn a set of optimal policies to maximize two agents’ expected discounted rewards.
Rewards of the environment: Reward is used to evaluate the action; its value is obtained by the agent after executing an action. In this paper, our goal is to improve the charging performance of WRSN, which includes minimizing the moving distance of MC and reducing the number of dead sensors. Since the total number of dead sensors is inversely proportional to the reward, if the performed actions lead to more sensors being dead, we will give a penalty for this behavior. Therefore, the expected discounted rewards for two agents are defined with (10), and the immediate reward obtained after performing the actions at the k-th time step is defined with (11).
The expected discounted rewards for two agents can be defined as
R ϑ ( π ϑ ) = E a 1 ~ π 1 , a 2 ~ π 2 , S ~ T [ k = 0 K γ k r ϑ ( S ( k ) , a 1 ( k ) , a 2 ( k ) ) ] , ϑ [ 1 , 2 ] .
r ϑ ( S ( k ) , a 1 ( k ) , a 2 ( k ) ) = ϖ d ( k 1 , k ) + ( ϵ ) N d ( k ) .  
D i s a 1 ~ π 1 , a 2 ~ π 2 = k = 1 K d ( k 1 , k )
where the action space of a 1 is A 1 = { a 1 | a 1 { 0 , 1 , , n } } , and the action space of a 2 is A 2 = { a 2 | a 2 { 0.5 , 0.6 , , 1 } } . ϖ is a reward coefficient between 0 and 1, which can ensure the shorter the moving distance is, the greater the reward that will be obtained. The N d ( k ) indicates the number of new dead sensors after the actions at the k-th time step are performed, and ϵ is the penalty coefficient. In (12), D i s a 1 ~ π 1 , a 2 ~ π 2 represents the total moving distance obtained after performing the actions when the termination conditions are met. Obviously, the decision of the charging sequence and the charging ratio have different contributions to the reward function, which brings difficulties to the design of the algorithm.
State Space Update of the environment: One episode of the JSSRC can be formed as a finite sequence of decisions, observations, actions, and immediate rewards, which is described in Table 3.
To display the specific update process of states, we assume that the MC is located at the depot at time step 0. At each time step, MC decides the next charged sensor from SN and determines the corresponding charging ratio for it. It is defined that the residual energy of sensor s n i before charging and after charging are e r i ( k ) and e r i ( k _ ) , respectively. The charging demand of each sensor and the residual energy of MC will be updated after performing the charging operation at time step k. They are shown as follows:
e r i ( i = v i s i t ) ( k ) = max { 0 , e r i ( k 1 _ ) t m ( k ) e c i ( k ) }
e r i ( i = v i s i t ) ( k _ ) = ε i ( k ) E s n
e r i ( i v i s i t ) ( k ) = max { 0 , e r i ( k 1 _ ) t m ( k ) e c i ( k ) } .
e r i ( i v i s i t ) ( k _ ) = e r i ( i v i s i t ) ( k ) t c ( k ) e c i ( k ) .
where t m ( k ) is the moving duration of the MC between the k-1-th and k-th time step.
It is assumed that at the k-1-th time step, the MC is located at s n j , at the k-th step, MC is located at s n i . Therefore, we have d ( k , k 1 ) = d i j , and t m ( k ) can be obtained by (17)
t m ( k ) = d ( k , k 1 ) v m
If s n i is alive at the k-th time step, the charging time is
t c i ( i = v i s i t ) ( k ) = ε i ( k ) E s n e r i ( i = v i s i t ) ( k ) P c e c i ( k )
where ε i ( k ) is the unique charging ratio of s n i at the k-th time step.
Therefore, the charging demands of three types of working states about s n i are
e d i ( i = v i s i t ) ( k ) = ε i ( k ) E s n e r i ( i = v i s i t ) ( k )
e d i ( i v i s i t ) ( k ) = ε i ( k ) E s n e r i ( i v i s i t ) ( k _ ) .
e d i ( i = d e a d ) ( k ) = 0
The residual energy of the MC before and after performing the charging operation is defined as e r m c ( k ) and e r m c ( k _ ) , respectively; they will update with (22) and (23)
e r m c ( k ) = max { 0 , e r m c ( k 1 _ ) d ( k 1 , k ) e m } .
e r m c ( k _ ) = max { 0 , e r m c ( k ) e d i ( i = v i s i t ) ( k ) }
To speed up the training and obtain feasible solutions, we give the following constraints:
(1)
The MC could visit any position in the network as long as its residual energy could satisfy the charging demand of the next selected sensor or is enough to move back to the depot.
(2)
All sensors with a charging demand greater than 0 have a certain probability of being selected as the next one to be charged.
(3)
The MC does not charge the sensors whose charging demands are zero.
(4)
If the residual energy of MC does not satisfy the charging demand of the next selected sensor, but it is enough to return to the depot, the MC is allowed to return to the depot to charge itself, and the charging time of the MC is ignored.
(5)
The charging decision of two adjacent time steps cannot be the same sensor or depot.
(6)
If the residual energy of the MC does not meet the charging demand for the next sensor, is not enough to return to the depot, or the preset network lifetime is reached, the charging plan will be ended no matter whether the sensors are still alive or not.

3.3. AMADRL-JSSRC Algorithm

As depicted in Figure 3, AMADRL-JSSRC’s implementation consists of the environment, the experience replay buffer (D), the mini-batch (B), the obtained rewards, and the different neural networks. The environment can be partially observed by each agent, where the actor and critic networks estimate the optimal control policies for the charging sequence scheduler and the charging ratio controller. The detail of training AMADRL-JSSRC is described in Algorithm 1.
Unlike the traditional methods such as MADDPG [36] and MAPPO [37], each agent receives information from other agents without discrimination and calculates the corresponding Q-value. In JSSRC, the contribution of the charging sequence scheduler and the charging ratio controller to the Q-value are different. Compared with the charging ratio, the decision of the charging sequence has a greater impact on the reward. To calculate the Q-value function Q ϑ φ ( s , a ) for agent ϑ , we introduce the attention mechanism with a differentiable key-value memory model [38,39]. This kind of mechanism does not need to make any assumptions about the temporal or spatial locality of the inputs, which is more suitable to overcome the difficulty that each agent has a different action space and contributes a different reward in this article.
At each time step, the critic network in each agent will receive the observation information s = ( o 1 , o 2 ) and action information a = ( a 1 , a 2 ) , for all ϑ [ 1 , 2 ] . We define the set of all agents except for ϑ as \ ϑ , and we use ϑ ^ as the pointer to index the set. Q ϑ φ ( s , a ) is defined as the function of agent ϑ which is obtained by combining with the observation information, action information, and contribution from other agents:
Q ϑ φ ( s , a ) = f ϑ ( g ϑ ( o ϑ , a ϑ ) , c ϑ )
where f ϑ is a two-layer multi-layer perceptron (MLP) [40], and g ϑ is a one-layer MLP embedding function. c ϑ is the contribution from other agents, which is a weighted sum of the value of each agent with (25)
c ϑ = ϑ ^ ϑ κ ϑ ^ v ϑ ^ = ϑ ^ ϑ κ ϑ ^ h ( V g ϑ ^ ( o ϑ ^ , a ϑ ^ ) )
In (25), v ϑ ^ is the embedding function of agent ϑ ^ encoded with an embedding function. Then, the shared matrix V is used for linear transformation. h is an element-wise nonlinearity activation function named leaky ReLu, which could retain some negative axis values to prevent all negative axis information from being lost. h is realized by (26)
h ( x ) = { x x > 0 ϕ x o t h e r w i s e
where ϕ is a very small constant.
The attention weight κ ϑ ^ uses bilinear mapping (i.e., query-key system) to compare the embedded e ϑ ^ with e ϑ = g ϑ ( o ϑ , a ϑ ) , and it passes the similarity value between these two embedding into a SoftMax function:
κ ϑ ^ exp ( e ϑ ^ T W k T W q e ϑ )
where the e ϑ is transformed to a “query” with W q , and the e ϑ ^ is transformed to a “key” with W k [41].
To prevent vanishing gradients, the matching is scaled by the dimensionality of these two matrices. The multiple attention heads mechanism is introduced in AMADRL-JSSRC, each head with a separate set of parameters ( W k , W q , V ), which could give rise to an aggregated contribution from another agent to the agent i . We concatenate the contributions of all heads into a vector. The most important point is that each head could focus on a different weighted mixture of agents.
In AMADRL-JSSRC, the weights for extracting selectors, keys, and values are shared between two agents, because the multi-agent value function is essentially a multi-task regression problem. This parameter sharing in the critic network enables our method to learn effectively in an environment where the action space and reward for individual agents are different but share common observation features. The structure of the critic network and the structure of the multiple head attention mechanism are clearly shown in the left part of Figure 3.

3.4. Parameters Update in AMADRL-JSSRC

The parameters φ ¯ and θ ¯ used in the critic networks and policies gradient will be updated, respectively, according to line 17 to line 24 and line 28 to line 32 in Algorithm 1.
Since the parameters are shared among critic networks in AMADRL-JSSRC, all critic networks are updated together to minimize a joint regression loss function:
Q ( φ ) = ϑ = 1 2 E ( s , a , r , s ) ~ D [ Q ϑ φ ( s , a ) y ϑ ] 2
In (28), y ϑ is obtained by (29)
y ϑ = r ϑ + γ E a ~ π θ ¯ ( s ) [ Q ϑ φ ¯ ( s , a ) α log ( π θ ¯ ϑ ( a ϑ | o ϑ ) ) ]
It is worth noting that Q ϑ φ ¯ is used to estimate the action value for agent ϑ by receiving the observation information and action information from all agents. D is a replay buffer to store past experiences. In (29), α is a parameter that could trade off maximizing entropy and rewards.
Since the charging sequence decision has a greater impact on the expected reward than the charging ratio decision, in order to give the optimal policies objectively, we need to compare the value of a specific action to the value of the average action of the agent, with another agent fixed. We could determine whether said action will lead to an increase in expected return or whether any increase in reward is attributed to the actions of another agent. This problem is called multi-agent credit assignment. An effective solution is to introduce an advantage function [42] with a baseline that only marginalized the actions of the given agent from Q ϑ φ ( s , a ) , and the form of this advantage function is shown below:
A ϑ ( s , a ) = Q ϑ φ ( s , a ) b ( s , a ϑ ^ ) ,
where
b ( s , a ϑ ^ ) = E a ϑ ~ π ϑ ( o ϑ ) [ Q ϑ φ ( s , ( a ϑ , a ϑ ^ ) ) ]
In (31), b ( s , a ϑ ^ ) is the multi-agent baseline used to calculate the advantage function.
We calculate our baseline with the AMADRL-JSSRC algorithm in a single forward pass by outputting the expected return Q ϑ φ ( s , ( a ϑ , a ϑ ^ ) ) for every possible action, a ϑ A ϑ . The expectation could be calculated exactly with (32)
E a ϑ ~ π ϑ ( o ϑ ) [ Q ϑ φ ( s , ( a ϑ , a ϑ ^ ) ) ] = a ϑ A ϑ π ( a ϑ | o ϑ ) Q ϑ ( s , ( a ϑ , a ϑ ^ ) )
To achieve this goal, we make the following four adjustments:
(1)
We must remove a i from the input of Q i and output a value for every action.
(2)
We need add an observation encoder, e ϑ = g ϑ s ( o ϑ ) , to replace the e ϑ = g ϑ ( o ϑ , a ϑ ) in (24) described above.
(3)
We also modify f ϑ to output the Q-value of all possible actions rather than the single input action.
(4)
To avoid overgeneralization [43], we sample all actions from the current strategies of all agents to calculate the gradient estimation of agent ϑ rather than sampling the actions of other agents from the experience replay buffer such as [36,39].
Algorithm 1 AMADRL-JSSRC
1: Initialize the number of parallel environments for two agents as N p , initialize the update time of parallel operation as T u p d a t e , initialize the experience replay buffer with D and the minibatch with B, initialize the number of episodes as N e , the number of steps per episode as N p e the number of critic updates as N c u , the number of policy updates as N p u , and the number of multiple attention head as N m , initialize the critic network Q φ , and actor network π θ with random parameters φ , θ , initialize the target network, φ ¯ φ and θ ¯ θ , T u p d a t e 0
2: for  i e p = 1 , , N e    do
3: Reset environments, and obtain the initial o ϑ e n v for each agent, ϑ
4:  for  k = 1 , , N p e   do
5:   Randomly select actions a ϑ e n v ~ π ϑ ( | o ϑ e n v ) for each agent ϑ , in each environment ( e n v )
    with greedy search strategy
6:   Send actions to all parallel environments, then obtain o ϑ e n v and r ϑ e n v for all agents
7:   Store transitions for all environments in D
8:    T u p d a t e = T u p d a t e + N p
9:   if  T u p d a t e min   steps   per   update  then
10:    for  j = 1 , .. , N c u do
11:    Sample B
12:    function Update Critic (B):
13:     Unpack the mini-batch (B)
14:      ( o 1 , 2 B , a 1 , 2 B , r 1 , 2 B , o 1 , 2 B ) B
15:     Calculate Q i φ ( o 1 , 2 B , a 1 , 2 B ) for two agents in parallel
16:     Calculate a 1 B ~ π 1 θ ( o 1 B ) and a 2 B ~ π 2 θ ( o 2 B ) with target policies
17:     Calculate Q ϑ φ ¯ ( o 1 , 2 B , a 1 , 2 B ) for two agents in parallel with the target critic
18:     Update critic with Q ( φ ) shown in (28) and Adam optimizer [44]
19:    end function Update Critic
20:    end for
21:    for  j = 1 , .. , N p u do
22:    Sample N m × ( o 1 , 2 ) ~ D
23:    function Update Policies ( o 1 , 2 B )
24:     Calculate a 1 , 2 B ~ π ϑ θ ¯ ( o i B ) , ϑ = 1 , 2
25:     Calculate Q ϑ φ ( o 1 , 2 B , a 1 , 2 B ) for two agents in parallel
26:     Update policies with θ ϑ J ( π θ ) shown in (33) and Adam optimizer [44]
27:     end function Update Policies
28:   end for
29:    Update target parameters:
          φ ¯ = τ φ ¯ + ( 1 τ ) φ , θ ¯ = τ θ ¯ + ( 1 τ ) θ
30:      T u p d a t e 0
31:      end if
32:    end for
33:  end for
34: Output: The parameters of target actor
Therefore, the policies of each agent will be updated by:
θ ϑ J ( π θ ) = E s ~ D , a ~ π [ θ ϑ log ( π θ ϑ ( a ϑ | o ϑ ) ) ( α log ( π θ ϑ ( a ϑ | o ϑ ) ) + Q ϑ φ ( s , a ) b ( s , a ϑ ^ ) ) ]

4. Experimental Setup and Results

In this section, we will conduct experiments to evaluate the performance of AMADRL-JSSRC. The simulations are divided into two phases: (1) the training phase of AMADRL-JSSRC and (2) the testing phase for a comparative study with baseline algorithms. The experiment setting and training details are described in Section 4.1. The testing details of the comparison with the baseline algorithms are described in Section 4.2.

4.1. Experimental Environment and Details

We conduct the AMADRL-JSSRC using Python 3.9.7 and TensorFlow 2.7.0 over 10,000 episodes, and each episode is divided into 100 time slots. Then, AMADRL-JSSRC has tested over ten episodes, where the average values of the important metrics are calculated.
We use the same simulation settings as described in [31], and some details are supplemented here. We assume that the locations of sensors are assigned uniformly at random in the unit square [ 0 , 1 ] × [ 0 , 1 ] , and the residual energy of each sensor is randomly generated between 10 and 20 J. The moving speed of MC is 0.1 m/s, and the energy consumption rate of moving unit distance is 0.1 J/s. The rate at which MC charges the sensor is 1 J/s, and the time that the MC returns to the depot to charge itself is ignored here. The main simulation settings are provided in Table 4. In addition, the relevant data of the real-time energy consumption rate of the sensor are shown in Table 5.
After the network environment is initialized, we will conduct simulation training on the environment. Our implementation uses an experience replay buffer of 10 5 . The size of the minibatch is 1024. As for the neural networks, all networks (separate policies and those contained within the centralized critic networks) use a hidden dimension of 128, and the Leaky Rectified Linear units are used as the nonlinear activation. We train our models with the Adam optimizer [44] and set different initial learning rates when the network size is different. The key parameters used in the training stage are described in Table 6.
We have trained our model for three different environment settings on four NVIDIA GeForce GTX 2080ti for 10 h, after which the observed qualitative differences between the results of consecutive training iterations were ignored. We present one set of experimental results to describe the relationship between episodes and reward, which is shown in Figure 4. We can see that obtained rewards increase slowly through episodes to reach peak values after 240 training episodes. This is mainly caused by the efficient learning of AMADRL-JSSRC to the WRSN with dynamic energy changes so that agents could make reasonable decisions to obtain a greater reward.
Since the reward discount factor and penalty coefficient have a great impact on the performance of the algorithm in the training and testing process, we have made two sets of experiments, and the results are shown in Table 7 and Table 8, respectively. These experimental results prove that AMADRL-JSSRC will strive for a long-term reward rather than a short-sight reward when the reward discount factor approaches 1. Furthermore, with the increase in the penalty coefficient, the number of dead sensors gradually decreases. The reason is that in order to obtain a high global charging reward, AMADRL-JSSRC preferentially charges the sensors with low residual energy to avoid sensor death when a large value is assigned to ϵ . Therefore, in this paper, the penalty coefficient is set as 10, and the reward discount factor is set as 0.9.

4.2. Comparison Results against the Baselines

In this section, we compare the performance of the AMADRL-JSSRC with that of the ACRL algorithm, the GREEDY algorithm, the dynamic programming algorithm, and two typical online charging schemes algorithms NJNP and TSCA [16]. The detailed execution process of the above algorithms is shown in [31]. It is noted that some details of the baseline algorithms need to be adjusted. For example, we have replaced the reward calculation equation in line 13 of Algorithm 1, line 6 of Algorithm 2, and the line 10 of Algorithm 3 described in [31] with r ( S ( k ) , a ( k ) ) = ϖ d ( k 1 , k ) + ( ϵ ) N d ( k ) , and we change their seeking rule from the minimum global reward to the maximum global reward.
We consider three networks with different scales, including 50, 100, and 200 sensors; these environments are denoted as JSSRC50, JSSRC100, and JSSRC200. We have run our tests on WRSNs based on these environments, and the corresponding MC capacity is set as 50, 80, and 150 J. In addition, the base time of these three tests is set as 100 s, 200 s, and 300 s, respectively. Unless otherwise specified, these parameters are fixed during the test.
The tour length, the extra time, and the number of dead sensors obtained via different algorithms based on different JSSRC environments are shown in Table 9. It is observed that when the network size is small, such as the network with 50 sensors, the exact heuristic algorithm is better than AMADRL-JSSRC and ACRL algorithms in terms of average tour length and the average number of dead sensors. Meanwhile, the ACRL performance is slightly better than that of AMADRL-JSSRC at JSSRC 50. However, with the increase in network scale, the results of AMADRL-JSSRC and ACRL outperform the GREEDY, DP, JNJP, and TSCA significantly; the AMADRL-JSSRC and ACRL algorithms begin to show their superiority. The AMADRL-JSSRC algorithm is better than the ACRL algorithm, especially in the terms of the number of dead sensors. The reason for this phenomenon is that the charging ratio of the ACRL algorithm is fixed and cannot adjust adaptively according to the real-time charging demand, which will lead to some sensors becoming dead during the MC charging the selected sensors. The extra time comparisons are also presented in this table, where all the times are reported on one NVIDIA GeForce GTX 2080ti. We find that our proposed approach significantly improves the solution while only adding a small computational cost in runtime. Moreover, the extra time of AMADRL-JSSRC is longer than that of ACRL, verifying that multi-agent collaborative decision making consumes more computational cost.

5. Discussions

The impacts of the parameters on the charging performance, including the capacity of the sensor and the capacity of MC, and the performance comparison in terms of lifetime are discussed in this section. The test environment is set as 100 sensors, the baseline time is 300 s, the initial capacity of the sensor is 50 J, and the initial capacity of MC is 100 J. Meanwhile, the initial residual energy of sensors will change with the capacity of sensors. Since the baseline algorithms do not have the ability to adaptively control the charging ratio for each sensor, for a fair comparison, we introduce the optimal charging ratio from Table 4, which is named the charging coefficient in [31]. Therefore, the charging ratio of the baseline algorithms are ACRL ε = 0.7 , GREEDY ε = 0.8 , DP ε = 0.9 , NJNP ε = 0.8 , and TSCA ε = 0.8 , respectively.

5.1. The Impacts of the Capacity of the Sensor

As depicted in Figure 5, NJNP has the lowest tour length, and the average tour length gradually decreases with the increase in the capacity of sensors. The reason is that with the increase in the capacity of the sensor, the number of sensors’ charging requests will decrease on the premise of sufficient residual energy. Moreover, the charging for each sensor is also prolonged due to the increase in sensor capacity. Based on the fixed baseline time, the more time the MC spends on charging sensors, the less time it will spend on moving. Therefore, the average tour length decreases gradually. The fluctuation in Figure 5 is caused by the random distribution of sensor positions and the dynamic change of their energy consumption rate in each test experiment. Furthermore, the NJNP algorithm has the lowest tour length is because it preferentially charges the sensors close to MC. It is noted that the moving distance of AMADRL-JSSRC is slightly longer than that of ACRL. This is because AMADRL-JSSRC can determine different charging ratios for the selected sensors according to the real-time charging demand to avoid the punishment caused by the dead sensors. Therefore, AMADRL-JSSRC spends slightly less time on charging than ACRL. When the base time is fixed, more time will be spent on moving, resulting in a longer moving distance.
Figure 6 shows that with the increase in the capacity of the sensor, the average number of dead sensors shows an opposite change to the average tour length. Obviously, the average number of dead sensors of AMADRL-JSSRC is always the smallest. This is because the optimal charging ratios of the baseline algorithms are fixed, while the baseline algorithms are fixed. With the increase in the capacity of the sensor, the charging ratio for each selected sensor will be prolonged, increasing the risk of subsequent low residual energy sensor death. This result proves that the adaptive control of the charging ratio for each sensor could improve the charging performance effectively for the network.

5.2. The Impacts of the Capacity of MC

Figure 7 and Figure 8 show the impacts of the capacity of MC change on the average tour length and the average number of dead sensors, respectively. With the increase in MC capacity, when the baseline time is fixed, the MC could reduce the time of returning to the depot to charge itself. This change could shorten the moving distance of the MC and decrease the risk of subsequent low residual energy sensor death when it returns to or leaves the depot. Meanwhile, figures verify that compared with the ACRL algorithm, AMADRL-JSSRC gains a smaller number of dead sensors at the cost of increasing a certain moving distance.

5.3. Performance Comparison in Terms of Lifetime

We have analyzed the test results of six schemes under the fixed baseline lifetime. In this section, we explore the lifetime of the six schemes under different JSSRCs, which are JSSRC50, JSSRC100, and JSSRC200, until the termination condition is satisfied. These six algorithms have run 50 times independently, and the test results are shown in Figure 9, Figure 10 and Figure 11. It can be seen from the figure that although the fluctuation range of network lifetime obtained by the AMADRL-JSSRC algorithm is large, the lower and the upper bounds of network lifetime are still higher than the other five algorithms significantly. Moreover, with the increase in the number of sensors, this performance is outstanding significantly. It is noted that the network lifetime obtained by the AMADRL-JSSRC is better than the ACRL, which further proves that adjusting the charging ratio adaptively for each sensor could prolong the network lifetime effectively.

6. Conclusions

In this paper, a novel joint charging sequence scheduling and charging ratio control problem is studied, and an attention-shared multi-agent actor–critic-based deep reinforcement learning approach (AMADRL-JSSRC) is proposed, where a charging sequence scheduler and a charging ratio controller are employed to determine the target sensor and charging ratio by interacting with the environment. AMADRL-JSSRC trains decentralized policies in multi-agent environments, using a centralized computing critic network to share an attention mechanism, and it selects relevant policy information for each agent. Meanwhile, the AMADRL-JSSRC performance significantly prolongs the lifetime of the WRSN and minimizes the number of dead sensors, and the performance is more significant when dealing with large-scale WRSNs. In future work, the multi-agent reinforcement learning approach for multiple MCs to complete the charging tasks jointly is the key point for further study.

Author Contributions

Conceptualization, C.J. and Z.W.; methodology, J.L.; data curation, S.C. and H.W.; investigation, J.X.; project administration, W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62173032, Grant 62173028 and the Foshan Science and Technology Innovation Special Project under Grant BK20AF005.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Liu, G.; Su, X.; Hong, F.; Zhong, X.; Liang, Z.; Wu, X.; Huang, Z. A Novel Epidemic Model Base on Pulse Charging in Wireless Rechargeable Sensor Networks. Entropy 2022, 24, 302. [Google Scholar] [CrossRef] [PubMed]
  2. Ayaz, M.; Ammad-Uddin, M.; Baig, I.; Aggoune, M. Wireless Sensor’s Civil Applications, Prototypes, and Future Integration Possibilities: A Review. IEEE Sens. J. 2018, 18, 4–30. [Google Scholar] [CrossRef]
  3. Raza, M.; Aslam, N.; Le-Minh, H.; Hussain, S.; Cao, Y.; Khan, N.M. A Critical Analysis of Research Potential, Challenges, and Future Directives in Industrial Wireless Sensor Networks. IEEE Commun. Surv. Tutor. 2018, 20, 39–95. [Google Scholar] [CrossRef]
  4. Liu, G.; Peng, Z.; Liang, Z.; Li, J.; Cheng, L. Dynamics Analysis of a Wireless Rechargeable Sensor Network for Virus Mutation Spreading. Entropy 2021, 23, 572. [Google Scholar] [CrossRef] [PubMed]
  5. Liu, G.; Huang, Z.; Wu, X.; Liang, Z.; Hong, F.; Su, X. Modelling and Analysis of the Epidemic Model under Pulse Charging in Wireless Rechargeable Sensor Networks. Entropy 2021, 23, 927. [Google Scholar] [CrossRef]
  6. Liang, H.; Yu, G.; Pan, J.; Zhu, T. On-Demand Charging in Wireless Sensor Networks: Theories and Applications. In Proceedings of the IEEE International Conference on Mobile Ad-Hoc & Sensor Systems, Hangzhou, China, 14–16 October 2013; pp. 28–36. [Google Scholar] [CrossRef]
  7. Wang, C.; Yang, Y.; Li, J. Stochastic Mobile Energy Replenishment and Adaptive Sensor Activation for Perpetual Wireless Rechargeable Sensor Networks. In Proceedings of the 2013 IEEE Wireless Communications and Networking Conference (WCNC), Shanghai, China, 7–10 April 2013; pp. 974–979. [Google Scholar] [CrossRef]
  8. Feng, Y.; Liu, N.; Wang, F.; Qian, Q.; Li, X. Starvation Avoidance Mobile Energy Replenishment for Wireless Rechargeable Sensor Networks. In Proceedings of the IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia, 22–27 May 2016; pp. 1–6. [Google Scholar] [CrossRef]
  9. Liang, W.; Xu, Z.; Xu, W.; Shi, J.; Mao, G.; Das, S.K. Approximation Algorithms for Charging Reward Maximization in Rechargeable Sensor Networks via a Mobile Charger. IEEE/ACM Trans. Netw. 2017, 25, 3161–3174. [Google Scholar] [CrossRef]
  10. Peng, Y.; Li, Z.; Zhang, W.; Qiao, D. Prolonging Sensor Network Lifetime through Wireless Charging. In Proceedings of the 2010 31st IEEE Real-Time Systems Symposium, RTSS 2010, San Diego, CA, USA, 30 November—3 December 2010; pp. 129–139. [Google Scholar] [CrossRef] [Green Version]
  11. Li, Z.; Peng, Y.; Zhang, W.; Qiao, D. J-RoC: A Joint Routing and Charging Scheme to Prolong Sensor Network Lifetime. In Proceedings of the 2011 19th IEEE International Conference on Network Protocols, Vancouver, BC, Canada, 17–20 October 2011; pp. 373–382. [Google Scholar] [CrossRef] [Green Version]
  12. Chen, F.; Zhao, Z.; Min, G.; Wu, Y. A Novel Approach for Path Plan of Mobile Chargers in Wireless Rechargeable Sensor Networks. In Proceedings of the 2016 12th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN), Hefei, China, 16–18 December 2016; pp. 63–68. [Google Scholar] [CrossRef]
  13. Ping, Z.; Yiwen, Z.; Shuaihua, M.; Xiaoyan, K.; Jianliang, G. RCSS: A Real-Time on-Demand Charging Scheduling Scheme for Wireless Rechargeable Sensor Networks. Sensors 2018, 18, 1601. [Google Scholar] [CrossRef] [Green Version]
  14. He, L.; Kong, L.; Gu, Y.; Pan, J.; Zhu, T. Evaluating the on-Demand Mobile Charging in Wireless Sensor Networks. IEEE Trans. Mob. Comput. 2015, 14, 1861–1875. [Google Scholar] [CrossRef]
  15. Lin, C.; Han, D.; Deng, J.; Wu, G. P2S: A Primary and Passer-By Scheduling Algorithm for On-Demand Charging Architecture in Wireless Rechargeable Sensor Networks. IEEE Trans. Veh. Technol. 2017, 66, 8047–8058. [Google Scholar] [CrossRef]
  16. Chi, L.; Zhou, J.; Guo, C.; Song, H.; Obaidat, M.S. TSCA: A Temporal-Spatial Real-Time Charging Scheduling Algorithm for on-Demand Architecture in Wireless Rechargeable Sensor Networks. IEEE Trans. Mob. Comput. 2018, 17, 211–224. [Google Scholar] [CrossRef]
  17. Yan, Z.; Goswami, P.; Mukherjee, A.; Yang, L.; Routray, S.; Palai, G. Low-Energy PSO-Based Node Positioning in Optical Wireless Sensor Networks. Opt.-Int. J. Light Electron Opt. 2018, 181, 378–382. [Google Scholar] [CrossRef]
  18. Shu, Y.; Shin, K.G.; Chen, J.; Sun, Y. Joint Energy Replenishment and Operation Scheduling in Wireless Rechargeable Sensor Networks. IEEE Trans. Ind. Inform. 2017, 13, 125–134. [Google Scholar] [CrossRef]
  19. Feng, Y.; Zhang, W.; Han, G.; Kang, Y.; Wang, J. A Newborn Particle Swarm Optimization Algorithm for Charging-Scheduling Algorithm in Industrial Rechargeable Sensor Networks. IEEE Sens. J. 2020, 20, 11014–11027. [Google Scholar] [CrossRef]
  20. Chawra, V.K.; Gupta, G.P. Correction to: Hybrid Meta-Heuristic Techniques Based Efficient Charging Scheduling Scheme for Multiple Mobile Wireless Chargers Based Wireless Rechargeable Sensor Networks. Peer-Peer Netw. Appl. 2021, 14, 1316. [Google Scholar] [CrossRef]
  21. Zhang, S.; Wu, J.; Lu, S. Collaborative Mobile Charging. IEEE Trans. Comput. 2015, 64, 654–667. [Google Scholar] [CrossRef]
  22. Liang, W.; Xu, W.; Ren, X.; Jia, X.; Lin, X. Maintaining Large-Scale Rechargeable Sensor Networks Perpetually via Multiple Mobile Charging Vehicles. ACM Trans. Sens. Netw. 2016, 12, 1–26. [Google Scholar] [CrossRef]
  23. Wu, J. Collaborative Mobile Charging and Coverage. J. Comp. Sci. Technol. 2014, 29, 550–561. [Google Scholar] [CrossRef]
  24. Madhja, A.; Nikoletseas, S.; Raptis, T.P. Hierarchical, Collaborative Wireless Charging in Sensor Networks. In Proceedings of the 2015 IEEE Wireless Communications and Networking Conference (WCNC), New Orleans, LA, USA, 9–12 March 2015; pp. 1285–1290. [Google Scholar] [CrossRef]
  25. Feng, Y.; Guo, L.; Fu, X.; Liu, N. Efficient Mobile Energy Replenishment Scheme Based on Hybrid Mode for Wireless Rechargeable Sensor Networks. IEEE Sens. J. 2019, 19, 10131–10143. [Google Scholar] [CrossRef]
  26. Kaswan, A.; Tomar, A.; Jana, P.K. An Efficient Scheduling Scheme for Mobile Charger in on-Demand Wireless Rechargeable Sensor Networks. J. Netw. Comput. Appl. 2018, 114, 123–134. [Google Scholar] [CrossRef]
  27. Tomar, A.; Muduli, L.; Jana, P.K. A Fuzzy Logic-Based On-Demand Charging Algorithm for Wireless Rechargeable Sensor Networks with Multiple Chargers. IEEE Trans. Mob. Comput. 2021, 20, 2715–2727. [Google Scholar] [CrossRef]
  28. Cao, X.; Xu, W.; Liu, X.; Peng, J.; Liu, T. A Deep Reinforcement Learning-Based on-Demand Charging Algorithm for Wireless Rechargeable Sensor Networks. Ad Hoc Netw. 2021, 110, 102278. [Google Scholar] [CrossRef]
  29. Wei, Z.; Liu, F.; Lyu, Z.; Ding, X.; Shi, L.; Xia, C. Reinforcement Learning for a Novel Mobile Charging Strategy in Wireless Rechargeable Sensor Networks. In Wireless Algorithms, Systems, and Applications; Chellappan, S., Cheng, W., Li, W., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; pp. 485–496. [Google Scholar] [CrossRef]
  30. Soni, S.; Shrivastava, M. Novel Wireless Charging Algorithms to Charge Mobile Wireless Sensor Network by Using Reinforcement Learning. SN Appl. Sci. 2019, 1, 1052. [Google Scholar] [CrossRef] [Green Version]
  31. Yang, M.; Liu, N.; Zuo, L.; Feng, Y.; Liu, M.; Gong, H.; Liu, M. Dynamic Charging Scheme Problem with Actor-Critic Reinforcement Learning. IEEE Internet Things J. 2021, 8, 370–380. [Google Scholar] [CrossRef]
  32. Xie, L.; Shi, Y.; Hou, Y.T.; Sherali, H.D. Making Sensor Networks Immortal: An Energy-Renewal Approach with Wireless energy transmission. IEEE/ACM Trans. Netw. 2012, 20, 1748–1761. [Google Scholar] [CrossRef]
  33. Hou, Y.T.; Shi, Y.; Sherali, H.D. Rate Allocation and Network Lifetime Problems for Wireless Sensor Networks. IEEE/ACM Trans. Netw. 2008, 16, 321–334. [Google Scholar] [CrossRef]
  34. Shu, Y.; Yousefi, H.; Cheng, P.; Chen, J.; Gu, Y.J.; He, T.; Shin, K.G. Near-Optimal Velocity Control for Mobile Charging in Wireless Rechargeable Sensor Networks. IEEE Trans. Mob. Comput. 2016, 15, 1699–1713. [Google Scholar] [CrossRef] [Green Version]
  35. Littman, M.L. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Machine Learning Proceeding 1994; Cohen, W.W., Hirsh, H., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 1994; pp. 157–163. [Google Scholar] [CrossRef] [Green Version]
  36. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Available online: https://doi.org/10.48550/arXiv.1706.02275 (accessed on 7 June 2017).
  37. Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. Available online: https://arxiv.org/abs/2103.01955 (accessed on 2 March 2021).
  38. Graves, A.; Wayne, G.; Danihelka, I. Neural Turing Machines. Available online: https://arxiv.org/abs/1410.5401v1 (accessed on 20 October 2014).
  39. Oh, J.; Chockalingam, V.; Singh, S.; Lee, H. Control of Memory, Active Perception, and Action in Minecraft. Available online: https://arxiv.org/abs/1605.09128 (accessed on 30 May 2016).
  40. Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. Available online: https://arxiv.org/abs/1705.08926 (accessed on 24 May 2017).
  41. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Available online: https://arxiv.org/abs/1706.03762. (accessed on 12 June 2017).
  42. Iqbal, S.; Sha, F. Actor-Attention-Critic for Multi-Agent Reinforcement Learning. Available online: https://arxiv.org/abs/1810.02912 (accessed on 5 October 2018).
  43. Wei, E.; Wicke, D.; Freelan, D.; Luke, S. Multiagent Soft Q-Learning. Available online: https://arxiv.org/abs/1804.09817 (accessed on 25 April 2018).
  44. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. Available online: https://arxiv.org/abs/1412.6980 (accessed on 22 December 2014).
Figure 1. An example WRSN with a mobile charger.
Figure 1. An example WRSN with a mobile charger.
Entropy 24 00965 g001
Figure 2. A scheduling example of JSSRC.
Figure 2. A scheduling example of JSSRC.
Entropy 24 00965 g002
Figure 3. The structure of the AMADRL-JSSRC algorithm.
Figure 3. The structure of the AMADRL-JSSRC algorithm.
Entropy 24 00965 g003
Figure 4. Reward per episode.
Figure 4. Reward per episode.
Entropy 24 00965 g004
Figure 5. The impact of the capacity of the sensor on the average tour length.
Figure 5. The impact of the capacity of the sensor on the average tour length.
Entropy 24 00965 g005
Figure 6. The impact of the capacity of the sensor on the average number of dead sensors.
Figure 6. The impact of the capacity of the sensor on the average number of dead sensors.
Entropy 24 00965 g006
Figure 7. The impact of capacity of MC on the average tour length.
Figure 7. The impact of capacity of MC on the average tour length.
Entropy 24 00965 g007
Figure 8. The impact of capacity of MC on the average number of dead sensors.
Figure 8. The impact of capacity of MC on the average number of dead sensors.
Entropy 24 00965 g008
Figure 9. The lifetime of different algorithms on JSSRC50.
Figure 9. The lifetime of different algorithms on JSSRC50.
Entropy 24 00965 g009
Figure 10. The lifetime of different algorithms on JSSRC100.
Figure 10. The lifetime of different algorithms on JSSRC100.
Entropy 24 00965 g010
Figure 11. The lifetime of different algorithms on JSSRC200.
Figure 11. The lifetime of different algorithms on JSSRC200.
Entropy 24 00965 g011
Table 1. Performance Comparison of the Existing Approaches and the Proposed Approach.
Table 1. Performance Comparison of the Existing Approaches and the Proposed Approach.
Approach Dynamic Change of the Sensor Energy ConsumptionCharging Sequence SchedulingCharging Ratio ControlCharging Sequence Scheduling and Charging Ratio Control Simultaneously
Off-line[17]NoYesNoNo
[18]NoYesNoNo
[19]NoYesNoNo
[20]NoYesNoNo
[21]NoYesNoNo
On-line[16]YesYesNoNo
[25]YesYesNoNo
[26]YesYesNoNo
[27]YesYesYesNo
RL[28]NoYesNoNo
[29]NoYesNoNo
[30]NoYesNoNo
[31]YesYesNoNo
OursYesYesYesYes
Table 2. Abbreviations used in this paper.
Table 2. Abbreviations used in this paper.
AbbreviationDescription
WRSNWireless rechargeable sensor network
MCMobile charger
BSBase station
DisTotal moving distance of MC during the charging tour
JSSRCJoint mobile charging sequence scheduling and charging ratio control problem
AMADRLAttention-shared multi-agent actor–critic-based deep reinforcement learning
SmcState information of MC
SnetState information of network
ACRLActor–critic reinforcement learning
DPDynamic programming
NJNPNearest-job-next with preemption
TSCATemporal–spatial real-time charging scheduling algorithm
Table 3. State Space Update.
Table 3. State Space Update.
Time StepObservationAgent 1Agent 2Immediate
Rewards
Action   Space   ( A 1 ) Individual
Reward ( r 1 )
Action   Space   ( A 2 ) Individual   Reward   ( r 2 )
1 O ( 1 ) = ( o 1 ( 1 ) , o 2 ( 1 ) ) a 1 ( 1 ) r 1 ( 1 ) a 2 ( 1 ) r 2 ( 1 ) R ( 1 ) = R ( o 1 ( 1 ) , o 2 ( 1 ) ,
a 1 ( 1 ) , a 2 ( 1 ) )
k O ( k ) = ( o 1 ( k ) , o 2 ( k ) ) a 1 ( k ) r 1 ( k ) a 2 ( k ) r 2 ( k ) R ( k ) = R ( o 1 ( k ) , o 2 ( k ) ,
a 1 ( k ) , a 2 ( k ) )
K O ( K ) = ( o 1 ( K ) , o 2 ( K ) ) a 1 ( K ) r 1 ( K ) a 2 ( K ) r 2 ( K ) R ( K ) = R ( o 1 ( K ) , o 2 ( K ) ,
a 1 ( K ) , a 2 ( K ) )
Table 4. The Parameters of the Simulation Settings.
Table 4. The Parameters of the Simulation Settings.
ParameterDescriptionValue
Network size [ 0 , 1 ] × [ 0 , 1 ]
Number of sensors50–200
E m c MC initial energy100 J
v m s Moving speed of MC0.1 m/s
e m The speed of energy consumed on moving of MC0.1 J/m
P c Charging speed of MC1 J/s
ε Charging ratio
E s n Energy capacity of sensor50 J
ω The threshold of the number of dead sensors0.5
E 0 The set of initial residual energy10~20 J
Table 5. Energy parameters of sensor.
Table 5. Energy parameters of sensor.
ParameterDescriptionValue
ξ 1 Distance-free energy consumption index 5 × 10 12   J / b i t
ξ 2 Distance-related energy consumption index 1.3 × 10 4   J / b i t
ρ Energy consumption for receiving or transmitting 5 × 10 8   J / b i t
Number of bits 2 × 10 4
r Signal attenuation coefficient4
Per second packet generation probability0.2~0.5
Table 6. Key Parameters of the Training Stage.
Table 6. Key Parameters of the Training Stage.
ParameterDescriptionValue
DSize of experience replay buffer 10 5
BSize of mini-batch1024
π l r Actor learning rate5 × 10−4 (JSSRC50,100)
5 × 10−5 (JSSRC200)
Q l r Critic learning rate5 × 10−4 (JSSRC50,100)
5 × 10−5 (JSSRC200)
N p Number of parallel environments4
N e Number of episodes 10 4
N p e Number of steps per episode100
N c u Number of critic updates4
N p u Number of policy updates4
N m Number of multiple attention heads4
N u t Number of target updates 10 3
AdamOptimizer method
γ Reward discount0.9
ϖ Reward coefficient0.5
ϵ Penalty coefficient.10
τ Update rate of target parameters0.005
α Temperature parameter0.01
Table 7. Impact of the reward discount ( ϵ = 10 ).
Table 7. Impact of the reward discount ( ϵ = 10 ).
γ 0.50.60.70.80.9
Reward−60.72−18.050.1920.8952.45
Number of dead sensors25181385
Moving distance (m)11.3713.1514.8515.5516.77
Table 8. Impact of the penalty coefficient ( γ = 0.9 ).
Table 8. Impact of the penalty coefficient ( γ = 0.9 ).
ϵ 015810
Reward60.3338.2937.6440.1252.45
Number of dead sensors20151175
Moving distance (m)14.5415.0915.8816.2316.77
Table 9. The Results Based on Different Algorithms Over Test Set.
Table 9. The Results Based on Different Algorithms Over Test Set.
EnvironmentAlgorithmMean LengthStd Mean   N d Base TimeExtra Time
JSSRC50AMADRL-JSSRC13.9180.80231000.905
ACRL13.8780.79841000.788
GREEDY13.9020.83421000.647
DP14.0680.85661000.743
NJNP13.8340.81551000.516
TSCA14.0280.75541000.498
JSSRC100AMADRL-JSSRC17.4541.22852001.463
ACRL16.7681.26682001.32
GREEDY18.2331.445132001.38
DP18.0881.328132001.12
NJNP16.8911.306122000.995
TSCA17.7181.205112000.936
JSSRC200AMADRL-JSSRC36.7691.81383001.828
ACRL36.1261.998123001.482
GREEDY37.8563.162193001.635
DP37.5322.376183001.864
NJNP35.5132.265173001.465
TSCA35.9212.169163001.416
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jiang, C.; Wang, Z.; Chen, S.; Li, J.; Wang, H.; Xiang, J.; Xiao, W. Attention-Shared Multi-Agent Actor–Critic-Based Deep Reinforcement Learning Approach for Mobile Charging Dynamic Scheduling in Wireless Rechargeable Sensor Networks. Entropy 2022, 24, 965. https://doi.org/10.3390/e24070965

AMA Style

Jiang C, Wang Z, Chen S, Li J, Wang H, Xiang J, Xiao W. Attention-Shared Multi-Agent Actor–Critic-Based Deep Reinforcement Learning Approach for Mobile Charging Dynamic Scheduling in Wireless Rechargeable Sensor Networks. Entropy. 2022; 24(7):965. https://doi.org/10.3390/e24070965

Chicago/Turabian Style

Jiang, Chengpeng, Ziyang Wang, Shuai Chen, Jinglin Li, Haoran Wang, Jinwei Xiang, and Wendong Xiao. 2022. "Attention-Shared Multi-Agent Actor–Critic-Based Deep Reinforcement Learning Approach for Mobile Charging Dynamic Scheduling in Wireless Rechargeable Sensor Networks" Entropy 24, no. 7: 965. https://doi.org/10.3390/e24070965

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop