Next Article in Journal
A Data-Efficient Training Method for Deep Reinforcement Learning
Previous Article in Journal
Real-Time 3D Object Detection and Classification in Autonomous Driving Environment Using 3D LiDAR and Camera Sensors
 
 
Article
Peer-Review Record

ACUTE: Attentional Communication Framework for Multi-Agent Reinforcement Learning in Partially Communicable Scenarios

Electronics 2022, 11(24), 4204; https://doi.org/10.3390/electronics11244204
by Chengzhang Zhao 1,2, Jidong Zhao  2, Zhekai Du 2 and Ke Lu  1,2,*
Reviewer 1:
Electronics 2022, 11(24), 4204; https://doi.org/10.3390/electronics11244204
Submission received: 18 November 2022 / Revised: 11 December 2022 / Accepted: 13 December 2022 / Published: 16 December 2022
(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

In this paper, the authors proposed the Attentional Communication Framework for multi-agent reinforcement learning. At each step, each individual agent receives encoded messages from a random subset of other agents. An attention mechanism is used to combine these messages as well as the observation of the agent itself. The proposed framework is based on deep Q-learning. All agents share the same network parameters and are trained via experience replay. 

Overall, the paper was well structured and clearly presented. The proposed algorithm was compared with existing ones in several experimental settings and showed advantages. I have the following comments and questions for the authors regarding some details of this paper:

1. The analogy between continuous / discrete communication channels and continuous / discrete actions is unnecessary and seems off in line 99-100.

2. Long and awkward sentence in line 102-106.

3. Delete “too” from 109.

4. Add relevant citations after “reinforcement learning” in line 120.

5. Line 131, “Neighboring” 

6. Line 155, the definition of T in Algorithm 1 is better than the “time cursor for the game” here.

7. In equations (1) and (2), observations are denoted as “omega”, but in figure 1, observations are denoted as “o”.

8. Please consolidate the use of letters j, mu, m, and C in Section 4.

9. May consider using bold letters to indicate vectors and bold capital letters to indicate matrices in Section 4.

10. y mean very different things in equations (2, 4) and in Section 4.1. Please reduce the overuse of this letter.

11. For the proposed framework, it doesn’t seem to make a difference whether the states are only partially observed or we treat each observation directly as a state itself. Please confirm and comment.

Author Response

Response to Reviewer 1:                                                                    

General comment: 

In this paper, the authors proposed the Attentional Communication Framework for multi-agent reinforcement learning. At each step, each individual agent receives encoded messages from a random subset of other agents. An attention mechanism is used to combine these messages as well as the observation of the agent itself. The proposed framework is based on deep Q-learning. All agents share the same network parameters and are trained via experience replay. Overall, the paper was well structured and clearly presented. The proposed algorithm was compared with existing ones in several experimental settings and showed advantages. 

Response: Thank you very much for your general comment. We have revised our manuscript, accordingly.  Changes of the paper are highlighted in the revised manuscript. Below are point-to-point responses. 

 

Specific Comments:

Comments 1.1: The analogy between continuous / discrete communication channels and continuous / discrete actions is unnecessary and seems off in line 99-100

Response: Thanks for your valuable comment. In the revised manuscript, we have removed this analogy and restated the notions of discrete and continuous communication channels. Please refer to lines 103-109, page 3 in the revised manuscript.

Comments 1.2: Long and awkward sentence in line 102-106.

Response: Thank you for pointing this out. We have revised the sentence to address your concerns and hope that it is now more concise and clear. Please refer to lines 111-114 on page 3.

 

Comments 1.3: Delete “too” from 109.

Response: Thanks, we have corrected the typo.

 

Comments 1.4: Add relevant citations after “reinforcement learning” in line 120.

Response: Thanks for the comment. We have added a reference (Line 129, Page 3) published in this year by OpenAI, i.e., AlphaTensor, which treats the problem of finding an efficient matrix multiplication algorithm as a reinforcement learning problem, with an architecture based on the transformer.

 

Comments 1.5: Line 131, “Neighboring”

Response: Thanks for your comment. We have thoroughly checked and corrected the grammatical errors and typos we found in our revised manuscript. 

 

Comments 1.6: Line 155, the definition of T in Algorithm 1 is better than the “time cursor for the game” here.

Response: Thanks for pointing that out. We have modified the definition of  T from “T is the time cursor for the game” to “T is the time horizon for the game”

Comments 1.7: In equations (1) and (2), observations are denoted as “omega”, but in figure 1, observations are denoted as “o”.

Response: Thanks for your comment. We have revised Figure 1 to make it consistent with the equations

  1. and (2). 

 

Comments 1.8: Please consolidate the use of letters j, mu, m, and C in Section 4.

Response: Thanks, we have carefully checked the use of letters in Section 4.

 

Comments 1.9: May consider using bold letters to indicate vectors and bold capital letters to indicate matrices in Section 4.

Response: Thanks for your suggestion. We have marked all matrix symbols in Section 4 with bold letters.

 

Comments 1.10: y mean very different things in equations (2, 4) and in Section 4.1. Please reduce the overuse of this letter.

Response: Thanks for your comment. All the confusing y in Fig 1 and Section 4 have been replaced with variable Msg to distinguish it from the symbol  y in Eq. (2, 4).

 

Comments 1.11: For the proposed framework, it doesn’t seem to make a difference whether the states are only partially observed or we treat each observation directly as a state itself. Please confirm and comment.

Response: In the two MARL environments used in our experiment, i.e., Combat and LBF, the visual range of the agents is limited. For example, in the Combat Environment, the agent can only observe other agents within the visual range of 3x3 surrounding area centered on it. In LBF, each agent can also specify the size of its visual range if the sight parameter is set. In our experiments on "Foraging-19x19-19p-9f-v2", we set the observation range of agents to be 5x5 surrounding areas. Therefore, it is not reasonable to treat the partial observation as the ground-truth state of the environment, since the observation of each agent is only a part of the state. We have added the description of the parameters for the agent visual range in these environments(Page 8, Table 1).

Reviewer 2 Report

1. In the introduction, related work and background it is not enough to state the current work. It should be expanded with justification and reconstructed. Including the motivation, the main work, and the improvements compared with previous related works should be emphasized in this section and explain how the present work defers from that published previously, also, the methodology given in this paper is pore and old to state the contribution of the present work.

 

2. The motivation of the research is not clear and the innovation of the paper is insufficient, if it is not then these should be respectively given.

 3. The caption of figure 1 is too long. Please revise and be specific.

 4. This is a technical research article, authors should validate the work. However, please provide a comparison table with others latest work, and mention their method, parameters and results and compare with your work for more clear and quick understand. Authors mention in Table 1 and compare with other methods. However it’s not enough to validate the work.  

 5. (Minimal justification/description, more limitation acknowledgment, concerns about approach) Detail discussion on the proposed method should be given.

 6. The application is not clear, with the current contain, need more justification.

 7. In the Experiment section, however, what authors mean with the section the Results and Discussion. In the section Figure 2 (b), Figure 4 (b), Figure 5 and Figure 6 not clear. Please revise For example authors utilized any specific software framework, tools or platforms, if then please mention the name.  

 

Given the aforementioned issues and based on my expertise, I conclude that the present paper has some originality and contribution to recommend it not suitable for a possible publication in the current format. I suggest that the authors revise the article before acceptance.

 

 

Author Response

Response to Reviewer 2:                                                                    

General comment: 

Given the aforementioned issues and based on my expertise, I conclude that the present paper has some originality and contribution to recommend it not suitable for a possible publication in the current format. I suggest that the authors revise the article before acceptance.

Response: Thank you for your comments. We have revised our manuscript, accordingly.  Changes of the paper are highlighted in the revised manuscript. Below are point-to-point responses. 

 

Specific Comments:

Comments 2.1:  In the introduction, related work and background it is not enough to state the current work. It should be expanded with justification and reconstructed. Including the motivation, the main work, and the improvements compared with previous related works should be emphasized in this section and explain how the present work defers from that published previously, also, the methodology given in this paper is pore and old to state the contribution of the present work.

Response: Thanks for your comment. According to your suggestion, we have modified the statement of our current work. In the Introduction section, we have added the motivation for our study (Line 57-63, Page 2),  and revised the statement of the main work and contributions (Line 149-150, Line 156-162, Page 4). 

 

In addition, we have refined the analysis of previous work in the Related Work section. Specifically, previous communication-based methods focus on designing effective communication protocols to achieve SOTA performance in ideal experimental environments. However, they overlook the unreliable channels and bandwidth constraints in real-world scenarios. Our work is among the first attempts at partially communicable MARL scenarios. Technically, our work is an extension of existing independent reinforcement learning algorithms and is based on a DQN baseline. Therefore, the reader only needs the background knowledge of POMDPs and Deep Q-Learning. 

 

Indeed, DQN is a relative pioneer algorithm, however, it is also straightforward to replace the baseline DQN algorithm with other up-to-date SOTA algorithms, such as TD3 [R2], SAC [R3], etc. In the following work, we will combine these algorithms with ACUTE (Line 414-415, Page 13). Besides, we have also tried many approaches in the IO implementation of the attention module, ref (Line 231-238, Page 6). During the experiments, we found that only our proposed IO method can effectively extract data features and works better.

 

Comments 2.2: The motivation of the research is not clear and the innovation of the paper is insufficient, if it is not then these should be respectively given.

Response: Thank you for your comment. Our work is inspired by the swarms of UAV (unmanned aerial vehicle) swarms widely used in modern combat [R1].  UAV swarms are known for their large-scale number, low individual production costs, and resistance to communication interference. Inspired by this, we aim to investigate how large-scale UAV swarms can efficiently accomplish global awareness and policy-making by their limited communication capabilities under interference. Our proposed broadcast-receive asynchronous communication model 

  1. is easy to measure the level of communication interference 
  2. is easy to implement, and can be easily applied to various multi-agent environments without complex modifications to the environment. 

ACUTE in this proposed communication mode can effectively downscale and extract features from received messages. We have tried various input-output modes for attention module, but none of them performs as well as the one proposed in the paper (Line 231-238, Page 6). In response to your comments, we have revised Section 1 (Line 57-60, Page 2; Line 89-90, Page 1), adding and detailing our motivation and innovation.

 

Comments 2.3: The caption of figure 1 is too long. Please revise and be specific.

Response: Thank you for pointing out this problem in the manuscript. We intended the reader to get a quick overview of the ACUTE architecture from Fig.1 and the caption. We have rewritten the captions in Fig.1 to make it concise and specific.

 

Comments 2.4: This is a technical research article, authors should validate the work. However, please provide a comparison table with others latest work, and mention their method, parameters and results and compare with your work for more clear and quick understand. Authors mention in Table 1 and compare with other methods. However it’s not enough to validate the work.  

Response: Thanks for your comment. The MFQ and QMIX we compared in experiments are both strong baselines in MARL and have achieved state-of-the-art performance in many MARL tasks. In this work, we mainly compare our work with MFQ since it is also a local communication-based MARL method, which is closely related with our method. However, we show in the experimental results that our method is more suitable in communication-restricted scenarios, which verifies the effectiveness of our design. In addition, as an early attempt towards communication-restricted MARL study, we lay stress on how our method performs compared with methods that are not explicitly designed for the communication-restricted scenarios, rather than outperforming methods with advanced learning strategies. Therefore, we implement our method based on a DQN baseline. The experimental comparison with DQN can be regarded as an ablation experiment of our attentive communication, which is enough to verify our claim. In addition, our attention module is not specifically designed for DQN, thus can be easily formulated as a plug-and-play plug in other advanced MARL baseline methods to boost their performance in communication-restricted scenarios. We have added corresponding justifications in the revised manuscript.

 

Comments 2.5:  (Minimal justification/description, more limitation acknowledgment, concerns about approach) Detail discussion on the proposed method should be given.

Response: Considering the Reviewer’s suggestion, we have supplemented a subsection describing the limitations of our method(Section 4.4, Line 266-274, Page 7).

 

Comments 2.6: The application is not clear, with the current contain, need more justification.

Response: Thanks for pointing out our inadequacies. Our work can be applied to policy-making for UAV swarms, which can be found in the response to Comments 2.2 and Section 6 (Line 411-413, Page 13)

 

Comments 2.7: In the Experiment section, however, what authors mean with the section the Results and Discussion. In the section Figure 2 (b), Figure 4 (b), Figure 5 and Figure 6 not clear. Please revise For example authors utilized any specific software framework, tools or platforms, if then please mention the name. 

Response:  Thanks. We have revised section 4.4 (line 284-287, Page 8), and section 5 (Line 313, Page 8; Line 319, Page 9; Line 362-363, Page 10) according to the Reviewer’s comment, which added detailed descriptions of the experimental conditions and revised the analysis of the experimental results.

 

[R1] Bridley, R.; Pastor, S. Military Drone Swarms and the Options to Combat Them. Small Wars 2022.

[R2] Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International conference on machine learning. PMLR, 2018, pp. 1587

[R3] Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05

Round 2

Reviewer 2 Report

The authors have revised the paper accordingly to the comments. 

Back to TopTop