Multi-Agent Chronological Planning with Model-Agnostic Meta Reinforcement Learning

Hu, Cong; Xu, Kai; Zhu, Zhengqiu; Qin, Long; Yin, Quanjun

doi:10.3390/app13169174

Open AccessArticle

Multi-Agent Chronological Planning with Model-Agnostic Meta Reinforcement Learning

by

Cong Hu

,

Kai Xu

^*,

Zhengqiu Zhu

,

Long Qin

and

Quanjun Yin

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(16), 9174; https://doi.org/10.3390/app13169174

Submission received: 24 June 2023 / Revised: 4 August 2023 / Accepted: 9 August 2023 / Published: 11 August 2023

(This article belongs to the Special Issue Intelligent Unmanned System Technology and Application)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, we propose an innovative approach to address a chronological planning problem involving the multiple agents required to complete tasks under precedence constraints. We model this problem as a stochastic game and solve it with multi-agent reinforcement learning algorithms. However, these algorithms necessitate relearning from scratch when confronted with changes in the chronological order of tasks, resulting in distinct stochastic games and consuming a substantial amount of time. To overcome this challenge, we present a novel framework that incorporates meta-learning into a multi-agent reinforcement learning algorithm. This approach enables the extraction of meta-parameters from past experiences, facilitating rapid adaptation to new tasks with altered chronological orders and circumventing the time-intensive nature of reinforcement learning. Then, the proposed framework is demonstrated through the implementation of a method named Reptile-MADDPG. The performance of the pre-trained model is evaluated using average rewards before and after fine-tuning. Our method, in two testing tasks, improves the average rewards from −44 to −37 through 10,000 steps of fine-tuning in two testing tasks, significantly surpassing the two baseline methods that only attained −51 and −44, respectively. The experimental results demonstrate the superior generalization capabilities of our method across various tasks, thus constituting a significant contribution towards the design of intelligent unmanned systems.

Keywords:

meta reinforcement learning; MARL; chronological planning; cooperative navigation

1. Introduction

In recent years, Multi-Agent Reinforcement Learning (MARL) has attracted a great deal of interest in the AI community. As a cutting-edge research direction, MARL is closely related to decision theory, game theory, optimization methods, agent-based modeling, and so forth. In real-world scenarios, MARL has a broad range of promising applications, such as robotics control [1,2,3], power grid [4,5,6], real-time strategy games [7,8], etc. The use of MARL has achieved remarkable performance and exhibits potential economic benefits.

Multi-agent chronological planning problems, where multiple agents need to complete sequential tasks with precedence constraints, are important in both theoretical and real-world domains. However, these problems are yet to be well addressed. In this paper, we consider a chronological scenario where several agents have to cooperatively occupy some landmarks in sequence. This scenario is applicable in many multi-agent systems, such as real-time strategy games and pickup-and-delivery problems.

Many problems related to multi-agent systems can be described with Markov Decision Processes (MDPs) and addressed within the framework of MARL. However, when it comes to chronological planning, the process of training policies for multiple agents could be extremely time-consuming to obtain optimal policies since there might be too many task completion orders. Meanwhile, learned policies can be unstable in performance and easily deteriorate when faced with tiny disturbances or other noise. The existing bottleneck in this domain is that the learned policies may suffer from unsatisfactory performance in a new environment (a new task completion order) even though the drift of an underlying Markov decision process, e.g., system dynamics or the reward mechanism, is quite slight. These challenges render the direct deployment of a previously learned policy in a new task order impractical. However, learning from scratch is not only time-consuming but also fails to leverage prior experience effectively.

To circumvent the above-mentioned concerns about MARL, researchers have proposed a novel paradigm—meta-learning—to improve the generalizability of machine learning algorithms with regard to adapting to new environments. Meta-learning is a universal paradigm that can be combined with supervised learning and reinforcement learning. Recent advances in meta-learning enable reinforcement learning to learn from a distribution of different yet related tasks and solve new tasks within a few trials. In the realm of reinforcement learning, meta-learning can be used to learn good initialization parameters [9,10], improve exploration efficiency [11,12], and learn proper hyperparameters [13]. The use of meta-learning techniques enables us to leverage learned skills in the past from MARL, and simultaneously improve the robustness to the disturbance of MDPs.

Although meta-learning has proved to be effective in adapting policies to new environments, applying meta-learning to multi-agent chronological planning remains less investigated. To close this knowledge gap, we first design a reward setting for the chronological scenario that can effectively describe scenarios of different chronological tasks. Then, we use a classical MARL algorithm—Multi-Agent Deep Deterministic Policy Gradients (MADDPG)—to train the policies of several cooperative agents in the environment. To improve the generalization across different tasks of the learned policies, we encapsulate the MARL part into a meta-learning framework—Reptile—to produce more general policies that can adapt to a new chronological planning scenario with a few interactions.

We respectively examine methods on MADDPG with and without a meta-learning framework. We pre-train these two methods on training tasks and then fine-tune the learned policies on two testing tasks to assess the generalization capability. We also consider a random policy as a baseline when fine-tuning on testing stage. The empirical results show that our method—Reptile-MADDPG—adapts to the testing tasks fastest, and achieves the highest rewards. We prove that the meta-learning framework can help MARL to quickly adapt to different but related tasks.

In summary, this work aims to contribute to these aspects:

First, we propose a framework for addressing chronological planning problems by bridging the knowledge gap through the application of meta-learning to multi-agent reinforcement learning techniques. Based on the framework, we instantiate a method called Reptile-MADDPG to improve the generalizability of chronological planning across different tasks.
Second, we design a reward setting that can effectively describe scenarios of different chronological tasks.
Finally, we conduct extensive testing and comparison with existing methods. Our method shows faster adaption to testing tasks and obtains higher rewards, proving the efficiency of the meta-learning framework in quickly adapting to different but related tasks.

The rest of this paper is organized as follows: we first introduce related work on MARL and meta-reinforcement learning in Section 2 and then propose our method, referred to as Reptile-MADDPG, in Section 3. In Section 4 and Section 5, we describe the design of our experiments and present the results to show the effectiveness of our method. Section 6 contains the concluding remarks about the whole work and directions for future research.

2. Related Work

2.1. Reinforcement Learning

The field of reinforcement learning (RL) holds significant importance in the realm of machine learning and has achieved noteworthy advancements across various domains. In recent years, RL has greatly benefited from the progress made in deep learning. Combined with deep learning, RL can solve some tough decision-making tasks intractable before. As a landmark paper in the field of reinforcement learning, DeepMind introduced DQN in 2015 [14]. DQN uses a deep neural network to more effectively represent Q-values and achieves a remarkable performance in Atari games. Lin et al. introduced an innovative method termed Reinforcement Q-learning-based Deep Neural Network (RQDNN), which synergistically integrates the Deep Principal Component Analysis Network (DPCANet) and Q-learning for playing strategy video games [15]. Srinivasu et al. presented a robust reinforcement learning-based algorithm utilizing Probabilistic Roadmap and Inverse Kinematics for accurate path recognition and approximation in real-time surgical procedures, offering an optimal solution for performing precise surgeries on soft tissues [16]. To improve the applicability of DQN in real-world environments, Kim et al. strategically incorporated prior knowledge through a Bayesian-based loss function, notably improving the learning convergence performance [17].

2.2. Multi-Agent Reinforcement Learning

MARL extends RL to a Multi-Agent System (MAS). MARL is much more complex because the agents need to make decisions based on observations and interactions with dynamic environments and their partners [18]. The most natural approach to finding policies for MARL is that each agent learns its policy independently and treats the rest of the agents as part of the environment. This idea is implemented in the Independent Q-Learning (IQL) method [19]. In 2015, Tampuu et al. [20] proposed Independent Deep Q-Network (IDQN), which extends IQL with DQN. These methods suffer from the non-stationarity of the environment. That means the policies of other agents keep changing during the training process and result in unstable training for the learning algorithm. Non-stationarity also prevents the straightforward use of experience replay in methods like DQN. Meanwhile, policy gradient RL methods usually exhibit very high variance when the coordination of multiple agents is required [21].

Many approaches in MARL adopt the framework of centralized training and decentralized evaluation (CTDE). In this framework, the policies of a group of agents are trained in a centralized way and granted access to other agents’ information and the global states during the centralized training process. While in the decentralized execution phase, each agent makes its own decision based on its local action-observation information. Among these works, DIAL [22], CommNet [23], ATOC [24], and SchedNet [25] aim to enable agents to learn how to communicate with each other in multi-agent systems. Some other approaches try to use a fully observable critic to let agents learn to achieve cooperation directly from their own local observations. MADDPG is the first general-purpose MARL algorithm for stabilizing training. It uses an actor-critic learning framework and can be applied to mixed competitive and cooperative environments. With the fully observable critics, MADDPG eliminates the non-stationary in MARL by explicitly conditioning on the actions of other agents. As the critics in MADDPG concatenate all the local observations, it faces the curse of dimensionality when agents increase. To alleviate this problem, Iqbal et al. [26] proposed Multiple-Actor-Attention-Critic (MAAC) that involves an attention mechanism to select relevant information for each agent during training.

In cooperative settings, a group of agents has to coordinate to maximize a shared team reward. Therefore, the global Q function

Q^{π} (o, a)

is conditioned on the joint states and actions of all the agents. However, in a multi-agent system, some agents may get lazy and not learn to cooperate as they are supposed to [27], which may lead to the breakdown of the whole system. To solve this issue, some works in MARL focus on the factorization of this global reward. Sunehag et al. [28] proposed Value-Decomposition Networks (VDN) that learn an optimal linear decomposition for Q values:

Q^{π} (o, a) : = \sum_{i = 1}^{N} Q^{i} (o^{i}, a^{i})

. The implicit value function learned by each agent depends only on local observations. As the structure of Q values considered by VDN is too simple, Rashid et al. [7] proposed QMIX, which makes an improvement over VDN by adding the constraints that the joint-action value is monotonic in the per-agent values. Wang et al. [29] also proposed QPLEX, which takes a duplex dueling network architecture to factorize the joint value function, making value function learning more efficient.

Even though the above MARL methods perform well in some scenarios, when the task varies, one has to retrain the policies from scratch, which takes plenty of time.

2.3. Meta Reinforcement Learning

Early works on the idea of meta-learning (also known as learning to learn) date back to the 1990s [30,31]. Recently, meta-learning has received much attention and has been applied in few-shot recognition [32], network routing [33], traffic control [34], and in biomedical [35] and other domains [36,37,38,39,40,41]. Meta-learning is regarded as the key to achieving human-level intelligence because it raises the learning level from data to tasks [42].

Here, we focus on meta-learning in the reinforcement learning community, meta reinforcement learning (meta-RL). The goal of meta-RL is to adapt to a new test task quickly using only a small amount of experience in the test setting [9] without learning from scratch. In meta-RL, the train and test tasks are usually different but drawn from the same family of problems [43]. In this paper, we divide the recent methods of meta-reinforcement learning into recurrent neural network-based and optimization-based methods.

RNN-based meta-RL is very similar to the standard RL algorithms. The current state

s_{t}

, the last reward

r_{t - 1}

, and the last action

a_{t - 1}

are also fed into the policy network of model-based meta-RL. Duan et al. [44] proposed a meta learning method that trains a gated recurrent unit (GRU) to remember the construct of the current MDP, so that it can adapt to an unseen but familiar MDP task quickly by fine-tuning the parameters of the GRU. During the training procedure, the hidden states of the GRU are not cleared between episodes. Similar to [44], Wang et al. [45] used an LSTM as the memory module. Frans et al. [46] present a hierarchical meta reinforcement learning method that learns hierarchically structured policies and uses shared primitives to improve sample efficiency on new tasks.

Optimization-based meta-RL aims to update the model parameters to achieve good generalization across new tasks. Finn et al. [9] proposed a general method called model-agnostic meta learning (MAML). MAML aims to optimize initialized parameters on some training tasks so that the parameters can quickly adapt to similar unseen tasks. MAML is model-agnostic so that it can combine with any learning models trained with gradient-descent. To reduce the expensive computation from the use of second-derivatives in MAML, Nichol et al. [10] proposed a first-order meta learning method, Reptile, that performs SGD that updates the initialized parameters towards the average of updated task-specific weights. Evolved Policy Gradient aims to build a differentiable loss function that is parameterized via temporal convolutions over the agent’s historical information [47]. When tackling new tasks, one can optimize the policy by minimizing the loss function. Meta Q-Learning (MQL) [48] is an off-policy meta learning method and draws upon ideas in propensity estimation to amplify the amount of available data for adaptation.

Recently, some works tried to apply meta learning to multi-agent systems. Al-Shedivat et al. [49] studied the continuous adaptation problem in a multi-agent environment, RoboSumo, and enabled a learning agent to adapt to different enemies in a short timeframe. However, they did not actually research multi-agent learning, but focused on a single agent in a two-agent competitive environment. Jia et al. [50] combined meta learning with MADDPG and successfully enabled MADDPG to adapt to new tasks, with different friction coefficients or numbers of agents. Li et al. [18] trained a meta-actor and a meta-critic to distill the meta-knowledge of a team that can help a new agent to better integrate into the group. Both [18,50] tested their meta learning methods in multi-agent systems. Existing works primarily focus on variations in kinetic parameters or the number of agents between training and testing tasks, neglecting the interrelationships among agents. In contrast, our study addresses the complexity of chronological constraints between agents. This leads to a more challenging scenario where the reward function in the stochastic game may vary across different tasks.

3. Methods

In this section, we first formulate the chronological planning problem in a mathematical form. Then, we elaborate on how to address the problem together with a reward function we designed. Furthermore, we show a meta-learning framework combined with multi-agent reinforcement learning to acquire learned policies for chronological planning with varying orders.

3.1. Chronological Planning Problem Formulation

We consider a multi-agent chronological planning problem in a navigation task with precedence constraints, where agents autonomously plan their routes toward their targets in a specific chronological order. There are N agents and N landmarks in our scenario. The task is to navigate the N agents to occupy all the N landmarks in an order and simultaneously avoid collisions with each other. At the start of each episode, the positions of both agents and landmarks are randomly reset. Moreover, we have excluded generated positions where any two landmarks are too close, preventing unavoidable collisions when the agents are approaching their destination.

A simple example is shown in Figure 1. In addition, the precedence constraint requires that agents must occupy the N landmarks in a certain order. An agent can access the position and velocity of itself, as well as the positions of other agents and all the landmarks. Each agent can also observe whether each landmark is occupied.

As the agents in our scenario must consider the interaction with other agents when planning the routes (i.e., making decisions), we use stochastic game (SG), also known as Markov game, to describe the multi-agent decision-making process in our problem. A stochastic game can be regarded as a multi-player extension to an MDP, which allows agents to move simultaneously [51].

The stochastic game of our problem is defined by the number of agents N, the state set

S

, the action sets

{A^{1}, A^{2}, \dots, A^{N}}

, and the observation sets

{O^{1}, \dots, O^{N}}

. For each agent i,

o^{i} \in O^{i}

is an observation of the global state

s \in S

. In our scenario,

o^{i}

contains the absolute position and velocity of the i-th agent itself and the relative locations of all the other agents and landmarks. For the sake of universality and reasonability, we define the global state

s \in S

as the combination of observations from the N agents,

(o^{1}, \dots, o^{N})

, where

o^{i} \in O^{i}

. To describe the environmental dynamics, state transition

P (s^{'} | s, a)

gives the distribution of the next state

s^{'}

when taking a joint action

a = (a^{1}, \dots, a^{N})

under the current state s, and per-agent reward function

R (s, a, s^{'})

returns a scalar value for agent i for a transition from

(s, a)

to

s^{'}

. Our reward function will be explained in detail in Section 3.2. The policy of agent i is represented by

π^{i} = P (a^{i} | s)

, which gives the action probabilities with regard to a given state and we denote the joint policy of all the N agents by

π = (π^{1}, \dots, π^{N})

.

In a stochastic game, the goal of each agent is to find a behavioral policy

π^{i}

that can take sequential actions at every step t such that a discounted cumulative reward in Equation (1) is maximized, where

γ

is the decay factor. Here, we use the superscript of (

\cdot^{i}

,

\cdot^{- i}

) to distinguish between agent i and all the other

N - 1

teammates. In

(π^{i}, π^{- i})

, the former item means the policy of agent i while the latter one means the joint policy of all the agents except agent i.

V^{π^{i}, π^{- i}} (s) = E_{s_{t + 1} \sim P (\cdot ∣ s_{t}, a_{t}), a^{- i} \sim π^{- i} (\cdot ∣ s_{t})} [\sum_{t \geq 0} γ^{t} R_{t}^{i} (s_{t}, a_{t}, s_{t + 1}) ∣ a_{t}^{i} \sim π^{i} (\cdot ∣ s_{t}), s_{0} .]

(1)

As shown in Equation (1), the optimal strategy of each agent is not only determined by its own policy

π^{i}

, but is also affected by the joint policy of other agents in the environment

π^{- i}

. This brings non-stationarity to the training process, the fundamental difference in the solution concept between single-agent RL and multi-agent RL. Furthermore, we can clearly see in Equation (1) that the key to optimizing strategies is to design a reward function R that can correctly guide agents to complete the task in chronological order. In the next section, we explain our design of the reward function in detail.

3.2. Reward Function Design

In reinforcement learning, a reward function gives an assessment to the actions of agents and can guide the agents to maximize their objectives. The design of the reward function can largely affect the convergence of RL algorithms [52,53]. Based on our chronological scenario, we design a reward function that can well reflect the tasks in chronological order.

The reward function consists of three parts. First, the agents need to reach their landmarks as soon as possible, to achieve efficient task completion. Therefore, we give a negative reward at each step according to the distance

d^{i}

between agent i and its corresponding landmark:

r_{d i s}^{i} = - d^{i} .

(2)

This part of the reward motivates the agents to approach their landmarks as soon as possible to reduce the penalty (negative reward).

Second, the agents should not collide with each other. Therefore, we give a penalty part if the distance between any two agents is smaller than the sum of their radii.

r_{c o l l i d e}^{i} = \{\begin{matrix} 0 & d^{i j} \geq (R a d^{i} + R a d^{j}), \forall j \neq i \\ - 1 & d^{i j} < (R a d^{i} + R a d^{j}), \exists j \neq i \end{matrix}

(3)

d^{i j}

means the distance between agent i and agent j and

R a d^{i}

and

R a d^{j}

represents their radii.

The last part is a negative reward that penalizes the agents who break the precedence order. In our setting, the N agents are required to occupy their landmarks one by one. We represent a precedence constraint as

(v^{1}, v^{2}, \dots, v^{N})

. This means that, for any

i > 1

, agent

v^{i}

must occupy its landmark after agent

v^{i - 1}

, which is called the precedent agent of agent

v^{i}

. In our setting, we describe this chronological constraint using the distances between agents and their targeted landmarks. We give a penalty, i.e., a negative reward, to the agent i (whose precedent agent is agent j) if the distance from agent i to its landmark is less than that of agent j. The penalty is described in Equation (4).

r_{o r d e r}^{i} = - m a x (0, d^{j} - d^{i}) .

(4)

Considering all three penalties together, the reward function in our scenario is:

R^{i} = r_{d i s}^{i} + r_{c o l l i d e}^{i} + β r_{o r d e r}^{i},

(5)

where

β

is the coefficient of the precedence penalty. For simplicity, the coefficients of

r_{d i s}^{i}

and

r_{c o l l i d e}^{i}

are set to 1 as default as in previous research. Specifically, the third part of the reward varies in tasks with different

r_{o r d e r}^{i}

.

3.3. Framework

Our proposed solution to the aforementioned stochastic game involves the application of a multi-agent reinforcement learning algorithm to train cooperative policies for the N agents. To address the generalizability problem, where tasks may present in various chronological orders, we incorporate a meta-learning method. This leads to a two-part framework comprised of the MARL and meta components, as illustrated in Figure 2.

This dual-stage framework operates in the following manner. The MARL component is responsible for handling agent interactions and coordination to solve a specific stochastic game (depicted as the inner loop in Figure 2). The meta component, on the other hand, concentrates on generalization across different tasks (represented as the outer loop in Figure 2). This element of the framework distills meta-knowledge from the training outcomes of the inner loop. This meta-knowledge is instrumental in swiftly adapting to new tasks.

We use MADDPG [21] and Reptile [10] to instantiate a method called Reptile-MADDPG based on the proposed framework. In the rest of this section, we will explain our method in detail.

3.3.1. Multi-Agent Deep Deterministic Policy Gradients

MADDPG is an actor-critic method and adopts the paradigm of centralized training and distributed execution. It extends the Deep Deterministic Policy Gradient (DDPG) algorithm to the multi-agent context. The extension facilitates decentralized execution where each agent takes actions based only on its local observations while still permitting centralized training with access to the observations and actions of all agents.

In the MADDPG framework shown in Figure 3, each agent employs two primary components: a critic

V^{i}

and an actor

π^{i}

. The actor takes the current observation o as input and outputs an action a, dictating the policy the agent should follow. The critic, meanwhile, predicts the Q-value of a given state–action pair, estimating the expected return for taking an action in a particular state following the actor’s policy.

Unlike traditional DDPG, where the critic only evaluates the Q-value of a state-action pair from a single agent’s perspective, in MADDPG, the critic of each agent takes into account the actions and observations of all other agents. Having global information, MADDPG can eliminate the non-stationary in MARL by explicitly conditioning on the actions of other agents. As we know the actions of other agents, and the environment is stationary even as the policies change, since

P (s^{'} | s, a^{1}, \dots, a^{N}, π^{1}, \dots, π^{N}) = P (s^{'} | s, a^{1}, \dots, a^{N}) = P (s^{'} | s, a^{1}, \dots, a^{N}, \hat{π^{1}}, \dots, \hat{π^{N}})

for any

π^{i} \neq \hat{π^{i}}

. This design enables agents to learn policies that are aware of the actions of other agents and promote cooperative behavior.

Overall, MADDPG represents an efficient and effective method for training multi-agent systems, leveraging the power of deep learning and reinforcement learning within a framework specifically designed to handle the complexities of multi-agent environments.

3.3.2. Meta Learning for MARL

To quickly adapt to unseen but related tasks, we incorporate MADDPG into a meta-learning method. Researchers have proposed many meta-learning methods and some of them aim to find a proper initialization of the neural network’s parameters that can quickly be fine-tuned for new tasks. Their objective is training to maximize the expectation of rewards across a task set:

max_{θ} E_{T_{i} \sim p (T)} [R (ϕ)], where ϕ = f_{θ} (T_{i}),

(6)

where

p (T)

is the distribution of all possible tasks and

T_{i}

means a specific task sampled from

p (T)

.

θ

is the meta parameters to be trained and

ϕ

is the parameters adapted from

θ

for a specific task

T_{i}

.

f_{θ}

indicates the base learner parameterized by

θ

.

In our method, MADDPG is the base learner that learns parameters

ϕ

starting from

θ

to reduce the loss on a specific task

T_{i}

.

Here, we take Reptile [10] as the meta-learning part in our method. Reptile is a simple but effective meta-learning method, involving only first-order gradient information, bringing a faster training speed with negligible accuracy loss.

Similar to other meta-learning methods, Reptile consists of two stages, as illustrated in Figure 4. During the inner update stage, Reptile uses the base learner (a policy gradient reinforcement learning algorithm) to train on a certain task, updating the network parameters according to the gradient

g_{i}

at each step. After several steps, the network parameters

ϕ_{i}

specific to that task are obtained. Subsequently, the meta update stage takes place. In this phase, the meta-parameters

θ

are updated based on the results of the training performed on the individual task, as Equation (7) shows. Through gradient descent of the outer update, Reptile learns easily adaptable model parameters for various tasks. This two-step iterative process allows Reptile to generalize across a variety of tasks and adapt swiftly to new ones. Reptile is suitable for any gradient-based learning algorithms and does not introduce any extra parameters to be learned.

θ^{'} = θ + ϵ (ϕ_{i} - θ)

(7)

3.4. Procedure

Our method consists of two stages, meta-training and meta-testing. In the meta-training stage, we optimize the model parameters of each agent to minimize their losses on a set of training tasks. In the meta-testing stage, we use the learned meta-parameters to adapt to unseen but related tasks. We collect experiences from a new task and use them to fine-tune the meta parameters to reduce the loss on a specific task.

As we have shown in the framework of our proposed method (Figure 2), the overall method is a two-level optimization, where the inner loop corresponds to the base learner that only considers a single task, while the outer loop trains the meta parameters

θ

that determine the base learner in the inner loop. In Figure 2, different tasks are implicitly expressed in their corresponding environments. The pseudo-code of meta training is shown in Algorithm 1.

For every cycle of the outer loop, one or more tasks are sampled from a set of tasks that share some common structure. The current meta parameters

θ

and the sampled task(s) are given to the inner loop, while in the inner loop, agents initialize their network parameters

ϕ

with

θ

and then interact with the sampled training tasks and optimize their parameters for the maximal average reward. In detail, N agents collect experiences from each training task and store them in the corresponding replay buffers. For every training step, we randomly sample a batch of transitions

{(s_{k}, a_{k}, r_{k}, s_{k}^{'})}^{B}

from the replay buffer, where B means the batch size. Then, each agent uses the batch to update its actor and critic based on Equations (8) and (9).

\nabla_{ϕ} J (μ^{i}) = \frac{1}{B} \sum_{k} [\nabla_{ϕ} μ^{i} (o_{k}^{i}) Q_{μ}^{i} (x, a_{k}^{1}, \dots, a_{k}^{N}) |_{a_{k}^{i} = μ^{i} (o_{k}^{i})}]

(8)

\begin{matrix} L (ϕ) = \frac{1}{B} \sum_{k} [{(Q_{μ}^{i} (x_{k}, a_{k}^{1}, \dots, a_{k}^{N}) - y_{k})}^{2}], \\ y_{k} = r_{k}^{i} + γ Q^{' i} ({x^{'}}_{k}, a_{k}^{' 1}, \dots, a_{k}^{' N}) |_{a_{k}^{' j} = μ^{' j} (o_{k}^{i})} . \end{matrix}

(9)

Every time the policy networks and Q networks are updated, the corresponding target networks will be soft-updated with hyperparameters

τ

like Equation (10).

ϕ^{'} \leftarrow (1 - τ) ϕ^{'} + τ ϕ .

(10)

When the inner loop finishes training, the outer loop concludes its results on the batch of tasks to update the meta parameters. To be specific, the outer loop collects the learned policy parameters

{ϕ_{i}}_{i = 1 \dots | T |}

and updates the meta parameters

θ

as Equation (11).

θ \leftarrow θ + ϵ \sum_{i = 1}^{| T |} (ϕ_{i} - θ) .

(11)

After training, the meta parameters

θ

that can quickly adapt to a specific task in the stage of meta-testing are acquired. The process of meta-testing is the same as the inner loop, representing a typical MARL process. With the initialization of

θ

, meta-testing could potentially learn good parameters quickly.

4. Experiments

In this section, we introduce the experimental settings where we evaluate the effectiveness of our method to solve the chronological planning problem, as well as the details of the state and action features, network architectures, and training and testing regimes. As explained in the Introduction, we mainly focus on whether the trained policies have acceptable generalizability and quick adaption across various tasks in the experiments.

4.1. Experimental Settings

To evaluate our method, we use a popular test-bed in MARL community, Multi-Agent Particle Environment (MPE) (https://github.com/openai/multiagent-particle-envs (accessed on 20 April 2022)), to instantiate our chronological planning problem. MPE is a simulation environment for multi-agent systems and provides several built-in scenarios that represent some typical cases of collaboration and competition. MPE is widely used in the validation of MARL algorithms due to its usability and extensibility.

As MPE is not intentionally designed for meta-learning, we design our own scenario that supports multiple tasks. In our scenario, there are N cooperative agents in a continuous environment with a size of

2 \times 2

and the radius of an agent is

0.15

. They can choose a continuous action and move omnidirectionally by setting the forces in two orthogonal directions within a range of

[- 1, 1]

, and aim to cover N landmarks without collisions. Each agent can obtain the

x - y

locations of all the agents and landmarks, and make decisions by itself without any communication with others. The positions of agents and landmarks are randomly reset at the beginning of each episode. To evaluate the ability to adapt to unseen tasks, we obtain six tasks by changing the chronological orders of the three agents in our scenario. We randomly select four of them as training tasks and the other two as testing tasks in the experiments.

4.2. Experimental Conditions

The proposed Reptile-MADDPG approach is referred to as META in our experiments. To evaluate the generalizability and the ability of quick adaption, we select two baseline conditions.

As MADDPG is not specifically defined for multiple tasks, we introduce Domain Randomization [54,55] and provide MADDPG diverse training data from different tasks to improve its generalization (referred to as DR).

Furthermore, during the testing stage, we also consider the policy consisting of random parameters without pre-training (referred to as RANDOM).

4.3. Implementation Details

We train an actor and a critic for each agent. For all experiments, we use MLPs with three layers (ReLU) for both the actor and critic networks and the number of neurons for the hidden layers is 64.

We pre-train our method and original MADDPG without meta-learning on the four training tasks and then evaluate them on the two testing tasks with fine-tuning. At the pre-training stage, we run the two methods for 10,000 epochs. In each epoch, we randomly sample three training tasks and train the models for an episode (25 steps) in each task. For every 25 epochs, we validate the current policies on the sampled tasks without fine-tuning, and record the average rewards.

In the following testing stage, we fine-tune the two pre-trained policies on each testing task for 400 episodes and record their rewards during adaptation. For every five episodes during fine-tuning, we evaluate the three policies and record the average rewards on the current testing task. We repeat the fine-tuning process for five runs and show the average results with 95% confidence intervals (95% CI). All the experiments are run on an AliCloud server of Intel(R) Xeon(R) Platinum 8163 CPU with 16 GB memory.

4.4. Parameter Study

We perform a parameter study on

β

in the reward function of our setting to show the compatibility of our method. We first remove the chronological constraints by simply setting the coefficient of chronological penalty

β

to zero. Therefore, the six tasks become indistinguishable. Then, we run the pre-training process of META and DR for 5000 epochs and fine-tune the learned policies for 10,000 steps, and we keep other details unchanged from the preceding experiments. We repeat the experiments for five runs with different random seeds and compare the achieved rewards of META and DR on the tasks without chronological constraints.

We also study the impact of different

β

. We separately set

β

to each of

[1.0, 1.5, 2.0, 3.0]

, pre-train META and DR for 10,000 epochs, and then fine-tune the learned meta-policies for 10,000 steps and we keep the other details unchanged from the preceding experiments.

5. Results and Discussion

We present our results in terms of the meta-training stage, the meta-testing stage, and the parameter study.

5.1. Results of Meta Training

We show the learning curves during the pre-training process in Figure 5. We compare our method, meta-learning framework (Reptile-MADDPG, denoted as META), with the baseline condition MADDPG with Domain Randomization (denoted as DR). The learning curves are the average results of five runs with different random seeds; for each seed, it takes about 50 min for both META and DR in the pre-training process.

The figure indicates that the learning curves of both META and DR converge after around 2000 training epochs. META and DR perform similarly during pre-training, while META increases a bit more quickly and has slightly higher rewards; that is to say, in the pre-training stage, META can achieve better rewards in a comparable convergence time compared to DR.

5.2. Results of Meta Testing

The learning curves of fine-tuning on testing tasks are demonstrated in Figure 6, and the statistical results are also shown in Table 1 and Table 2, including the average rewards and standard deviations averaged over five runs. For each run, it takes about 90 min for the meta-testing stage.

The learning curves in Figure 6 show that, at the beginning of fine-tuning, both the rewards achieved by META and DR drop a little compared with those on training tasks because the chronological constraint varies. Then, in the next episodes, the rewards of the two pre-trained policies decrease because they have to adjust their policies to skip the local minima and find a set of parameters more suitable for the current testing task. After about 50 episodes, the performance of META starts to increase quickly and achieve very good rewards on the testing tasks. On the contrary, the performance of DR grows slowly and cannot even reach the level before fine-tuning within 400 episodes.

The quantitative results show the testing results on two testing tasks separately and further demonstrate the advantage of our method. We can see from Table 1 and Table 2 that META always performs the best from 0 to 10,000 steps in comparison to DR and RANDOM baselines. META acquires about 15% higher rewards at step 10,000 compared with that at 0 steps, while DR cannot adapt to the testing tasks within the given fine-tuning steps and only achieves rewards of about −50 at 10,000 steps. Although the rewards of RANDOM grow most rapidly, META reaches −45 before 6000 steps while RANDOM takes 10,000 steps, where META can achieve rewards higher than −40. This means that META can achieve acceptable results in fewer fine-tuning steps during the process of fast adaptation.

These results prove that the meta-learning framework can effectively help MARL to adapt to unseen tasks quickly.

5.3. Results of Parameter Study

Our parameter study first tests META and DR on a task without chronological constraints to show the compatibility of our method. We show the learning curves of pre-training and fine-tuning separately in Figure 7 and Figure 8. Both of them are averaged from five runs with different random seeds. We can see that, without the precedence penalty (

β = 0

), META and DR perform very closely to one another. At the pre-training stage, both their rewards converge to about −10 after 1000 epochs, while at the fine-tuning stage, the curves of META and DR continue to fluctuate and keep very close. This indicates that our method can also deal with a single task and the meta-learning framework does not harm the performance of the MARL algorithm in the inner loop.

In addition, the impact of varying the parameter

β

is illustrated in Figure 9. We expected that, with larger

b e t a

, the reinforcement learning would obey the chronological constraints more strictly. However, the reinforcement learning became very unstable when

β

increased. As the fourth sub-figure in Figure 9 shows, the training curve of META drops at 8000 epochs. We will try to find the reason for this in future work.

6. Conclusions

In this study, we addressed a complex multi-agent chronological planning problem by proposing an innovative method that combines multi-agent reinforcement learning with a meta-learning framework. Recognizing the critical role that a properly defined reward function plays, we designed one that encapsulates the chronological constraints inherent in the problem. Additionally, the leverage of meta-learning allows our model to efficiently generate adaptive policies for agents, enabling them to swiftly adjust to new tasks with altered chronological sequences.

One of the primary contributions of this work lies in its successful enhancement of the generalization abilities of the learned policies. As a metric, we compared the average rewards before/after fine-tuning as other works have done. Compared to domain randomization, our method can achieve a slightly higher reward (from −45.01/−46.04 to −44.21/−43.78 on the two testing tasks) before fine-tuning, but this increases to −37.32/−38.18, which is much higher than the −52.22/−50.28 of DR. These findings validate the efficacy and practical utility of our model, making a strong case for its necessity and utility in the field of multi-agent chronological planning.

Although our method has improved the adaptability of learned policies, it might face limitations in its ability to generalize to all types of chronological planning tasks. The performance might depend on the specific nature of the tasks and the underlying Markov decision process. In addition, our method requires significant computational resources to extract the meta-parameters of past experiences, which could limit its practical application in large-scale or resource-constrained settings. Therefore, future work will focus on the application of our proposed method to more challenging scenarios. Specifically, we aim to scrutinize its adaptability in tasks featuring higher degrees of disturbance, pushing the boundaries of its potential utility. We will also consider optimizing the computational efficiency of the method. The work we present here paves the way for more sophisticated implementations of our methodology, promising advancements in the sphere of multi-agent systems.

Author Contributions

Conceptualization, C.H. and K.X.; methodology, C.H. and Z.Z.; validation, Q.Y. and K.X.; formal analysis, Z.Z.; investigation, K.X.; data curation, L.Q.; writing—original draft preparation, C.H.; writing—review and editing, C.H. and K.X.; supervision, K.X.; funding acquisition, L.Q. and Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China [grant number 62103420].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data supporting the reported results can be found at https://github.com/openai/multiagent-particle-envs (accessed on 20 April 2022).

Acknowledgments

We thank Qi Wang and Sihang Qiu at the National University of Defense Technology for attending discussions and giving us important advice.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wei, Y.; Zheng, R. Multi-Robot Path Planning for Mobile Sensing through Deep Reinforcement Learning. In Proceedings of the IEEE INFOCOM—IEEE Conference on Computer Communications, Vancouver, BC, Canada, 10–13 May 2021; pp. 1–10. [Google Scholar] [CrossRef]
Jestel, C.; Surmann, H.; Stenzel, J.; Urbann, O.; Brehler, M. Obtaining Robust Control and Navigation Policies for Multi-robot Navigation via Deep Reinforcement Learning. In Proceedings of the 7th International Conference on Automation, Robotics and Applications (ICARA), Prague, Czech Republic, 4–6 February 2021; pp. 48–54. [Google Scholar] [CrossRef]
Liu, Z.; Liu, Q.; Tang, L.; Jin, K.; Wang, H.; Liu, M.; Wang, H. Visuomotor Reinforcement Learning for Multirobot Cooperative Navigation. IEEE Trans. Autom. Sci. Eng. 2021, 19, 3234–3245. [Google Scholar] [CrossRef]
Wang, S.; Duan, J.; Shi, D.; Xu, C.; Li, H.; Diao, R.; Wang, Z. A Data-Driven Multi-Agent Autonomous Voltage Control Framework Using Deep Reinforcement Learning. IEEE Trans. Power Syst. 2020, 35, 4644–4654. [Google Scholar] [CrossRef]
Yan, Z.; Xu, Y. A Multi-Agent Deep Reinforcement Learning Method for Cooperative Load Frequency Control of a Multi-Area Power System. IEEE Trans. Power Syst. 2020, 35, 4599–4608. [Google Scholar] [CrossRef]
Wang, J.; Xu, W.; Gu, Y.; Song, W.; Green, T.C. Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 3271–3284. [Google Scholar]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. arXiv 2018, arXiv:cs.LG/1803.11485. [Google Scholar]
Ye, D.; Chen, G.; Zhang, W.; Chen, S.; Yuan, B.; Liu, B.; Chen, J.; Liu, Z.; Qiu, F.; Yu, H.; et al. Towards Playing Full MOBA Games with Deep Reinforcement Learning. In Proceedings of the NeurIPS; NeurIPS: New Orleans, LA, USA, 2020; pp. 621–632. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
Nichol, A.; Achiam, J.; Schulman, J. On First-Order Meta-Learning Algorithms. arXiv 2018, arXiv:cs.LG/1803.02999. [Google Scholar]
Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; Levine, S. Meta-Reinforcement Learning of Structured Exploration Strategies. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3–8 December 2018; Volume 31, pp. 5307–5316. [Google Scholar]
Rakelly, K.; Zhou, A.; Finn, C.; Levine, S.; Quillen, D. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 5331–5340. [Google Scholar]
Du, Y.; Han, L.; Fang, M.; Liu, J.; Dai, T.; Tao, D. LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 4403–4414. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Lin, C.J.; Jhang, J.Y.; Lin, H.Y.; Lee, C.L.; Young, K.Y. Using a Reinforcement Q-Learning-Based Deep Neural Network for Playing Video Games. Electronics 2019, 8, 1128. [Google Scholar] [CrossRef] [Green Version]
Srinivasu, P.N.; Bhoi, A.K.; Jhaveri, R.H.; Reddy, G.T.; Bilal, M. Probabilistic Deep Q Network for real-time path planning in censorious robotic procedures using force sensors. J. Real-Time Image Process. 2021, 18, 1773–1785. [Google Scholar] [CrossRef]
Kim, C. Deep Q-Learning Network with Bayesian-Based Supervised Expert Learning. Symmetry 2022, 14, 2134. [Google Scholar] [CrossRef]
Li, Y.; Zhou, W.; Wang, H.; Ding, B.; Xu, K. Improving Fast Adaptation for Newcomers in Multi-Robot Reinforcement Learning System. In Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence Computing, Advanced Trusted Computing, Scalable Computing Communications, Cloud Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, UK, 19–23 August 2019; pp. 753–760. [Google Scholar] [CrossRef]
Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent Cooperation and Competition with Deep Reinforcement Learning. arXiv 2015, arXiv:cs.AI/1511.08779. [Google Scholar] [CrossRef] [Green Version]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 6382–6393. [Google Scholar]
Foerster, J.N.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. arXiv 2016, arXiv:cs.AI/1605.06676. [Google Scholar]
Sukhbaatar, S.; Szlam, A.; Fergus, R. Learning Multiagent Communication with Backpropagation. arXiv 2016, arXiv:cs.LG/1605.07736. [Google Scholar]
Jiang, J.; Lu, Z. Learning Attentional Communication for Multi-Agent Cooperation. In Proceedings of the Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31, pp. 7254–7264. [Google Scholar]
Kim, D.; Moon, S.; Hostallero, D.; Kang, W.J.; Lee, T.; Son, K.; Yi, Y. Learning to Schedule Communication in Multi-agent Reinforcement Learning. arXiv 2019, arXiv:cs.AI/1902.01554. [Google Scholar]
Iqbal, S.; Sha, F. Actor-Attention-Critic for Multi-Agent Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning; Long Beach, CA, USA, 9–15 June 2019, Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR Proceedings of Machine Learning Research; 2019; Volume 97, pp. 2961–2970. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning. arXiv 2017, arXiv:cs.AI/1706.05296. [Google Scholar]
Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. QPLEX: Duplex Dueling Multi-Agent Q-Learning. arXiv 2021, arXiv:2008.01062. [Google Scholar]
Bengio, Y.; Bengio, S.; Cloutier, J. Learning a synaptic learning rule. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 8–12 July 1991; IEEE: Manhattan, NY, USA, 1991; p. 969. [Google Scholar] [CrossRef]
Schmidhuber, J. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks. Neural Comput. 1992, 4, 131–139. [Google Scholar] [CrossRef]
Liu, X.; Zhou, F.; Liu, J.; Jiang, L. Meta-Learning based prototype-relation network for few-shot classification. Neurocomputing 2020, 383, 224–234. [Google Scholar] [CrossRef]
Sun, S.; Kiran, M.; Ren, W. MAMRL: Exploiting Multi-agent Meta Reinforcement Learning in WAN Traffic Engineering. arXiv 2021, arXiv:cs.NI/2111.15087. [Google Scholar]
Wang, M.; Wu, L.; Li, M.; Wu, D.; Shi, X.; Ma, C. Meta-learning based spatial-temporal graph attention network for traffic signal control. Knowl.-Based Syst. 2022, 250, 109166. [Google Scholar] [CrossRef]
Fagerblom, F. Model-Agnostic Meta-Learning for Digital Pathology. Ph.D. Thesis, Linkoping University, Linkoping, Sweden, 2020. [Google Scholar]
Li, Q.; Peng, Z.; Xue, Z.; Zhang, Q.; Zhou, B. MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning. arXiv 2021, arXiv:cs.LG/2109.12674. [Google Scholar] [CrossRef]
Yang, J.; Wang, E.; Trivedi, R.; Zhao, T.; Zha, H. Adaptive Incentive Design with Multi-Agent Meta-Gradient Reinforcement Learning. arXiv 2021, arXiv:cs.MA/2112.10859. [Google Scholar]
Shi, J.; Yao, H.; Wu, X.; Li, T.; Lin, Z.; Wang, T.; Zhao, B. Relation-aware Meta-learning for E-commerce Market Segment Demand Prediction with Limited Records. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Online, 8–12 March 2021; ACM: New York, NY, USA, 2021; pp. 220–228. [Google Scholar] [CrossRef]
Alshehri, M.; Reyes, N.; Barczak, A. Evolving Meta-Level Reasoning with Reinforcement Learning and A* for Coordinated Multi-Agent Path-Planning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, Auckland, New Zealand, 9–13 May 2020; pp. 1744–1746. [Google Scholar]
Feng, Y.; Chen, J.; Xie, J.; Zhang, T.; Lv, H.; Pan, T. Meta-learning as a promising approach for few-shot cross-domain fault diagnosis: Algorithms, applications, and prospects. Knowl.-Based Syst. 2022, 235, 107646. [Google Scholar] [CrossRef]
Liu, T.; Ma, X.; Li, S.; Li, X.; Zhang, C. A stock price prediction method based on meta-learning and variational mode decomposition. Knowl.-Based Syst. 2022, 252, 109324. [Google Scholar] [CrossRef]
Xu, Z.; Chen, X.; Tang, W.; Lai, J.; Cao, L. Meta weight learning via model-agnostic meta-learning. Neurocomputing 2021, 432, 124–132. [Google Scholar] [CrossRef]
Weng, L. Meta Reinforcement Learning. 2019. Available online: https://lilianweng.github.io/posts/2019-06-23-meta-rl/ (accessed on 7 July 2022).
Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P.L.; Sutskever, I.; Abbeel, P. RL²: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv 2016, arXiv:cs.AI/1611.02779. [Google Scholar]
Wang, J.X.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J.Z.; Munos, R.; Blundell, C.; Kumaran, D.; Botvinick, M. Learning to reinforcement learn. arXiv 2016, arXiv:cs.LG/1611.05763. [Google Scholar]
Frans, K.; Ho, J.; Chen, X.; Abbeel, P.; Schulman, J. Meta Learning Shared Hierarchies. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 468–475. [Google Scholar]
Houthooft, R.; Chen, R.Y.; Isola, P.; Stadie, B.C.; Wolski, F.; Ho, J.; Abbeel, P. Evolved policy gradients. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, BC, Canada, 3–8 December 2018; pp. 5405–5414. [Google Scholar]
Fakoor, R.; Chaudhari, P.; Soatto, S.; Smola, A.J. Meta-Q-Learning. arXiv 2019, arXiv:cs.LG/1910.00125v2. [Google Scholar]
Al-Shedivat, M.; Bansal, T.; Burda, Y.; Sutskever, I.; Mordatch, I.; Abbeel, P. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–21. [Google Scholar]
Jia, H.; Ding, B.; Wang, H.; Gong, X.; Zhou, X. Fast Adaptation via Meta Learning in Multi-agent Cooperative Tasks. In Proceedings of the IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, UK, 19–23 August 2019; IEEE: Manhattan, NY, USA, 2019; pp. 707–714. [Google Scholar] [CrossRef]
Yang, Y.; Wang, J. An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective. arXiv 2020, arXiv:cs.MA/2011.00583. [Google Scholar]
Zou, H.; Ren, T.; Yan, D.; Su, H.; Zhu, J. Reward Shaping via Meta-Learning. arXiv 2019, arXiv:cs.LG/1901.09330. [Google Scholar]
Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, San Francisco, CA, USA, 27–30 June 1999; pp. 278–287. [Google Scholar]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 23–30. [Google Scholar] [CrossRef] [Green Version]
Slaoui, R.B.; Clements, W.R.; Foerster, J.N.; Toth, S. Robust Visual Domain Randomization for Reinforcement Learning. arXiv 2019, arXiv:cs.LG/1910.10537. [Google Scholar]

Figure 1. An illustrative case of our problem. The numbers inside landmarks indicate the order in which to occupy them.

Figure 2. The framework of the proposed method that combines meta-learning and multi-agent reinforcement learning.

Figure 3. The paradigm of centralized training and distributed execution in MADDPG.

Figure 4. Overview of Reptile algorithm.

Figure 5. Average rewards on training tasks during pre-training and 95% CI.

Figure 6. Average rewards on testing tasks during fine-tuning and 95% CI.

Figure 7. Average rewards on training tasks during pre-training and 95% CI (

β = 0

).

Figure 7. Average rewards on training tasks during pre-training and 95% CI (

β = 0

).

Figure 8. Average rewards on testing tasks during fine-tuning and 95% CI (

β = 0

).

Figure 8. Average rewards on testing tasks during fine-tuning and 95% CI (

β = 0

).

Figure 9. Training and testing curves in different

β

s.

Figure 9. Training and testing curves in different

β

s.

Table 1. Quantitative results of meta-testing on task 1.

Policy	0 Steps	2000 Steps	4000 Steps	6000 Steps	10,000 Steps
META	−44.21 ± 2.86	−49.26 ± 3.66	−45.79 ± 1.20	−43.13 ± 2.51	−37.32 ± 1.85
DR	−45.01 ± 2.08	−53.17 ± 2.76	−52.08 ± 2.33	−52.15 ± 3.10	−52.22 ± 4.12
RANDOM	−121.32 ± 3.54	−70.35 ± 3.24	−58.12 ± 3.24	−52.13 ± 1.78	−43.32 ± 2.66

Table 2. Quantitative results of meta-testing on task 2.

Policy	0 Steps	2000 Steps	4000 Steps	6000 Steps	10,000 Steps
META	−43.78 ± 2.81	−47.44 ± 3.42	−47.10 ± 2.73	−43.62 ± 2.59	−38.18 ± 2.05
DR	−46.04 ± 4.04	−55.44 ± 2.47	−53.09 ± 2.32	−52.09 ± 2.74	−50.28 ± 2.88
RANDOM	−122.00 ± 6.68	−75.98 ± 2.98	−59.56 ± 3.19	−52.83 ± 2.03	−44.38 ± 3.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, C.; Xu, K.; Zhu, Z.; Qin, L.; Yin, Q. Multi-Agent Chronological Planning with Model-Agnostic Meta Reinforcement Learning. Appl. Sci. 2023, 13, 9174. https://doi.org/10.3390/app13169174

AMA Style

Hu C, Xu K, Zhu Z, Qin L, Yin Q. Multi-Agent Chronological Planning with Model-Agnostic Meta Reinforcement Learning. Applied Sciences. 2023; 13(16):9174. https://doi.org/10.3390/app13169174

Chicago/Turabian Style

Hu, Cong, Kai Xu, Zhengqiu Zhu, Long Qin, and Quanjun Yin. 2023. "Multi-Agent Chronological Planning with Model-Agnostic Meta Reinforcement Learning" Applied Sciences 13, no. 16: 9174. https://doi.org/10.3390/app13169174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Chronological Planning with Model-Agnostic Meta Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Reinforcement Learning

2.2. Multi-Agent Reinforcement Learning

2.3. Meta Reinforcement Learning

3. Methods

3.1. Chronological Planning Problem Formulation

3.2. Reward Function Design

3.3. Framework

3.3.1. Multi-Agent Deep Deterministic Policy Gradients

3.3.2. Meta Learning for MARL

3.4. Procedure

4. Experiments

4.1. Experimental Settings

4.2. Experimental Conditions

4.3. Implementation Details

4.4. Parameter Study

5. Results and Discussion

5.1. Results of Meta Training

5.2. Results of Meta Testing

5.3. Results of Parameter Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI