MultiAgent Reinforcement Learning for Extended Flexible Job Shop Scheduling
Abstract
:1. Introduction
 Dual flexibity. FJSPDT considers both technology and path flexibility. As shown in Figure 1, the blue operation may be processed before the red operation, or vice versa, which results in technology flexibility. The same operations can be processed by different robots, e.g., the orange operation can be processed by Robot ${m}_{1}$ or Robot ${m}_{4}$, hence leading to path flexibility.
 Varied transportation time. One job needs to be processed through multiple stations, and the transportation time between operations varies depending on the distance between stations.
 Cooperative gaming and uncertain environment. Multiple jobs coordinate their operations to achieve shorter makespan employing preempt resources, such as robots, stations, etc. The environment also suffers from uncertain factors, e.g., robot failures.
 Generalization. The scheduling algorithm should be adaptable to changing conditions, such as altered operation sequences, modified stations, robot distributions, or varying job quantities.
 Realtime. Due to the uncertainty of the environment, the scheduler needs to respond in real time (in seconds) to robot failures and make realtime scheduling.
 Accuracy. Significant operational risks may arise in the application scenario corresponding to FJSPDT. In automobile assembly scheduling, poor coordination between operations may lead to collisions. Therefore, the algorithm needs to ensure operational safety.
 Curse of dimensionality. The solution space expands as the number of agents and resources (such as robots, and stations) increases, resulting in a computational dilemma. Here, M, S, and T are used to represent the sets of robots, stations, and time respectively. N and O represent the sets of agents and operations respectively. The solution space of FJSPDT is ${A}_{\leftM\right\times \leftS\right\times \leftT\right}^{\leftN\right\times \leftO\right}$ which causes an explosion in combinations.
 The FJSPDT abstracted from real and complex scenarios is proposed. As far as our literature review is concerned, the problem is new and needs to be fully investigated. The FJSPDT is modeled as a decentralized partially observable Markov decision process in that an agent may only observe the information around it in real applications. Then, the MARLbased method can be applied to achieve high adaptability and realtime decisions.
 We build an adaptive and stable multiagent learning framework by combining a graph convolutional network (GCN) and the actor–critic structure. GCN extracts embedding graph’s structural features and represents system states in nonEuclidean space, thereby degrading dimension curses and adapting to various scheduling environments. The actor–critic structure can update the network parameters in a single step without running an episode, making it faster than the policy gradient algorithm.
 A double Qvalue mixing algorithm (DQMIX) under the above framework is proposed to address the challenge of fast convergence and high adaptability. The algorithm combines an unrestricted optimal network with a monotonic mixing network to improve exploration and exploitation capabilities. It also integrates mechanistic constraints into databased learning, mitigating the curse of dimensionality by eliminating invalid actions. In addition, the reward function is designed as a function of reduction in makespan estimation to mitigate the learning challenges caused by sparse feedback.
2. Related Works
 SingleAgent Reinforcement Learning (SARL): The algorithm only contains one agent that makes all the decisions for a control system.
 MultiAgent Reinforcement Learning (MARL): The algorithm comprises multiple agents that interact with the environment through their respective policies.
2.1. SARL for Scheduling
2.1.1. SARL with Value Iteration
2.1.2. SARL with Policy Iteration
2.2. MARL for Scheduling
 Distributed Training Paradigm (DTP): In the distributed Paradigm, agents learn independently of other agents and do not rely on explicit information exchange.
 Centralized Training Paradigm (CTP): The centralized paradigm allows agents to exchange additional information during training, which is then abandoned during tests. Agents receive only the locally observable information and independently determine actions according to their policies during execution.
2.2.1. MARL with DTP
2.2.2. MARL with CTP
3. Problem Description and Model Formulation
3.1. Problem Description
 All jobs are available after release dates.
 A robot can only process one job at a time.
 Each operation can only be processed on one robot at a time.
 Each operation cannot be interrupted during processing.
 There is no buffer in the station, and each station can only accommodate one job at a time.
3.2. DecPOMDP Framework
 $\mathcal{N}=\{1,\cdots ,n\}$ is the set of $n>1$ agents;
 $\mathcal{S}$ is a set of states denoting all joint states possible by the multiagents;
 $\mathcal{A}={\times}_{i\in \mathcal{N}}{\mathcal{A}}^{i}$ is the set of joint actions, where ${\mathcal{A}}^{i}$ denotes the set of actions for agent i;
 $\mathcal{P}$ is the transition probability function;
 $R=R(s,a)$ is the reward function, mapping states and joint actions to real numbers;
 $\mathcal{O}={\times}_{i\in \mathcal{N}}{\mathcal{O}}^{i}$ is the set of joint observations, where ${\mathcal{O}}^{i}$ is the set of observations available to agent i;
 $\Omega $ is the observation probability function;
 $\gamma \in \left[0,1\right)$ is a discount factor.
3.3. State and Observations
3.3.1. State
3.3.2. Observations
3.4. Actions
3.5. Reward
4. Algorithm
4.1. Algorithm Overview
 We utilize a learning algorithm to satisfy the realtime and adaptive requirements of FJSPDT scheduling. The scheduling procedure of FJSPDT is modeled as a DecPOMDP model, and a realtime scheduling policy that can adapt to an uncertain environment is obtained through reinforcement learning.
 Regarding accuracy, we propose double critic networks to assist agent training, enhancing the stability and quality of learning. In addition, a MASK layer is added in the DQMIX output, which outputs a set of feasible actions based on the current state and conflicted relationships, ensuring that the output actions are accurate.
 The MARL architecture is adopted to reduce the dimensionality of action space. Each job agent only needs to choose actions based on their own observations, and each agent model only needs to output a $1\times {\mathcal{A}}^{i}$ dimensional vector instead of $n\times {\mathcal{A}}^{i}$ dimensional vectors. The GCN network layer is designed to compress the number of input parameters from $\leftV\right\times \leftV\right\times p$ to $\leftV\right\times p$, which further alleviates the dimensionality curse.
 Firstly, an environment instance is generated according to FJSPDT.
 Secondly, starting at time $t=0$, each agent chooses an action ${a}_{t}^{i}$ based on its own observation ${o}_{t}^{i}$ and its own policy. The actions of all agents form the joint action vector ${a}_{t}$.
 Thirdly, according to the processing state ${s}_{t}$ and scheduling action ${a}_{t}$ at time t, the environment updates the processing state ${s}_{t+1}$ and observations ${o}_{t+1}$ at time $t+1$. The environment outputs a reward r according to the reward function, and stores the tuple $({s}_{t},{o}_{t},{a}_{t},{s}_{t+1},{o}_{t+1})$ as a POMDP instance in the experience memory.
 Next, the agents continue to interact with the environment until all jobs’ tasks are completed, or the makespan exceeds ${C}_{up}$.
 Finally, when the scheduling ends, a complete POMDP chain is formed, and the environment gives the final reward r.
4.2. Agent Network
4.3. Critic Network
4.3.1. Motivation of Proposing Double Critic Networks
 The nonnegative monotonicity of the mixing value function should be kept when the optimal policy has been recovered, consistent with the goal pursued in cooperative games. Each agent’s marginal return with the optimal policy is nonnegative, i.e.,$$\frac{\partial {Q}^{mix}\left(s,\left[{Q}^{i}\right]\right)}{\partial {Q}^{i}\left({\tau}^{i},{a}^{i}\right)}\ge 0.$$
 To overcome the limitations of ${Q}^{mix}$ when the optimal policy has not yet been recovered, a joint actionvalue function ${Q}^{*}(s,\tau ,a)$ is introduced. ${Q}^{*}(s,\tau ,a)$ is a function of state s, action–observation history $\tau $ and joint action a. It reduces the dependencies on agents’ utilities.
4.3.2. Mixing Network
4.3.3. Optimal Network
4.4. Loss Function
Algorithm 1 DQMIX Algorithm. 

5. Case Study
5.1. Case Description and Algorithm Settings
5.1.1. Case Description
5.1.2. Experimental Settings
5.2. Solution Quality
5.3. Computation Time
5.4. Convergence
5.5. Scalability
5.6. Generalization
5.6.1. Generalization to Robot Breakdowns
5.6.2. Generalization to Varied Quantity of Jobs
5.6.3. Generalization to Changes in Operations
5.7. GCN Effectiveness Verification
5.8. Ablation on Reward Function
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
 Chaudhry, I.A.; Khan, A.A. A research survey: Review of flexible job shop scheduling techniques. Int. Trans. Oper. Res. 2016, 23, 551–591. [Google Scholar] [CrossRef]
 Xiong, H.; Shi, S.; Ren, D.; Hu, J. A survey of job shop scheduling problem: The types and models. Comput. Oper. Res. 2022, 142, 105731. [Google Scholar] [CrossRef]
 Luo, Q.; Deng, Q.; Xie, G.; Gong, G. A Paretobased twostage evolutionary algorithm for flexible job shop scheduling problem with worker cooperation flexibility. Robot. Comput.Integr. Manuf. 2023, 82, 102534. [Google Scholar] [CrossRef]
 Wei, Z.; Liao, W.; Zhang, L. Hybrid energyefficient scheduling measures for flexible jobshop problem with variable machining speeds. Expert Syst. Appl. 2022, 197, 116785. [Google Scholar] [CrossRef]
 Li, Y.; Gu, W.; Yuan, M.; Tang, Y. Realtime datadriven dynamic scheduling for flexible job shop with insufficient transportation resources using hybrid deep Q network. Robot. Comput.Integr. Manuf. 2022, 74, 102283. [Google Scholar] [CrossRef]
 Du, Y.; Li, J.Q.; Chen, X.L.; Duan, P.Y.; Pan, Q.K. Knowledgebased reinforcement learning and estimation of distribution algorithm for flexible job shop scheduling problem. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 1036–1050. [Google Scholar] [CrossRef]
 Burggräf, P.; Wagner, J.; Saßmannshausen, T.; Ohrndorf, D.; Subramani, K. Multiagentbased deep reinforcement learning for dynamic flexible job shop scheduling. Procedia CIRP 2022, 112, 57–62. [Google Scholar] [CrossRef]
 Zhang, Y.; Zhu, H.; Tang, D.; Zhou, T.; Gui, Y. Dynamic job shop scheduling based on deep reinforcement learning for multiagent manufacturing systems. Robot. Comput.Integr. Manuf. 2022, 78, 102412. [Google Scholar] [CrossRef]
 Wang, X.; Zhang, L.; Lin, T.; Zhao, C.; Wang, K.; Chen, Z. Solving job scheduling problems in a resource preemption environment with multiagent reinforcement learning. Robot. Comput.Integr. Manuf. 2022, 77, 102324. [Google Scholar] [CrossRef]
 Jing, X.; Yao, X.; Liu, M.; Zhou, J. Multiagent reinforcement learning based on graph convolutional network for flexible job shop scheduling. J. Intell. Manuf. 2022, 1–19. [Google Scholar] [CrossRef]
 Ku, W.Y.; Beck, J.C. Mixed integer programming models for job shop scheduling: A computational analysis. Comput. Oper. Res. 2016, 73, 165–173. [Google Scholar] [CrossRef]
 Gao, K.; Cao, Z.; Zhang, L.; Chen, Z.; Han, Y.; Pan, Q. A review on swarm intelligence and evolutionary algorithms for solving flexible job shop scheduling problems. IEEE/CAA J. Autom. Sin. 2019, 6, 904–916. [Google Scholar] [CrossRef]
 Tian, Y.; Si, L.; Zhang, X.; Cheng, R.; He, C.; Tan, K.C.; Jin, Y. Evolutionary largescale multiobjective optimization: A survey. ACM Comput. Surv. 2021, 54, 174. [Google Scholar] [CrossRef]
 Afshin, O.; Davood, H. A review of cooperative multiagent deep reinforcement learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar]
 Lihu, A.; Holban, S. Top five most promising algorithms in scheduling. In Proceedings of the 2009 5th International Symposium on Applied Computational Intelligence and Informatics, Timisoara, Romania, 28–29 May 2009; pp. 397–404. [Google Scholar]
 Wang, X.; Zhang, L.; Ren, L.; Xie, K.; Wang, K.; Ye, F.; Chen, Z. Brief review on applying reinforcement learning to job shop scheduling problems. J. Syst. Simul. 2021, 33, 2782. [Google Scholar]
 Liu, Y.K.; Zhang, X.S.; Zhang, L.; Tao, F.; Wang, L.H. A multiagent architecture for scheduling in platformbased smart manufacturing systems. Front. Inf. Technol. Electron. Eng. 2019, 20, 1465–1492. [Google Scholar] [CrossRef]
 Zhang, W.; Dietterich, T.G. A reinforcement learning approach to jobshop scheduling. In Proceedings of the IJCAI, Citeseer, Montreal, QU, Canada, 20–25 August 1995; Volume 95, pp. 1114–1120. [Google Scholar]
 Aydin, M.E.; Öztemel, E. Dynamic jobshop scheduling using reinforcement learning agents. Robot. Auton. Syst. 2000, 33, 169–178. [Google Scholar] [CrossRef]
 Waschneck, B.; Reichstaller, A.; Belzner, L.; Altenmüller, T.; Bauernhansl, T.; Knapp, A.; Kyek, A. Optimization of global production scheduling with deep reinforcement learning. Procedia CIRP 2018, 72, 1264–1269. [Google Scholar] [CrossRef]
 Luo, S. Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl. Soft Comput. 2020, 91, 106208. [Google Scholar] [CrossRef]
 Lang, S.; Behrendt, F.; Lanzerath, N.; Reggelin, T.; Müller, M. Integration of deep reinforcement learning and discreteevent simulation for realtime scheduling of a flexible job shop production. In Proceedings of the 2020 Winter Simulation Conference (WSC), Orlando, FL, USA, 14–18 December 2020; pp. 3057–3068. [Google Scholar]
 Gu, Y.; Chen, M.; Wang, L. A selflearning discrete salp swarm algorithm based on deep reinforcement learning for dynamic job shop scheduling problem. Appl. Intell. 2023, 53, 18925–18958. [Google Scholar] [CrossRef]
 Wang, L.; Hu, X.; Wang, Y.; Xu, S.; Ma, S.; Yang, K.; Liu, Z.; Wang, W. Dynamic jobshop scheduling in smart manufacturing using deep reinforcement learning. Comput. Netw. 2021, 190, 107969. [Google Scholar] [CrossRef]
 Gronauer, S.; Diepold, K. Multiagent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
 Chandak, Y.; Theocharous, G.; Kostas, J.; Jordan, S.; Thomas, P. Learning action representations for reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 941–950. [Google Scholar]
 Liu, Y.; Li, Z.; Jiang, Z.; He, Y. Prospects for multiagent collaboration and gaming: Challenge, technology, and application. Front. Inf. Technol. Electron. Eng. 2022, 23, 1002–1009. [Google Scholar] [CrossRef]
 Aissani, N.; Trentesaux, D.; Beldjilali, B. Multiagent reinforcement learning for adaptive scheduling: Application to multisite company. IFAC Proc. Vol. 2009, 42, 1102–1107. [Google Scholar] [CrossRef]
 Martínez Jiménez, Y.; Coto Palacio, J.; Nowé, A. Multiagent reinforcement learning tool for job shop scheduling problems. In Proceedings of the International Conference on Optimization and Learning, Cadiz, Spain, 17–19 February 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 3–12. [Google Scholar]
 Hameed, M.S.A.; Schwung, A. Reinforcement learning on job shop scheduling problems using graph networks. arXiv 2020, arXiv:2009.03836. [Google Scholar]
 Zhou, T.; Tang, D.; Zhu, H.; Zhang, Z. Multiagent reinforcement learning for online scheduling in smart factories. Robot. Comput.Integr. Manuf. 2021, 72, 102202. [Google Scholar] [CrossRef]
 Popper, J.; Motsch, W.; David, A.; Petzsche, T.; Ruskowski, M. Utilizing multiagent deep reinforcement learning for flexible job shop scheduling under sustainable viewpoints. In Proceedings of the 2021 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), Mauritius, 7–8 October 2021; pp. 1–6. [Google Scholar]
 Wang, S.; Li, J.; Luo, Y. Smart scheduling for flexible and hybrid production with multiagent deep reinforcement learning. In Proceedings of the 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 17–19 December 2021; Volume 2, pp. 288–294. [Google Scholar]
 Kipf, T.N.; Welling, M. Semisupervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
 Zhang, C.; Song, W.; Cao, Z.; Zhang, J.; Tan, P.S.; Chi, X. Learning to dispatch for job shop scheduling via deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 1621–1632. [Google Scholar]
 Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic value function factorisation for deep multiagent reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
 Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. QTRAN: Learning to factorize with transformation for cooperative multiagent reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
 Mahajan, A.; Rashid, T.; Samvelyan, M.; Whiteson, S. MAVEN: Multiagent variational exploration. Adv. Neural Inf. Process. Syst. 2019, 32, 7611–7622. [Google Scholar]
 Rashid, T.; Farquhar, G.; Peng, B.; Whiteson, S. Weighted QMIX: Expanding monotonic value function factorisation for deep multiagent reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 10199–10210. [Google Scholar]
 Son, K.; Ahn, S.; Reyes, R.D.; Shin, J.; Yi, Y. QTRAN++: Improved value transformation for cooperative multiagent reinforcement learning. arXiv 2020, arXiv:2006.12010. [Google Scholar]
 Ha, D.; Dai, A.M.; Le, Q.V. HyperNetworks. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
 Li, X.; Gao, L. An effective hybrid genetic algorithm and tabu search for flexible job shop scheduling problem. Int. J. Prod. Econ. 2016, 174, 93–110. [Google Scholar] [CrossRef]
 Li, X.; Gao, L.; Pan, Q.; Wan, L.; Chao, K.M. An effective hybrid genetic algorithm and variable neighborhood search for integrated process planning and scheduling in a packaging machine workshop. IEEE Trans. Syst. Man, Cybern. Syst. 2018, 49, 1933–1945. [Google Scholar] [CrossRef]
 Liu, Q.; Li, X.; Gao, L.; Li, Y. A modified genetic algorithm with new encoding and decoding methods for integrated process planning and scheduling problem. IEEE Trans. Cybern. 2020, 51, 4429–4438. [Google Scholar] [CrossRef]
Type  Number ^{1} 

Classical JSP  68 
Dynamic JSP  48 
JSP considering the machine availability  52 
Flexible JSP with alternative machines (FJSP)  225 
JSP with alternative routings  32 
JSP considering batches  39 
JSP considering setup times  69 
JSP considering transportation time  1 
JSP with nondeterministic or nonconstant processing time  32 
Distributed JSP (DSJSP)  25 
JSP with dualresource constraints (DRJSP)  30 
JSP considering energy and proenvironment  132 
JSP with a prior job  5 
JSP with dependent jobs  2 
JSP with nowait constraint for operations on the same job  18 
JSP with blocking constraint for capacities of buffer  8 
JSP with reentrancy  9 
JSP with preemption  2 
JSP considering overtime work  2 
JSP with limited buffer capacity  2 
JSP considering outsourcing (subcontracting)  6 
JSP considering robot or automated guided vehicle (AGV)  28 
FJSP with worker cooperation flexibility  1 
FJSP with technology and path flexibility (FJSPDT)   
Symbol  Definition 

N  A set of $\leftN\right$ jobs, $N=\{{n}_{1},{n}_{2},\cdots ,{n}_{\leftN\right}\}$. 
${n}_{i}$  The ith job in the set N. 
O  A set of all jobs’ operations, $O=\{{o}_{1},{o}_{2},\cdots ,{o}_{\leftO\right}\}$. 
S  A set of all stations for processing, $S=\{{s}_{1},{s}_{2},\cdots ,{s}_{\leftS\right}\}$. 
${s}_{k}$  The kth station in the set S. 
M  A set of $\leftM\right$ robots, $M=\{{m}_{1},{m}_{2},\cdots ,{m}_{\leftM\right}\}$. 
${m}_{k}$  The kth robot in the set M. 
${O}_{i}$  A set of operations for job ${n}_{i}$, ${O}_{i}=\{{o}_{i1},{o}_{i2},\cdots ,{o}_{{O}_{i}}\}$. 
${o}_{ij}$  The jth operation of job ${n}_{i}$. 
${M}_{j}$  A set of robots that can support operation ${o}_{j}$, ${M}_{j}=\{{m}_{j1},{m}_{j2},\cdots ,{m}_{{M}_{j}}\}$. 
${m}_{jk}$  The kth robot that can support operation ${o}_{j}$. 
${M}_{l}$  A set of robots that the station ${s}_{l}$ can cover. 
${m}_{lk}$  The kth robot that the station ${s}_{l}$ can cover. 
${S}_{k}$  A set of stations that the robot ${m}_{k}$ can reach. 
${s}_{kl}$  The lth station that the robot ${m}_{k}$ can reach. 
Action  Description 

${a}_{i}(i\le K)$  Specifying a station and a robot are available for the agent to perform the next operation. 
${a}_{K+1}$  Performing the transportation action. If the current station cannot support the next operation, the agent must move. 
${a}_{K+2}$  Staying on processing. The agent must not be interrupted to carry out other things. 
${a}_{K+3}$  Taking a wait action. The agent performs nothing. 
${a}_{K+4}$  Representing stop. At this point, the agent has completed all operations. 
Case  Number of Jobs  Number of Operations  Number of Stations  Number of Robots  ${\mathit{C}}_{\mathbf{up}}$  LB ${}^{1}$ 

N3  3  12  5  5  18  13 
N6  6  41  22  35  90  54 
N8  8  55  22  35  100  58 
N12  12  83  22  35  120  66 
Operation  Alternative Robots (Processing Time) 

1  M1(1), M4(1) 
2  M3(5), M4(5), M5(5) 
3  M1(4), M2(4) 
4  M2(1), M3(1), M5(1) 
Robot  Available Stations 

M1  S1, S4, S5 
M2  S2 
M3  S2, S3, S5 
M4  S4 
M5  S2, S3, S5 
Station  S1  S2  S3  S4  S5 

S1  0  1  1  2  2 
S2  1  0  1  1  2 
S3  1  1  0  1  1 
S4  2  1  1  0  1 
S5  2  2  1  1  0 
Network  Parameters 

GCN  One layer, with the hidden dimension of 512 
RNN  Two layers, with the hidden dimension of 512 in each layer. 
Mixing Network  Two layers, with the hidden dimension of 512 in each layer. 
Optimal Network  Three layers, with the hidden dimension of 512 in each layer. 
Case  DQMIX  hGATS  hGAVNS  MGA  CWQMIX  QTRAN  MILP  

BEST  MEAN  BEST  MEAN  BEST  MEAN  BEST  MEAN  BEST  MEAN  BEST  MEAN  
N3  14  14  13  13  13  13  13  14.0  14.0  17.1  14.0  16.2  13 
N6  63  64.5  61  63.6  61  62  62  63.3  68.0  83.6  72.0  84.6   
N8  68  70.3  71  71.6  67  68  70  71  73  75.7  76  78.7   
N12  81  84.7  85  85.6  85  85.6  83  84  81  85.7  82  89.7   
Case  DQMIX  hGATS  hGAVNS  MGA  CWQMIX  QTRAN  MILP  

BEST  MEAN  BEST  MEAN  BEST  MEAN  BEST  MEAN  BEST  MEAN  BEST  MEAN  
N3  7.7%  7.7%  0.0%  0.0%  0.0%  0.0%  0.0%  0.0%  7.7%  31.5%  7.7%  24.5%  13 
N6  3.3%  5.7%  0.0%  4.3%  0.0%  1.6%  1.6%  3.8%  11.5%  37.0%  18.0%  38.7%  61 
N8  1.5%  4.9%  6.0%  6.9%  0.0%  1.5%  4.5%  6.0%  9.0%  13.0%  13.4%  17.5%  67 
N12  0.0%  4.6%  4.9%  5.7%  4.9%  5.7%  2.5%  3.7%  0.0%  8.8%  1.2%  10.7%  81 
Model  Training Duration  Execution Duration 

DQMIX  5.0 h 47.08 min  4.59 s 
hGATS  47.77 min  47.77 min 
hGAVNS  51.58 min  51.58 min 
MGA  32.86 min  32.86 min 
Items  DQMIX  CWQMIX  QTRAN 

StaticMakespan ^{1}  69  73  76 
Number of instances ^{2}  150  150  150 
Makespan mean ^{3}  73.0  78.5  79.3 
Makespan variance  6.0  3.7  3.6 
$MIR$  5.7%  7.5%  4.4% 
$NMIR$  5.7%  10.2%  7.2% 
Success rate  99.3%  100%  98.7% 
Cases  Number of Jobs  DQMIX  CWQMIX  QTRAN 

N4  4  68  65  81 
N5  5  68  66  79 
N6  6  70  68  81 
N7  7  71  79  76 
N8  8  81  79  83 
Mean    71.6  71.4  80 
Variance    29.3  49.3  7 
Case  DQMIX  CWQMIX  QTRAN 

1  77  73  74 
2  89  73  72 
3  71  72  88 
4  79  72  74 
5  68  69  69 
6  75  100  72 
7  66  100  72 
8  69  66  78 
9  68  67  71 
10  67  100  72 
Mean  72.9  79.2  74.2 
Success rate  100%  70%  100% 
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. 
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, S.; Xiong, G.; Yang, J.; Shen, Z.; Tamir, T.S.; Tao, Z.; Han, Y.; Wang, F.Y. MultiAgent Reinforcement Learning for Extended Flexible Job Shop Scheduling. Machines 2024, 12, 8. https://doi.org/10.3390/machines12010008
Peng S, Xiong G, Yang J, Shen Z, Tamir TS, Tao Z, Han Y, Wang FY. MultiAgent Reinforcement Learning for Extended Flexible Job Shop Scheduling. Machines. 2024; 12(1):8. https://doi.org/10.3390/machines12010008
Chicago/Turabian StylePeng, Shaoming, Gang Xiong, Jing Yang, Zhen Shen, Tariku Sinshaw Tamir, Zhikun Tao, Yunjun Han, and FeiYue Wang. 2024. "MultiAgent Reinforcement Learning for Extended Flexible Job Shop Scheduling" Machines 12, no. 1: 8. https://doi.org/10.3390/machines12010008