# Dynamic Job-Shop Scheduling Based on Transformer and Deep Reinforcement Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

**Table 1.**The reinforcement learning algorithm based on the value function solves the dynamic job-shop scheduling problem.

Work | State Representation | Action Space | Dynamic Factors | Problems | Objective | Algorithm |
---|---|---|---|---|---|---|

Turgut et al. [15] | Matrix | Eligible operations | Random job arrival | DJSP | Tardiness | DQN |

Chang et al. [16] | Self-designed | Self-designed rules | Random job arrival | DFJSP | Earliness penalty and tardiness | DDQN |

Wang et al. [17] | Matrix | Dispatching rules | Random job arrival | DJSP | Premature completion and tardiness | Q-learning |

Bouazza et al. [5] | Vector | Dispatching rules | Random job arrival | DFJSP | Makespan and total weighted completion time | Q-learning |

Luo et al. [19] | Matrix | Operations | Random job arrival | DFJSP | Tardiness | DDQN |

Luo et al. [20] | Matrix | Self-designed rules | Random job arrival | DJSP | Tardiness and utilization rate of machine | DDQN |

Shahrabi et al. [18] | Variables | Actions | Random job arrival | DJSP | Mean flow time | Q-learning |

Ours | Transformer | Dispatching rules-DEA | Random job arrival | DJSP | Makespan | D3QPN |

- (1)
- This paper proposes an innovative deep reinforcement learning framework to solve the scheduling problem of both static and dynamic events. Taking the disjunctive graph as a state space, and using the transformer model to extract graph features, the state is mapped to the most appropriate dispatching rules through the double dueling Q-network with prioritized experience replay (D3QPN) which can select the job with the highest priority to execute. A reward function equivalent to minimizing the makespan is designed to evaluate the scheduling results.
- (2)
- This paper uses data envelopment analysis to select the appropriate dispatching rules from the general dispatching rules to compose the action space, aiming at minimizing the makespan and maximizing machine utilization. It also proposes a dynamic target strategy with an elite strategy, in which an objective function $\mathcal{L}$ with values between 0 and 1 is introduced. The experimental results show that the scheduling performance is improved by 15.93% under this strategy.
- (3)
- Taking the OR-Library as the dataset, this paper comprehensively compares the proposed method with various dispatching rules, the genetic algorithm (GA), and other reinforcement learning methods. The experimental results show that the proposed method reaches the best level of scheduling effect compared with other methods.

## 2. Problem Formulation

#### 2.1. DJSP Description

- (1)
- Each machine can only perform one operation every time;
- (2)
- Each operation of the job can only be performed by one machine every time;
- (3)
- All operations of the same job must be performed in a predetermined order;
- (4)
- An operation that has been started cannot be interrupted or terminated;
- (5)
- The conversion time and setup time between machines are ignored.

#### 2.2. MDP Formulation for DJSP

#### 2.3. Disjunctive Graph

## 3. Methodology

#### 3.1. Overall Framework

Algorithm 1 Proposed framework D3QPN and transformer |

Input: Environment and set of network random variables |

1. Initialize Feature Extraction module; replay buffer; prioritized replay exponent; minibatch |

2. Formularize the DJSP as MDP $MDP=\left(S,A,P,R,\gamma \right)$ |

3. for epoch number $epoch=\mathrm{1,2},\dots ,max\_epoch$ do |

4. Extract the states_{t} from s using Feature Extraction module |

5. Select the dispatching rule using DEA |

6. for schedule cycle $i=\mathrm{1,2},\dots k$ do |

7. Execute dispatching rule ${a}_{t}$ and observe new disjunctive graph ${s}_{t}$ |

8. Execute the state ${s}_{t+1}$ from ${s}_{t}$ |

9. end for |

10. Store $({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})$ in replay buffer with maximal priority |

11. Sample the minibatch of transitions with probability |

12. Update the learning algorithm and Feature Extraction module using Algorithm 2 |

13. end for |

Output: The learned DRL module and the Feature Extraction module |

#### 3.2. D3QN with Prioritized Experience Replay

Algorithm 2 The training procedure of D3QPN |

1. Initialize minibatch $\mathrm{k},\text{step-size}\eta ,\mathrm{exponents}\alpha \mathrm{and}\beta ,\mathrm{replay}\mathrm{memory}D,\mathrm{replay}\mathrm{period}K\mathrm{and}\mathrm{capacity}N,\mathrm{budget}T.$ |

2. Initialize target network $\widehat{Q}$ with random weights ${\theta}^{-}=\theta .$ |

3. Initialize the Feature Extraction module. |

4. for e = 1 to M do |

5. Reset schedule scheme and observe state ${s}_{1}$ |

6. for t = 1 to T do |

7. Select and execute action ${a}_{t}$ based on proposed strategy. |

8. Observe reward ${r}_{t}$ and next state ${s}_{t+1}$. |

9. $\mathrm{Store}\mathrm{transition}\left({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\right)\mathrm{in}\mathrm{D}\mathrm{with}\mathrm{maximal}\mathrm{priority}{p}_{t}={max}_{it}{p}_{i}.$ |

10. if t ≡ 0 mod K then |

11. for j = 1 to K do |

12. $\mathrm{Compute}\text{importance-sampling}\mathrm{weight}{w}_{j}={(N\xb7per\left(j\right))}^{-\beta}/{max}_{i}{w}_{i}.$ |

13. $\mathrm{Set}{y}_{i}=\left\{\begin{array}{c}{r}_{j}terminal\\ {r}_{j}+\gamma \widehat{Q}\left({s}_{j+1},{argmax}_{a}Q\left({s}_{j+1},a;\theta \right);{\theta}^{-}\right)non\text{-}terminal\end{array}\right.$ |

14. $\mathrm{Compute}\text{TD-error}{\delta}_{j}={({y}_{j}-Q({s}_{j},{a}_{j};\theta \left)\right)}^{2}$. |

15. $\mathrm{Update}\mathrm{transition}\mathrm{priority}{p}_{j}\leftarrow \left|{\delta}_{j}\right|$. |

16. $\mathrm{Accumulate}\mathrm{weight-change}\u2206\leftarrow \u2206+{w}_{i}\xb7{\delta}_{j}\xb7{\nabla}_{\theta}Q({s}_{j},{a}_{j})$ |

17. end for |

18. $\mathrm{Update}\mathrm{weights}\theta \leftarrow \theta +\eta \xb7\u2206,\mathrm{reset}\u2206=0$. |

19. $\mathrm{Every}\mathrm{C}\mathrm{steps}\mathrm{reset}\widehat{Q}=Q$. |

20. end if |

21. end for |

22. end for |

#### 3.3. Feature Extraction—Transformer

#### 3.4. Action Space under Data Envelopment Analysis

_{1}and Obj

_{2}, the completion time and machine load related to each scheduling rule, respectively, and the strong efficiency dispatching rule and weak efficiency dispatching rule are obtained according to the efficiency score. This paper evaluated and sorted 20 general dispatching rules through data envelopment analysis and finally selected 12 optimal dispatching rules, FIFO, LSO, LPT, SRPT, SSO, MOR, FHALF, NINQ, WINQ, LIFO, LRPT, and LOR, as shown in Figure 4.

#### 3.5. Reward Function

#### 3.6. Strategy

## 4. Experiments

#### 4.1. Dataset

#### 4.2. Training Environment

#### 4.3. Performance Evaluation

- (1)
- A total of 14 dispatching rules, including FIFO, LIFO, LPT, SPT, LRPT, SRPT, LSO, SSO, LOR, MOR, LHALF, FHALF, NINQ, and WINQ;
- (2)
- GA: a search algorithm that simulates the principles of natural selection and genetic inheritance. Its basic idea is to find the optimal solution to the problem by simulating the evolutionary process. The core steps of the genetic algorithm include selection, crossover, and mutation. Using the genetic algorithm to solve job-shop scheduling problem necessitates defining a coding strategy, fitness function, and genetic operation;
- (3)
- Advantage actor-critic (A2C): a reinforcement learning method, based on the policy gradient and value function, which is usually used to solve reinforcement learning problems in continuous-action space and high-dimensional state space. The algorithm combines an Actor network and a Critic network, generates action through the Actor network, estimates the state-value function or state-action value function through the Critic network, and finally trains the Actor network and Critic network through the strategy gradient algorithm;
- (4)
- Proximal policy optimization (PPO): has some advantages over the policy gradient and trust region policy optimization (TRPO). It alternates between sampling data and using the random gradient ascending method to optimize instead of the objective function. Although the standard strategy gradient method performs a gradient update for each data sample, the PPO proposes a new objective function, which can realize small-batch updates;
- (5)
- DQN: is one of the first widely used algorithm models for reinforcement learning in the field of deep learning. It was proposed by the research team of DeepMind in 2013 by combining deep neural networks with the classical reinforcement learning algorithm Q-learning. It realizes the processing of high-dimensional and continuous state space and has the ability to learn and plan;
- (6)
- Rainbow DQN: is a deep reinforcement learning method proposed by DeepMind that integrates six improvements on the basis of the DQN. Rainbow DQN combines six extended improvements to the DQN algorithm, integrating them on the same agent, including DDQN, dueling DQN, and DQN, as well as the prioritized replay, multistep learning, distributional RL, and noisy net methods.

#### 4.4. Ablation Experiment

#### 4.5. Feature Extraction Evaluation

#### 4.6. Action Space Evaluation

#### 4.7. Strategy Evaluation

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Garey, M.R.; Johnson, D.S.; Sethi, R. The Complexity of Flowshop and Jobshop Scheduling. Math. Oper. Res.
**1976**, 1, 117–129. [Google Scholar] [CrossRef] - Haupt, R. A survey of priority rule-based scheduling. OR Spectr.
**1989**, 11, 3–16. [Google Scholar] [CrossRef] - Sutton, R.; Barto, A. Reinforcement learning: An introduction (Adaptive computation and machine learning). IEEE Trans. Neural Netw.
**1998**, 9, 1054. [Google Scholar] [CrossRef] - Samsonov, V.; Kemmerling, M.; Paegert, M.; Lütticke, D.; Sauermann, F.; Gützlaff, A.; Schuh, G.; Meisen, T. Manufacturing Control in Job Shop Environments with Reinforcement Learning. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, Virtual, 4–6 February 2021. [Google Scholar] [CrossRef]
- Bouazza, W.; Sallez, Y.; Beldjilali, B. A distributed approach solving partially flexible job-shop scheduling problem with a Q-learning effect. IFAC Pap.
**2017**, 50, 15890–15895. [Google Scholar] [CrossRef] - He, Z.; Tran, K.P.; Thomassey, S.; Zeng, X.; Xu, J.; Yi, C. Multi-Objective Optimization of the Textile Manufacturing Process Using Deep-Q-Network Based Multi-Agent Reinforcement Learning. J. Manuf. Syst.
**2020**, 62, 939–949. [Google Scholar] [CrossRef] - Wang, X.; Zhang, L.; Lin, T.; Zhao, C.; Wang, K.; Chen, Z. Solving job scheduling problems in a resource preemption environment with multi-agent reinforcement learning. Robot. Comput. Integr. Manuf.
**2022**, 77, 102324. [Google Scholar] [CrossRef] - Lang, S.; Behrendt, F.; Lanzerath, N.; Reggelin, T.; Müller, M. Integration of Deep Reinforcement Learning and Discrete-Event Simulation for Real-Time Scheduling of a Flexible Job Shop Production. In Proceedings of the 2020 Winter Simulation Conference (WSC), Orlando, FL, USA, 14–18 December 2020. [Google Scholar] [CrossRef]
- Chen, R.; Yang, B.; Li, S.; Wang, S. A Self-Learning Genetic Algorithm based on Reinforcement Learning for Flexible Job-shop Scheduling Problem. Comput. Ind. Eng.
**2020**, 149, 106778. [Google Scholar] [CrossRef] - Park, J.; Chun, J.; Kim, S.H.; Kim, Y.; Park, J. Learning to schedule job-shop problems: Representation and policy learning using graph neural network and reinforcement learning. Int. J. Prod. Res.
**2021**, 59, 3360–3377. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] [CrossRef] - Baer, S.; Turner, D.; Mohanty, P.; Samsonov, V.; Bakakeu, R.; Meisen, T. Multi Agent Deep Q-Network Approach for Online Job Shop Scheduling in Flexible Manufacturing. In Proceedings of the ICMSMM 2020: International Conference on Manufacturing System and Multiple Machines, Opfikon, Switzerland, 13–14 January 2020. [Google Scholar]
- Zhao, M.; Li, X.; Gao, L.; Wang, L.; Xiao, M. An improved Q-learning based rescheduling method for flexible job-shops with machine failures. In Proceedings of the 2019 IEEE 15th International Conference on Automation Science and Engineering (CASE), Vancouver, BC, Canada, 22–26 August 2019. [Google Scholar] [CrossRef]
- Luo, B.; Wang, S.; Yang, B.; Yi, L. An improved deep reinforcement learning approach for the dynamic Job-Shop scheduling problem with random job arrivals. In Proceedings of the 4th International Conference on Advanced Algorithms and Control Engineering, Online, 21–23 February 2020; IOP Publishing Press: Bristol, UK, 2021; pp. 1–8. [Google Scholar]
- Turgut, Y.; Bozdag, C.E. Deep Q-Network Model for Dynamic Job Shop Scheduling Problem Based on Discrete Event Simulation. In Proceedings of the 2020 Winter Simulation Conference (WSC), Orlando, FL, USA, 14–18 December 2020. [Google Scholar] [CrossRef]
- Chang, J.; Yu, D.; Hu, Y.; He, W.; Yu, H. Deep Reinforcement Learning for Dynamic Flexible Job Shop Scheduling with Random Job Arrival. Processes
**2022**, 10, 760. [Google Scholar] [CrossRef] - Wang, Y.F. Adaptive job shop scheduling strategy based on weighted Q-learning algorithm. J. Intell. Manuf.
**2018**, 31, 417–432. [Google Scholar] [CrossRef] - Shahrabi, J.; Adibi, M.A.; Mahootchi, M. A reinforcement learning approach to parameter estimation in dynamic job shop scheduling. Comput. Ind. Eng.
**2017**, 110, 75–82. [Google Scholar] [CrossRef] - Luo, S. Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl. Soft Comput.
**2020**, 91, 106208. [Google Scholar] [CrossRef] - Luo, S.; Zhang, L.; Fan, Y. Dynamic multi-objective scheduling for flexible job shop by deep reinforcement learning. Comput. Ind. Eng.
**2021**, 159, 107489. [Google Scholar] [CrossRef] - Braglia, M.; Petroni, A. Data envelopment analysis for dispatching rule selection. Prod. Plan. Control. Manag. Oper.
**1999**, 10, 454–461. [Google Scholar] [CrossRef] - Oukil, A.; El-Bouri, A. Ranking dispatching rules in multiobjective dynamic flow shop scheduling: A multi-faceted perspective. Int. J. Prod. Res.
**2019**, 59, 388–411. [Google Scholar] [CrossRef] - Oukil, A.; El-Bouri, A.; Emrouznejad, A. Energy-aware job scheduling in a multi-objective production environment—An integrated DEA-OWA model. Comput. Ind. Eng.
**2022**, 168, 108065. [Google Scholar] [CrossRef] - Bellman, R. Dynamic Programming. Science
**1966**, 153, 34–37. [Google Scholar] [CrossRef] - Demange, M.; Paschos, V.T. Extremal values of a combinatorial optimization problem and polynomial approximation. Mathématiques Inform. Sci. Hum.
**1996**, 135, 51–66. [Google Scholar] [CrossRef] - Watkins, C.J.C.H. Learning From Delayed Rewards. Robot. Auton. Syst.
**1989**, 15, 233–235. [Google Scholar] [CrossRef] - Hasselt, H.V.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. arXiv
**2015**, arXiv:1509.06461. [Google Scholar] [CrossRef] - Wang, Z.; Freitas, N.D.; Lanctot, M. Dueling Network Architectures for Deep Reinforcement Learning. arXiv
**2015**, arXiv:1511.06581. [Google Scholar] [CrossRef] - Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv
**2015**, arXiv:1511.05952. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv
**2017**, arXiv:1706.03762. [Google Scholar] [CrossRef] - Charnes, A.; Cooper, W.W.; Rhodes, E. Measuring the efficiency of decision making units. Eur. J. Oper. Res.
**1978**, 2, 429–444. [Google Scholar] [CrossRef] - Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn.
**2002**, 47, 235–256. [Google Scholar] [CrossRef] - Thompson, W.R. On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika
**1933**, 25, 285–294. [Google Scholar] [CrossRef] - Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B. Deep Exploration via Bootstrapped DQN. arXiv
**2016**, arXiv:1602.04621. [Google Scholar] [CrossRef] - Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy Networks for Exploration. arXiv
**2017**, arXiv:1706.10295. [Google Scholar] [CrossRef]

Dispatching Rules | Description |
---|---|

FIFO | First in, first out |

LPT | The longer the processing time, the more priority |

LRPT | Select the job with the longest remaining processing time |

LSO | Select the job with the longest processing time for the next step |

LOR | Select the job with the fewest remaining operations |

LHALF | Select the job that has less than half of the total number of operations that have not been performed |

NINQ | Select the job with the shortest queue to the next operating machine |

WINQ | Select the next job with the least amount of work to operate the machine |

LIFO | Last in, first out |

SPT | The shorter the processing time, the more priority |

SRPT | Select the job with the shortest remaining processing time |

SSO | Select the job with the shortest time for the next step |

MOR | Select the job with the most remaining operations |

FHALF | Select the job for which more than half of the total number of operations have not yet been executed |

Hyperparameter | Value |
---|---|

Number of episodes | 8000 |

Schedule cycle | 10 |

Buffer size | 100,000 |

Discount factor γ | 0.95 |

Target Q update frequency | 200 |

Batch size | 128 |

Prioritized replay α | 0.6 |

Prioritized replay β | 0.4 |

Number of layers of feature extraction module | 3 |

Number of attention heads | 5 |

Learning rate | 10^{−5} |

Instance | Scale | Ours | Dispatching Rules | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

FIFO | LIFO | LPT | SPT | LTPT | STPT | MOR | LOR | LSO | SSO | LHALF | FHALF | NINQ | WINQ | |||

ft06 | 6 × 6 | 59 | 68 | 70 | 77 | 88 | 68 | 83 | 61 | 68 | 61 | 87 | 79 | 66 | 67 | 68 |

ft10 | 10 × 10 | 1052 | 1262 | 1281 | 1295 | 1074 | 1190 | 1262 | 1163 | 1352 | 1424 | 1471 | 1366 | 1253 | 1341 | 1262 |

swv01 | 20 × 10 | 1718 | 1889 | 2123 | 2145 | 1737 | 1961 | 1751 | 1971 | 1838 | 1997 | 2027 | 1989 | 2011 | 1977 | 1889 |

swv06 | 20 × 15 | 2042 | 2243 | 2331 | 2542 | 2140 | 2327 | 2360 | 2287 | 2383 | 2488 | 2452 | 2684 | 2751 | 2133 | 2243 |

abz5 | 10 × 10 | 1338 | 1388 | 1605 | 1586 | 1352 | 1483 | 1624 | 1336 | 1559 | 1725 | 1855 | 1735 | 1466 | 1564 | 1370 |

abz7 | 20 × 15 | 806 | 902 | 946 | 903 | 849 | 918 | 923 | 775 | 933 | 971 | 1101 | 984 | 951 | 914 | 900 |

la01 | 10 × 5 | 695 | 830 | 764 | 822 | 751 | 835 | 933 | 763 | 941 | 808 | 843 | 865 | 896 | 836 | 830 |

la06 | 15 × 5 | 926 | 1078 | 1031 | 1125 | 1200 | 1098 | 1012 | 926 | 1095 | 1011 | 1211 | 1234 | 1053 | 1134 | 1078 |

la21 | 15 × 10 | 1198 | 1417 | 1479 | 1451 | 1324 | 1278 | 1541 | 1251 | 1547 | 1419 | 1965 | 1763 | 1877 | 1563 | 1417 |

la31 | 30 × 10 | 1764 | 2148 | 2256 | 2245 | 1951 | 2083 | 2270 | 1836 | 2129 | 2147 | 2296 | 2345 | 1988 | 2238 | 2148 |

orb01 | 10 × 10 | 1163 | 1456 | 1495 | 1410 | 1478 | 1308 | 1458 | 1307 | 1410 | 1383 | 1294 | 1326 | 1671 | 1504 | 1456 |

orb02 | 10 × 10 | 933 | 1157 | 1264 | 1293 | 1175 | 1067 | 1166 | 1047 | 1194 | 1265 | 1347 | 1096 | 1255 | 1244 | 1157 |

yn01 | 20 × 20 | 1009 | 1158 | 1177 | 1115 | 1196 | 1177 | 1188 | 1045 | 1205 | 1314 | 1361 | 1198 | 1358 | 1146 | 1123 |

yn02 | 20 × 20 | 1074 | 1356 | 1283 | 1195 | 1256 | 1225 | 1199 | 1098 | 1446 | 1364 | 1679 | 1612 | 1421 | 1489 | 1356 |

Instance | Scale | GA | Reinforcement Learning | ||||
---|---|---|---|---|---|---|---|

PPO | A2C | DQN | Rainbow | Ours | |||

ft06 | 6 × 6 | 59 | 67 | 69 | 65 | 63 | 59 |

ft10 | 10 × 10 | 1061 | 1139 | 1276 | 1223 | 1231 | 1052 |

swv01 | 20 × 10 | 2331 | 1986 | 1979 | 1962 | 2061 | 1718 |

swv06 | 20 × 15 | 2971 | 2354 | 2369 | 2311 | 2333 | 2042 |

abz5 | 10 × 10 | 1377 | 1755 | 1477 | 1635 | 1552 | 1338 |

abz7 | 20 × 15 | 807 | 985 | 961 | 897 | 904 | 806 |

la01 | 10 × 5 | 741 | 828 | 830 | 785 | 935 | 695 |

la06 | 15 × 5 | 994 | 1021 | 1043 | 984 | 1066 | 926 |

la21 | 15 × 10 | 1511 | 1345 | 1334 | 1347 | 1494 | 1198 |

la31 | 30 × 10 | 2443 | 2047 | 2075 | 1958 | 1846 | 1764 |

orb01 | 10 × 10 | 1463 | 1343 | 1344 | 1327 | 1473 | 1165 |

orb02 | 10 × 10 | 1010 | 1311 | 1154 | 1230 | 1098 | 933 |

yn01 | 20 × 20 | 1488 | 1132 | 1250 | 1109 | 1110 | 1009 |

yn02 | 20 × 20 | 1131 | 1261 | 1312 | 1455 | 1354 | 1074 |

Instance | Makespan | |||||||
---|---|---|---|---|---|---|---|---|

DQN | DDQN | Dueling DQN | Prioritized Replay | Multistep Learning | Distributional RL | Noisy Net | Ours | |

ft06 | 65 | 63 | 61 | 62 | 62 | 64 | 60 | 59 |

ft10 | 1223 | 1310 | 1307 | 1321 | 1421 | 1334 | 1294 | 1052 |

swv01 | 1962 | 1812 | 1785 | 1801 | 1894 | 1981 | 1794 | 1718 |

swv06 | 2311 | 2216 | 2177 | 2183 | 2197 | 2431 | 2274 | 2042 |

abz5 | 1635 | 1469 | 1397 | 1401 | 1557 | 1463 | 1576 | 1338 |

abz7 | 897 | 952 | 904 | 881 | 991 | 895 | 975 | 806 |

la01 | 785 | 761 | 752 | 779 | 757 | 804 | 743 | 695 |

la06 | 984 | 973 | 943 | 951 | 1064 | 972 | 958 | 926 |

la21 | 1347 | 1254 | 1269 | 1240 | 1367 | 1575 | 1276 | 1198 |

la31 | 1958 | 1951 | 1901 | 1876 | 1864 | 1934 | 1802 | 1764 |

orb01 | 1327 | 1371 | 1298 | 1366 | 1365 | 1631 | 1309 | 1165 |

orb02 | 1230 | 993 | 964 | 959 | 1074 | 1361 | 1001 | 933 |

yn01 | 1109 | 1127 | 1118 | 1123 | 1216 | 1398 | 1109 | 1009 |

yn02 | 1455 | 1278 | 1275 | 1307 | 1254 | 1462 | 1241 | 1074 |

Average | 1306 | 1252 | 1225 | 1233 | 1292 | 1379 | 1244 | 1127 |

Instance | Makespan | |||
---|---|---|---|---|

Dispatching Rules-DEA | Candidate Dispatching Rules | Eligible Operations | Unconstrained Operations | |

ft06 | 59 | 68 | 164 | 5600 |

ft10 | 1052 | 1352 | 2163 | 9520 |

swv01 | 1718 | 1838 | 3596 | 27,026 |

swv06 | 2042 | 2383 | 4124 | 35,196 |

abz5 | 1338 | 1559 | 2764 | 19,650 |

abz7 | 806 | 933 | 1726 | 6504 |

la01 | 695 | 941 | 1367 | 5699 |

la06 | 926 | 1095 | 2063 | 7955 |

la21 | 1198 | 1547 | 2587 | 11,220 |

la31 | 1894 | 2129 | 3759 | 29,853 |

orb01 | 1165 | 1410 | 2431 | 13,552 |

orb02 | 933 | 1194 | 1925 | 8954 |

yn01 | 1009 | 1205 | 2165 | 9860 |

yn02 | 1074 | 1446 | 2335 | 10,745 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Song, L.; Li, Y.; Xu, J.
Dynamic Job-Shop Scheduling Based on Transformer and Deep Reinforcement Learning. *Processes* **2023**, *11*, 3434.
https://doi.org/10.3390/pr11123434

**AMA Style**

Song L, Li Y, Xu J.
Dynamic Job-Shop Scheduling Based on Transformer and Deep Reinforcement Learning. *Processes*. 2023; 11(12):3434.
https://doi.org/10.3390/pr11123434

**Chicago/Turabian Style**

Song, Liyuan, Yuanyuan Li, and Jiacheng Xu.
2023. "Dynamic Job-Shop Scheduling Based on Transformer and Deep Reinforcement Learning" *Processes* 11, no. 12: 3434.
https://doi.org/10.3390/pr11123434