# Generally Applicable Q-Table Compression Method and Its Application for Constrained Stochastic Graph Traversal Optimization Problems

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- We define the class of constrained stochastic graph traversal problems by identifying several real-life problems that belong to that. Although the pairwise relationships of individual problems were already known, we recognized that their joint investigation is recommended.
- We present a general end-to-end process for collecting observations, creating and fine-tuning the discretization function based on DBSCAN clustering results, building a Q-table, and using it for optimal decisions. We call our framework the Q-table compression method.
- Our solution delivers a human-interpretable model without a complex pre-study of the correlation of distances or further hidden dependencies.
- We demonstrate the usability of our method in three selected use cases belonging to the problem class of constrained stochastic graph traversal problem: in a constrained stochastic shortest pathfinding problem, in a constrained stochastic Hamiltonian pathfinding problem, and in a stochastic disassembly line balancing problem. We also verify the performance of the Q compression method compared to a simple grid-based discretization method.

## 2. Constrained Graph Traversal Problem Formulation

- Every possible loop-free route $v\subseteq \mathcal{V}$ declares a vertex $\widehat{v}$ in the transformed graph, including ${\widehat{v}}_{0}=\left(\right)$ which represents the empty subset.
- ${\widehat{e}}_{ij}=({\widehat{v}}_{i},{\widehat{v}}_{j})$ is an edge in the transformed graph if and only if:
- -
- The path in $\mathcal{G}$ represented by ${\widehat{v}}_{i}\subset \mathcal{V}$ is a sub-path of the one represented by ${\widehat{v}}_{j}\subset \mathcal{V}$:$${\widehat{v}}_{i}\subset {\widehat{v}}_{j}$$
- -
- The path of ${\widehat{v}}_{j}$ is longer exactly by one edge than the path of ${\widehat{v}}_{i}$:$${\widehat{v}}_{i}\cup \left({v}_{y}\right)={\widehat{v}}_{j},\phantom{\rule{3.33333pt}{0ex}}\mathrm{where}\phantom{\rule{3.33333pt}{0ex}}{v}_{y}\in \mathcal{V}$$
- -
- By marking with ${v}_{x}\in \mathcal{V}$ the last element of $\widehat{{v}_{i}}$, the additional edge of path ${\widehat{v}}_{j}$ is at its end compared to the path of ${\widehat{v}}_{j}$:$$({v}_{x},{v}_{y})\in \mathcal{E}$$
- -
- The initial vertex can be freely chosen from all vertex $v\in \mathcal{V}$$$(\left(\right),\left(v\right))\in \widehat{\mathcal{E}}\phantom{\rule{3.33333pt}{0ex}}\mathrm{for}\phantom{\rule{3.33333pt}{0ex}}\mathrm{all}\phantom{\rule{3.33333pt}{0ex}}v\in \mathcal{V}$$

- As in the classical Hamiltonian pathfinding problem in our formulation, there were no distances in the original graph $\mathcal{G}$, but the SPP formulation requires distances to the edges, so we can assign a constant value 1 to all existing $({\widehat{v}}_{i},{\widehat{v}}_{j})$ edges of the transformed graph $\widehat{\mathcal{G}}$. However, in real-world problems, there are given distances: $\widehat{d}({\widehat{v}}_{i},{\widehat{v}}_{j})=d({v}_{x},{v}_{y})$ and we are interested in solving the shortest Hamiltonian pathfinding problem (SHPP).
- Finally, an optimal solution to the shortest path finding problem of $\widehat{\mathcal{G}}$ will provide the shortest Hamiltonian path with an objective function of $min\left|\mathcal{H}\right|=min{\sum}_{i=1}^{k}{d}_{i}$.

## 3. Q-Table Compression Method

#### 3.1. Reinforcement Learning (RL)

#### 3.2. Q-Table Compression Process

- Let’s assume that the optimal action is determined for each and every ${o}_{j}^{i}$ and it is denoted by ${a}_{j}^{i}$. If ${a}_{j}^{i}\ne {a}_{j}^{k}$ then $\mathcal{F}\left({a}_{j}^{i}\right)\ne \mathcal{F}\left({a}_{j}^{k}\right)$. This practically means that the mapping function merges observation states only if their optimal actions do not differ.
- The mapping function $\mathcal{F}$ should be consistent in that sense that if action ${a}_{j}^{i}$ moves the agent from the observation state ${o}_{j}^{i}$ to ${o}_{j+1}^{i}$ then action ${a}_{j}^{i}$ should move the agent from $\mathcal{F}\left({o}_{j}^{i}\right)$ to $\mathcal{F}\left({o}_{j+1}^{i}\right)$.
- $\left|S\right|\ll \left|O\right|$, or in other words: the size of the state representation S should be significantly smaller than the size of observation space O. The smaller S is the better representation.

`get state`). The learning process is based on the $\u03f5$-Greedy algorithm: the agent decides whether to take a random action or take the best-known action (

`choose action method`). The agent randomly chooses an action with $\u03f5$ probability of the feasible actions (

`get random action`), or chooses the optimal action from the feasible actions with $(1-\u03f5)$ probability (

`get optimal action`). In this case, the agent determines the expected cumulative rewards to reach the target state for all feasible actions first and then chooses the action with the best total expected reward (or if there are multiple actions with the same total expected reward, then randomly choose one out of them). If the Q-table does not contain a relevant entry for the current state because the trajectory is undiscovered, then the action selection falls back to the random method. The value of the parameter $\u03f5$ decreases from 1 to 0 linearly as the number of episodes goes to its predefined limit. We would like to highlight that the restriction on potential actions improves the efficiency of the learning process compared to enabling any action and producing a bad reward for unfeasible actions.

`take action`). After that, the agent saves the quadlet (old state, action, new state, reward) into the observation history (

`save observation`), and determines the new discretized state by applying the discretization function for the observation (

`update state`). Finally, the agent checks whether the target state has been reached (

`check exit criterion`). If not, then the cycle repeats.

`check update criterion`) the agent queries the observation history (

`get observation history`). To let the agent act adaptively, the history can be accessible due to a moving observation window to get the most relevant part of the history and not process the old, invalid records from there. The major goal at this stage is to create a new/updated discretization ruleset that fits the requirements as described above in this subsection and projects the mixed discrete-continuous observation space to a discrete state space (

`update discretization`). Denote by ${s}_{t}=({v}_{t},{u}_{t})\in \mathcal{V}\times \mathbf{R}$ an observation, where ${v}_{t}$ shows the currently visited vertex of the graph, and ${u}_{t}$ describes the current range utilization. The agent takes the observations in a loop: It fixes the discrete part of the observation space (practically a vertex) and queries all matching records: $({v}_{i},{u}_{i})|{v}_{i}={v}_{fixed}$. Then it uses the DBSCAN algorithm to assign the continuous range utilization values of the observations into clusters. Denote ${R}_{i}^{v}$ the ith range of the corresponding cluster and ${k}_{v}$ the number of identified clusters of the vertex v. The ranges should be nonoverlapping: ${R}_{i}\bigcap {R}_{j}=\varnothing $ for all $i\ne j\in \{1,\dots ,{k}_{v}\}$, and they need to be widened to completely cover the potential value range: ${\bigcup}_{i=1}^{{k}_{v}}{R}_{i}=\mathbf{R}$. Figure 2 demonstrate the result of applying DBSCAN algorithm to determine the clusters for two different verices.

`discretize observation history`). Finally, the agent truncates the Q-table and rebuilds it from the discretized observation history using the discretized states (

`update Q-table`). The Algorithm 1 presents the Q-compression method in pseudo-code format.

Algorithm 1 Q-table compression reinforcement learning method | |||

1: | function Q-COMPRESSION(D, ${l}_{sim}$, ${l}_{range}$) | ||

2: | inputs: | ||

3: | D: $n\times n$ matrix | ▹ represents the graph distances | |

4: | ${l}_{sim}$: constant parameter | ▹ define simulation length | |

5: | ${l}_{range}$: constant parameter | ▹ define single range limit | |

6: | ▹ initialize learning parameters | ||

7: | $Q\left(\right)\leftarrow [.]$ | ▹ initialize Q-table | |

8: | $O\left(\right)\leftarrow [.]$ | ▹ initialize observation history | |

9: |
for $i=1,2,\dots ,{l}_{sim}$ do | ▹ loop for iterating episodes | |

10: | ${(v,u)}_{curr}\leftarrow ({v}_{start},0)$ | ▹ reset state (current vertex, range util.) | |

11: | while ${v}_{curr}\ne {v}_{target}$ do | ▹ check episode exit criterion | |

12: | $\xi \leftarrow \phantom{\rule{5.0pt}{0ex}}\backsim U(0,1)$ | ▹ generate standard uniform random number | |

13: | $\mathbf{a}\leftarrow \left\{v\right|D({v}_{curr},v)>0\}$ | ▹ get feasible action options list | |

14: | if $\xi <\frac{i}{{l}_{sim}}$ then | ▹ optimal action selection criterion | |

15: | $m\leftarrow max\left(r=Q(s,a)|s=({v}_{curr},{\mathcal{F}}_{{v}_{curr}}\left(u\right)\left)\right\}\right)$ | ▹ get maximal expected cumulative reward from Q-table | |

16: | ${\mathbf{v}}_{\mathbf{max}}\leftarrow \left\{a|Q(s,a)=m,s=({v}_{curr},{\mathcal{F}}_{{v}_{curr}}\left(u\right))\right\}$ | ▹ get all maximal-reward actions from Q-table | |

17: | if $|{\mathbf{v}}_{\mathbf{max}}>\mathbf{0}|$ then | ▹ If no applicable action found then fallback to random action | |

18: | $\mathbf{a}\leftarrow \mathbf{a}\cap {\mathbf{v}}_{\mathbf{max}}$ | ▹ restrict optimal action list to optimals | |

19: |
end if | ||

20: |
end if | ||

21: | $a\leftarrow \phantom{\rule{5.0pt}{0ex}}\backsim U\left(\mathbf{a}\right)$ | ▹ choose next action from options | |

22: | ${v}_{next}\leftarrow a,d\phantom{\rule{0.0pt}{0ex}}\backsim D({s}_{curr},a),{u}_{next}\leftarrow u-d$ | ▹ determine next obs. | |

23: | $\mathbf{r}\leftarrow \mathrm{REWARD}({s}_{curr},a,d)$ | ▹ get reward | |

24: | $O\stackrel{+}{\leftarrow}(({s}_{curr},a,{s}_{next},r)$ | ▹ save observation into history | |

25: | ${v}_{curr}\leftarrow {v}_{next},{u}_{curr}\leftarrow {u}_{next}$ | ▹ update state | |

26: | end while | ||

27: |
if $\phantom{\rule{3.33333pt}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}(i,{l}_{range})==0$ then | ▹ Q-table update due criterion | |

28: | $\mathrm{UPDATE-Q}\left(O\right)$ | ▹ call discret. and Q-table update sub-process | |

29: | end if | ||

30: |
end for | ||

31: | end function | ||

32: | function REWARD(${s}_{curr}$, a, d, ${l}_{range}$) | ||

33: | inputs: | ||

34: | ${s}_{curr}$: pair of current vertex and range utilization | ||

35: | a: single vertex | ▹ action a determines the next vertex to visit | |

36: | d: dynamic value | ▹ measures distance realized on performed action a | |

37: | ${l}_{range}$: constant parameter | ▹ define utilization limit | |

38: | ${r}_{d}\leftarrow {c}_{d}\xb7d$ | ▹ get reward term of distance proportional cost | |

39: | ${r}_{r}\leftarrow {c}_{r}(d=={u}_{curr})$ | ▹ get reward term of range proportional cost | |

40: | ${r}_{o}\leftarrow {c}_{o}({l}_{range}<{u}_{curr})$ | ▹ get reward term of range overutilization cost | |

41: |
return ${r}_{d}+{r}_{r}+{r}_{o}$ | ||

42: | end function | ||

43: | function UPDATE-Q(O) | ||

44: | inputs: | ||

45: | O: list of quad-tuples | ▹ stores the observation history up to episode i | |

46: |
for $v\in \mathcal{V}$ do | ▹ loop for iterating vertices | |

47: | ${O}_{v}\leftarrow \left\{O({s}_{curr},a,{s}_{next},r)\right|{s}_{curr}=(v,.)\}$ | ▹ filter for relevant obs. | |

48: | ${C}_{v}\leftarrow DBSCAN\left({O}_{v}\left(r\right)\right)$ | ▹ determine clusters by DBSCAN | |

49: | ${R}_{v}\leftarrow {C}_{v}$ | ▹ widen the clusters to get complete disjoint covering ranges | |

50: |
for $i=1,2,\dots ,{k}_{v}$ do | ▹ loop on identified clusters of vertex v | |

51: | ${\mathcal{F}}_{v}(u\in {R}_{v}^{i})\leftarrow \mathrm{AVERAGE}\left({R}_{v}^{i}\right)$ | ▹ update discretization function by assigning cluster’s centroid value | |

52: |
end for | ||

53: |
end for | ||

54: | $Q\left(\right)\leftarrow [.]$ | ▹ reset Q-table | |

55: |
for $j=1,2,\dots ,\left|O\right|$ do | ▹ loop for iterating obs. history | |

56: | $\left({(v,u)}_{curr},a,r,{(v,u)}_{next}\right)\leftarrow {O}_{j}({s}_{curr},a,r,{s}_{next})$ | ▹ read jth obs. | |

57: | ${\tilde{s}}_{curr}\leftarrow \left({v}_{curr},\mathcal{F}\left({u}_{curr}\right)\right)$ | ▹ determine discretized current state | |

58: | ${\tilde{s}}_{next}\leftarrow \left({v}_{next},\mathcal{F}\left({u}_{next}\right)\right)$ | ▹ determine discretized next state | |

59: | $Q({\tilde{s}}_{curr},a)\stackrel{+}{\leftarrow}\frac{1}{\left|O\right({\tilde{s}}_{curr},a,.,.|}\left(r+\gamma {max}_{{a}_{next}}Q({\tilde{s}}_{next},{a}_{next})-Q({\tilde{s}}_{curr},a)\right)$ | ▹ calculate expected reward for the intervals discretized by DBSCAN | |

60: |
end for | ||

61: | end function |

## 4. Results and Discussion

#### 4.1. Constrained Shortest Path Finding Use Case

#### 4.2. Constrained Hamiltonian Path Finding Use Case

#### 4.3. Disassembly Line Balancing Use Case

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

RL | Reinforcement Learning |

SPP | Shortest Pathfinding Problem |

CSPP | Constrained Shortest Pathfinding Problem |

HPP | Hamiltonian Pathfinding Problem |

CSHPP | Constrained Shortest Hamiltonian Pathfinding Problem |

DLBP | Disassembly Line Balancing Problem |

## References

- Liao, X.; Wang, J.; Ma, L. An algorithmic approach for finding the fuzzy constrained shortest paths in a fuzzy graph. Complex Intell. Syst.
**2021**, 7, 17–27. [Google Scholar] [CrossRef] - Qin, H.; Su, X.; Ren, T.; Luo, Z. A review on the electric vehicle routing problems: Variants and algorithms. Front. Eng. Manag.
**2021**, 8, 370–389. [Google Scholar] [CrossRef] - Vital, F.; Ioannou, P. Scheduling and shortest path for trucks with working hours and parking availability constraints. Transp. Res. Part B Methodol.
**2021**, 148, 1–37. [Google Scholar] [CrossRef] - Baum, M.; Dibbelt, J.; Gemsa, A.; Wagner, D. Towards route planning algorithms for electric vehicles with realistic constraints. Comput.-Sci.-Res. Dev.
**2016**, 31, 105–109. [Google Scholar] [CrossRef] - Baum, M.; Dibbelt, J.; Gemsa, A.; Wagner, D.; Zündorf, T. Shortest feasible paths with charging stops for battery electric vehicles. Transp. Sci.
**2019**, 53, 1627–1655. [Google Scholar] [CrossRef] - Adler, J.D.; Mirchandani, P.B.; Xue, G.; Xia, M. The electric vehicle shortest-walk problem with battery exchanges. Netw. Spat. Econ.
**2016**, 16, 155–173. [Google Scholar] [CrossRef] - Çil, Z.A.; Öztop, H.; Kenger, Z.D.; Kizilay, D. Integrating distributed disassembly line balancing and vehicle routing problem in supply chain: Integer programming, constraint programming, and heuristic algorithms. Int. J. Prod. Econ.
**2023**, 265, 109014. [Google Scholar] [CrossRef] - Ulmer, M.W.; Goodson, J.C.; Mattfeld, D.C.; Thomas, B.W. On modeling stochastic dynamic vehicle routing problems. EURO J. Transp. Logist.
**2020**, 9, 100008. [Google Scholar] [CrossRef] - Slama, I.; Ben-Ammar, O.; Masmoudi, F.; Dolgui, A. Disassembly scheduling problem: Literature review and future research directions. IFAC-PapersOnLine
**2019**, 52, 601–606. [Google Scholar] [CrossRef] - Ferone, D.; Festa, P.; Guerriero, F.; Laganà, D. The constrained shortest path tour problem. Comput. Oper. Res.
**2016**, 74, 64–77. [Google Scholar] [CrossRef] - Kang, J.G.; Lee, D.H.; Xirouchakis, P.; Persson, J.G. Parallel disassembly sequencing with sequence-dependent operation times. CIRP Ann.
**2001**, 50, 343–346. [Google Scholar] [CrossRef] - AbuSalim, S.W.; Ibrahim, R.; Saringat, M.Z.; Jamel, S.; Wahab, J.A. Comparative analysis between Dijkstra and Bellman-Ford algorithms in shortest path optimization. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2020; Volume 917, p. 012077. [Google Scholar]
- Toroslu, I.H. Improving the floyd-warshall all pairs shortest paths algorithm. arXiv
**2021**, arXiv:2109.01872. [Google Scholar] - Ferone, D.; Festa, P.; Guerriero, F. An efficient exact approach for the constrained shortest path tour problem. Optim. Methods Softw.
**2020**, 35, 1–20. [Google Scholar] [CrossRef] - Dondo, R. A new formulation to the shortest path problem with time windows and capacity constraints. Lat. Am. Appl. Res.
**2012**, 42, 257–265. [Google Scholar] - Zhang, J.; Liu, C.; Li, X.; Zhen, H.L.; Yuan, M.; Li, Y.; Yan, J. A survey for solving mixed integer programming via machine learning. Neurocomputing
**2023**, 519, 205–217. [Google Scholar] [CrossRef] - Magzhan, K.; Jani, H.M. A review and evaluations of shortest path algorithms. Int. J. Sci. Technol. Res
**2013**, 2, 99–104. [Google Scholar] - Hildebrandt, F.D.; Thomas, B.W.; Ulmer, M.W. Opportunities for reinforcement learning in stochastic dynamic vehicle routing. Comput. Oper. Res.
**2023**, 150, 106071. [Google Scholar] [CrossRef] - Li, S.E. Reinforcement Learning for Sequential Decision and Optimal Control; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar]
- Dong, W.; Zhang, W.; Yang, W. Node constraint routing algorithm based on reinforcement learning. In Proceedings of the 2016 IEEE 13th International Conference on Signal Processing (ICSP), Chengdu, China, 6–10 November 2016; pp. 1752–1756. [Google Scholar]
- Kallrath, J. Solving planning and design problems in the process industry using mixed integer and global optimization. Ann. Oper. Res.
**2005**, 140, 339–373. [Google Scholar] [CrossRef] - Qi, M.; Wang, M.; Shen, Z.J. Smart feasibility pump: Reinforcement learning for (mixed) integer programming. arXiv
**2021**, arXiv:2102.09663. [Google Scholar] - Tang, Y.; Agrawal, S.; Faenza, Y. Reinforcement learning for integer programming: Learning to cut. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 9367–9376. [Google Scholar]
- Gros, S.; Zanon, M. Reinforcement learning for mixed-integer problems based on mpc. IFAC-PapersOnLine
**2020**, 53, 5219–5224. [Google Scholar] [CrossRef] - Wu, Y.; Song, W.; Cao, Z.; Zhang, J. Learning large neighborhood search policy for integer programming. Adv. Neural Inf. Process. Syst.
**2021**, 34, 30075–30087. [Google Scholar] - Cappart, Q.; Moisan, T.; Rousseau, L.M.; Prémont-Schwarz, I.; Cire, A.A. Combining reinforcement learning and constraint programming for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3677–3687. [Google Scholar]
- Xia, W.; Di, C.; Guo, H.; Li, S. Reinforcement learning based stochastic shortest path finding in wireless sensor networks. IEEE Access
**2019**, 7, 157807–157817. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Arts, L.; Heskes, T.; de Vries, A.P. Comparing Discretization Methods for Applying Q-Learning in Continuous State-Action Space. 2017. Available online: https://www.cs.ru.nl/bachelors-theses/2017/Luuk_Arts___4396863___Comparing_Discretization_Methods_for_Applying_Q-learning_in_Continuous_State-Action_Space.pdf (accessed on 2 February 2024).
- Sinclair, S.R.; Banerjee, S.; Yu, C.L. Adaptive discretization for episodic reinforcement learning in metric spaces. In Proceedings of the ACM on Measurement and Analysis of Computing Systems, Boston, MA, USA, 8–12 June 2020; Volume 3, pp. 1–44. [Google Scholar]
- Baumgardner, J.; Acker, K.; Adefuye, O.; Crowley, S.T.; DeLoache, W.; Dickson, J.O.; Heard, L.; Martens, A.T.; Morton, N.; Ritter, M.; et al. Solving a Hamiltonian Path Problem with a bacterial computer. J. Biol. Eng.
**2009**, 3, 1–11. [Google Scholar] [CrossRef] - Tuncel, E.; Zeid, A.; Kamarthi, S. Solving large scale disassembly line balancing problem with uncertainty using reinforcement learning. J. Intell. Manuf.
**2014**, 25, 647–659. [Google Scholar] [CrossRef] - Lambert, A.; Gupta, S. Disassembly Modeling for Assembly, Maintenance, Reuse and Recycling; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar]

Vertex Setup (State Space) | |
---|---|

CSPP | current vertex |

CSHPP | visited vertices + current vertex |

DLBP | removed components |

Edge context (action space) | |

CSPP | integrates the vertex connectivity |

CSHPP | integrates the vertex connectivity |

DLBP | integrates the precedence graph |

Constraints (restrictions for action selection) | |

CSPP | range utilization ≤ battery capacity |

CSHPP | range utilization ≤ battery capacity |

DLBP | workstation utilization ≤ cycle time |

Objective | |

CSPP | $min\left({c}_{d}{\sum}_{j=1}^{l}\left|{\mathcal{P}}_{j}\right|+{c}_{r}l\right)$ |

CSHPP | $min\left({c}_{d}{\sum}_{j=1}^{l}\left|{\mathcal{H}}_{j}\right|+{c}_{r}l\right)$ |

DLBP | $min\left({c}_{i}{\sum}_{i=1}^{l}{({t}_{c}-{\sum}_{j=1}^{{l}_{j}}{t}_{{w}_{i}^{j}})}^{2}+{c}_{h}{\sum}_{i=1}^{n}{h}_{i}{r}_{i}+{c}_{d}{\sum}_{i=1}^{n}{d}_{i}{r}_{i}\right)$ |

Discretization Method | Episode | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

50 | 100 | 150 | 200 | 250 | 300 | 350 | 400 | 450 | 500 | |

Grid-based | 135 | 203 | 254 | 292 | 320 | 340 | 356 | 366 | 372 | 375 |

Q-compression | 39 | 42 | 43 | 43 | 43 | 43 | 43 | 43 | 43 | 43 |

Discretization Method | Episode | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 | |

Grid-based | 649 | 1127 | 1526 | 1863 | 2167 | 2425 | 2650 | 2844 | 3008 | 3145 |

Q-compression | 411 | 615 | 788 | 944 | 1093 | 1223 | 1339 | 1428 | 1504 | 1563 |

Discretization method | Episode | |||||||||

1100 | 1200 | 1300 | 1400 | 1500 | 1600 | 1700 | 1800 | 1900 | 2000 | |

Grid-based | 3266 | 3362 | 3442 | 3504 | 3554 | 3589 | 3613 | 3630 | 3640 | 3647 |

Q-compression | 1615 | 1651 | 1678 | 1698 | 1713 | 1723 | 1730 | 1733 | 1735 | 1735 |

Task No. | Disassembly Task | Removal Time | Demand | Hazardousness |
---|---|---|---|---|

1 | PC top cover | 14 | 360 | No |

2 | Floppy drive | 10 | 500 | No |

3 | Hard drive | 12 | 620 | No |

4 | Backplane | 18 | 480 | No |

5 | PCI cards | 23 | 540 | No |

6 | RAM modules (2) | 16 | 750 | No |

7 | Power supply | 20 | 295 | Yes |

8 | Motherboard | 36 | 720 | No |

Discretization Method | Episode | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 | |

Grid-based | 702 | 1158 | 1522 | 1834 | 2087 | 2320 | 2485 | 2609 | 2716 | 2807 |

Q-compression | 144 | 151 | 152 | 150 | 147 | 147 | 146 | 146 | 146 | 146 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kegyes, T.; Kummer, A.; Süle, Z.; Abonyi, J.
Generally Applicable Q-Table Compression Method and Its Application for Constrained Stochastic Graph Traversal Optimization Problems. *Information* **2024**, *15*, 193.
https://doi.org/10.3390/info15040193

**AMA Style**

Kegyes T, Kummer A, Süle Z, Abonyi J.
Generally Applicable Q-Table Compression Method and Its Application for Constrained Stochastic Graph Traversal Optimization Problems. *Information*. 2024; 15(4):193.
https://doi.org/10.3390/info15040193

**Chicago/Turabian Style**

Kegyes, Tamás, Alex Kummer, Zoltán Süle, and János Abonyi.
2024. "Generally Applicable Q-Table Compression Method and Its Application for Constrained Stochastic Graph Traversal Optimization Problems" *Information* 15, no. 4: 193.
https://doi.org/10.3390/info15040193