# Scaling Up Q-Learning via Exploiting State–Action Equivalence

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

**Contributions.**We make the following contributions. We study off-policy learning in discounted finite MDPs, admitting some equivalence structure in their state–action space. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP, when a prior knowledge on the structure is provided to the agent. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically. As it turns out, the sample efficiency gain of QL-ES over Q-learning is captured by an MDP-dependent quantity $\xi $ that is defined in terms of the associated covering times in the MDP; see Section 5 for details. Analytically establishing the dependence of the gain ratio $\xi $ on the number S of states in a given MDP seems difficult, although it is possible to numerically compute it. Nonetheless, we present a simple example where $\xi =O\left(S\right)$, thus showcasing that QL-ES in some domains may require much fewer (by a factor of S) samples than Q-learning. Furthermore, we numerically compute $\xi $ for a few families of MDPs built using standard environments (with increasing S), thereby showcasing the theoretical superiority of QL-ES over Q-learning. Through extensive numerical experiments on standard domains, we show that Q-function estimates under QL-ES converge much faster than those obtained from (structure-oblivious) Q-learning. These results demonstrate that the empirical performance gain from exploiting the equivalence structure could be massive, even in simple domains. To our best knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in MDPs.

## 2. Related Work

**Similarity and equivalence in MDPs.**There is a rich literature on learning and exploiting various notions of structure in MDPs, where the aim is to leverage structure to alleviate the computational cost of finding an optimal policy (in the known MDP setting) or to speed up exploration (in the RL setting). Many such algorithms fall into the category of state abstraction (or aggregation) [26,27]. Approximate homomorphism is proposed to construct beneficial abstract models in MDPs [28]. In the known MDP setting, Refs. [29,30] appear to be the first presenting the notion of equivalence between states based on stochastic bi-simulation. The authors of [31,32] use bi-simulation metrics as quantitative analogues of the equivalence relations to partition the state space by capturing similarities. In the RL setting, Refs. [18,19,20,33,34] investigate model-based algorithms that rely on the grouping of similar states (or state–action pairs) to speed up exploration. Ref. [20] is the first to present an average-reward RL algorithm (in the regret setting), where the confidence intervals of similar states are aggregated. Ref. [18] studies regret minimization in average-reward MDPs with equivalence structure and presents the C-UCRL algorithm, which is capable of exploiting the structure. The regret bound for C-UCRL depends on the number of classes in the MDP rather than the size of the state–action space. A similar equivalence structure was studied in [17] in the context of multi-task RL, where similarities of the transition dynamics across tasks were extracted and exploited to speed up learning. Ref. [24] studies the efficiency of hierarchical RL in the regret setting in scenarios where the hierarchical structure is defined with respect to the notion of equivalence; more precisely, it assumes that the underlying MDP can be decomposed into equivalent sub-MDPs—i.e., smaller MDPs with identical reward and transition functions up to some known bijection mappings. Closest to our work, in terms of the structure definition, is [18]. However, we restrict ourselves to a model-free approach where the model-based machinery presented in [18] does not apply. Finally, we mention that there is some literature on exploiting equivalence in deep RL (e.g., [21,22]). However, none of these works study provably efficient learning methods to our best knowledge.

**Q-learning and its variants.**We provide a very brief overview of the works studying theoretical analysis of Q-learning and its variants. Q-learning [2] has been around for more than three decades as a cheap and popular model-free method to solve finite, unknown discounted MDPs without estimating the model. Its convergence was investigated in an asymptotic flavor [35,36], and more recently in the non-asymptotic (finite-sample) regime in a series of work, including [9,37,38,39,40]. To the best of our knowledge, Ref. [9] reports the sharpest PAC-type sample complexity bound for the classical Q-learning. Some of these works present variants of Q-learning with improved sample complexity bounds using a variety of techniques, such as acceleration and variance reduction [9,40,41]. Although the concept of equivalence in MDPs is not new, there is no work reporting PAC-type sample complexity bounds for model-free algorithms combined with equivalence relations, to our knowledge.

## 3. Problem Formulation

#### 3.1. Discounted Markov Decision Processes

#### 3.2. The Off-Policy Learning Problem and Q-Learning

**The Q-learning algorithm.**The Q-learning algorithm [35] is perhaps the most famous model-free algorithm for learning an optimal policy in unknown tabular MDPs. As a model-free method, it directly learns the optimal Q-function ${Q}^{\star}$ of the MDP (without estimating P and R), which can be used to derive a policy. The algorithm maintains an estimate ${Q}_{t}$ of the optimal ${Q}^{\star}$ at each time step t. Specifically, it starts from an arbitrary choice for ${Q}_{0}\in {\mathbb{R}}^{S\times A}$ and updates ${Q}_{t}$, at each $t\ge 0$, as

Algorithm 1 Q-learning [2]. |

Input: dataset $\mathcal{D}$, maximum iterations T, learning rates ${\left({\alpha}_{t}\right)}_{t\ge 0}$ |

Initialization:
${Q}_{0}=0\in {\mathbb{R}}^{S\times A}$ |

for
$t=0,1,\dots ,T$ do |

Sample action ${a}_{t}\sim {\pi}_{\mathrm{b}}\left({s}_{t}\right)$ and observe ${r}_{t}\sim R({s}_{t},{a}_{t})$ and ${s}_{t+1}\sim P(\xb7|{s}_{t},{a}_{t})$. |

Compute ${Q}_{t+1}$ using (1). |

end for |

#### 3.3. Similarity and Equivalence Classes

**Definition**

**1**

**$\theta $-similar**if there exist mappings ${\sigma}_{s,a}:\{1,\dots ,S\}\to \mathcal{S}$ and ${\sigma}_{{s}^{\prime},{a}^{\prime}}:\{1,\dots ,S\}\to \mathcal{S}$ such that

**profile mapping**(or for short,

**profile**) for $(s,a)$, and denote by $\sigma ={\left({\sigma}_{s,a}\right)}_{s,a}$ the set of profile mappings across $\mathcal{S}\times \mathcal{A}$.

**Definition**

**2**

**equivalence structure**and denote it by $\mathcal{C}$. We further define $C:=\left|\mathcal{C}\right|$.

**Off-policy learning in MDPs with equivalence structures.**In this work, we assume that the underlying MDP M admits an equivalence structure $\mathcal{C}$ as introduced above. In other words, the transition function P is such that $\mathcal{S}\times \mathcal{A}$ can be partitioned into $C:=\left|\mathcal{C}\right|$ classes, where the pairs in each class $c\in \mathcal{C}$ are 0-similar. We make the following assumption regarding the agent’s prior knowledge about $\mathcal{C}$.

**Assumption**

**A1.**

**Assumption**

**A2**.

## 4. The QL-ES Algorithm

Algorithm 2 QL-ES |

Input: dataset $\mathcal{D}$, maximum iterations T, learning rates ${\left({\alpha}_{t}\right)}_{t\ge 0}$, equivalence structure $\mathcal{C}$ |

Initialization: ${Q}_{0}=0\in {\mathbb{R}}^{S\times A}$. |

for
$t=0,1,2,\dots ,T$
do |

Sample action ${a}_{t}\sim {\pi}_{\mathrm{b}}\left({s}_{t}\right)$ and observe ${s}_{t+1}\sim P(\xb7|{s}_{t},{a}_{t})$. |

Find $c({s}_{t},{a}_{t})$. |

for $(s,a)\in c({s}_{t},{a}_{t})$ do |

${s}_{t+1}^{\left(sa\right)}={\sigma}_{s,a}^{-1}\left({\sigma}_{{s}_{t},{a}_{t}}\left({s}_{t+1}\right)\right)$ |

Compute ${Q}_{t+1}$ using (2) |

end for |

end for |

**Remark**

**1.**

## 5. Theoretical Guarantee for QL-ES

**Definition**

**3.**

**Theorem**

**1.**

**Comparison with sample complexity of Q-learning.**Theorem 1 tells us that the number of steps to have $\parallel {Q}^{\star}-{Q}_{T}{\parallel}_{\infty}\le \epsilon $ with high probability depends on ${t}_{\mathrm{cover},\mathcal{C}}{\epsilon}^{-2}{(1-\gamma )}^{-5}$ (up to some logarithmic factors), where ${t}_{\mathrm{cover},\mathcal{C}}$, defined in Definition 3, is the cover time with respect to $\mathcal{C}$. Comparing this result against the sample of complexity of Q-learning (e.g., Theorem 2 in [9]) reveals that using QL-ES yields an improvement over Q-learning by a factor of $\mathrm{const}.\times \xi $, where

## 6. Simulation Results

#### 6.1. Evaluation Metrics

- (i)
- Max-norm Q-value Error defined as $\parallel {Q}^{\star}-{Q}_{t}{\parallel}_{\infty}$;
- (ii)
- Total Policy Error defined as $\parallel {\pi}^{\star}-{\pi}_{t}^{\mathsf{greedy}}{\parallel}_{1}$, where ${\pi}_{t}^{\mathsf{greedy}}$ denotes the greedy policy w.r.t. ${Q}_{t}$, i.e., ${\pi}_{t}^{\mathsf{greedy}}\left(s\right):=arg{max}_{a}{Q}_{t}(s,a)$ for all s.

#### 6.2. Environments

**RiverSwim and variants.**A generic RiverSwim MDP with L states is shown in Figure 1, which extends the classical 6-state RiverSwim presented in [45]. This MDP is constructed so that efficient exploration is required to obtain the optimal policy. The larger the number L of states, the more exploration is required. The L-state RiverSwim (with $L\ge 3$) admits an equivalence structure with $C=3$ regardless of L. We consider RiverSwim instances with various L so as to have MDPs with progressive difficulty levels while having a fixed number of classes. In some experiments, we consider a slightly modified version of RiverSwim, which we shall call Perturbed RiverSwim. It is identical to RiverSwim (Figure 1) except that in any state ${s}_{i}$, where $i<L$ is even, $p\left({s}_{i}\right|{s}_{i},\mathrm{R})=0.65$ and $p\left({s}_{i+1}\right|{s}_{i},\mathrm{R})=0.3$. It is clear that there are $C=4$ classes in a L-state Perturbed RiverSwim.

**GridWorld.**We also consider 2-room and 4-room grid-world MDPs with different grid sizes. Figure 3 shows a $7\times 7$ 2-room and a $9\times 9$ 4-room grid-worlds, respectively. In both environments, the agent starts at the upper-left corner (in red) and is supposed to reach the lower-right corner (in yellow), where it is given a reward of 1 and then sent back to the initial red state. At each step, the agent has four possible actions (hence, $A=4$): Going up, left, down, or right. Black squares indicate the wall where the agent is not able to penetrate through. After executing a given action, the agent has a probability of $0.1$ to stay in the same state, has a probability of $0.7$ to move to the desired direction, and has a probability of $0.06$ and $0.14$ to move to the other two possible directions. If the wall blocks the agent, it stays where it is, and the transition probability of the next state is added to that of the current state.

#### 6.3. Bounds on the Ratio $\xi $

#### 6.4. Experimental Results with Exact Equivalence Structure

#### 6.5. The Gain in the Case of $\theta $-Similar Pairs

#### 6.6. The Impact of Partially Using the Structure

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Sample Complexity of QL-ES: Proof of Theorem 1

#### Appendix A.1. Notations

**Lemma**

**A1.**

#### Appendix A.2. Proof of Theorem 1

**Lemma**

**A2.**

**Lemma**

**A3.**

**Lemma**

**A4.**

**Lemma**

**A5.**

**Theorem**

**A1.**

## Appendix B. Proofs of Technical Lemmas

#### Appendix B.1. Proof of Lemma A1

#### Appendix B.2. Proof of Lemma A2

#### Appendix B.3. Proof for Lemma A4

#### Appendix B.4. Proof for Lemma A5

## References

- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge England, Cambridge, UK, 1989. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Gheshlaghi Azar, M.; Munos, R.; Kappen, H.J. Minimax PAC Bounds on the Sample Complexity of Reinforcement Learning with a Generative Model. Mach. Learn.
**2013**, 91, 325–349. [Google Scholar] [CrossRef][Green Version] - Azar, M.G.; Osband, I.; Munos, R. Minimax Regret Bounds for Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, 2017, Sydney, Australia, 6–11 August 2017; pp. 263–272. [Google Scholar]
- Zhou, D.; Gu, Q.; Szepesvari, C. Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes. In Proceedings of the Conference on Learning Theory, Boulder, CO, USA, 15–19 August 2021; pp. 4532–4576. [Google Scholar]
- Agarwal, A.; Kakade, S.; Yang, L.F. Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal. In Proceedings of the Conference on Learning Theory, Virtual, 9–12 July 2020; pp. 67–83. [Google Scholar]
- Li, G.; Wei, Y.; Chi, Y.; Gu, Y.; Chen, Y. Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction. arXiv
**2021**, arXiv:2006.03041. [Google Scholar] [CrossRef] - Ortner, R.; Ryabko, D. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning. Adv. Neural Inf. Process. Syst.
**2012**, 25, 1763–1771. [Google Scholar] - QIAN, J.; Fruit, R.; Pirotta, M.; Lazaric, A. Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs. Adv. Neural Inf. Process. Syst.
**2019**, 32, 4891–4900. [Google Scholar] - Asadi, K.; Misra, D.; Littman, M. Lipschitz Continuity in Model-Based Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 264–273. [Google Scholar]
- Ok, J.; Proutiere, A.; Tranos, D. Exploration in Structured Reinforcement Learning. Adv. Neural Inf. Process. Syst.
**2018**, 31, 8888–8896. [Google Scholar] - Osband, I.; Van Roy, B. Near-Optimal Reinforcement Learning in Factored MDPs. Adv. Neural Inf. Process. Syst.
**2014**, 27, 604–612. [Google Scholar] - Talebi, M.S.; Jonsson, A.; Maillard, O. Improved Exploration in Factored Average-Reward MDPs. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; pp. 3988–3996. [Google Scholar]
- Rosenberg, A.; Mansour, Y. Oracle-Efficient Regret Minimization in Factored MDPs with Unknown Structure. Adv. Neural Inf. Process. Syst.
**2021**, 34, 11148–11159. [Google Scholar] - Sun, Y.; Yin, X.; Huang, F. Temple: Learning Template of Transitions for Sample Efficient Multi-Task RL. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 9765–9773. [Google Scholar]
- Asadi, M.; Talebi, M.S.; Bourel, H.; Maillard, O.A. Model-Based Reinforcement Learning Exploiting State-Action Equivalence. In Proceedings of the Asian Conference on Machine Learning, Nagoya, Japan, 17–19 November 2019; pp. 204–219. [Google Scholar]
- Leffler, B.R.; Littman, M.L.; Edmunds, T. Efficient Reinforcement Learning with Relocatable Action Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22–26 July 2007; Volume 7, pp. 572–577. [Google Scholar]
- Ortner, R. Adaptive Aggregation for Reinforcement Learning in Average Reward Markov Decision Processes. Ann. Oper. Res.
**2013**, 208, 321–336. [Google Scholar] [CrossRef] - Van der Pol, E.; Worrall, D.; van Hoof, H.; Oliehoek, F.; Welling, M. MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 4199–4210. [Google Scholar] - Mondal, A.K.; Nair, P.; Siddiqi, K. Group Equivariant Deep Reinforcement Learning. arXiv
**2020**, arXiv:2007.03437. [Google Scholar] - Ortner, R.; Ryabko, D.; Auer, P.; Munos, R. Regret Bounds for Restless Markov Bandits. Theor. Comput. Sci.
**2014**, 558, 62–76. [Google Scholar] [CrossRef] - Wen, Z.; Precup, D.; Ibrahimi, M.; Barreto, A.; Van Roy, B.; Singh, S. On Efficiency in Hierarchical Reinforcement Learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 6708–6718. [Google Scholar] - Dulac-Arnold, G.; Levine, N.; Mankowitz, D.J.; Li, J.; Paduraru, C.; Gowal, S.; Hester, T. Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks and Analysis. Mach. Learn.
**2021**, 110, 2419–2468. [Google Scholar] [CrossRef] - Li, L.; Walsh, T.J.; Littman, M.L. Towards a Unified Theory of State Abstraction for MDPs. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics, Lauderdale, FL, USA, 4–6 January 2006. [Google Scholar]
- Abel, D.; Hershkowitz, D.; Littman, M. Near Optimal Behavior via Approximate State Abstraction. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2915–2923. [Google Scholar]
- Ravindran, B.; Barto, A.G. Approximate Homomorphisms: A Framework for Non-Exact Minimization in Markov Decision Processes. In Proceedings of the KBCS, New York, NY, USA, 17–18 May 2004. [Google Scholar]
- Dean, T.; Givan, R.; Leach, S. Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes. In Proceedings of the Uncertainty in Artificial Intelligence, Providence, RI, USA, 1–3 August 1997; pp. 124–131. [Google Scholar]
- Givan, R.; Dean, T.; Greig, M. Equivalence Notions and Model Minimization in Markov Decision Processes. Artif. Intell.
**2003**, 147, 163–223. [Google Scholar] [CrossRef][Green Version] - Ferns, N.; Panangaden, P.; Precup, D. Metrics for Finite Markov Decision Processes. In Proceedings of the Uncertainty in Artificial Intelligence, Banff, AB, Canada, 7–11 July 2004; pp. 162–169. [Google Scholar]
- Ferns, N.; Panangaden, P.; Precup, D. Bisimulation Metrics for Continuous Markov Decision Processes. SIAM J. Comput.
**2011**, 40, 1662–1714. [Google Scholar] [CrossRef][Green Version] - Brunskill, E.; Li, L. Sample Complexity of Multi-task Reinforcement Learning. In Proceedings of the Uncertainty in Artificial Intelligence, Bellevue, WA, USA, 11–15 August 2013; p. 122. [Google Scholar]
- Mandel, T.; Liu, Y.E.; Brunskill, E.; Popovic, Z. Efficient Bayesian Clustering for Reinforcement Learning. In Proceedings of the International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–16 July 2016; pp. 1830–1838. [Google Scholar]
- Watkins, C.J.; Dayan, P. Q-Learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Tsitsiklis, J.N. Asynchronous Stochastic Approximation and Q-Learning. Mach. Learn.
**1994**, 16, 185–202. [Google Scholar] [CrossRef][Green Version] - Even-Dar, E.; Mansour, Y. Learning Rates for Q-Learning. J. Mach. Learn. Res.
**2003**, 5, 1–25. [Google Scholar] - Azar, M.G.; Munos, R.; Ghavamzadeh, M.; Kappen, H. Reinforcement Learning with a Near Optimal Rate of Convergence Technical Report <inria-00636615v2>. 2011. Available online: https://www.researchgate.net/publication/265653590_Reinforcement_Learning_with_a_Near_Optimal_Rate_of_Convergence (accessed on 30 January 2023).
- Qu, G.; Wierman, A. Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning. In Proceedings of the Conference on Learning Theory, Virtual, 9–12 July 2020; pp. 3185–3205. [Google Scholar]
- Devraj, A.M.; Meyn, S.P. Q-Learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning. arXiv
**2020**, arXiv:2002.10301. [Google Scholar] - Wang, Y.; Dong, K.; Chen, X.; Wang, L. Q-Learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Levin, D.A.; Peres, Y. Markov Chains and Mixing Times; American Mathematical Society: Providence, RI, USA, 2017; Volume 107. [Google Scholar]
- Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat.
**1951**, 22, 400–407. [Google Scholar] [CrossRef] - Strehl, A.L.; Littman, M.L. An Analysis of Model-Based Interval Estimation for Markov Decision Processes. J. Comput. Syst. Sci.
**2008**, 74, 1309–1331. [Google Scholar] [CrossRef] - Azar, M.G.; Munos, R.; Ghavamzadeh, M.; Kappen, H.J. Speedy Q-Learning. Adv. Neural Inf. Process. Syst.
**2011**, 24, 2411–2419. [Google Scholar] - Jaksch, T.; Ortner, R.; Auer, P. Near-Optimal Regret Bounds for Reinforcement Learning. J. Mach. Learn. Res.
**2010**, 11, 1563–1600. [Google Scholar] - Bourel, H.; Maillard, O.; Talebi, M.S. Tightening Exploration in Upper Confidence Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1056–1066. [Google Scholar]

**Figure 3.**The 2-room grid world (

**left**) and 4-room grid world (

**right**) with walls in black, initial state in red, and goal state in yellow.

**Figure 8.**Results with $\theta $-similar pairs: Modified RiverSwim (

**left**) and Modified GridWorld (

**right**).

Environment | States | $7\times 7$ | $9\times 9$ | $11\times 11$ | $20\times 20$ | $50\times 50$ | $100\times 100$ |
---|---|---|---|---|---|---|---|

2-room | $SA$ | 84 | 172 | 292 | 1228 | 9028 | $3.8\times {10}^{4}$ |

2-room | C | 8 | 8 | 8 | 8 | 8 | 8 |

4-room | $SA$ | 80 | 160 | 272 | 1172 | 8852 | $3.7\times {10}^{4}$ |

4-room | C | 8 | 8 | 8 | 8 | 8 | 8 |

**Table 2.**Empirical values of ${t}_{\mathrm{cover}}$, ${t}_{\mathrm{cover},\mathcal{C}}$, and ${\xi}_{\mathrm{LCB}}$ for RiverSwim with S states.

S | 6 | 10 | 14 | 20 |
---|---|---|---|---|

${t}_{\mathrm{cover}}$ | 131, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[97,158]$ | 513, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[340,658]$ | 2529, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[1867,3196]$ | 12,792, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[7577,\mathrm{15,688}]$ |

${t}_{\mathrm{cover},\mathcal{C}}$ | 12, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[9,16]$ | 33, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[26,38]$ | 56, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[46,66]$ | 113, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[94,133]$ |

${\xi}_{\mathrm{LCB}}$ | $\frac{97}{16}\approx 6.1$ | $\frac{340}{38}\approx 8.9$ | $\frac{1867}{66}\approx 28.3$ | $\frac{7577}{133}\approx 57.0$ |

**Table 3.**Empirical values of ${t}_{\mathrm{cover}}$, ${t}_{\mathrm{cover},\mathcal{C}}$, and ${\xi}_{\mathrm{LCB}}$ for Perturbed RiverSwim with S states.

S | 6 | 10 | 14 | 20 |
---|---|---|---|---|

${t}_{\mathrm{cover}}$ | 116, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[97,136]$ | 557, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[351,655]$ | 2011, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[1616,2451]$ | 11,856, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[7282,\mathrm{17,103}]$ |

${t}_{\mathrm{cover},\mathcal{C}}$ | 14, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[12,17]$ | 32, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[23,37]$ | 58, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[46,67]$ | 114, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[99,130]$ |

${\xi}_{\mathrm{LCB}}$ | $\frac{97}{17}\approx 5.7$ | $\frac{351}{37}\approx 9.5$ | $\frac{1616}{67}\approx 24.1$ | $\frac{7282}{130}\approx 56.0$ |

**Table 4.**Empirical values of ${t}_{\mathrm{cover}}$, ${t}_{\mathrm{cover},\mathcal{C}}$, and ${\xi}_{\mathrm{LCB}}$ for 2-room GridWorld with S states.

S | 21 | 43 | 71 | 111 |
---|---|---|---|---|

${t}_{\mathrm{cover}}$ | 2164, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[1939,2355]$ | 5480, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[4890,5889]$ | 11,310, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[9817,\mathrm{12,735}]$ | 22,793, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[\mathrm{20,571},\mathrm{24,882}]$ |

${t}_{\mathrm{cover},\mathcal{C}}$ | 205, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[125,251]$ | 418, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[329,481]$ | 877, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[613,1011]$ | 1338, $\mathrm{CI}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}[1101,1551]$ |

${\xi}_{\mathrm{LCB}}$ | $\frac{1939}{251}\approx 7.7$ | $\frac{4890}{481}\approx 10.2$ | $\frac{9817}{1011}\approx 9.7$ | $\frac{20571}{1551}\approx 13.3$ |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lyu, Y.; Côme, A.; Zhang, Y.; Talebi, M.S. Scaling Up Q-Learning via Exploiting State–Action Equivalence. *Entropy* **2023**, *25*, 584.
https://doi.org/10.3390/e25040584

**AMA Style**

Lyu Y, Côme A, Zhang Y, Talebi MS. Scaling Up Q-Learning via Exploiting State–Action Equivalence. *Entropy*. 2023; 25(4):584.
https://doi.org/10.3390/e25040584

**Chicago/Turabian Style**

Lyu, Yunlian, Aymeric Côme, Yijie Zhang, and Mohammad Sadegh Talebi. 2023. "Scaling Up Q-Learning via Exploiting State–Action Equivalence" *Entropy* 25, no. 4: 584.
https://doi.org/10.3390/e25040584