# Deep Reinforcement Learning for Dynamic Stock Option Hedging: A Review

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Reinforcement Learning

#### 2.1. Fundamentals

#### 2.2. Value-Based RL Methods

#### 2.3. Policy-Based RL Methods

## 3. Similar Work

## 4. Methodology

Source | Reviewed | Method | State | Action | Reward | Train Data | Test Data | Benchmark |
---|---|---|---|---|---|---|---|---|

[5,30] | Yes | Q-Learning | ${S}_{t},\tau ,{\sigma}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | BSD |

[31] | Yes | SARSA | ${S}_{t}$, $\tau $, ${n}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | BSD |

[32] | Yes | DQN Pop-Art PPO | ${S}_{t},\tau $, ${n}_{t},K$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | BSD |

[52] | No | TRVO | ${C}_{t},{S}_{t}$, ${\u2206}_{t}$, ${n}_{t}$ | Cont. | $\delta {w}_{t}$ | GBM | GBM | BSD |

[1] | Yes | DDPG | ${S}_{t}$, $\tau $, ${n}_{t}$ | Cont. | $\mathrm{min}initial\mathbb{E}\left[{w}_{t}\right]+\lambda \sqrt{\mathbb{V}\left[{w}_{t}\right]},$ | GBM, SABR | GBM, SABR | BSD, Bartlett |

[53] | Yes | IMPALA | ${S}_{t}$, $\tau $ | Disc. | +1, −1 | HSX, HNX | HSX, HNX | Market Return |

[49] | No | DQN, DDPG | ${C}_{t},{S}_{t}$, ${\u2206}_{t}$, ${n}_{t}$, ${\sigma}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM, Heston | GBM, Heston, S&P | BSD, Wilmott |

[54] | No | PG w/Baseline | ${S}_{t}$, $\tau $, ${n}_{t}$ | Disc. | $\delta {w}_{t}$ | GBM, Heston | GBM, Heston, S&P | BSD |

[50] | No | Dir. Policy Search | ${S}_{t}$, $\tau $ | Cont. | CVaR | GBM, GAN | GBM, GAN | BSD |

[55] | No | DDPG | ${S}_{t}$, $\tau $, ${n}_{t}$ | Cont. | Payoff | GBM | GBM | BSD |

[56] | No | Actor Critic | ${C}_{t},{S}_{t}$, ${n}_{t}$, $\tau $ | Cont. | $\delta {w}_{t}$ | Heston | Heston | BSD |

[51] | Yes | DDPG | ${S}_{t}$, $\tau $, ${\u2206}_{t},{n}_{t},\mathrm{K},{\mathsf{\nu}}_{t},{\mathsf{\Gamma}}_{t}$ | Cont. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | S&P, DJIA | S&P, DJIA | BSD |

[57] | Yes | TD3 | ${S}_{t}$, $\tau $, ${n}_{t},{\sigma}_{t}$ | Cont. | $\delta {w}_{t}$ | GBM, Heston, S&P | GBM, Heston, S&P | BSD |

[58] | Yes | D4PG-QR | ${S}_{t},{\mathsf{\Gamma}}_{t}^{port},{\mathsf{\nu}}_{t}^{port},\phantom{\rule{0ex}{0ex}}{\mathsf{\Gamma}}_{t}^{hedge},{\mathsf{\nu}}_{t}^{hedge}$ | Cont. | CvaR and modified mean-var. | SABR | SABR | BSD, BSDG, BSDV |

[59] | No | DDPG, DDPG-U | ${S}_{t}$, $\tau ,{n}_{t}$, ${\sigma}_{t},{\u2206}_{t},\frac{dC}{dt}$ | Cont. | $\delta {w}_{t}$ + $\lambda Var\left[\delta {w}_{t}\right]$ | GBM, S&P | GBM, S&P | BSD |

[33] | Yes | CMAB | ${S}_{t}$, $\tau $, ${n}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | CMAB vs. DQN |

[60] | No | DDPG | ${C}_{t},{S}_{t}$, ${\u2206}_{t}$, ${n}_{t},\tau $ | Cont. | Min. ${c}_{t}$ | GBM | GBM | BSD |

## 5. Analysis

#### 5.1. RL Methods

#### 5.2. State and Action Spaces

- The ratio of post-hedging the BS Gamma to pre-hedging the BS Gamma is within the range of [0, 1].
- The ratio of post-hedging the BS Vega to pre-hedging the BS Vega is within the range of [0, 1].

#### 5.3. Reward Formulations

#### 5.4. Data Generation Processes

#### 5.5. Comparison of Results

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Cao, J.; Chen, J.; Hull, J.; Poulos, Z. Deep Hedging of Derivatives Using Reinforcement Learning. J. Financ. Data Sci.
**2021**, 3, 10–27. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Black, F.; Scholes, M. The Pricing of Options and Corporate Liabilities. J. Polit. Econ.
**1973**, 81, 637–654. [Google Scholar] [CrossRef] - Hull, J. Options, Futures, and Other Derivatives, 8th ed.; Prentice Hall: Boston, MA, USA, 2012. [Google Scholar]
- Halperin, I. QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds. J. Deriv.
**2017**, 28, 99–122. [Google Scholar] [CrossRef] - Leland, H.E. Option Pricing and Replication with Transactions Costs. J. Financ.
**1985**, 40, 1283–1301. [Google Scholar] [CrossRef] - Rogers, L.C.G.; Singh, S. The Cost of Illiquidity and Its Effects on Hedging. Math. Financ.
**2010**, 20, 597–615. [Google Scholar] [CrossRef] - Daly, K. Financial Volatility: Issues and Measuring Techniques. Phys. Stat. Mech. Its Appl.
**2008**, 387, 2377–2393. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; Bradford Books: Bradford, PA, USA, 2018; ISBN 0-262-03924-9. [Google Scholar]
- Zou, L. Meta-Learning: Theory, Algorithms and Applications; Academic Press: Cambridge, MA, USA, 2022; ISBN 978-0-323-90370-7. [Google Scholar]
- François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An Introduction to Deep Reinforcement Learning. Found. Trends Mach. Learn.
**2018**, 11, 219–354. [Google Scholar] [CrossRef] - Hambly, B.; Xu, R.; Yang, H. Recent Advances in Reinforcement Learning in Finance. Math. Financ.
**2023**, 33, 437–503. [Google Scholar] [CrossRef] - Al Mahamid, F.; Grolinger, K. Reinforcement Learning Algorithms: An Overview and Classification. In Proceedings of the 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Online, 12–17 September 2021; pp. 1–7. [Google Scholar]
- Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, Cambridge University, Cambridge, UK, 1989. [Google Scholar]
- Tesauro, G. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play. Neural Comput.
**1994**, 6, 215–219. [Google Scholar] [CrossRef] - Ruder, S. An Overview of Gradient Descent Optimization Algorithms. arXiv
**2016**, arXiv:1609.04747. [Google Scholar] - Lin, L.-J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn.
**1992**, 8, 293–321. [Google Scholar] [CrossRef] - Fedus, W.; Ramachandran, P.; Agarwal, R.; Bengio, Y.; Larochelle, H.; Rowland, M.; Dabney, W. Revisiting Fundamentals of Experience Replay. arXiv
**2020**, arXiv:2007.06700. [Google Scholar] - Bellemare, M.G.; Dabney, W. A Distributional Perspective on Reinforcement Learning. arXiv
**2017**, arXiv:1707.06887. [Google Scholar] - Lillicrap, T.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. 387–395. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; Abdeel, P. Trust Region Policy Optimization. arXiv
**2015**, arXiv:arXiv:1502.05477. [Google Scholar] - Dayan, P.; Niv, Y. Reinforcement Learning: The Good, The Bad and The Ugly. Cogn. Neurosci.
**2008**, 18, 185–196. [Google Scholar] [CrossRef] - Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag.
**2017**, 34, 26–38. [Google Scholar] [CrossRef] - Mousavi, S.S.; Schukat, M.; Howley, E. Deep Reinforcement Learning: An Overview. In Lecture Notes in Networks and Systems, Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016, London, UK, 21–22 September 2016; Bi, Y., Kapoor, S., Bhatia, R., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 426–440. [Google Scholar]
- Wang, H.; Liu, N.; Zhang, Y.; Feng, D.; Huang, F.; Li, D.; Zhang, Y. Deep Reinforcement Learning: A Survey. Front. Inf. Technol. Electron. Eng.
**2020**, 21, 1726–1744. [Google Scholar] [CrossRef] - Botvinick, M.; Ritter, S.; Wang, J.X.; Kurth-Nelson, Z.; Blundell, C.; Hassabis, D. Reinforcement Learning, Fast and Slow. Trends Cogn. Sci.
**2019**, 23, 408–422. [Google Scholar] [CrossRef] [PubMed] - Sivamayil, K.; Rajasekar, E.; Aljafari, B.; Nikolovski, S.; Vairavasundaram, S.; Vairavasundaram, I. A Systematic Study on Reinforcement Learning Based Applications. Energies
**2023**, 16, 1512. [Google Scholar] [CrossRef] - Halperin, I. The QLBS Q-Learner Goes NuQLear: Fitted Q Iteration, Inverse RL, and Option Portfolios. Quant. Financ.
**2019**, 19, 1543–1553. [Google Scholar] [CrossRef] - Kolm, P.N.; Ritter, G. Dynamic Replication and Hedging: A Reinforcement Learning Approach. J. Financ. Data Sci.
**2019**, 1, 159–171. [Google Scholar] [CrossRef] - Du, J.; Jin, M.; Kolm, P.N.; Ritter, G.; Wang, Y.; Zhang, B. Deep Reinforcement Learning for Option Replication and Hedging. J. Financ. Data Sci.
**2020**, 2, 44–57. [Google Scholar] [CrossRef] - Cannelli, L.; Nuti, G.; Sala, M.; Szehr, O. Hedging Using Reinforcement Learning: Contextual k-Armed Bandit versus Q-Learning. J. Financ. Data Sci.
**2023**, 9, 100101. [Google Scholar] [CrossRef] - Malibari, N.; Katib, I.; Mehmood, R. Systematic Review on Reinforcement Learning in the Field of Fintech. arXiv
**2023**, arXiv:2305.07466. [Google Scholar] - Charpentier, A.; Élie, R.; Remlinger, C. Reinforcement Learning in Economics and Finance. Comput. Econ.
**2023**, 62, 425–462. [Google Scholar] [CrossRef] - Singh, V.; Chen, S.-S.; Singhania, M.; Nanavati, B.; Kar, A.K.; Gupta, A. How Are Reinforcement Learning and Deep Learning Algorithms Used for Big Data Based Decision Making in Financial Industries—A Review and Research Agenda. Int. J. Inf. Manag. Data Insights
**2022**, 2, 100094. [Google Scholar] [CrossRef] - Pricope, T.V. Deep Reinforcement Learning in Quantitative Algorithmic Trading: A Review. arXiv
**2021**, arXiv:2106.00123. [Google Scholar] - Sun, S.; Wang, R.; An, B. Reinforcement Learning for Quantitative Trading. Assoc. Comput. Mach.
**2023**, 14, 1–29. [Google Scholar] [CrossRef] - Gašperov, B.; Begušić, S.; Posedel Šimović, P.; Kostanjčar, Z. Reinforcement Learning Approaches to Optimal Market Making. Mathematics
**2021**, 9, 2689. [Google Scholar] [CrossRef] - Atashbar, T.; Aruhan Shi, R. Deep Reinforcement Learning: Emerging Trends in Macroeconomics and Future Prospects; IMF Working Papers; International Monetary Fund: Washington, DC, USA, 2022; Volume 2022. [Google Scholar]
- Mosavi, A.; Faghan, Y.; Ghamisi, P.; Duan, P.; Ardabili, S.F.; Salwana, E.; Band, S.S. Comprehensive Review of Deep Reinforcement Learning Methods and Applications in Economics. Mathematics
**2020**, 8, 1640. [Google Scholar] [CrossRef] - Sato, Y. Model-Free Reinforcement Learning for Financial Portfolios: A Brief Survey. arXiv
**2019**, arXiv:1904.04973. [Google Scholar] - Liu, P. A Review on Derivative Hedging Using Reinforcement Learning. J. Financ. Data Sci.
**2023**, 5, 136–145. [Google Scholar] [CrossRef] - Buehler, H.; Gonon, L.; Teichmann, J.; Wood, B. Deep Hedging. Quant. Financ.
**2019**, 19, 1271–1291. [Google Scholar] [CrossRef] - Buehler, H.; Gonon, L.; Teichmann, J.; Wood, B.; Mohan, B.; Kochems, J. Deep Hedging: Hedging Derivatives Under Generic Market Frictions Using Reinforcement Learning, Swiss Finance Institute Research Paper No. 19-80. SSRN 2020. preprint.
- Chong, W.F.; Cui, H.; Li, Y. Pseudo-Model-Free Hedging for Variable Annuities via Deep Reinforcement Learning. Ann. Actuar. Sci.
**2023**, 17, 503–546. [Google Scholar] [CrossRef] - Mandelli, F.; Pinciroli, M.; Trapletti, M.; Vittori, E. Reinforcement Learning for Credit Index Option Hedging. arXiv
**2023**, arXiv:2307.09844. [Google Scholar] - Carbonneau, A. Deep Hedging of Long-Term Financial Derivatives. Insur. Math. Econ.
**2021**, 99, 327–340. [Google Scholar] [CrossRef] - Giurca, B.; Borovkova, S. Delta Hedging of Derivatives Using Deep Reinforcement Learning, SSRN 2021. preprint.
- Kim, H. Deep Hedging, Generative Adversarial Networks, and Beyond. arXiv
**2021**, arXiv:2103.03913. [Google Scholar] - Xu, W.; Dai, B. Delta-Gamma–Like Hedging with Transaction Cost under Reinforcement Learning Technique. J. Deriv.
**2022**, 29, 60–82. [Google Scholar] [CrossRef] - Vittori, E.; Trapletti, M.; Restelli, M. Option Hedging with Risk Averse Reinforcement Learning. In Proceedings of the ICAIF’ 20: Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; Association for Computing Machinery: New York, NY, USA; 2021. [Google Scholar]
- Pham, U.; Luu, Q.; Tran, H. Multi-Agent Reinforcement Learning Approach for Hedging Portfolio Problem. Soft Comput.
**2021**, 25, 7877–7885. [Google Scholar] [CrossRef] [PubMed] - Xiao, B.; Yao, W.; Zhou, X. Optimal Option Hedging with Policy Gradient. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand, 7–10 December 2021; pp. 1112–1119. [Google Scholar]
- Assa, H.; Kenyon, C.; Zhang, H. Assessing Reinforcement Delta Hedging, SSRN 2021. preprint.
- Murray, P.; Wood, B.; Buehler, H.; Wiese, M.; Pakkanen, M. Deep Hedging: Continuous Reinforcement Learning for Hedging of General Portfolios across Multiple Risk Aversions. In Proceedings of the ICAIF’ 22: Proceedings of the Third ACM International Conference on AI in Finance, New York, NY, USA, 2–4 November 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 361–368. [Google Scholar]
- Mikkilä, O.; Kanniainen, J. Empirical Deep Hedging. Quant. Financ.
**2023**, 23, 111–122. [Google Scholar] [CrossRef] - Cao, J.; Chen, J.; Farghadani, S.; Hull, J.; Poulos, Z.; Wang, Z.; Yuan, J. Gamma and Vega Hedging Using Deep Distributional Reinforcement Learning. Front. Artif. Intell.
**2023**, 6, 1129370. [Google Scholar] [CrossRef] [PubMed] - Zheng, C.; He, J.; Yang, C. Option Dynamic Hedging Using Reinforcement Learning. arXiv
**2023**, arXiv:2306.10743. [Google Scholar] - Fathi, A.; Hientzsch, B. A Comparison of Reinforcement Learning and Deep Trajectory Based Stochastic Control Agents for Stepwise Mean-Variance Hedging. arXiv
**2023**, arXiv:2302.07996. [Google Scholar] [CrossRef] - Ashraf, N.M.; Mostafa, R.R.; Sakr, R.H.; Rashad, M.Z. Optimizing Hyperparameters of Deep Reinforcement Learning for Autonomous Driving Based on Whale Optimization Algorithm. PLoS ONE
**2021**, 16, e0252754. [Google Scholar] [CrossRef] [PubMed] - Wang, N.; Zhang, D.; Wang, Y. Learning to Navigate for Mobile Robot with Continual Reinforcement Learning. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 3701–3706. [Google Scholar]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. arXiv
**2018**, arXiv:1802.09477. [Google Scholar] - Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. arXiv
**2015**, arXiv:1509.06461. [Google Scholar] [CrossRef] - Barth-Maron, G.; Hoffman, M.W.; Budden, D.; Dabney, W.; Horgan, D.; TB, D.; Lillicrap, T. Distributed Distributional Deterministic Policy Gradients. arXiv
**2018**, arXiv:1804.08617. [Google Scholar] - Dabney, W.; Rowland, M.; Bellemare, M.G.; Munos, R. Distributional Reinforcement Learning with Quantile Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.; et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. arXiv
**2018**, arXiv:1802.01561. [Google Scholar] - Markowitz, H. Portfolio Selection. J. Financ.
**1952**, 7, 77–91. [Google Scholar] [CrossRef] - Rockafellar, R.T.; Uryasev, S. Conditional Value-at-Risk for General Loss Distributions. J. Bank. Financ.
**2002**, 26, 1443–1471. [Google Scholar] [CrossRef] - Hagan, P.; Kumar, D.; Lesniewski, A.; Woodward, D. Managing Smile Risk. Wilmott Mag.
**2002**, 1, 84–108. [Google Scholar] - Bartlett, B. Hedging under SABR Model. Wilmott Mag.
**2006**, 4, 2–4. [Google Scholar] - Heston, S.L. A Closed-Form Solution for Options with Stochastic Volatility with Applications to Bond and Currency Options. Rev. Financ. Stud.
**1993**, 6, 327–343. [Google Scholar] [CrossRef] - Wachowicz, E. Wharton Research Data Services (WRDS). J. Bus. Financ. Librariansh.
**2020**, 25, 184–187. [Google Scholar] [CrossRef] - Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM
**2020**, 63, 139–144. [Google Scholar] [CrossRef] - Whalley, A.E.; Wilmott, P. An Asymptotic Analysis of an Optimal Hedging Model for Option Pricing with Transaction Costs. Math. Financ.
**1997**, 7, 307–324. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Pickard, R.; Lawryshyn, Y.
Deep Reinforcement Learning for Dynamic Stock Option Hedging: A Review. *Mathematics* **2023**, *11*, 4943.
https://doi.org/10.3390/math11244943

**AMA Style**

Pickard R, Lawryshyn Y.
Deep Reinforcement Learning for Dynamic Stock Option Hedging: A Review. *Mathematics*. 2023; 11(24):4943.
https://doi.org/10.3390/math11244943

**Chicago/Turabian Style**

Pickard, Reilly, and Yuri Lawryshyn.
2023. "Deep Reinforcement Learning for Dynamic Stock Option Hedging: A Review" *Mathematics* 11, no. 24: 4943.
https://doi.org/10.3390/math11244943