## 1. Introduction

## 2. Reinforcement Learning

#### 2.1. Fundamentals

#### 2.2. Value-Based RL Methods

#### 2.3. Policy-Based RL Methods

## 3. Similar Work

## 4. Methodology

Source | Reviewed | Method | State | Action | Reward | Train Data | Test Data | Benchmark |
---|---|---|---|---|---|---|---|---|

[5,30] | Yes | Q-Learning | ${S}_{t},\tau ,{\sigma}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | BSD |

[31] | Yes | SARSA | ${S}_{t}$, $\tau $, ${n}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | BSD |

[32] | Yes | DQN Pop-Art PPO | ${S}_{t},\tau $, ${n}_{t},K$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | BSD |

[52] | No | TRVO | ${C}_{t},{S}_{t}$, ${\u2206}_{t}$, ${n}_{t}$ | Cont. | $\delta {w}_{t}$ | GBM | GBM | BSD |

[1] | Yes | DDPG | ${S}_{t}$, $\tau $, ${n}_{t}$ | Cont. | $\mathrm{min}initial\mathbb{E}\left[{w}_{t}\right]+\lambda \sqrt{\mathbb{V}\left[{w}_{t}\right]},$ | GBM, SABR | GBM, SABR | BSD, Bartlett |

[53] | Yes | IMPALA | ${S}_{t}$, $\tau $ | Disc. | +1, −1 | HSX, HNX | HSX, HNX | Market Return |

[49] | No | DQN, DDPG | ${C}_{t},{S}_{t}$, ${\u2206}_{t}$, ${n}_{t}$, ${\sigma}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM, Heston | GBM, Heston, S&P | BSD, Wilmott |

[54] | No | PG w/Baseline | ${S}_{t}$, $\tau $, ${n}_{t}$ | Disc. | $\delta {w}_{t}$ | GBM, Heston | GBM, Heston, S&P | BSD |

[50] | No | Dir. Policy Search | ${S}_{t}$, $\tau $ | Cont. | CVaR | GBM, GAN | GBM, GAN | BSD |

[55] | No | DDPG | ${S}_{t}$, $\tau $, ${n}_{t}$ | Cont. | Payoff | GBM | GBM | BSD |

[56] | No | Actor Critic | ${C}_{t},{S}_{t}$, ${n}_{t}$, $\tau $ | Cont. | $\delta {w}_{t}$ | Heston | Heston | BSD |

[51] | Yes | DDPG | ${S}_{t}$, $\tau $, ${\u2206}_{t},{n}_{t},\mathrm{K},{\mathsf{\nu}}_{t},{\mathsf{\Gamma}}_{t}$ | Cont. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | S&P, DJIA | S&P, DJIA | BSD |

[57] | Yes | TD3 | ${S}_{t}$, $\tau $, ${n}_{t},{\sigma}_{t}$ | Cont. | $\delta {w}_{t}$ | GBM, Heston, S&P | GBM, Heston, S&P | BSD |

[58] | Yes | D4PG-QR | ${S}_{t},{\mathsf{\Gamma}}_{t}^{port},{\mathsf{\nu}}_{t}^{port},\phantom{\rule{0ex}{0ex}}{\mathsf{\Gamma}}_{t}^{hedge},{\mathsf{\nu}}_{t}^{hedge}$ | Cont. | CvaR and modified mean-var. | SABR | SABR | BSD, BSDG, BSDV |

[59] | No | DDPG, DDPG-U | ${S}_{t}$, $\tau ,{n}_{t}$, ${\sigma}_{t},{\u2206}_{t},\frac{dC}{dt}$ | Cont. | $\delta {w}_{t}$ + $\lambda Var\left[\delta {w}_{t}\right]$ | GBM, S&P | GBM, S&P | BSD |

[33] | Yes | CMAB | ${S}_{t}$, $\tau $, ${n}_{t}$ | Disc. | $\delta {w}_{t}-\lambda {(\delta {w}_{t})}^{2}$ | GBM | GBM | CMAB vs. DQN |

[60] | No | DDPG | ${C}_{t},{S}_{t}$, ${\u2206}_{t}$, ${n}_{t},\tau $ | Cont. | Min. ${c}_{t}$ | GBM | GBM | BSD |

## 5. Analysis

#### 5.1. RL Methods

#### 5.2. State and Action Spaces

- The ratio of post-hedging the BS Gamma to pre-hedging the BS Gamma is within the range of [0, 1].
- The ratio of post-hedging the BS Vega to pre-hedging the BS Vega is within the range of [0, 1].

#### 5.3. Reward Formulations

#### 5.4. Data Generation Processes

#### 5.5. Comparison of Results

## 6. Conclusions

