# Estimating the Value-at-Risk by Temporal VAE

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- The signal identification analysis of the model applied to financial time series where we in particular show that the TempVAE possesses the auto-pruning property;
- A test procedure to demonstrate that the TempVAE identifies the correct number of latent factors to adequately model the data;
- The demonstration that our newly developed TempVAE approach for the VaR estimation performs excellently and beats the benchmark models; and
- The detailed documentation of the hyperparameter choice in the appendix (the ablation study).

## 2. Comparison to Related Work

## 3. The Temporal Variational Autoencoder

#### The Auto-Pruning Property of VAEs and the Posterior Collapse

## 4. Implementation, Experiments and VaR-Estimation

#### 4.1. Description of the Used Data Sets

#### 4.2. Signal Identification

#### 4.3. Fit to Financial Data and Application to Risk Management

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Appendix A. β-Annealing, Model Implementation and Data Preprocessing

#### Appendix A.1. β-Annealing

**Figure A1.**On the left: The KL influence is adjusted over $\beta $. During training, we start with $\beta =0$ and increase it over time. We implement this by subtracting an exponentially decaying term from the ultimate $\beta $ value. We use a decay rate of $0.96$ and decay steps of 20. On the right: The evolution of the (negative) KL-Divergence over time. We can see how the values starts to decrease after the annealing has increased $\beta $ enough.

#### Appendix A.2. Model Implementation

#### Appendix A.3. Data Preprocessing

- For the DAX data, 5061 observations were split at $t=3340$ into 3340 training observations and 1721 test observations.
- For the S&P500 data, 4872 observations were split at $t=3215$ into 3215 training observations and 1657 test observations.
- For the noise data, 5050 observations were split at $t=3333$ into 3333 training observations and 1717 test observations.
- For each of the oscillating PCA datasets, 9979 observations were split at $t=6586$ into 6586 training observations and 3393 test observations.

## Appendix B. Ablation Studies

#### Appendix B.1. Preventing Posterior Collapse with β-Annealing

**Table A1.**The average number of active units given by (A2) for the two models with and without annealing. We see that the model with the annealing is not decreasing the number of average active dimensions when increasing the number of latent signals. For the model without annealing, this is not the case.

DAX | Noise | Osc. PCA 2 | Osc. PCA 5 | Osc. PCA 10 | |
---|---|---|---|---|---|

TempVAE | 20% | 0% | 20% | 20% | 20% |

TempVAE noAnneal | 0% | 0% | 10% | 20% | 10% |

**Table A2.**The negative log-likelihood for the two models on the test set. We see the TempVAE is outperforming the version without annealing on every relevant dataset. Only for the noise case, where there is nothing to learn, are the results comparable.

DAX | Noise | Osc. PCA 2 | Osc. PCA 5 | Osc. PCA 10 | |
---|---|---|---|---|---|

TempVAE | 20.29 | 31.48 | −38.51 | −1.40 | 11.07 |

TempVAE noAnneal | 27.49 | 31.44 | 3.58 | 2.60 | 13.28 |

#### Appendix B.2. Comparison to a Model with Trainable Prior Parameters

**Figure A2.**The activity statistics for the “DAX” data for the model ‘TempVAE trainPrior’. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Figure A3.**The activity statistics for the “oscillating PCA” data for the model ‘TempVAE trainPrior’. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Table A3.**The average exceedances for the model TempVAE and the version with trainable parameters for the prior ${p}_{\mathit{\theta}}(Z)$. TempVAE outperforms the trainable prior version. For the average exceedances, the respective quantile assumption is the optimal value. Hence, for VaR95, the best value is 5, and for VaR99, the best value is 1.

Model | AE95 | AE99 |
---|---|---|

TempVAE | 4.8 | 1.3 |

TempVAE trainPrior | 8.2 | 3.1 |

**Figure A4.**VaR95 forecasts of both models ‘TempVAE’ and ‘TempVAE trainPrior’ on the test set. The VaR forecasts for the model with trainable prior are more volatile.

#### Appendix B.3. Autoregressive Structure for the Observables Distribution

**Table A4.**The negative log-likelihood for the two models TempVAE and ‘TempVAE AR’ on the test set. We see that the TempVAE is outperforming the version with an autoregressive structure, even in the case of the Noise dataset. This is because the ‘TempVAE AR’ model was over-fitting these data.

DAX | Noise | Osc. PCA 2 | Osc. PCA 5 | Osc. PCA 10 | |
---|---|---|---|---|---|

TempVAE | 20.29 | 31.48 | −38.51 | −1.40 | 11.07 |

TempVAE AR | 22.68 | 32.91 | −34.72 | 25.33 | 69.26 |

#### Appendix B.4. Using a Diagonal Covariance Matrix ${\mathbf{\Sigma}}_{t}^{r}$

**Table A5.**The average exceedances for the model TempVAE and the version with diagonal covariance matrix for ${\mathbf{\Sigma}}_{t}^{r}$. For the average exceedances, the respective quantile assumption is the optimal value. Hence, for VaR95, the best value is 5, and for VaR99, the best value is 1.

Model | AE95 | AE99 |
---|---|---|

TempVAE | 4.8 | 1.3 |

TempVAE diag | 5.3 | 1.6 |

#### Appendix B.5. Setting ${\mathit{\mu}}_{t}^{r}\equiv 0$

**Figure A5.**The activity statistics for the “DAX” data for the model ‘TempVAE zeroMean’. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Figure A6.**The activity statistics for the “Oscillating PCA 2” data for the model ‘TempVAE zeroMean’. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Figure A7.**The kernel density estimated distribution of the “oscillating PCA 2” data. On the left, we see the model ‘TempVAE zeroMean’ and on the right, the historical data.

#### Appendix B.6. Regularization: L2, Dropout and KL-Divergence

- ‘TempVAE’.
- ‘TempVAE det’: The KL-Divergence is switched off and the bottleneck uses only a mean parameter whereas the covariance is set to zero. Therefore, the bottleneck is deterministic and the auto-pruning switched off.
- ‘TempVAE no dropout/L2’: Dropout and L2 regularization are switched off.
- ‘TempVAE no L2’: L2 regularization is switched off.
- ‘TempVAE no dropout’: Dropout is switched off.
- ‘TempVAE det no dropout/L2’: ‘TempVAE det’ and ‘TempVAE no dropout/L2’ combined.

**Table A6.**The average number of active units given by (A2) for the models where different regularizations are excluded.

DAX | Noise | Osc. PCA 2 | Osc. PCA 5 | Osc. PCA 10 | |
---|---|---|---|---|---|

TempVAE | 20% | 0% | 20% | 20% | 20% |

TempVAE det | 0% | 0% | 90% | 81% | 90% |

TempVAE no dropout/L2 | 20% | 0% | 31% | 76% | 30% |

TempVAE no L2 | 20% | 6% | 20% | 20% | 30% |

TempVAE no dropout | 20% | 0% | 31% | 41% | 40% |

TempVAE det no dropout/L2 | 90% | 90% | 90% | 80% | 90% |

**Figure A8.**The kernel density estimated distribution of the “oscillating PCA 2” data for the model ‘TempVAE no dropout’. On the left, we see the ‘TempVAE no dropout’ model and on the right, we see the historical data.

#### Appendix B.7. Encoder Dependency

**Figure A9.**The kernel density estimated distribution of the “oscillating PCA 2” data for the model ‘TempVAE backwards’. On the left, we see the ‘TempVAE backwards’ model and on the right, we see the historical data.

**Table A7.**The average number of active units given by (A2) for the two models with different encoder architecture.

DAX | Noise | Osc. PCA 2 | Osc. PCA 5 | Osc. PCA 10 | |
---|---|---|---|---|---|

TempVAE | 20% | 0% | 20% | 20% | 20% |

TempVAE backwards | 10% | 0% | 10% | 0% | 0% |

## Appendix C. GARCH and DCC-GARCH

- GARCH: A GARCH(1,1) model, introduced by Bollerslev (1986).
- DCC-GARCH-MVt: A Dynamic Conditional Correlation GARCH(1,1), with a multivariate distribution assumption for the error term.

## Appendix D. Activities on the Oscillating PCA Data Sets

**Figure A10.**The activity statistics for the “oscillating PCA 5” data. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Figure A11.**The activity statistics for the “oscillating PCA 10” data. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Table A8.**The different fit scores for the model TempVAE on the “oscillating PCA” datasets. Comparable scores are achieved for the Score Portfolio NLL. The multivariate scores are not comparable, as the structure of the data is not Gaussian.

Score | Osc. PCA 2 | Osc. PCA 5 | Osc. PCA 10 |
---|---|---|---|

TempVAE Portfolio NLL | −5.21 | −5.77 | −4.80 |

**Figure A12.**The kernel density estimated distribution of the first two dimensions ${R}_{t,1:2}$ of the “oscillating PCA 2” data. On the left, we see the TempVAE model, and on the right, we see the historical data.

**Figure A13.**The kernel density estimated distribution of the first two dimensions ${R}_{t,1:2}$ of the “oscillating PCA 5” data. On the left, we see the TempVAE model, and on the right, we see the historical data.

**Figure A14.**The kernel density estimated distribution of the first two dimensions ${R}_{t,1:2}$ of the “oscillating PCA 10” data. On the left, we see the TempVAE model, and on the right, we see the historical data.

## Appendix E. Activities on the Stock Market Data and Model Adaptions for High-Dimensional Data

**Figure A15.**The activity statistics for the “DAX” data. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Figure A16.**The activity statistics for the “S&P500” data. We consider sequences of size $M=21$ as input. The graphs shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

## Appendix F. VaR and Scatterplots

**Figure A17.**The VaR99 estimates for the models TempVAE, DCC-GARCH-MVN and HS on a fraction of the test data.

**Figure A18.**The VaR95 estimates for the two models TempVAE, DCC-GARCH-MVN and HS on the whole test data.

**Figure A19.**The VaR99 estimates for the two models TempVAE, DCC-GARCH-MVN and HS on the whole test data.

**Figure A20.**The kernel density estimated distribution of the first two dimensions ${R}_{t,1:2}$ of the “DAX” data. On the left, we see the TempVAE model and on the right, we see the historical data.

**Figure A21.**The kernel density estimated distribution of the first two dimensions ${R}_{t,1:2}$ of the “DAX” data. On the left, we see the GARCH model, and on the right, we see the historical data.

**Figure A22.**The kernel density estimated distribution of the first two dimensions ${R}_{t,1:2}$ of the “DAX” data. On the left, we see the DCC-GARCH-MVN model, and on the right, we see the historical data.

**Figure A23.**The kernel density estimated distribution of the first two dimensions ${R}_{t,1:2}$ of the “DAX” data. On the left, we see the DCC-GARCH-MVt model, and on the right, we see the historical data.

## Notes

1 | i.e., the fact that drift parameters are notoriously hard to estimate while volatility parameters are much easier to obtain. |

2 | E.g., $p(R|Z)={\prod}_{t=1}^{T}p({R}_{t}|{R}_{1:t-1},{Z}_{1:t})$. |

3 | Then, the assumption is ${p}_{\mathit{\theta}}(Z)={\prod}_{t=1}^{T}{p}_{\mathit{\theta}}({Z}_{t})$, with ${p}_{\mathit{\theta}}({Z}_{t})\sim \mathcal{N}(\mathbf{0},\mathit{I})$ for all $t=1,\dots ,T$. |

4 | This is common practice when modeling financial data as the increments of the log-stock prices are typically assumed to be independent (or at least uncorrelated) while the price increments are definitely not. For our later application in risk management (the VaR estimation), we therefore transform the log-return forecasts of the models to actual returns by using ${R}_{t,i}^{\mathrm{nonlog}}:=\mathrm{exp}\left({R}_{t,i}\right)-1$) to model the extremes in the data adequately. |

5 | The benchmark models are note feasible on the “S&P500” data due to a too high dimension. |

6 | We consider a time window of 180 days. |

7 | Note that usually one can calculate the VaR estimates for GARCH models analytically. But as we apply a non-linear transform of the modeled log-returns (see expression (24)) this task is not trivially performed. |

8 | Indeed, we could get the fractions of mispredictions correct by always predicting the VaR as $-\infty $ for the initial fractions of the predictions and then equal to 1 (i.e., a total loss). Of course, this is no reasonable predictor! |

## References

- Arian, Hamidreza, Mehrdad Moghimi, Ehsan Tabatabaei, and Shiva Zamani. 2020. Encoded Value-at-Risk: A Predictive Machine for Financial Risk Management. arXiv arXiv:2011.06742. [Google Scholar] [CrossRef]
- Arimond, Alexander, Damian Borth, Andreas G. F. Hoepner, Michael Klawunn, and Stefan Weisheit. 2020. Neural Networks and Value at Risk. SSRN Electronic Journal, 20–7. [Google Scholar] [CrossRef]
- Bayer, Justin, and Christian Osendorfer. 2014. Learning Stochastic Recurrent Networks. arXiv arXiv:1411.7610v3. [Google Scholar]
- Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Berlin/Heidelberg: Springer. [Google Scholar]
- Bollerslev, Tim. 1986. Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics 31: 307–27. [Google Scholar] [CrossRef]
- Bowman, Samuel R., Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating Sentences from a Continuous Space. In CoNLL 2016—20th SIGNLL Conference on Computational Natural Language Learning, Proceedings. Berlin: Association for Computational Linguistics (ACL), pp. 10–21. [Google Scholar]
- Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. 2016. Importance Weighted Autoencoders. Paper presented at International Conference on Learning Representations, ICLR, San Juan, PR, USA, May 2–4. [Google Scholar]
- Chen, Luyang, Markus Pelger, and Jason Zhu. 2023. Deep Learning in Asset Pricing. Management Science. (online first). [Google Scholar] [CrossRef]
- Chen, Xiaoliang, Kin Keung Lai, and Jerome Yen. 2009. A statistical neural network approach for value-at-risk analysis. Paper presented at 2009 International Joint Conference on Computational Sciences and Optimization, CSO 2009, Sanya, China, April 24–26; vol. 2, pp. 17–21. [Google Scholar] [CrossRef]
- Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Paper presented at EMNLP 2014—2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, October 25–29; pp. 1724–34. [Google Scholar] [CrossRef]
- Chung, Junyoung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. 2015. A Recurrent Latent Variable Model for Sequential Data. Advances in Neural Information Processing Systems 28: 2980–8. [Google Scholar]
- Engle, Robert. 2012. Dynamic Conditional Correlation. Journal of Business & Economic Statistics 20: 339–50. [Google Scholar] [CrossRef]
- Fatouros, Georgios, Georgios Makridis, Dimitrios Kotios, John Soldatos, Michael Filippakis, and Dimosthenis Kyriazis. 2022. DeepVaR: A framework for portfolio risk assessment leveraging probabilistic deep neural networks. Digital Finance 2022: 1–28. [Google Scholar] [CrossRef] [PubMed]
- Fraccaro, Marco, Simon Kamronn, Ulrich Paquet, and Ole Winther. 2017. A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning. Advances in Neural Information Processing Systems 30: 3602–11. [Google Scholar]
- Fraccaro, Marco, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. 2016. Sequential Neural Models with Stochastic Layers. Advances in Neural Information Processing Systems 29: 2207–15. [Google Scholar]
- Ghalanos, Alexios. 2019. rmgarch: Multivariate GARCH models. R Package Version 1.3-7. Available online: https://cran.microsoft.com/snapshot/2020-04-15/web/packages/rmgarch/index.html (accessed on 1 December 2022).
- Girin, Laurent, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. 2020. Dynamical Variational Autoencoders: A Comprehensive Review. Foundations and Trends in Machine Learning 15: 1–175. [Google Scholar] [CrossRef]
- Goyal, Anirudh, Alessandro Sordoni, Marc-Alexandre Côté, Nan Rosemary Ke, and Yoshua Bengio. 2017. Z-Forcing: Training Stochastic Recurrent Networks. Advances in Neural Information Processing Systems 2017: 6714–24. [Google Scholar]
- He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Paper presented at The IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, December 7–13. [Google Scholar]
- Kingma, Diederik P., and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. Paper presented at 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, May 7–9. [Google Scholar]
- Kingma, Diederik P., and Max Welling. 2014. Auto-Encoding Variational Bayes. Paper presented at 2nd International Conference on Learning Representations, ICLR, Banff, AB, Canada, April 14–16; Technical Report. Available online: https://arxiv.org/pdf/1312.6114.pdf (accessed on 1 December 2022).
- Kingma, Diederik P., Danilo J. Rezende, Shakir Mohamed, and Max Welling. 2014. Semi-supervised Learning with Deep Generative Models. Paper presented at the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, December 8–13. [Google Scholar]
- Krishnan, Rahul, Uri Shalit, and David Sontag. 2017. Structured Inference Networks for Nonlinear State Space Models. Paper presented at AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, February 4–9; vol. 31. [Google Scholar]
- Laloux, Laurent, Pierre Cizeau, Marc Potters, and Jean-Phillippe Bouchard. 2000. Random Matrix and Financial Correlations. International Journal of Theoretical and Applied Finance 3: 391–97. [Google Scholar] [CrossRef]
- Liu, Yan. 2005. Value-at-Risk Model Combination Using Artificial Neural Networks. In Emory University Working Paper Series. Atlanta: Emory University. [Google Scholar]
- Luo, Rui, Weinan Zhang, Xiaojun Xu, and Jun Wang. 2018. A Neural Stochastic Volatility Model. Paper presented at AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, February 2–7; vol. 32. [Google Scholar]
- Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. Paper presented at the 31 st International Conference on Machine Learning, Beijing, China, June 21–26; pp. 1278–86. [Google Scholar]
- Sarma, Mandira, Susan Thomas, and Ajay Shah. 2003. Selection of Value-at-Risk Models. Journal of Forecasting 22: 337–58. [Google Scholar] [CrossRef]
- Sicks, Robert, Ralf Korn, and Stefanie Schwaar. 2021. A Generalised Linear Model Framework for β-Variational Autoencoders based on Exponential Dispersion Families. Journal of Machine Learning Research 22: 1–41. [Google Scholar]
- Xu, Xiuqin, and Ying Chen. 2021. Deep Stochastic Volatility Model. arXiv arXiv:2102.12658. [Google Scholar]

**Figure 1.**The dependency structure of the generative model. Realizations of Z influence the realizations of R as well as all future realizations of R and Z. The Gaussian distribution parameters from the MLP as well as the sampling are represented by the dashed lines. Furthermore, information of former time points is propagated through the RNN hidden states ${h}_{.}^{r}$ and ${h}_{.}^{z}$.

**Figure 2.**The first 100 steps of two oscillating signals. The two signals have different amplitude, frequency and origin.

**Figure 3.**The activity statistics for the “Noise” data. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Figure 4.**The activity statistics for the “Oscillating PCA 2” data. We consider sequences of size $M=21$ as input. The graph shows the activity values of statistic (21) for the $\kappa =10$ dimensional latent space.

**Figure 5.**The VaR95 estimates for the two models TempVAE, DCC-GARCH-MVN and for HS on a fraction of the test data. The blue line shows the original return data.

**Table 1.**Fit scores for models TempVAE, GARCH and two multivariate GARCH. In all cases, lowest is best.

Model | Diagonal NLL | NLL | Portfolio NLL |
---|---|---|---|

GARCH | 22.76 | 22.84 | −0.51 |

TempVAE | 25.79 | 20.35 | −3.04 |

DCC-GARCH-MVN | 25.03 | 18.69 | −3.07 |

DCC-GARCH-MVt | 25.44 | 19.17 | −3.05 |

**Table 2.**The table displays the values (all of these have to be multiplied by ${10}^{-2}$) of the RLF scores as well as the average number of exceedances for the VaR95 and the VaR99 within the test set. For the RLF values, lowest is best. For the average exceedances, the respective quantile level assumption is the optimal value. Hence, for VaR95, the best value is 5, and for VaR99, the best value is 1.

Model | RLF95 | RLF99 | AE95 | AE99 |
---|---|---|---|---|

GARCH | 44.47 | 32.35 | 24.3 | 17.8 |

TempVAE | 12.64 | 5.96 | 4.8 | 1.3 |

DCC-GARCH-MVN | 13.15 | 7.10 | 5.4 | 2.1 |

DCC-GARCH-MVt | 10.23 | 4.14 | 3.9 | 0.6 |

HS | 14.57 | 7.43 | 5.1 | 1.3 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Buch, R.; Grimm, S.; Korn, R.; Richert, I.
Estimating the Value-at-Risk by Temporal VAE. *Risks* **2023**, *11*, 79.
https://doi.org/10.3390/risks11050079

**AMA Style**

Buch R, Grimm S, Korn R, Richert I.
Estimating the Value-at-Risk by Temporal VAE. *Risks*. 2023; 11(5):79.
https://doi.org/10.3390/risks11050079

**Chicago/Turabian Style**

Buch, Robert, Stefanie Grimm, Ralf Korn, and Ivo Richert.
2023. "Estimating the Value-at-Risk by Temporal VAE" *Risks* 11, no. 5: 79.
https://doi.org/10.3390/risks11050079