Next Article in Journal
Good-Deal Bounds for Option Prices under Value-at-Risk and Expected Shortfall Constraints
Next Article in Special Issue
Deep Hedging under Rough Volatility
Previous Article in Journal
The Importance of Economic Variables on London Real Estate Market: A Random Forest Approach
Previous Article in Special Issue
A Generative Adversarial Network Approach to Calibration of Local Stochastic Volatility Models
 
 
Article
Peer-Review Record

Exploiting Distributional Temporal Difference Learning to Deal with Tail Risk

by Peter Bossaerts *, Shijie Huang and Nitin Yadav
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 23 September 2020 / Revised: 15 October 2020 / Accepted: 21 October 2020 / Published: 26 October 2020
(This article belongs to the Special Issue Machine Learning in Finance, Insurance and Risk Management)

Round 1

Reviewer 1 Report

In the paper, a modification of the distributional RL (disRL) based on the MLE (maximum likelihood estimator) for the mean and the EM algorithm is proposed. This new method is described in a detailed way together with a pseudo-code algorithm. Then its efficiency is analyzed and compared with two other approaches (the TD Learning and the classical disRL) using three sets of data: two artificial ones and the third one related to a real-life financial stock. Of course, the main parts (like disRL, MLE, EM) are not new, but the introduced connection of these methods seems very interesting. I have only a few minor remarks concerning the reviewed paper:

  1. The proposed approach is analyzed using a two-arm contextual bandit problem. Please provide some insight if and how this new method can be generalized if the considered financial problem is more complicated.
  2. As noted by the authors, their method achieves even 100% effectiveness (e.g., in the Gaussian environment, the leptokurtic environment II). Please explain if this very high level is related to the quality of the method, simplicity of the considered problem or some other factors.
  3. The reviewed paper is focused almost only on the RL (and similar) approaches to the widely-known problem of the leptokurtic environment in financial data. Therefore some other methods, models and algorithms should be also mentioned in the introduction together with respective references, e.g.,

Glasserman P., Monte Carlo Methods in Financial Engineering

Jurczenko E., Maillet B., The Four-moment Capital Asset Pricing Model: Between Asset Pricing and Asset Allocation

Nowak P., Romaniuk M., A fuzzy approach to option pricing in a Levy process setting

Scherer M., Rachev S.T., Kim Y.S, Fabozzi F.J. Approximation of skewed and leptokurtic return distributions

Simonato J.-G., GARCH processes with skewed and leptokurtic innovations: Revisiting the Johnson S_u case

Author Response

We thank the reviewer for the constructive comments. We hope that our way of addressing them improved the paper, to the reviewer's satisfaction. Here is how we responded.

Q 1. The proposed approach is analyzed using a two-arm contextual bandit problem. Please provide some insight if and how this new method can be generalized if the considered financial problem is more complicated.

We addressed this issue by adding a paragraph after line 77:

"The framework may appear simple, but it is generic. The contextual two-arm bandit can readily be extended to handle more involved, and hence, more realistic situations, by augmenting the state vector or the number of states, and/or increasing the number of control options beyond two (arms). The bandit does not have to be stationary; it can change randomly over time, to form a so-called restless bandit. Continuous states and large81state spaces can be accommodated through deep learning (Mnih et al. 2013; Moravˇcík et al. 2017; Silver et al. 2016). We chose a simple, canonical setting, in order to illustrate how easy it is for traditional RL to fail under leptokurtosis, and how powerful our version of distributional RL is to address the failure."

Q 2. As noted by the authors, their method achieves even 100% effectiveness (e.g., in the Gaussian environment, the leptokurtic environment II). Please explain if this very high level is related to the quality of the method, simplicity of the considered problem or some other factors.

See response to Q1. Simplicity allows us to verify that it is the method that causes improvement rather than other factors, such as ability of distributional RL to deal with large state spaces.

Q 3. The reviewed paper is focused almost only on the RL (and similar) approaches to the widely-known problem of the leptokurtic environment in financial data. Therefore some other methods, models and algorithms should be also mentioned in the introduction together with respective references, e.g., (5 references)

It is important to distinguish between RL and the other procedures you mention. Those procedures focus on estimation and prediction, and not on action/decision/control. Our problem is how leptokurtosis affects discovery and maintenance of the optimal policy, not merely on estimation and prediction. As such, RL stands out; it is fundamentally different.

We altered the paragraph after (old) line 84 and added a new paragraph to make sure the reader appreciates this:

"One could argue that there are other solutions to the problems leptokurtosis causes. This could be GARCH or stochastic volatility modelling (Simonato 2012), Monte Carlo approaches (Glasserman 2013), moment methods (Jurczenko and Maillet 2012), or parametric return process approximation and modelling (Nowak and Romaniuk 2013; Scherer et al. 2012). These procedures would effectively filter the data before application of RL. But it is known that mere filtering, while alleviating the impact of leptokurtosis, does not eliminate tail risk. Indeed, the filtered risk appears to be best modeled with a t distribution (which we use here), or the stable Paretian distribution (for which variance does not even exist). These distributions still entail tail risk. See, e.g., Curto et al. (2009); Simonato (2012).

More importantly, none of the aforementioned procedures deals with control, which is what RL is made for. The procedures aim only at forecasting. As such, they do not provided a good comparison to RL. RL is engaged in forecasting as well, but itprediction subserves the goal of finding the best actions. The problem we address here is whether leptokurtosis affects discovery and maintenance of the optimal policy, not merely that of finding the best prediction of the future reward."

 

Reviewer 2 Report

Dear authors,


The article you have presented is very interesting and pertinent research and is dealing with a topic of high interest. The paper addresses distributional temporal difference learning to deal with tail risk. When it comes to the paper itself, the paper is very well structured and presented in all its parts. The aim of the paper as well as the research gap is clearly stated. The chapters have a corresponding order. The literature review is carried out very well and most of the relevant papers are cited. Discussion and conclusions are supported by the results.
It was a pleasure to review your manuscript, my sincere congratulations to the authors on a excellent piece of paper.

Author Response

Thank you for your heads-up! 

Reviewer 3 Report

It is a very good and well-written paper -- a real joy to read. A state of the art too. Well done.

Author Response

Thank you for the heads-up!

Back to TopTop