Next Article in Journal
Cosmic Neutrinos as a Window to Departures from Special Relativity
Previous Article in Journal
Analysis of the Fractional-Order Local Poisson Equation in Fractal Porous Media
Previous Article in Special Issue
A New Bivariate Random Coefficient INAR(1) Model with Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Iteration Algorithm for American Options Pricing Based on Reinforcement Learning

School of Mathematics, Jilin University, Changchun 130012, China
Symmetry 2022, 14(7), 1324; https://doi.org/10.3390/sym14071324
Submission received: 30 May 2022 / Revised: 18 June 2022 / Accepted: 23 June 2022 / Published: 27 June 2022

Abstract

:
In this paper, we present an iteration algorithm for the pricing of American options based on reinforcement learning. At each iteration, the method approximates the expected discounted payoff of stopping times and produces those closer to optimal. In the convergence analysis, a finite sample bound of the algorithm is derived. The algorithm is evaluated on a multi-dimensional Black-Scholes model and a symmetric stochastic volatility model, the numerical results implied that our algorithm is accurate and efficient for pricing high-dimensional American options.

1. Introduction

The pricing of American options is an important issue in quantitative finance and stochastic processes [1,2]. Many popular derivative products in various financial sectors are of the American type, and can be exercised at any time before maturity. Therefore, considerable effort has been spent to obtain accurate and efficient methods for pricing American options (see, e.g., Hull [3]). When the dimension of the option is small, methods based on partial differential equations [4] and binomial trees [5] can be applied. However, the calculation costs of these methods increase exponentially as the dimension gets larger, thus making them inefficient for pricing options on many underlying assets, such as the widely used high-dimensional symmetric stochastic volatility models [6].
To treat American options on multi-dimensional underlying assets, many pricing methods based on Monte Carlo simulation have been proposed. The most popular are the regression-based methods proposed by Longstaff and Schwartz [7] and Tsitsiklis and Roy [8]. Through a backward iteration scheme, these methods can approximate the continuation value and a feasible exercise policy, such as linear regression [7], neural network [9], Gaussian process regression [10] and kernel ridge regression [11], all of which produce lower price bounds for American options. The dual approaches for American options were developed by Rogers [12] and Haugh and Kogan [13], these methods produce upper price bounds for options. However, in the computation of these methods, the continuation value at each date is approximated by different functions, and only the data at this date are used.
A different strand of the literature focuses on finding the optimal exercise policy by Monte Carlo sample [14,15]. These approaches consider a parametric class of exercise regions and maximize an estimate of the value function within the parametric class. Through optimization, all the sample data are used to approximate the optimal exercise policy. Recently, Bayer et al. [16] and Becker et al. [17] consider randomized stopping times in approximating the optimal exercise regions. However, in these approaches, the resulting loss function may still be non-concave and exhibit isolated local optima; thus, it is difficult to find the global optimum for the loss function [18,19].
Reinforcement learning, especially the policy iteration method, has achieved empirical success in high-dimensional control problems [20,21,22]. The basic idea of policy iteration is to compute the evaluation function of the policy in each iteration, after which an improved policy is computed from the function for the next iteration [23]. The pricing of American options could be seen as a control problem, but there are only two possible actions, and these do not influence the underlying process. Reinforcement learning methods have already been applied to optimal stopping problems. Tsitsiklis and Roy [8] introduced the fitted Q-iteration for American option pricing based on the least-squares method. Yu and Bertsekas [24] proposed an algorithm based on projected value iteration, the convergence of this method for finite-state models was also obtained. Li et al. [25] considered a least squares policy iteration in the pricing of American options, and they give a finite-time bound for the algorithm. Becker et al. [26] developed an algorithm related to policy optimization for high-dimensional optimal stopping problems. Chen et al. [27] applied Zap-Q-learning for the optimal stopping problem, and established consistency of the algorithm for linear function approximation. Herrera et al. [28] considered a fitted Q-iteration based on randomized neural networks for the optimal stopping problems. However, most of these methods are a direct transformation of the method in reinforcement learning. Furthermore, there is a lack of analysis on the accuracy and efficiency of reinforcement learning in pricing high-dimensional American options.
In this paper, we propose an iteration algorithm for American options based on reinforcement learning. In each iteration, the expected discounted payoff of a family of stopping times is approximated by regression; thus, the data of all dates are used to improve the approximation of all dates. An improved family of stopping times was obtained based on the constructed function. After this procedure, an approximate optimal exercise policy was obtained. To provide theoretical guarantees, we developed a finite sample-error bound for the algorithm. In the numerical experiments, we considered the data generated by the multi-dimensional Black–Scholes model and a symmetric stochastic volatility model. The results showed that (a) our algorithm was accurate and efficient in pricing high-dimensional American options; (b) by using a function of time and underlying process, the continuation values can be approximated using a fraction of the parameters; (c) the methods based on reinforcement learning outperform the state-of-the-art methods in the pricing of American options.
The paper is organized as follows. In Section 2, we introduce the problem of pricing American options and illustrate the relationship of continuation values and stopping times. The efficient algorithm is described in Section 3. In Section 4, convergence rates of the algorithm are discussed. Numerical experiments of high-dimensional American options on multi-dimensional Black–Scholes model and a symmetric stochastic volatility model are given in Section 5. Finally, we conclude in Section 6. All proofs are found in the Appendix A.

2. Pricing of American Options and Stopping Times

In this section, we introduce to the pricing of American options. Let { X t , 0 t T } be a R d -valued Markov process, this process is defined on a filtered measurable probability space with a risk-neutral measure P . We assumed that the process records all relevant financial variables. In practice, the price of the American option was approximated by the price of a Bermudan option [11], which could be exercised at discrete time points 0 < t 1 < < t N = T . For  0 n N , we represented t n by n to simplify the notation in the following. We assumed that the risk-free discount factor between time points was constant, which was denoted by γ ( 0 , 1 ) . The price V n ( x ) of the option at n = 1 , , N was given by the optimal stopping problems
V n ( x ) = sup τ T n E γ τ n g ( X τ ) | X n = x ,
where g ( x ) is the non-negative payoff function, T n denotes the set of stopping times such that n τ . The price at time 0 is given by V 0 ( x ) = E [ γ V 1 ( X 1 ) | X 0 = x ] . We assume that g satisfies g ( X n ) B for n = 1 , , N , where · p denotes the L p -norm and B > 0 . As we will see later, this assumption can be relaxed.
The optimal stopping problem (1) is solved by a family of optimal stopping times τ n * , n = 1 , , N , that satisfies the consistency property τ n * > n τ n * = τ n + 1 * [16]. By a dynamic programming principle, τ n * can be determined by the continuation values [29]. The value in state x at time n is C * ( N , x ) 0 for n = N and
C * ( n , x ) = E [ γ τ n + 1 * n g ( X τ n + 1 * ) | X n = x ] ,
for n = 0 , , N 1 . Then, τ n * can be written as
τ n * = inf i n : g X i C * i , X i .
In other words, the option should be exercised when the current payoff is larger than the continuation value. A meaningful family of suboptimal stopping times should be obtained by replacing the continuation values by a good approximation.
Motivated by the fitted policy iteration method in reinforcement learning [23], we considered iteratively approximating the family of optimal stopping times. In this paper, we dealt with consistent families of stopping times τ n , n = 1 , , N . These times satisfied n τ n N with τ N = N and τ n > n τ n = τ n + 1 . We define the function C τ : { 0 , 1 , , N 1 } × R d R by
C τ ( n , x ) = E [ γ τ n + 1 n g ( X τ n + 1 ) | X n = x ] .
This function represents the expected discounted payoff achieved when X n = x , and the option is not exercised at n after which the stopping time τ n + 1 is followed. Conversely, given C : { 0 , 1 , , N 1 } × R d R , we defined a new family of stopping times by
τ N = N , τ n = n , if g ( X n ) C ( n , X n ) , τ n + 1 , otherwise .
It was immediately seen that the obtained family of stopping times τ n , 1 n N was consistent. The following result shows that, the exercise policy τ 1 constructed from C τ yielded a higher than expected discounted payoff than the original policy τ 1 .
Theorem 1.
For any family of consistent stopping times τ n , n = 1 , , N , the stopping time τ 1 constructed from C τ by (5) satisfies
E γ τ 1 g ( X τ 1 ) E γ τ 1 g ( X τ 1 ) .
By Theorem 1, if we can approximate C τ ( n , X n ) for a family of stopping times τ n , 1 n N 1 , we can construct an improved family of stopping times closer to the optimal family.

3. Iteration Algorithm

In this section, we propose a two-step iteration algorithm for American options. In the evaluation step, C τ ( n , x ) of a family of stopping times is estimated. In the improvement step, the estimated function is used to construct an improved family of stopping times by (5).
We first defined the approximation architecture used for estimating C τ ( n , x ) in the algorithm. Contrary to the regression-based algorithms, we used a single function by taking the time as an argument in the computation throughout the paper, denoted by F : =  { f : R d + 1 R } , the choosing set of real-valued functions. We also introduced the truncation operator for the approximation architecture. Let ψ B denote the truncation operator with level B defined by
ψ B f = f , if | f | B , sign ( f ) · B , otherwise .
For a set of functions F , we set ψ B F = { ψ B f : f F } .
To obtain a good approximation of C τ , it was a straightforward matter to consider
1 N n = 0 N 1 E C τ ( n , X n ) E [ γ τ n + 1 n g ( X τ n + 1 ) | X n ] 2 .
To obtain a practical procedure, we considered the sample-based approximation to (8) in the algorithm.
To approximate the optimal stopping times numerically, our method was initialized with arbitrary C 0 F and corresponding stopping times τ n 1 constructed from (5), n = 1 , , N . For  j = 1 , , J 1 , we generated a set of Monte Carlo paths ( x 0 i , , x N i ) , i = 1 , , M of process X n , this sample set was independent of all previously generated paths. If at time n, stopping time τ n j was applied, the discounted payoff along the i-th simulated path was denoted by γ τ n j , i n g ( x τ n j , i i ) . To obtain the approximation of C τ j , we considered minimizing the empirical counterpart of (8). Let f ^ j F satisfy
f ^ j = arg min f F 1 N M i = 1 M n = 0 N 1 f n , x n i γ τ n + 1 j , i n g ( x τ n + 1 j , i i ) 2 ,
we used the truncation C ^ τ j = ψ B f ^ j as the approximation. In the next iteration, an improved family of stopping times τ n j + 1 , n = 1 , , N was obtained by (5). Starting from any family of consistent stopping times and computing inductively, we finally constructed the exercise policy τ 1 J . Note that the optimization problem (9) is easily solved for some linear function space such as polynomial basis functions. For other approximation architecture such as neural networks, gradient-based methods can be applied to find the infimum, since (9) is differentiable with respect to f.
To estimate V 0 , we generated another independent Monte Carlo sample path ( x 0 i , , x N i ) , i = 1 , , M and approximate V 0 by the average
V ^ 0 = i = 1 M γ τ 1 J , i g ( x τ 1 J , i i ) .
Our method is summarized into Algorithm 1. In next section, we discuss the convergence of the algorithm and derive a finite sample bound.
Algorithm 1 Iteration algorithm for pricing American options.
Require: the number of sample path M, M , the number of iterations J and function space F
Ensure:  the approximating optimal stopping time τ 1 J , the price estimate V ^ 0
1:
Generate sample paths of the underlying process;
2:
Generate a random function C 0 F ;
3:
for j = 1 , , J 1 do
4:
    Obtain τ n j , n = 1 , , N using C ^ τ j 1 from (5);
5:
    Construct f ^ j by the regression optimization problem
f ^ j = arg min f F 1 N M i = 1 M n = 0 N 1 f n , x n i γ τ n + 1 j , i n g ( x τ n + 1 j , i i ) 2 ;
6:
    Obtain the approximation by C ^ τ j = ψ B f ^ j ;
7:
end for
8:
Obtain τ 1 J using C ^ τ J 1 from (5);
9:
Generate another independent sample path of the underlying process;
10:
Calculate the option price by (10);
11:
return τ 1 J and V ^ 0 ;

4. Convergence Analysis

In this section, we consider the convergence of the algorithm introduced in Section 3. Before describing the main result, we present some necessary definitions. To measure the complexity of a functional class, we introduced the definition of covering numbers. For a class of functions F and points z 1 M : = z 1 , , z M , the covering number N 1 ϵ , F , z 1 M is the minimal number Q N such that there exist functions f 1 , , f Q with the property that for every f F there is a q { 1 , , Q } such that
1 M i = 1 M f z i f q z i < ϵ .
For f : R d + 1 R , we introduce · by
f 2 = 1 N n = 0 N 1 f ( n , X n ) 2 2 .
Denoted by τ ( f ) , the family of stopping times was obtained from (5) with respect to f. Let E M stands for the expectation conditioned by the samples used to approximate the function C τ . We now state our main result about the convergence of our algorithm.
Theorem 2.
Assume that B < . Fix the set of admissible functions F and positive integer M. For j = 1 , , J , define τ j by (5) and define C ^ τ j by (9). Then
E M C * ( n , X n ) C τ J ( n , X n ) c 1 log M sup 0 n N 1 sup x 1 M ( { n } × R d ) M log 1 / 2 N 1 1 M B , ψ B F , x 1 M M 1 / 2 + c 2 sup f F inf f F f C τ ( f ) + c 3 γ J / 2 B ,
where c 1 , c 2 , c 3 > 0 .
There are three terms in the bound (13). The first is the estimation error caused by the sampling step in the approximation. The second is the approximation error of F with respect to the C τ j ( n , x ) appearing in the iteration. The third comes from the error remaining after running the iteration algorithm for J iterations. This term decays at a geometric rate.
Remark 1.
log N 1 ϵ , ψ B F , x 1 M is bounded by log M · ν ψ B F + under some mild conditions, ν ψ B F + is the VC-dimension of ψ B F + (see the definition in Kohler and Langer [30]). Theorems 1 immediately apply for linear, finite-dimensional approximation architecture, since the corresponding VC-dimension is bounded [31]. For neural networks with L hidden layers, λ neurons per layer and ReLU activation function, a bound of c 4 λ L log λ with c 4 > 0 on the corresponding VC-dimension is also known [30]. Hence, the Theorem applies for deep neural networks as well.
The next corollary provides a bound on the difference between V 0 and E V ^ 0 .
Corollary 1.
Assume that B < and X 0 = x 0 a.s. for some x 0 R . Fix the set of admissible functions F and positive integer M. Define τ 1 J by (5) with respect to C ^ τ J 1 and define V ¯ 0 : = E γ τ 1 J g ( X τ 1 J ) . Then
E M V 0 V ¯ 0 c 5 log M sup 0 n N 1 sup x 1 M ( { n } × R d ) M log 1 / 2 N 1 1 M B , ψ B F , x 1 M M 1 / 2 + c 6 sup f F inf f F f C τ ( f ) + c 7 γ J / 2 B ,
where c 5 , c 6 , c 7 > 0 and 0 < γ < 1 .

5. Numerical Examples

In this section, the performances of the algorithm were tested on various American options for bounded and unbounded payoffs.
The computations were carried out on a laptop with an Intel i5-10300H 2.50 GHz CPU and a NVIDIA GeForce GTX 1650 GPU.
To evaluate our method, we considered two function spaces: the linear spaces of polynomials and neural networks. Polynomial basis functions have been used in Longstaff and Schwartz [7] and are a popular basis function for regression-based methods. To include interaction terms in the basis, we considered the classical polynomial basis functions up to the third order. Neural network approximates nonlinear functions by successive compositions of an affine transformation and non-linear activation function. This model showed good performance for pricing American options, especially in high dimensions [32].
We compared our method with two state-of-the-art methods: the least squares Monte Carlo (LSM) proposed in Longstaff and Schwartz [7] and deep optimal stopping (DOS) proposed in Becker et al. [26]. To have a fair comparison for accuracy and efficiency, we used the same number of sample paths and time steps for both methods. Furthermore, we used the same network architecture in DOS and in the method with the neural network except the activation function. There were 3 hidden layers and 40 + d neurons per hidden layer in the networks. The activation function in our method was a leaky ReLU function.

5.1. Multi-Dimensional Black–Scholes Model

In this subsection, we consider high-dimensional American options in the Black–Scholes model. Assume the risk-neutral dynamics of the assets prices S t = ( S t 1 , , S t d ) are given by
S t = S 0 exp r δ 1 2 σ 2 t + σ W t ,
where S 0 is the initial value; r is the risk-free interest rate; δ is the dividend rate; and W t is d-dimensional Brownian motions with covariance matrix ρ . The parameters are set as T = 1 and N = 10 , and we used r = 5 % , δ = 0 for each asset. We assumed that ρ was a diagonal matrix and all the assets had the same volatility σ = 0.2 and initial value S 0 = 100 .
We considered three types of high-dimensional options: max call options with payoff max 1 i d S t i K + , arithmetic put options with payoff ( K 1 d i = 1 d S t i ) + and geometric put options with payoff ( K ( i = 1 d S t i ) 1 / d ) + . We considered d = { 5 , 10 , 20 , 30 , 40 , 60 , 80 , 100 } , and set K = 100 . In the computation we set M = 20,000; M = 100,000; J = 5 ; and we used the payoff as a regressor. For American options with unbounded payoff, we omitted the truncation step.
Table 1, Table 2 and Table 3 report the pricing results. Column Ref. provides the benchmark value computed by Premia (https://www.rocq.inria.fr/mathfi/Premia, accessed date is 29 September 2021), a freely available software for derivative pricing and hedging. Column Time is the computational times in seconds. For large d the computation time of LSM and our method with a polynomial basis function exceeded a reasonable amount of time, so the results were omitted. The columns in tables labeled as LSM-2, LSM-3, LFPI and NNFPI correspond to, respectively, LSM with a second-order polynomial basis function, LSM with a third-order polynomial basis function, our method with a second-order polynomial basis function, and our method with a neural network. Because V ^ 0 is a lower-biased price estimate, higher price estimates implied better experimental performance.
We observed that both LFPI and NNFPI provided accurate results. LFPI was relatively faster in low-dimensional cases, but the computation time of NNFPI increased little with d. For the linear approximation architecture, the results showed that our method outperformed the LSM with respect to the required polynomial degree in pricing. This phenomenon is crucial when the number of underlying assets is large. For the computation time, although LFPI is slower than the LSM algorithm with same polynomial degree, LSM needed larger polynomial degrees for accurate results so that our method returned accurate results faster than the LSM in high-dimensional examples. In the case of nonlinear architecture, NNFPI generally outperformed DOS, which has similar network architecture.

5.2. Stochastic Volatility Model

This subsection is devoted to the American options on the Heston model [33], a well-known symmetric stochastic volatility model in options pricing. The evolution of underlying asset S t and instantaneous variance ν t is described by the following stochastic differential equation
d S t = r S t d t + ν t S t d W t 1 , d ν t = κ θ ν t d t + ξ ν t d W t 2 ,
where r 0 , κ > 0 , θ > 0 , ξ > 0 , W t 1 and W t 2 are d-dimensional Brownian motions. For a reliable price reference for high-dimensional American options, we used the same parameter settings as in the cases studied in Herrera et al. [28]. The max call options and geometric put options were considered in the experiments. Specifically, we choose the parameters T = 1 ; N = 10 ; κ = 2 ; θ = 0.01 ; ξ = 0.2 ; the initial stock price S 0 = 100 ; the initial variance ν 0 = 0.01 ; r = 0 % for max call options; and r = 2 % for geometric put options. We assumed that the dynamics of different assets were independent, the correlation between the Brownian motion driving the price process and variance process of single asset was ρ = 0.3 . In the computation we used M = 20,000; M = 100,000; and J = 5 and same network architecture as in last subsection.
To obtain a Markovian model, we included price and variance as inputs. In practical, stochastic volatility models, they need to be calibrated from observed data; then, our algorithm can be applied to sample data generated from the models. In the experiments, we tested our method under max call options and geometric put options with K = 100 . The results are reported in Table 4 and Table 5. The results were similar to those under the multi-dimensional Black–Scholes model. The pricing results obtained from LFPI and NNFPI were close to the reference values. The LFPI computation time was smaller than that of NNFPI for low-dimensional situations, but the NNFPI was more efficient for d 10 . It could be seen that NNFPI is generally the most accurate method for high-dimensional American options, and the computation time of NNFPI is close to that of DOS, especially for large d. For the linear approximation architecture, the results showed that LFPI outperformed LSM. Note that our method had much fewer trainable parameters.

5.3. Convergence with Respect to the Number of Iteration

In this subsection, we study numerically the convergence of our method with respect to the hyperparameter J. We considered max-call options under the Black–Scholes model with same parameters setting as in Section 5.1. Figure 1 presents the results for LFPI with d = 5 and NNFPI with d = 20 . The errors were computed with respect to the final value. It could be seen that our method converged fast with respect to the number of iterations. The results confirmed that limiting the number of iterations below to 5 was reasonable.

6. Conclusions

We introduced a novel method for American options based on reinforcement learning that was accurate and efficient in high-dimensional situations. We provided a convergence analysis for the algorithms in the number of training samples and iterations. We also considered the applicability of the algorithm and carried out comprehensive numerical experiments in multivariate Black-Scholes and Heston models. The results showed that (a) the NNFPI achieved good performance in high-dimensional situations, and the LFPI outperformed the LSM in the pricing of American options; (b) the algorithm had high efficiency and accuracy under different model assumptions; (c) our algorithm had a fast convergence rate with respect to the number of iterations; (d) the continuation values can be approximated with a fraction of the parameters by using a function of time and the underlying process. To summarize, the results reconfirmed that reinforcement learning methods surpass backward induction methods for pricing of high-dimensional American options.
There are several directions for future research. First, upper price bounds and confidence interval could be constructed based on the approximating the optimal stopping time. Furthermore, it would be desirable to remove the condition that the payoff function is bounded in L . One idea is to use the truncation technique in Zanger [34]. The proofs are technically more challenging and are left for future research.

Funding

This work was partially supported by the Fundamental Research Funds for the Central Universities, JLU (93K172020K26).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank the editor and reviewers for their valuable suggestions and comments which greatly improved the article.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proofs

Proof of Theorem 1. 
Let I A be the indicator function of set A. Because τ n = n I { g ( X n ) C τ ( n , X n ) } + τ n + 1 I { g ( X n ) < C τ ( n , X n ) } , we have
E [ γ τ n g ( X τ n ) | x n = x ] = γ n g ( x ) I { τ n = n } + E [ γ τ n + 1 g ( X τ n + 1 ) | X n = x ] I { τ n > n } γ n g ( x ) I { τ n = n } + E [ γ τ n + 1 g ( X τ n + 1 ) | X n = x ] I { τ n > n } = γ n g ( x ) I { τ n = n } + E [ γ τ n + 1 g ( X τ n + 1 ) I { τ n > n } | X n = x ] γ n g ( x ) I { τ n = n } + E γ n + 1 g ( X n + 1 ) I { τ n = n + 1 } + E [ γ τ n + 2 g ( X τ n + 2 ) I { τ n > n + 1 } | X n + 1 ] | X n = x = γ n g ( x ) I { τ = n } + E γ n + 1 g ( X n + 1 ) I { τ n = n + 1 } + γ τ n + 2 g ( X τ n + 2 ) I { τ n > n + 1 } | X n = x .
By induction, we have
E γ τ 1 g ( X τ 1 ) E n = 1 N 1 γ n g ( X n ) I { τ n = n } + γ N g ( X N ) I { τ N > N 1 } | X 0 = x 0 = E n = 1 N 1 γ n g ( X n ) I { τ n = n } + γ N g ( X N ) I { τ N = N } | X 0 = x 0 = E γ τ 1 g ( X τ 1 ) .
The last step follows from (5). □
Our aim is to derive a bound of C τ J and C * . To this end, we define the operator T τ by
T τ C ( n , x ) = γ E [ g ( X n + 1 ) I { τ n + 1 = n + 1 } + C ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } | X n = x ] .
It is easy to see that T τ is a contraction operator with index γ , and hence has a unique fixed point C τ ( n , x ) ,
T τ C τ = C τ .
For j = 1 , , J , we define ε j = C ^ τ j T τ j C ^ τ j .
Lemma A1.
Let J be a positive integer. Then, for the sequence of functions C ^ τ j B , 0 j < J and ε j the following inequalities hold
C * C τ J c 8 max 1 j < J ε j + γ J / 2 B ,
where c 8 > 0 and 0 < γ < 1 .
Proof. 
We interpret ( n , X n ) as a random variable, where n is uniformly distributed on 0 , , N 1 . We have
C * C τ J = 1 N n = 0 N 1 E C * ( n , X n ) C τ J ( n , X n ) 2 1 / 2 = C * ( n , X n ) C τ J ( n , X n ) .
By Lemma 12 in [23], the conclusions follows. □
Lemma A2.
Assume that B < and τ n , n = 1 , , N is arbitrary family of consistent stopping times. ( x 0 i , , x N i ) , i = 1 , , M is a set of Monte Carlo paths of process X n . Let f ^ be defined by
f ^ = argmin f F 1 N M i = 1 M n = 0 N 1 f n , x n i γ τ n + 1 i n g ( x τ n + 1 i i ) 2 ,
and set C ^ τ = ψ B f ^ . Then we have
E M C ^ τ T τ C ^ τ 2 c 9 ( log M ) 2 sup 0 n N 1 sup x 1 M ( { n } × R d ) M log N 1 1 M B , ψ B F , x 1 M M + 2 inf f F f C τ 2 ,
where c 9 > 0 .
Proof. 
For any a , b R , we have ( a + b ) 2 2 a 2 + 2 b 2 . Thus, we have
C ^ τ ( n , X n ) T τ C ^ τ ( n , X n ) 2 2 = C ^ τ ( n , X n ) γ E [ g ( X n + 1 ) I { τ n + 1 = n + 1 } + C ^ τ ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } | X n ] 2 2 = C ^ τ ( n , X n ) γ E [ g ( X n + 1 ) I { τ n + 1 = n + 1 } + C τ ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } | X n ] + γ E [ C τ ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } C ^ τ ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } | X n ] 2 2 2 C ^ τ ( n , X n ) γ E [ g ( X n + 1 ) I { τ n + 1 = n + 1 } + C τ ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } | X n ] 2 2 + 2 γ E [ C τ ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } C ^ τ ( n + 1 , X n + 1 ) I { τ n + 1 > n + 1 } | X n ] 2 2 2 C ^ τ ( n , X n ) C τ ( n , X n ) 2 2 + 2 C τ ( n + 1 , X n + 1 ) C ^ τ ( n + 1 , X n + 1 ) 2 2 ,
the last inequality follows from Jensen’s inequality. Thus by induction, we have
C ^ τ T τ C ^ τ 2 = 1 N n = 0 N 1 C ^ τ ( n , X n ) T τ C ^ τ ( n , X n ) 2 2 4 N n = 0 N 1 C ^ τ ( n , X n ) C τ ( n , X n ) 2 2 = 4 C ^ τ C τ 2 .
Since C τ ( n , x ) = E γ τ n + 1 n g X τ n + 1 | X n = x , we have the following error decomposition
C ^ τ C τ 2 = 1 N n = 0 N 1 [ E C ^ τ ( n , X n ) γ τ n + 1 n g X τ n + 1 2 E C τ ( n , X n ) γ τ n + 1 n g X τ n + 1 2 2 M i = 1 M C ^ τ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 C τ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 ] + 2 N M n = 0 N 1 i = 1 M C ^ τ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 C τ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 .
Using Lemma 1 in [35], the first term in (A1) is bounded by
c 10 ( log M ) 2 sup 0 n N 1 sup x 1 M ( { n } × R d ) M log N 1 1 M B , ψ B F , x 1 M M ,
for some c 10 > 0 . Because | ψ B a b | | a b | holds for | b | B , the second term in (A1) is bounded by
inf f F 2 N M n = 0 N 1 i = 1 M f ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 C τ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 .
If we choose an f ˜ F such that
f ˜ C τ 2 inf f F f C τ 2 + 1 M ,
we can conclude
E M inf f F 1 N M n = 0 N 1 i = 1 M f ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 E M 1 N M n = 0 N 1 i = 1 M C τ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 E M 1 N M n = 0 N 1 i = 1 M f ˜ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 E M 1 N M n = 0 N 1 i = 1 M C τ ( n , x n i ) γ τ n + 1 i n g x τ n + 1 i i 2 = E M 1 N n = 0 N 1 f ˜ ( n , X n ) γ τ n + 1 n g X τ n + 1 2 E M 1 N n = 0 N 1 C τ ( n , X n ) γ τ n + 1 n g X τ n + 1 2 = f ˜ C τ 2 + E M 1 N n = 0 N 1 C τ ( n , X n ) γ τ n + 1 n g X τ n + 1 2 E M 1 N n = 0 N 1 C τ ( n , X n ) γ τ n + 1 n g X τ n + 1 2 inf f F f C τ 2 + 1 M .
The conclusion follows from (A2) and (A3). □
Proof of Theorem 2. 
Fix M , J > 0 . Lemma A1 gives
E M C * C τ J c 11 E M max 0 j < J ε j + γ J / 2 B ,
c 11 > 0 . For any a , b > 0 , we have a + b a + b . By Jensen’s inequality and Lemma A2, we conclude that for any fixed integer 0 j < J ,
E M ε j E M ε j 2 1 2 c 12 log M sup 0 n N 1 sup x 1 M ( { n } × R d ) M log 1 / 2 N 1 1 M B , ψ B F , x 1 M M 1 / 2 + c 13 inf f F f C τ ,
for c 12 > 0 , c 13 > 0 . Combining this with (A4), we get
E M C * ( n , X n ) C τ J ( n , X n ) c 1 log M sup 0 n N 1 sup x 1 M ( { n } × R d ) M log 1 / 2 N 1 1 M B , ψ B F , x 1 M M 1 / 2 + c 2 sup f F inf f F f C τ ( f ) + c 3 γ J / 2 B .
Proof of Corollary 1. 
By dynamic programming principle, we have
V 0 = C * 0 , x 0 .
By the definition, we have
V ¯ 0 = C τ J ( 0 , x 0 ) .
We have the following error bound,
E M V 0 V ¯ 0 = E M C * 0 , x 0 C τ J ( 0 , x 0 ) c 14 E M C * C τ J c 5 log M sup 0 n N 1 sup x 1 M ( { n } × R d ) M log 1 / 2 N 1 1 M B , ψ B F , x 1 M M 1 / 2 + c 6 sup f F inf f F f C τ ( f ) + c 7 γ J / 2 B ,
where c 14 > 0 and 0 < γ < 1 . □

References

  1. Andriyanov, N. Forming a taxi service order price using neural networks with multi-parameter training. J. Phys. Conf. Ser. 2020, 1661, 012165. [Google Scholar] [CrossRef]
  2. Ullrich, T. On the Autoregressive Time Series Model Using Real and Complex Analysis. Forecasting 2021, 3, 44. [Google Scholar] [CrossRef]
  3. Hull, J.C. Options, Futures, and Other Derivatives; Pearson Education: New York, NY, USA, 2018. [Google Scholar]
  4. Achdou, Y.; Pironneau, O. Computational Methods for Option Pricing; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2005. [Google Scholar]
  5. Cox, J.C.; Ross, S.A.; Rubinstein, M. Option pricing: A simplified approach. J. Financ. Econ. 1979, 7, 229–263. [Google Scholar] [CrossRef]
  6. Casas, I.; Veiga, H. Exploring option pricing and hedging via volatility asymmetry. Comput. Econ. 2021, 57, 1015–1039. [Google Scholar] [CrossRef]
  7. Longstaff, F.A.; Schwartz, E.S. Valuing American Options by Simulation: A Simple Least-Squares Approach. Rev. Financ. Stud. 2001, 14, 113–147. [Google Scholar] [CrossRef] [Green Version]
  8. Tsitsiklis, J.; Roy, B.V. Regression methods for pricing complex American-style options. IEEE Trans. Neural Netw. 2001, 12, 694–703. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Kohler, M.; Krzyżak, A.; Todorovic, N. Pricing of High-Dimensional American Options by Neural Networks. Math. Financ. 2010, 20, 383–410. [Google Scholar] [CrossRef] [Green Version]
  10. Goudenege, L.; Molent, A.; Zanette, A. Machine learning for pricing American options in high-dimensional Markovian and non-Markovian models. Quant. Financ. 2020, 20, 573–591. [Google Scholar] [CrossRef]
  11. Hu, W.; Zastawniak, T. Pricing high-dimensional American options by kernel ridge regression. Quant. Financ. 2020, 20, 851–865. [Google Scholar] [CrossRef]
  12. Rogers, L.C. Monte Carlo valuation of American options. Math. Financ. 2002, 12, 271–286. [Google Scholar] [CrossRef]
  13. Haugh, M.B.; Kogan, L. Pricing American options: A duality approach. Oper. Res. 2004, 52, 258–270. [Google Scholar] [CrossRef] [Green Version]
  14. Andersen, L. A simple approach to the pricing of Bermudan swaptions in the multifactor LIBOR market model. J. Comput. Financ. 2000, 3, 5–32. [Google Scholar] [CrossRef]
  15. Belomestny, D. On the rates of convergence of simulation-based optimization algorithms for optimal stopping problems. Ann. Appl. Probab. 2011, 21, 215–239. [Google Scholar] [CrossRef]
  16. Bayer, C.; Belomestny, D.; Hager, P.; Pigato, P.; Schoenmakers, J. Randomized optimal stopping algorithms and their convergence analysis. SIAM J. Financ. Math. 2021, 12, 1201–1225. [Google Scholar] [CrossRef]
  17. Becker, S.; Cheridito, P.; Jentzen, A.; Welti, T. Solving high-dimensional optimal stopping problems using deep learning. Eur. J. Appl. Math. 2021, 32, 470–514. [Google Scholar] [CrossRef]
  18. Garcıa, D. Convergence and biases of Monte Carlo estimates of American option prices using a parametric exercise rule. J. Econ. Dyn. Control 2003, 27, 1855–1879. [Google Scholar] [CrossRef]
  19. Bayer, C.; Tempone, R.; Wolfers, S. Pricing American options by exercise rate optimization. Quant. Financ. 2020, 20, 1749–1760. [Google Scholar] [CrossRef]
  20. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, UK, 2018. [Google Scholar]
  21. Wang, R.; Salakhutdinov, R.R.; Yang, L. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Adv. Neural Inf. Process. Syst. 2020, 33, 6123–6135. [Google Scholar]
  22. Wang, R.; Du, S.S.; Yang, L.; Salakhutdinov, R.R. On reward-free reinforcement learning with linear function approximation. Adv. Neural Inf. Process. Syst. 2020, 33, 17816–17826. [Google Scholar]
  23. Antos, A.; Szepesvári, C.; Munos, R. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 2008, 71, 89–129. [Google Scholar] [CrossRef]
  24. Yu, H.; Bertsekas, D.P. Q-learning algorithms for optimal stopping based on least squares. In Proceedings of the European Control Conference, Kos, Greece, 2–5 July 2007. [Google Scholar]
  25. Li, Y.; Szepesvari, C.; Schuurmans, D. Learning policies for American options. In Proceedings of the Conference on Artificial Intelligence and Statistics, Clearwater Beach, FL, USA, 16–18 April 2009. [Google Scholar]
  26. Becker, S.; Cheridito, P.; Jentzen, A. Deep optimal stopping. J. Mach. Learn. Res. 2019, 20, 74. [Google Scholar]
  27. Chen, S.; Devraj, A.M.; Bušić, A.; Meyn, S. Zap Q-Learning for optimal stopping. In Proceedings of the American Control Conference, Denver, CO, USA, 1–3 July 2020. [Google Scholar]
  28. Herrera, C.; Krach, F.; Ruyssen, P.; Teichmann, J. Optimal Stopping via Randomized Neural Networks. arXiv 2021, arXiv:2104.13669. [Google Scholar]
  29. Glasserman, P. Monte Carlo Methods in Financial Engineering; Springer Science & Business Media: New York, NY, USA, 2003; Volume 53. [Google Scholar]
  30. Kohler, M.; Langer, S. On the rate of convergence of fully connected very deep neural network regression estimates. arXiv 2019, arXiv:1908.11133. [Google Scholar]
  31. Zanger, D.Z. Quantitative error estimates for a least-squares Monte Carlo algorithm for American option pricing. Financ. Stoch. 2013, 17, 503–534. [Google Scholar] [CrossRef]
  32. Beck, C.; Weinan, E.; Jentzen, A. Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. J. Nonlinear Sci. 2019, 29, 1563–1619. [Google Scholar] [CrossRef] [Green Version]
  33. Heston, S.L. A closed-form solution for options with stochastic volatility with applications to bond and currency options. Rev. Financ. Stud. 1993, 6, 327–343. [Google Scholar] [CrossRef] [Green Version]
  34. Zanger, D.Z. General error estimates for the Longstaff–Schwartz least-squares Monte Carlo algorithm. Math. Oper. Res. 2020, 45, 923–946. [Google Scholar] [CrossRef]
  35. Bauer, B.; Kohler, M. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Stat. 2019, 47, 2261–2285. [Google Scholar] [CrossRef]
Figure 1. Convergence with respect to the number of iteration.
Figure 1. Convergence with respect to the number of iteration.
Symmetry 14 01324 g001
Table 1. Pricing results for d-dimensional max call options.
Table 1. Pricing results for d-dimensional max call options.
dRef.LSM-2TimeLSM-3TimeLFPITimeDOSTimeNNFPITime
529.6329.6352.329.7222.029.7322.627.48312.829.68319.1
1038.9639.0625.939.1319.239.2197.135.77211.939.09718.9
2047.8447.27914.447.91291.648.00023.245.83912.847.96822.8
3052.9150.08228.752.786667.252.96553.851.30113.152.99123.6
4056.3756.17590.6 56.347138.756.13214.456.49624.3
6061.24 60.41616.761.36232.0
8064.72 63.45719.364.70435.7
10067.28 66.33320.467.32340.4
Table 2. Pricing results for d-dimensional arithmetic put options.
Table 2. Pricing results for d-dimensional arithmetic put options.
dRef.LSM-2TimeLSM-3TimeLFPITimeDOSTimeNNFPITime
52.052.0451.42.0461.22.0472.62.04612.62.04819.0
101.391.3763.31.3784.71.3787.01.37811.71.37818.7
201.061.0426.71.04535.71.04625.51.04712.51.04722.7
300.640.62611.70.628215.00.63055.10.62913.20.63023.5
400.660.64526.6 0.646141.00.64614.10.64624.3
600.89 0.86416.80.86531.9
800.74 0.71219.20.71335.6
1000.32 0.31120.50.31240.2
Table 3. Pricing results for d-dimensional geometric put options.
Table 3. Pricing results for d-dimensional geometric put options.
dRef.LSM-2TimeLSM-3TimeLFPITimeDOSTimeNNFPITime
52.052.0331.32.0361.52.0382.62.03612.62.03819.1
101.391.3683.81.3765.11.3796.91.37311.91.37918.8
201.061.0277.71.04340.71.04925.21.05112.61.05222.8
300.640.60413.70.616271.50.62554.90.62513.20.62523.7
400.660.64535.1 0.654135.80.64614.30.65524.4
600.89 0.86916.60.88732.0
800.74 0.71919.30.72235.6
1000.32 0.30820.80.31140.3
Table 4. Pricing results for d-dimensional max-call options in the Heston model.
Table 4. Pricing results for d-dimensional max-call options in the Heston model.
dRef.LSM-2TimeLSM-3TimeLFPITimeDOSTimeNNFPITime
58.338.2525.78.2586.38.2626.18.19212.88.20719.8
1011.8311.46027.411.66276.611.62323.111.28613.511.79621.0
20 14.99289.0 15.058132.014.88515.615.33026.5
30 17.09118.417.46527.5
40 18.48521.518.97431.7
5020.09 19.56323.720.10035.0
80 21.99334.322.48743.5
10023.69 22.92740.323.61350.1
Table 5. Pricing results for d-dimensional geometric put options in the Heston model.
Table 5. Pricing results for d-dimensional geometric put options in the Heston model.
dRef.LSM-2TimeLSM-3TimeLFPITimeDOSTimeNNFPITime
52.432.3664.92.3685.02.4206.02.30912.82.44319.7
102.011.77321.11.95952.61.98622.51.93113.62.01521.0
201.711.63167.5 1.658134.01.64315.41.71226.5
30 1.29218.11.59327.5
40 1.30420.81.52231.6
501.48 1.29122.51.47434.9
80 1.19433.31.42643.3
1001.40 1.14139.91.40249.8
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, N. An Iteration Algorithm for American Options Pricing Based on Reinforcement Learning. Symmetry 2022, 14, 1324. https://doi.org/10.3390/sym14071324

AMA Style

Li N. An Iteration Algorithm for American Options Pricing Based on Reinforcement Learning. Symmetry. 2022; 14(7):1324. https://doi.org/10.3390/sym14071324

Chicago/Turabian Style

Li, Nan. 2022. "An Iteration Algorithm for American Options Pricing Based on Reinforcement Learning" Symmetry 14, no. 7: 1324. https://doi.org/10.3390/sym14071324

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop