# Smart Robotic Strategies and Advice for Stock Trading Using Deep Transformer Reinforcement Learning

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Contributions of the Paper

- We propose a Transformer DRL-based framework for stock trading. The model does not require a sliding look-back window to track price movements, as it employs a transformer network architecture to pick the best trading policy, which is automatically identified by the intrinsic attention mechanism of the network.
- We enhance our data set by including several widely used and valuable trading technical indicators. With the addition of these technical indicators, we are able to augment our data set with important relevant information, which is well-complemented by the forecasted data calculated from our prediction model. Our proposed model benefits from a good balance of observations using this combination of features.
- The proposed technique optimizes the reward during the training process by integrating risk-adjusted return metrics, including the max drawdown, Sortino ratio, Omega, cumulative returns, annual volatility, Calmar ratio, and a normal reward function without risk adjustment.
- The utilization of various reward functions provides abundant possibilities for exploring the policy space and prohibits the agent from taking an imperfect, but acceptable action.

#### 1.2. Outline of the Paper

## 2. Related Work

- It does not require previous knowledge of the environment to understand the trade rules;
- it can continually adapt to changing environmental scenarios; and
- it prioritizes long-term advantages, rather than quick rewards.

## 3. Methodology and Proposed Model

#### 3.1. Preliminaries

#### 3.1.1. Reinforcement Learning

#### 3.1.2. The Trading Environment

#### 3.1.3. Reinforcement Learning as Sequence Modelling

#### 3.1.4. State Space $\in {\mathbb{R}}^{32}$

- Market Features $\mathfrak{f}\in {\mathbb{R}}^{28xD}$: Essentially, this is a set of features that is gathered for both tickers and their corresponding market indices, including the transaction date, close price, volume, and six technical indicators. As a result, the feature set includes both the closing price of the ticker and the closing price of the linked index, except for the transaction date, as the ticker and its index will have the same transaction date. Several technical indicators generate multiple values (features) as a result of their calculations, such as Moving Average Convergence Divergence (MACD), which generates two (MCAD) values and a single line. In Section 3.3, we discuss the 28 market features in detail.
- Held shares $\mathfrak{h}\in {\mathbb{Z}}^{{+}^{3xD}}$: The total number of shares that are owned in relation to the stocks; this amount (which must be an integer) describes all of the shares possessed.
- Available Balance $\mathfrak{b}\in {\mathbb{R}}^{+}$: The amount of liquid assets that are accessible to be used in the process of purchasing or selling a certain stock at each successive time step in the process. This should either be positive or zero, and permitted activities should not result in a balance that is negative.

#### 3.1.5. Action Space

#### 3.1.6. Reward Function

- The Profit and Loss (PnL) The profit function is the one that of the most-used reward functions. Its mathematical formula is as follows:$${r}_{t}=\left(1+{a}_{t}\times \frac{{p}_{t}-{p}_{t-1}}{{p}_{t-1}}\right)\frac{{p}_{t-1}}{{p}_{t-n}},$$
- Volatility-Based Metrics: Sharpe and Sortino ratio According to [34], the Sharpe ratio is a frequently used measure of the risk-adjusted return, which can indicate both profit and volatility. The Sharpe ratio is calculated by dividing the average risk-free return of the investor by the standard deviation of that return:$${S}_{t}=\frac{\phantom{\rule{4.pt}{0ex}}\mathrm{Average}\phantom{\rule{4.pt}{0ex}}\left({\sum}_{i=1}^{W}{R}_{i}\right)}{\sigma \left({\sum}_{i=1}^{W}{R}_{i}\right)}.$$At time t, the Sharpe ratio reward function is ${S}_{t}$, the daily return on multiple shares of a stock is ${R}_{i}$, and return averages and standard deviations are estimated over returns for periods of W, where W denotes the window size, which is used to calculate the average and standard deviation of the returns. Despite this fact, the Sharpe ratio considers volatility in portfolio values. However, the ratio equally regards both upward and downward movements; that is, it also penalizes upward volatility [35,36,37]. As a matter of fact, upward volatility (upwards price movement) contributes to positive returns, while downward volatility causes losses. In contrast to the Sharpe ratio, the Sortino ratio only considers downward volatility to be a risk, rather than overall volatility. The upward volatility, therefore, is not penalized by this ratio. Mathematically, it is formulated as follows:$$S{R}_{t}=\frac{\phantom{\rule{4.pt}{0ex}}\mathrm{Average}\phantom{\rule{4.pt}{0ex}}\left({\sum}_{i=1}^{W}{R}_{i}\right)}{{\sigma}_{down}\left({\sum}_{i=1}^{W}{R}_{i}\right)},$$However, neither the PnL, Sharpe ratio, or Sortino ratio metrics take into account the maximum drawdown, defined as the maximum loss noted between a peak and a bottom of a portfolio before a new peak is reached. It provides a measure of the rate of change in price over a specified time period, and is an indicator of downside risk [39,40]. A measure known as the Calmar ratio uses maximum drawDown only as a method for quantifying risk [37], where a high Calmar ratio indicates better portfolio performance [41].

#### 3.2. Data

#### 3.2.1. Data Pre-Processing

#### 3.2.2. Data Imputation

- Missing Completely at Random (MCAR): The missing data are not related to the values of any other variable in the data set.
- Missing at random (MAR): The probability of the missing values of a variable is dependent on another variable in the data set, but not on that variable.
- Missing Not at Random (MNAR): That missing values in a variable are closely related to the variable and not the other variables in the data set. This is the most concerning missing value.

#### 3.2.3. Data Stationarity

#### 3.2.4. Data Normalization

#### 3.3. Feature Extraction

- Exponential moving average (EMA) These are utilized to reduce the noise and point out the short- and long-term trends in time-series data. The exponential moving average EMA $(x,\alpha )$ is generated by exponentially reducing the weight of observations ${X}_{i}$ regarding their distance from ${X}_{t}$ using a weighted multiplier $\alpha $.$$\begin{array}{c}EMA\left({x}_{t},\alpha \right)=\alpha {x}_{t}+(1-\alpha )EMA\left({x}_{t-1},\alpha \right).\\ \alpha =\frac{2}{N+1}\end{array}$$
- Money flow index (MFI): Based on price and volume, the money flow index determines the amount of money moving into and out of a specific ticker, or, to put it another way, if a given stock has been over-bought or over-sold. This is what is known as a momentum indicator. When the MFI is over 80, it indicates an over-bought condition; meanwhile, when it is below 20, it suggests an over-sold condition. The MFI may be computed using the following formula:$$\begin{array}{c}\mathrm{MFI}=100-\frac{100}{1+\mathrm{MFR}},\\ \mathrm{where}\phantom{\rule{4.pt}{0ex}}\mathrm{MFR}\phantom{\rule{4.pt}{0ex}}=\frac{\phantom{\rule{4.pt}{0ex}}\mathrm{Positive}\phantom{\rule{4.pt}{0ex}}\mathrm{Money}\phantom{\rule{4.pt}{0ex}}\mathrm{Flow}\phantom{\rule{4.pt}{0ex}}}{\phantom{\rule{4.pt}{0ex}}\mathrm{Negative}\phantom{\rule{4.pt}{0ex}}\mathrm{Money}\phantom{\rule{4.pt}{0ex}}\mathrm{Flow}\phantom{\rule{4.pt}{0ex}}}\\ \mathrm{Money}\mathrm{Flow}=\left(\frac{\phantom{\rule{4.pt}{0ex}}\mathrm{High}\phantom{\rule{4.pt}{0ex}}+\phantom{\rule{4.pt}{0ex}}\mathrm{Low}\phantom{\rule{4.pt}{0ex}}+\phantom{\rule{4.pt}{0ex}}\mathrm{Close}\phantom{\rule{4.pt}{0ex}}}{3}\right)Volume.\end{array}$$
- Relative strength index (RSI): This index is also used as a momentum indicator, which determines whether a ticker is over-bought or over-sold by considering both the velocity and magnitude of price fluctuations. Its value may vary anywhere from 0 to 100, with low values indicating a stock that is being over-sold and high values indicating an over-bought stock. The following formula may be used to easily determine the value of this indicator:$$\begin{array}{c}\mathrm{RSI}=100-\frac{100}{1+\mathrm{RS}},\\ \mathrm{where}\phantom{\rule{4.pt}{0ex}}\mathrm{RS}\phantom{\rule{4.pt}{0ex}}=\frac{\phantom{\rule{4.pt}{0ex}}\mathrm{Average}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{Up}\phantom{\rule{4.pt}{0ex}}\mathrm{closes}}{\phantom{\rule{4.pt}{0ex}}\mathrm{Average}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{Down}\phantom{\rule{4.pt}{0ex}}\mathrm{closes}\phantom{\rule{4.pt}{0ex}}}.\end{array}$$
- Moving average convergence-divergence (MACD): The MACD is another indicator used to illustrate the relationship between two exponential moving averages (EMAs): Slow (${\theta}_{1}$) and fast (${\theta}_{2}$). According to [54], the moving averages are calculated based on the following criteria: ${\theta}_{1}$ comprises 26 periods (market standard), ${\theta}_{2}$ consists of 12 periods (usual for the financial markets), ${\theta}_{1}-{\theta}_{2}$ for constructing the MACD line. Finally, the moving average is constructed using the MACD line (standardized with 9 periods). This indicator can be used to assess the trend-following momentum within a stock. The formula for calculating this indicator is:$$\mathrm{MACD}=\mathrm{EMA}\left({\theta}_{1}\right)-\mathrm{EMA}\left({\theta}_{2}\right).$$
- Commodity channel index (CCI): The commodity channel index (also abbreviated as CCI) is a type of trend indicator that calculates the difference between the average of historical prices and the current price value. When the CCI is greater than zero, the current price is higher than the average value of historical prices; conversely, when the CCI is lower than zero, the price is lower than the historical average. A reading above 100 is above the buy threshold, and that below − 100 is below the sell threshold.$$\begin{array}{c}\mathrm{CCI}=\frac{1}{0.015}\frac{{P}_{typicalprice}-\mathrm{SMA}\left({P}_{typicalprice}\right)}{\mathrm{MAD}\left({P}_{typicalprice}\right)},\end{array}$$
- Ichimoku: The Ichimoku Hinko Hyo indicator identifies the trend direction and determines accurate support and resistance levels. There are five main components of the Ichimoku Cloud indicator that provide reliable trade signals: Kijun-Sen, Senkou Span B, Senkou Span A, Tenkan-Sen, and Chiou Span.

#### 3.4. Decision Transformer Model

## 4. Experiments and Analysis

#### 4.1. Hyperparameters

#### 4.2. Model Training

#### 4.3. Complexity Analysis

## 5. Discussion

#### Comparison with Related Works

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Kaya, O.; Schildbach, J.; AG, D.B.; Schneider, S. Robo-Advice—A True Innovation in Asset Management. 2017. Available online: https://www.dbresearch.com/PROD/RPS_EN-PROD/PROD0000000000449125/Robo-advice_%E2%80%93_a_true_innovation_in_asset_managemen.PDF (accessed on 20 October 2022).
- Méndez-Suárez, M.; García-Fernández, F.; Gallardo, F. Artificial Intelligence Modelling Framework for Financial Automated Advising in the Copper Market. J. Open Innov. Technol. Mark. Complex.
**2019**, 5, 81. [Google Scholar] [CrossRef] [Green Version] - So, M.K.P. Robo-Advising Risk Profiling through Content Analysis for Sustainable Development in the Hong Kong Financial Market. Sustainability
**2021**, 13, 1306. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement learning: An Introduction Second edition. Learning
**2012**, 3, 322. [Google Scholar] - ArgaamPlus. Tadawul’s Market Cap Slips 2.3% to SAR 12.178 trln Last Week. Available online: https://www.argaam.com/en/article/articledetail/id/1559878 (accessed on 10 October 2022).
- Capital Market Overview. Available online: https://www.saudiexchange.sa/wps/portal/tadawul/knowledge-center/about/Capital-Market-Overview?locale=en (accessed on 10 October 2022).
- Wang, Y.; Wang, D.; Zhang, S.; Feng, Y.; Li, S.; Zhou, Q. Deep Q-trading. Cslt. Riit. Tsinghua. Edu. Cn
**2017**, 1–9. [Google Scholar] - Lee, K.C.; Lee, S. A causal knowledge-based expert system for planning an Internet-based stock trading system. Expert Syst. Appl.
**2012**, 39, 8626–8635. [Google Scholar] [CrossRef] - Guresen, E.; Kayakutlu, G.; Daim, T.U. Using artificial neural network models in stock market index prediction. Expert Syst. Appl.
**2011**, 38, 10389–10397. [Google Scholar] [CrossRef] - Selvin, S.; Vinayakumar, R.; Gopalakrishnan, E.A.; Menon, V.K.; Soman, K.P. Stock price prediction using LSTM, RNN and CNN-sliding window model. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 1643–1647. [Google Scholar] [CrossRef]
- Nikou, M.; Mansourfar, G.; Bagherzadeh, J. Stock price prediction using DEEP learning algorithm and its comparison with machine learning algorithms. Intell. Syst. Account. Financ. Manag.
**2019**, 26, 164–174. [Google Scholar] [CrossRef] - Malibari, N.; Katib, I.; Mehmood, R. Predicting Stock Closing Prices in Emerging Markets with Transformer Neural Networks: The Saudi Stock Exchange Case. Int. J. Adv. Comput. Sci. Appl.
**2021**, 12, 876–886. [Google Scholar] [CrossRef] - Moody, J.; Wu, L.; Liao, Y.; Saffell, M. Performance functions and reinforcement learning for trading systems and portfolios. J. Forecast.
**1998**, 17, 441–470. [Google Scholar] [CrossRef] - Xiong, Z.; Liu, X.Y.; Zhong, S.; Yang, H.; Walid, A. Practical Deep Reinforcement Learning Approach for Stock Trading. arXiv
**2018**, arXiv:1811.07522. [Google Scholar] - Gudelek, M.U.; Boluk, S.A.; Ozbayoglu, A.M. A deep learning based stock trading model with 2-D CNN trend detection. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar] [CrossRef]
- Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; Dai, Q. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading. IEEE Trans. Neural Netw. Learn. Syst.
**2017**, 28, 653–664. [Google Scholar] [CrossRef] [PubMed] - Luo, S.; Lin, X.; Zheng, Z. A novel CNN-DDPG based AI-trader: Performance and roles in business operations. Transp. Res. Part E Logist. Transp. Rev.
**2019**, 131, 68–79. [Google Scholar] [CrossRef] - Li, Y.; Ni, P.; Chang, V. Application of deep reinforcement learning in stock trading strategies and stock forecasting. Computing
**2020**, 102, 1305–1322. [Google Scholar] [CrossRef] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 2017, pp. 5999–6009. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] - Fengqian, D.; Chao, L. An Adaptive Financial Trading System Using Deep Reinforcement Learning with Candlestick Decomposing Features. IEEE Access
**2020**, 8, 63666–63678. [Google Scholar] [CrossRef] - Zarkias, K.S.; Passalis, N.; Tsantekidis, A.; Tefas, A. Deep reinforcement learning for financial trading using price trailing. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3067–3071. [Google Scholar]
- Taghian, M.; Asadi, A.; Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl.
**2022**, 195, 116523. [Google Scholar] [CrossRef] - Thomas, G. Reinforcement learning in financial markets—A survey, FAU Discussion Papers in Economics, No. 12/2018, Friedrich-Alexander-Universität Erlangen-Nürnberg, Institute for Economics, Nürnberg. 2018. Available online: https://ideas.repec.org/p/zbw/iwqwdp/122018.html (accessed on 16 October 2022).
- Dang, Q.V. Reinforcement Learning in Stock Trading. International Conference on Computer Science, Applied Mathematics and Applications Hanoi, Vietnam. 2019. hal-02306522. Available online: https://link.springer.com/chapter/10.1007/978-3-030-38364-0_28 (accessed on 16 October 2022).
- Jeong, G.; Kim, H.Y. Improving financial trading decisions using deep Q-learning: Predicting the number of shares, action strategies, and transfer learning. Expert Syst. Appl.
**2019**, 117, 125–138. [Google Scholar] [CrossRef] - Yang, H.; Liu, X.Y.; Zhong, S.; Walid, A. Deep reinforcement learning for automated stock trading. In Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; ACM: New York, NY, USA, 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Li, Y.; Zheng, W.; Zheng, Z. Deep Robust Reinforcement Learning for Practical Algorithmic Trading. IEEE Access
**2019**, 7, 108014–108021. [Google Scholar] [CrossRef] - Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst.
**2021**, 34, 15084–15097. [Google Scholar] - Janner, M.; Li, Q.; Levine, S. Offline reinforcement learning as one big sequence modeling problem. Adv. Neural Inf. Process. Syst.
**2021**, 34, 1273–1286. [Google Scholar] - Sadighian, J. Extending Deep Reinforcement Learning Frameworks in Cryptocurrency Market Making. arXiv
**2020**, arXiv:2004.0698. [Google Scholar] - Gašperov, B.; Begušić, S.; Šimović, P.P.; Kostanjčar, Z. Reinforcement learning approaches to optimal market making. Mathematics
**2021**, 9, 2689. [Google Scholar] [CrossRef] - Sharpe, W.F. The sharpe ratio. Streetwise–Best J. Portf. Manag.
**1998**, 3745, 169–185. [Google Scholar] [CrossRef] [Green Version] - Jiang, Z.; Xu, D.; Liang, J. A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv
**2017**, arXiv:1706.10059. [Google Scholar] - Leem, J.B.; Kim, H.Y. Action-specialized expert ensemble trading system with extended discrete action space using deep reinforcement learning. PLoS ONE
**2020**, 15, e0236178. [Google Scholar] [CrossRef] - Millea, A. Deep reinforcement learning for trading—A critical survey. Data
**2021**, 6, 119. [Google Scholar] [CrossRef] - Yashaswi, K. Deep Reinforcement Learning for Portfolio Optimization using Latent Feature State Space (LFSS) Module. arXiv
**2021**, arXiv:2102.06233. [Google Scholar] - Bu, S.J.; Cho, S.B. Learning Optimal Q-Function Using Deep Boltzmann Machine for Reliable Trading of Cryptocurrency; Springer International Publishing: Cham, Switzerland, 2018; Volume 11314, pp. 468–480. [Google Scholar] [CrossRef]
- Wang, R.; Wei, H.; An, B.; Feng, Z.; Yao, J. Commission Fee is not Enough: A Hierarchical Reinforced Framework for Portfolio Management. Proc. AAAI Conf. Artif. Intell.
**2021**, 35, 626–633. [Google Scholar] [CrossRef] - Zhang, Z.; Zohren, S.; Roberts, S. Deep Reinforcement Learning for Trading. J. Financ. Data Sci.
**2020**, 2, 25–40. [Google Scholar] [CrossRef] - Argaam (Ed.) Al Rajhi Bank’s Board Proposes 60% Capital Increase to SAR 40 bln via Bonus Issue, ArgaamPlus, 20 February 2022. Available online: https://www.argaam.com/en/article/articledetail/id/1536623 (accessed on 16 October 2022).
- Zhang, S.; Zhang, C.; Yang, Q. Data preparation for data mining. Appl. Artif. Intell.
**2003**, 17, 375–381. [Google Scholar] [CrossRef] - Rahm, E.; Do, H.H. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. Tech. Comm. Data Eng.
**2001**, 24, 1–56. [Google Scholar] - Silva-Ramírez, E.L.; Pino-Mejías, R.; López-Coello, M.; Cubiles-de-la Vega, M.D. Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw.
**2011**, 24, 121–129. [Google Scholar] [CrossRef] [PubMed] - Sinharay, S.; Stern, H.S.; Russell, D. The use of multiple imputation for the analysis of missing data. Psychol. Methods
**2001**, 6, 317–329. [Google Scholar] [CrossRef] - Kuznetsov, V.; Mohri, M. Discrepancy-Based Theory and Algorithms for Forecasting Non-Stationary Time Series. Ann. Math. Artif. Intell.
**2020**, 88, 367–399. [Google Scholar] [CrossRef] - Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Dickey, D.A.; Fuller, W.A. Distribution of the Estimators for Autoregressive Time Series with a Unit Root. J. Am. Stat. Assoc.
**1979**, 74, 427. [Google Scholar] [CrossRef] - Raymaekers, J.; Rousseeuw, P.J. Transforming variables to central normality. Mach. Learn.
**2021**. [Google Scholar] [CrossRef] - Loh, L.K.Y.; Kueh, H.K.; Parikh, N.J.; Chan, H.; Ho, N.J.H.; Chua, M.C.H. An Ensembling Architecture Incorporating Machine Learning Models and Genetic Algorithm Optimization for Forex Trading. FinTech
**2022**, 1, 100–124. [Google Scholar] [CrossRef] - Smyl, S.; Kuber, K. Data Preprocessing and Augmentation for Multiple Short Time Series Forecasting with Recurrent Neural Networks 2016; Technical Report; In 36th International Symposium on Forecasting. Available online: https://www.researchgate.net/publication/309385800_Data_Preprocessing_and_Augmentation_for_Multiple_Short_Time_Series_Forecasting_with_Recurrent_Neural_Networks (accessed on 16 October 2022).
- Borovkova, S.; Tsiamas, I. An ensemble of LSTM neural networks for high-frequency stock market classification. J. Forecast.
**2019**, 38, 600–619. [Google Scholar] [CrossRef] [Green Version] - Halilbegovic, S. Macd—Analysis of Weaknesses of the Most Powerful Technical Analysis Tool. Indep. J. Manag. Prod.
**2016**, 7, 367–379. [Google Scholar] [CrossRef] [Green Version] - Padial, D.L. Technical Analysis Library Using Pandas and Numpy. Available online: https://github.com/bukosabino/ta (accessed on 6 October 2022).
- Truong, A.; Walters, A.; Goodsitt, J.; Hines, K.; Bruss, C.B.; Farivar, R. Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. Proc. Int. Conf. Tools Artif. Intell. ICTAI
**2019**, 2019, 1471–1479. [Google Scholar] [CrossRef] [Green Version] - Dabérius, K.; Granat, E.; Karlsson, P. Deep Execution—Value and Policy Based Reinforcement Learning for Trading and Beating Market Benchmarks. SSRN Electron. J.
**2019**. [Google Scholar] [CrossRef] - Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst.
**2021**, 34, 22419–22430. [Google Scholar] - Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell.
**2021**, 35, 11106–11115. [Google Scholar] [CrossRef] - Yarats, D.; Brandfonbrener, D.; Liu, H.; Laskin, M.; Abbeel, P.; Lazaric, A.; Pinto, L. Don’t Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning. arXiv
**2022**, arXiv:2201.13425. [Google Scholar] - Liu, X.y.; Xia, Z.; Rui, J.; Gao, J.; Yang, H. FinRL-Meta: Market Environments and Benchmarks for Data-Driven Financial Reinforcement Learning. arXiv
**2022**, arXiv:2211.03107. [Google Scholar] [CrossRef] - Cobbe, K.; Klimov, O.; Hesse, C.; Kim, T.; Schulman, J. Quantifying generalization in reinforcement learning. In Proceedings of the International Conference on Machine Learning PMLR, Long Beach, CA, USA, 10–15 June 2019; pp. 1282–1289. [Google Scholar]
- Lee, K.; Lee, K.; Shin, J.; Lee, H. Network randomization: A simple technique for generalization in deep reinforcement learning. arXiv
**2019**, arXiv:1910.05396. [Google Scholar] - Huotari, T.; Savolainen, J.; Collan, M. Deep reinforcement learning agent for S&P 500 stock selection. Axioms
**2020**, 9, 130. [Google Scholar] [CrossRef] - Théate, T.; Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl.
**2021**, 173, 114632. [Google Scholar] [CrossRef]

**Figure 4.**Comparison of performance of the four selected indices with that of the Tadawul All Share Index (TASI).

**Figure 7.**Behaviour of Al-Rajhi Banking and Investment (1120) during the training and testing periods.

**Figure 8.**Behaviour of the Saudi Basic Industries Corporation (1020) during the training and testing periods.

**Figure 14.**Outcome of 7010 stock net worth during model training with different reward functions (Initial portfolio value $10,000).

**Figure 15.**Outcome of 5110 stock net worth during model training with different reward functions (Initial portfolio value $10,000).

**Figure 16.**Outcome of 1120 stock net worth during model training with different reward functions (Initial portfolio value $10,000).

**Figure 17.**Outcome of 2010 stock net worth during model training with different reward functions (Initial portfolio value $10,000).

**Figure 18.**Outcome of 7010 stock target return during model training with different reward functions (Initial portfolio value $10,000).

**Figure 19.**Outcome of 5110 stock target return during model training with different reward functions (Initial portfolio value $10,000).

**Figure 20.**Outcome of 1120 stock target return during model training with different reward functions (Initial portfolio value $ 10,000).

**Figure 21.**Outcome of 2010 stock target return during model training with different reward functions (Initial portfolio value $ 10,000).

Reference | Year | Data Set | Model | Application | Results |
---|---|---|---|---|---|

Guresen et al. [9] | 2011 | NASDA stock | MLP DAN2 Hybrid ANN | Forecasting of stock indices. | MSE: 0.54% |

Selvin et al. [10] | 2017 | National Stock Exchange | RNN CNN LSTM | Prediction of future stock price | MSE RNN: 3.90% CNN: 2.36% LSTM: 4.18% |

Nikou et al. [11] | 2019 | iShares MSCI United Kingdom | ANN, RF, SVR, LSTM | Prediction of closing price of stock | RMSE ANN: 0.45 SVR: 0.34 RNN: 0.38 LSTM: 0.30 |

Malibari et al. [12] | 2021 | Saudi Stock Exchange | Transformer network | Prediction of closing price of stock | Accuracy over 90% |

Moody et al. [13] | 1998 | S&P 500 stock index | RRL | Performance check of RRL for trading and profit | Hold strategy: 0.34 Average strategy: 0.84 Voting strategy: 0.83 |

Xiong et al. [14] | 2018 | Dow Jones 30 stocks | DRL | Prediction of future stock price | Annualized Std. Error of 13.62% Sharpe ratio: 1.79 |

Gudelek et al. [15] | 2017 | Google finance | 2D-CNN | Prediction of future stock price | 70% accuracy |

Wang et al. [7] | 2017 | Dow Jones 30 stocks | Portfolio management | deep Q-learning | - |

Deng et al. [16] | 2017 | Stock IF-contract | Financial Signal Representation and Trading. | DRL | 0.523 PR |

Luo et al. [17] | 2019 | Stock IF-contract | AI-trader’s performance | CNN-DDPG | - |

Li et al. [18] | 2020 | 10 US equities | single-stock trading strategies | DQN Double DQN Dueling DQN | DQN outperformed the other two techniques |

Our Work | 2022 | Saudi Stock Market(Tadawul) | 4 stocks trading strategies Robo Advice | Decision Transformer | An increase of around 20% was seen in net worth |

Index/Stocks | ADF Statistic | p-Value | Critical Values | Decision |
---|---|---|---|---|

Telecommunication Services Index(TTSI) | − 1.161312 | 0.690013 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Saudi Telecom Company (7010) | −1.594104 | 0.486555 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Banks Index (TBNI) | 1.187172 | 0.995896 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Al Rajhi Banking and Investment Corp(1120) | 2.972744 | 1.000000 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Materials Index (TMTI) | −0.562991 | 0.879127 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Saudi Basic Industries Corporation (2010) | −1.878952 | 0.342027 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Utilities Index (TUTI) | 0.334678 | 0.978881 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Saudi Electricity Company (5110) | −1.661178 | 0.451217 | 1%: −3.435 5%: −2.864 10%: −2.568 | Failed to Reject H${}_{o}$, Time-Series is Non-Stationary |

Hyperparameter | Value |
---|---|

State dimensionality | The state size for the DRL environment |

Action dimensionality | The size of the action space = 3 |

Number of hidden layers | 12 |

Number of attention heads | 12 |

Learning Rate | 0.001 |

Optimizer | GELU |

Batch size | 256 |

Dropout probability | 0.1 |

Layer normalization epsilon | 1 × 10^{−5} |

**Table 4.**Reward functions used in this work, in order of their performance (best to worst from top to bottom).

Reward Function | Description | Average Increase in Net Worth | |||
---|---|---|---|---|---|

7010 | 5110 | 2010 | 1120 | ||

Sortino ratio | Measures the risk on return by penalizing downside volatility | 21.54% | 18.54% | 17% | 9.36% |

Cumulative returns | Indicates the total return at the end of the trading phase. | 8.02% | 10.61% | 12.92% | 14.57% |

Annual volatility | Indicator of model robustness and shows the annual standard deviation of portfolio return. | 7.84% | 10.61% | 12.92% | 14.57% |

Calmar ratio | Risk quantification measure. High ratio indicates better portfolio performance. | 1.02% | 5.81% | 10.90% | 4.75% |

Omega | Weighted gain to loss probability ratio at a specific value of expected return. | 8.12% | 0.61% | 2.92% | 1.57% |

Max Drawdown | Weighted gain to loss probability ratio at a specific value of expected return. | 0.024 | 1.61% | 1.28% | 0.07% |

Normal | Reward without risk adjustment. | 10.02% | 1.68% | 15.46% | 2.67% |

Reference | Year | Data Set | Rewards | Performance |
---|---|---|---|---|

Huotari et al. [64] | 2020 | 415 stocks in the S&P 500 stock | Sharpe Differential Sharpe ratios | return: 328.9% Sharpe ratio: 0.91 |

Yang et al. [28] | 2020 | 30 Dow Jones stocks | Portfolio value change Turbulence | return: 328.9% Sharpe ratio: 0.91 |

Théate and Ernst [65] | 2021 | 30 stocks from a variety of industries in North America, Europe, and Asia | Daily Returns | All 30 stocks Sharpe Ratio: 0.401 APPL Sortino ratio: 1.84 Annualised return: 32% Tesla Sortino ratio: 0.35 Annualised return:12% |

Our Work | 2022 | 4 stocks from Saudi Stock Market (Tadawul) | Sortino ratio, Cumulative Returns, Annual Volatility | Net worth: 20% Sortino ratio: 21.54% Cumulative Returns: 14.5% Annual Volatility: 11.1% |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Malibari, N.; Katib, I.; Mehmood, R.
Smart Robotic Strategies and Advice for Stock Trading Using Deep Transformer Reinforcement Learning. *Appl. Sci.* **2022**, *12*, 12526.
https://doi.org/10.3390/app122412526

**AMA Style**

Malibari N, Katib I, Mehmood R.
Smart Robotic Strategies and Advice for Stock Trading Using Deep Transformer Reinforcement Learning. *Applied Sciences*. 2022; 12(24):12526.
https://doi.org/10.3390/app122412526

**Chicago/Turabian Style**

Malibari, Nadeem, Iyad Katib, and Rashid Mehmood.
2022. "Smart Robotic Strategies and Advice for Stock Trading Using Deep Transformer Reinforcement Learning" *Applied Sciences* 12, no. 24: 12526.
https://doi.org/10.3390/app122412526