Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates

Cheng, Yi-Hsin; Wang, Hei-Chia

doi:10.3390/engproc2023055063

Open AccessProceeding Paper

Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates^†

by

Yi-Hsin Cheng

^1,*

and

Hei-Chia Wang

^1,2

¹

Institute of Information Management, National Cheng Kung University, Tainan 701, Taiwan

²

Center for Innovative FinTech Business Models, National Cheng Kung University, Tainan 701, Taiwan

^*

Author to whom correspondence should be addressed.

^†

Presented at the IEEE 5th Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability, Tainan, Taiwan, 2–4 June 2023.

Eng. Proc. 2023, 55(1), 63; https://doi.org/10.3390/engproc2023055063

Published: 7 December 2023

(This article belongs to the Proceedings of 2023 IEEE 5th Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability)

Download

Browse Figures

Versions Notes

Abstract

:

The era of low-interest rates is coming. In addition to their basic salary, people hope to increase their income by doing part-time work, understanding how to use assets already on hand, and investing in assets to earn extra rewards. Goldman Sachs reports that over the past 140 years, the 10-year stock market return has averaged 9.2%. The investment firm also noted that the S&P 500 outperformed its 10-year historical average with an annual average return of 13.6% between 2010 and 2020. Nowadays, with increased computing power and advancements in artificial intelligence algorithms, the effective use of computing power for investment has become an important topic. In the investment process, venture capitalists form portfolios, a practice that improves investment efficiency and reduces risks in a relatively safe situation. Current investments are not limited to one country, allowing for investments in other countries. Given this situation, we must pay attention to Uncovered Equity Parity (UEP) conditions. Thus, we aim to find optimal dynamic trading strategies with Deep Reinforcement Learning, considering the aforementioned properties.

Keywords:

decision support; deep reinforcement learning; Investment portfolio

1. Introduction

The era of low-interest rates is coming. In addition to regular jobs, people hope to do part-time work to increase their income, understand how to use assets already on hand, and invest in them. Central Banks around the world are adopting negative interest rate policies. People who previously saved their money now tend to invest more [1]. Goldman Sachs reports that the 10-year stock market return averages 9.2%, with the S&P 500 outperforming its 10-year historical average from 2010 to 2020, achieving an annual average return of 13.6%.

However, while the S&P 500’s Index achieved a high return, the average market return for the typical investor was only 5.19% during the same period. The main reason for this is that investors are sometimes illogical and irrational when investing [2]. We usually invest in various stocks, real estate, or gold. Different investments have different attributes and features. In the past, investors invested based on their experience, but were lacking a complete formula or exact rules. Nowadays, with increased computing power and developed artificial intelligence algorithms, the effective use of computing power for investment is an important topic.

People constantly adjust their various investment portfolio holdings, with past studies focusing on univariate time series analysis due to excessive noise in multivariate data causing inaccuracies [3]. However, daily stock prices are highly positively correlated with each other. As a result of recent advances in neural network research, particularly in Long Short-Term Memory (LSTM) networks, multivariate models have been shown to outperform univariate networks [4].

In the context of global investing, it is essential to consider Uncovered Equity Parity (UEP), which suggests that a country’s stock market often moves in the opposite direction to the exchange rate [5]. This means that the exchange rate can significantly impact investment performance, therefore making it crucial to incorporate this factor into portfolio design. Previous studies have shown that adding USD and gold to a portfolio can improve its hedging effect, with the optimal weighted stock-dollar-gold combination emerging as the best portfolio strategy, regardless of the reference return or risk [6].

To tackle the challenge of investing in different countries, we propose a deep reinforcement learning-based approach. Specifically, we use an A2C-based model to adjust the shares owned of each stock at each time step, with the goal of maximizing the portfolio return per period by selecting sequential trading directions for individual assets based on time-varying market characteristics.

2. Methodology

Stock returns become a time series forecasting problem, due to the daily changes in stock prices. Many researchers have undertaken extensive studies on this topic [3,7]. Recurrent Neural Networks (RNNs) are a type of neural network that contain feedback loops in their recurrent layers. Through feedback loops, RNNs can store information in “memory cells”. Several studies have demonstrated that various forms of Recurrent Neural Networks (RNNs) perform better than conventional financial time series models across different markets and financial investment applications [8,9]. However, they perform poorly if learning has long-term time dependencies.

Reinforcement learning (RL) is a type of machine learning that allows an agent to learn by interacting with an environment and receiving feedback in the form of rewards or penalties. In contrast to supervised learning, where feedback provides explicit instruction on how to perform a task, RL learns through trial and error by receiving signals indicating whether its actions are positive or negative. Essentially, RL enables an agent to make optimal decisions based on past experiences and feedback from the environment, incorporating knowledge about when it acts and how its environment encourages rewards.

Recent research tries to find a learning method to optimize trading strategies when interacting with the dynamic real-world financial environment. Pendharkar and Cusatis [10] show two different RL methods: on-policy (SARSA) and off-policy (Q-learning). They also compared the performance of the two methods. Almahdi and Yang [11] considered real-world constraints and high transaction cost conditions to build a hybrid method that combines recurrent RL and particle swarm optimization.

These studies used different RL-based methods depending on the different problem settings they face. Advantage Actor Critic (A2C) is a typical actor-critic algorithm to improve policy gradient updates. In the basic Actor—Critic algorithm, the critic network estimates the value function, but there is a problem with the large variance of the policy gradient. The critic network in A2C not only estimates the value function of action, but also evaluates how the action could be performed better. To address the high variance of the policy gradient, A2C employs a baseline mechanism to make the model more robust.

In A2C, many agents work independently to pass average gradients to the global network.

The A2C’s objective function

\nabla J_{θ} (θ)

can be written as follows:

\nabla J_{θ} (θ) = E [\sum_{t = 1}^{T} \nabla_{θ} \log π_{θ} (a_{t}| s_{t}) A (s_{t}, a_{t})]

(1)

π_{θ} (a_{t}| s_{t})

is the policy network at time step—t.

A (a_{t}| s_{t})

is the advantage function and can be written as:

A (s_{t}, a_{t}) = Q (s_{t}, a_{t}) - V (s_{t})

(2)

or

A (s_{t}, a_{t}) = r (s_{t}, a_{t}, s_{t + 1}) + γ V (s_{t}) - V (s_{t})

(3)

Portfolio optimization involves adjusting the allocation of shares in a portfolio to achieve the objective of maximizing returns while minimizing risk. This can be achieved by considering various factors such as expected returns, volatility, and correlation between assets. In our study, we use the approach of changing the volume of shares in our portfolio to find the optimal allocation for maximizizing returns while minimizing risk. We aim to strike a balance between generating high returns and managing the risk of losses, achieved through efficient diversification and asset allocation.

The Sharpe ratio is a commonly used metric for evaluating the risk-adjusted performance of an investment strategy. It measures the excess return of an asset or portfolio over the risk-free rate per unit of volatility or risk. A higher Sharpe ratio indicates better risk-adjusted performance. In our study, we use the Sharpe ratio to measure the performance of our proposed model in optimizing portfolio return with respect to risk. By evaluating the Sharpe ratio of our model, we can gain insights into its ability to generate higher returns while controlling risk, and compare its performance with other models or benchmark indices.

In this study, assumptions are made to ensure that individual investor trades do not have any impact on the market price of portfolio assets since trades are small. This means that if we buy or sell any number of shares, the stock price does not change. In the portfolio trading process, we have the freedom to buy and sell any number of shares at any time. To maximize the portfolio return per period, we formulate this process as a generalized Markov decision process (MDP), where the agent selects the sequential trading direction of individual assets in the portfolio based on time-varying market characteristics.

The state space of the agent in this system is where the agent observes the financial environment. The state at period, t, can be expressed as follows:

S_{t} = (X_{t}, {c p}_{t}, {s h}_{t} {, {c h}_{t}, b}_{t})

(4)

(

X_{t}

: encoding features at time step—t

{c p}_{t}

: each stock’s close price at time step—t

{s h}_{t}

: shares owned of each stock at time step—t

{c h}_{t} :

exchange rate at time step—t

b_{t}

: investment available balance at time step—t)

For each stock, our action space is defined as {−i …, −1, 0, 1, …, j}, where i represents the maximum number of stocks to buy and j is the maximum number of stocks to sell at each action step. For continuous actions, we normalized the action space to [−1, 1]. In each state, we first perform an action selection, followed by a sell operation, a buy, and a hold operation.

The encoding features consist of 5 features and OHCLV (Open, High, Close, Low, Volume) data for each stock. The five features are Simple Moving Average (SMA), Exponential Moving Average (EMA), Moving Average Convergence Divergence (MACD), Relative Strength Index (RSI), and On-Balance Volume (OBV). These five features are selected from widely used financial technical indicators to predict stock price trend. Table 1 shows the technical indicators used in the proposed model.

From the state processing network, the agent observes the financial environment and leverages its experience through the Actor—Critic network. After updating the average gradients to the global network, the agent performs the final action, as depicted in Figure 1.

Our goal is to obtain the maximum return at the end of the investment process. For this goal, we define the reward function as recomputing the whole portfolio value in a new state—

s

using action—

a

.

R (s_{t}, a_{t}, s_{t + 1}) = (b_{t + 1} + {c p}_{t + 1} s_{t + 1} {c h}_{t + 1}) - (b_{t} + c p_{t} s_{t} {c h}_{t})

(5)

3. Experiment and Evaluation

To evaluate the effectiveness of the proposed A2C-based model for financial portfolio management, we conducted experiments on a real-world stock market dataset. The dataset includes historical stock prices of various assets, as well as economic indicators and news sentiment data that may affect the stock market performance.

We used Python, NumPy, pandas, and the scikit-learn machine learning library to implement and train the A2C model on this dataset.

To accelerate the computation process, we used a PC with an Intel four-core CPU 2.7 G, DDR4 16 G RAM, and the Ubuntu Desktop 20.04.5 LTS operating system. The experiment environment is further described in Table 2.

To construct the portfolio, we selected Dow Jones Industrial Average (DJIA), SPDR S&P 500 ETF Trust (SPY), and Yuanta/P-shares Taiwan Top 50 ETF (0050.tw). In the portfolio, the first two stocks are in the US stock market and the last one is in the Taiwan stock market. We used daily data from 1 January 2010 to 1 January 2020 for training and data from 1 January 2020 to 7 June 2022 for validation (back-testing). Figure 2 shows the relative price of these three stocks from 1 January 2020 to 8 July 2022.

The close price of SPY is the highest at the end of this back-testing period, which is the reason why we choose SPY as the baseline. We would have obtained almost 30% if we bought SPY on 1 January 2020 and kept this stock until 7 June 2022.

The portfolio comprises SPY, DJIA, and 0050.tw, and the A2C model, which has been trained, is used by the agent to adjust the shares of each stock owned at each time step. The time step in this research is daily. Each day, the agent observes the financial environment and uses the trained A2C model to take action. Table 3 presents the experimental results of a real financial dataset, covering the period from 1 January 2020 to 7 June 2022.

The portfolio outperforms other stocks that use a buy-and-hold strategy.

Figure 3 shows the cumulative returns of the portfolio and the baseline. The baseline is the buy-and-hold strategy, only investing in SPY stock.

4. Conclusions and Future Work

Investments in today’s global economy are not limited to a single country. As a result, when constructing a portfolio, it is essential to consider the Unintended Effects Paradigm (UEP) that can affect the portfolio’s performance. In this study, we propose an A2C-based model for financial portfolio management that takes into account different countries and exchange rates. Although exchange fees and stock transaction costs were not factored into the model, our findings indicate that investing in a portfolio yields better returns than investing in a single stock. Future research should focus on incorporating exchange fees and stock transaction costs into the model, and we will explore another deep reinforcement learning model that can be integrated into our proposed framework.

Author Contributions

Conceptualization, Y.-H.C. and H.-C.W.; methodology, Y.-H.C.; investigation, H.-C.W.; writing, Y.-H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors declare that all data supporting the results of this research are available in this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chandrasekhar, C.P. Negative interest rates: Symptom of crisis or instrument for recovery. Econ. Political Wkly. 2017, 52, 53–60. [Google Scholar]
Li, Y.-M.; Lin, L.F.; Hsieh, C.Y.; Huang, B.S. A social investing approach for portfolio recommendation. Inf. Manag. 2021, 58, 103536. [Google Scholar] [CrossRef]
Moghaddam, A.H.; Moghaddam, M.H.; Esfandyari, M. Stock market index prediction using artificial neural network. J. Econ. Finance Adm. Sci. 2016, 21, 89–93. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Jung, J.; Jung, K.M. Stock market uncertainty and uncovered equity parity deviation: Evidence from Asia. J. Asian Econ. 2021, 73, 101271. [Google Scholar] [CrossRef]
Dong, X.Y.; Li, C.H.; Yoon, S.M. How can investors build a better portfolio in small open economies? Evidence from Asia’s Four Little Dragons. N. Am. J. Econ. Finance 2021, 58, 19. [Google Scholar] [CrossRef]
Chiang, W.-C.; Enke, D.; Wu, T.; Wang, R. An adaptive stock index trading decision support system. Expert Syst. Appl. 2016, 59, 195–207. [Google Scholar] [CrossRef]
Bao, W.; Yue, J.; Rao, Y. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLoS ONE 2017, 12, e0180944. [Google Scholar] [CrossRef]
Sermpinis, G.; Karathanasopoulos, A.; Rosillo, R.; de la Fuente, D. Neural networks in financial trading. Ann. Oper. Res. 2021, 297, 293–308. [Google Scholar] [CrossRef]
Pendharkar, P.C.; Cusatis, P. Trading financial indices with reinforcement learning agents. Expert Syst. Appl. 2018, 103, 1–13. [Google Scholar] [CrossRef]
Almahdi, S.; Yang, S.Y. A constrained portfolio trading system using particle swarm algorithm and recurrent reinforcement learning. Expert Syst. Appl. 2019, 130, 145–156. [Google Scholar] [CrossRef]

Figure 1. Flow diagram of the proposed model.

Figure 2. Flow diagram of the A2C-based model.

Figure 3. Returns and baseline of the portfolio.

Table 1. Summary of five features.

Type	Financial Technical Indicators
Moving average	Simple Moving Average (SMA)
Moving average	Exponential Moving Average (EMA)
Trend	Moving Average Convergence Divergence (MACD)
Momentum	Relative Strength Index (RSI)
Volume	On-Balance Volume (OBV)

Table 2. Experimental environment.

Numerical and Machine Learning Package
Python 3.9.2	scikit-learn
	imbalanced-learn
	NumPy
	SciPy
	pandas

Table 3. Performance of portfolio for stock trading.

	SPY	DJIA	0050.tw	Portfolio
Annual Return	0.11728	0.05538	0.12267	0.21484
Cumulative returns	0.30851	0.13962	0.30875	0.60423
Sharp Ratio	0.56783	0.33789	0.65887	0.81368

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, Y.-H.; Wang, H.-C. Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates. Eng. Proc. 2023, 55, 63. https://doi.org/10.3390/engproc2023055063

AMA Style

Cheng Y-H, Wang H-C. Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates. Engineering Proceedings. 2023; 55(1):63. https://doi.org/10.3390/engproc2023055063

Chicago/Turabian Style

Cheng, Yi-Hsin, and Hei-Chia Wang. 2023. "Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates" Engineering Proceedings 55, no. 1: 63. https://doi.org/10.3390/engproc2023055063

Article Menu

Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates^†

Abstract

1. Introduction

2. Methodology

3. Experiment and Evaluation

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates †

Abstract

1. Introduction

2. Methodology

3. Experiment and Evaluation

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Decision Support from Financial Disclosures with Deep Reinforcement Learning Considering Different Countries and Exchange Rates^†