# Measuring the Impact of Financial News and Social Media on Stock Market Modeling Using Time Series Mining Techniques

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Previous Work

## 3. Theoretical Background

#### 3.1. Sentiment Analysis

#### 3.2. Symbolic Aggregate Approximation

#### 3.3. Dynamic Time Warping

## 4. Methodology

#### Pattern Discovery Method

- Identify patterns within the stock closing price signal, of length N. Each pattern ${p}_{i}$ has the form of:$${p}_{i}=[({t}_{i},{t}_{j}),\cdots ,({t}_{k},{t}_{l})]$$
- Compute the mean DTW distance of all extracted patterns, denoted by:$$MDT{W}_{all}$$
- For each pattern:
- (a)
- Calculate the DTW distance between the two time series (closing—sentiment as well as closing—number of Tweets) in every space contained in the rule ±3 days. Let each distance be $MDT{W}_{i}^{m}$, where i refers to the pattern and m to the distinct number of rule spaces.
- (b)
- Average each $MDT{W}_{i}^{m}$ to find the mean DTW distance for the whole pattern, denoted as $MDT{W}_{i}$.

- If $MDT{W}_{i}<MDT{W}_{all}$, then the rule is considered as valid for both time series.
- Return this pattern.

## 5. Experimental Results

#### 5.1. Data

#### 5.2. Preprocessing: Time Series Representation

#### 5.2.1. Preprocessing of the Companies News Data: Sentiment Analysis

#### 5.2.2. Preprocessing of Twitter Data (Tweets)

#### 5.3. Time Series Representation

#### 5.4. Pattern Detection

#### 5.5. Correlation Discovery: Dynamic Time Warping

- close and sentimentScore
- close and numTweets,

#### 5.5.1. Forecasting Methods and Models

#### ARIMA

#### LR and GLM

#### SVM

#### ANN

#### 5.5.2. Can News and Tweets Improve the Prediction of the Next Closing Price?

- the sentiment score of news
- the number of tweets
- both of them

- This operator can be used to load data from Microsoft Excel spreadsheets. In our case, the excel file that will be loaded in the Rapid Miner tool has the following columns (attributes): date, close, volume, open, high, low, sentiment and tweets.
- Select Attribute (http://docs.rapidminer.com/studio/operators/blending/attributes/selection/select_attributes.html)This operator selects which attributes of an ExampleSet should be kept and which attributes should be removed. This is used in cases when not all attributes of an ExampleSet are required; it helps to select required attributes. In our case, we selected the “date” as a filter of attributes, and we selected the option “invert selection” because we needed to filter a subset of attributes.
- WindowingThis operator transforms a given example set containing series data into a new example set containing single valued examples. For this purpose, windows with a specified window and step size are moved across the series, and the attribute value lying horizon values after the window end is used as a label that should be predicted. In simpler words, we select the step in order to make the prediction. We have chosen to predict the next closing price based on the three previous days.
- This operator performs a cross-validation in order to estimate the statistical performance of a learning operator (usually on unseen datasets). It is mainly used to estimate how accurately a model (learned by a particular learning operator) will perform in practice. As previously explained, the two most accurate regression types were used for our experiments, i.e., linear regression and Support Vector Machines (SVM).

#### 5.5.3. Results

## 6. Shortcomings of the Study

## 7. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Technical-Analysis. The Trader’s Glossary of Technical Terms and Topics. Available online: http://www.traders.com (accessed on 10 August 2018).
- Anny, N.; Wai-Chee, F.A. Mining frequent episodes for relating financial events and stock trends. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2003; pp. 27–39. [Google Scholar]
- Liu, Y.; Huang, X.; An, A.; Yu, X. ARSA: A sentiment-aware model for predicting sales performance using blogs. In Proceedings of the 2007 ACM 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 23–27 July 2007; pp. 607–614. [Google Scholar]
- Jacob, B.; Ronen, F.; Shimon, K.; Matthew, R. Which News Moves Stock Prices? A Textual Analysis; National Bureau of Economic Research: Cambridge, MA, USA, January 2013. [Google Scholar]
- Roll, R. R2. J. Financ.
**1988**, 43, 541–566. Available online: https://doi.org/10.1111/j.1540-6261.1988.tb04591.x (accessed on 6 November 2018). [CrossRef] - Ruiz Eduardo, J.; Hristidis, V.; Castillo, C.; Gionis, A.; Jaimes, A. Correlating financial time series with micro-blogging activity. In Proceedings of the 2012 ACM Fifth ACM International Conference on Web Search and Data Mining, Seattle, WA, USA, 8–12 February 2012; pp. 513–522. [Google Scholar]
- Mao, Y.; Wei, W.; Wang, B. Correlating SandP 500 stocks with Twitter data. In Proceedings of the 2012 ACM First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research, Beijing, China, 12–16 August 2012; pp. 69–72. [Google Scholar]
- Xue, Z.; Hauke, F.; Gloor, P.A. Predicting stock market indicators through Twitter “I hope it is not as bad as I fear”. Procedia Soc. Behav. Sci.
**2011**, 26, 55–62. [Google Scholar] - Robert, S.; Hsinchun, C. Textual analysis of stock market prediction using financial news articles. In Proceedings of the 2006 AMCIS Americas Conference on Information Systems, Acapulco, Mexico, 4–6 August 2006; Volume 185. [Google Scholar]
- Li, X.; Xie, H.; Chen, L.; Wang, J.; Deng, X. News impact on stock price return via sentiment analysis. Knowl. Based. Syst.
**2014**, 69, 14–23. [Google Scholar] [CrossRef] - Carlo, S.; Alessandro, V. Wordnet Affect: An Affective Extension of Wordnet; Lrec: Lisbon, Portugal, 2004; pp. 1083–1086. [Google Scholar]
- Erik, C.; Daniel, O.; Dheeraj, R. SenticNet 3: A common and common-sense knowledge base for cognition-driven sentiment analysis. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Quebec City, QC, Canada, 27–31 July 2014. [Google Scholar]
- Stefano, B.; Andrea, E.; Fabrizio, S. Sentiwordnet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining; Lrec: Valletta, Malta, 2010; pp. 2200–2204. [Google Scholar]
- Text Analysis and Sentiment Polarity on FIFA World Cup 2014 Tweets. ACM SIGKDD 2005. Chicago, IL, USA. Available online: http://aylien.com/sentiment-analysis/ (accessed on 6 November 2018).
- Lin, J.; Eamonn, K.; Stefano, L.; Bill, C. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, USA, 13 June 2003; pp. 2–11. [Google Scholar]
- Berndt, D.J.; James, C. Using Dynamic Time Warping to Find Patterns in Time Series. 1994, pp. 359–370. Available online: https://www.aaai.org/Papers/Workshops/1994/WS-94-03/WS94-03-031.pdf (accessed on 25 October 2018).
- Lin, J.; Keogh, E.; Patel, P.; Lonardi, S. Finding Motifs in Time Series. In Proceedings of the 8th ACM International Conference on KDD 2nd Workshop on Temporal Data Mining, Riverside, CA, USA, 23–26 July 2002; pp. 53–68. [Google Scholar]
- Senin, P.; Lin, J.; Wang, X.; Oates, T.; Gandhi, S.; Boedihardjo, P.A.; Chen, C.; Frankenstein, S.; Lerner, M. Grammarviz 2.0: A tool for grammar-based pattern discovery in time series. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2014; pp. 468–472. [Google Scholar]
- Bliemel, F. Theil’s forecast accuracy coefficient: A clarification. J. Mark. Res.
**1973**, 10, 444–446. [Google Scholar] [CrossRef] - Jacobs, W.; Souza, A.M.; Zanini, R.R. Combination of Box-Jenkins and MLP/RNA Models for Forecasting Combining Forecasting of Box-Jenkins. IEEE Lat. Am. Trans.
**2016**, 14, 1870–1878. [Google Scholar] [CrossRef] - Vagropoulos, S.I.; Chouliaras, G.I.; Kardakos, E.G.; Simoglou, C.K.; Bakirtzis, A.G. Comparison of SARIMAX, SARIMA, Modified SARIMA and ANN-based Models for short-term PV generation forecasting. In Proceedings of the IEEE International Energy Conference (ENERGYCON), Leuven, Belgium, 4–8 April 2016. [Google Scholar]
- Stock, J.H.; Watson, M. Combination Forecasts of Output Growth in a Seven-Country Dataset. J. Forecast.
**2004**, 23, 405–430. [Google Scholar] [CrossRef] - Cao, L.J.; Tay, F.E.H. Financial forecasting using support vector machines. Neural Comput. Appl.
**2001**, 10, 184–192. [Google Scholar] [CrossRef] - Guresen, E.; Kayakutlu, G.; Daim, T.U. Using Artificial Neural Network Models in Stock Market Prediction. Expert Syst. Appl.
**2011**, 38, 10389–10397. [Google Scholar] [CrossRef] - Diebold, F.X. Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of Diebold–Mariano tests. J. Bus. Econ. Stat.
**2015**, 33, 1. [Google Scholar] [CrossRef] - YALE: Rapid Prototyping for Complex Data Mining Tasks. KDD 2006. Philadelphia, Pennsylvania, USA. Available online: https://rapidminer.com/ (accessed on 6 November 2018).

**Figure 1.**Example Symbolic Aggregate Approximation (SAX) method to take a symbolic representation of a time series. Dimensionality reduction via Piecewise Aggregate Approximation (PAA). The symbolic representation is: baabccbc [15].

**Figure 2.**Example of the similarity comparison of two sequences using DTW. The $\delta $ distance measures the distance between two points in the time series. $\gamma $ is the cumulative distance for each point. The closer to the diagonal the warping path is located, the more similar the two sequences are.

**Figure 4.**Time series of close, sentimentScore and numTweets about AAPL. Normalization into [0, 1] interval.

**Figure 5.**Rule $R\#5$ from AAPL close time series, as illustrated by GrammarViz 2.0. The frequency of the pattern is three times on intervals [35, 61], [74, 93] and [100, 121]. Rule $R\#5$ represents these three intervals.

**Figure 6.**If mean value DTW (R) < mean value DTW (w), where R: rule, w: window (random window size), then there exists a correlation between the two time series.

**Figure 7.**SVM and LR are outperforming all other models, with ARIMA and ANN having significantly worse performance.

**Figure 9.**Improvement rates (expressed in %) of pattern intervals (rules) about AAPL by using linear regression.

**Figure 10.**Improvement rates (expressed in %) of pattern intervals (rules) about AAPL by using SVM regression.

**Figure 11.**Improvement rates (expressed in %) at random intervals about AAPL by using linear regression.

**Figure 12.**Improvement rates (expressed in %) at random intervals about AAPL by using SVM regression.

**Figure 13.**Topic modeling for AAPL and GE stocks. The y-axis represents the number of texts in each topic, and the x-axis represents the topicId.

**Figure 14.**Topic modeling for IBM and MSFT stocks. The y-axis represents the number of texts in each topic, and the x-axis represents the topicId.

**Figure 15.**Topic modeling for ORCL stock. The y-axis represents the number of texts in each topic, and the x-axis represents the topicId.

AAPL | GE | IBM | MSFT | ORCL | |
---|---|---|---|---|---|

Number of news | 4282 | 1506 | 1549 | 2577 | 575 |

Number of tweets | 310,503 | 46,237 | 56,804 | 67,107 | 16,494 |

Root Mean Squared Error (RMSE) | $RMSE=\sqrt{\frac{{\sum}_{i=1}^{N}{e}_{i}^{2}}{N}}$ |

Mean Absolute Error (MAE) | $MSE=\sqrt{\frac{{\sum}_{i=1}^{N}{e}_{i}}{N}}$ |

Theil’s $U2$ Decomposition ${}_{1}$ | $U2=\frac{\sqrt{\frac{{\sum}_{i=1}^{N-1}{\left(\frac{{f}_{i+1}-{y}_{i+1}}{{y}_{i}}\right)}^{2}}{N}}}{\sqrt{\frac{{\sum}_{i=1}^{N-1}{\left(\frac{{y}_{i+1}-{y}_{i}}{{y}_{i}}\right)}^{2}}{N}}}$ |

Theil U Decomposition | |||||
---|---|---|---|---|---|

Stock (Avg. Price in US$) | ARIMA | GLM | LM | SVM | ANN |

AAPL (116.29 $) | 0.0263 | 0.0205 | 0.0217 | 0.0211 | 0.0263 |

GE (29.23 $) | 0.0107 | 0.0149 | 0.0102 | 0.0103 | 0.0195 |

IBM (142.25 $) | 0.0591 | 0.0270 | 0.0267 | 0.0264 | 0.0504 |

MSFT (50.81 $) | 0.0496 | 0.0337 | 0.0316 | 0.0316 | 0.0424 |

ORCL (37.89 $) | 0.0146 | 0.0119 | 0.0118 | 0.0117 | 0.0184 |

**Table 4.**Diebold–Mariano (DM)-test results on AAPL for all five forecasting models, carried out in pairs. Green colors represent high p-values, while red corresponds to cases where the null hypothesis is rejected due to almost zero p-values.

Null Hypothesis: Both Forecasts Have the Same Accuracy | |||||
---|---|---|---|---|---|

p-value | ARIMA | GLM | LR | SVM | ANN |

ARIMA | 0.6067 | 0.6506 | 0.6280 | 0.5403 | |

GLM | 0.8132 | 0.9198 | 0.0086 | ||

LR | 0.9636 | 0.0069 | |||

SVM | 0.0076 | ||||

ANN | |||||

DM-statistic | ARIMA | GLM | LR | SVM | ANN |

ARIMA | 0.5357 | 0.4704 | 0.5038 | −0.6397 | |

GLM | 0.1442 | 0.0520 | −3.4563 | ||

LR | 0.0473 | −3.6067 | |||

SVM | −3.5415 | ||||

ANN |

**Table 5.**The improvement rates of the next closing price of the five companies using rules (in RapidMiner).

Sentiment and Tweets | Sentiment Only | Tweets Only | |
---|---|---|---|

APPL Linear | |||

R#2 | 0 | 0 | 0 |

R#5 | 0 | 0 | 2.93 |

R#6 | 11.13 | 30.39 | 6.36 |

R#7 | 0 | 0 | 0 |

APPL SVM | |||

R#2 | 96.22 | 96.81 | 95.06 |

R#5 | 97.89 | 78.52 | 96.15 |

R#6 | 43.00 | 36.40 | 67.54 |

R#7 | 97.95 | 97.64 | 96.34 |

GE Linear | |||

R#4 | 0.00 | 0.54 | 0.00 |

R#5 | 0.00 | 1.21 | 0.00 |

R#6 | 0.00 | 0.00 | 0.00 |

R#7 | 22.52 | 25.74 | 0.00 |

GE SVM | |||

R#4 | 83.18 | 70.25 | 65.67 |

R#5 | 99.77 | 99.66 | 99.61 |

R#6 | 51.01 | 52.78 | 55.37 |

R#7 | 45.66 | 41.68 | 32.28 |

IBM Linear | |||

R#4 | 0.00 | 0.89 | 0.00 |

R#5 | 33.41 | 1.00 | 37.61 |

R#6 | 0.00 | 0.00 | 0.00 |

R#7 | 0.00 | 0.00 | 0.00 |

IBM SVM | |||

R#4 | 98.53 | 85.69 | 93.56 |

R#5 | 88.13 | 90.06 | 75.38 |

R#6 | 90.19 | 90.59 | 77.66 |

R#7 | 0.00 | 0.00 | 0.00 |

MSFT Linear | |||

R#1 | 0.00 | 0.00 | 0.00 |

R#2 | 0.00 | 0.00 | 0.00 |

R#3 | 13.71 | 12.82 | 14.16 |

R#4 | 0.00 | 2.52 | 0.00 |

MSFT SVM | |||

R#1 | 71.76 | 64.36 | 0.00 |

R#2 | 34.45 | 25.07 | 27.95 |

R#3 | 7.83 | 7.41 | 0.00 |

R#4 | 76.76 | 73.09 | 65.78 |

ORCL Linear | |||

R#5 | 0.00 | 0.00 | 0.00 |

R#6 | 0.00 | 12.60 | 17.26 |

R#7 | 5.12 | 0.00 | 26.40 |

ORCL SVM | |||

R#5 | 92.91 | 24.63 | 88.53 |

R#6 | 0.00 | 0.00 | 0.00 |

R#7 | 44.75 | 12.35 | 0.00 |

**Table 6.**The improvement rates of the next closing price of the five companies using random intervals, without using rules (in RapidMiner).

Sentiment and Tweets | Sentiment Only | Tweets Only | |
---|---|---|---|

APPL Linear | |||

Random Interval #1 | 12.92 | 20.97 | 20.97 |

Random Interval #2 | 0.00 | 0.00 | 0.00 |

Random Interval #3 | 0.00 | 0.00 | 0.00 |

Random Interval #4 | 0.00 | 1.57 | 0.00 |

Random Interval #5 | 0.00 | 0.00 | 0.00 |

Random Interval #6 | 1.52 | 0.00 | 0.00 |

Random Interval #7 | 0.00 | 4.56 | 0.00 |

APPL SVM | |||

Random Interval #1 | 0.00 | 6.52 | 0.00 |

Random Interval #2 | 43.90 | 37.40 | 47.85 |

Random Interval #3 | 50.49 | 46.30 | 41.83 |

Random Interval #4 | 24.89 | 33.73 | 13.49 |

Random Interval #5 | 70.92 | 66.03 | 35.56 |

Random Interval #6 | 57.96 | 52.69 | 60.56 |

Random Interval #7 | 7.33 | 42.67 | 0.00 |

GE Linear | |||

Random Interval #1 | 26.94 | 26.94 | 26.94 |

Random Interval #2 | 0.00 | 0.00 | 0.00 |

Random Interval #3 | 0.78 | 0.00 | 0.00 |

Random Interval #4 | 0.00 | 0.00 | 0.00 |

Random Interval #5 | 0.00 | 0.00 | 0.00 |

Random Interval #6 | 0.00 | 0.00 | 5.95 |

GE SVM | |||

Random Interval #1 | 33.33 | 27.24 | 21.51 |

Random Interval #2 | 47.04 | 37.57 | 39.05 |

Random Interval #3 | 64.55 | 70.30 | 61.59 |

Random Interval #4 | 45.21 | 0.00 | 39.32 |

Random Interval #5 | 25.04 | 0.00 | 24.40 |

Random Interval #6 | 73.27 | 70.68 | 52.27 |

IBM Linear | |||

Random Interval #1 | 0.00 | 0.00 | 0.00 |

Random Interval #2 | 1.10 | 0.00 | 0.00 |

Random Interval #3 | 0.00 | 30.31 | 30.31 |

Random Interval #4 | 6.95 | 2.66 | 14.87 |

Random Interval #5 | 37.16 | 39.00 | 38.43 |

Random Interval #6 | 60.88 | 60.88 | 60.88 |

Random Interval #7 | 5.14 | 5.14 | 5.14 |

IBM SVM | |||

Random Interval #1 | 45.45 | 56.35 | 0.00 |

Random Interval #2 | 25.32 | 25.53 | 24.20 |

Random Interval #3 | 49.15 | 26.10 | 50.07 |

Random Interval #4 | 49.57 | 61.39 | 15.28 |

Random Interval #5 | 66.10 | 26.16 | 40.78 |

Random Interval #6 | 41.55 | 12.83 | 21.00 |

Random Interval #7 | 6.03 | 30.80 | 0.00 |

MSFT Linear | |||

Random Interval #1 | 72.98 | 74.57 | 74.57 |

Random Interval #2 | 25.13 | 27.26 | 27.26 |

Random Interval #3 | 57.90 | 57.49 | 57.90 |

Random Interval #4 | 65.32 | 65.32 | 65.32 |

Random Interval #5 | 0.30 | 14.56 | 14.56 |

Random Interval #6 | 17.86 | 3.97 | 17.86 |

Random Interval #7 | 40.63 | 42.07 | 42.71 |

MSFT SVM | |||

Random Interval #1 | 21.59 | 33.89 | 7.25 |

Random Interval #2 | 59.45 | 54.10 | 50.23 |

Random Interval #3 | 20.03 | 57.34 | 0.00 |

Random Interval #4 | 39.64 | 51.13 | 27.63 |

Random Interval #5 | 45.47 | 34.48 | 4.25 |

Random Interval #6 | 22.45 | 22.54 | 10.11 |

Random Interval #7 | 91.55 | 67.40 | 92.20 |

ORCL Linear | |||

Random Interval #1 | 67.89 | 69.39 | 69.39 |

Random Interval #2 | 0.00 | 13.29 | 0.00 |

Random Interval #3 | 0.00 | 0.00 | 0.00 |

Random Interval #4 | 0.00 | 4.77 | 4.77 |

Random Interval #5 | 58.34 | 58.34 | 58.34 |

Random Interval #6 | 0.00 | 0.00 | 0.00 |

Random Interval #7 | 14.63 | 30.23 | 30.02 |

ORCL SVM | |||

Random Interval #1 | 0.00 | 19.61 | 0.00 |

Random Interval #2 | 60.23 | 56.87 | 20.32 |

Random Interval #3 | 50.80 | 0.00 | 28.25 |

Random Interval #4 | 60.24 | 62.06 | 38.49 |

Random Interval #5 | 76.92 | 73.75 | 18.81 |

Random Interval #6 | 0.00 | 0.00 | 0.00 |

Random Interval #7 | 54.75 | 49.19 | 52.59 |

**Table 7.**DM-test between two forecasting models, based on SVM and using sentiments and tweets as additional features. The leftmost model is the one considering the intervals denoted by rules, while the rightmost represents the model of random intervals.

Null Hypothesis: Both Forecasts Have the Same Accuracy | ||
---|---|---|

Company Index | Rules vs. Random Intervals | |

p-Value | DM Statistic | |

AAPL (116.29 $) | 0.1107 | −1.6839 |

GE (29.23 $) | 0.149 | −1.5968 |

IBM (142.25 $) | 0.081 | −2.2125 |

MSFT (50.81 $) | 0.09359 | −2.4001 |

ORCL (37.89 $) | 0.1367 | −1.5357 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kollintza-Kyriakoulia, F.; Maragoudakis, M.; Krithara, A.
Measuring the Impact of Financial News and Social Media on Stock Market Modeling Using Time Series Mining Techniques. *Algorithms* **2018**, *11*, 181.
https://doi.org/10.3390/a11110181

**AMA Style**

Kollintza-Kyriakoulia F, Maragoudakis M, Krithara A.
Measuring the Impact of Financial News and Social Media on Stock Market Modeling Using Time Series Mining Techniques. *Algorithms*. 2018; 11(11):181.
https://doi.org/10.3390/a11110181

**Chicago/Turabian Style**

Kollintza-Kyriakoulia, Foteini, Manolis Maragoudakis, and Anastasia Krithara.
2018. "Measuring the Impact of Financial News and Social Media on Stock Market Modeling Using Time Series Mining Techniques" *Algorithms* 11, no. 11: 181.
https://doi.org/10.3390/a11110181