Next Article in Journal
Knowledge-Enhanced Compressed Measurements for Detection of Frequency-Hopping Spread Spectrum Signals Based on Task-Specific Information and Deep Neural Networks
Previous Article in Journal
Cluster-Based Structural Redundancy Identification for Neural Network Compression
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Temporal Window Attention-Based Window-Dependent Long Short-Term Memory Network for Multivariate Time Series Prediction

College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(1), 10; https://doi.org/10.3390/e25010010
Submission received: 16 October 2022 / Revised: 30 November 2022 / Accepted: 15 December 2022 / Published: 21 December 2022
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Multivariate time series prediction models perform the required operation on a specific window length of a given input. However, capturing complex and nonlinear interdependencies in each temporal window remains challenging. The typical attention mechanisms assign a weight for a variable at the same time or the features of each previous time step to capture spatio-temporal correlations. However, it fails to directly extract each time step’s relevant features that affect future values to learn the spatio-temporal pattern from a global perspective. To this end, a temporal window attention-based window-dependent long short-term memory network (TWA-WDLSTM) is proposed to enhance the temporal dependencies, which exploits the encoder–decoder framework. In the encoder, we design a temporal window attention mechanism to select relevant exogenous series in a temporal window. Furthermore, we introduce a window-dependent long short-term memory network (WDLSTM) to encode the input sequences in a temporal window into a feature representation and capture very long term dependencies. In the decoder, we use WDLSTM to generate the prediction values. We applied our model to four real-world datasets in comparison to a variety of state-of-the-art models. The experimental results suggest that TWA-WDLSTM can outperform comparison models. In addition, the temporal window attention mechanism has good interpretability. We can observe which variable contributes to the future value.

1. Introduction

Predicting the multivariate time series precisely is of great significance in real life, such as traffic prediction, virus spread prediction, energy prediction, and air quality prediction [1]. For instance, policymakers can plan interventions by predicting epidemic trajectories [2], and people make outdoor activity planning by predicting future air quality [3]. For multivariate time series prediction, this is a common practice that the temporal window size is required to divide the time series data into multiple feature matrices with fixed dimensions [4] as the input. That is to say that there is no available information before the window. However, in practice, it is hard to contain all useful information in an appropriate window size. Therefore, considering the information of the previous window can help to find meaningful insights into time series data.
Capturing spatio-temporal correlation is the main problem in multivariate time series prediction tasks. The spatio-temporal correlations within a temporal window represent the influence of different exogenous series on prediction at different time steps. Most studies capture one-dimensional spatio-temporal correlations, such as the spatial correlations among variables at the same time step and temporal correlation among different time steps, as shown in Figure 1. The one-dimensional spatio-temporal correlations focus only on locally relevant variables, which leads to ignoring some important information. For instance, although variable 1 at time t is more important than variable 2 at time i, the weight coefficient of variable 1 is smaller than that of variable 2. Therefore, it is necessary to consider the global information (i.e., two-dimensional spatio-temporal correlations) within a temporal window for multivariate time series prediction.
Historically, a vital issue of time series prediction is capturing and exploiting the dynamic temporal dependencies in a temporal window [5]. The autoregressive integrated moving average (ARIMA) [6,7] focuses only on capturing the temporal dependence of the target series itself and ignores the exogenous series. Hence, ARIMA is not suitable for modeling dynamic relationships among variables for multivariate time series. The vector autoregressive (VAR) [8] model is developed to utilize exogenous series features. The hidden Markov model (HMM) is a classic statistical model that contributes to extracting the dynamic temporal behavior of the multivariate time series [9]. However, VAR and HMM suffer from the problem of overfitting with large-scale time series. To overcome this limitation, recurrent neural network (RNN) and its variants (e.g., long short-term memory network (LSTM) [10] and gated recurrent unit (GRU) [11]) have been developed for multivariate time series prediction. Nevertheless, they fail to differentiate the contribution of the exogenous series to the target series, resulting in limited information.
Most single time series prediction networks are no longer sufficient for fitting a given multivariate time series due to the complexity and variability of time series structure [9,12]. Encoder–decoder networks based on LSTM or GRU units are popular due to their ability to handle complex time series. However, they are limited in capturing the dependencies of the long input sequence. Inspired by the learning theory of human attention, attention-based encoder–decoder networks have been applied to select the most relevant series. For instance, the two-stage attention mechanism [13] has been designed to predict multivariate time series, which employ input attention to select the exogenous series and temporal attention to capture historical information. Subsequently, the multistage attention mechanism [14] has been developed to capture the influence information from different time stages. Recently, the temporal self-attention mechanism [1] has been introduced into the encoder–decoder to extract the temporal dependence. These attention-based encoder–decoder networks achieve state-of-the-art performance by extracting the spatial–temporal dependence for multivariate time series prediction. Nevertheless, these models focus on important information in spatial and temporal dimensions, respectively. In addition, LSTM and GRU summarize the information only within a temporal window so that they cannot memorize very long term information.
To address these issues, we proposed a temporal window attention-based window-dependent long short-term memory network (TWA-WDLSTM) to predict future values. The contributions of our work are summarized as follows:
(1)
Temporal window attention mechanism. We develop a new attention concept, namely a temporal window attention mechanism, to learn the spatio-temporal pattern. The temporal window attention mechanism learns the weight distribution strategy to select the relevant variables in a temporal window; hence, it can capture two-dimensional spatio-temporal correlations from a global perspective.
(2)
Window-dependent long short-term memory network (WDLSTM). We design a novel recurrent neural network as the encoder and decoder to enforce the long-term temporal dependencies among time series because it utilizes the previous temporal window hidden state matrix.
(3)
Real evaluation. We construct extensive experiments on four real-world datasets to demonstrate the effectiveness of our model. The results show that our model outperforms state-of-the-art comparison models.
The remainder of this paper is organized as follows. We provide a literature review on time series prediction methods in Section 2. We define the notations and problem formulation in Section 3. We introduce the specific details of our model in Section 4. We design experiments to test the validity of our model in different fields and analyse the experimental results in Section 5. We summarize our work and future work in Section 5.

2. Related works

2.1. Recurrent Neural Network

As deep learning develops, more models for multivariate time series prediction have been proposed. The recurrent neural network (RNN) with the capacity to gain short-term dependencies of time series becomes one of the most outstanding multivariate time series prediction models. Recently, some advanced RNN variants were proposed to overcome vanishing gradient and capture long-term dependencies. For instance, Feng et al. [15] introduced the clockwork RNN, which runs the hidden layer at different clock speeds to solve the long-term dependency problem. Zhang et al. [16] modified the GRU architectures, in which gates explicitly regulate two distinct types of memories to predict medical records and multi-frequency phonetic time series. Ma et al. [17] designed temporal pyramid RNN to gain long-term and multi-scale temporal dependencies. Harutyunyan et al. [18] introduced a modification of LSTM, namely, channel-wise LSTM, which is most remarkable in length-of-stay prediction. Zhang et al. [19] designed CloudLSTM, which employs a dynamic point-cloud convolution operator as the core component for spatial–temporal point-cloud stream forecasting. The above RNN methods focus on utilizing the information within a temporal window. Limited by the temporal window size, they cannot fully exploit very long term temporal dependencies.
The input length limits the performance of an RNN, and as the length increases, its ability to extract time features decreases. To address this issue, combining the attention mechanism and RNN to select relevant features provides excellent results. For instance, Wang et al. [20] used two stacked LSTM layers with two attention layers to improve prediction performance. Liu et al. [21] introduced the dual-stage two-phase attention-based LSTM to enhance spatial features and discover long-term dependencies to predict long-term time series. Huang et al. [3] designed a spatial attention–embedded LSTM to tackle the intricate spatial–temporal interactions for air quality prediction. Liang et al. [22] developed multi-level attention-based LSTM to capture the complex correlations between variables to achieve geo-sensory time series prediction. Deng et al. [23] employed a graph attention mechanism to learn the dependence relationships for anomaly detection. Preeti et al. [24] used self-attention to focus more on relevant parts of the time series in the spatial dimension. However, these studies pay more attention to designing various attention mechanisms to obtain the relationships in the spatial dimension. In the temporal dimension, the typical temporal attention mechanism is used to select the relevant time steps and ignore the relevant variables.

2.2. Convolutional Neural Network

The Convolutional Neural Networks (CNNs) model is popular because of multi-threaded GPU computing. Wang et al. [25] designed multiple CNNs to integrate the correlations among multiple periods for periodic multivariate time series prediction. Wu et al. [26] developed temporal convolution modules to learn temporal dependencies. However, CNNs fail to capture long-term temporal dependencies for multivariate time series prediction. Most studies additionally used other methods to extract global temporal correlations. For instance, Lai et al. [27] employed convolutional neural network models to extract local temporal features of time series and GRU to discover long-term dependencies for time series. Cao et al. [28] introduced a temporal convolutional network to mine short-term temporal correlations and an encoder–decoder attention module to gain long-term temporal correlations for traffic flow prediction.

3. Model

3.1. Notation and Problem Statement

Multivariate time series data are composed of multiple exogenous series and one target series. Given N exogenous series and the temporal window size T, we divide the time series data into multiple feature matrices. A matrix with fixed dimensions is as follows:
X t = ( x 1 ,   x 2 , ,   x N ) = ( x 1 , x 2 , , x T ) = [ x 1 1 x t 1 x T 1 x 1 2 x t 2 x T 2 x 1 N x t N x T N ]
where x t i is the i-th exogenous series at time t, and t represents the t -th temporal window.
Given the previous values of the target series, y t = ( y 1 , y 2 , , y T )   with y t , and past values of N exogenous series, i.e.,   X t T × N , our aim is to design a model for learning complex nonlinear relationships among time series:
y ^ t , T + 1 = F ( y t ,   X t )
where F (·) is a prediction model we aim to construct.

3.2. Exhaustive Description of Model

We employ the attention-based encoder–decoder network to perform a multivariate time series prediction, as depicted in Figure 2. Specifically, in the encoder, we propose a temporal window attention mechanism to adaptively select relevant variables in a temporal window. To capture very long temporal dependencies, WDLSTM is introduced to encode the input features into a matrix hidden state. In the decoder, WDLSTM decodes the encoded input features to predict future values. The detailed structure of WDLSTM is shown in Figure 3.

3.2.1. Temporal Window Attention Mechanism

Most existing work focuses mainly on designing different attention mechanisms to select the relevant variables at the same time step or capture historical information at different time steps to improve prediction performance. However, there is a critical defect in applying existing attention mechanisms for multivariate time series prediction because it fails to select essential variables in terms of prediction utility. Moreover, the temporal attention mechanisms perform a weighted average of historical information, failing to detect time steps that are useful for prediction. Ref. [20] has proved that utilizing both temporal and spatial attention can improve prediction accuracy. Hence, we design a temporal window attention mechanism to capture two-dimensional spatio-temporal correlations in a temporal window.
For the prediction task, we use the features within the temporal window to generate value for the next time step. We take the series matrix within a temporal window, X t = ( x 1 ,   x 2 , ,   x N ) = ( x 1 , x 2 , , x T ) T × N and a variable x t k as input. We define a temporal window attention mechanism via a multilayer perceptron to score the importance of the k-th input variable at time t within a temporal window, as follows:
e t k = v e tan h ( [ H t 1 ; S t 1 ] W e + X t W e + W e x t k + b e )
where [ H t 1 ; S t 1 ] T × 2 n is a concatenation operation, H t 1 and S t 1 are the hidden state and cell state of the WDLSTM unit, respectively, and v e T ,   W e 2 n ,   W e N , W e T , and b e T are parameters to learn.
After that, it is necessary to use the softmax function to normalize the attention weights. The softmax function helps to capture long-term dependencies yet prohibits its scale-up due to the time complexity. To gain more variables to be useful for prediction, we use the Taylor softmax [29] function instead of the softmax function, as follows:
α t k = 1 + e t k + 0.5 ( e t k ) 2 t T i N 1 + e t i + 0.5 ( e t i ) 2
where α t k is the attention weight measuring the importance of the k-th series to future value at time t in temporal window t . The Taylor softmax function employs second-order Taylor expansion of the exponential, e x p   ( e t k ) 1 + e t k + 0.5 ( e t k ) 2 . Moreover, the numerator 1 + e t k + 0.5 ( e t k ) 2 is positive definite and never becomes zero because its minimum value is 0.5. Hence, the Taylor softmax function can enhance numerical stability. Once we obtain the attention weights, the new input matrix is computed as follows:
X ˜ t = [ α 1 1 x 1   1 α t 1 x t 1 α T 1 x T 1   α 1 2 x 1 2 α t 2 x t 2 α T 2 x T 2     α 1 n x 1 N α t n x t N   α T 3 x T N ]

3.2.2. Encoder–Decoder

In both the encoder and decoder, LSTM, which can memorize historical information, is employed to encode/decode the input sequences into high-level feature representations. In practice, LSTM usually fails to memorize very long term dependencies because it only considers the features within a temporal window. We propose a window-dependent long short-term memory network (WDLSTM) to enhance the learning ability of the long-term temporal dependencies. The idea of WDLSTM is to encapsulate the features matrix of a temporal window as a hidden state matrix. As standard LSTM neural networks, WDLSTM has the memory cells S t and the gate control units, such as forget gate F t , input gate I t , and output gate O t . The gate matrices and hidden state matrices of WDLSTM are denoted with uppercase and boldface to differentiate the gate vectors and hidden state vector of LSTM. In the encoder, given the newly computed X ˜ t at temporal window t and the previous window hidden state matrix H t 1 , the WDLSTM unit update is summarized as follows:
F t = σ (   [ H t 1 ; X ˜ t ] W F + b F )
I t = σ ( [ H t 1 ; X ˜ t ] W I + b I )
O t = σ ( [ H t 1 ; X ˜ t ] W O + b O )
S ˜ t   =   tan h ( [ H t 1 ; X ˜ t ] W S + b S )
S t = F t S t 1 + I t S ˜ t
H t     =   O t tan h ( S t )
where [ H t 1 ; X ˜ ] T × ( n + N ) is a concatenation of the hidden state matrix H t 1 T × n and the current input matrix X ˜ t   T × N , W F , W I , W O , W S   ( n + N ) × n and b F , b I , b O ,   b S T × n are parameters to learn. The symbols σ and are a logistic sigmoid function and an elementwise multiplication, respectively. Figure 3 presents the structure of the WDLSTM and the difference between WDLSTM and LSTM. The input of WDLSTM is a matrix X ˜ t T × N at temporal window t , and hidden state matrix H t T × n is the output of WDLSTM. The input of LSTM is a vector x ˜ t N at time step t, and hidden state vector h t n   is the output of LSTM. x ˜ t is the element vector of X ˜ t . Different from the gate unit of LSTM, that of WDLSTM can control the input and output of information within the temporal window, which changes the value range of the memory, so that the information flows of the temporal window can preserve the very long term dependencies.
In the decoder, we combine the encoder hidden state with the target series y t as the input and then employ the WDLSTM to decode the concatenation. That is to say, the hidden state matrix of the encoder can update the decoder hidden state matrix, which can be defined as follows:
D t = f M L S T M ( D t 1 , [ H t ; y t ] )
where D t T × m is the decoder hidden state matrix, and f M L S T M is a WDLSTM unit. Then D t can be updated as:
F ¨ t = σ ( [ D t 1 ;   [ H t ; y t ] ] W ¨ F + b ¨ F )
I ¨ t = σ ( [ D t 1 ;   [ H t ; y t ] ] W ¨ I + b ¨ I )
O ¨ t = σ ( [ D t 1 ;   [ H t ; y t ] ] W ¨ O + b ¨ O )
S ¨ ˜ t   =   tan h ( [ D t 1 ;   [ H t ; y t ] ] W ¨ S + b ¨ S )
S ¨ t = F t S ¨ t 1 + I t S ¨ ˜ t
D t   =   O ¨ t tan h ( S ¨ t )
where [ D t 1 ;   [ H t ; y t ] ] T × ( m + n + 1 ) is a concatenation of the previous hidden state matrix D t 1 and the input [ H t ; y t ] . Finally, we employ the linear function to generate the future value:
y ^ t , T + 1 = F ( y ,   X t ) = v y ( D t W y + b w ) + b y
where v y T , W y m , b w T , and b y are parameters to learn.

4. Experimental Studies

4.1. Datasets

We utilize four real-world time series datasets to evaluate our model. There are missing values in all datasets due to sensor power outages or communication errors. We employ linear interpolation to fill in the missing values. We partition the datasets into the training and test sets by a ratio of 8:2.
  • Photovoltaic (PVP) power dataset: The dataset is derived from National Energy Day-ahead PV power, which is competition data. The frequency of data collection is every 15 min. We take the PV power as the target series and choose six relevant features as exogenous series. We select the first 24,000 time steps as the training set and the rest 6000 time steps as the test set.
  • SML 2010 dataset: This dataset collected 17 features from the house monitor system. We utilized 16 exogenous series to predict room temperature. We select the first 2211 time steps as the training set and the rest 553 time steps as the test set.
  • Beijing PM2.5 dataset: This dataset collected hourly the concentration of PM2.5 and some meteorological readings from air-quality monitoring sites in Beijing, China. The time period is from 1 January 2013 to 31 December 2014, in which the first 14,016 time steps are used to train models and the remaining 3504 time steps are employed to test models.
  • NASDAQ 100 stock dataset: The dataset contains the per-minute stock prices of 81 major companies. The NASDAQ 100 is regarded as target series. We select the first 32,448 time steps as the training set and the remaining 8112 time steps as the test set.

4.2. Methods for Comparison

We select seven state-of-the-art models as comparison models. The modes are introduced as follows:
ARIMA [30]: ARIMA is a typical statistical model for univariate time series prediction. The ARIMA model converts nonstationary time series to stationary data utilizing difference processing.
LSTM [31]: LSTM is a widely applied RNN variant designed to mine the long-term temporal dependence hidden in time series.
DA-RNN [13]: The attention-based encoder–decoder network for time series prediction employs an input attention mechanism to gain spatial correlations and temporal attention to capture temporal dependencies.
DSTP-RNN [15]: The model employs a two-phase attention mechanism to strengthen the spatial correlations and a temporal attention mechanism to capture temporal dependencies for long-term and multivariate time series prediction.
MTNet [32]: This trains the nonlinear neural network and autoregressive components in parallel to improve the robustness. The nonlinear neural network uses the memory component to memorize long-term historical information.
DA-Conv-LSTM [33]: This exploits the convolutional layer and two-stage attention model to extract the complex spatial-temporal correlation among nearby values.
CGA-LSTM [12]: The model employs the correlational attention mechanism to gain the spatial correlation and the graph attention network to learn temporal dependencies.

4.3. Parameter and Evaluation Metrics

We execute a grid search strategy and choose the best values for three types of key hyperparameters in our model. All models shared the hyperparameters that are listed in Table 1 for a fair comparison. For the number of time windows T, we set T ∈ {6, 12, 24, 48} for the PV power dataset that is periodic and T ∈ {5, 10, 15, 25} for other datasets. For the size of hidden states for encoder and decoder, we set m = n ∈ {16, 32, 64, 128}. The models are trained for 30 epochs with a batch size of 128. The initial learning rate is set as 0.001 and decays by 10% every 10 epochs.
To assess the performances of our model and comparison models, we adopt three evaluation metrics: mean absolute error (MAE), root mean square error (RMSE), and R squared (R2). MAE and RMSE are employed to measure the error between the predicted and observed values. R Squared (R2) is chosen as the indicator to measure the fitting effect of the model. The range of R2 is determined as (0,1). If R2 is close to 1, it means that the prediction accuracy of our model is high. MAE, RMSE, and R2 are defined as follows:
MAE   =   1 N i = 1 N | y t i y ^ t i |
RMSE   = 1 N i = 1 N ( y t i y ^ t i ) 2
R 2 = 1 i = 1 N | y t i y ^ t i | i = 1 N   | y t i y ¯ |
where y ^ t and y t are the predicted value and observed value at time t, y ¯ is the average value of the observed values, and N represents the number of samples. The scikit-learn package (https://scikit-learn.org/stable/index.html, accessed on 17 August 2022) provided support utilities of three evaluation metrics. Moreover, we implemented our model in the Pytorch framework (https://pytorch.org, accessed on 17 August 2022).

4.4. Experimental Results and Analyses

In this section, we give the experimental results on four real-world datasets, evaluated with all three metrics, as shown in Table 2 and Table 3. The best results of each dataset are displayed in boldface. To clearly observe the difference in experimental results, we present the values of MAE and R2 with bar charts in Figure 4, Figure 5, Figure 6 and Figure 7. We observe that TWA-WDLSTM achieves better performance on all datasets.
As seen in Table 2 and Table 3, the MAE, RMSE, and R2 of ARIMA are higher than other contrast models and TWA-WDLSTM. TWA-WDLSTM shows 44.9%, 88.2%, 20.4% and 80.9% improvements over ARIMA in MAE on PV power, SML2010, Beijing PM2.5 and NASDAQ100 stock dataset, respectively. This suggests that ignoring exogenous factors can degrade model performance. Although LSTM achieves better performance than ARIMA, TWA-WDLSTM outperforms LSTM by 38.0%, 85.3%, 10.4% and 79.0% in MAE on four datasets, respectively. This is because the LSTM network focuses on extracting long-term dependencies of all time series rather than selecting relevant features.
DA-RNN, DSTP, MTNet, and DA-Conv-LSTM are the state-of-the-art models for multivariate time series, which pay more attention to obtaining the relevant variables in a time step and memorizing long-term dependencies among time series. Hence, their performance is better than ARIMA and LSTM. Nevertheless, these models exhibit different performances on four datasets. In detail, the DSTP model outperforms other state-of-the-art models on most tasks because the two-stage attention mechanism learns more stable spatial correlations. The performances of DA-RNN, MTNet, and DA-Conv-LSTM are comparable. The CGA-LSTM outperforms other contrast models because it nests a correlational attention into the graph attention mechanism to select the relevant variable.
For visual comparison, we display the MAE of comparison models and TWA-WDLSTM on four datasets in Figure 4 and Figure 5. In the comparison to state-of-the-art models, TWA-WDLSTM has the best performance on all datasets. For instance, the MAE value gained by TWA-WDLSTM (0.0358) is 61.6%, 53.1%, 79.9%, 55.4%, and 50.7% less than that of DA-RNN (0.0932), DSTP (0.0764), MTNet (0.1779), DA-Conv-LSTM (0.0803), and CGA-LSTM (0.0726) on SML2010 dataset. This suggests selecting the relevant variables in a temporal window to achieve accurate predictions. Moreover, WDLSTM employs the historical information of a temporal window to update the hidden state, which can capture very long term dependencies.
Figure 6 and Figure 7 visually present the fitting effects of different models on different datasets. We observe that TWA-WDLSTM has different fitting effects on four datasets. The data in the SML2010 and NASDAQ 100 stock datasets were more stable and controllable; hence, the R2 of TWA-WDLSTM is more than 0.999. Although the PV power dataset is periodic, no power is generated at night, which makes it more difficult to predict. However, TWA-WDLSTM still works better. TWA-WDLSTM shows inferiority on Beijing PM2.5 dataset but outperforms contrast models. This is because the randomness of Beijing PM2.5 data is stronger than other data.

4.5. Interpretability of Temporal Window Attention Mechanism

The temporal window attention mechanism is employed to select the relevant variables in a temporal window for making the prediction. Hence, we verify its performance on different temporal window sizes using the grid search strategy. We plot the MAE and RMSE versus different temporal window sizes in Figure 8 and Figure 9. We observe that the minimum values of MAE, RMSE are gained when T = 48 on PV power dataset (m = n = 32), when T = 15 on SML2010 dataset (m = n = 32), when T = 5 on Beijing air dataset (m = n = 64), and when T = 10 on NASDAQ100 stock dataset (m = n = 128). To further investigate the temporal window attention mechanism, we visualize the weights distribution of the temporal window attention mechanism for the SML2010 dataset in Figure 10. The weights semantically represent the contribution of each variable in a temporal window to the future values. The more the corresponding variable contributes, the darker the color. For instance, the phenomenon can be clearly observed in that variable No. 15 at time step 5 (red box) exhibits a maximum contribution value of 0.0122, and variable No. 8 at time step 6 (green box) has a minimum contribution of about 0.0022. Moreover, multiple variables have different attention weights over different time steps. Specifically, the weights of variable No. 6 vary in the range of (0.0085, 0.0112). Variable No. 14 dynamically changes in the range of (0.0059, 0.0098). The above results illustrate that the temporal window attention mechanism successfully captures the relevant variable in a temporal window.

4.6. Evaluation on WDLSTM

To further evaluate the ability of WDLSTM to capture long-term dependencies, we compare its performance with LSTM on four real-world datasets that are described in Section 4.1, as presented in Table 4. Though the two networks share a similar intuition, WDLSTM outperforms LSTM because it benefits from the information flows of the temporal window. Specifically, WDLSTM shows 17.2%, 28.9%, 8.5% and 14.9% improvements beyond LSTM on MAE over four datasets. This is a significant outcome.
Moreover, TWA-WDLSTM shows 25.1%, 79.3%, 0.9%, and 75.2% improvements beyond WDLSTM on MAE over all datasets. The most striking result to emerge from the data is that the temporal window attention mechanism can adaptively extract the relevant variable to achieve accurate prediction.

5. Conclusions

In this paper, we propose the temporal window attention-based window-dependent long short-term memory network (TWA-WDLSTM), which consists of an encoder with a temporal window attention mechanism and a decoder, to make multivariate time series predictions. Extensive experiments on four real-world datasets strongly support our idea and show that TWA-WDLSTM outperforms the seven state-of-the-art models. The interpretation of the temporal window attention mechanism can further comprehend two-dimensional spatio-temporal patterns.
We summarize the significant advantages of TWA-WDLSTM as follows:
(1)
In many actual cases, capturing the spatio-temporal correlations in multivariate time series is a challenge. However, most studies focus on capturing one-dimensional spatio-temporal correlations from a local perspective so that they could ignore some important information. The newly introduced temporal window attention mechanism can pick the important variables within a temporal window to capture two-dimensional spatio-temporal correlations from a global perspective.
(2)
RNNs cannot memorize very long term information because they only summarize the information within a temporal window. To this end, we design WDLSTM as an encoder and decoder to enhance the learning ability of the long-term temporal dependencies.
The future works will further study whether the proposed model can be extended to solve the problem of long-term multivariate time series prediction by capturing more complex spatio-temporal patterns.

Author Contributions

Conceptualization, S.H. and H.D.; methodology, S.H.; software, S.H.; validation, H.D.; formal analysis, H.D.; investigation, S.H.; resources, H.D.; writing—original draft preparation, S.H.; writing—review and editing, S.H.; visualization, S.H.; supervision, H.D.; project administration, H.D.; funding acquisition, H.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under 61472095 and the Natural Science Foundation of Heilongjiang Province under Grant LH2020F023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Fu, E.; Zhang, Y.; Yang, F.; Wang, S. Temporal self-attention-based Conv-LSTM network for multivariate time series prediction. Neurocomputing 2022, 501, 162–173. [Google Scholar] [CrossRef]
  2. Kamarthi, H.; Rodríguez, A.; Prakash, B.A. Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future. In Proceedings of the Tenth International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–14. [Google Scholar]
  3. Huang, Y.; Ying, J.J.C.; Tseng, V.S. Spatio-attention embedded recurrent neural network for air quality prediction. Knowl. Based Syst. 2021, 233, 107416. [Google Scholar] [CrossRef]
  4. Chen, Y.; Ding, F.; Zhai, L. Multi-scale temporal features extraction based graph convolutional network with attention for multivariate time series prediction. Expert Syst. Appl. 2022, 200, 117011. [Google Scholar] [CrossRef]
  5. Shih, S.Y.; Sun, F.K.; Lee, H.y. Temporal pattern attention for multivariate time series forecasting. Mach. Learn. 2019, 108, 1421–1441. [Google Scholar] [CrossRef] [Green Version]
  6. Mahmoudi, M.R.; Baroumand, S. Modeling the stochastic mechanism of sensor using a hybrid method based on seasonal autoregressive integrated moving average time series and generalized estimating equations. ISA Trans. 2022, 125, 300–305. [Google Scholar] [CrossRef]
  7. Dimitrios, D.; Anastasios, P.; Thanassis, K.; Dimitris, K. Do confidence indicators lead Greek economic activity ? Bull. Appl. Econ. 2021, 8, 1–15. [Google Scholar]
  8. Guefano, S.; Tamba, J.G.; Azong, T.E.W.; Monkam, L. Forecast of electricity consumption in the Cameroonian residential sector by Grey and vector autoregressive models. Energy 2021, 214, 118791. [Google Scholar] [CrossRef]
  9. Li, J. A Hidden Markov Model-based fuzzy modeling of multivariate time series. Soft Comput. 2022, 6, 1–18. [Google Scholar] [CrossRef]
  10. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  11. Ozdemir, A.C.; Buluş, K.; Zor, K. Medium- to long-term nickel price forecasting using LSTM and GRU networks. Resour. Policy 2022, 78, 102906. [Google Scholar] [CrossRef]
  12. Han, S.; Dong, H.; Teng, X.; Li, X.; Wang, X. Correlational graph attention-based Long Short-Term Memory network for multivariate time series prediction. Appl. Soft Comput. 2021, 106, 107377. [Google Scholar] [CrossRef]
  13. Qin, Y.; Song, D.; Cheng, H.; Cheng, W.; Jiang, G.; Cottrell, G.W. A dual-stage attention-based recurrent neural network for time series prediction. IJCAI Int. Jt. Conf. Artif. Intell. 2017, 2627–2633. [Google Scholar] [CrossRef] [Green Version]
  14. Hu, J.; Zheng, W. Multistage attention network for multivariate time series prediction. Neurocomputing 2020, 383, 122–137. [Google Scholar] [CrossRef]
  15. Feng, X.; Chen, J.; Zhang, Z.; Miao, S.; Zhu, Q. State-of-charge estimation of lithium-ion battery based on clockwork recurrent neural network. Energy 2021, 236, 121360. [Google Scholar] [CrossRef]
  16. Zhang, Y.; Peng, N.; Dai, M.; Zhang, J.; Wang, H. Memory-Gated Recurrent Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 10956–10963. [Google Scholar]
  17. Ma, Q.; Lin, Z.; Chen, E.; Cottrell, G.W. Temporal pyramid recurrent neural network. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 5061–5068. [Google Scholar] [CrossRef]
  18. Harutyunyan, H.; Khachatrian, H.; Kale, D.C.; Steeg, G.V.; Galstyan, A. Multitask learning and benchmarking with clinical time series data. Sci. Data 2019, 6, 96. [Google Scholar] [CrossRef] [Green Version]
  19. Zhang, C.; Fiore, M.; Murray, I.; Patras, P. CloudLSTM: A Recurrent Neural Model for Spatiotemporal Point-cloud Stream Forecasting. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 12B, pp. 10851–10858. [Google Scholar] [CrossRef]
  20. Wang, L.; Sha, L.; Lakin, J.R.; Bynum, J.; Bates, D.W.; Hong, P.; Zhou, L. Development and Validation of a Deep Learning Algorithm for Mortality Prediction in Selecting Patients with Dementia for Earlier Palliative Care Interventions. JAMA Netw. Open 2019, 2, e196972. [Google Scholar] [CrossRef]
  21. Liu, Y.; Gong, C.; Yang, L.; Chen, Y. DSTP-RNN: A dual-stage two-phase attention-based recurrent neural networks for long-term and multivariate time series prediction. Expert Syst. Appl. 2020, 143, 113082. [Google Scholar] [CrossRef]
  22. Liang, Y.; Ke, S.; Zhang, J.; Yi, X.; Zheng, Y. Geoman: Multi-level attention networks for geo-sensory time series prediction. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3428–3434. [Google Scholar] [CrossRef] [Green Version]
  23. Deng, A.; Hooi, B. Graph Neural Network-Based Anomaly Detection in Multivariate Time Series. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 5A, pp. 4027–4035. [Google Scholar]
  24. Preeti; Bala, R.; Singh, R.P. A dual-stage advanced deep learning algorithm for long-term and long-sequence prediction for multivariate financial time series. Appl. Soft Comput. 2022, 126, 109317. [Google Scholar] [CrossRef]
  25. Wang, K.; Li, K.; Zhou, L.; Hu, Y.; Cheng, Z.; Liu, J.; Chen, C. Multiple convolutional neural networks for multivariate time series prediction. Neurocomputing 2019, 360, 107–119. [Google Scholar] [CrossRef]
  26. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, Virtual, 6–10 July 2020; pp. 753–763. [Google Scholar] [CrossRef]
  27. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling long- and short-term temporal patterns with deep neural networks. In Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval, SIGIR 2018, Ann Arbor, MI, USA, 8–12 July 2018; pp. 95–104. [Google Scholar] [CrossRef] [Green Version]
  28. Cao, S.; Wu, L.; Wu, J.; Wu, D.; Li, Q. A spatio-temporal sequence-to-sequence network for traffic flow prediction. Inf. Sci. 2022, 610, 185–203. [Google Scholar] [CrossRef]
  29. De Brébisson, A.; Vincent, P. An exploration of softmax alternatives belonging to the spherical loss family. 4th Int. Conf. Learn. Represent. arXiv 2015, arXiv:1511.05042. [Google Scholar]
  30. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. A Comparison of ARIMA and LSTM in Forecasting Time Series. In Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1394–1401. [Google Scholar] [CrossRef]
  31. Gers, F. Long Short-Term Memory in Recurrent Neural Networks. Volume 2366. [CrossRef]
  32. Chang, Y.-Y.; Sun, F.-Y.; Wu, Y.-H.; Lin, S.-D. A Memory-Network Based Solution for Multivariate Time-Series Forecasting. arXiv 2018, arXiv:1809.02105. [Google Scholar]
  33. Xiao, Y.; Yin, H.; Zhang, Y.; Qi, H.; Zhang, Y.; Liu, Z. A dual-stage attention-based Conv-LSTM network for spatio-temporal correlation and multivariate time series prediction. Int. J. Intell. Syst. 2021, 36, 2036–2057. [Google Scholar] [CrossRef]
Figure 1. One dimensional spatio-temporal correlation. A black circle represents a variable.
Figure 1. One dimensional spatio-temporal correlation. A black circle represents a variable.
Entropy 25 00010 g001
Figure 2. Overall framework of the proposed TWA-WDLSTM model. The temporal window attention mechanism is calculated in the dashed box. The green box denotes the temporal window attention mechanism that computes weight coefficient α t k based on the variable x t k , the feature matrix X t , and the previously hidden state matrix H t 1 . The newly computed X ˜ t is fed into the encoder WDLSTM unit. The encoded hidden state matrix H t is used as an input to the decoder WDLSTM unit and generates the final prediction result y ^ t , T + 1 .
Figure 2. Overall framework of the proposed TWA-WDLSTM model. The temporal window attention mechanism is calculated in the dashed box. The green box denotes the temporal window attention mechanism that computes weight coefficient α t k based on the variable x t k , the feature matrix X t , and the previously hidden state matrix H t 1 . The newly computed X ˜ t is fed into the encoder WDLSTM unit. The encoded hidden state matrix H t is used as an input to the decoder WDLSTM unit and generates the final prediction result y ^ t , T + 1 .
Entropy 25 00010 g002
Figure 3. The structure of WDLSTM and LSTM. The black and grey circles represent elements of current input X ˜ t and previous input X ˜ t 1 , respectively. The memory cells S t , forget gate F t , input gate I t , and output gate O t are the components of WDLSTM. The matrix X ˜ t is fed into WDLSTM to generate hidden state matrix H t . The memory cells s t , forget gate f t , input gate i t , and output gate o t are the components of LSTM. x ˜ t , which is the element vector of X ˜ t , is the input of LSTM.
Figure 3. The structure of WDLSTM and LSTM. The black and grey circles represent elements of current input X ˜ t and previous input X ˜ t 1 , respectively. The memory cells S t , forget gate F t , input gate I t , and output gate O t are the components of WDLSTM. The matrix X ˜ t is fed into WDLSTM to generate hidden state matrix H t . The memory cells s t , forget gate f t , input gate i t , and output gate o t are the components of LSTM. x ˜ t , which is the element vector of X ˜ t , is the input of LSTM.
Entropy 25 00010 g003
Figure 4. MAE values comparison of different models on PV power dataset and SML2010 dataset.
Figure 4. MAE values comparison of different models on PV power dataset and SML2010 dataset.
Entropy 25 00010 g004
Figure 5. MAE values comparison of different models on Beijing PM2.5 dataset and NASDAQ100 stock dataset.
Figure 5. MAE values comparison of different models on Beijing PM2.5 dataset and NASDAQ100 stock dataset.
Entropy 25 00010 g005
Figure 6. Fitting effect of different models on PV power dataset and SML2010 dataset.
Figure 6. Fitting effect of different models on PV power dataset and SML2010 dataset.
Entropy 25 00010 g006
Figure 7. Fitting effect of different models on Beijing PM2.5 dataset and NASDAQ100 stock dataset.
Figure 7. Fitting effect of different models on Beijing PM2.5 dataset and NASDAQ100 stock dataset.
Entropy 25 00010 g007
Figure 8. Influence of different temporal window size on model performance. When T = 48 on PV power dataset and T = 10 on SML2010 dataset, the model achieved the best performance.
Figure 8. Influence of different temporal window size on model performance. When T = 48 on PV power dataset and T = 10 on SML2010 dataset, the model achieved the best performance.
Entropy 25 00010 g008
Figure 9. Influence of different temporal window size on TWA-WDLSTM performance. When T = 5 on Beijing PM2.5 dataset and T = 10 on ASDAQ100 stock dataset, the model achieved the best performance.
Figure 9. Influence of different temporal window size on TWA-WDLSTM performance. When T = 5 on Beijing PM2.5 dataset and T = 10 on ASDAQ100 stock dataset, the model achieved the best performance.
Entropy 25 00010 g009
Figure 10. The attention weight matrix on SML2010 dataset. The red and green boxes highlight the maximum and minimum values, respectively.
Figure 10. The attention weight matrix on SML2010 dataset. The red and green boxes highlight the maximum and minimum values, respectively.
Entropy 25 00010 g010
Table 1. Hyperparameter settings.
Table 1. Hyperparameter settings.
HyperparameterValue
Training epochs30
Batch size128
hidden states{16, 32, 64, 128}
Temporal window size{6, 12, 24, 48}, {5, 10, 15, 25}
Initial Learning rate0.001
Table 2. Experimental results on PV dataset and SML2010 dataset.
Table 2. Experimental results on PV dataset and SML2010 dataset.
DatasetPV PowerSML2010
MAERMSER2MAERMSER2
ARIMA0.46300.79510.93080.30330.38160.9754
LSTM0.41190.77440.93650.24370.30800.9839
DA-RNN0.35080.63560.94420.09320.12020.9962
DSTP0.32800.64170.95230.07640.11060.9974
MTNet0.30130.62510.95750.17790.23130.9903
DA-Conv-LSTM0.30200.62180.95760.08030.10840.9980
CGA-LSTM0.27960.61330.95820.07260.09590.9982
TWA-WDLSTM0.25520.60840.95940.03580.05460.9994
Table 3. Experimental results on Beijing PM2.5 dataset and NASDAQ100 stock dataset.
Table 3. Experimental results on Beijing PM2.5 dataset and NASDAQ100 stock dataset.
DatasetBeijing PM2.5NASDAQ100 Stock
MAERMSER2MAERMSER2
ARIMA22.201434.23230.87717.43509.30340.9811
LSTM19.731331.97330.89286.75678.73750.9833
DA-RNN18.545131.20880.89823.83394.95860.9923
DSTP19.136031.49340.89001.55832.28160.9988
MTNet18.687430.15770.89744.79745.68010.9904
DA-Conv-LSTM18.408831.04150.89902.29743.10400.9978
CGA-LSTM18.267431.24210.89931.49522.23920.9990
TWA-WDLSTM17.675130.22520.90421.41892.16810.9990
Table 4. Comparison result of LSTM and WDLSTM.
Table 4. Comparison result of LSTM and WDLSTM.
Dataset LSTMWDLSTMTWA-WDLSTM
PV PowerMAE0.41190.34090.2552
RMSE0.77440.68590.6084
R20.93650.94850.9594
SML2010MAE0.24370.17320.0358
RMSE0.30800.22140.0546
R20.98390.99170.9994
Beijing PM2.5MAE19.731318.03817.6751
RMSE31.973330.517930.2252
R20.89280.90240.9042
NASDAQ100 StockMAE6.75675.74361.4189
RMSE8.73757.71102.1681
R20.98330.98700.9990
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, S.; Dong, H. A Temporal Window Attention-Based Window-Dependent Long Short-Term Memory Network for Multivariate Time Series Prediction. Entropy 2023, 25, 10. https://doi.org/10.3390/e25010010

AMA Style

Han S, Dong H. A Temporal Window Attention-Based Window-Dependent Long Short-Term Memory Network for Multivariate Time Series Prediction. Entropy. 2023; 25(1):10. https://doi.org/10.3390/e25010010

Chicago/Turabian Style

Han, Shuang, and Hongbin Dong. 2023. "A Temporal Window Attention-Based Window-Dependent Long Short-Term Memory Network for Multivariate Time Series Prediction" Entropy 25, no. 1: 10. https://doi.org/10.3390/e25010010

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop