Next Article in Journal
Energy Analysis of a Novel Turbo-Compound System for Mild Hybridization of a Gasoline Engine
Next Article in Special Issue
Denoising Masked Autoencoder-Based Missing Imputation within Constrained Environments for Electric Load Data
Previous Article in Journal
Droop-Free Sliding-Mode Control for Active-Power Sharing and Frequency Regulation in Inverter-Based Islanded Microgrids
Previous Article in Special Issue
Convolutional Autoencoder-Based Anomaly Detection for Photovoltaic Power Forecasting of Virtual Power Plants
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multidimensional Feature-Based Graph Attention Networks and Dynamic Learning for Electricity Load Forecasting

1
Guangdong Power Grid Co., Ltd., Shantou Supply Bureau, Shantou 515063, China
2
Department of Computer Science, College of Engineering, Shantou University, Shantou 515063, China
*
Author to whom correspondence should be addressed.
Energies 2023, 16(18), 6443; https://doi.org/10.3390/en16186443
Submission received: 4 August 2023 / Revised: 29 August 2023 / Accepted: 31 August 2023 / Published: 6 September 2023

Abstract

:
Electricity load forecasting is of great significance for the overall operation of the power system and the orderly use of electricity at a later stage. However, traditional load forecasting does not consider the change in load quantity at each time point, while the information on the time difference of the load data can reflect the dynamic evolution information of the load data, which is a very important factor for load forecasting. In addition, the research topics in recent years mainly focus on the learning of the complex relationships of load sequences in time latitude by graph neural networks. The relationships between different variables of load sequences are not explicitly captured. In this paper, we propose a model that combines a differential learning network and a multidimensional feature graph attention layer, it can model the time dependence and dynamic evolution of load sequences by learning the amount of load variation at different time points, while representing the correlation of different variable features of load sequences through the graph attention layer. Comparative experiments show that the prediction errors of the proposed model have decreased by 5–26% compared to other advanced methods in the UC Irvine Machine Learning Repository Electricity Load Chart public dataset.

1. Introduction

Load forecasting refers to mining the intrinsic connection between the load data and different influencing factors from the changes in the power load itself, weather, date type, economy and other factors to find the impact of the past load change law on the future load value [1] and then be able to more accurately make predictions about the future load. As shown in Figure 1, the load volume of the future load data is modeled by analyzing the characteristics of the historical load data (including static characteristics of load data as well as dynamic time-series characteristics). Power system load forecasting can be divided into long-term, medium-term, short-term and ultra short-term according to the forecasting time. Short-term power load forecasting is the most critical type of forecasting in the power system. It involves forecasting the power load in the next few hours, one day to several days, and its accuracy directly affects the safe and stable operation and economy of the power system. Accurate short-term load forecasting is very important to improve the operation efficiency of power operators. It can help each power plant reasonably arrange the schedule and weekly plan, optimize the power generation plan, and avoid the occurrence of surplus or shortage. Through accurate prediction, power operators can better control the balance of supply and demand, adjust power production and consumption strategies so as to improve the operation efficiency and economy of the power system.
In the past decades, the research on short-term load forecasting has made significant progress. The early work is mainly based on traditional statistics. The traditional statistical methods include regression analysis, state space method [2], time-series method [3], etc.; among them, the classic is the auto regressive integrated moving averages (ARIMA) [4], which is an unusual popular time series prediction statistical method based on the stable data series. The traditional statistical methods only analyze historical load data singularly and do not take into account the impact of factors such as environment, weather and date type on load and are, therefore, poorly adapted and not very effective in forecasting.
With the development of deep learning technology, recurrent neural networks (RNN) and convolutional neural networks (CNN) are also widely used in the field of load forecasting [5,6]. These networks comprehensively analyze the impact of the load data and load-related influencing factors on load forecasting results through a deep neural network, which can better improve the accuracy of forecasting results. In particular, the application of a convolutional neural network has made a major breakthrough in short-term load forecasting. The convolutional neural network can effectively process time-series data and capture the time–space dependence in the load series. This research based on the convolutional neural network brings higher accuracy and reliability for short-term load forecasting and provides better support for the operation and management of power systems. In addition, there are some research works based on machine learning algorithms, including support vector machine networks [7], random forest [8] and other methods to further improve the accuracy and robustness of short-term load forecasting. However, there are many and complex variables related to the load, and it is difficult for deep neural networks to fully capture the complex dependencies between variables. How to capture the correlation between these variables is a difficult problem in the current research.
The temporal difference information in the load data can reflect the dynamic evolution information in the load data, which is a very important factor in load forecasting [9]. In order to consider the variation in the load at each time point in load forecasting, this paper uses a differential learning network to learn the variation in the load between each time point in load forecasting. A multidimensional feature map attention layer is also constructed to model the correlation of different variable features of the load sequence.
The contributions of our paper are summarized as follows:
  • We performed data preprocessing on the original data to extract the trend terms and seasonal features of the load data. For the load data series with natural cyclical characteristics, subsequent load forecasting based on trend terms and seasonal features can improve the accuracy of the forecasts.
  • We construct the differential features of the load data based on the GRU module, combine the original features of the load data and the differential features of the load data for post modeling and solve the problem of the influence of the temporal difference information of the load data on load forecasting.
  • We consider the load series incorporating multidimensional features such as seasonal trends and holidays as a complete graph and model the correlation of different variable features of the load series by constructing a multidimensional feature graph attention layer where each node in the graph represents a feature of the load sequence, and the edge between adjacent nodes represents the relationship between two features.
The rest of the paper is organized as follows. Section 2 presents the related work, Section 3 describes the proposed model, Section 4 gives our methodology, Section 5 is experimental studies, respectively, and Section 6 summarizes our work and future prospects.

2. Related Works

The related studies in this paper include a time-series prediction model based on RNN, a time-series prediction model based on Transformer, and a time-series prediction model based on Graph.

2.1. Time-Series Prediction Model Based on RNN

The RNN network is a serial structure composed of continuous time units. Each time unit inputs a word x and outputs a prediction word y. Salinas and Flunkert [10] first introduced the deep learning model to the time-series prediction problem. This method adopts the classical RNN model for time-series prediction. In the training stage, the real value and external features from the previous time step are introduced into each time unit, and the value of the next time step is predicted after passing through the RNN unit. Rangapuram, Syama Sundar et al. [11] proposed an RNN-based deep state space model where the entire prognostic model is divided into two parts: modeling the relationship between two consecutive hidden states and the relationship between the hidden state from the current moment to the prognostic result at the current moment. However, DeepAR and the deep state space model are both single-step prediction models, which can only predict the value at one moment in the future at a time. In many practical situations, it is not only necessary to predict the value of a certain moment, but more importantly, the load of a future time period can be predicted. The model proposed in this paper implements predicting the value of multiple future time steps by expanding the causal convolution in temporal convolutional networks and the structure of fully connected layers.

2.2. Time-Series Prediction Model Based on Transformer

The Transformer model [12,13] first achieved significant results in the field of natural language processing. Since both natural language processing and time series are sequential data, Transformer has also been gradually applied to time-series prediction tasks. Wu, N. and Green, B et al. [14] used a Transformer structure similar to GPT to try the task of time-series prediction and achieved good results. However, in time-series prediction tasks, the sequence position relationship of sample points is very important, and the location information can only depend on the coding of the location information, which affects the application of the Transformer in time-series prediction. Bryan Lim et al. [15] proposed a combined LSTM and Transformer approach to address the problem of information forgetting in sequence models using the ultra-long-period information extraction capability of Attention. Zhou, H et al. [16] proposed the Informer model, which is an upgraded version of Transformer for long-period prognostics that enables Transformer to improve operational efficiency in long-term forecasting. On the problem that RNN-based methods cannot completely eliminate the vanishing and exploding gradient problems in the face of long sequences, the Transformer architecture can model both long-term dependence and short-term dependence [17]. Furthermore, the model proposed in this paper uses the inflated causal convolution to model long-term sequences in avoiding the above problem and also avoids the dependencies of long-term sequences that cannot be captured in traditional convolutional neural networks.

2.3. Time-Series Prediction Model Based on Graph

RNN is the most fundamental time-series prediction model, and the natural sequence modeling capability of RNN can handle time-series data well. On the basis of RNN, Li, Y et al. [18] proposed the DCRNN model, which enables RNN to input time series with multiple nodes in space at the same time by introducing the idea of a graph into the RNN. The DCRNN model is able to learn the spatial relationship of different nodes while learning the time dimension dependency. The other one is the model structure combining TCN and GCN [19]. TCN is a CNN model applied to time-series prediction, which increases the perceptual field area by null convolution and prevents data leakage by causal convolution. GCN [20,21] is a graph convolutional neural network, which is a convolutional neural network that can act directly on the graph and use its structural information. The first layer is fed with time-series data from all nodes in the graph, and an independent representation of each sequence is extracted by TCN [22]. After obtaining the representation of each sequence at each moment, the representations of all nodes within each moment are spatially fused using GCN. In addition to GCN, the Attention-based graph temporal prediction model GMAN [14] is also a mainstream approach. The modeling framework adopted by GMAN is different from both DCRNN and Graph WaveNet, and GMAN performs information extraction in both temporal and spatial dimensions in parallel at each layer, which are then fused by Gate. Since GCN is difficult to scale to large-scale networks and converges slowly in the training process, the proposed model in this paper adds the Attention mechanism to GCN by giving different attention weights to the edges between nodes in the graph and models the correlation between multidimensional feature variables by graph attention.

3. TCN_DLGAT Model

In this section, we describe three important components of the model proposed in this paper: the differential learning network, the time-series convolutional network and the multidimensional feature map attention network, and conclude by outlining the overall structure of the model.

3.1. Differential Learning Network

The differential learning network proposed in this paper is built on the basis of a gated recurrent unit (GRU) [23]. GRU is a kind of recurrent neural network and also a variant of a long–short-term memory network (LSTM) [24], which can be used to process sequential data. GRU can solve the problems of RNN recurrent neural networks, such as not being able to use long-term memory and a gradient in back propagation, and has a similar role to LSTM, but is easier to train. There are two gates in the GRU model, the reset gate and the update gate, as shown in Figure 2.
In Figure 2, x t R m denotes the information input at the current moment, and m is the dimension of the input data. R m denotes the set of m-dimensional input vectors. h t 1 denotes the hidden state at the previous moment, and h t denotes the hidden state passed to the next moment. r t and z t denote the reset gate and the update gate. σ and tanh denote the sigmoid activation function and the hyperbolic tangent activation function. ⨂ denotes the vector dot product and ⨁ denotes vector splicing. The GRU memory cell is implemented as follows:
z t = σ ( W z · [ h t 1 , x t ] )
r t = σ ( W r · [ h t 1 , x t ] )
h t ˜ = t a n h ( W · [ r t h t 1 , x t ] )
h t = ( 1 z t ) h t 1 + z t h t ˜
where W z , W r and W are not a value but a weight matrix. Equation (1) represents the state change in information through the update gate. Equation (4) is an expression for the update memory. The update gate is used to control the extent to which information from the previous state is brought into the current state. That is, the update gate helps the model to decide how much information from the past should be transferred to the future. Equation (2) represents the change in state of information through the reset gate, which determines how much past information needs to be forgotten.
The differential learning network framework proposed in this paper is shown in Figure 3. The differential features of the load data are added to the traditional GRU to learn the dynamic evolution information of the load sequences by combining the original feature information of the load data and the differential feature information. The time difference information in the load data can reflect the dynamic evolution information of the load data, which is a very important factor for load prediction. The differential learning network proposed in this paper can model the time dependence and dynamic evolution of load sequences by learning the amount of load change at different time points.
The differential learning network is divided into two GRU parts. First, in the first GRU module, the differential feature representation of the load data is calculated by the difference between the load data and the state update process of the GRU, given the load sequence X= x 1 , x 2 , , x T , where T denotes the time step. The differential sequence X ^ = x 1 ^ , x 2 ^ , , x T ^ is computed as follows:
x ^ t = 0 , if t = 1 x t x t 1 , if t = 2 , 3 , , T
where x t and x ^ t R m .
The learned differential feature representation h ^ t is then combined with the original information of the load sequence and entered into the GRU module of the second part. The state conversion process is calculated as follows:
z t = σ ( W z · [ h t 1 , h ^ t , x t ] )
r t = σ ( W r · [ h t 1 , h ^ t , x t ] )
h t ˜ = t a n h ( W · [ r t h t 1 , x t ] )
h t = ( 1 z t ) h t 1 + z t h t ˜
Using the differential learning network, the differential feature of the load data is encoded as H ^ = h 1 ^ , h 2 ^ , , h T ^ , and the original feature of the load data is encoded as H = h 1 , h 2 , , h T .

3.2. Time-Series Convolutional Network

Traditional convolutional neural networks cannot capture long time dependencies well due to the limitation of their convolutional kernel size and are generally not suitable for modeling long time series, so in this paper, we use a temporal convolutional network to capture the temporal trend of each time node. The time series convolution network (TCN) proposed by Yu uses extended causal convolution to increase the receptive field of convolution [25]. Unlike the traditional convolution, the dilation type convolution allows the presence of interval sampling of the input during convolution. The temporal causal order is also maintained by padding the input with zeros, allowing the prediction of the current point in time to involve only previous historical information, as shown in Figure 4. The sampling rate is controlled by d in the figure. The bottom layer with d = 1 means that every point is sampled for the input, and the middle layer with d = 2 means that every 2 points are sampled for the input. Generally speaking, the higher the level, the greater the size of d used. Therefore, the dilated causal convolution makes the effective window size grow exponentially with the number of layers. This way, the convolutional network can obtain a large perceptual field with a relatively small number of layers [26].
Since the traditional convolutional neural network cannot model long time load sequences, this paper introduces a temporal convolutional network in the part of extracting the temporal features of load data. The input of the temporal convolutional network is defined as X= x 1 , x 2 , , x T , where T denotes the time step. The temporal features H= h 1 , h 2 , , h T of the output load data through the temporal convolutional network.

3.3. Multidimensional Feature Graph Attention Network

To capture the spatial correlation among multidimensional feature variables, the graph attention [27] (GAT) network is used for spatial modeling of the load data. The GAT network is capable of modeling correlations between nodes in arbitrary graphs. Graph attention networks propose a weighted summation of neighboring node features using an attention mechanism. The weights assigned to neighboring node features depend solely on the node features themselves, independent of the graph structure. GAT was developed after another graph neural network, the graph convolutional network (GCN), and the core difference between GAT and GCN is how to collect and accumulate the feature representations of neighboring nodes with distance 1. In GCN, one graph convolution operation consists of a normalized summation of the neighboring node features:
h i l + 1 = σ j N ( i ) 1 c i j W ( l ) h j ( l )
where N ( i ) is the set of nodes that are 1-distance neighbors to node i. We usually add an edge connecting node i to itself so that i itself is included in N ( i ) . c i j is a normalization constant based on the graph structure. σ is an activation function. W ( l ) R m is the weight matrix of node feature transformations, shared by all nodes, where m denotes the dimensionality of the node feature vector. The graph attention model (GAT) replaces the fixed normalization operation in the graph convolution with an attention mechanism. Figure 5 and the following equation define how to update the lth layer node features to obtain the l + 1st layer node features.
First, for node i, the similarity coefficients between its neighbors ( j N ( i ) ) and itself are computed one by one:
e i j l = α W ( l ) h i ( l ) W ( l ) h j ( l ) , j N ( i )
The above equation calculates the original attention fraction between pairs of nodes. First, the features of nodes i and j are transformed and concatenated. Then, the concatenated embeddings and a learnable weight vector α are dot-produced, and finally, the attention coefficients are normalized to obtain the attention weights as follows:
a i j l = exp e i j l k N ( i ) exp e i j l
After obtaining the attention weights, similar to the node feature update rule of GCN, an attention-based weighted summation of the features of all neighboring nodes is performed as follows:
h i l + 1 = σ j N ( i ) a i j l W ( l ) h j ( l )
where h i l + 1 is the new feature output by GAT for each node i incorporating the information of the neighboring nodes, and σ is the activation function.
In a multidimensional feature graph attention network, we consider a load sequence that incorporates multidimensional features such as seasonal trends, holidays, etc., as a complete graph. Each node in the graph represents a feature of the load sequence, and the edge between adjacent nodes represents the relationship between two features. In this way, the correlation between multidimensional features in the load sequence can be learned through graph attention networks. As shown in Figure 6: ξ t 1 , ξ t 2 , , ξ t m represents the feature variables of the load sequence at moment t, and m is the total number of multi-dimensional features. α 12 , α 13 , , α 1 m represents the attention weights of other features on feature ξ t 1 , which represents the correlation dependence between features. The output of the multidimensional feature map attention network is an n × m matrix, where n represents the time step, and m represents the total number of nodes.

3.4. TCN_DLGAT Model Framework

The proposed TCN_DLGAT framework is shown in Figure 7. The input consists of two parts: the original load data x and the differential load data x ^ . The overall part of the model consists of three components: the TCN time-series convolutional network, the GRU differential learning network and the multidimensional feature map attention network. The TCN is used to model the time dependence of the long series of load data, the GRU differential learning network learns the dynamically changing features of the load data, and the multidimensional feature map attention network extracts the spatial features between multidimensional feature variables. We also propose a learning algorithm to minimize the loss function and make the model converge at the optimum.

4. Methods

In this section, we address the application scenario by first giving the mathematical definition of the problem to be solved, then introducing the data preprocessing part in the pre-training phase of the model and, finally, giving the model training procedure for minimizing the loss optimization function.

4.1. Problem Definition

In this paper, we focus on the task of multivariate load forecasting. Let x t R N denote the vector representation of multidimensional variables of dimension N at moment t. Here x t consists of the load at moment t and other variables that affect the future loads (e.g., seasonal characteristics, holiday characteristics, and temperature, etc.), and x t [ i ] R denotes the value of the i-th variable at moment t. Our goal is to predict the loadings Y= y p + 1 , y p + 2 , , y p + n at n future time points based on a historical input sequence X = x 1 , x 2 , , x p with time step.

4.2. STL Decomposition to Extract Trend and Seasonal Features

Seasonal and Trend decomposition using LOESS (STL) is a very general and robust method for decomposing time series, where LOESS is a method for estimating nonlinear relationships. The STL decomposition method was proposed by R. B. Cleveland, Cleveland, McRae in 1990 [28]. STL is a decomposition of the time series into three main components: trend, seasonal terms and residuals. STL uses the locally estimated scatterplot smoothing (LOESS) method to extract smoothed estimates of the three components.
The load data studied in this paper also have a growth or decline trend and seasonal characteristics over a fixed period of time. To improve the accuracy of model prediction, this paper uses the STL method to decompose the load data into long-term trend, seasonal and random residuals as the trend and seasonal characteristics of the load series.

4.3. Extract the Year, Month, Day and Holiday Characteristics

In reality, the load data have a certain correlation with external factors such as temperature and electricity prices [29]. When these external characteristics can be accurately obtained in the forecasting process, they can be used to improve the accuracy of load forecasting. The load patterns of holidays and non-working days are different from those of normal times, and their overall load curves and peaks differ significantly from those of working days. During special holidays such as the Spring Festival, the Dragon Boat Festival, the Mid-Autumn Festival and other holidays, major social events and traditional ways of celebrating festivals have a higher overall electricity consumption than usual during the festive period. In order to consider the impact of holidays and non-working days on electricity consumption in the forecasting process, this paper extracts the year, month and day of the load data in the dataset as well as the corresponding holiday characteristics in the data preprocessing section.

4.4. Data Normalization

The load data in this paper also have the above characteristics. In order to improve the accuracy of the model prediction, we use the STL method to extract the characteristics of the load series such as periodicity, trend and holidays.
In order to improve the robustness of the model, we use the MinMaxScaler to normalize the data as follows:
x ˙ = x m i n ( X t r a i n ) m a x ( X t r a i n ) m i n ( X t r a i n )
where m a x ( X t r a i n ) and m i n ( X t r a i n ) denotes the maximum and minimum value in the training set.

4.5. Training Loss Function

Using MAE as a loss function in time-series prediction has the advantages of robustness, intuitiveness and interpretability, gradient processing and stability and predictive value evaluation, which can help us better optimize the training model and evaluate the prediction performance. Therefore, our model optimizes the training model by minimizing the mean absolute error (MAE), which is defined as:
MAE = 1 n i = 1 n y ^ i y i
where y ^ i denotes the predicted value of the model, and y i denotes the actual value.

5. Experiments

In this section, we validate the effectiveness of the proposed model on the UCI Electricity Load Graph public dataset (https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, accessed on 1 August 2023). The dataset records the electricity consumption of 370 customers once every 15 min from 2011–2014. We use the STL method to extract the periodicity, trend and holiday features of the load data. The differential features of the load data are first calculated, and then data normalization is performed on the differential features, as well as the raw data. The dataset is divided chronologically, with 70 % for training, 10 % for validation, and 20 % for testing. We use the past week (i.e., 168 h, 672 time points) to predict the load consumption for the next 24 h (96 time points).

5.1. Experimental Setup

Our experiments were conducted in a computer environment with an AMD Ryzen 7 5800X 8-Core Processor and an NVIDIA Corporation GA102 [GeForce RTX 3080 Lite Hash Rate]. For all models, we used the same sliding window size n = 672. The temporal convolution part uses TCN networks with expansion factors 1, 2, 4, 1, 2, 4, 1, 2, 4, and the convolution kernel is set to 25. The dropout is uniformly set to 0.2 for all experimental models, and the initial learning rate is 0.001. Our evaluation metrics include mean absolute error (MAE), mean square error (MSE) and mean absolute percentage error (MAPE).
We compare TCN_DLGAT with the following models:
  • LSTM: A temporal recurrent neural network suitable to be used to process and predict events with relatively long intervals and delays in the time series.
  • CNN_LSTM: A combination of the convolutional neural network and long and short-term memory network.
  • LSTNet [30]: The LSTM model combined with a temporal attention mechanism.
  • Transformer:an encoder–decoder network combining attention mechanisms.
  • TCN: An advanced TCN model called WaveNet where an additional gating mechanism is applied after expansion causal convolution.
The selection of each parameter of the model during the experiment is shown in Table 1. The optimization process of neural network error reduction is carried out using gradient descent optimization algorithm. The adaptive moment estimation (Adam) algorithm is chosen in this paper.

5.2. Metrics

Our evaluation metrics include mean absolute error (MAE), mean square error (MSE), and mean absolute percentage error (MAPE).
MAE is specifically defined in Equation (15) in Section 4, and mean square error (MSE) and mean absolute percentage error (MAPE) are defined as:
MSE = 1 n i = 1 n y ^ l y i 2
MAPE = 1 n i = 1 n y ^ l y i y i
where y ^ i denotes the predicted value of the model, and y i denotes the actual value. Mean absolute error (MAE) measures the average absolute difference between the predicted value and the true value, and it indicates the average size of the model’s prediction error. A smaller MAE value represents a model with better accuracy and precision. Mean square error (MSE) calculates the average squared difference between the predicted and true values and is more concerned with larger errors. A smaller MSE value indicates a better overall fit of the model to the true values. Mean absolute percentage error (MAPE) measures the average absolute percentage difference between the predicted and true values and is used as a measure of relative error. A smaller MAPE value indicates that the model has a lower relative error in its predictions. For these metrics, smaller values indicate a better predictive performance of the model.

5.3. Experimental Results

In the experiments of the UCI Electricity Load Chart public dataset, we compare TCN_DLGAT with several advanced methods, and the experimental results are shown in Table 2 and Figure 8. During the week, TCN_DLGAT performs much better than the recurrent neural network models, including LSTM and CNN_LSTM, and MAE, MSE and MAPE are lower than these models, as represented by Monday, where the MAE of LSTM and CNN_LSTM increased by 15% and 9% compared to TCN_DLGAT. Furthermore, the prediction error of TCN_DLGAT outperforms the models based on the attention mechanism (LSTNet and Transformer), whose MAE increases by 5% and 26% compared to both TCN_DLGAT. Compared with the temporal convolutional network (TCN), the MAE increases by 5% compared to TCN_DLGAT because the temporal convolutional network only increases the perceptual field and does not learn the dynamic change characteristics between the load data and the spatial correlation between other variables; therefore, it does not perform as well as TCN_DLGAT in terms of prediction results. This is also validating the importance of the dynamic information of the load data and the spatial correlation between the multiple variables that we proposed.
We plotted the predicted and true values of the output of the above experimental models for a particular day, as shown in Figure 9. The predicted MSE error of the LSTM model is 0.00197, the predicted MSE error of the CNN_LSTM model is 0.00183, the predicted MSE error of the LSTNet model is 0.00165, the predicted MSE error of the Transformer model is 0.00178, the predicted MSE error of the TCN model is 0.00158, and the predicted MSE error of the proposed model TCN_DLGAT is 0.00142. By analyzing these plots, we can find that among the six comparison models, the MSE of the model proposed in this paper is the smallest compared to the other models, indicating that the difference between the predicted and true values of the TCN_DLGAT model output is the smallest. Furthermore, for several peaks of the load curve in a day, only TCN and TCN_DLGAT predict such changes. Both TCN and TCNDLGAT use expansive causal convolution to increase the sensory field of the convolution, which can better capture the long-term dependencies and dynamic evolution in the time series. These two models can learn when users’ electricity consumption is highest and lowest on each day in a long-term time series. This feature of special points can be easily captured in long-term time series. Being able to capture the peaks of the load curve in a day is of great importance for the implementation of orderly power consumption in the power system at a later stage.

5.4. Ablation Experiments

In this section, we verify the effectiveness of each module of the proposed model by disabling the two GRU modules, the temporal convolutional network, and the multidimensional feature map attention layer. The descriptions of these ablation models are as follows:
  • TCN_GAT: Omitting the differential learning network, the input containing only the raw load data information is used as one input channel of the temporal convolutional network to extract the long series time dependence of the load data. The spatial correlation between the multidimensional feature variables is also extracted using a dimensional feature map attention network, and then the two parts of the feature outputs are connected through a convolutional neural network to obtain the predicted expression form.
  • TCN_DL: Omitting the multidimensional feature map attention network, the original and differential sequences of the load data are fed into a time-series convolutional network, as well as a differential learning network, to model the dynamic variation characteristics of the load sequences and the time-dependence of the long time sequences. The two parts of the feature outputs are connected through a convolutional neural network to obtain the predicted expressions.
  • DLGAT: Omitting the time-series convolutional network, the original and differential sequences of the load data are first fed into the differential learning network, as well as the multidimensional feature map attention network. Then, the two parts of the feature outputs are connected through a convolutional neural network to obtain the predicted expressions.
Table 3 shows the MAE, MSE and MAPE for predicting 96 time points of the future day. According to Table 2, the MAE, MSE and MAPE of TCN_GAT are higher than those of TCN_DLGAT, where the percentage increase in MAE is 29%, which illustrates the importance of the dynamically changing characteristics learned from the differential features of the load data in the differential learning network for the model prediction results. The MAE percentage of TCN_DL is higher than that of TCN_DLGAT by 8%. This indicates that the multidimensional feature graph attention networks are better at capturing features such as holidays and seasonality and can reflect the spatial correlation between these different features through graph attention. At the same time, the omission of the temporal convolutional network in DLGAT, which is capable of capturing long time-series features, results in the model’s inability to capture features across the load data. Therefore, it is important to use temporal convolutional networks to learn the temporal correlation between load data.
In the ablation experiments, we found that the MSE and MAPE percentages of TCN_DL are higher than 115% and 330% of TCN_DLGAT, and the absence of GAT significantly increases the variability of the prediction results. This suggests that establishing a graph structure between multiple feature variables can effectively capture the complex relationships and dependencies between nodes in the graph and that such dependencies established by the graph structure can improve the prediction accuracy of the model. In the absence of the GAT model, the model is not able to fully exploit this correlation information, leading to inaccurate prediction results. Particularly for load series data with complex relationships and dependencies, a missing GAT can result in the model failing to capture important feature correlations, increasing the prediction error.

6. Conclusions

In this paper, we propose a method that combines a differential learning network and a multidimensional feature map attention network to model the time-dependent and dynamic evolution of load sequences. By learning the amount of load variation at different time points, we are able to better capture the evolutionary trend of the load sequence. We also introduce a graph attention layer to represent the correlation between different variable features of the load sequence. This attention mechanism helps the model to focus more on the features that are important for the prediction goal.
We conducted experiments on the UC Irvine Machine Learning Repository Electricity Load Chart public dataset and verified that the proposed model outperforms other state-of-the-art methods in terms of performance. The ablation experiments also demonstrate the effectiveness of each module in the proposed model.
In future work, we plan to investigate the application of the proposed TCN_DLGAT model to larger datasets. In addition, we may also consider adding a feature selection module to the multidimensional feature graph attention network to exclude those features that negatively affect the model predictions. This can reduce the influence of unnecessary feature variables on the model prediction and improve the performance and interpretability of the model.

Author Contributions

Conceptualization, C.H., N.D., J.H. and N.L.; methodology, C.H., N.D. and Y.F.; software, C.H., N.D., J.H., N.L. and W.C.; validation, C.H., N.D., J.H. and W.C.; writing—original draft preparation, C.H. and N.L.; visualization, N.D.; supervision, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangdong Provincial Science and Technology Program under No.030500KK52220013 (GDKJXM20220802).

Data Availability Statement

The original data from https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014, accessed on 1 August 2023. The preprocessed data can be obtained from the corresponding author based on reasonable sharing requests.

Conflicts of Interest

The authors declare that they do not have any types of conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RNNRecurrent neural networks
GRUGate recurrent unit
LSTMLong–short-term memory
TCNTime series convolution network
GATGraph attention networks
GCNGraph convolutional network

References

  1. Zhu, X.C. Research on short-term power load forecasting method based on IFOA-GRNN. Power Syst. Prot. Control 2020, 48, 121–127. [Google Scholar]
  2. Zheng, L.; Shuxin, T.; Dawei, Y. A medium- and long-term load forecasting method based on ARIMA-TARCH-BP neural network model. Electron. Devices 2020, 43, 175–179. [Google Scholar]
  3. Wang, Y.; Du, X.; Lu, Z.; Duan, Q.; Wu, J. Improved LSTM-Based Time-Series Anomaly Detection in Rail Transit Operation Environments. IEEE Trans. Ind. Inform. 2022, 18, 9027–9036. [Google Scholar] [CrossRef]
  4. Aamir, M.; Shabri, A. Modelling and forecasting monthly crude oil price of Pakistan: A comparative study of ARIMA, GARCH and ARIMA Kalman model. AIP Conf. Proc. 2016, 1750, 060015. [Google Scholar]
  5. Wan, R.; Mei, S.; Wang, J.; Liu, M.; Yang, F. Multivariate Temporal Convolutional Network: A Deep Neural Networks Approach for Multivariate Time Series Forecasting. Electronics 2019, 8, 876. [Google Scholar] [CrossRef]
  6. Wang, X.; Liu, H.; Du, J.; Dong, X.; Yang, Z. A long-term multivariate time series forecasting network combining series decomposition and convolutional neural networks. Appl. Soft Comput. 2023, 139, 110214. [Google Scholar] [CrossRef]
  7. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn 1995, 20, 273–297. [Google Scholar] [CrossRef]
  8. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  9. Lewandowski, M.; Makris, D.; Velastin, S.A.; Nebel, J.C. Structural Laplacian Eigenmaps for Modeling Sets of Multivariate Sequences. IEEE Trans. Cybern. 2014, 44, 936–949. [Google Scholar] [CrossRef] [PubMed]
  10. Salinas, D.; Flunkert, V.; Gasthaus, J. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  11. Rangapuram, S.S.; Seeger, M.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. Deep State Space Models for Time Series Forecasting. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 7796–7805. [Google Scholar]
  12. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  13. Wu, N.; Green, B.; Xue, B.; O’Banion, S. Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. arXiv 2020, arXiv:2001.08317. [Google Scholar]
  14. Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1234–1241. [Google Scholar]
  15. Lim, B.; ArÄśk, S.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  16. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
  17. Du, S.; Li, T.; Yang, Y.; Horng, S.J. Multivariate time series forecasting via attention-based encoder—decoder framework. Neurocomputing 2020, 388, 269–279. [Google Scholar] [CrossRef]
  18. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  19. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph Wavenet for Deep Spatial-Temporal Graph Modeling. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1907–1913. [Google Scholar]
  20. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Systems. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
  21. Chen, F.; Pan, S.; Jiang, J.; Huo, H.; Long, G. DAGCN: Dual Attention Graph Convolutional Networks. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
  22. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 933–941. [Google Scholar]
  23. Chung, J.; Gulcehre, C.; Cho, K.H.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
  24. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  25. Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  26. Fan, J.; Zhang, K.; Huang, Y.; Zhu, Y.; Chen, B. Parallel spatio-temporal attention-based TCN for multivariate time series prediction. Neural Comput. Appl. 2023, 35, 13109–13118. [Google Scholar] [CrossRef]
  27. Cirstea, R.G.; Guo, C.; Yang, B. Graph Attention Recurrent Neural Networks for Correlated Time Series Forecasting. arXiv 2021, arXiv:2103.10760. [Google Scholar]
  28. Cleveland, R.B.; Cleveland, W.S. STL: A seasonal-trend decomposition procedure based on Loess. J. Off. Stat. 1990, 6, 3–73. [Google Scholar]
  29. Li, J.; Wei, S.; Dai, W. Combination of Manifold Learning and Deep Learning Algorithms for Mid-Term Electrical Load Forecasting. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 2584–2593. [Google Scholar] [CrossRef] [PubMed]
  30. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling Long and Short-Term Temporal Patterns with Deep Neural Networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 95–104. [Google Scholar]
Figure 1. The static and dynamic time-series characteristics of the load data are used to predict the load for a future period.
Figure 1. The static and dynamic time-series characteristics of the load data are used to predict the load for a future period.
Energies 16 06443 g001
Figure 2. Gated recurrent units (GRU) structure.
Figure 2. Gated recurrent units (GRU) structure.
Energies 16 06443 g002
Figure 3. The differential learning network framework is divided into two parts: the first part is used to extract the differential features of the load data, and the second part takes the differential features of the load data together with the original features as input to model the dynamic features of the overall load sequence.
Figure 3. The differential learning network framework is divided into two parts: the first part is used to extract the differential features of the load data, and the second part takes the differential features of the load data together with the original features as input to model the dynamic features of the overall load sequence.
Energies 16 06443 g003
Figure 4. The kernel size is 3 for the expansive causal convolution. Since the expansive causal convolution makes the size of the effective window grow exponentially with the number of layers, a large perceptual field can be obtained with a relatively small number of layers.
Figure 4. The kernel size is 3 for the expansive causal convolution. Since the expansive causal convolution makes the size of the effective window grow exponentially with the number of layers, a large perceptual field can be obtained with a relatively small number of layers.
Energies 16 06443 g004
Figure 5. The h i in the figure denotes the features of node i, and h j 1 h j 4 denote the features of the neighboring nodes of node i. The attention coefficient e i j is determined by calculating the similarity coefficient, and finally, the attention coefficient is normalized to obtain the final attention weight a i j .
Figure 5. The h i in the figure denotes the features of node i, and h j 1 h j 4 denote the features of the neighboring nodes of node i. The attention coefficient e i j is determined by calculating the similarity coefficient, and finally, the attention coefficient is normalized to obtain the final attention weight a i j .
Energies 16 06443 g005
Figure 6. ξ t 2 , , ξ t m denotes the neighboring node features of ξ t 1 , and the new feature that incorporates the information of the neighboring nodes for each feature node at time t is output by calculating the attention weight of each neighboring node.
Figure 6. ξ t 2 , , ξ t m denotes the neighboring node features of ξ t 1 , and the new feature that incorporates the information of the neighboring nodes for each feature node at time t is output by calculating the attention weight of each neighboring node.
Energies 16 06443 g006
Figure 7. The model framework. The input is divided into two parts: the original feature sequence X= x 1 , x 2 , , x T of the load data that incorporates trend features, seasonal features, year–month–day features and holiday features; and the differential feature sequence X ^ = x 1 ^ , x 2 ^ , , x T ^ of the load data. The original and differential feature sequences are input into the three components of the model to obtain the dynamic and space–time correlation representations of the load data.
Figure 7. The model framework. The input is divided into two parts: the original feature sequence X= x 1 , x 2 , , x T of the load data that incorporates trend features, seasonal features, year–month–day features and holiday features; and the differential feature sequence X ^ = x 1 ^ , x 2 ^ , , x T ^ of the load data. The original and differential feature sequences are input into the three components of the model to obtain the dynamic and space–time correlation representations of the load data.
Energies 16 06443 g007
Figure 8. The predicted load MAE and MSE comparison curves for each comparison model for one week. The MAE and MSE of the proposed model in this paper (green curve in the figure) are lower than those of the other comparison models.
Figure 8. The predicted load MAE and MSE comparison curves for each comparison model for one week. The MAE and MSE of the proposed model in this paper (green curve in the figure) are lower than those of the other comparison models.
Energies 16 06443 g008
Figure 9. The comparison curves of the predicted and true values of the output of each of the six models on a particular day in the comparison experiment. The horizontal coordinates indicate the predicted time steps (96 time points in a day), and the vertical coordinates indicate the predicted load volumes. For several peaks of the load curve in a day (marked by red dots in the figure), only TCN and TCN_DLGAT predict such changes, and the MSE error of TCN_DLGAT is 10% less than that of TCN.
Figure 9. The comparison curves of the predicted and true values of the output of each of the six models on a particular day in the comparison experiment. The horizontal coordinates indicate the predicted time steps (96 time points in a day), and the vertical coordinates indicate the predicted load volumes. For several peaks of the load curve in a day (marked by red dots in the figure), only TCN and TCN_DLGAT predict such changes, and the MSE error of TCN_DLGAT is 10% less than that of TCN.
Energies 16 06443 g009
Table 1. Description of model parameters.
Table 1. Description of model parameters.
Model ParameterParameter MeaningParameter Meaning
input_seqSliding window size672
input_sizeDimension of input data3
hidden_sizeDimension of hidden layer state96
output_sizeSize of output data96
batch_sizeNumber of batch size32
num_layersNumber of layers in the LSTM stack1
learning_ratelearning_rate0.001
dropoutdropout0.2
encoder_layersNumber of layers of encoder in Transformer3
decoder_layersNumber of layers of decoder in Transformer3
n_headsNumber of heads of attention in Transformer3
Table 2. MAE, MSE and MAPE errors on the UCI Power Load Graph public dataset. The percentages in parentheses reflect the percentage increase compared to TCN_DLGAT, which outperformed the other methods in all comparison experiments based on MAE being 5% less than the next best alternative (underlined part), MSE being 10% less than the next best alternative (underlined part), and MAPE being 24% less than the next best alternative (underlined part).
Table 2. MAE, MSE and MAPE errors on the UCI Power Load Graph public dataset. The percentages in parentheses reflect the percentage increase compared to TCN_DLGAT, which outperformed the other methods in all comparison experiments based on MAE being 5% less than the next best alternative (underlined part), MSE being 10% less than the next best alternative (underlined part), and MAPE being 24% less than the next best alternative (underlined part).
LSTMCNN_LSTMLSTNetTransformerTCNTCN_DLGAT
MAE0.039 (+15%)0.037 (+9%)0.036 (+5%)0.043 (+26%)0.036 (+5%)0.034
MSE0.0026 (+30%)0.0023 (+15%)0.0022 (+10%)0.0024 (+20%)0.0027 (+35%)0.0020
MAPE13.098 (+355%)14.923 (+418%)3.562 (+24%)8.095 (+181%)14.838 (+415%)2.881
Table 3. The ablation analysis results table shows the effect of ablation on MAE, MSE and MAPE, respectively. As a whole, each module of the model contributes to the prediction results. The percentage increase in MAE for each ablation model ranges from 8% to 29%.
Table 3. The ablation analysis results table shows the effect of ablation on MAE, MSE and MAPE, respectively. As a whole, each module of the model contributes to the prediction results. The percentage increase in MAE for each ablation model ranges from 8% to 29%.
TCN_GATTCN_DLDLGATTCN_DLGAT
MAE0.044 (+29%)0.037 (+8%)0.043 (+26%)0.034
MSE0.0031 (+55%)0.0043 (+115%)0.0029 (+45%)0.0020
MAPE3.818 (+33%)22.477 (+330%)3.038 (+5%)2.881
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, C.; Du, N.; He, J.; Li, N.; Feng, Y.; Cai, W. Multidimensional Feature-Based Graph Attention Networks and Dynamic Learning for Electricity Load Forecasting. Energies 2023, 16, 6443. https://doi.org/10.3390/en16186443

AMA Style

Huang C, Du N, He J, Li N, Feng Y, Cai W. Multidimensional Feature-Based Graph Attention Networks and Dynamic Learning for Electricity Load Forecasting. Energies. 2023; 16(18):6443. https://doi.org/10.3390/en16186443

Chicago/Turabian Style

Huang, Chaokai, Ning Du, Jiahan He, Na Li, Yifan Feng, and Weihong Cai. 2023. "Multidimensional Feature-Based Graph Attention Networks and Dynamic Learning for Electricity Load Forecasting" Energies 16, no. 18: 6443. https://doi.org/10.3390/en16186443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop