Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction

Jin, Xue-Bo; Wang, Zhong-Yao; Kong, Jian-Lei; Bai, Yu-Ting; Su, Ting-Li; Ma, Hui-Jun; Chakrabarti, Prasun

doi:10.3390/e25020247

Open AccessArticle

Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction

by

Xue-Bo Jin

^1,2

,

Zhong-Yao Wang

^1,2,

Jian-Lei Kong

^1,2,*

,

Yu-Ting Bai

^1,2

,

Ting-Li Su

^1,2,

Hui-Jun Ma

^1,2 and

Prasun Chakrabarti

³

¹

Artificial Intelligence College, Beijing Technology and Business University, Beijing 100048, China

²

China Light Industry Key Laboratory of Industrial Internet and Big Data, Beijing Technology and Business University, Beijing 100048, China

³

Department of Computer Science and Engineering, ITM SLS Baroda University, Vadodara 391510, India

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(2), 247; https://doi.org/10.3390/e25020247

Submission received: 17 November 2022 / Revised: 17 January 2023 / Accepted: 25 January 2023 / Published: 30 January 2023

(This article belongs to the Special Issue Improving Predictive Models with Expert Knowledge)

Download

Browse Figures

Versions Notes

Abstract

:

The environment and development are major issues of general concern. After much suffering from the harm of environmental pollution, human beings began to pay attention to environmental protection and started to carry out pollutant prediction research. A large number of air pollutant predictions have tried to predict pollutants by revealing their evolution patterns, emphasizing the fitting analysis of time series but ignoring the spatial transmission effect of adjacent areas, leading to low prediction accuracy. To solve this problem, we propose a time series prediction network with the self-optimization ability of a spatio-temporal graph neural network (BGGRU) to mine the changing pattern of the time series and the spatial propagation effect. The proposed network includes spatial and temporal modules. The spatial module uses a graph sampling and aggregation network (GraphSAGE) in order to extract the spatial information of the data. The temporal module uses a Bayesian graph gated recurrent unit (BGraphGRU), which applies a graph network to the gated recurrent unit (GRU) so as to fit the data’s temporal information. In addition, this study used Bayesian optimization to solve the problem of the model’s inaccuracy caused by inappropriate hyperparameters of the model. The high accuracy of the proposed method was verified by the actual PM2.5 data of Beijing, China, which provided an effective method for predicting the PM2.5 concentration.

Keywords:

time series data prediction; spatio-temporal network; self-optimization; GRU; graph neural network

1. Introduction

In recent years, along with the development of the economy, people’s living standards have been improving daily, and people are paying more attention to their happiness. Yet, air quality is sure to directly affect people’s happiness. Therefore, time series prediction technology has been widely used in air quality prediction. Among the many factors affecting air quality, PM2.5 is among the most significant. PM2.5 refers to particles with an aerodynamic equivalent diameter of less than or equal to 2.5 microns in the ambient air. It can remain suspended in the air for a long time. The higher its concentration in the air, the more serious the air pollution is. PM2.5 comes from various sources, mainly from human activities, such as coal combustion, urban dust, automobile exhaust emissions, and industrial pollution sources. Although the size of PM2.5 is tiny, it is a component with little content in the Earth’s atmospheric composition. Still, compared with other atmospheric particles that are coarser and more abundant, PM2.5 has a large area coverage, intense activity, a long staying time in the atmosphere, and a long transportation distance. It easily carries toxic and harmful substances (heavy metals, microorganisms, etc.) and can change the precipitation and temperature modes. Thus, it has a more significant impact on human health and atmospheric environmental quality [1]. Therefore, it is of great significance in accurately predicting the concentration of PM2.5.

With the increasing impact of air pollution on people, air pollution has become a topic of concern, and the prediction of pollution has also become an essential means of controlling air pollution. With the development of science and technology, obtaining PM2.5 concentration and other air pollutant data is convenient. By analyzing and studying past PM2.5 concentrations data, researchers can obtain the changing patterns of the PM2.5 concentration to a certain extent. However, because of the very complex formation mechanism and changing process of PM2.5 concentration, together with the effect of the temporal dimension (time series) and spatial dimension (geological distance), its data are a kind of non-stationary time series with complex nonlinear and noise characteristics, which increases the difficulty of predicting the PM2.5 concentration. Therefore, establishing an air quality prediction model with high accuracy and real-time performance that can fully consider space–time nonlinearity has always been a hotspot in the time series data prediction field.

Researchers have explored air pollution prediction from different research directions and established a variety of prediction methods. In the 1960s and 1970s, some developed countries first began to study methods of predicting air pollution, of which the most important is the potential prediction [2]. This is mainly based on the diffusion of pollutants, but a potential prediction cannot reveal the quantitative pollutant concentration. In order to obtain better prediction results, statistical methods and machine learning technology have been widely used in air prediction modeling. The statistical method involves computer-built statistical probability models based on data and uses the models to analyze and predict. Simple statistical methods mainly use the autoregression model (AR) [3] and autoregressive integrated moving average model (ARIMA) [4] to predict air quality. Machine learning technology uses a computer algorithm that can be automatically improved by experience. Machine learning technology that is applied to air quality prediction includes artificial neural networks (ANNs) [5], support vector regression (SVR) [6], random forest (RF) [7], etc. Although the above method is easy to implement for predicting air quality, and the model is relatively simple and can be aptly explained, it is difficult to establish a model of nonlinear data, and it is also difficult to effectively express highly complex data. Therefore, it cannot handle large datasets with high complexity and strong nonlinearity for actual PM2.5 prediction.

In addition to algorithms based on traditional statistical and machine learning technologies, more and more research has begun to use deep learning technology to build prediction models. Relying on its powerful modeling ability and nonlinear processing ability, the deep learning method has been widely used in air quality prediction. Standard deep learning network models include the recurrent neural network (RNN) [8], gated recurrent unit (GRU) [9], long short-term memory network (LSTM) [10], temporal convolution network (TCN) [11], etc. These networks and their variants have been widely used in air quality prediction and have achieved good prediction accuracy. However, most of the above deep learning methods emphasized the fitting analysis of time series but neglected or simplified the impact of spatial transmission between different regions, thus decreasing the prediction accuracy.

Compared with the above methods, the graph neural network (GNN) is a framework that uses deep learning technology to learn graph-structured data directly. Its excellent performance has greatly attracted scholars’ attention and in-depth exploration. The commonly used graph neural networks include a graph convolution network (GCN) [12], graph attention network (GAT) [13], graph sampling and aggregation network (GraphSAGE) [14], etc. These networks and their variants have been successfully applied to air quality prediction. Because of their strong ability to fit nonlinear chart-structured data, they show high accuracy and robustness. However, these networks often consider only the spatial dimension and simultaneously fail to consider the time and space dependence. In order to solve this problem, many researchers began to combine recurrent neural networks and graph neural networks to form graph recurrent neural networks (GRNs) [15]. A GRN usually converts graph data into sequences, recursively evolving and changing during training. This combination can establish a full model based on the dependence of the two dimensions of space and time and can also effectively solve the common problems of gradient disappearance/explosion in time series prediction, thus improving the prediction accuracy.

Although the graph recurrent neural network has shown its effectiveness in many fields because of its excellent nonlinear fitting ability and powerful spatio-temporal information capture ability, just like other deep learning methods, its performance may also be seriously affected by hyperparameters selected on the basis of empirical knowledge or multiple attempts. This operation requires too much experimental time and computational resources, and the obtained hyperparameters may not be optimal, as they will lead to the unstable performance and limited accuracy of the model [16]. Therefore, applying an effective strategy to find the most optimal hyperparameter set of the model is the critical step in solving the above problems.

To sum up, if air quality prediction only focuses on the information in a single time or space dimension while ignoring or simplifying the fitting of time series and the analysis of common spatial transmission effects between different locations, the prediction accuracy will decrease. Therefore, based on the ability of graph neural networks to capture spatial information, as well as the advantages of a recurrent neural network to effectively solve the common problems of gradient disappearance/explosion in time series prediction, this paper proposes a new time series prediction network with the self-optimization ability of spatio-temporal graph gated recurrent units called BGGRU. The main contributions of this paper are as follows:

(1): BGraphSAGE for the spatial dimension: A GNN model (GraphSAGE) is used to extract the spatial features, and the Bayesian method is used to optimize the model’s hyperparameters. The model can extract hidden spatial information while eliminating the unstable performance of the model caused by the experience-based selection of hyperparameters.
(2): BGraphGRU for the temporal dimension: Graph convolution is used to replace the linear operation of the GRU with the Bayesian hyperparameter optimization method so that the network can extract the temporal dependency with suitable hyperparameters.
(3): BGGRU with the combined model: This consists of a spatial dimension (BGraphSAGE) module and temporal dimension (BGraphGRU) module, which can thoroughly learn the data from the time and space dimensions so as to effectively improve the prediction accuracy and generalization performance of the prediction model.

The rest of this paper is organized in the following way: Section 2 introduces the related work in this field; Section 3 presents the method proposed in this study in detail; and Section 4 describes the setup and results of the experiment, in which the proposed combined network model was used to predict PM2.5 in Beijing, China. Section 5 is the conclusion of this paper and suggests prospective work.

2. Related Work

2.1. The Traditional PM2.5 Prediction Method

Traditional PM2.5 prediction methods include the statistical method and machine learning technology method. Because of its relatively simple structure, the statistical method mainly considers the formation mechanism of PM2.5. It is widely used in air quality prediction. Liu et al. [17] combined ARIMA with numerical prediction to predict the daily and hourly PM2.5 concentration in Hong Kong. Zeng et al. [18] studied the relationship between PM2.5 and meteorological factors in Chengdu within 24 h. They used the generalized additive model to predict the concentration of PM2.5. The traditional prediction method based on statistics has limitations because the formation of PM2.5 is very complex.

Based on historical data, machine learning for air pollution prediction can model the nonlinearity of actual air pollution data, thus producing a higher prediction accuracy. Shahriar [19] et al. evaluated hybrid models consisting of ARIMA, ANN, SVM, PCR, DT, and Catboost. Among these models, Catboost had the best performance. The ARIMA-ANN and DT methods also provided acceptable results. Caroline et al. [20] constructed and evaluated an SVM to predict ground PM2.5 in densely populated cities with complex terrains. The final results proved the potential of SVM as a prediction model in other tropical cities. Rui et al. [21] used multi-layer perceptron (MLP) to analyze and predict environmental PM2.5 in eight core regional cities in China and found that reducing gaseous pollutants is crucial to regulating PM2.5.

With the explosion of data, statistical models and machine learning methods with simple structures can no longer provide the complex modeling capabilities required for big data, and they easily fall into overfitting.

2.2. PM2.5 Prediction Method Based on Deep Learning

In recent years, the deep learning method has attracted significant attention for air quality prediction because of its strong learning ability and ability to fit nonlinear data. Dongming Qin et al. [22] used a combined model consisting of two deep networks (CNN and LSTM). Among them, the CNN model was used to extract the features of input data, and the LSTM model was used to consider the time dependence of pollutants. The prediction results show that it improves the prediction performance compared with classic models. H. Kaimian et al. [23] used LSTM to predict the PM2.5 concentration in Tehran, and the prediction results could explain 80% of the PM2.5 variability. K. Nagrecha [24] et al. proposed a CNN network system to analyze past sensor measurements and predict air pollutant concentrations. The result was equivalent to the most advanced prediction system in this field and was usually superior to any advanced prediction system.

Although these networks have been widely applied in air quality prediction, most networks do not fully mine data characteristics from the two dimensions of time and space. Moreover, most deep learning models focus on Euclidean space, but the air quality monitoring station distribution often fails to match the ideal conditions of Euclidean space.

2.3. PM2.5 Prediction Method Based on Graph Neural Network

A graph neural network, referring to the algorithm, uses a neural network to learn, extract, and mine the features and patterns in graph-structured data to meet the needs of graph learning tasks such as clustering, classification, prediction, segmentation, generation, etc. A graph is composed of nodes and edges, usually denoted as

G = (V, E)

, where

V = {V_{1}, V_{2} \dots V_{n}}

represents the node set, and

E = {E_{1}, E_{2} \dots E_{m}}

represents the edge set. Graph neural networks can adapt to complex structures a priori, such as by defining the relationship between multiple concepts, describing complex nonlinear structures, etc.

In addition, compared with other neural network models, graph neural networks can model the overall characteristics of data from two aspects, i.e., structure and function. Therefore, a graph neural network has greater universality in spatio-temporal data modeling and information mining.

In recent years, graph neural networks have been widely used in air quality prediction on account of the above characteristics. P. Zhao [25] et al. proposed a method using multi-attention spatio-temporal graph networks (MASTGN) to predict the air pollutant concentration in China and Japan. Shuo Wang et al. [26] proposed a graph-based model called PM2.5-GNN, which can capture long-term dependencies. In order to combine the two advantages of strong interpretability and feature extraction ability, Hz et al. [27] combined the PM2.5 dispersion partial differential equation with a deep learning method based on the newly proposed DPGN model and used PM2.5 monitoring data obtained every hour. Zhao [28] proposed a comprehensive prediction method to extract the spatial propagation between adjacent places. The methods mentioned above try to mine information in the spatial dimension rather than solve the problems of gradient disappearance/explosion (common issues in time series prediction), which will reduce the prediction accuracy. Therefore, they fail to combine the advantages of graph neural networks in capturing spatial information with the recurrent neural network in processing time series.

To sum up, traditional prediction methods and those based on deep learning have difficulty mining the data features in the space and time dimensions. In recent years, graph neural networks have provided a new idea for air quality prediction due to their ability to mine spatial characteristics. In this paper, we propose a new time series prediction network with the self-optimization ability of spatio-temporal graph gated recurrent units, which combines three elements: the spatial information extraction ability of a graph neural network, the advantages of a recurrent neural network that can effectively solve the gradient problem in time series prediction, and the powerful hyperparameter optimization ability of the Bayesian method.

3. Methods

3.1. Overall Structure of BGGRU

BGGRU is composed of two modules: the spatial dimension module (BGraphSAGE) and the temporal dimension module (BGraphGRU). They jointly exploit the spatio-temporal relationship in the data. As shown in Figure 1, the first part of BGGRU is the spatial dimension module, using BGraphSAGE to aggregate local features, generate embedding for each node in the graph, and extract the spatial features from the data. The second part of BGGRU is the temporal dimension module, using BGraphGRU to jointly model spatio-temporal dependencies while maintaining the spatial characteristics of the input. Lastly, the future concentration of PM2.5 is predicted through two fully connected layers. The model also uses the Bayesian method to optimize the hyperparameters of each module in the model and the model as a whole. Additionally, the learning diagram for BGGRU is shown in Figure 2, which indicates the learning process of the model. In the training stage, we first preprocess the training and validation datasets (see details in Section 4.1), randomly initialize the network parameters, and generate the graph adjacency matrix describing the graph structure through known geographic information. Then, the BGGRU network proposed in this paper is used to extract the spatio-temporal dimension features from the data and predict the future concentration of PM2.5. Finally, the loss function (see details in Section 4.2) is used to calculate the prediction error, and the network weights are adjusted later. The above steps are repeated until the number of epochs is reached. In the test stage, the test dataset is sent to the trained BGGRU, and the future concentration of PM2.5 is predicted through the network. The performance of the network is evaluated by the evaluation function (see details in Section 4.2).

In addition, the model also uses a skip connection for the two modules of space and time. In deep neural networks, when the layer is deep, the model might often succumb to overfitting and gradient disappearance/explosion problems, resulting in the failure to update the parameters in the shallow layer. A skip connection means the next layer, including not only the information of the current layer but also the newly obtained information of the current layer after nonlinear transformation. This method solves the problems of overfitting and gradient disappearance/explosion of the model to a certain extent.

As shown in Figure 1 and Figure 2, the proposed model in this paper adds a skip connection mechanism. That is, the final fully connected layer receives not only the output of BGraphGRU but also the output of the first BGraphSAGE. The skip connection can accelerate the stability of the model and prevent overfitting.

3.1.1. Spatial Dimension Module of BGGRU (BGraphSAGE)

Traditional graph embedding algorithms (based on matrix decomposition and random walk) need to use the information of all nodes in the iterative process to learn the vector representation. These previous approaches are inherently transductive. GraphSAGE [29] includes sampling and aggregation. First, the connection information between nodes is used to sample their neighbors, and then the information of adjacent nodes is continuously fused through multi-layer aggregation functions. The structure of GraphSAGE is shown in Figure 3. The method proposed in this paper also uses GraphSAGE to aggregate the features of local nodes, generate embedding for each node in the graph, and extract the spatial features from the data.

The graph is represented as

ς = (υ, ε)

, where

υ

is the set of nodes, and

ε

is the set of edges. In each iteration K, each node will aggregate the embeddings of all adjacent nodes by taking the average of these vectors. Then, the aggregated vector is concatenated to the current embedding vector, and the output is calculated through a linear layer with a sigmoid activation function.

We also used the Bayesian method to automatically optimize the hyperparameters of the spatial dimension module (see Section 3.2 for details).

3.1.2. Temporal Dimension Module of BGGRU (BGraphGRU)

The GRU is a variant of LSTM, which is also proposed to solve gradient problems in long-term memory and backpropagation. In general, the GRU, like LSTM, is mainly used for temporal feature extraction from time series data, and its actual performance is similar to that of LSTM in many cases, but its calculation is simpler and easier to implement compared to LSTM. Therefore, the BGraphGRU proposed here has the same chain structure as GRU. Still, a graph convolution operation (GraphSAGE) is used to replace these linear transformations and extract the structural features of snapshots at each time. After that, BGraphGRU is used to solve the problem of long-term dependence and effectively learn the temporal characteristics of input graphs. In this way, the spatio-temporal correlation can be jointly modeled while maintaining the spatial structure of the input. BGraphGRU’s formula is shown below.

The GRU introduces the concepts of the reset gate and update gate, thus modifying the calculation method of hidden states in the recurrent neural network. The BGraphGRU proposed in this paper has the same chain structure as GRU but uses a graph convolution operation (GraphSAGE) to replace the original linear transformation. The function of the reset gate’s action is the same as that of the GRU, and the subject is the hidden state in front. Additionally, the function is to determine how much past information needs to be forgotten. The formula for the reset gate is as follows:

r_{t} = σ (W_{r} A_{t} + G r a p h S A G E_{r}^{k} (h_{t - 1}, {\hat{A}}_{t - 1}) + b_{r})

(1)

where

A_{t} \in R^{N \times N}

is the input of BGraphGRU at time t,

h_{t - 1} \in R^{N \times d}

is the hidden state at time t − 1, and

W_{r} \in R^{N \times d}

and

b_{r} \in R^{d}

are the weight matrix and bias matrix of the reset gate.

In fact, the update gate can be understood as a combination of the forget gate and input gate in LSTM. The subject of its function is the hidden unit of the current time and the previous time, and its function is to decide how much useful information needs to be transferred down at the current time and previous time. The formula for the update gate is as follows:

z_{t} = σ (W_{z} A_{t} + G r a p h S A G E_{z}^{k} (h_{t - 1}, {\hat{A}}_{t - 1}) + b_{z})

(2)

where

A_{t} \in R^{N \times N}

is the input of GraphGRU at time t,

h_{t - 1} \in R^{N \times d}

is the hidden state at time t − 1, and

W_{z} \in R^{N \times d}

and

b_{z} \in R^{d}

are the weight matrix and bias matrix of the update gate.

Next, BGraphGRU will calculate the candidate hidden state to assist in the later hidden state calculation. The formula is as follows:

{\tilde{h}}_{t} = \tanh (W_{h A} \cdot A_{t} + W_{h h} (r_{t} * G r a p h S A G E_{h}^{k} (h_{t - 1})) + b_{h})

(3)

It can be seen from the above formula that the reset gate controls how the hidden state of the previous time step flows to the candidate hidden state of the current time step. The hidden state of the last time step may contain all of the historical information of the time series up to the last time step. Therefore, the reset gate can discard historical information irrelevant to the prediction.

Finally, the calculation of the hidden state

h_{t}

of time step t uses the update gate

z_{t}

of the current time step to combine the hidden state

h_{t - 1}

of the previous time step with the candidate hidden state

{\tilde{h}}_{t}

of the current time step. The formula is as follows:

h_{t} = (1 - 2 t) * G r a p h S A G E_{h}^{k} (h_{t - 1}) + z_{t} * {\tilde{h}}_{t}

(4)

We also used the Bayesian method to automatically optimize the hyperparameters of the temporal dimension module (see Section 3.2 for details).

3.2. Bayesian Hyperparameter Optimization

While the graph neural network is widely used in air quality prediction, the selection of the model’s hyperparameters, like in other deep learning methods, often depends on experience; not only does this consume lots of time and resources, but the obtained hyperparameters are not necessarily optimal.

Therefore, the critical step in solving the above problems is to use an effective strategy to find out the optimal hyperparameter set of the model. In recent years, Bayesian optimization has become widely used in solving black box function problems and has become the mainstream for hyperparameter optimization. Bayesian optimization is a global one. The objective function only needs to meet local smoothness assumptions, such as uniform continuity or Lipschitz continuity. The introduction of the acquisition function for effective exploration and utilization can obtain the approximate solution of complex objective functions with shorter evaluation times. This is why we optimized the model’s hyperparameters through the Bayesian method to ensure the prediction model’s performance [30]. The formulas are as follows.

The first step in the Bayesian method is to assume a functional relationship between the loss function to be optimized and the hyperparameter of the model:

p^{*} = \underset{p \in P}{\arg \min l o s s (p)}

(5)

In this formula,

p^{*}

is the optimal combination of hyperparameters obtained from the Bayesian hyperparameter optimization algorithm,

P

is the set of all hyperparameters,

p

is the set of input hyperparameters, and

l o s s (•)

is the objective function needed to be optimized. In our model, the hyperparameters to be optimized included the number of training epochs (num_epoch), the learning rate of the overall model (learning rate), the number of BGraphSAGE layers (num_layers of BGraphSAGE), the number of edges of each node (edges_per_node) in the graph, and the number of BGraphGRU layers (num_layers of BGraphGRU). The mean absolute error defines the loss function, and the formula is as follows:

l o s s (p_{j}) = \frac{1}{n} \sum_{j = 1}^{n} |{\hat{y}}_{i} (p_{j}) - y_{i}|

(6)

In this formula,

p_{j}

is the j-th hyperparameter combination,

y

is the actual value, and

\hat{y} (p_{j})

is the model output results using the hyperparameter combination

p_{j}

.

The next step of the Bayesian method is to construct a dataset

D = {(x_{1,} y_{1}), (x_{2,} y_{2}) \dots (x_{i,} y_{i})}

, in which

x_{i}

is the i-th hyperparameter set, and

y_{i}

is the error of the output result under that set of hyperparameters:

y_{i} = l o s s (p_{i})

(7)

Next, the Bayesian method will use limited observation points to estimate the function distribution and obtain the surrogate model M, which obeys the Gaussian distribution G with variance

k

and mean

μ

. The posterior probability

p (y | x_{i}, D)

is derived from dataset D. The expression of the specific function M is gained from dataset D:

p (l o s s) = G (l o s s; μ, k)

(8)

p (l o s s | D) = G (l o s s; μ_{l o s s | D}, k_{l o s s | D})

(9)

The function used to determine the rules of the following observation point is called the acquisition function

a (p)

. It will measure the impact of the observation point on the fitting of the surrogate model M and select the point with the most significant impact to perform the following observation.

p^{*} = \arg \max a (P, p (y | x))

(10)

The ultimate goal of Bayesian hyperparameter optimization is to continuously optimize our estimation of the objective function

l o s s (•)

with the gradual increase in observation points under the influence of the acquisition function

a (p)

and, finally, estimate the minimum value of the objective function

l o s s (•)

. Therefore, the above steps will be repeated until the optimal hyperparameters are selected or the maximum number of iterations is reached.

4. Experiment and Results

4.1. Dataset and Experimental Environment

This study used PM2.5 concentration data obtained from 16 districts in Beijing, China (Dongcheng District, Xicheng District, Chaoyang District, Fengtai District, Shijingshan District, Haidian District, Shunyi District, Tongzhou District, Daxing District, Fangshan District, Mentougou, Changping District, Pinggu District, Miyun District, Huairou District, and Yanqing District) from 1 January 2017, to 1 April 2017, as the target variable. The sampling time interval is 1 h, with a total of 2160 samples, in which the proportion of the training set, verification set, and test set is 8:1:1. Shown in Figure 4 are the data from Haidian District and Xicheng District. The data type in the other districts is about the same. The model takes 24 data points as an input sample to predict the 25th data point. The datasets are publicly available at https://github.com/btbuIntelliSense/Beijing-PM2.5-Concentration-Data (accessed on 14 January 2023).

This study conducted all experiments on a computer equipped with a 12th Gen Intel (R) core (TM) i5-12500h processor, 4800 mhz, and 16 GB memory. All models were built using PyTorch, an open-source deep learning framework.

Unfortunately, there happened to be some values missing in the actual dataset in this experiment. The method for dealing with missing values in this work was to replace the missing value at time t with the data at time t − 1. Should 24 consecutive data be missing (a whole day), the data from the previous day would have to be used in place of the missing data.

In this experiment, the nodes of the graph refer to the 16 districts in Beijing, China. The edges connecting the nodes are determined by the distance between the separate districts (the distance between them was calculated by the longitude and latitude of the districts). Each node is assigned to its N (the number of edges_per_node) nearest adjacent districts.

4.2. Loss Function and Evaluation Index

The loss function used in this experiment was the mean absolute scaled error (MASE), and the formula is as follows:

l o s s = \frac{| \sum_{i = 0}^{N - 1} {\hat{y}}_{i} - y_{i} |}{\sum_{i = 0}^{N - 1} y_{i}}

(11)

In this formula,

{\hat{y}}_{i}

represents the predicted value of PM2.5 obtained through the model, and

y_{i}

represents the actual value of PM2.5.

In this experiment, four evaluation functions were used to evaluate the prediction performance of the model, that is, the root-mean-square error (RMSE), mean square error (MSE), mean absolute error (MAE), and coefficient of determination (R²). Their formulas are as follows:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}

(12)

M S E = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}

(13)

M A E = \frac{1}{N} \sum_{i = 1}^{N} | {({\hat{y}}_{i} - y_{i})}^{2} |

(14)

R^{2} = 1 - \frac{\sum_{i}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i}^{N} {({\bar{y}}_{i} - y_{i})}^{2}}

(15)

Among them,

N

represents the total number of samples in the dataset,

y_{i}

and

\bar{y}

represent the actual value and the average actual value of PM2.5, and

{\hat{y}}_{i}

represents the predicted value of the PM2.5 concentration obtained through the model. For RMSE, MSE, and MAE, the smaller the value, the better the model’s prediction performance. For R², the closer the value is to 1, the better the model’s prediction performance.

4.3. Experimental Results

4.3.1. Bayesian Hyperparameter Optimization Results

The Bayesian method was used to optimize the five hyperparameters: num_layers of BGraphSAGE, num_layers of BGraphGRU, num_epoch, learning rate, and edges_per_node. The results of Bayesian optimization are shown in Table 1. The following experiments used the optimal hyperparameters to predict the PM2.5 concentration.

4.3.2. Comparative Prediction Results

Nine models were used as the baseline, namely, TCN [31], ConvLSTM [32], CNN-LSTM [33], GRU [34], BidirectionalLSTM (BiLSTM) [35], LSTM [36], BidirectionalGRU (BiGRU) [37], BayesLSTM (bLSTM) [38], and BayesGRU (bGRU) [39]. The dataset and data division ratio used in the baseline models were the same as those of the proposed model. We compared the average PM2.5 prediction results of the above baseline models and the proposed model for 16 districts in Beijing, China. The errors in the prediction results are shown in Table 2.

As shown in Table 2, BGGRU ranks number one in the comprehensive order, followed by bGRU [39] and bLSTM [38]. It can be seen that the model proposed in this paper is better than the control models for the four evaluation indicators used in this experiment. The value of R² is very close to 1, and compared with the control models ranking second and third in the comprehensive indicators, RMSE, MSE, and MAE, respectively, increased by 61%, 84%, and 61% on average. As the baseline models focus more on information in the time dimension but ignore the air transmission effect between different locations, the model proposed in this paper first uses a graph neural network to extract the characteristics of the spatial dimension information and then extracts the time dimension information from the data through a model combining a graph neural network and a recurrent neural network, thus fully considering the characteristics of the data in the space–time dimension. Therefore, the proposed model has better prediction results than the other models in predicting air quality.

4.3.3. Ablation Study

First, the prediction results of Beijing’s average PM2.5 using only the spatial dimension module of BGGRU were compared with the model proposed in this paper. The comparison results are shown in Table 3.

As shown in Table 3, the prediction results using the model proposed in this paper are better than those using the spatial dimension network alone. The RMSE, MSE, and MAE of the model proposed in this paper are 64%, 87%, and 71% higher than those using the spatial dimension module alone, and its R² is also closer to 1. This shows that it is not enough to consider only the spatial dimension, but it is essential to consider the spatio-temporal dimension.

In addition, we also compared the impact of using Bayesian hyperparameter optimization with that of selecting hyperparameters through experience on the prediction results of Beijing’s average PM2.5. The results of the hyperparameter set obtained with Bayesian optimization and empirical selection are shown in Table 4, and the comparison results are shown in Table 5.

It can be seen from the above results that the model using Bayesian hyperparameter optimization is better than the model using empirical selection in terms of prediction results, which shows that Bayesian hyperparameter optimization can improve the prediction ability of the model to a certain extent. Thus, compared with the method of selecting hyperparameters through experience, using the Bayesian method to optimize hyperparameters can improve the model’s performance and save a lot of time and computing resources. Moreover, it enjoys a better mathematical theoretical basis and interpretability.

4.3.4. Display and Analysis

This section shows the average and respective PM2.5 prediction results in Beijing using the model proposed in this paper with the above data, as shown in Figure 5 and Figure 6.

It can be seen from Figure 5 and Figure 6 that the prediction of the proposed model is consistent with the trend of the actual values, which reflects the proposed model’s effectiveness in PM2.5 prediction after extracting spatio-temporal information.

5. Conclusions and Future Work

Because changes in air quality data are affected by both the time dimension and spatial dimension, this paper proposes a deep spatio-temporal graph network with self-optimization to predict the concentration of PM2.5. In the combined network model, firstly, a graph neural network is used to extract the characteristics of the spatial dimension information from the data. Then, while maintaining the spatial structure of the data, a graph recurrent neural network is used to extract the time dimension information from the data. The model can fully extract and mine the characteristics of the data from the two dimensions of time and space and solve the common gradient problems in time series prediction to improve prediction accuracy. In addition, we also effectively solve the model’s inaccuracy caused by inappropriate hyperparameters by using Bayesian optimization to select the optimal hyperparameters of the model. The effectiveness of the proposed method was proven through an experiment on predicting PM2.5 in 16 districts of Beijing, China.

In future research, we will test our proposed model on a more complex time series dataset, and we will try to combine the graph neural network with different recurrent neural networks and their variants to further improve the overall performance of the prediction model.

Author Contributions

Conceptualization, X.-B.J. and Z.-Y.W.; methodology, X.-B.J. and Z.-Y.W.; software, Z.-Y.W.; validation, X.-B.J. and Z.-Y.W.; formal analysis, J.-L.K.; investigation, Y.-T.B. and T.-L.S.; resources, X.-B.J. and J.-L.K.; data curation, Z.-Y.W., Y.-T.B. and T.-L.S.; writing—original draft preparation, X.-B.J. and Z.-Y.W.; writing—review and editing, X.-B.J. and Z.-Y.W.; visualization, Z.-Y.W.; supervision, J.-L.K., Y.-T.B. and T.-L.S.; project administration, X.-B.J. and J.-L.K.; funding acquisition, H.-J.M., P.C. and X.-B.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China, Nos. 62173007, 62006008, and 61903009.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tsai, F.; Smith, K.; Vichit, N. Indoor/outdoor PM10 and PM2.5 in Bangkok, Thailand. J. Expo. Sci. Environ. Epidemiol. 2000, 10, 15–26. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gregory, R.; Adrian, S. Predicting air quality: Improvements through advanced methods to integrate models and measurements. J. Comput. Phys. 2007, 227, 3540–3571. [Google Scholar]
Li, J.; Li, X.; Wang, K. Atmospheric PM2.5 concentration prediction based on time series and interactive multiple model approach. Adv. Meteorol. 2019, 2019, 1279565. [Google Scholar] [CrossRef] [Green Version]
Guo, N.; Chen, W.; Wang, M. Appling an Improved Method Based on ARIMA Model to Predict the Short-Term Electricity Consumption Transmitted by the Internet of Things (IoT). Wirel. Commun. Mob. Comput. 2021, 2021, 6610273. [Google Scholar] [CrossRef]
Conde, R.A.; Colorado, D.; Hernández, S.L. Comparison of an artificial neural network and Gompertz model for predicting the dynamics of deaths from COVID-19 in México. Nonlinear Dyn. 2021, 104, 4655–4669. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Hong, W.C. Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn. 2019, 98, 1107–1136. [Google Scholar] [CrossRef]
Fang, Z.; Yang, H.; Li, C. Prediction of PM2.5 hourly concentrations in Beijing based on machine learning algorithm and ground-based LiDAR. Arch. Environ. Prot. 2021, 47, 98–107. [Google Scholar]
Connor, J.T.; Martin, R.D.; Atlas, L.E. Recurrent neural networks and robust time series prediction. IEEE Trans. Neural Netw. 1994, 5, 240–254. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Pan, C.; Tan, J.; Feng, D. Prediction Intervals Estimation of Solar Generation Based on Gated Recurrent Unit and Kernel Density Estimation. Neurocomputing 2020, 453, 552–562. [Google Scholar] [CrossRef]
Tian, Y.; Zhang, K.; Li, J. LSTM-based Traffic Flow Prediction with Missing Data. Neurocomputing 2018, 318, 297–305. [Google Scholar] [CrossRef]
Shao, B.J.; Zico, K.; Vladlen, K. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A. Spectral network sand locally connected network son graphs. In Proceedings of the 2nd International Conferenceon Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Veliovic, P.; CucurullG, C.A. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zhiwen, C.; Qiao, D.; Zhengrun, Z.; Bei, S.; Tao, P.; Chunhua, Y. Energy consumption prediction of cold source system based on GraphSAG. IFAC-Pap. OnLine 2021, 54, 37–42. [Google Scholar]
Gori, M.; Monfardini, G.; Scarselli, F. A new model for learningin graph domains. In Proceedings of the International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; pp. 729–734. [Google Scholar]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 2951–2959. [Google Scholar]
Tong, L.; Alexis, K.H.; Lau, K.S. Time Series Forecasting of Air Quality Based on Regional Numerical Modeling in Hong Kong. J. Geophys. Res. Atmos. 2018, 123, 4175–4196. [Google Scholar]
Zeng, Y.Y.; Daniel, A.J.; Xue, Q. Prediction of Potentially High PM2.5 Concentrations in Chengdu, China. Aerosol Air Qual. Res. 2020, 20, 956–965. [Google Scholar] [CrossRef] [Green Version]
Shahriar, S.A.; Kayes, I.; Hasan, K. Potential of ARIMA-ANN, ARIMA-SVM, DT and CatBoost for atmospheric PM2.5 forecasting in Bangladesh. Atmosphere 2021, 12, 100. [Google Scholar] [CrossRef]
Caroline, M.; Alejandro, C.; Sergio, V. A support vector machine model to forecast ground-level PM2.5 in a highly populated city with a complex terrain. Air Quality. Atmos. Health 2021, 14, 399–409. [Google Scholar] [CrossRef]
Rui, F.; Han, G.; Kun, L. Analysis and accurate prediction of ambient PM2.5 in China using Multi-layer Perceptron. Atmos. Environ. 2020, 232, 117534. [Google Scholar]
Qin, D.; Yu, J.; Zou, G.D. A Novel Combined Prediction Scheme Based on CNN and LSTM for Urban PM2.5 Concentration. IEEE Access 2019, 7, 20050–20059. [Google Scholar] [CrossRef]
Hamed, K.; Qi, L.; Chunlin, W. Evaluation of Different Machine Learning Approaches to Forecasting PM2.5 Mass Concentrations. Aerosol Air Qual. Res. 2019, 19, 1400–1410. [Google Scholar]
Nagrecha, K. Sensor-Based Air Pollution Prediction Using Deep CNN-LSTM. In Proceedings of the 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 16–18 December 2020; pp. 694–696. [Google Scholar]
Zhao, P.; Zettsu, K. MASTGN: Multi-Attention Spatio-Temporal Graph Networks for Air Pollution Prediction. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 1442–1448. [Google Scholar]
Wang, S.; Li, Y.; Zhang, J. PM2.5-GNN: A Domain Knowledge Enhanced Graph Neural Network for PM2.5 Forecasting. In Proceedings of the 28th International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 3–6 November 2020. [Google Scholar]
Hz, A.; Feng, Z.; Zda, B. A theory-guided graph networks based PM2.5 forecasting method. Environ. Pollut. 2022, V293, 118569. [Google Scholar]
Zhao, G.; He, H.; Huang, Y. Near-surface PM2.5 prediction combining the complex network characterization and graph convolution neural network. Neural Comput. Applic 2021, 33, 17081–17101. [Google Scholar] [CrossRef]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar]
Jin, X.B.; Zheng, W.Z.; Kong, J.L. Deep-Learning Forecasting Method for Electric Power Load via Attention-Based Encoder-Decoder with Bayesian Optimization. Energies 2021, 14, 1596. [Google Scholar] [CrossRef]
Chen, Y.; Kang, Y.; Chen, Y. Probabilistic forecasting with temporal convolutional neural network. Neurocomputing 2020, 399, 491–501. [Google Scholar] [CrossRef] [Green Version]
Wang, K.; Qi, X.; Liu, H. Photovoltaic power forecasting based LSTM-Convolutional Network. Energy 2019, 189, 116225. [Google Scholar]
Gilik, A.; Ogrenci, A.S.; Ozmen, A. Air quality prediction using CNN+LSTM-based hybrid deep learning architecture. Environ. Sci. Pollut. Res. 2022, 29, 11920–11938. [Google Scholar] [CrossRef]
Wang, Y.; Liao, W.; Chang, Y. Gated Recurrent Unit Network-Based Short-Term Photovoltaic Forecasting. Energies 2018, 11, 2163. [Google Scholar] [CrossRef] [Green Version]
Saeed, A. Hybrid Bidirectional LSTM Model for Short-Term Wind Speed Interval Prediction. IEEE Access 2020, 8, 182283–182294. [Google Scholar] [CrossRef]
Navares, R.; Aznarte, J.L. Predicting air quality with deep learning LSTM: Towards comprehensive models. Ecol. Inform. 2019, 55, 101019. [Google Scholar]
Yu, S.; Wang, J.; Liu, J.S. Rapid Prediction of Respiratory Motion Based on Bidirectional Gated Recurrent Unit Network. IEEE Access 2020, 8, 49424–49435. [Google Scholar] [CrossRef]
Li, J.W.; Zhu, H. A Practical Application for Text-Based Sentiment Analysis Based on Bayes-LSTM Model. J. Phys. Conf. Ser. 2020, 1631, 012035. [Google Scholar] [CrossRef]
Jin, X.B.; Gong, W.T.; Kong, J.L. A Variational Bayesian Deep Network with Data Self-Screening Layer for Massive Time-Series Data Forecasting. Entropy 2022, 24, 335. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The overall structure of BGGRU.

Figure 2. The learning diagram of BGGRU.

Figure 3. Visual illustration of the GraphSAGE sample and aggregate approach.

Figure 4. Examples of data from different districts in Beijing, China. (a) Haidian; (b) Xicheng.

Figure 5. Average PM2.5 prediction results in Beijing, China.

Figure 6. PM2.5 prediction results for 16 districts in Beijing, China.

Table 1. Bayesian hyperparameter optimization results.

Hyperparameter	Hyperparameter Set with Bayesian Optimization
num_layers of GraphSAGE	4
num_layers of GraphGRU	1
num_epoch	11
learning rate	0.0327
edges_per_node	3

Table 2. Errors in prediction results of different models.

Model	RMSE	MSE	MAE	R²
TCN [31]	26.91	724.33	21.65	0.4
ConvLSTM [32]	25.46	648.65	20.05	0.47
CNN-LSTM [33]	24.85	617.76	19.76	0.49
GRU [34]	15.93	253.85	11.49	0.86
BiLSTM [35]	15.78	248.88	11.93	0.87
LSTM [36]	15.27	233.35	10.66	0.87
BiGRU [37]	13.22	174.65	9.74	0.91
bLSTM [38]	7.02	49.28	5.15	0.97
bGRU [39]	6.71	45.09	5.07	0.97
BGGRU (proposed) *	2.66	7.11	1.99	0.99

* The bold part shows the best performance of different models of the same indicator.

Table 3. Comparison of prediction results between the spatial module alone and the overall model.

Model	RMSE	MSE	MAE	R²
GGRU	7.39	54.75	6.88	0.96
BGGRU	2.66	7.11	1.99	0.99

The bold part shows the best performance of the same indicator.

Table 4. The results of the hyperparameter set obtained with Bayesian optimization and empirical selection.

Hyperparameter	Hyperparameter Set with Bayesian Optimization	Hyperparameter Set with Empirical Selection
num_layers of GraphSAGE	4	3
num_layers of GraphGRU	1	3
num_epoch	11	10
learning rate	0.0327	0.05
edges_per_node	3	3

Table 5. Comparison of prediction results between the hyperparameter set optimized by the Bayesian method and that selected through experience.

Model	RMSE	MSE	MAE	R²
GGRU	3.37	11.35	2.78	0.99
BGGRU	2.66	7.11	1.99	0.99

The bold part shows the best performance of the same indicator.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, X.-B.; Wang, Z.-Y.; Kong, J.-L.; Bai, Y.-T.; Su, T.-L.; Ma, H.-J.; Chakrabarti, P. Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction. Entropy 2023, 25, 247. https://doi.org/10.3390/e25020247

AMA Style

Jin X-B, Wang Z-Y, Kong J-L, Bai Y-T, Su T-L, Ma H-J, Chakrabarti P. Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction. Entropy. 2023; 25(2):247. https://doi.org/10.3390/e25020247

Chicago/Turabian Style

Jin, Xue-Bo, Zhong-Yao Wang, Jian-Lei Kong, Yu-Ting Bai, Ting-Li Su, Hui-Jun Ma, and Prasun Chakrabarti. 2023. "Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction" Entropy 25, no. 2: 247. https://doi.org/10.3390/e25020247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction

Abstract

1. Introduction

2. Related Work

2.1. The Traditional PM2.5 Prediction Method

2.2. PM2.5 Prediction Method Based on Deep Learning

2.3. PM2.5 Prediction Method Based on Graph Neural Network

3. Methods

3.1. Overall Structure of BGGRU

3.1.1. Spatial Dimension Module of BGGRU (BGraphSAGE)

3.1.2. Temporal Dimension Module of BGGRU (BGraphGRU)

3.2. Bayesian Hyperparameter Optimization

4. Experiment and Results

4.1. Dataset and Experimental Environment

4.2. Loss Function and Evaluation Index

4.3. Experimental Results

4.3.1. Bayesian Hyperparameter Optimization Results

4.3.2. Comparative Prediction Results

4.3.3. Ablation Study

4.3.4. Display and Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI