Self-Attention ConvLSTM for Spatiotemporal Forecasting of Short-Term Online Car-Hailing Demand

Ge, Hongxia; Li, Siteng; Cheng, Rongjun; Chen, Zhenlei

doi:10.3390/su14127371

Open AccessArticle

Self-Attention ConvLSTM for Spatiotemporal Forecasting of Short-Term Online Car-Hailing Demand

¹

Faculty of Maritime and Transportation, Ningbo University, Ningbo 315211, China

²

Jiangsu Province Collaborative Innovation Center for Modern Urban Traffic Technologies, Nanjing 210096, China

³

National Traffic Management Engineering and Technology Research Centre, Ningbo University Sub-Center, Ningbo 315211, China

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(12), 7371; https://doi.org/10.3390/su14127371

Submission received: 29 May 2022 / Revised: 13 June 2022 / Accepted: 13 June 2022 / Published: 16 June 2022

(This article belongs to the Special Issue Application of Emerging Simulation Technologies in Achieving Sustainable Transportation Systems)

Download

Browse Figures

Versions Notes

Abstract

:

As a flourishing basic transportation service in recent years, online car-hailing has made great achievements in metropolitan cities. Accurate spatiotemporal forecasting plays a significant role in the deployment of a network for online car-hailing demand services. A self-attention mechanism in convolutional long short-term memory (ConvLSTM) is proposed to accurately predict the online car-hailing demand. It can more effectively address the disadvantage that ConvLSTM is not good at capturing spatial correlation over a large spatial extent. Furthermore, it can generate features by aggregating pair-wise similarity scores of features at all positions of input and memory, and thus obtain the function of long-range spatiotemporal dependencies. First, the online car-hailing trajectories dataset was converted into images after geographic grid matching, and image enhancement was performed by cropping. Then, the effectiveness of the ConvLSTM embedded with a self-attention mechanism (SA-ConvLSTM) was demonstrated by comparing it to existing models. The experimental results showed that the proposed model performed better than the existing models, and including spatiotemporal information in images would perform better predictions than including spatial information in time-series pixels.

Keywords:

online car-hailing demand; spatiotemporal forecasting; ConvLSTM; attention mechanism; deep learning

1. Introduction

Building an intelligent, green, efficient, and safe integrated transportation system requires a new form of transportation service that injects new vitality into intelligent transportation technology innovation and market expansion. The full use of app-based ridesharing mobility resources can relieve various traffic pressures and help create a new pattern of sustainable transportation. Online car-hailing is a supplement to urban public transit, and accurate prediction of it can guide users to avoid peaks and congestion, and improve the quality of passenger travel experience. The research on the spatiotemporal data of online car-hailing travel demand can not only help residents make more reasonable travel decisions and improve their travel efficiency, but also assist the city planners in the improvement of route usage and help drivers make wise travelling decisions [1].

The short-term demand forecast of online car-hailing is a typical short-term forecasting problem of time series. The short-term prediction theoretical methods and models in the traffic field in the existing research focus on the prediction of traffic flow, average speed of road sections, and so on [2]. They are mainly divided into the following four categories: prediction models based on linear system theory [3,4,5,6,7,8,9], prediction models based on nonlinear system theory [10,11,12,13], intelligent prediction models based on machine learning [14,15,16,17,18], and combined forecast models [19].

Online car-hailing is a relatively new industry, but taxis and online car-hailing have strong similarities, and the demand forecasting of both can learn from each other. Sitorus [20] et al. used a support vector machine (SVM) algorithm to analyze the obtained telematic data and predict the risk of online transportation trips. Stefano et al. [21] considered multi-source datasets, considered the correlation between travel demand and flow, and extended the Kalman Filter theory to the Local Ensemble Transformed Kalman Filter (LETKF) to predict the travel demand of online car-hailing. Nie [22] used a long short-term memory (LSTM) network to predict the demand based on regional division, which verified that the prediction accuracy was high in the actual situation. Yu et al. [23] used an LSTM model to predict the trend of car-hailing over time, taking into account the characteristics of time period, weeks, weather conditions, and so on; they found that this model can better predict the short-term demand. Jiang et al. [24] used an SVM neural network to build a least squares support vector machine (LS-SVM) based on an online car-hailing travel demand prediction model, and through experiments they showed that the prediction results of the proposed prediction model method were better than those of other methods. Rahman et al. [25] used a one-dimensional convolutional neural network (1D CNN) and zone-distributed independently recurrent neural network (IndRNN) to build a spatiotemporal deep learning structure, and tested it on real online car-hailing data; their network outperformed traditional time series models and machine learning models. In response to the analysis of the spatiotemporal correlation of the car-hailing data, Ke et al. [26] combined the ConvLSTM model and the LSTM network to extract the spatiotemporal and temporal characteristics of the travel demand of the car-hailing, and finally verified its effectiveness through experiments. Chen et al. [27] proposed a model for predicting the departure and arrival flows of taxis based on graph convolutional networks (GCN), taking road sections as nodes and taxi demand as the characteristics of nodes. Then, the departure flow prediction task and the arrival flow prediction task were combined, and the multi-task learning strategy was adopted to speed up the learning process, and a more generalized result was obtained. Geng et al. [28] proposed the spatiotemporal multi-graph convolution network (ST-MGCN) for ride-hailing demand forecasting, and the proposed model is evaluated on two real large-scale ride-hailing demand datasets and achieves over 10% improvement over state-of-the-art baselines.

The above-mentioned existing researched forecasting models have been successful in short-term forecasting of online car-hailing demand. However, there are still two main problems: (1) in the prediction of converting travel demand into images, the granularity of some models was too fine, which brought problems such as data sparseness and increased computational complexity; and (2) with the development of deep learning, more advanced models, methods, and mechanisms have been proposed, such as bidirectional LSTM (Bi-LSTM), ConvLSTM, attention mechanism, self-attention mechanism, and graph neural network (GNN). It is necessary to use new models and mechanisms to propose more reasonable prediction methods to cover different situations.

Taking the defects in previous studies as our point of departure, we add our own innovations. In summary, our contributions are summarized as follows: (1) We obtained the global spatial dependence of online car-hailing demand by adding a self-attention module in the single layer rather than superimposing a convolution operation. (2) Additional memory units in the self-attention module were used to capture the spatiotemporal information contained in the online car-hailing demand images. (3) We conducted extensive experiments on a large-scale real-world online car-hailing dataset from Didi Chuxing. The results show that our method consistently outperforms the competing baselines.

2. Related Work

The spatiotemporal prediction of online car-hailing demand based on the deep neural network models was the focus of our research. Liu et al. [29] proposed a spatiotemporal data integration model to predict online car-hailing demand, treating the predicted results as individual channels of the image; the proposed integration module first compresses the results and then restores the results using a fully convolutional network (CNN). The framework of this model can be used to improve the predictive capabilities of other models. Wang et al. [30] constructed a deep learning travel demand forecasting framework based on deep spatiotemporal ConvLSTM and CNN, and compared the three components of closeness, period, and trend to prove the validity of the model. Tan et al. [31] and Lu et al. [32] both used the ConvLSTM model for spatiotemporal prediction of online car-hailing. Yao et al. [33] proposed a deep multi-view spatiotemporal network (DMVST-Net) to model spatiotemporal relationships; the model includes three views: temporal view (through LSTM to model the correlation between future demand values and recent time points), spatial view (through local CNN to model local spatial correlations), and semantic views (to model correlations between regions with similar temporal patterns).

In previous studies on spatiotemporal prediction by converting online car-hailing demand into time series image groups, most of them chose to use the fully convolutional network model [29,33] or the ConvLSTM model [30,31,32]. However, Liu et al. [29] only provided the evaluation index of CNN prediction but did not provide the visualization result of prediction; Wang et al. [30] used heat maps to compare the predicted and true values of ConvLSTM and CNN. Although the predicted results were better in areas with large data coverage, there were deviations in low-density areas; Tan et al. [31] proposed a model with good predictive effect, but the spatial granularity of the grid is too coarse to reflect the spatiotemporal characteristics of demand in small areas; the spatial granularity of the grid set by Lu et al. [32] is too fine, and the ConvLSTM cannot capture the long-range spatiotemporal relationship, resulting in a mediocre final prediction effect. Yao et al. [33] considered the views comprehensively, but the characteristics of each perspective flattened the output forecast results through the full connection layer, showing only the difference in the temporal dimension prediction effect. In order to solve the above problems in the existing research, this paper selected the appropriate granularity according to the area size, data density, and precision during grid division; an additional memory unit was added to the original self-attention mechanism, and they were added to ConvLSTM to capture long-range spatiotemporal dependencies, and finally the spatial prediction images arranged in time series were obtained.

3. Preliminaries

In this section, we will give explicit definitions of several key concepts and formulate the modules of the model for car-hailing demand forecasting problem. The methodology framework of this study is shown in Figure 1.

3.1. Detailed Definitions

The time data have only one-dimensional attributes and are strictly sequential, so the time range can be divided into several equal-sized time windows according to the start time and interval. The spatial data generally have multi-dimensional attributes, and the spatial relationships between units are more complex, which often requires grid division.

Definition 1 (Geographic grid).

As shown in Figure 2, grid division refers to dividing areas into checkerboard-shaped grids, and classifying spatial data points into individual grids [34]. In the experiment, we divided the longitude and latitude into

p \times q

geographic grids. In addition, the grids will correspond to the pixels in the image, and the neural networks can be used to capture the association between spatial demand information.

Definition 2 (Travel demand).

Travel demand is defined as follows [30]:

T D_{t}^{p, q} = \sum_{D_{t} \in O} |\{O (p, q)\}|

(1)

where

|\{O (p, q)\}|

denotes the set of geographic grids

(p, q)

;

D_{t} \in O

represents the demand in the time slot

t

is belong to

O

;

T D_{t}^{p, q}

means the set of cumulative demand in the time slot

t

in

O

.

Mathematically, if travel demand is described in the form of a spatiotemporal demand matrix, then

T D_{t}^{p, q}

can be rewritten as follows:

T D_{t}^{p, q} = [\begin{matrix} d_{t}^{1, 1} & d_{t}^{2, 1} & \dots & d_{t}^{p, 1} \\ d_{t}^{1, 2} & d_{t}^{2, 2} & \dots & d_{t}^{p, 2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ d_{t}^{1, q} & d_{t}^{2, q} & \dots & d_{t}^{p, q} \end{matrix}]

(2)

where

d_{t}^{p, q}

denotes the cumulative demand the

(p, q)

-th grid in the time slot

t

.

Definition 3 (Demand prediction task).

According to the above, the travel demand matrix is arranged in a time-series to obtain the travel demand time-series

X_{t}^{O}

containing spatial information:

X_{t}^{O} = \{T D_{t - n_{1} τ}^{p, q}, T D_{t - (n_{1} - 1) τ}^{p, q}, \dots, T D_{t - (n_{1} - n_{2}) τ}^{p, q}, \dots, T D_{t - τ}^{p, q}, T D_{t}^{p, q}\}

(3)

where

n_{1}

,

n_{2}

denotes two different time slots;

τ

is the time interval of each time slot.

The prediction task is to forecast

\{T D_{t - (n_{1} - n_{2}) τ}^{p, q}, \dots, T D_{t - τ}^{p, q}, T D_{t}^{p, q}\}

with the application of the historical demand data

\{T D_{t - n_{1} τ}^{p, q}, T D_{t - (n_{1} - 1) τ}^{p, q}, \dots, T D_{t - (n_{1} - n_{2} + 1) τ}^{p, q}\}

.

3.2. ConvLSTM Model

The ConvLSTM model was born out of a precipitation prediction, as well as the problem of predicting the future rainfall intensity [35]. The researchers extended the idea of LSTM to make the state transformation in the model structure have a convolution structure. As shown in Figure 3, it consists of one memory unit and three gates, and the memory unit can be regarded as a memory update gate. The input, forget, and output gates, as well as the memory information, can be expressed by the following formulas [35]:

{\tilde{i}}_{t} = {\tilde{σ}}^{i} ({\tilde{W}}_{i} \otimes [{\tilde{h}}_{t - 1}, {\tilde{X}}_{t}] + {\tilde{b}}_{i})

(4)

{\tilde{f}}_{t} = {\tilde{σ}}^{f} ({\tilde{W}}_{f} \otimes [{\tilde{h}}_{t - 1}, {\tilde{X}}_{t}] + {\tilde{b}}_{f})

(5)

{\tilde{o}}_{t} = {\tilde{σ}}^{o} ({\tilde{W}}_{o} \otimes [{\tilde{h}}_{t - 1}, {\tilde{X}}_{t}] + {\tilde{b}}_{o})

(6)

{\overset{⌢}{g}}_{t} = \tanh ({\tilde{W}}_{c} \otimes [{\tilde{h}}_{t - 1}, {\tilde{X}}_{t}] + {\tilde{b}}_{c})

(7)

{\tilde{c}}_{t} = {\tilde{f}}_{t} ⊙ {\tilde{c}}_{t - 1} + {\tilde{i}}_{t} ⊙ {\overset{⌢}{g}}_{t}

(8)

{\tilde{h}}_{t} = {\tilde{o}}_{t} ⊙ \tanh ({\tilde{c}}_{t})

(9)

where

{\tilde{X}}_{t}

denotes input in the time slot

t

;

{\tilde{i}}_{t}

,

{\tilde{f}}_{t}

,

{\tilde{o}}_{t}

,

{\overset{⌢}{g}}_{t}

are the input gate, forget gate, output gate, and update gate in the time slot

t

, respectively;

{\tilde{c}}_{t}

is the state of candidate memory in the time slot

t

;

\otimes

is the process of convolution;

⊙

is the Hadamard product;

{\tilde{σ}}^{i}

,

{\tilde{σ}}^{f}

,

{\tilde{σ}}^{o}

are the activation function of the input gate, forget gate, and output gate, respectively;

{\tilde{W}}_{i}

,

{\tilde{W}}_{f}

,

{\tilde{W}}_{c}

,

{\tilde{W}}_{o}

and

{\tilde{b}}_{i}

,

{\tilde{b}}_{f}

,

{\tilde{b}}_{c}

,

{\tilde{b}}_{o}

are the weight and bias for calculating

{\tilde{i}}_{t}

,

{\tilde{f}}_{t}

,

{\overset{⌢}{g}}_{t}

,

{\tilde{o}}_{t}

, respectively.

3.3. Self-Attention Module

Figure 4 shows the pipeline of the self-attention module. The self-attention module receives two inputs: the input feature

H_{t}

is the hidden state in ConvLSTM in the time slot

t

, and the input feature

M_{t - 1}

is the memory in the time slot

t - 1

. The original feature maps

H_{t}

are mapped into different feature spaces as the query:

Q_{h} = W_{h q} H_{t} \in ℝ^{\hat{C} \times N}

; the key:

K_{h} = W_{h k} H_{t} \in ℝ^{\hat{C} \times N}

; and the value:

V_{h} = W_{h v} H_{t} \in ℝ^{C \times N}

. The memory is mapped into key:

K_{m} = W_{m k} M_{t - 1} \in ℝ^{\hat{C} \times N}

; and value:

V_{m} = W_{m v} M_{t - 1} \in ℝ^{C \times N}

, where

\{W_{h q}, W_{h k}, W_{h v}, W_{m k}, W_{m v}\}

is a set of weights for

1 \times 1

convolutions,

C

and

\hat{C}

are number of channels, and where

N = H \times W

, the output is

{\hat{H}}_{t}

and the updated memory is

M_{t}

[36].

The whole pipeline can be separated into three parts: the feature aggregation to obtain the global information, the memory updating, and the output [36].

(1).: Feature Aggregation. The similarity scores of each pair of points are calculated by applying the matrix production as:

e_{Δ} = Q_{h}^{T} K_{Δ} \in ℝ^{N \times N}, Δ \in \{h, m\}

(10)

The similarity between the

i

-th point and the

j

-th point can be indexed as:

e_{h; i, j} = (H_{t, i}^{T} W_{h q}^{T}) (W_{h k} H_{t, j})

(11)

e_{m; i, j} = (H_{t, i}^{T} W_{h q}^{T}) (W_{m k} M_{t - 1, j})

(12)

where the

H_{t, i}

,

H_{t, j}

,

M_{t - 1, i}

, and

M_{t - 1, j}

are feature vectors with the shape

C \times 1

. Then, all weights which are used for aggregating features are normalized by applying SoftMax function along columns:

α_{Δ; i, j} = \frac{\exp e_{Δ; i, j}}{\sum_{k = 1}^{N} \exp e_{Δ; i, k}}, Δ \in \{h, m\}, i, j \in \{1, 2, \dots, N\}

(13)

The aggregated feature of the

i

-th location is calculated by a weighted sum across all locations

N

:

Z_{Δ; i} = \sum_{j = 1}^{N} α_{Δ; i, j} V_{Δ; j} = \{\begin{matrix} \sum_{j = 1}^{N} α_{h; i, j} (W_{h v} H_{t, j}) \\ \sum_{j = 1}^{N} α_{m; i, j} (W_{m v} M_{t - 1, j}) \end{matrix}

(14)

Finally, the aggregated feature

Z

can be obtained with

Z = W_{z} [Z_{h}; Z_{m}]

.

(2).: Memory Updating. A gating mechanism is used to update the memory $M$ adaptively, so that the self-attention module can capture long-range dependencies in terms of spatial and temporal domains. The aggregated feature $Z$ and the original input $H_{t}$ are used to produce values of the input gate ${i^{'}}_{t}$ and the fused feature ${g^{'}}_{t}$ . Besides, the forget gate is replaced as $1 - {i^{'}}_{t}$ to reduce parameters. The updating progress can be formulated as follows [36]:

{i^{'}}_{t} = σ (W_{m; z i} \otimes Z + W_{m; h i} \otimes H_{t} + b_{m; i})

(15)

{g^{'}}_{t} = \tanh (W_{m; z g} \otimes Z + W_{m; h g} \otimes H_{t} + b_{m; g})

(16)

M_{t} = (1 - {i^{'}}_{t}) ⊙ M_{t - 1} + {i^{'}}_{t} ⊙ {g^{'}}_{t}

(17)

Compared with the original memory cell

c

in the ConvLSTM which is updated by convolution operations only, the memory

M

is updated by not only convolution operations but also aggregated features

Z_{t}

, obtaining the global spatial dependency in a timely manner. Therefore, we argue that

M_{t - 1}

is able to contain global past spatiotemporal information.

(3).: Output. Finally, the output feature ${\hat{H}}_{t}$ is a dot product between the output gate ${o^{'}}_{t}$ and updated memory $M_{t}$ , which can be formulated as follows [36]:

{o^{'}}_{t} = σ (W_{m; z o} \otimes Z + W_{m; h o} \otimes H_{t} + b_{m; o})

(18)

{\hat{H}}_{t} = {o^{'}}_{t} ⊙ M_{t}

(19)

3.4. Self-Attention ConvLSTM Model

We embed the self-attention module into the ConvLSTM to construct the SA-ConvLSTM, as illustrated in Figure 5. If we remove the SAM module, the SA-ConvLSTM will degenerate into the standard ConvLSTM. The internal calculation formulas for the self-attention module have been introduced in Section 3.3. Figure 5 is an overall introduction to the units of SA-ConvLSTM, which can be represented by the following formulas [36]:

{\overset{⌢}{X}}_{t} = S A ({\tilde{X}}_{t})

(20)

{\overset{⌢}{H}}_{t - 1} = S A ({\tilde{h}}_{t - 1})

(21)

i_{t} = σ^{i} (W_{i} \otimes [{\overset{⌢}{H}}_{t - 1}, {\overset{⌢}{X}}_{t}] + b_{i})

(22)

f_{t} = σ^{f} (W_{f} \otimes [{\overset{⌢}{H}}_{t - 1}, {\overset{⌢}{X}}_{t}] + b_{f})

(23)

o_{t} = σ^{o} (W_{o} \otimes [{\overset{⌢}{H}}_{t - 1}, {\overset{⌢}{X}}_{t}] + b_{o})

(24)

g_{t} = \tanh (W_{c} \otimes [{\overset{⌢}{H}}_{t - 1}, {\overset{⌢}{X}}_{t}] + b_{c})

(25)

{\overset{⌢}{c}}_{t} = f_{t} ⊙ {\overset{⌢}{c}}_{t - 1} + i_{t} ⊙ g_{t}

(26)

{\overset{⌢}{H}}_{t} = o_{t} ⊙ \tanh ({\overset{⌢}{c}}_{t})

(27)

In the experiment, a Batch Normalization layer was added after each SA-ConvLSTM model [37]. On the one hand, it can reduce the training time, conduct deep network training, and use a larger learning rate; on the other hand, it can make the learning process less affected by initialization, prevent overfitting, and reduce the need for regularization. The formulas for this process are shown as follows [37]:

For the

l

-th layer with

a

-dimensional input

x_{l} = (x_{l}^{(1)}, x_{l}^{(2)}, \dots, x_{l}^{(a)})

, each dimension will be normalized using the following formula:

{\overset{⌢}{x}}_{l}^{(b)} = \frac{{\overset{⌢}{x}}_{l}^{(b)} - E [{\overset{⌢}{x}}_{l}^{(b)}]}{\sqrt{V a r [{\overset{⌢}{x}}_{l}^{(b)}]}}

(28)

where

{\overset{⌢}{x}}_{_{l}}^{(b)}

denotes each activation;

E [\cdot]

stands for the calculated mean;

V a r [\cdot]

represents the calculated variance.

For each activation

{\overset{⌢}{x}}_{_{l}}^{(b)}

, a pair of parameters will be used to scale and shift the normalized value:

y_{l}^{(b)} = γ_{l}^{(b)} {\overset{⌢}{x}}_{l}^{(b)} + β_{l}^{(b)}

(29)

where

γ_{_{l}}^{(b)}

and

β_{_{l}}^{(b)}

denotes the parameters to be learned along with the original model parameters, and they restore the representation power of the network.

In the experiment, we used the three-dimensional convolution (Conv3D) to output the predicted image group. The specific experimental process is shown in Figure 6, where the image enhancement method will be introduced in the next section.

4. Experiments

In this section, we will introduce methods on how to process the online car-hailing trajectories data and fully extract the required spatiotemporal information from them, and then we will verify the effectiveness of the proposed model and compare it with existing methods.

4.1. Dataset

The dataset used in the experiments contained real-world online car-hailing trajectories from the city of Chengdu, a large city in southwestern China [38]. It was provided by Didi Chuxing GAIA Initiative, with online car-hailing trajectories data from 1 November 2016 to 30 November 2016. An example of the original online car-hailing trajectories data is shown in Table 1, where the first column and the second column both show encryption processing.

After the transformation of the spatial coordinate system, the spatial and temporal features hidden in the data are mined and the spatiotemporal data are fused by dividing the spatial and temporal units. The spatiotemporal grid division process of a trajectory is shown in the Figure 7. For the online car-hailing trajectories data, while classifying the positioning points, the vector trajectories are also rasterized into grid sequences, a method that realizes the process of scattering the trajectories data into point coordinates and converting them into tensors. Each grid unit is a pixel in the image. The smaller the granularity of the pixel, the higher the definition. When the granularity of the grid is fine to a certain level, it can truly and accurately reflect the features in the area. However, too small granularity can cause problems such as data sparseness and increased computational complexity.

First, the longitude of 104.081083° to 104.111122° and the latitude of 30.653965° to 30.674219° were selected for research, each geographic grid cell was 2500 m² in size, and the research area was divided into 53 × 67 grids. Figure 8 illustrates the process of trajectory data acquisition and image conversion. Second, with a time interval of 10 min, the online car-hailing demand in different regions was counted. Three cumulative demand images are shown in Figure 8, which are the time units 00:00, 11:00, and 23:00 on 1 November. These three images from the spatial dimension yield the observation that there are many grids with missing data. There are three main reasons for this incomplete data: first, the travel proportion of online car-hailing is not high, and the vehicle trajectories cannot cover all areas; second, a large amount of land in the city is non-urban roads such as vegetation or buildings, and the demand of online car-hailing is 0; and third, the grid size is too small. Similarly, in the time dimension, there are also a lot of vacancies in data, which is caused by fluctuations of the demand for online car-hailing in time. For example, at midnight, the demand for online car-hailing is far less than during the morning and evening rush hours.

As mentioned above, due to the fine granularity of the initial grid, many grids lack data, or the volume in the grid is extremely low. Data with values that are too small are susceptible to random perturbations and are unpredictable. In addition, for the online car-hailing service platform, the fine-grained spatiotemporal grid is not necessarily conducive to the scheduling of vehicles, which may bring great scheduling difficulties to the scheduling system. Therefore, the initial spatiotemporal grids need to be aggregated. In the spatial dimension, 2 × 2 original grids were merged into a new grid; in the time dimension, the original 6 time periods were merged into a new grid time granularity. We completed missing grids and filled in “0” for missing values that existed in grid granularity. The above spatiotemporal integration operations are shown in Figure 9. In addition, the online car-hailing dataset covers only one month. In order to expand the number of samples and improve the training effect, it is necessary to perform image enhancement on the training data. One of the most commonly used image enhancement methods is cropping [39]. As shown in the Figure 10, the experiment uses a square sliding window with a length and width of 20 to further clip the spatial grid, increasing the amount of data to 120 times of the original.

4.2. Evaluation Metrics

We use four valuation metrics that are commonly used to evaluate the traffic forecasting performance: Structural Similarity (SSIM), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Coefficient of Determination (R²), which can be defined as follows:

SSIM = {[S_{1} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})]}^{ω} {[S_{2} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})]}^{ξ} {[S_{3} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})]}^{ψ}

(30)

MAE = \frac{1}{U} \sum_{i = 1}^{U} |T D_{t}^{i} - {\hat{T D}}_{t}^{i}|

(31)

RMSE = \sqrt{\frac{1}{U} \sum_{i = 1}^{U} {(T D_{t}^{i} - {\hat{T D}}_{t}^{i})}^{2}}

(32)

R^{2} = 1 - \frac{\sum_{i = 1}^{U} {(T D_{t}^{i} - {\hat{T D}}_{t}^{i})}^{2}}{\sum_{i = 1}^{U} {(T D_{t}^{i} - {\bar{T D}}_{t}^{i})}^{2}}

(33)

where

T D_{t}^{i}

and

{\hat{T D}}_{t}^{i}

represent the true value and predicted value of the

i

-th grid in the time slot

t

;

{\bar{T D}}_{t}^{i}

represents the mean value of the sample;

U

is the total number of geographic grids;

S_{1} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})

,

S_{2} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})

,

S_{3} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})

represent the pixel mean, variance, and covariance of the true and predicted images, respectively, and they also represent comparative brightness, contrast, and structure, respectively;

ω

,

ξ

,

ψ

are parameters to adjust the relative importance of

S_{1} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})

,

S_{2} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})

,

S_{3} (T D_{t}^{i}, {\hat{T D}}_{t}^{i})

respectively.

4.3. Training Configuration

In this section, we introduce some parameter settings and details on the proposed models. Each model is trained with an Adam algorithm and the learning rate of it is

0.001

. During the training process, the batch size is set to

32

; the number of training epochs is set to

50

; the activation function in each layer of Conv3D adopts the Tanh function; and the activation function in each layer of ConvLSTM and SA-ConvLSTM adopts the SoftMax function; the activation function of each layer of the comparison model with time-series pixels as input adopts the ReLU function. In the experiment, each layer of ConvLSTM and SA-ConvLSTM had

10

filters with the size of each filter being

3 \times 3

; when the Conv3D layers were used as the input and hidden layers, there were

10

filters per layer; and when the Conv3D layer was used as the output layer, there was only one filter per layer, each of which had a size of

3 \times 3 \times 3

. An early stop strategy was used to prevent model overfitting and end the training. The loss functions used in the models to update the parameters in the models were Mean Square Error (MSE). In the experiment, TensorFlow and Keras were used to implement the deep learning models described in Section 3.2 and Section 3.4.

Section 3.1 and Section 4.1 were combined and resulting size of the image was

20 \times 20 \times 1

. Each frame contains a total of

24

images as input and output groups, of which the first

12

images were used as input, and the last

12

were used as output. The experiments set

2880

images from the last day of

30

days as the test set, and the images from the remaining

29

days were divided into training and validation sets in a ratio of 8 to 2. The model structure of SA-ConvLSTM is shown in Table 2.

4.4. Analysis and Discussion of Prediction Performance

In this section, in order to prove the validity of the proposed model, we compared it with previous model algorithms. Among them, for the image prediction containing spatiotemporal information, the ConvLSTM and CNN models with good effect in spatiotemporal prediction were selected; for the time-series pixel prediction containing spatial information, the LSTM, GRU, Bi-LSTM, and bidirectional GRU (Bi-GRU) models with good performance in time-series prediction were selected. With regard to the specific method of time-series pixel prediction, we, first, flattened the pixels in each image, and then the vectors were arranged according to the grid location of each pixel; second, we arranged each vector in chronological order to form time-series vectors; finally, according to the settings in the experiment, every 12 historical time-series vectors were used to predict the next 12 time-series vectors.

As shown in Table 3, the proposed SA-ConvLSTM prediction network obtained the lowest MAE and RMSE, as well as the highest SSIM and R², mainly because our method accounts for both the local and global correlations between the online car-hailing demand in the research area. It fully explored the spatial and temporal characteristics. Although LSTM and GRU could extract the long-term dependence effectively, it was not good at spatial feature mining. Compared to LSTM and GRU, the bi-directional mechanism made Bi-LSTM and Bi-GRU more superior in extracting the temporal features of the periodicity of time-series pixels and the contextual relationship of pixels, but they were also insufficient in capturing spatial features. The CNN was not good at temporal feature mining. ConvLSTM was superior to them in spatiotemporal feature mining. Figure 11 shows the prediction results of the 1st, 40th, 80th, and 120th cropped image sub-images at 12:00 on 30 November, respectively. The self-attention module that considers additional memory units was effective.

In the comparative experiment, we compared the spatiotemporal prediction models of online car-hailing demand images proposed in previous studies in Section 2 to the proposed model, and obtained the following conclusions: the image enhancement method not only increased the sample size, but also solved the problems of fine spatial granularity and irregular image size; SA-ConvLSTM can capture long-range spatial dependencies and achieve good prediction results. Therefore, our research is not only applicable to the prediction of online car-hailing demand, but also can be implemented for object detection in traffic scenes, such as vehicle detection, traffic light detection, and pedestrian detection. The images obtained from traffic scenes are complex, good prediction results are not only useful for automatic driving, but also help drivers drive safely. However, the parameters of SA-ConvLSTM are larger than those of ConvLSTM and CNN, and the training time is longer, which makes our equipment unable to load experiments with larger images.

5. Conclusions

With the purpose of predicting short-term online car-hailing demand, this study investigated a variant model of ConvLSTM based on a self-attention mechanism, which mainly solves the complex spatiotemporal prediction problem. The model used additional memory units on the traditional self-attention mechanism to enable the model to capture long-term time dependence. Experiments showed that adding a self-attention module is more efficient than stacking convolutional layers. Compared to convolution operations, the self-attention module showed the capacity to capture the features of special locations far away from the data concentration area, as well as the ability to remember the global temporal and spatial dependence.

The reasonable forecast of demand distribution also improves the travel efficiency of users, and more resources have been effectively utilized without increasing supply. With the development of machine learning, new mechanisms and models continue to emerge, such as transformer models, which can be considered in future work. In addition, more kinds of data related to ride-hailing demand should be considered, such as season, customer income, charging policy, point of interest data, etc. In the actual road network, the complex and large-scale online car-hailing demand will be difficult to predict, and more suitable deep learning models are needed to solve it, such as a class based on graph convolutional networks [27]. Moreover, the complex trajectory data contains many characteristics of traffic conditions and driver behavior; for example, some features of route selection decision can be extracted from it [40], or the behavior of traffic participants can be estimated and their future trajectories can be predicted [41,42]. Therefore, these will be considered in our future work.

Author Contributions

Methodology, S.L. and Z.C.; software, R.C.; validation, S.L., R.C., H.G. and Z.C.; writing—original draft preparation, H.G.; writing—review and editing, R.C.; visualization, S.L.; supervision, H.G.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Program of Humanities and Social Science of Education Ministry of China (Grant No. 20YJA630008) and the Natural Science Foundation of Zhejiang Province, China (Grant No. LY20G010004) and the K.C. Wong Magna Fund in Ningbo University, China.

Data Availability Statement

The data can be obtained from https://gaia.didichuxing.com.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lu, M.; Lai, C.; Ye, T.; Liang, J.; Yuan, X. Visual Analysis of Multiple Route Choices Based on General GPS Trajectories. IEEE Trans. Big Data 2017, 3, 234–247. [Google Scholar] [CrossRef]
Vlahogianni, E.I.; Karlaftis, M.G.; Golias, J.C. Short-term traffic forecasting: Where we are and where we’re going. Transp. Res. Part C Emerg. Technol. 2014, 43, 3–19. [Google Scholar] [CrossRef]
Okutani, I.; Stephanedes, Y.J. Dynamic prediction of traffic volume through Kalman filtering theory. Transp. Res. Part B Methodol. 1984, 18, 1–11. [Google Scholar] [CrossRef]
Chen, H.; Grant-Muller, S. Use of sequential learning for short-term traffic flow forecasting. Transp. Res. Part C-Emerg. Technol. 2001, 9, 319–336. [Google Scholar] [CrossRef]
Cai, L.R.; Zhang, Z.C.; Yang, J.J.; Yu, Y.D.; Zhou, T.; Qin, J. A noise-immune Kalman filter for short-term traffic flow forecasting. Phys. A Stat. Mech. Appl. 2019, 536, 122601. [Google Scholar] [CrossRef]
Chien, S.I.J.; Kuchipudi, C.M. Dynamic travel time prediction with real-time and historic data. J. Transp. Eng. 2003, 129, 608–616. [Google Scholar] [CrossRef]
Ahmed, M.S.; Cook, A.R. Analysis of freeway traffic time-series data by using Box-Jenkins techniques. Transp. Res. Rec. 1979, 773, 1–9. [Google Scholar]
Nihan, N.L.; Holmesland, K.O. Use of the Box and Jenkins Time Series Technique in Traffic Forecasting. Transportation 1980, 9, 125–143. [Google Scholar] [CrossRef]
Xu, J.; Rahmatizadeh, R.; Boloni, L.; Turgut, D. Real-Time Prediction of Taxi Demand Using Recurrent Neural Networks. IEEE Trans. Intell. Transp. Syst. 2018, 19, 2572–2581. [Google Scholar] [CrossRef]
Jiang, X.M.; Adeli, H. Dynamic wavelet neural network model for traffic flow forecasting. J. Transp. Eng. 2005, 131, 771–779. [Google Scholar] [CrossRef]
Zhang, H.; Wang, X.M.; Cao, J.; Tang, M.N.; Guo, Y.R. A multivariate short-term traffic flow forecasting method based on wavelet analysis and seasonal time series. Appl. Intell. 2018, 48, 3827–3838. [Google Scholar] [CrossRef]
Disbro, J.E.; Frame, M. Traffic Flow Theory and Chaotic Behavior. Transp. Res. Rec. J. Transp. Res. Board 1989, 1225, 109–115. [Google Scholar]
Forbes, G.J.; Hall, F. The applicability of catastrophe theory in modelling freeway traffic operations. Transp. Res. Part A Gen. 1990, 24, 335–344. [Google Scholar] [CrossRef]
Smith, B.L.; Demetsky, M.J. Short-term traffic flow prediction: Neural network approach. Transp. Res. Rec. 1994, 1453, 98–104. [Google Scholar]
Cheng, T.; Haworth, J.; Wang, J. Spatio-temporal autocorrelation of road network data. J. Geogr. Syst. 2012, 14, 389–413. [Google Scholar] [CrossRef]
Dia, H. An object-oriented neural network approach to short-term traffic forecasting. Eur. J. Oper. Res. 2001, 131, 253–261. [Google Scholar] [CrossRef] [Green Version]
Yu, F.; Xu, X.Z. A short-term load forecasting model of natural gas based on optimized genetic algorithm and improved BP neural network. Appl. Energy 2014, 134, 102–113. [Google Scholar] [CrossRef]
Castro-Neto, M.; Jeong, Y.S.; Jeong, M.K.; Han, L.D. Online-SVR for short-term traffic flow prediction under typical and atypical traffic conditions. Expert Syst. Appl. 2009, 36, 6164–6173. [Google Scholar] [CrossRef]
An, J.Y.; Fu, L.; Hu, M.; Chen, W.H. A Novel Fuzzy-Based Convolutional Neural Network Method to Traffic Flow Prediction with Uncertain Traffic Accident Information. IEEE Access 2019, 7, 20708–20722. [Google Scholar] [CrossRef]
Sitorus, C.M.; Rizal, A.; Jajuli, M. Prediksi Risiko Perjalanan Transportasi Online Dari Data Telematik Menggunakan Algoritma Support Vector Machine. J. Tek. Inform. Sist. Inf. 2020, 6, 254–265. [Google Scholar] [CrossRef]
Stefano, C.; Ernesto, C.; Livia, M.; Marialisa, N. Dynamic demand estimation and prediction for traffic urban networks adopting new data sources. Transp. Res. Part C 2017, 81, 83–98. [Google Scholar]
Nie, Y.M. How can the taxi industry survive the tide of ridesourcing? Evidence from Shenzhen, China. Transp. Res. Part C Emerg. Technol. 2017, 79, 242–256. [Google Scholar] [CrossRef]
Yu, D.B.; Li, Z.P.; Zhong, Q.L.; Yi, A.; Chen, W. Demand Management of Station-Based Car Sharing System Based on Deep Learning Forecasting. J. Adv. Transp. 2020, 2020, 8935857. [Google Scholar] [CrossRef]
Jiang, S.; Chen, W.T.; Li, Z.H.; Yu, H.Y. Short-Term Demand Prediction Method for Online Car-Hailing Services Based on a Least Squares Support Vector Machine. IEEE Access 2019, 7, 11882–11891. [Google Scholar] [CrossRef]
Rahman, M.H.; Rifaat, S.M. Using spatio-temporal deep learning for forecasting demand and supply-demand gap in ride-hailing system with anonymized spatial adjacency information. IET Intell. Transp. Syst. 2021, 15, 941–957. [Google Scholar] [CrossRef]
Ke, J.T.; Zheng, H.Y.; Yang, H.; Chen, X.Q. Short-term forecasting of passenger demand under on-demand ride services: A spatio-temporal deep learning approach. Transp. Res. Part C Emerg. Technol. 2017, 85, 591–608. [Google Scholar] [CrossRef] [Green Version]
Chen, Z.; Zhao, B.; Wang, Y.H.; Duan, Z.T.; Zhao, X. Multitask Learning and GCN-Based Taxi Demand Prediction for a Traffic Road Network. Sensors 2020, 20, 3776. [Google Scholar] [CrossRef] [PubMed]
Geng, X.; Li, Y.G.; Wang, L.Y.; Zhang, L.Y.; Liu, Y. Spatiotemporal Multi-Graph Convolution Network for Ride-Hailing Demand Forecasting. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3656–3663. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.; Lyu, C.; Khadka, A.; Zhang, W.B. Spatio-Temporal Ensemble Method for Car-Hailing Demand Prediction. IEEE Trans. Intell. Transp. Syst. 2020, 21, 5328–5333. [Google Scholar] [CrossRef]
Wang, D.J.; Yang, Y.; Ning, S.M. DeepSTCL: A Deep Spatio-temporal ConvLSTM for Travel Demand Prediction. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
Tan, L.Y.; Zhang, Z.Y.; Jiang, W.W. Ride-Hailing Service Prediction Based on Deep Learning. Int. J. Mach. Learn. Comput. 2022, 12, 1. [Google Scholar]
Lu, X.J.; Ma, C.X.; Qiao, Y.H. Short-term demand forecasting for online car-hailing using ConvLSTM networks. Phys. A 2010, 570, 125838. [Google Scholar] [CrossRef]
Yao, H.X.; Wu, F.; Ke, J.T.; Tang, X.F.; Jia, Y.T.; Lu, S.Y.; Gong, P.H.; Ye, J.P.; Li, Z.H. Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction. Proc. AAAI Conf. Artif. Intell. 2018, 32, 2588–2595. [Google Scholar]
Savchuk, O. Large-Scale Dynamics of Hypoxia in the Baltic Sea. Chem. Struct. Pelagic Redox Interfaces 2010, 22, 137–160. [Google Scholar]
Shi, X.J.; Chen, Z.R.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. Adv. Neural Inf. Processing Syst. 2015, 28, 802–810. Available online: https://proceedings.neurips.cc/paper/2015/file/07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf (accessed on 28 May 2022).
Lin, Z.H.; Li, M.M.; Zheng, Z.B.; Cheng, Y.Y.; Yuan, C. Self-Attention ConvLSTM for Spatiotemporal Prediction. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11531–11538. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Int. Conf. Mach. Learn. 2015, 37, 448–456. Available online: http://proceedings.mlr.press/v37/ioffe15.pdf (accessed on 28 May 2022).
Didi Chuxing. Available online: https://outreach.didichuxing.com/app-vue/HaiKou? (accessed on 1 March 2021).
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Schlaich, J. Analyzing route choice behavior with mobile phone trajectories. Transp. Res. Rec. 2010, 2157, 78–85. [Google Scholar] [CrossRef]
Gindele, T.; Brechtel, S.; Dillmann, R. A probabilistic model for estimating driver behaviors and vehicle trajectories in traffic environments. IEEE Trans. Intell. Transp. Syst. 2010, 17, 2751–2766. [Google Scholar]
Tinessa, F.; Marzano, V.; Papola, A.; Montanino, M.; Simonelli, F. CoNL route choice model: Numerical assessment on a real dataset of trajectories. In Proceedings of the 2019 6th International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS), Cracow, Poland, 5–7 June 2019; pp. 1–10. [Google Scholar]

Figure 1. Framework of methodology.

Figure 2. Geographic grids.

Figure 3. ConvLSTM cell diagram [35].

Figure 4. The illustration of the self-attention module [36].

Figure 5. The usage of self-attention module for the ConvLSTM (SA-ConvLSTM) [36].

Figure 6. Spatiotemporal prediction of SA-ConvLSTM.

Figure 7. Example of spatiotemporal grid division process of a trajectory.

Figure 8. Explanation of trajectory data collection and image conversion process.

Figure 9. The cumulative online car-hailing demand. These three images are the new cumulative demand images at 00:00, 11:00, and 23:00, respectively, on 1 November after the merged grid granularity.

Figure 10. Grid spatial granularity cropping.

Figure 11. The prediction results of four image subgraphs at 12:00 on 30 November. (a–d) represent the predicted values of each subgraph, respectively; (a’–d’) represent the ground truth values of each subgraph, respectively.

Table 1. Example of original online car-hailing trajectories data.

Driver ID	Order ID	Timestamp	Longitude	Latitude
XXXXXX	XXXXXX	1477969147	104.0751	30.72724

Table 2. The model structure of SA-ConvLSTM.

Layer (Type)	Output Shape	Parameter
InputLayer	(None, None, 20, 20, 1)	0
SaConvLSTM2D_1	(None, None, 20, 20, 10)	4800
BatchNormalization_1	(None, None, 20, 20, 10)	40
SaConvLSTM2D_2	(None, None, 20, 20, 10)	8040
BatchNormalization_2	(None, None, 20, 20, 10)	40
SaConvLSTM2D_3	(None, None, 20, 20, 10)	8040
BatchNormalization_3	(None, None, 20, 20, 10)	40
Conv3D	(None, None, 20, 20, 1)	271

Table 3. Performance comparison of different models for demand prediction.

Model	Performance Indices
Model	SSIM (%)	MAE	RMSE	R² (%)
LSTM	32.97	75.47	134.03	40.37
GRU	34.25	73.85	133.36	41.89
Bi-GRU	46.22	52.92	101.26	65.56
Bi-LSTM	48.03	52.81	101.34	65.62
CNN	93.00	27.11	51.78	91.23
ConvLSTM	94.47	20.86	43.31	93.83
SA-ConvLSTM	95.15	19.31	38.62	95.11

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ge, H.; Li, S.; Cheng, R.; Chen, Z. Self-Attention ConvLSTM for Spatiotemporal Forecasting of Short-Term Online Car-Hailing Demand. Sustainability 2022, 14, 7371. https://doi.org/10.3390/su14127371

AMA Style

Ge H, Li S, Cheng R, Chen Z. Self-Attention ConvLSTM for Spatiotemporal Forecasting of Short-Term Online Car-Hailing Demand. Sustainability. 2022; 14(12):7371. https://doi.org/10.3390/su14127371

Chicago/Turabian Style

Ge, Hongxia, Siteng Li, Rongjun Cheng, and Zhenlei Chen. 2022. "Self-Attention ConvLSTM for Spatiotemporal Forecasting of Short-Term Online Car-Hailing Demand" Sustainability 14, no. 12: 7371. https://doi.org/10.3390/su14127371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Attention ConvLSTM for Spatiotemporal Forecasting of Short-Term Online Car-Hailing Demand

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Detailed Definitions

3.2. ConvLSTM Model

3.3. Self-Attention Module

3.4. Self-Attention ConvLSTM Model

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Training Configuration

4.4. Analysis and Discussion of Prediction Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI