Spatio-Temporal Heterogeneous Graph Neural Networks for Estimating Time of Travel

Wu, Lei; Tang, Yong; Zhang, Pei; Zhou, Ying

doi:10.3390/electronics12061293

Open AccessArticle

Spatio-Temporal Heterogeneous Graph Neural Networks for Estimating Time of Travel

by

Lei Wu

^1,2,

Yong Tang

^1,*,

Pei Zhang

² and

Ying Zhou

²

¹

School of Information Science and Engineeing, Yanshan University, Qinhuangdao 054000, China

²

School of Economics and Law, Shijiazhuang Tiedao University, Shijiazhuang 050043, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(6), 1293; https://doi.org/10.3390/electronics12061293

Submission received: 2 February 2023 / Revised: 4 March 2023 / Accepted: 5 March 2023 / Published: 8 March 2023

(This article belongs to the Special Issue Intelligent Analysis and Security Calculation of Multisource Data)

Download

Browse Figures

Versions Notes

Abstract

:

Estimating Time of Travel (ETT) is a crucial element of intelligent transportation systems. In most previous studies, time of travel is estimated by identifying the spatio-temporal features of road segments or intersections independently. However, due to continuous changes in road segments and intersections in a path, dynamic features should be coupled and interactive. Therefore, employing only road segment or intersection features is inadequate for improving the accuracy of ETT. To address this issue, we proposed a novel deep learning framework for ETT based on a spatio-temporal heterogeneous graph neural network (STHGNN). Specifically, a heterogeneous traffic graph was first created based on intersections and road segments, which implies an adjacency correlation. Next, a learning approach for spatio-temporal heterogeneous convolutional attention networks was proposed to obtain the spatio-temporal correlations of joint intersections and road segments. This approach integrates temporal and spatial features. Finally, a fusion prediction approach was employed to estimate the travel time of a given path. Experiments were conducted on real-world path datasets to evaluate our proposed model. The results showed that STHGNN significantly outperformed the baselines.

Keywords:

Estimating Time of Travel (ETT); heterogeneous graph neural network; spatio-temporal correlation

1. Introduction

With the process of modernization and urbanization, increasingly powerful sensors and application terminals are used to collect travel path data. This trend is conducive to fully exploiting these data for real-time monitoring and prediction of traffic dynamics in urban areas, thereby promoting the construction of smart cities. Estimating Time of Travel (ETT) is a core function of intelligent transportation systems that estimate the time interval for a certain vehicle from a starting point to an endpoint. Accurate ETT benefits route planning, vehicle navigation, etc. [1,2].

A user’s travel path is a sequence of road segments and intersections that appear alternately. Therefore, the features of road segments and intersections are very important in ETT. Specifically, at least 10% of users’ travel time is delayed by stopping at intersections temporarily due to traffic control and congestion [3]. Additionally, the average speeds of vehicles on road segments have a more direct impact on ETT. However, most previous studies utilize the features of road segments or intersections separately to estimate travel time [4,5,6,7]. If the features of the two are integrated, the accuracy of ETT can be improved.

However, it is challenging to model road segments and intersections together. The reasons are as follows: First, intersections and road segments have their own static attributes, such as road segment distances, traffic lights of intersections, and types of intersections. These could affect traffic conditions on road networks and impact travel time. Second, road segments and intersections appear alternately on road networks and have mutual influences on traffic conditions. These mutual influences show non-linear dependence. If congestion occurs at a certain intersection (or road segment), the traffic condition of road segments (or intersections) adjacent to the intersection (or road segment) will be affected due to spatial extensibility and temporal continuity of the congestion.

In this paper, we introduced a heterogeneous network to jointly model intersections and road sections and proposed a novel deep learning framework, STHGNN, based on heterogeneous networks. The general process is as follows: First, we constructed a heterogeneous traffic network containing two kinds of nodes (intersections and road segments) and two relationships (i.e., <intersection, road segment>, <road segment, intersection>). Then, we integrated the attributes of adjacent intersections and road segments and completed the initialization and mutual augmentation of the attributes of intersections and road segments. Next, we constructed a spatio-temporal feature learning block (STFL block) composed of a two-layer temporal gated convolutional neural network and a heterogeneous graph attention network to obtain the latent representations of intersections and road segments. Lastly, we estimated the travel time from a given starting point to an endpoint through the two-layer fully connected neural network.

Our main contributions are summarized as follows:

(1): We utilized a heterogeneous information network to model intersections and road segments simultaneously and fused the spatio-temporal features of intersections and road segments to obtain the latent correlation of interactions.
(2): We proposed the STHGNN framework based on deep learning and extracted spatial and temporal features to estimate travel time.

The rest of this paper is organized as follows: The related work is introduced in Section 2. The definitions are given in Section 3. The details of STHGNN that we proposed are described in Section 4. The experiments designed to evaluate our model and the parameters we applied are described in Section 5. Finally, a conclusion is made in Section 6.

2. Related Work

ETT plays an important role in intelligent transportation systems. Traditional solutions for ETT mainly fall into two categories. One is the route-based approach, which is used to estimate travel time by considering road segments and intersections in the path route. SMA modeled the correlation of different road segments based on historical patterns [8]. Wang et al. aggregated the roads adjacent to the starting points and endpoints to estimate the travel time of the entire route [2]. Zheng Wang presented ETT as a regression issue and proposed a wide-deep-recurrent (WDR) architecture [9], where the recurrent module was specially designed to handle time sequences. The other solution is the statistical learning approach, which is used to improve estimation accuracy in a data-driven manner. B. Gupta proposed a gradient-boosting tree regression approach for improving the accuracy of ETT on a road [10]. D. Wang proposed an integrative learning approach for predicting the travel time of a taxi using multiple variants of gradient-boosted models [11]. However, these approaches cannot effectively capture the dynamic spatio-temporal correlations of the traffic on a given path. As a result, prediction accuracy is limited.

In recent years, approaches based on deep learning have become increasingly important for ETT. Classical approaches based on deep learning mainly employ variations of classical deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Z. Wang proposed an efficient deep learning framework to capture the spatio-temporal dependence on a given path [9]. H. Zhang modeled a road network as a grid graph [12] and designed a deep learning framework to obtain spatial and temporal patterns from the grid graph for ETT. T.-y. Fu proposed a novel deep framework to efficiently integrate multiple information elements for ETT [13]. Although these classical deep learning approaches effectively capture the spatio-temporal correlation, they ignore the spatial structure of real-world road networks.

Graph neural networks (GNNs) can extract important information from graphs and make useful predictions to effectively address representation learning from non-Euclidean data. In the past two years, GNNs have drawn significant attention to ETT, and a large amount of related work has been carried out. Some researchers attempted to model the spatial and temporal dependence of road traffic using spatio-temporal graph convolutions. In DCRNN [14], traffic flow was regarded as a diffusion process on a directed graph, and diffusion convolutions were introduced to capture spatial dependencies. In addition, a sequence-to-sequence architecture with a gated recurrent unit (GRU) was used to capture temporal dependence. Yu et al. proposed an end-to-end STGNN framework to capture spatial and temporal correlations using GNNs and 1D convolutional neural networks, respectively [15]. Guo et al. proposed an attention-based STGNN framework to capture superior long-distance temporal correlations [16]. However, these studies only considered the spatio-temporal attributes of road segments, ignoring the interactive relationship between intersections and road segments. Different from these studies, our study proposed a new deep learning model based on a heterogeneous traffic network, jointly modeling intersections and road segments for fusion learning.

3. Related Definitions

Definition 1.

Heterogeneous traffic network. The heterogeneous traffic network is a traffic network that is modeled as the heterogeneous graph G = (V, E,

X_{}^{}

), where

V = {v | v \in S I o r v \in S S}

is the set of nodes, in which SI is the set of intersections, and SS is the set of road segments.

E = {\{e_{i j} | e_{i j} = 〈v_{i}, v_{j}〉 v_{i}, v_{j} \in V a n d (v_{i} \in S I a n d v_{j} \in S S o r v_{i} \in S S a n d v_{j} \in S I)\}}_{}^{}

is the set of edges, where the edge

〈v_{i}, v_{j}〉

indicates whether intersections and road segments are connected, and v_i and v_j are different types of nodes.

X^{} = {x_{i}^{} \in R^{| V | \times n}}_{i = 1}^{| V |}

is the feature matrix of nodes, where x_i represents the n-dimensional static and dynamic features of nodes.

The heterogeneous traffic network in Definition 1 contains two kinds of nodes, i.e., intersections and road segments, and two kinds of edges, i.e., relationships, i.e., R = {<intersection, road segment>, <road segment, intersection>}. Node features can be classified into static features and dynamic features. The static features of intersections reflect intersection type, traffic light type, and quantity of traffic lights. The dynamic features mainly involve average transit time. The static features of road segments reflect indicators such as the length of road segments and the number of lanes. The dynamic features mainly involve time-related traffic flow and average speed. A schema of a simple heterogeneous traffic network is shown in Figure 1.

To better capture the latent relationship between nodes of the same type in the heterogeneous traffic network, the road network meta path is defined as follows:

Definition 2.

Meta path of a road network. Given the heterogeneous traffic network G and the edge relationship R = {<intersection, segment>, <segment, intersection>}, the meta path Φ of the road network is defined as

{v_{1} \overset{r}{\to} v}_{2}

(abbreviated as v_1rv₂), where v

\in

V and r

\in

R.

Figure 1b shows two examples of meta paths of a road network: ISI (intersection-segment-intersection) and SIS (segment-intersection-segment).

Definition 3.

nth-order neighbors based on meta paths. For a node v in a heterogeneous traffic network, its nth-order neighbors

N_{i}^{n}

are defined as the set of nodes connected to node i through meta path Φ with n-hop of the road network. Figure 1d,e respectively show the 1-order neighbor of I₁, which is

N_{i}^{1} = \{I_{1}, I_{2}\}

, and the 2-order neighbor of I₁, which is

N_{i}^{2} = \{I_{1}, I_{3} a n d I_{4}\}

. It should be noted that the nth-order neighbors of a meta path of a road network include that meta path.

The historical paths of vehicles provide dynamic features of intersections and road segments on road networks, such as average travel time on intersections and road segments, and average speeds on road segments. The vehicle paths are formalized as follows:

Definition 4.

Vehicle path. A vehicle path is a series of continuous nodes consisting of intersections and road segments that appear alternately. T = {S₁, I₁, S₂, I₂…S_i, I_j}, where

S_{i} \in S S

,

I_{j} \in S I

.

Problem Formulation: Given a query

Q = (O_{q}, D_{q}, T_{q})

, our target is to develop a function F that predicts the travel time

t_{Θ}

from the starting point

O_{q}

to the endpoint

D_{q}

based on the heterogeneous traffic network G at time

T_{q}

.

The formulation is defined as follows:

t_{Θ} \leftarrow F (O_{q}, D_{q}, T_{q}, G)

For convenience, we list the interpretations of primary symbols throughout the paper in Table 1.

4. Methods

To jointly model intersections and road segments, a deep learning framework based on heterogeneous information networks, i.e., STHGNN, is proposed in this section. The framework consists of three main layers: a feature augmentation layer, a spatio-temporal feature learning layer, and a travel time estimation layer. The design of the model is shown in Figure 2.

Feature augmentation layer: In the heterogeneous traffic network, intersections and road segments appear alternately, and their attributes will inevitably affect the traffic conditions of adjacent road segments or intersections. Therefore, in this layer, we integrated the attributes of intersections and road segments and mapped them to a high-dimensional latent space to initialize feature representations of intersections and road segments, which laid a good foundation for the next layer to effectively learn more complex spatio-temporal heterogeneous features.
Spatio-temporal feature learning layer: To jointly learn the heterogeneous spatio-temporal features of intersections and road segments, an embedded learning layer based on temporal gated convolutions and heterogeneous graph attention networks was constructed. This layer consists of STFL blocks, as shown in Figure 3a. First, STFL blocks take advantage of temporal gated convolutions to learn the temporal features of different networks. Then, the heterogeneous graph attention convolution layer is used to fuse the features of a node and its neighbor nodes based on the meta path of the road network. Finally, a temporal gated convolution is used to augment the temporal features.
Travel time estimation layer: In this layer, the latent representations of the corresponding intersections and road segments are selected from the spatio-temporal feature learning layer according to the index of the starting point and the endpoint in the query. Then, the selected latent representations are fused and imported into the two-layer fully connected network with the vector representations of the starting point, the endpoint, and the query time.

4.1. Road Network Feature Augmentation Layer

The features of the intersections and road segments are mapped to the high-dimensional latent feature space, which is used to initialize the feature representation and serve as the input for the next layer. Intersections and road segments are mutually influenced. For this reason, when initializing the features of intersections and road segments, the features of the adjacent road segments and intersections can be identified in an integrated manner. To map the imported features to the high-dimensional latent space, each node goes through a shared linear transformation process with an activation function (e.g., tanh is used in this paper). This is a common feature augmentation approach [17,18], which can result in better representation ability by embedding representation. The specific approach is explained in terms of road segments and intersections separately as follows:

Given that a road segment is connected to two intersections, the features of the road segment and the two intersections are integrated. The initialized representation of a road segment is defined as:

X_{s}^{(t)} = t a n h (W_{s} • [v_{s}^{(t)}, d_{s}, h_{I_{1}}^{(t)}, p_{I_{1}}, h_{I_{2}}^{(t)}, p_{I_{2}}])

(1)

where [] is a 6-dimensional vector, in which

v_{s}^{(t)}

is the average speed of the road segment at time t; d_s is the distance of the road segment;

h_{I_{1}}^{(t)}

and

h_{I_{2}}^{(t)}

are, respectively, the average travel time of the two intersections connected to the road segment at time t; and

p_{I_{1}}

and

p_{I_{2}}

, respectively, indicate whether there is a red traffic light at the intersection: 0 means no, and 1 means yes. Ws is a learnable parameter that maps the features of the road segment to a high-dimensional latent space, and [

•

] is a concatenation operator. The first two items in Equation (1) reflect the dynamic and static attributes of the road segment; the last four items are in pairs, representing the dynamic and static attributes of the intersections directly connected to the road segment.

Given that an intersection is connected to several road segments, the representation of the intersection can be initialized by integrating the features of the connected road segments. Therefore, the representation of an intersection is defined as:

X_{I}^{(t)} = t a n h (W_{I} • [[\frac{\sum_{s = 1}^{n} v_{s}^{(t)}}{n}, \frac{\sum_{s = 1}^{n} d_{s}}{n}], h_{I}^{(t)}, p_{I}])

(2)

where

v_{s}^{(t)}

is the

a v e r a g e s p e e d

of each road segment adjacent to the intersection at time t; ds is the length of the road segment;

h_{I}^{(t)} i s t h e a v e r a g e t r a v e l t i m e o f t h e i n t e r s e c t i o n a t t i m e t

; p_I indicates whether there are traffic lights at the intersection;

W_{I}

is a learnable parameter that maps the features of the intersection to a high-dimensional latent space; and [

•

] is a concatenation operator. The first two items in Equation (2) reflect the dynamic and static attributes of all the road segments connected to the intersection, and the last four items reflect the dynamic and static attributes of the intersection itself.

Finally, Equation (3) is used to fuse the representation of intersection

X_{I}^{(t)} a n d r e p r e s e n t a t i o n o f r o a d s e g m e n t X_{s}^{(t)}

.

X_{}^{(t)}

is used as the initial feature representation of the node and imported into the spatio-temporal feature learning layer.

X^{(t)} = [X_{I}^{(t)} ‖X_{S}^{(t)}]

(3)

In this way, each road segment not only explores its own attributes but also integrates the attributes of the adjacent intersections. Similarly, each intersection integrates the static and dynamic features of the adjacent road segments and its own static and dynamic features. These attributes and features are mapped to a high-dimensional latent feature space through feature augmentation.

4.2. Spatio-Temporal Feature Learning Layer

The spatio-temporal feature learning layer is the core layer of STHGNN. Its purpose is to obtain the latent representation of road segments and intersections for the next time interval on the basis of the previous M time steps. As shown in Figure 3a, this layer consists of a heterogeneous graph attention neural network and two temporal gated convolutional neural networks, which are used to process, respectively, the spatial features and temporal features from the neighbor nodes of a meta path on the road network.

4.2.1. Temporal Gated CNNs

Although RNN and LSTM models are widely used in time series analysis, they do not solve problems such as time-consuming iterations, complex gating mechanisms, and slow responses to dynamic changes. Compared with RNN and LSTM, CNN has the advantages of fast training speed, simple structure, and not being subject to the previous steps. Causal convolution is a deep learning network specially designed for time series prediction. It can simulate LSTM using CNN.

Therefore, in this paper, causal convolutions are used on the time axis to capture the temporal dynamic behavior of traffic flow. To adapt the dynamic temporal correlations, a gated mechanism is used to control the temporal information flow in convolutions, which is crucial in recurrent neural networks. Taking the first temporal gated convolution in Figure 3b as an example, (X^(tq−M+1), X^(tq−M+2),

\dots

, and X^(tq)) ∈ R^M×|V|×n is used as the input and then the results of the two causal convolutions are fused together through the Hadamard products. One of the causal convolutions goes through the Sigmoid function, and the other uses the residual connection to reduce the vanishing gradient. Finally, the hidden state H₁ ∈ R ^{(M−LK+1) ×|V|×C1} is the output. The equation is:

H_{1} = (K_{} * X^{(t)}) ⊙ σ (K_{} * X^{(t)}) \in R^{(M - L_{K} + 1) \times |V| \times C_{o u t}}

(4)

where K is the convolution kernel with a size of [L_K × 1 × C_in, C₁], in which C_in and C₁ represent the number of input channels and output channels of the causal convolutions, respectively. In this paper, C_in is assumed to be equal to C₁. ⊙ represents the Hadamard product (element-wise product). The Sigmoid function σ(·) is used to control the proportion of information flowing to the next layer.

The second temporal gated convolution in Figure 3a is used to augment temporal features. Its structure is exactly the same as the first one, except that the size of its convolution kernel is [(M − L_K + 1)) × 1 × C₃, C₃], and the output is X^(tq)′ ∈ R^{|V| × C3}, where X^(tq)′ is the final representation of all intersections and road segments at time t_q.

4.2.2. Spatial Feature Learning Based on Heterogeneous Graph Attention Mechanism

The travel time of a road segment is affected not only by the adjacent intersections but also by the upstream road segments. The intersections also have problems. Therefore, it is necessary to learn the spatial information between nodes using heterogeneous information networks and meta paths. The heterogeneous graph attention mechanism has proved to be an effective approach for obtaining spatial features in heterogeneous information networks [19,20]. Therefore, in this paper, a heterogeneous graph attention mechanism is used to identify the spatial features of road segments and intersections.

The basic idea of identifying the latent spatial features of intersections and road segments in the heterogeneous graph attention mechanism is as follows: First, the importance between node pairs is identified through the meta path of a road network. Then, the attention coefficient is identified through the softmax function. Finally, the feature representation of the node is obtained by making a weighted summation of the attention coefficient with the features of the neighbors around the node.

Specifically, in a heterogeneous traffic network,

e_{i j}^{n}

is used to express the importance of the intersection or road segment

X_{i}^{(t)}

and nth-order neighbor

X_{j}^{(t)}

that are connected by meta path Φ, which is computed as:

e_{i j}^{n} = a_{f n n} (X_{i}^{(t)}, X_{j}^{(t)})

(5)

where a_fnn represents a feedforward neural network.

Then, the node features of the network are embedded into the model through the self-attention mechanism, i.e., only the

e_{i}^{n}

of

j {\in N}_{i}^{n}

is calculated, where

N_{i}^{n}

is the set of all nth-order neighbor nodes of node i based on the meta path Φ. Next, the importance obtained above are normalized, and the attention coefficient

a_{i j}^{n}

is learned through the softmax function. The equation is as follows:

a_{i j}^{n} = s o f t m a x (e_{i j}^{n}) = \frac{e x p (σ (a_{Φ}^{t} \cdot [X_{i} | | X_{j}]))}{\sum_{k \in N_{i}^{n Φ}} e x p (σ (a_{Φ}^{t} \cdot [X_{i} | | X_{k}]))}

(6)

where σ represents the activation function, || represents the connection operation, and

a_{Φ}^{t}

is the attention vector of the meta path Φ. It can be concluded from the above equation that the attention coefficient of (i, j) depends on their latent features. The attention weight coefficients of

a_{i j}^{n}

are asymmetrical, which means that they contribute differently to one another because the concatenate order is in the numerator and because they have different neighbors.

Finally, the feature representation of the node is obtained by making a weighted summation of the attention coefficient obtained above with the features of the neighbors around the node and by going through the activation function. The equation is as follows:

z_{i}^{Φ} = σ (\sum_{j ϵ N_{i}^{n}} a_{i j}^{n} \cdot X_{j})

(7)

where

z_{i}^{Φ}

is the learning representation of meta path Φ-based node i.

Given that a heterogeneous graph possesses the properties of a scale-free network, the variance of the graph data is quite high. To solve the above problems, multi-head attention is applied to make the training process more stable. Specifically, the above process is repeated K times, and finally, the feature representation of node i is obtained:

z_{i}^{Φ} = | | \begin{matrix} K \\ k = 1 \end{matrix} σ (\sum_{j ϵ N_{i}^{n}} a_{i j}^{n} \cdot X_{j})

(8)

To identify the influence of the latent features of neighbor nodes, the influence of third-order neighbors based on the meta path of the road network on nodes is studied. That means the output H₁ of the temporal gated convolution in the previous layer is used as the input. After three rounds of heterogeneous graph attention convolutions, the final feature dimension of each node becomes C₂, and

H_{4} \in R^{(M - L_{K} - 1) \times |V| \times C_{2}}

is exported.

4.3. Travel Time Estimation Layer

A two-layer fully connected network with the size of [λ, 1] is used to estimate the travel time. The latent representations of the corresponding intersections and road segments are selected from the spatio-temporal feature learning layer according to the index of the starting point and the endpoint in the query. Then, the selected latent representations are imported to the fully connected network to estimate the travel time

t_{Θ}

.

Loss function:

MAPE is used as the loss function, which is defined as follows:

L o s s (t_{Θ}; G) = \frac{|t_{Θ} - \hat{t_{Θ}}|}{\hat{t_{Θ}}}

where

t_{Θ}

is the estimated travel time, and

\hat{t_{Θ}}

is the real travel time.

5. Experiment Settings and Results Analysis

5.1. Experiment

We used a real taxi trajectory collected from Shenzhen and released in GISCUP 2021 [21]. The data from 1 August to 31 August was used as the training dataset, and the data on 1 September was used as the testing data. Table 2 lists the statistics of the dataset.

Evaluation Metric. Mean Absolute Percentage Error (MAPE) computes the percentage between predicted values and ground truth:

M A P E = \frac{1}{| D |} \sum_{i = 1}^{| D |} \frac{|t_{i} - \hat{t_{i}}|}{t_{i}}

. MAPE is one of the most popular metrics for ETT tasks. Mean Average Error (MAE) calculates the absolute residual for each data point:

M A E = \frac{1}{| D |} \sum_{i = 1}^{| D |} | t_{i} - \hat{t_{i}} |

. A smaller MAE indicates good prediction. Root Mean Squared Error (RMSE) describes the spread of residuals:

R M S E = \sqrt{\frac{1}{| D |} \sum_{i = 1}^{| D |} {(| t_{i} - \hat{t_{i}} |)}^{2} .}

Baselines: We compared the performance with the following baselines.

AVG: This is a classic method that has been deployed in multiple map services. The average speed of each link in a given city during a specific time interval is calculated. The travel time of a given path can be estimated using the historical average speed and a given departure time.
DeepTTE [11]: This is a hybrid end-to-end deep learning model that adopts a Geo-Conv layer to capture spatial dependencies and a GRU layer to capture temporal dependencies. DeepTTE learns the spatio-temporal correlations based on the consecutive sampling points along the query path.
DCRNN [14]: This is a spatio-temporal graph-based deep framework. This model exploits diffusion graph convolution to capture spatial dependencies and then uses the recurrent neural networks to model temporal dependencies. We leverage its core architecture for travel time estimation in our experiments.

In the preprocessing stage, the trajectories that were extremely long and extremely short were removed from the trajectory set. We used the trajectory duration as the pruning metric. If a trajectory duration was less than 3 min, the trajectory was considered too short to be meaningful. As we know, a passenger can cancel a taxi reservation in 3 min. On the other hand, we also deleted the trajectories with durations of more than 30 min. After the data preprocessing, 94% of the dataset remained.

For our model, the sizes of embedded dimensions for intersections and road segments were set as 20. The number of spatio-temporal learning cells was set at three by default. The size L_K of the causal convolution kernels in the gated CNN was set to two, with the output channels C₁ = 8 and C₃ = 11. We set a time slice of 5 min so that days are divided into 288 data points. We used the historical information on intersection and road segments at the first 12 time steps with slots of 5 min for temporal learning. We adopted two layered, fully-connected networks with hidden units that were both set as 60. Additionally, the learning rate was set to 0.001, the batch size was set to 100, and Adam was employed to optimize the parameters.

5.2. Experimental Evaluation

Table 3 shows the experimental results of our model and three baselines. The lower these metrics, the better the performance. The parameters of other baseline algorithms followed the best settings in their proposed papers.

From the experimental results, we have the following observations. STHGNN shows the best prediction performance under all settings. Comparing the other three methods with our STHGNN, the neural network-based methods DeepTTE and DCRNN have better prediction precision than AVG. Our proposed STHGNN shows the best performance.

For instance, MAPE errors of DeepTTE, DCRNN, and STHGNN were approximately 10.4%, 19%, and 22.1% lower than AVG. This is mainly because AVG is a classic method that has difficulty handling complex time series data. Meanwhile, the spatial correlation hidden in the trajectories cannot be exploited, nor can the spatio-temporal features. Comparing the other three methods, DeepTTE, DCRNN, and STHGNN all use LSTM or GRU to identify the temporal features. For the spatial aspect, DeepTTE uses a Geo-Conv layer to identify the spatial features, while DCRNN and STHGNN both use graph structures to capture the spatial features. There, the MAPE of DCRNN and STHGNN is slightly lower. After all, taxis run on the road network. The graph model shows the benefit of spatial correlation learning.

For the graph-based methods, STHGNN models road sections and intersections as nodes on the same graph and learns the spatial features of road sections and intersections at the same time. It uses the neighbor based on the meta path to fully mine more spatial information. Thus, the MAPE of STHGNN is lower than DCRNN.

5.3. Ablation Study

In order to verify the effectiveness of our proposed heterogeneous information network, we conducted an ablation study, including (1) w/o intersections, which only adopts the spatio-temporal features from the road segments; (2) w/o road segments, which only adopts the spatio-temporal features from the intersections; and (3) w/o a feature augmentation layer, which adopts the spatio-temporal features from the intersections and road segments but does not augment and fuse.

Table 4 shows the performance of STHGNN against its ablations on the ETT task. From Table 4, we find that the performance of w/o feature augmentation is worse than STHGNN. For example, its performance degrades by 3.13%, 4.87%, and 4.72% in terms of RMSE, MAE, and MAPE when applied to the Shenzhen dataset compared with our model. This is because STHGNN integrated the attributes of intersections and road segments and mapped them to a high-dimensional latent space to initialize feature representations of intersections and road segments.

The results are shown in Table 4, from which we find that obvious declines occur in estimation performance when removing the road segments or intersections. This is because the road segments and intersections all play a central role in affecting traffic. Based on the above ablation study, we conclude that the heterogeneous traffic network and feature augmentation layer we proposed can effectively improve the accuracy of estimating the time of travel.

6. Conclusions

In order to integrate the road segments and intersections, we proposed a novel spatio-temporal heterogeneous graph neural network framework called STHGNN for estimating travel time in this paper. STHGNN combines gated convolution neural networks and graph neural networks to capture the correlations in spatio-temporal information. To evaluate the effectiveness of our model, we conduct experiments on a real-world dataset, and the experimental results show that STHGNN can achieve a higher degree of accuracy in estimating travel time. However, there are still some limitations that can be addressed in the future. First, we did not consider certain external data, such as weather conditions and special events, which could have a significant impact on traffic states. Second, we may have obtained less heterogeneous information due to limited data resources. In the future, we will consider more external factors that have a significant impact on travel time and attempt to apply more heterogeneous information about the road network.

Author Contributions

Methodology, L.W.; Formal analysis, L.W.; Investigation, L.W. and Y.Z.; Writing—original draft, L.W.; Writing—review & editing, L.W., Y.T. and P.Z.; Supervision, Y.T.; Funding acquisition, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hebei Provincial Department of Education (ZD2018040, 2018GJJG240), the National Natural Science Foundation of China(F2021210005).

Conflicts of Interest

The authors declare no conflict of interest.

References

Grotenhuis, J.W.; Wiegmans, B.W.; Rietveld, P. The desired quality of integrated multimodal travel information in public transport: Customer needs for time and effort savings. Transport Policy 2007, 14, 27–38. [Google Scholar] [CrossRef]
Yuan, N.J.; Zheng, Y.; Zhang, L.; Xie, X. T-finder: A recommender system for finding passengers and vacant taxis. IEEE Trans. Knowl. Data Eng. 2012, 25, 2390–2403. [Google Scholar] [CrossRef]
Tirachini, A. Estimation of travel time and the benefits of up-grading the fare payment technology in urban bus services. Transp. Res. Part C Emerg. Technol. 2013, 30, 239–256. [Google Scholar] [CrossRef]
Wang, H.; Tang, X.; Kuo, Y.H.; Kifer, D.; Li, Z. A simple baseline for travel time estimation using large-scale trip data. ACM TIST 2019, 10, 1–22. [Google Scholar] [CrossRef] [Green Version]
Wang, M.X.; Lee, W.C.; Fu, T.Y.; Yu, G. Learning embeddings of intersections on road networks. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Chicago, IL, USA, 5–8 November 2019; pp. 309–318. [Google Scholar]
Chen, W.; Chen, L.; Xie, Y.; Cao, W.; Gao, Y.; Feng, X. Multirange attentive bicomponent graph convolutional network for traffic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3529–3536. [Google Scholar]
Wu, C.-H.; Ho, J.-M.; Lee, D. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst. 2004, 5, 276–281. [Google Scholar] [CrossRef] [Green Version]
Jenelius, E.; Koutsopoulos, H.N. Travel time estimation for urban road networks using low frequency probe vehicle data. Transp. Res. Part B Methodol. 2013, 53, 64–81. [Google Scholar] [CrossRef] [Green Version]
Wang, Z.; Fu, K.; Ye, J. Learning to estimate the travel time. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, London, UK, 19–23 August 2018; pp. 858–866. [Google Scholar]
Gupta, B.; Awasthi, S.; Gupta, R.; Ram, L.; Kumar, P.; Prasad, B.R.; Agarwal, S. Taxi travel time prediction using ensemble-based random forest and gradient boosting model. In Advances in Big Data and Cloud Computing; Springer: Berlin/Heidelberg, Germany, 2018; pp. 63–78. [Google Scholar]
Wang, D.; Zhang, J.; Cao, W.; Li, J.; Zheng, Y. When will you arrive? In estimating travel time based on deep neural networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2500–2507. [Google Scholar]
Zhang, H.; Wu, H.; Sun, W.; Zheng, B. Deeptravel: A neural network based travel time estimation model with auxiliary supervision. arXiv 2018, arXiv:1802.02147. [Google Scholar]
Fu, T.Y.; Lee, W.C. Deepist: Deep image-based spatio-temporal network for travel time estimation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 69–78. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In Proceedings of the International Conference on Learning Representations (ICLR’18), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, J.; Shi, X.; Xie, J.; Ma, H.; King, I.; Yeung, D.Y. Gaan:Gated attention networks for learning on large and spatio-temporal graphs. arXiv 2018, arXiv:1803.07294. [Google Scholar]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P. Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR’18), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Jin, G.; Yan, H.; Li, F.; Huang, J.; Li, Y. Spatio-Temporal Dual Graph Neural Networks for T ravel Time Estimation. arXiv 2021, arXiv:2105.13591. [Google Scholar]
Zhang, C.; Song, D.; Huang, C.; Swami, A.; Chawla, N.V. Heterogeneous Graph Neural Network. In Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P.S. Heterogeneous Graph Attention Network. In Proceedings of the International World Wide Web Conference (WWW’19), San Francisco, CA, USA, 13–17 May 2019. [Google Scholar]
Available online: https://sigspatial2021.sigspatial.org/ (accessed on 11 November 2022).

Figure 1. An illustrative example of a heterogeneous traffic network.

Figure 2. The overview of STHGNN.

Figure 3. Spatio-Temporal Heterogeneous Graph Learning Layer.

Table 1. Interpretation of symbols.

Term	Description	Term	Description
G	the heterogeneous traffic network	Q	a query
V	the set of nodes	$t_{Θ}$	the estimated travel time
E	the set of edges	$X_{s}^{(t)}$	the initialized representation of a road segment at time t,
X	the feature matrix of nodes	$X_{I}^{(t)}$	the initialized representation of an intersection at time t,
R	the edge relationships	$X^{(t)}$	the initial feature representation of the node at time t
Φ	the meta path	H	output the hidden state
$N_{i}^{n}$	the nth-order neighbors of node i based on the meta path	$a_{f n n}$	a feedforward neural network
T	the vehicle path	$z_{i}^{Φ}$	the learning representation of meta path Φ-based node i

Table 2. The numerical statistics of real-world datasets.

Dataset	Shenzhen
number of trajectories	8,562,030
number of intersections	356,208
number of road segments	378,984
average travel time (s)	236.54
average moving distance (m)	1325.78

Table 3. Performance of STHGNN and baselines.

Method	RMSE (s)	MAE (s)	MAPE
AVG	256.2	152.67	0.163
DeepTTE	232.7	138.35	0.146
DCRNN	208.1	124.01	0.132
STHGNN	201.3	119.96	0.127

Table 4. Ablation results.

Method	RMSE (s)	MAE (s)	MAPE
w/o intersection	218.7	132.28	0.145
w/o road segments	213.5	129.39	0.138
w/o feature augmentation	207.6	125.81	0.133
STHGNN	201.3	119.96	0.127

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, L.; Tang, Y.; Zhang, P.; Zhou, Y. Spatio-Temporal Heterogeneous Graph Neural Networks for Estimating Time of Travel. Electronics 2023, 12, 1293. https://doi.org/10.3390/electronics12061293

AMA Style

Wu L, Tang Y, Zhang P, Zhou Y. Spatio-Temporal Heterogeneous Graph Neural Networks for Estimating Time of Travel. Electronics. 2023; 12(6):1293. https://doi.org/10.3390/electronics12061293

Chicago/Turabian Style

Wu, Lei, Yong Tang, Pei Zhang, and Ying Zhou. 2023. "Spatio-Temporal Heterogeneous Graph Neural Networks for Estimating Time of Travel" Electronics 12, no. 6: 1293. https://doi.org/10.3390/electronics12061293

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatio-Temporal Heterogeneous Graph Neural Networks for Estimating Time of Travel

Abstract

1. Introduction

2. Related Work

3. Related Definitions

4. Methods

4.1. Road Network Feature Augmentation Layer

4.2. Spatio-Temporal Feature Learning Layer

4.2.1. Temporal Gated CNNs

4.2.2. Spatial Feature Learning Based on Heterogeneous Graph Attention Mechanism

4.3. Travel Time Estimation Layer

5. Experiment Settings and Results Analysis

5.1. Experiment

5.2. Experimental Evaluation

5.3. Ablation Study

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI