Missing Data Estimation in Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN)

Najafi, Bahareh; Parsaeefard, Saeedeh; Leon-Garcia, Alberto

doi:10.3390/make4020017

Open AccessArticle

Missing Data Estimation in Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN)

by

Bahareh Najafi

^*

,

Saeedeh Parsaeefard

and

Alberto Leon-Garcia

Department of Electrical and Computer Engineering, University of Toronto, BA4120, 40 St. George Street, Toronto, ON M5S 3G4, Canada

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2022, 4(2), 397-417; https://doi.org/10.3390/make4020017

Submission received: 10 April 2022 / Revised: 23 April 2022 / Accepted: 27 April 2022 / Published: 30 April 2022

(This article belongs to the Section Network)

Download

Browse Figures

Versions Notes

Abstract

:

GNNs have been proven to perform highly effectively in various node-level, edge-level, and graph-level prediction tasks in several domains. Existing approaches mainly focus on static graphs. However, many graphs change over time and their edge may disappear, or the node/edge attribute may alter from one time to the other. It is essential to consider such evolution in the representation learning of nodes in time-varying graphs. In this paper, we propose a Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN), a node embedding approach for dynamic graphs that incorporates the interdependence of temporal relations into embedding computation. We evaluate the performance of TMP-GNN on two different representations of temporal multilayered graphs. The performance is assessed against the most popular GNNs on a node-level prediction task. Then, we incorporate TMP-GNN into a deep learning framework to estimate missing data and compare the performance with their corresponding competent GNNs from our former experiment, and a baseline method. Experimental results on four real-world datasets yield up to

58 %

lower

ROC AUC

for the pair-wise node classification task, and

96 %

lower MAE in missing feature estimation, particularly for graphs with a relatively high number of nodes and lower mean degree of connectivity.

Keywords:

node embedding; Position-Aware Graph Neural Network; spatio-temporal measurements; temporal multilayer graph; missing data analysis

1. Introduction

Graph neural networks (GNN) have recently been used as a standard in developing machine learning methods for graphs. These graphs have been formed from sources such as transportation networks [1,2,3,4], brain networks [5], social media community networks, etc. The GNN architecture has effectively combined the node/edge features and graph topology to build a distributed representation. The resulting representation can be used to solve node-level, edge-level [6] and graph-level prediction tasks [5]. Several recent works have been applied to represent nodes, edges and graphs in many domains, such as bioinformatics, natural language processing and transportation. In [7], the authors used a hybrid method of a decomposed LSTM and a GCN to predict next period prescription. The authors in [8] proposed a GCN-based Relational Fusion Network (RFN) for speed limit classification and speed estimation in road networks. The study introduced the notion of volatile homophily in the road segments of adjacent regions and enumerated the substantial differences between the road networks and other networks previously learned by GCN. The underlying differences include a smaller number of node attributes, potential abrupt behavioral changes among the neighboring segments and a lower mean node degree. Another study [9] learned the graph representation by applying LSTM autoencoders on graph sequences generated from random walks, and utilized it for the classification task.

The goal of node embedding methods is to identify a vector representation that captures the node location within a broader topological structure of the graph. Most node embeddings learned from GNN architectures focus on single static graphs. These methods assume that the number/position of nodes as well as their interactions do not change over time. However, in many applications, we tackle spatio-temporal measurements collected over time, wherein the necessity of time-varying graphs emerges. In such graphs, the number of nodes, their connecting edges and the edge weights vary from time to time; thus, dynamic node embedding should be learned accordingly from the graph. Node embedding approaches in such graphs can be mainly categorized as temporal-first [10,11] and structural-first [12,13] according to [14], in which the former and the latter prioritize learning temporal and structural information in their framework first and then feed static topological and dynamic temporal features into their model, respectively. Time-varying graphs can be defined as continuous- or discrete-time dynamic graphs, in which the latter can be represented as a sequence of interdependent time layers. Each layer is a graph building from existing nodes and weighted edges corresponding to a given time. For instance, the study in [5] proposed a graphSAGE-based graph representation learning for time-variant EEG signals to apply brain state classification, where the dynamic graph is modeled as a sequence of individual graphs per one time stamp. Another work [8] combines temporal, spatial and semantic views to predict taxi demands at a future time stamp.

One straightforward way of computing node embedding in discrete time-varying graphs is to use a static GNN-based node embedding for each individual time layer and aggregate the results of corresponding layers through recurrent neural network variants. However, this method would compute an embedding for each layer independently, and consequently ignore the inter-layer correlation. In this paper, we present TMP-GNN, a Temporal Multilayer Position-Aware GNN-based node embedding, which is an extension of its static version, Position-Aware GNN (P-GNN) [15]. The goal of P-GNN is to learn position-aware node embeddings that utilize the local network structure and the global network position of a given node with respect to randomly selected nodes called anchor sets, which enables us to distinguish among isomorphic nodes—nodes that are based in very different parts of the graph but have topologically the same structure. The distinguishing power will improve if the node/edge features are available. The resulting embeddings can be later used to approximately calculate the shortest path distance among the embedded nodes in the graph. The contributions of the paper are summarized as follows:

(1) We learn the short-term temporal dependencies, global position and feature information of the graph jointly through our TMP-GNN embedding component and utilize the derived representation in a missing data estimation framework.

(2) Instead of modeling a dynamic graph via multi-graph, we exploit a supra-adjacency matrix to encode a temporal graph with its intra-layer and inter-layer coupling in a single graph, which facilitates faster learning and a higher area under the receiver operating characteristic curve

(ROC AUC)

for pair-wise node classification.

(3) We deploy the concept of conditional centrality derived from eigenvector-based centrality to distinguish nodes of higher influence and integrate it in message aggregation across the graph.

(4) We identify a new experimental setting to enhance the training of the multi-graph deployment of dynamic graphs so that it can better trace the behavioral changes of nodes in the graph.

(5) We use hidden states learned from bi-directional GRU (bi-GRU) to learn the long-term temporal dependencies in both forward and backward directions to estimate missing values.

(6) We conduct several experiments using four real-world datasets, with a wide range of node numbers, degrees of connectivity, edge dynamics and areas of study. The results illustrate that TMP-GNN improves the area under the ROC curve significantly for classification tasks as compared to the best baseline, and it reduces the MAE of missing data estimation.

2. Notations and Preliminaries

Before presenting the details of our proposed architecture, we provide some background on temporal multilayer graphs.

2.1. Notation

Figure 1 illustrates a temporal multilayer graph. We can represent the graph as

G^{(t)} = (V, ρ^{(E \times t)}, E^{(t)}, \tilde{E})

, where

V

are the set of nodes

{v, \forall v \in V}

. Here,

ρ^{(E \times t)}

indicate whether a given edge is present at a given time t.

E^{(t)}

shows intra-layer weighted edges at time t,

E^{(t)} = {e_{u v}^{(t)}, \forall u, v \in V}

.

\tilde{E}

indicates inter-layer edges

{(t, t^{'}, {\tilde{x}}_{t t^{'}}) \in \tilde{ε}}

, where

{\tilde{x}}_{t t^{'}}

is the edge weight between time layers

(t, t^{'}) .

The multilayer graph is also denoted by a sequence of adjacency matrices

A^{(t)} \in R^{N \times N}

, where

A_{u v}^{(t)}

denotes the directed edge from node u to v in time layer t, as follows:

A_{u v}^{(t)} = \{\begin{matrix} x_{u v}^{(t)}, & if u, v are connected at time t \\ 0, & otherwise . \end{matrix}

(1)

where

x_{u v}^{(t)}

is the feature vector associated with

e_{u v}^{(t)}

. In scenarios where the node attributes are available, another set is defined as

W^{(t)} = {w_{v}^{(t)}, \forall t \in N_{T}}

, where

w_{v}^{(t)}

is the feature vector associated with node v in time layer t.

The set

\tilde{E}

encodes the weight and topology of the coupling of individual instances of the same nodes between pairs of time layers

(t, t^{'})

.

The inter-layer coupling described above is diagonal and uniform [16], which means that the existence of inter-layer edges is restricted between separate instances of the same nodes from one layer to another. For instance, there is no edge from node v in t to node u in

t^{'}

. The corresponding layers are uniformly coupled, meaning that the inter-layer edge weight between two layers is identical for all nodes in those layers. Please note that this assumption can be generalized so that different inter-layer edge weights are assigned to different subsets of nodes; that is,

{\tilde{x}}_{t t^{'}}^{u} \neq {\tilde{x}}_{t t^{'}}^{v}

.

There are two possible ways to form the coupling between the layers: directed and undirected chain. The former couples the adjacent time layers by neglecting the directionality of time as in the following equation:

{\tilde{A}}_{u v}^{(t t^{'})} = \{\begin{matrix} 1 & \forall u, v if u = v, | t^{'} - t | = δ \\ 0 & otherwise \end{matrix}

(2)

where

δ

is the sampling rate. The latter not only respects the time direction, but also includes weighted undirected connections between the same nodes of all pairs of time layers as in:

{\tilde{A}}_{u v}^{(t t^{'})} = \{\begin{matrix} 1 + γ, \forall u, v & if u = v, | t^{'} - t | = δ \\ γ & otherwise, \end{matrix}

(3)

where

\tilde{A}

is a

T \times T

inter-layer matrix, and

γ

is the node teleportation probability [17]. In this paper, we have used the former approach for inter-layer coupling. The inter-layer coupling of the same nodes decrease as

| t^{'} - t |

increases, and we adjust

δ

according to the data characteristics and inter-sequence correlation in our experiments.

A node embedding function, e.g., SAGE, can be represented as

f : V^{(t)} \to Z^{(t)}

, which maps a given node v to d-dimensional vector

z

in layer t, where

Z^{(t)} = {z_{v}^{(t)}, \forall v, t \in N_{| V |}, N_{T}}, z_{v}^{(t)} \in R^{d}

. Once the node embedding is computed, the corresponding edge embedding can be calculated as

z_{u v}^{(t)} = g (z_{u}^{(t)}, z_{v}^{(t)})

, where the g function is equal to mean in the case of our study.

2.2. Supracentrality Matrix for Temporal Multilayer Position-Aware Graph

As explained, a temporal graph is represented through a sequence of adjacency matrices, each of which refers to one layer of a dynamic network at a specific point of time. Then, we construct a supracentrality matrix

C (ω)

by linking centrality matrices across time layers through a weighted inter-layer parameter called

ω

, which is used to adjust the extent of coupling strength among pairs of time layers. The entries of the dominant eigenvector of

C (ω)

show joint centrality, the importance of every node–layer pair

(v, t)

. Additionally, marginal and conditional centralities are defined to represent the relative importance of a node compared to other nodes at time layer t.

The supracentrality framework is mainly focused on eigenvector-based centrality, which is obtained by calculating the centralities as the elements of the dominant eigenvector corresponding to the largest-magnitude eigenvalue of a centrality matrix

C (A)

, which is defined as a function of network adjacency matrix

A

. We have selected a centrality measure from one of the most popular choices to be equal to adjacency matrix (

C (A) = A

).

C (ω)

, a supracentrality matrix for a dynamic graph, is a group of matrices formed as below:

C (ω) = \hat{C} + ω \hat{A},

(4)

where

\begin{matrix} \hat{C} = [\begin{matrix} C^{(1)} & 0 & 0 & \dots & 0 \\ 0 & C^{(2)} & 0 & \dots & 0 \\ 0 & 0 & C^{(3)} & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & \dots & C^{(T)} \end{matrix}], \end{matrix}

and

\begin{matrix} \hat{A} = [\begin{matrix} {\tilde{A}}^{(11)} I & {\tilde{A}}^{(12)} I & {\tilde{A}}^{(13)} I & \dots & {\tilde{A}}^{(1 T)} I \\ {\tilde{A}}^{(21)} I & {\tilde{A}}^{(22)} I & {\tilde{A}}^{(23)} I & \dots & {\tilde{A}}^{(2 T)} I \\ {\tilde{A}}^{(31)} I & {\tilde{A}}^{(32)} I & {\tilde{A}}^{(33)} I & \dots & {\tilde{A}}^{(3 T)} I \\ \dots & \dots & \dots & \dots & \dots \\ {\tilde{A}}^{(T 1)} I & {\tilde{A}}^{(T 2)} I & {\tilde{A}}^{(T 3)} I & \dots & {\tilde{A}}^{(T T)} I \end{matrix}] . \end{matrix}

Here,

C (ω)

consists of two components in (4):

\hat{C}

and

ω \hat{A}

. The former (diagonal) component,

D i a g [C^{(1)}, \dots, C^{(T)}]

, represents a set of T weighted centrality matrices of individual layers, and the latter (off-diagonal block),

\hat{A} = \tilde{A} \otimes I

, encodes the uniform and diagonal coupling with strength parameter

ω

between the time layers, where

\tilde{A}

is defined in (2); ⊗ is the Kronecker product, and

I

is a

N \times N

identity matrix, since we only consider the coupling among the same nodes between consecutive pairs of layers. We study the dominant eigenvector

V (ω)

of

C (ω)

corresponding to the largest eigenvalue

λ_{max}

as in:

C (ω) V (ω) = λ_{max} (ω) V (ω) .

(5)

The elements in the dominant right eigenvector of

V (ω)

are interpreted as scores that measure the importance of node–layer pairs

(v, t)

. As such, the eigenvalue entity

W_{v, t} (ω)

in the following Equation (6), called the joint centrality, is a d-dimensional vector, reflecting the centrality of node v at layer t, where

V_{N (t - 1) + n} (ω)

is the

N (t - 1) + n

-th entity of the largest eigenvalue

V (ω)

and n refers to node order

v_{n}

, which is omitted here for simplicity.

W_{v, t} (ω) = V_{N (t - 1) + n} (ω) .

(6)

Inspired by probability theory, the authors in [16] defined marginal layer centrality

(MLC)

and conditional centrality in the following Equations (7) and (8).

MLC

indicates the average joint centralities over all nodes at time layer t. The conditional centrality of node v shows the importance of the node relative to other nodes at layer t.

{MLC}_{v} (ω) = \sum_{v = 1}^{| V |} W_{v, t} (ω),

(7)

C C_{v} (ω) = \frac{W_{v, t} (ω)}{{MLC}_{v} (ω)} .

(8)

Choice of $ω$ for Steady-State Node Ranking

A number of factors should be taken into account while choosing the value of

ω

. The choice of

ω

should be appropriate for the targeted application or the research question.

When

ω

\to \infty

, the centrality measures change slowly over time. On the other hand, in the presence of small

ω

, the

C C_{v} (ω)

fluctuates from one time layer to the other.

We considered

ω

at a strong coupling regime. If

ω

is too large (e.g.,

ω

> 100), the

C C_{v} (ω)

becomes stationary, which is equal to the average

C C_{v} (ω)

over all time layers [18]. Moreover, the dominant eigenvalue of

C (ω)

converges to the same values of

\hat{A}

. That is,

λ_{max} (ω) \to {\tilde{μ}}_{1}

, where

λ_{max}

and

{\tilde{μ}}_{1}

are the dominant eigenvalues of

C (ω)

and

\hat{A}

, respectively.

Through the power iteration method of eigenvector measurements [19], we found the minimum value of

ω

that satisfies the above requirements (Algorithm 1). In the next section, we utilize stationary

{C C}_{v} (ω)

to distinguish highly important nodes belonging to P-GNN anchor sets, and then exploit it to aggregate the information across the anchor sets.

Algorithm 1 Find the appropriate value of

ω

Require: Initialize the value of

ω

for

ω \geq 0

(we start from

ω = 10

), construct

B = ω \hat{A}

,

C = C (ω)

choose a random number for

b_{k}

,

c_{k^{'}}

as the initial value for the largest eigenvector of

B

and

C

. We set

b_{k}

,

c_{k^{'}}

=

[11 \dots 1]

with its length equal to the number of columns of

B

,

C

,

which is

N \times T

in our case.

b_{k} \leftarrow \frac{b_{k}}{| | b_{k} | |}

,

c_{k^{'}} \leftarrow \frac{c_{k^{'}}}{| | c_{k^{'}} | |}

λ_{k} \leftarrow b_{k}^{*} B b_{k}

,

μ_{k^{'}} \leftarrow c_{k}^{*} C c_{k}

k = 0

while

| λ_{k} - μ_{k^{'}} | < 0.01

do

while

| λ_{k + 1} - λ_{k} | < 0.01

do

k \leftarrow k + 1

b_{k} = \frac{B b_{k}}{| | B b_{k} | |}

λ_{k} = b_{k}^{*} B b_{k}

end while

return

λ_{k}

,

b_{k}

while

| μ_{k^{'} + 1} - μ_{k^{'}} | < 0.01

do

k^{'} \leftarrow k^{'} + 1

c_{k^{'}} = \frac{C c_{k^{'}}}{| | C c_{k^{'}} | |}

μ_{k^{'}} = c_{k^{'}}^{*} C c_{k^{'}}

end while

return

μ_{k^{'}}

,

c_{k^{'}}

end while

return

ω

2.3. TMP-GNN: Multilayer Position-Aware-Based Graph Neural Network

Several GNN methods are distinguished by the way in which they aggregate nodes’ information from their neighborhood to form a node representation at each layer k. The approach can be summarized as two functions, AGGREGATE and COMBINE [20], in which the former decides on how the information is aggregated from adjacent nodes, and the latter updates the node representation from the previous layer

k - 1

to layer k. The underlying functions can be combined and jointly represented for some GNNs, as follows:

\begin{matrix} a_{v_{n}}^{(k)} & = {AGGREGATE}^{(k)} ({h_{u_{n}}^{(k - 1)} : u_{n} \in neighbors (v_{n})}), \\ h_{v_{n}}^{(k)} & = {COMBINE}^{(k)} (h_{v_{n}}^{(k - 1)}, A_{v_{n}}^{(k)}), \end{matrix}

(9)

where

h_{v}^{(k)}

is the feature representation vector of node v at iteration k/k-th layer of GNN, and

Neighbors (v)

is the set of nodes adjacent to v in

G

.

We use a position-aware GNN, which, instead of aggregating the information from the nearest neighbors, aggregates the positional and feature information of each node with a randomly selected number of nodes called anchor sets. Then, the computed message is aggregated across the anchor sets for each node to incorporate the global topological information of the graph in node embedding. Figure 2 illustrates in detail how the node representation is learned through one P-GNN layer. The goal is to find the best position-aware embedding

z_{v}^{t}

with minimum distortion for a given node v at time layer t.

We have made the following modifications to P-GNN:

Generalization of P-GNN to Time-Varying Graphs:
We adopt the input of P-GNN as supracentrality matrix $C (ω)$ , which represents a temporal multilayer graph with $N \times T$ number of nodes. We compute an embedding for all nodes in all time layers, since $e_{u v} / x_{u v} / w_{u v}$ defined in (1) can change from t to $t^{'}$ . The embedding $z_{v}^{t}$ will then be aggregated from an RNN-based representation to estimate missing data.
Modification of Message Computation Function (F):
In an Intelligent Transportation System (ITS), the average speed of neighboring nodes might correlate more or less at different time layers due to a variety of factors, i.e., different types of residential zone, special events, accidents, etc. Using an attention mechanism while computing an anchor set’s message with respect to a given node v can alleviate misinformative messages from the anchor set to influence node v’s embedding. Therefore, we use the attention mechanism to learn the relative weights between the feature vector of v and its nearest neighbor from the anchor set. The fact at different degrees could apply to other application domains as well. As such, we modify our message computation function to incorporate the attention mechanism.
In P-GNN, we have multi-level aggregations, as demonstrated in Figure 2. First, the message of each node v is computed with regard to each anchor set $R_{j}$ through function F. From each anchor set, only the nodes with an up to two-hop distance to node v are considered for message computation, as shown in the following equation:

$\begin{matrix} M_{j} & = {F (u, v, h_{u}^{k - 1}, h_{v}^{k - 1}) | u \in R_{j}, R_{j} \subset R, R \sim V, \\ d (v, R_{j}) & = {min}_{u \in R_{j}} d (u, v) \leq 2}, \\ M_{v}^{k} [j] & = MEAN (M_{j}), \end{matrix}$

(10)

where $M_{v}^{k} [j]$ indicates the aggregated message of v corresponding to $R_{j}$ , which is the average of individual outputs of $F (u, v, h_{u}^{k - 1}, h_{v}^{k - 1})$ , where F is defined as:

$\begin{matrix} F (u, v, h_{u}^{k - 1}, h_{v}^{k - 1}) = \frac{1}{d_{s p}^{q} (u, v) + 1} a_{v}, \end{matrix}$

(11)

where $d_{s p}^{q} (u, v)$ is defined as below:

$\begin{matrix} d_{s p}^{q} (u, v) = \{\begin{matrix} d_{s p} (u, v) & if d_{s p} (u, v) \leq q, \\ \infty, & otherwise, \end{matrix} \end{matrix}$

(12)

where $d_{s p} (u, v)$ is the shortest path between a pair of nodes $(u, v)$ , and we write $a_{v} = w_{c} (CONCAT (c_{v}, h_{v}))$ , where $c_{v}$ is denoted as $c_{v} = \sum_{u^{'}} α_{v u^{'}} h_{u^{'}}$ and $α_{v u^{'}}$ is computed as follows:

$\begin{matrix} α_{v u^{'}} & = & \frac{exp (score (h_{v}^{k - 1}, h_{u^{'}}^{k - 1}))}{\sum_{u^{'}} exp (score (h_{v}^{k - 1}, h_{u^{'}}^{k - 1}))}, \end{matrix}$

(13)

where the score is calculated as:

$\begin{matrix} score (h_{v}^{k - 1}, h_{u^{'}}^{k - 1}) & = V^{T} (t a n h (W_{1} h_{v}^{k - 1} + W_{2} h_{u^{'}}^{k - 1})), \end{matrix}$

where V, $W_{1}$ , $W_{2}$ and $w_{c}$ are trainable weights. $c_{v}$ and $a_{v}$ are the context vector and attention coefficient, respectively. This is inspired by [22].
Modification of $A G G_{R}$ :
In P-GNN, $M_{v} [j]$ , associated with anchor set j, is averaged across the anchor sets to generate $h_{v}$ . We choose to differentiate nodes based on their conditional centrality in stationary status $(C C_{v} (ω) |_{ω \to \infty})$ , as higher conditional eigenvector centrality indicates a higher influence of a given node v and its surrounding nodes compared to the ones with lower eigenvector centrality. Additionally, corresponding informative anchor sets contain at least 2-hop neighbor(s) of node v, wherein the ones with higher $C C_{v} (ω)$ deserve to have higher weights for aggregation. As such, we substitute $A G G_{R}$ by the weighted mean of $M_{v} [j]$ , where the weights are proportionate to stationary $C C_{v} (ω)$ , as follows:

$\begin{matrix} h_{v}^{(k)} = \frac{1}{J} \sum_{j} r_{j} M_{v} [j], \forall j \in [1, J] \end{matrix}$

(14)

where $r_{j}$ is calculated as:

$\begin{matrix} r_{j} & = & \frac{\sum_{u^{'} \in R_{j}} C C_{{u^{'}}_{j}} (ω)}{\sum_{j^{'} = 1}^{J} \sum_{u \in R_{j^{'}}} C C_{u_{j^{'}}} (ω)} |_{ω \to \infty}, \\ d (v, u^{'}) \leq 2, \end{matrix}$

(15)

where $j$ is the number of anchor sets. Calculation of the conditional node centrality in stationary condition is implemented separately from the main algorithm, and the result has been used in $A G G_{R}$ to aggregate the computed messages across the anchor sets.
Anchor sets are selected randomly based on Bourgain’s theorem; they come in different sizes that distribute exponentially. $M$ is computed for node v from anchor set $R_{j}$ , if $R_{j}$ hits v, which means $d (v, R_{j}) \leq 1$ . Large anchor sets have a higher probability of hitting v, but are less informative of the positional information of the node, as v hits at least one of many nodes in the anchor sets. On the other hand, small anchor sets have a lower chance of hitting v; however, they provide positional information with high certainty [15]. In terms of message communication, each node in a P-GNN layer shares a message with $O (n log^{2} n)$ number of anchor sets for a graph with n nodes. Suppose that we have a temporal graph with $n^{'} = (| V | \times T)$ and average anchor set size of m; the total communicated messages will be $O (m n^{'} log^{2} n^{'})$ . This is because each node in TMP-GNN only communicates with nodes that are up to two-hop distant within each anchor set $R_{j}$ , so that m is limited ( $1 \leq m < < n^{'}$ ).

At the end of the last P-GNN layer, the node embedding

z_{v}

is calculated after applying a nonlinear

σ

to an inner product of a weight vector

w

and

M_{v}

as in

z_{v} = σ (w M_{v})

. The output embedding of for all nodes at all layers t can be reformed into a supraembedding matrix

Z

as follows:

Z = [\begin{matrix} z_{v_{1}}^{1} & z_{v_{1}}^{2} & z_{v_{1}}^{3} & \dots & z_{v_{1}}^{T} \\ z_{v_{2}}^{1} & z_{v_{2}}^{2} & z_{v_{2}}^{3} & \dots & z_{v_{2}}^{T} \\ z_{v_{3}}^{1} & z_{v_{3}}^{2} & z_{v_{3}}^{3} & \dots & z_{v_{3}}^{T} \\ \dots & \dots & \dots & \dots & \dots \\ z_{v_{| V |}}^{1} & z_{v_{| V |}}^{2} & z_{v_{| V |}}^{3} & \dots & z_{v_{| V |}}^{T} \end{matrix}],

(16)

where

z_{v}^{t}

is the embedding vector of node v at time t. In some scenarios, edge embedding is needed, rather than node embedding. We estimate

e_{(} u v)

embedding by averaging the embedding of the ending nodes as:

z_{u v} = \frac{1}{2} (z_{u} + z_{v})

(17)

We utilize the result of TMP-GNN embedding for missing data estimation. Before proceeding to our proposed architecture, we briefly review the other components of our framework for missing data.

2.4. Bi-Directional Recurrent Neural Network (Bi-GRU)

We use bi-directional GRU as a component in our proposed architecture to estimate missing points in the temporal multilayer graph

G

. The Bi-GRU [23] (Figure 3) captures the temporal correlation within a time layer in both forward and backward directions. The input to Bi-GRU is a triplet array

Y_{e}

,

M_{e}

and

Δ_{e}

that is produced from the graph feature set,

X = {x_{11}^{1}, x_{11}^{2}, \dots, x_{u v}^{t}, \dots, x_{u v}^{T}, \dots, x_{u^{'} v^{'}}^{T}}

, where

x_{u v}^{t}

is the feature associated with

e_{u v}

at time layer t. We sometimes refer to

x_{u v}

as

x_{e}

interchangeably, where e could be any edge in the graph. We also assume that the graph nodes are constant. However, the number of edges and their corresponding features may change from one layer to another. We set

x_{u v}^{t}

to 0, if

e_{u v}

does not exist at time t (

ρ^{(e_{u v} \times t)} = 0

). We also choose to randomly remove missing points and set it to 0 in

Y_{e}

.

M_{e}

is called the mask array containing 0 s and 1 s, which indicates the coordinates of missing and observed points, respectively, corresponding to

Y_{e}

. Additionally, each element in

Δ_{e}

illustrates the time difference between the current and the last layer at which the measurement is recorded.

Δ_{e}

is defined to handle the different sampling rates associated with data heterogeneity from different sources [24,25]. Each edge contains

D

streams of features. We use

x_{e_{d}}

to represent stream d of the feature associated with e, where e is an edge connecting a pair of nodes in the graph

G

.

Our goal is to find the best estimate

{\hat{x}}_{e_{d}}^{t}

with minimum RMSE for a particular missing point through solving the following optimization problem and finding the function f as defined below.

\begin{matrix} min_{f} E_{F} [\sum_{t = 1}^{T} \sum_{d = 1}^{D} (1 - m_{e_{d}}^{t}) L (X_{e_{d}}^{t}, {\hat{X}}_{e_{d}}^{t})], \end{matrix}

(18)

where the loss function is defined as

L (X_{e_{d}}^{t}, {\hat{X}}_{e_{d}}^{t}) = ({{x_{e_{d}}^{t} - f_{e_{d}}^{t} (X, T))}^{2}, X = {x_{e_{d}}^{t}, \forall e, \forall d, \forall t},

T = {1, 2, \dots, T}

.

E

; f and

F

indicate expected values for the desired estimator function and unknown probability distribution that the records are sampled from, respectively.

X_{e_{d}}^{t}

shows stream d of all edges’ features at time t, and

m_{e_{d}}

represents one element of the mask array

M_{e}

defined above.

2.5. E-TMP-GNN: Extended TMP-GNN for Missing Data Estimation

Figure 4 and Figure 5 illustrate our proposed hybrid architecture of TMP-GNN and Bi-GRU. We propose two architectures: E-TMP-GNN I and E-TMP-GNN II. In E-TMP-GNN I, the input is supracentrality matrix

C (ω)

, which encodes the intra-layer and inter-layer edge weights of multilayer graph

G

. The TMP-GNN’s output

Z

encodes the topological structure and positional and feature information across the nodes within a layer, and among the same nodes between consecutive layers. Thereafter, the resulting d-dimensional embedding

Z

is used to calculate edge embedding as in (16), which is then passed to the input of Bi-GRU to merge with existing edge features and form the below array.

\begin{matrix} X_{TMP - PGNN} = {x_{e_{d_{i}}}^{(t)}, z_{e_{d_{i}}}^{(t)} \forall i \in N_{D}, \forall e \in E^{(t)}, ρ^{(e \times t)} = 1} \end{matrix}

(19)

The Bi-GRU is an RNN variant wherein the output of the previous layer is a part of the current layer’s input. This characteristic allows the information to propagate step by step [23] and enables the Bi-GRU to capture long-term temporal correlations of edges in two directions, which is advantageous for missing data imputation. We use one layer of Bi-GRU and form a triplet array of

M_{e}

,

Y_{e}

and

Δ_{e}

, as explained earlier in this section. An initial estimation,

{\tilde{x}}_{e_{d}}^{t}

, is made by applying a nonlinear activation function to the last hidden states of Bi-GRU,

h_{h y b r i d_{e_{d}}}^{(t)}

, as below:

{\tilde{x}}_{e_{d}}^{t} = σ (U_{d} [{\vec{h}}_{h y b r i d_{e_{d}}}^{(t)}; {\overset{\leftarrow}{h}}_{h y b r i d_{e_{d}}}^{(t)}] + g_{d}),

(20)

where the two arrows indicate the forward and backward direction of Bi-GRU, respectively.

U_{d}

and

g_{d}

are weight and bias vectors. In order to capture the correlation among sequences of

x_{e_{d_{i}}} |_{i = 1}^{D}

, a fully connected layer with a hidden dimension equal to the number of feature streams

D

is placed afterward. The output of this layer

{\hat{x}}_{e_{d}}^{t}

is the final estimate of

x_{e_{d}}^{t}

.

{\hat{x}}^{(t)} = σ (w h^{(t)} + β_{0}),

(21)

where

h^{(t)}

is defined as follows:

h^{(t)} = ϕ (U x^{(t)} + V Y^{(t)} + γ_{0}),

(22)

where

Y^{(t)}

is denoted as

[x^{(t)}, M^{(t)}]

.

The Bi-GRU and FC components apply interpolation and imputation, respectively, as they capture the dynamics within a time layer

({x_{e_{d}}^{t}, \forall t \in N_{T}})

, and across the streams of features

({{x, z^{T *}}_{e_{d}}^{(t)}, d = {1, 2, \dots, D}})

simultaneously. The combination of Bi-GRU and FC, named Multi-directional RNN, is inspired by [24] (see Figure 6).

Our second proposed architecture is illustrated in Figure 4. As can be seen, the two pipelines have distinct inputs. The input to TMP-GNN is the temporal multilayer network represented as suppracentrality matrix

C (ω)

, and we feed Bi-GRU through

X

. The last

h_{TMP - {GNN}_{e}}^{(t)}

output from TMP-GNN is then passed through a fully connected layer to reduce the representation dimension as in the following equation:

{h'}_{TMP - {GNN}_{e}}^{(t)} = σ (w_{p} h_{TMP - {GNN}_{e}}^{(t)} + b_{p}),

(23)

where

w_{p}

and

b_{p}

are weight and bias vectors, respectively. The output is then merged by

h_{Bi - {GRU}_{e}}^{(t)} = [{\vec{h}}_{Bi - {GRU}_{e}}^{(t)}; {\overset{\leftarrow}{h}}_{Bi - {GRU}_{e}}^{(t)}]

through softmax attention,

h_{{pgb}_{e}}

is generated as follows:

\begin{matrix} w_{p}^{'}, w_{b}^{'} & = & Softmax (w_{p}, w_{b}), \end{matrix}

(24)

\begin{matrix} h_{{pgb}_{e}}^{(t)} & = & w_{p}^{'} {h'}_{TMP - {GNN}_{e}}^{(t)} + w_{b}^{'} h_{Bi - {GRU}_{e}}^{(t)} \end{matrix}

(25)

and an initial estimate of the missing value is made as:

{\tilde{x}}_{e_{d}}^{(t)} = σ (Q^{d} h_{{pgb}_{e_{d}}} + α_{0}^{d})

(26)

A fully connected neural network is used at the end to finalize the prediction of missing values as follows:

{\hat{x}}_{e_{d}}^{(t)} = σ (w_{1} h_{{pgb}_{e}}^{d} + γ_{1})

(27)

The E-TMP-GNN I aims at extracting additional features out of the embedding that is yielded from TMP-GNN, and use it to further enrich the edge feature sets. We demonstrate in Section 3 that the added streams provide a significant improvement in MAE compared to the baseline. TMP-GNN II, however, aims at reducing the number of feature streams by implementing a softmax layer that obtains

h_{Bi - {GRU}_{e}}^{(t)}

and

h_{TMP - {GNN}_{e}}^{(t)}

as inputs, since

h_{TMP - {GNN}_{e}}^{(t)}

is expected to contain valuable information about the temporal correlation among the same nodes of consecutive time layers.

3. Performance Evaluation

We divide this section to three parts. First, we review our real-world datasets and their associated characteristics in depth. Then, we discuss two potential inputs to our proposed TMP-GNN pipeline and their impact on node embedding performance. Thereafter, we evaluate the performance of E-TMP-GNN I and E-TMP-GNN II on missing data estimation. We implemented E-TMP-GNN I and E-TMP-GNN II in python using Networkx (https://github.com/networkx, accessed on 9 April 2022), PyTorch (https://github.com/pytorch, accessed on 9 April 2022) and Tensorflow (https://github.com/tensorflow/tensorflo, accessed on 9 April 2022). We built in and added various components to the P-GNN implementation [15].

3.1. Datasets

We run our experiment on four datasets: TomTom, COVID-19 Mobility (Mobility) [26], PhD Exchange (PhD), and the Seattle Inductive Loop Detector (Loop Detector).

TomTom Dataset

: This dataset [27] contains space mean speed and average travel time during peak congestion for hundreds of approximately one-km-long road segments across the Greater Toronto Area (GTA). The measurements are collected on a one-minute basis during congestion, where the average space mean speed is less than

70 %

of the vehicular speed during free-flow conditions. We select a three-hour interval from 4:30 p.m. to 7:30 p.m. on Thursday, 8 September 2016, which is expected to have a high proportion of collected records. During this interval, we select road segments with a significant number of mutual timestamps, where measurements for a sufficient number of the segments are conducted. From the point of view of the multilayer network, we select 10 layers of the GTA road network of 1867 nodes and 985 edges, where each node and edge is the start/ending point of a road segment and a road segment, respectively. We define each edge feature as the ratio of

\frac{space mean speed}{free - flow speed}

and

\frac{free - flow travel time}{travel time}

that is in range of [0, 1]. To reduce the correlation between the consecutive records and better assess the performance of pair-wise node classification in TMP-GNN and missing data estimation, the records are downsampled by a rate of

\frac{8}{1}

(

δ = \frac{1}{8}

). The edge weights are also normalized so that the features are comparable during the learning process. Moreover, the weighted adjacency matrix is created using the highly accurate geolocations of the road segments in the GTA. Once the node embedding is calculated through TMP-GNN, the edge embedding is estimated by averaging the embedding of the corresponding nodes that the edge is connected to as in (17). The resulting embedding will then be used in missing data estimation.

PhD Dataset

: This case of study utilizes a network that represents the exchange of PhD graduates between universities in the fields of mathematical science. The features we use in our experiments include the graduation year of the students who obtained a doctoral degree, his/her official academic advisor(s) and the degree-granting university. The aim of former studies was to estimate the flow of doctorate exchanges between universities, individuals who graduate from university a and then are hired at the university b. We consider the years 1950–2010, which result in

T = 61

time layers, and there exists a set of

N = 231

connected universities during this time span in the united states. In order to build the graph, directed intra-layer edges are created to represent a doctoral degree obtained by a graduate from university a at year t, who is later hired as a professor at university b, and at least advised one student there. In our case, the edge weight indicates the number of graduate doctorates from university a in year t who was later hired at university b as a faculty member. The edge direction opposes the flow of PhD graduates moving between universities

(b \to a)

[16,28].

Loop Detector Dataset

: Our third case of study is the data collected by inductive loop detectors on freeways in the Seattle area [3,29]. The dataset contains temporal sequences of the time mean speed of the freeway system. The speed information at a desired reference point (milepost) is averaged from multiple loop detectors on the main lanes of the same direction at the specific reference point. Measurements are recorded every 5 minutes. The adjacency matrix calculated in [29] is used to convert the data into an undirected graph where an edge indicates that a pair of reference points are connected, without specifying the direction of connectivity. We have downsampled the dataset by a rate of

\frac{2}{1}

(

δ = \frac{1}{2}

) and selected 11 time layers of the graph.

Mobility Dataset

: Our last case of study is the human mobility flow dataset in the U.S. during the COVID-19 epidemic [26]. To build the graph, a directed intra-layer edge is constructed to represent the flow of people between pairs of states. Each edge indicates the ratio of visits to the population count among the states. We focused on January 1st–22nd 2020, which led to 22 time layers.

Table 1 demonstrates the characteristics associated with the graphs built from the 4 selected datasets.

Among all, TomTom and Mobility have the highest and the lowest number of nodes, respectively. From an edge density perspective, we have ordered the graph from the most connected to the least connected as follows: Mobility, Loop Detector, PhD and TomTom. From a topological point of view, TomTom and Loop Detector are static graphs. That is,

E^{(t)} = E^{(t^{'})}, t \neq t^{'}

and the change is in the value of

x_{u v}^{(t)} / w_{v}^{(t)}

only. However, the PhD and Mobility graphs change throughout the time layers in terms of

| V^{(t)} |, x_{u v}^{(t)}

and

w_{v}^{(t)}

. Among the two, PhD has the highest dynamic rate, and turns into the sparsest graph in multiple time layers. Figure 7 demonstrates

| E^{(t)} |

per layer t for these graphs.

The third row in the table shows the number of nodes in the temporal multilayered graph, explained in Section 2.2, which is equal to

(| V | \times T)

. TomTom and PhD have the largest numbers of nodes in the temporal graph given the number of available time layers. The fourth row indicates the size of the largest connected component (LCC) in https://www.overleaf.com/project/6251d321650dd619f849d51c (accessed on 9 April 2022) for the dynamic graph. The parameter is considered as an important topological invariant of the graph and can be computed using a variant of depth-first search for directed and undirected graphs [30]. Here, graphs can be ordered according to ratio of

\frac{LCC}{| V |}

in descending order as follows: Loop Detector, PhD, Mobility, and TomTom. The entire Loop detector undirected graph is considered as one connected component; TomTom as a directed graph has multiple weakly connected components, with each component as small as

2 %

of the total

| V |

. We calculate the number of isolated nodes as the nodes that are not connected to the largest component by this equation:

no . of isolated nodes = N \times T - LCC

, which could be another indicator of nodes with lower node degrees and, thus, sparser graphs. TomTom has the largest number of isolated nodes, as expected.

3.2. Potential Inputs to TMP-GNN: Single Graph vs. Multiple Graphs

There are several ways in which we can feed P-GNN:

Supracentrality Graph $C (ω)$ : In this case, we represent the centrality of a temporal multilayer graph via a supracentrality matrix that is analogous to a single graph with $N \times T$ number of nodes. The main advantage of this approach is to illustrate the graph with all weighted intra-layer and inter-layer edges. However, the matrix size might be relatively large in the presence of a large number of T. In the next section, we demonstrate that this approach outperforms the one with a multiple-graph input.
Multiple Graphs: In this approach, a sequence of centrality/adjacency matrices associated with each time layer, ${A^{(1)}, A^{(2)}, \dots, A^{(T)}}$ , is fed into P-GNN. That is, the multilayer graph is treated as individual instances of a single graph without consideration of inter-layer coupling.
The input to P-GNN can be a single graph with the entry of its corresponding adjacency matrix as follows:

$a_{u v} = \{\begin{matrix} [x_{u v}^{(1)}, x_{u v}^{(2)}, \dots, x_{u v}^{(T)}] & : if u, v are connected, \\ 0 & : otherwise . \end{matrix}$

(28)

In this approach, we treat the multilayer graph as a single graph where each element of the adjacency matrix is a time series of features associated with an $e_{u v}$ . In order to capture the temporal characteristics of a v/ $e_{u v}$ , a recurrent neural network is required before applying function F in Figure 2, which leads to high computational complexity. We have not implemented this method.

Training

: There are two different ways to train the time-varying graph to compute node embedding. In the case of a single graph, graph nodes are split into training, evaluation, and testing, with

80 %

,

10 %

, and

10 %

respectively. The classification accuracy on the test set is recorded once the best performance on the evaluation is reached. As for the multiple graphs, there are two techniques to train the graph: first,

80 %

of the graphs can be used for training and the remaining for evaluation and the test set. Second,

80 %

of the nodes of all available graphs are used for training, and

10 %

of the remaining nodes in all time layers of graphs are used for testing, with an equal number for the validation set. Although both techniques can be utilized for time-invariant multiple graphs, e.g., Protein [15,31], we exploit the second choice for our multilayer graph task (Figure 8). It enables us to capture change trajectories throughout the time layers, whereas the first approach splits the graphs without considering the time ordering, and also the behavioral changes of any of the nodes in

20 %

of the graphs are not taken into account. To the best of our knowledge, this is the first time that such an experimental setting is used for node embedding in multiple graphs.

Inductive vs. Transductive Learning: There are two different settings to learn the node embedding: inductive learning and transductive learning. In the latter setting, the graph is trained with fixed node ordering. In this technique, the model needs to be retrained when some additional nodes are added, the ordering has changed or a new graph is given. Moreover, the node attributes are augmented as one-hot identifiers that could restrict the generalization ability of the model [32]. The authors in [15] assess the transductive learning performance of P-GNN on a link prediction task, in terms of whether a pair of nodes are connected. On the other hand, in inductive learning, the learned positional information can be transferred to an unseen graph. In particular, instead of augmenting the one-hot identifiers of node attributes, an order-invariant scalar could be assigned to nodes whenever node attributes are not available. We choose to train TMP-GNN in an inductive learning setting and utilize the pair-wise node classification task, as in [15], to demonstrate the performance of TMP-GNN embedding.

As discussed in (2), our goal is to compute the embedding for time-varying graphs with potential changes in edge existence

ρ^{(E \times t)}

or intra-layer edge weights represented in

A^{(t)}

. Therefore, pair-wise node classification would be a more challenging task than link prediction, since the node class is determined according to the status of its surrounding edges and/or the status of other isomorphic nodes that are dynamic in such graphs. When the learning task requires node positional information too, only using structure-aware embedding is not sufficient.

Figure 9 illustrates the

ROC AUC

of TMP-GNN for pair-wise node classification given the second and first type of input, the sequences of individual graphs with each graph associated with one layer and a single supragraph, respectively (Section 3.2). All datasets demonstrate significantly better performance in the presence of a single-graph input versus a multi-graph input. It is highlighted that a single graph has better represented the chronological property of a multilayer temporal graph. The difference in performance improvement achieved by TMP-GNN is significantly larger in the case of a single supragraph for the same reason. In the case of TomTom, Loop Detector and PhD, the other four GNNs perform relatively the same. As for Mobility, all five GNN-based approaches perform well, with relatively negligible differences in classification accuracy. This could be due to the lower number of nodes plus lower dynamics in the number of edges per time layer

| E^{(t)} |

, as shown in Figure 2, and stronger connectivity within time layers in comparison to other case studies that are confined to the expressive power of TMP-GNN, being less distinct from others. TMP-GNN demonstrates the largest difference in AUC improvement for TomTom from multi-graph to a single supragraph. As noted in Section 3.1, it has the highest number of nodes, with over 5 times more than PhD, the second largest graph, in conjunction with the lowest connectivity degree, and the largest number of isolated nodes, which further underlines the discriminating power of TMP-GNN. It is worth mentioning that the TomTom temporal graph, with a relatively larger number of nodes, requires higher communication message passing in the TMP-GNN layer, as described in Section 2.3, and thus higher computational effort compared to the other three datasets.

For three out of four case studies in multi-graph implementation (Figure 9a), the best performer is TMP-GNN; the second most competent is SAGE. The reason could be the similarity in message computation function F, where the feature information of two nodes

(h_{v}, h_{u})

or the attended message from u towards node v are concatenated as in (11).

Figure 10 reveals the impact of the availability of the edge feature on

ROC AUC

for all five GNN-based models for the Loop Detector dataset. From a topological point of view, Loop Detector is an undirected static graph. When edge features are not available, all GNNs perform relatively the same. When taking the time-varying edge attributes

x_{u v}^{(t)}

into account, it turns into a dynamic graph wherein the node class is also dependent on time-varying attributes, and TMP-GNN demonstrates its superiority over the other widely used structure-aware embedding. This is where the power of TMP-GNN is highlighted for time-varying graphs.

Table 2 shows the percentage of

ROC AUC

improvement achieved by TMP-PGNN for multi-graph and single-supragraph implementations, as well as the decrease in MAE achieved by E-TMP-GNN I and II compared to the baselines for all four datasets. We define

{\hat{η}}_{y}^{metric}

in the following equation as a measure to calculate the improvement:

\begin{matrix} {\hat{η}}_{y}^{metric} = \frac{|η_{y}^{metric} - η_{y ’}^{metric}|}{η_{y ’}^{metric}} \times 100, \end{matrix}

(29)

where

η

notes the performance, metric indicates

ROC AUC

or MAE according to the underlying experiment, and

| * |

is the absolute value. y represents TMP-GNN, E-TMP-GNN I, or E-TMP-GNN II, and y’ is the compared baseline method. It is noted that 0 indicates zero or no improvement.

In Figure 11, we evaluate the performance of our proposed architectures, TMP-GNN I and TMP-GNN II, on the estimation of missing data, and compare it to an RNN variant architecture named M-RNN (Figure 6) and the second best classifier demonstrated in Figure 9b, which is CGN, GAT, GIN, and SAGE for the TomTom, PhD, Loop, and Mobility datasets, respectively. We use Mean Absolute Error (MAE) as a measure of estimation accuracy and choose to remove missing values on a temporal dimension. The points are chosen according to uniform distribution in conjunction with a determined missing threshold

τ

.

All case studies experience a higher MAE as

τ

increases. Moreover, either TMP-GNN I or TMP-GNN II outperform both M-RNN and the counterpart GNN for each case. This is due to the high expressive power of additional features, learning through TMP-GNN representation. The underlying time-varying features proved to be significantly more efficient for missing data estimation than the ones derived from the second-best node classifier in the previous experiment, since it takes position information and inter-layer coupling into account simultaneously; the latter makes the architecture more powerful than the ones utilizing the RNN variant for capturing long-term temporal dependencies. TomTom has the highest MAE compared to others, followed by PhD, Loop and Mobility, respectively. This follows the same order as their corresponding multi-layer graphs when ranked according to their degree of connectivity; the sparser the graph is, the larger the estimation error would be. In addition, the percentage of improvement is larger for TomTom and PhD compared to the other two, which means that the proposed method is more effective for graphs with weaker connectivity.

For three out of four datasets (TomTom, Loop, Mobility), E-TMP-GNN II outperforms E-TMP-GNN I. In the case of the PhD dataset, E-TMP-GNN I is the best performer. The nature of being dynamic is more highlighted in the case of PhD and Mobility, since there is a significant change in the number of edges per time point (Figure 9), in addition to the variation in edge weights existing in the two other datasets. In TMP-GNN I, the time-aware node embedding generated by TMP-GNN will play a more effective role once added to the existing features of the dataset so that it is fed into the interpolation component implemented by Bi-GRU, whereas E-TMP-GNN II reduces the dimension of TMP-GNN’s time-aware embedding, and then combines it with Bi-GRU’s generated representation in an attentive way. In summary, dimension reduction and the softmax function might reduce the effectiveness of the time-aware embedding, later affecting the level of MAE decrease in missing data estimation. The more time-varying the dataset is, the more efficient time-aware representation is needed to better estimate missing data points.

4. Conclusions

We propose TMP-GNN, a generalized version of P-GNN node embedding, for temporal multilayer graphs. It takes short-term temporal correlations, as well as feature and positional information, into account. We also incorporate the notion of eigenvector-based centrality to distinguish nodes with a higher influence on their neighbors. Then, we extend the design to E-TMP-GNN I and II to tackle the missing data challenge in dynamic networks with multiple streams of node/edge features. Conducted experiments on four real-world datasets with diverse characteristics demonstrate significant improvements in node-level classification and missing data estimation tasks. It would be interesting to consider formulating the missing data problem directly into an edge representation task for future work.

Author Contributions

Conceptualization, B.N., S.P. and A.L.-G.; methodology, B.N., S.P. and A.L.-G.; software, B.N. and S.P.; validation, B.N., S.P. and A.L.-G.; formal analysis, B.N. and S.P.; investigation, B.N., S.P. and A.L.-G.; resources, A.L.-G.; data curation, B.N. and A.L.-G.; writing—original draft preparation, B.N., S.P. and A.L.-G.; writing—review and editing, B.N., S.P. and A.L.-G.; visualization, B.N., S.P. and A.L.-G.; supervision, A.L.-G.; project administration, B.N. and A.L.-G.; funding acquisition, A.L.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This work and its Article Processing Charge (APC) were funded by A.L.-G.’s University of Toronto operating grant (2020–2022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to the journal reviewers for their constructive comments and suggestions on the first draft of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GNN	Graph Neural Networks
TMP	Temporal Multilayer Position-Aware
ROC	Receiver Operating Characteristic
ROC AUC	Area Under Receiver Operating Characteristic Curve
ITS	Intelligent Transportation System
GRU	Gated Recurrent Unit
MAE	Mean Absolute Error
MLC	Marginal Layer Centrality
RNN	Recurrent Neural Network
GTA	Greater Toronto Area
LCC	Largest Connected Component
CGN	Convolutional Graph Neural Network
GAT	Graph Attention Network
GIN	Graph Isomorphism Network
SAGE	Sample And Aggregate

References

Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.Y. Traffic flow prediction with big data: A deep learning approach. IEEE Trans. Intell. Transp. Syst. 2014, 16, 865–873. [Google Scholar] [CrossRef]
Lin, Y.; Dai, X.; Li, L.; Wang, F. Pattern sensitive prediction of traffic flow based on generative adversarial framework. IEEE Trans. Intell. Transp. Syst. 2018, 20, 2395–2400. [Google Scholar] [CrossRef]
Cui, Z.; Henrickson, K.; Ke, R.; Wang, Y. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Trans. Intell. Transp. Syst. 2019, 21, 4883–4894. [Google Scholar] [CrossRef] [Green Version]
Yu, H.; Wu, Z.; Wang, S.; Wang, Y.; Ma, X. Spatiotemporal recurrent convolutional networks for traffic prediction in transportation networks. Sens. Multidiscip. Digit. Publ. Inst. 2017, 17, 1501. [Google Scholar] [CrossRef] [Green Version]
Ghoroghchian, N.; Groppe, D.M.; Genov, R.; Valiante, T.A.; Draper, S.C. Node-Centric Graph Learning From Data for Brain State Identification. IEEE Trans. Signal Inf. Process. Netw. 2020, 6, 120–132. [Google Scholar] [CrossRef]
Zhang, M.; Chen, Y. Link prediction based on graph neural networks. arXiv 2018, arXiv:1802.09691. [Google Scholar]
Liu, S.; Li, T.; Ding, H.; Tang, B.; Wang, X.; Chen, Q.; Yan, J.; Zhou, Y. A hybrid method of recurrent neural network and graph neural network for next-period prescription prediction. Int. J. Mach. Learn. Cybern. 2020, 11, 2849–2856. [Google Scholar] [CrossRef] [PubMed]
Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.; Ye, J.; Li, Z. Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Taheri, A.; Gimpel, K.; Berger-Wolf, T. Learning graph representations with recurrent neural network autoencoders. In Proceedings of the KDD’18 Deep Learning Day, London, UK, 19–23 August 2018. [Google Scholar]
Kumar, S.; Zhang, X.; Leskovec, J. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1269–1278. [Google Scholar]
Xue, H.; Yang, L.; Jiang, W.; Wei, Y.; Hu, Y.; Lin, Y. Modeling dynamic heterogeneous network for link prediction using hierarchical attention with temporal rnn. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Wuerzburg, Germany, 16–20 September 2019; pp. 282–298. [Google Scholar]
Xu, D.; Cheng, W.; Luo, D.; Gu, Y.; Liu, X.; Ni, J.; Zong, B.; Chen, H.; Zhang, X. Adaptive neural network for node classification in dynamic networks. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 1402–1407. [Google Scholar]
Xu, D.; Cheng, W.; Luo, D.; Liu, X.; Zhang, X. Spatio-Temporal Attentive RNN for Node Classification in Temporal Attributed Graphs. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 3947–3953. [Google Scholar]
Xie, Y.; Li, C.; Yu, B.; Zhang, C.; Tang, Z. A survey on dynamic network embedding. arXiv 2020, arXiv:2006.08093. [Google Scholar] [CrossRef]
You, J.; Ying, R.; Leskovec, J. Position-aware graph neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7134–7143. [Google Scholar]
Taylor, D.; Myers, S.A.; Clauset, A.; Porter, M.A.; Mucha, P.J. Eigenvector-based centrality measures for temporal networks. Multiscale Model. Simul. 2017, 15, 537–574. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gleich, D.F. PageRank beyond the (WEB). SIAM Rev. 2015, 57, 321–363. [Google Scholar] [CrossRef]
Shai, S.; Stanley, N.; Granell, C.; Taylor, D.; Mucha, P.J. Case studies in network community detection. In The Oxford Handbook of Social Networks; Oxford University Press: Oxford, UK, 2017. [Google Scholar]
Ipsen, I.; Wills, R.M. Analysis and computation of google’s pagerank. In Proceedings of the 7th IMACS International Symposium on Iterative Methods in Scientific Computing, Fields Institute, Toronto, ON, Canada, 5–8 May 2005; Volume 5. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
Bourgain, J. On Lipschitz embedding of finite metric spaces in Hilbert space. Isr. J. Math. 1985, 52, 46–52. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef] [Green Version]
Yoon, J.; Zame, W.R.; van der Schaar, M. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans. Biomed. Eng. 2018, 66, 1477–1490. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Najafi, B.; Parsaeefard, S.; Leon-Garcia, A. Estimation of Missing Data in Intelligent Transportation System. In Proceedings of the 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall), Virtual, 18 November–16 December 2020; pp. 1–6. [Google Scholar]
Kang, Y.; Gao, S.; Liang, Y.; Li, M.; Rao, J.; Kruse, J. Multiscale dynamic human mobility flow dataset in the US during the COVID-19 epidemic. Sci. Data 2020, 7, 1–13. [Google Scholar] [CrossRef]
TomTom Road Analytics. Available online: Https://www.tomtom.com/products/road-traffic-data-analytics/ (accessed on 1 January 2022).
Burris, V. The academic caste system: Prestige hierarchies in PhD exchange networks. Am. Sociol. Rev. 2004, 69, 239–264. [Google Scholar] [CrossRef] [Green Version]
Cui, Z.; Ke, R.; Wang, Y. Deep Bidirectional and Unidirectional LSTM Recurrent Neural Network for Network-wide Traffic Speed Prediction. arXiv 2018, arXiv:1801.02143. [Google Scholar]
Tarjan, R. Depth-first search and linear graph algorithms. SIAM J. Comput. 1972, 1, 146–160. [Google Scholar] [CrossRef]
Adhikari, B. DEEPCON: Protein contact prediction using dilated convolutional neural networks with dropout. Bioinformatics 2020, 36, 470–477. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]

Figure 1. A dynamic multilayer graph.

Figure 2. The input to our TMP-GNN layer is a temporal graph

G^{(t)}

, presented as a supragraph and the corresponding supracentrality matrix

C (ω)

. As the first step, anchor sets, a subset of graph nodes with different sizes, are randomly selected according to Bourgain’s theorem [21] (Reproduced with written permission)

R = R_{1}, R_{2}, \dots, R_{J}

. In order to compute the embedding for

v_{n}

, feature information of node

v_{n}

and another node from anchor set

R_{j}

, is fed into a message computation function

F

with output aggregated through

{AGG}_{M}

to produce

M_{v_{n}} [j], j = 1, 2, . ., J

corresponding to

R_{j}

. Then,

M_{v_{n}} [j]

is further aggregated through

{AGG}_{R}

to output

h_{v_{n}}

, which is passed to the next P-GNN layer. The embedding

z_{v_{n}}

at the last layer is generated by applying the nonlinear

σ

to the inner product of

M_{v_{n}}

and weight vector

w

(You, J., Ying, R. and Leskovec, J., 2019 [15]—modified, permission is obtained).

Figure 2. The input to our TMP-GNN layer is a temporal graph

G^{(t)}

, presented as a supragraph and the corresponding supracentrality matrix

C (ω)

. As the first step, anchor sets, a subset of graph nodes with different sizes, are randomly selected according to Bourgain’s theorem [21] (Reproduced with written permission)

R = R_{1}, R_{2}, \dots, R_{J}

. In order to compute the embedding for

v_{n}

, feature information of node

v_{n}

and another node from anchor set

R_{j}

, is fed into a message computation function

F

with output aggregated through

{AGG}_{M}

to produce

M_{v_{n}} [j], j = 1, 2, . ., J

corresponding to

R_{j}

. Then,

M_{v_{n}} [j]

is further aggregated through

{AGG}_{R}

to output

h_{v_{n}}

, which is passed to the next P-GNN layer. The embedding

z_{v_{n}}

at the last layer is generated by applying the nonlinear

σ

to the inner product of

M_{v_{n}}

and weight vector

w

(You, J., Ying, R. and Leskovec, J., 2019 [15]—modified, permission is obtained).

Figure 3. Bi-directional gated recurrent unit.

Figure 4. E-TMP-GNN I.

Figure 5. E-TMP-GNN II.

Figure 6. M-RNN architecture in time domain based on the theory in [24].

Figure 7. Number of edges per time layer.

Figure 8. Train–test split for temporal graph as (

a

) single graph, (

b

) a sequence of single-layered graph.

Figure 8. Train–test split for temporal graph as (

a

) single graph, (

b

) a sequence of single-layered graph.

Figure 9. Pair-wise node classification accuracy of TMP-GNN vs. GNNs.

Figure 10. Pair-wise node classification accuracy of TMP-GNN vs. GNNs for Loop Detector dataset with and without edge feature.

Figure 11. Performance evaluation of E-TMP-GNN I and II on missing data estimation vs. M-RNN and counterpart GNN.

Table 1. Characteristics of the study datasets.

Characteristics	TomTom	Loop Detector	PhD	Mobility
$\| V \|$	1867	323	231	72
$static : \| E \|, dynamic$ *: $\sum_{t = 1}^{T} E^{(t)}$	985	1001	10,365 *	2692
$\| V \| \times T$	18,670	3553	14,091	1584
Largest Connected Component (LCC)	50	323	13,847	1144
No. of Isolated Nodes	18,620	3230	244	440

* indicates the graph with a dynamic number of edges over time.

Table 2. Performance Comparison of E-TMP-GNN I, E-TMP-GNN II vs. counterpart methods.

Estimate	TomTom	Loop Detector	PhD	Mobility
${\hat{η}}_{Multi - graph}^{ROC AUC} (%)$	7–42	0–22	18–35	6–20
${\hat{η}}_{Sin gle supragraph}^{ROC AUC} (%)$	28–41	20–25	30–37	6–58
${\hat{η}}_{E - TMP - GNN I}^{MAE} (%)$	59–69	0–12	17–27	0
${\hat{η}}_{E - TMP - GNN II}^{MAE} (%)$	94–96	29–96	0	0–15

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Najafi, B.; Parsaeefard, S.; Leon-Garcia, A. Missing Data Estimation in Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN). Mach. Learn. Knowl. Extr. 2022, 4, 397-417. https://doi.org/10.3390/make4020017

AMA Style

Najafi B, Parsaeefard S, Leon-Garcia A. Missing Data Estimation in Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN). Machine Learning and Knowledge Extraction. 2022; 4(2):397-417. https://doi.org/10.3390/make4020017

Chicago/Turabian Style

Najafi, Bahareh, Saeedeh Parsaeefard, and Alberto Leon-Garcia. 2022. "Missing Data Estimation in Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN)" Machine Learning and Knowledge Extraction 4, no. 2: 397-417. https://doi.org/10.3390/make4020017

Article Menu

Missing Data Estimation in Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN)

Abstract

1. Introduction

2. Notations and Preliminaries

2.1. Notation

2.2. Supracentrality Matrix for Temporal Multilayer Position-Aware Graph

Choice of $ω$ for Steady-State Node Ranking

2.3. TMP-GNN: Multilayer Position-Aware-Based Graph Neural Network

2.4. Bi-Directional Recurrent Neural Network (Bi-GRU)

2.5. E-TMP-GNN: Extended TMP-GNN for Missing Data Estimation

3. Performance Evaluation

3.1. Datasets

3.2. Potential Inputs to TMP-GNN: Single Graph vs. Multiple Graphs

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Missing Data Estimation in Temporal Multilayer Position-Aware Graph Neural Network (TMP-GNN)

Abstract

1. Introduction

2. Notations and Preliminaries

2.1. Notation

2.2. Supracentrality Matrix for Temporal Multilayer Position-Aware Graph

Choice of ω for Steady-State Node Ranking

2.3. TMP-GNN: Multilayer Position-Aware-Based Graph Neural Network

2.4. Bi-Directional Recurrent Neural Network (Bi-GRU)

2.5. E-TMP-GNN: Extended TMP-GNN for Missing Data Estimation

3. Performance Evaluation

3.1. Datasets

3.2. Potential Inputs to TMP-GNN: Single Graph vs. Multiple Graphs

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Choice of $ω$ for Steady-State Node Ranking