Non-Autoregressive Sparse Transformer Networks for Pedestrian Trajectory Prediction

Liu, Di; Li, Qiang; Li, Sen; Kong, Jun; Qi, Miao

doi:10.3390/app13053296

Open AccessArticle

Non-Autoregressive Sparse Transformer Networks for Pedestrian Trajectory Prediction

by

Di Liu

^1,2,†,

Qiang Li

^1,†,

Sen Li

¹,

Jun Kong

^1,* and

Miao Qi

^3,*

¹

College of Information Science and Technology, Northeast Normal University, Changchun 130117, China

²

School of Computer Science, Northeast Electric Power University, Jilin City 132012, China

³

Key Laboratory of Applied Statistics of MOE, Northeast Normal University, Changchun 130024, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(5), 3296; https://doi.org/10.3390/app13053296

Submission received: 26 December 2022 / Revised: 1 March 2023 / Accepted: 2 March 2023 / Published: 4 March 2023

(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods)

Download

Browse Figures

Versions Notes

Abstract

:

Pedestrian trajectory prediction is an important task in practical applications such as automatic driving and surveillance systems. It is challenging to effectively model social interactions among pedestrians and capture temporal dependencies. Previous methods typically emphasized social interactions among pedestrians but ignored the temporal consistency of predictions and suffered from superfluous interactions by dense undirected graphs, resulting in a considerable deviance from reality. In addition, autoregressive approaches predicted future locations conditioning on previous predictions one by one, which would lead to error accumulation and time consuming. To address these issues, we present Non-autoregressive Sparse Transformer (NaST) networks for pedestrian trajectory prediction. Specifically, NaST models sparse spatial interactions and sparse temporal dependency via a sparse spatial transformer and a sparse temporal transformer separately. Different from previous predictions such as RNN-based approaches, the transformer decoder works in non-autoregressive pattern and predicts all the future locations at one time from a query sequence, which could avoid the error accumulation and be less computationally intensive. We evaluate our proposed method on the ETH and UCY datasets, and the experimental results show our method outperforms comparative state-of-the-art methods.

Keywords:

pedestrian trajectory prediction; sparse transformer; non-autoregressive

1. Introduction

Pedestrian trajectory prediction plays an important role in many fields such as autonomous driving [1,2], robotic motion planning [3], video surveillance [4,5], and computer vision [6,7,8,9,10]. It is a challenging task because the pedestrian motion is influenced by both historical trajectories and social interaction with neighbors [9]. Crowd interactions often obey some social norms in pedestrian trajectory prediction. For example, strangers usually keep their distance from others to avoid collisions, but fellows tend to walk together [11]. The motion of pedestrians is very easy to be influenced by neighbors in a scene [7]. Moreover, the environment will also affect pedestrian trajectory such as the surrounding obstacles or events happening suddenly. Such interactions are much more complex and harder to model in learning systems.

Traditional methods adopted handcrafted energy functions [2,12,13] to model human–human interaction, and these classic approaches are not easy to tune and are hard to generalize [9]. With the development of deep neural networks, a family of methods based on Recurrent Neural Networks (RNNs) [6,7,8,9,10] have been applied in pedestrian trajectory prediction. RNN-based approaches capture temporal features of pedestrians by their latent state and model the spatial interaction by merging the features of neighbors nearby. The prediction of trajectory can be achieved in an iterative way by repeatedly predicting each next location. Nevertheless, this kind of autoregressive decoding pattern is not parallelizable, which may cause error accumulation and increase computational complexity.

Distance-based methods [6,7,14] capture crowd interaction by merging their latent states with a social pooling layer, while attention-based methods [10,15,16,17] dynamically generate the importance of neighbors using soft attention instead. The graph-based methods [8,9,18,19] consider the pedestrians in the scene as graph nodes and capture the spatial interaction using graph neural networks such as Graph Convolution Networks (GCN) [8,19,20] or Gated Attention (GAT) [21]. It is much more effective in modeling complex social interactions in the real world. However, most of the distance-based and attention-based methods consider an agent interacting with the others in the neighborhood to construct dense interactions. In fact, an agent is influenced by only a few neighbors in a real scenario. In addition, the distance-based methods usually consider that the interaction between two agents are identical to each other and model undirected interaction. In some cases, a pedestrian might change their path to avoid collision with another person. However, the other one might go straight along the original path without any change. Therefore, the identical undirected interaction between them is unreasonable. Most existing graph-based methods extract features by simply aggregating weighted features of nodes in the local spatial neighborhood and neglecting the relative relation among pedestrians. Therefore, how to effectively encode the spatial and temporal interaction of pedestrians remains a challenging problem.

Transformer neural networks were first applied in Natural Language Processing (NLP) domains [22,23] and made great breakthroughs. They were designed for sequential tasks and were notable for powerful self-attention mechanisms for modelling long-range dependencies of data. The great success in NLP led researchers to adopt transformers to a series of computer vision tasks such as object detection [24], image classification [25], and joint vision–language modeling [26]. Existing works adopted transformers to pedestrian trajectory prediction and obtained encouraging results [27]. Nevertheless, the self-attention in [27] is dense and brings superfluous interaction. In addition, the prediction is made in an autoregressive way since the future locations are generated sequentially one at time. Generally speaking, existing works for pedestrian trajectory prediction could be mainly classified into distance-based methods [6,7,14], attention-based methods [10,15,16,17], and graph-based methods [8,9,18,19]. According to the analysis above, these methods have shortcomings in effectively modeling social interactions and temporal dependencies.

To solve the problems mentioned above, we delved into non-autoregressive pedestrian trajectory prediction with transformers to propose a Non-autoregressive Sparse Transformer (NaST) network. Our work is in line with recent approaches [24,27,28]. In order to capture the spatial interaction, we designed a sparse spatial transformer. Instead of making dense interactions between a particular pedestrian with all of the others in a scene, we decided to let the model discover the exact neighbors who were involved in the interaction and learned the proper weights for them. More specifically, the spatial transformer first computed weights for each agent by multi-head self-attention. Then, the attention weights were filtered by a designed sparse interaction mask to prune the irrelevant neighbors. Finally, the remaining ones were aggregated to compute the spatial interaction. In addition, the spatial interaction was directed since the influence of social interaction was not identical to each other between pedestrian pairs. Therefore, the sparse spatial transformer can learn adaptive self-attention and find the exact set of pedestrians involved in social interactions. In the same way, the pedestrian motion feature is captured by a sparse temporal transformer. In particular, the sparse spatial transformer and sparse temporal transformer are jointly combined to model the spatial social interactions and temporal motions. Furthermore, we designed the transformer in a non-autoregressive way to predict all of the future locations at one time, which could avoid error accumulation and would be less computationally intensive. Finally, our proposed NaST is evaluated on the commonly used ETH [29] and UCY [30] pedestrian trajectory prediction datasets. The experiment results indicated that our method outperformed some of the state-of-the-art methods. Compared with the existing methods, this is the first proposal of sparse transformers and the application of non-autoregressive inference in trajectory prediction tasks.

In summary, our contributions are three-fold: (1) We propose a Non-autoregressive Sparse transformer network for pedestrian trajectory prediction tasks. The advantage of the model is demonstrated by the experiments. (2) We propose a model of social interaction between pedestrians with a sparse spatial transformer encoder and extract the temporal dependencies with a sparse temporal transformer encoder. (3) We investigate the use of non-autoregressive inference with a transformer neural network to avoid error propagation.

2. Related Works

2.1. Pedestrian Trajectory Prediction

Pedestrian trajectory prediction can be considered as a sequential prediction task that anticipates the future path of an agent by using historical trajectories. A family of methods based on RNNs have been proposed to model temporal dependencies, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Unit (GRU). Social-LSTM [6] modeled the temporal dependency and predicted the future locations with LSTM, which computed the social interaction between the specific pedestrian and neighbors within a certain distance. Study [31] combined LSTM and the soft-attention mechanism [17] to model social interactions among pedestrians. Since the current prediction is computed based on the results of the previous time steps, RNN/LSTM-based methods are often time-consuming for predictions. In addition, they inevitably accumulate errors and make the long-term predictions inaccurate. Graph is another popular choice for trajectory prediction combined with CNN or RNN/LSTM. Social-STGCNN [19] considered the pedestrians as graph nodes and the edges were weighted by the relative distance among the agents. Combined with graph attention network,. STGAT [9] constructed a spatial-temporal model to predict pedestrian future trajectories. However, the number of nodes in graph depends on the number of pedestrians in a scene. Hence, the size of graph will grow significantly in the crowded scenes. Moreover, there are some other works elaborated on in this task. SGAN [7] adopted Generative Adversarial Network (GAN) [32] for multi-modal trajectory predictions. DDPTP [33] evaluated the most likely destinations of the pedestrian using the destination classifier (DC) and predicted the future trajectory with the destination-specific trajectory model (DTM). Study [34] formulated the trajectory prediction task as a reverse process of motion indeterminacy diffusion. MemoNet [35] is an instance-based approach that predicted the movement intentions of agents by looking for similar scenarios in the training data. STAR [27] constructed spatial and temporal transformers [22] to model the spatial interactions and temporal dependencies, respectively. Inspired by [27], we adopted transformers to model the trajectory sequences. Different from existing works, we constructed a sparse temporal transformer and made predictions in a non-autoregressive way.

2.2. Human-Human Interactions

The early works on crowd interaction modeling are based on Social Force [12,36], which assume that the pedestrians are driven by virtual attractive and repulsive forces for making the future trajectory. Other handcrafted methods [37,38] have been applied to crowd simulation [39,40], behavior detection [41], and pedestrian trajectory prediction [42], etc. With the popularity of deep learning in recent years, many approaches based on deep neural networks have been proposed to model social interaction. Distance-based methods [6,7,14] computed the geometric relation of the agents and pooled the features of neighbors to obtain the representation of a particular agent. Attention-based methods [10,15,16,17] adopted soft attention mechanism to generate different weights for neighbors. The graph-based methods [8,9,18,19] considered the pedestrians in the scene as graph nodes and modeled social interaction with an adjacency matrix. STGAT [9] and Social-biGAT [18] are models that adopted graph attention networks to model the spatial interaction among the agents. Social-STGCNN [19] constructed a spatial–temporal graph and designed a kernel function on the adjacency matrix to extract spatial interaction. However, previous works commonly model undirected interaction with all the other pedestrians or neighbors within a fixed distance. The needless dense interaction among the agents may have brought deviations from reality. In contrast, we designed a method to capture social interactions with a sparse spatial transformer that integrates the relative distance feature into the multi-head self-attention. It is capable of discovering the salient spatial interaction and finding the exact pedestrians involved in the social interaction.

2.3. Transformer Networks

Transformer [22] was first applied in Natural Language Processing [43,44,45] in the place of RNNs [46,47] to model sequential data, such as text generation [48], machine translation [49], etc. The core idea of the transformer consists of a multi-head self-attention method that computes the query, key, and value with the input embeddings. Compared with RNNs using iterative manners, transformers take advantage of parallel computation and capture long-range dependency. Due to the success in NLP, transformers have also been applied to other fields such as stock prediction [50], robot decision making [51], computer vision [52,53], etc. Studies [27,54] constructed spatial and temporal transformers for pedestrian trajectory predictions and achieved remarkable results compared with traditional sequence models. Some works [17,55,56] designed transformer-based modules for pedestrian and vehicle trajectory predictions and received encouraging results. However, the self-attention computed in these methods is dense and introduces superfluous interaction. Consequently, we design a novel sparse transformer that only pays attention to the agents who need to be focused on.

2.4. Non-Autoregressive Inference

For sequence modelling tasks, most deep neural networks generate predictions one by one in an autoregressive way, such as in RNN-based models. This indicates that the prediction at each time step is achieved based on the prediction results of the previous steps. These methods are not parallelizable and increase computational complexity in the case of works in machine translation using transformers [22,57]. In addition, the autoregressive pattern inevitably accumulates errors. The original design of self-attention in transformers is parallelizable in principle, while the autoregressive decoding at the inference stage makes it hard to be parallel. In light of this issue, some works have tried to attempt parallel transformer decoding in non-autoregressive pattern [24,28,58]. Study [28] adopted fertilities to parallelize decoding with transformers in machine translation. DETR [24] employed transformers with parallel decoding in end-to-end object detection. Study [58] explored non-autoregressive inference in human motion prediction. Non-autoregressive inference is also applied in pedestrian trajectory prediction tasks. NAP [59] designed a time-agnostic context generator and a time-specific context generator for non-autoregressive prediction. STPOTR [60] predicted human poses and trajectories by non-autoregressive transformer architecture. PReTR [61] extracted features from multi-agent scenes by employing a factorized spatio-temporal attention module and solved the trajectory prediction problem as a non-autoregressive task. We were inspired by this idea and constructed transformers in a non-autoregressive pattern for pedestrian trajectory predictions.

3. Methods

3.1. Overview

The goal of trajectory prediction aims to predict the future locations of all pedestrians in a scene. In fact, the future path of an agent depends on two factors, namely, historical trajectory, i.e., the temporal interaction, and the influence from neighbors, i.e., the spatial interaction. Therefore, both spatial and temporal features are the key information that should be considered for predicting trajectories.

Our proposed Non-autoregressive Sparse Transformer (NaST) networks comprises three main components: a sparse spatial transformer encoder, sparse temporal transformer encoder, and a non-autoregressive transformer decoder. The transformer encoder and decoder comprise feed forward networks and multi-head attention modules as the original transformer [22]. The encoder architecture is built with a sparse strategy and is in charge of modeling social interaction and temporal dependency. The decoder makes inference in a non-autoregressive pattern to generate all the predicted locations in parallel. The overall NaST architecture is shown in Figure 1.

Given a set of

N

pedestrians in a scene with their corresponding observed positions in

T_{o b s}

steps,

X_{t}^{i} = (x_{t}^{i}, y_{t}^{i})

represents the location of pedestrian

i \in {1, \dots, N}

at time

t \in \dots \dots, T_{o b s}}

. Therefore, the history observation trajectories of pedestrian

i

can be represented as

X_{1}^{i}, \dots, \dots, X_{T_{o b s}}^{i}

, and the future corresponding positions that need to be predicted are

{\hat{Y}}_{t}^{i}, t \in {T_{o b s + 1}, \dots, T_{p r e d}}

.

Pedestrian future trajectory is highly related to historical trajectory and the influence of neighbors nearby. Previous works based on dense interaction model constructed superfluous interactions and thus inevitably resulted in a considerable deviance from reality. The original transformer computes dense self-attention between a pedestrian with all neighbors. However, in a real scenario, a pedestrian is not influenced by all of the nearby neighbors. To find neighbors who are involved in interactions, we designed a sparse spatial transformer to model the social interactions. Due to the consistency of the trajectory in the temporal dimension, not all the temporal nodes are necessary for modeling temporal interactions, thus the temporal transformer is designed to find the most important time steps, which are helpful for predicting the future trajectory.

In Figure 1, the input history observation sequences are first embedded by a fully connected layer and then sent to the transformer encoder. The transformer encoder comprises a sparse spatial transformer and a sparse temporal transformer, which model the spatial and temporal interaction, respectively. The output features of the sparse spatial and temporal transformers are merged by a fully connected layer to form a set of new features with spatio-temporal encodings. The transformer decoder receives the outputs of the encoder along with a query sequence. The results are then concatenated with random Gaussian noise and embedded by a fully connected layer to generate the predictions sequence

{\hat{Y}}_{t}^{i}, t \in {T_{o b s + 1}, \dots, T_{p r e d}}

. We elaborate on our modular design for each part in the rest of the section.

3.2. Sparse Spatial Transformer

The spatial transformer encoder is composed of

L

layers, each with a multi-head adaptive sparse self-attention module and a feed forward network. The encoder receives the embeddings of historical trajectory as input and produces a sequence of embeddings of the same dimension.

Actually, pedestrians in one frame could be formulated as a directed spatial graph

G = (V^{t}, E^{t})

, where each node

v_{i}^{t} \in V^{t}

,

i \in {1, \dots, N}

corresponds to the ith pedestrian at time t, and the weighted edge

(v_{i}^{t}, v_{j}^{t}) \in E^{t}

represents the potential influence from pedestrian

v_{j}

to

v_{i}

at time t.

The spatial relation of pedestrians at each time is modeled as an asymmetric edge weight matrix in this task. It reviews the unequal influence from nodes

v_{i}

to

v_{j}

and

v_{j}

to

v_{i}

. Therefore, instead of constructing graphs with undirected spatial distances, we introduce relative spatial locations as prior edge feature knowledge of the edge weight matrix at time t (the superscript t is omitted for simplification):

e_{i j} = {\begin{matrix} φ ((x_{i} - x_{j}), (y_{i} - y_{j})), & d (v_{i}, v_{j}) < D \\ 0, & d (v_{i}, v_{j}) \geq D \end{matrix}

(1)

where

d (v_{i}, v_{j})

represents the distance between pedestrians i and j, and D is the threshold.

φ (\cdot)

embeds the relative distance features with a linear transformation. The process of message passing is illustrated in Figure 2. The feature of node

v_{i}

is represented as

h_{i} (i = 1, \dots, N)

at time

t

.

Then, we feed the learnable edge weights

e_{i j}

into the spatial transformer to compute the spatial interaction together with the node embedding. Given feature

h_{i}

of node

v_{i}

, we can represent its corresponding query vector as

Q_{i} = f_{Q} (h_{i})

, its key vector as

K_{i} = f_{K} (h_{i})

, and its value vector as

V_{i} = f_{V} (h_{i})

. The self-attention in the original transformer [22] is computed by

α_{i j}^{0} = \frac{Q_{i} K_{j}^{T}}{\sqrt{d_{k}}}

(2)

where

d_{k}

represents the dimensions of the query vector

Q_{i}

and the key vector

K_{i}

. However, the attention is only computed by the features (the coordinates embedded in the trajectory prediction task) of the node itself. It neglects the interaction from other nodes. In order to consider the directed interactions between nodes, we define the message from nodes

v_{j}

to

v_{i}

in the directed spatial graph as

{\hat{α}}_{i j} = \frac{Q_{i} K_{j}^{T}}{\sqrt{d_{k}}} + e_{i j}

(3)

where

d_{k}

is the dimensions of query vector

Q_{i}

and the key vector

K_{i}

. The self-attention (Equation (2)) in the original transformer can be modified as

α_{i j} = \frac{\exp ({\hat{α}}_{i j})}{\sum_{n \in N_{i} \cup {i}} \exp ({\hat{α}}_{i n})}

(4)

where

N_{i}

is the neighbor set of the node

v_{i}

. In this way,

α_{i j}

gives the importance weight of neighbor j to pedestrian i dynamically via the node feature itself and the spatial relation (edge feature) between the neighbors. Hence, the spatial interaction at time

t

can be computed by Equation (4) and represented by the asymmetric attention score matrix

A_{S I}^{t}

, where its

(i, j) - t h

element, and

α_{i j}

represents the influence from node

v_{j}

to node

v_{i}

. The

i - t h

row in the attention matrix

A_{S I}^{t}

represents the

i - t h

pedestrian’s initiative relations to others. The

i - t h

column in

A_{S I}^{t}

represents the passive relation from others.

Since

A_{S I}^{t}

is computed at every time step independently, it does not contain any temporal dependency information of the trajectories. Hence, we stack the dense interactions

A_{S I}^{t}

from every time step, and then fuse these stacked interactions with a 1×1 convolution along the temporal channel and a nonlinear function

σ (\cdot)

. After that, we receive the spatial–temporal dense interaction

A_{S I} \in R^{T_{o b s} \times N \times N}

.

A_{S I}

models spatial interactions between the specific pedestrian and neighbors whose distances are smaller than

D

. Nevertheless, not all these neighbors are involved in the interaction that might impact the future trajectory. In other words, there are superfluous interactions in

A_{S I}

. To find the exact ones who are involved, we generate a sparse interaction mask

M_{S P}

to exclude the irrelevant neighbors.

m_{i j} = {\begin{matrix} 1 & α_{i j} \geq γ or i = j \\ 0 & α_{i j} < γ \end{matrix}

(5)

where

m_{i j}

and

α_{i j}

are the

(i, j) - t h

element in

M_{S P}

and

A_{S I}

at time

t

, respectively. We omitted the superscript t for simplification.

γ \in [0, 1]

is the element-wise threshold. After we obtained the sparse spatial mask

M_{S P}

, the sparse spatial interaction matrix

A_{S I}^{s p a r s e}

could be computed as

A_{S I}^{s p a r s e} = M_{S P} ⊙ A_{S I}

(6)

where

⊙

denotes element-wise multiplication. Thus, a sparse spatial–temporal attention matrix representing the sparse directed interactions is eventually obtained from the inputs. The multi-head attention mechanism is also adopted to stabilize the process. Assume

h_{i}

is the embedding of node

v_{i}

, and

N_{i}

is the neighbor set associated with

v_{i}

. The final output feature of node

v_{i}

is calculated as:

h_{i}^{'} = f (h_{i} + \sum_{j \in N_{i} \cup {i}} α_{i j}^{'} \cdot V_{j}) + (h_{i} + \sum_{j \in N_{i} \cup {i}} α_{i j}^{'} \cdot V_{j}) i = 1, \dots, N

(7)

Where

α_{i j}^{'}

is the

(i, j) - t h

element in the sparse spatial interaction matrix

A_{S I}^{s p a r s e}

, and

f (\cdot)

is a fully connected layer. Finally, we can obtain nodes representation

h^{'} = {h_{1}^{'}, \dots, h_{N}^{'}}

, where

h_{i}^{'}

is the updated embedding of node

v_{i}

output by the sparse spatial transformer. It captures the relative importance for different pedestrians. Therefore, the sparse spatial transformer can learn a self-adaptive attention that finds the exact set of pedestrians involved in social interactions in the scene.

3.3. Sparse Temporal Transformer

Similar to sparse spatial attention, we can also obtain sparse temporal attention. Similarly, the

n - t h

pedestrian across

T_{o b s}

time steps can be formulated as a direct temporal graph

G_{t p} = (V^{n}, E^{n})

,

n = 1, \dots, N

, each node

v_{t}^{n} \in V^{n}, t = 1, \dots, T_{o b s}

corresponding the

n - t h

pedestrian at time t. The weighted edges

(v_{t}^{n}, v_{t - τ}^{n}) \in E^{n}, τ = 0, \dots, (t - 1)

represent the influence from time

t - τ

to

t

for pedestrian

n

. We represent the temporal relation of nodes as a lower triangular matrix since the current location of the agent is often related to the past. The edge feature of for pedestrian

n

is initialized as Equation (8). The superscript n is omitted for simplification.

e_{i j}^{'} = {\begin{matrix} φ ((x_{i} - x_{j}), (y_{i} - y_{j})) & i > j \\ 1 & i = j \\ 0 & i < j \end{matrix} i, j = 1, \dots, T_{o b s}

(8)

where

e_{i j}^{'}

represents the temporal influence from time j to i.

φ (\cdot)

is a linear embedding function. The process of message passing is illustrated in Figure 3, where

h_{t}

is the feature of node

v_{t}^{n}

for pedestrian n at time t, the superscript n is omitted for simplification.

The temporal interaction between time

i

and

j

is learned from their embedding features and the edge feature

e_{i j}^{'}

. Given the node feature

h_{t} (t = 1, \dots, T_{o b s})

for pedestrian

n

, we can represent the corresponding query vector as

Q_{t} = f_{Q} (h_{t})

, the key vector as

K_{t} = f_{K} (h_{t})

, and the value vector as

V_{t} = f_{V} (h_{t})

. We define the message from time node

v_{j}^{n}

to

v_{t}^{n}

(t, j = 0, \dots, T_{o b s})

in the directed temporal graph as

{\hat{β}}_{t j} = \frac{Q_{t} K_{j}^{T}}{\sqrt{d_{k}}} + e_{t j}

(9)

where

d_{k}

is the dimension of the query vector

Q_{i}

and the key vector

K_{i}

. Then, we obtain the attention

β_{t j}

by the Softmax function by the temporal transformer shown in Equation (3). Thus, we can obtain the temporal interaction matrix

A_{T I}^{n}

for pedestrian n, where its

(t, j) - t h

element

β_{t j}

represents the influence at time j to t. The

t - t h

column in

A_{T I}^{n}

represents the initiative relation from the current time to other time steps. While the

t - t h

row in

A_{T I}^{n}

represents the passive relations from the other time steps to the current time.

A_{T I}^{n}

is computed by the location of pedestrian n at each time step, and then we stack all the temporal interaction matrices of N pedestrians and, finally, obtain the dense temporal interaction matrix

A_{T I} \in R^{N \times T_{o b s} \times T_{o b s}}

.

Similarly, we generate a sparse temporal interaction mask

M_{T P}

to filter the irrelevant time steps.

m_{i j}^{'} = {\begin{matrix} 1 & β_{i j} \geq δ and i \geq j \\ 0 & β_{i j} < δ or i < j \end{matrix}

(10)

where

m_{i j}^{'}

and

β_{i j}

are the

(i, j) - t h

element in

M_{T P}

and

A_{T I}^{n}

, respectively.

δ \in [0, 1]

is the element-wise threshold. After the sparse temporal mask

M_{T P}

is obtained, the sparse temporal interaction

A_{T I}^{s p a r s e}

of all the pedestrians can be computed as

A_{T I}^{s p a r s e} = M_{T P} ⊙ A_{T I}

(11)

Thus, a sparse temporal attention matrix representing the sparse temporal directed interactions is eventually obtained. We also adopt the multi-head attention mechanism to calculate the temporal feature at t time step for pedestrian n:

h_{t}^{'} = f (h_{t} + \sum_{j = 1}^{t} β_{t j}^{'} V_{j}) + (h_{t} + \sum_{j = 1}^{t} β_{t j}^{'} V_{j}) t = 1, \dots, T_{o b s}

(12)

where

β_{t j}^{'}

is the element in

A_{T I}^{s p a r s e}

. The sparse temporal transformer outputs the updated feature at each time step for pedestrian n, and the computation could be paralleled for all pedestrians. It captures the temporal dependencies for trajectory prediction. Finally, we concatenate the outputs of the sparse spatial transformer and sparse temporal transformer together and send them to the transformer decoder.

3.4. Non-Autoregressive Transformer Decoder

Traditional time series prediction methods often predict future location

{\hat{y}}_{t}

based on

{\hat{y}}_{t - τ}

. This kind of autoregressive fashion is prone to the propagation of errors in future predictions and is computationally expensive in practice. As the individual steps of the decoder must run sequentially rather than in parallel, autoregressive decoding prevents architectures such as the transformer from fully realizing their train–time performance advantage during inference. An appropriate solution is to adopt a non-autoregressive inference pattern in the encoder–decoder model. Hence, we address these limitations by modelling the problem in a non-autoregressive pattern, as described in the following.

As illustrated in Figure 1, the transformer decoder comprises L layers, which receives the output of the sparse transformer encoder and a query sequence to produce the output embedding of the predictions. Similar to the transformer encoder, the decoder stacks are composed entirely of feed-forward networks (MLPs) and multi-head attention modules. Since no RNNs are used, there is no inherent requirement for sequential execution, making non-autoregressive decoding possible. Before decoding starts, the transformer decoder needs to know how many time steps the predicted trajectory will take in order to generate all the locations in parallel. As discussed in the machine translation work [28], the non-autoregressive decoding shows less conditional dependency between predicted elements

{\hat{y}}_{t}

. Therefore, the input of the transformer decoder should contain as much as temporal dependency. The output of the transformer encoder contains spatial and temporal features, which are extracted by the sparse spatial transformer and sparse temporal transformer separately. It provides a reliable dependency on which to make predictions, especially to the relation of the locations at different times.

Additionally, given the observed sequence

X_{1}, \dots, X_{T_{o b s}}

, the last observation

X_{T_{o b s}}

is the most relevant to the following next time steps. Inspired by [28], we simply copy the input embedding of the transformer encoder at the last observed time step and fill the query sequence with it. Then, we can obtain the query sequence

q_{T_{o b s + 1}}, \dots, q_{T_{p r e d}}

. The transformer decoder generates the future trajectories in a non-autoregressive pattern and outputs the predictions of

T_{o b s + 1}, \dots, T_{p r e d}

at one time. A random Gaussian noise is concatenated to the output of the transformer decoder to generate various future predictions [9]. The concatenated features are then embedded by a fully connected layer. The final predictions are obtained by adding the last time input of the transformer encoder with a residual connection. Given the residual connection, it can predict the location offsets from the last observed location

X_{T_{o b s}}

to each predicted

{\hat{y}}_{t}

,

t = T_{o b s + 1}, \dots, T_{p r e d}

.

4. Experiments

4.1. Datasets and Metrics

Our model was evaluated on two public pedestrian trajectory datasets, ETH [29] and UCY [30], which are widely used for future trajectory prediction. The two datasets are composed of five outdoor scenes that were recorded from a top-view. ETH contains two scenes: ETH and HOTEL. UCY contains three scenes: UNIV, ZARA1, and ZARA2. Since our task does not involve activity prediction and multi future prediction, other datasets such as ActEV/VIRAT [62] and The Forking Paths dataset [63] were excluded. The number of pedestrians in each scene varied from 0 to 51 per frame. The datasets exhibited complex interactions such as nonlinear trajectories, collision avoidance, walking together, moving from different directions, etc. It was recorded at 25 frames every second, the pedestrian trajectory was sampled every 0.4 s. The datasets provided all the pedestrians’ location coordinates in each frame. We evaluated our model following the same “leave-one-out” [64] strategy as commonly adopted by previous works. More specifically, the model was trained and validated on four sets and tested on the remaining one. In order to be consistent with the previous works, the model predicts the next 12 frames’ (4.8 s) conditions on the 8 observed frames (3.2 s) ahead.

Two conventional metrics were employed to evaluate our model, namely, the Average Displacement Error (ADE) and the Final Displacement Error (FDE). The ADE is defined as the average Euclidean distance over all estimated positions in both the predicted trajectory and ground-truth trajectory.

A D E = \frac{\sum_{n \in N} \sum_{t} | | {\hat{Y}}_{n}^{t} - Y_{n}^{t} | |_{2}}{N \times T} t = T_{o b s + 1}, \dots, T_{p r e d}

(13)

The FDE is the Euclidean distance between the predict position and the ground truth position at the final destination

T_{p r e d}

.

F D E = \frac{\sum_{n \in N} | | {\hat{Y}}_{n}^{t} - Y_{n}^{t} | |_{2}}{N} t = T_{p r e d}

(14)

4.2. Experimental Settings

Our proposed NaST network was trained on the Pytorch deep learning framework. In our experiment, the original coordinates data were first embedded into dimension 32 by a fully connected layer, followed by ReLU activation. During the training stage, the encoder received each frame embedded and extracted the features for the time steps observed. The sparse spatial and temporal transformers accepted inputs with a feature size of 32. The number of encoder layers in the sparse spatial and sparse temporal encoders were two and one, respectively. The number of self-attention heads was set to eight in each encoder layer. The number of layers in the decoder transformer was set to two, with four multi-heads.

The model was trained with the Adam optimizer [9] for 300 epochs with a batch size of 16. The learning rate was 0.0015. The sparse spatial threshold value

γ

and sparse temporal value

δ

were, empirically, set to 0.1 and 0.5. The model was trained by minimizing the Mean Square Error (MSE) loss. During the inference stage, 20 samples were generated for each sample and the closest one to ground truth was used to compute the ADE and FDE metrics.

4.3. Comparison with State-of-the-Arts

We compare our proposed NaST with other models including Social GAN [7], Sophie [15], Social-BiGAT [18], SR-LSTM [8], Social-STGCNN [19], RSGB [64], STAR [27], GraphTCN [65], and SGCN [66]. The results evaluated by ADE and FDE metrics are shown in Table 1. Social GAN [7], Sophie [15], and Social-BIGAT [18] adopted Generative Adversarial Networks (GANs) to generalize trajectory prediction. Social GAN [7] improved over Social LSTM to generate multiple plausible trajectories. Sophie [15] introduced the social and physical attention mechanisms to an LSTM-based GAN model. Social-BiGAT [18] combined graph attention mechanism with GAN for prediction. SR-LSTM [8] computed social interactions by LSTM with pair-wise attention and motion gates. Social-STGCNN [19] modeled the interactions as a graph and proposed a kernel function to embed the social interactions between pedestrians within the adjacency matrix. STAR [27] constructed spatial and temporal transformers to capture social interactions among the crowd. SGCN [66] presented a Sparse Graph Convolution Network for pedestrian trajectory prediction. GraphTCN [65] was a CNN-based method which modeled the spatial interactions as social graphs and captured the spatio-temporal interactions with a modified temporal convolutional network.

We observed that our method outperformed all the competitive methods on these benchmark datasets in terms of the average result of the ADE and FDE metrics. For the average ADE metric, our NaST surpassed the previous best method STAR [27] by 8%. For the average FDE metric, NaST is better than STAR [27] by a margin of 6%. In particular, the results showed that our model yielded better results than those dense methods such as Sophie [15], Social-STGCNN [19], and STAR [27], especially on the more complex datasets UNIV and ZARA2 containing dense crowd scenes. We speculated that the under-lying reason is that the dense methods construct superfluous social interactions, which interfere the trajectory prediction. Fortunately, the superfluous interaction is removed by the sparse transformer and focuses on the exact neighbors who are involved in the interaction. Furthermore, the non-autoregressive inference pattern could avoid accumulating errors and thus improve the efficiency of the model.

4.4. Ablation Study

4.4.1. The Individual Module in NaST

To verify the contribution of each component in our model, we conducted exhaustive ablative experiments on both the ETH and UCY datasets. Specifically, we removed one of the components including the sparse Spatial Transformer Encoder (STE), sparse Temporal Transformer Encoder (TTE), and Transformer decoder (TD) in NaST each time and computed predictions, and the results are shown in Table 2. Detailed experiments are introduced in the following.

In the degraded model (a), the sparse Temporal transformer Encoder (TTE) is removed, and we fixed the model with the sparse Spatial Transformer Encoder (STE) and the Transformer Decoder (TD). We observed that without the temporal transformer, model (a) suffered from performance reduction compared with the NaST model, especially on HOTEL and ZARA1. We inferred that because the scenes in HOTEL are relatively less crowded, and the spatial interaction in ZARA1 is much simpler. Hence, temporal dependency is more important during the inferring procedure. This illustrated that the temporal transformer can provide effective temporal modeling ability. In model (b), the sparse Spatial Transformer Encoder (STE) is removed and the sparse Temporal Transformer Encoder (TTE) and Transformer Decoder (TD) are kept fixed. From the results, we can see that model (b) accrues much worse results on UNIV, which contains dense crowd scenes, which suggests that the spatial transformer is important on modeling social interactions under crowd scenarios. Model (c) removes the Transformer Decoder (TD) and keeps the STE and TTE retained. It also receives a performance decrease. In conclusion, according to the experiment results in Table 2, we can see that removing any component from our model will result in a large performance reduction. The results clearly validate the contribution of each module to NaST for trajectory prediction.

4.4.2. Contribution of Spatial and Temporal Sparsity

In order to evaluate the contribution of the sparsity to the transformers, we conducted studies on the different extent of sparsity on transformer encoder. We designed experiments for sparsity hyper-parameters through a searching protocol. The spatial and temporal sparsity was evaluated at intervals of 0.1 in the range of 0 to 1. Representative results are shown in Table 3. We can receive different variants of our model by setting different values in the spatial sparse threshold

γ

and the temporal sparse threshold

δ

. First, we set the temporal sparse threshold

δ

fixed with 0.5 and observe the effect of different spatial sparsity. Spatial sparsity γ was evaluated at intervals of 0.1 in the range of 0 to 1. In NaST-sp0, the spatial sparse threshold

γ

was set to 0, which meant dense spatial interaction. It modeled the social interaction among the specific pedestrian with all neighbors in a scene. The performance of NaST-sp0 degraded further on the complex scenes such as UNIV, since dense interactions might make the model suffer from overfitting. When the spatial sparse threshold

γ

was set to 1 in NaST-sp1, it meant no interaction between pedestrian pairs. Evidently, the performances degraded, which illustrated the importance of social (spatial) interaction in predicting trajectories. In NaST-sp0.5, it worked well on the relative dense scenes such as UNIV and ZARA2, but it was not satisfactory on other simple scenes when the spatial sparse threshold

γ

was set to 0.5. The reason may be that the sparse mask might filter the important neighbors who interact with the specific pedestrian in the less crowded scenes. When the spatial sparse threshold

γ

is set to 0.1, the results are shown for NaST in Table 3.

Second, we fixed the spatial sparsity to

γ = 0.1

and then evaluated the effect of temporal sparsity. Temporal sparsity

δ

is evaluated at intervals of 0.1 in the range of 0 to 1. Representative results are shown in Table 3. NaST-tp0 referred to the temporal sparse threshold

δ = 0

, which meant dense temporal dependency. The prediction of the current time step depended on all the time steps before. In NaST-tp1,

δ

was set to 1, which indicated that no temporal interaction among different time steps. The model obtains the best average performance in the original NaST when

γ = 0.1

and

δ = 0.5

. This indicated that the proper sparsity was conducive to making precise predictions.

Different sparse spatial and temporal interaction matrices are obtained according to different spatial and temporal sparsity. Figure 4 shows the sparse spatial interaction matrices of different spatial sparsity in part of ETH scene. In Figure 4a, spatial sparsity

γ = 0

, representing dense spatial interaction; each pedestrian has interactions with others. In Figure 4b, the spatial sparsity

γ = 0.1

, representing sparse spatial interaction; each pedestrians have interactions with neighbors. In Figure 4c, the spatial sparsity

γ = 0.5

; the pedestrians have interactions with fewer neighbors.

Different temporal interaction matrices are obtained according to different temporal sparsity

δ

. Figure 5 shows part of the sparse temporal interaction matrices of the different temporal sparsity for a pedestrian. In Figure 5a, temporal sparsity

δ = 0

, representing dense temporal interaction; the current time has interactions with all the other time steps. In Figure 5b, temporal sparsity

δ = 0.5

, representing sparse temporal interaction; the current time has interaction with part of the other time steps. In Figure 5c, temporal sparsity

δ = 0.7

; the current time has interaction with fewer other time steps.

4.4.3. Contribution of Non-Autoregressive Prediction

The effectiveness of the non-autoregressive pattern of the transformer decoder was also evaluated. For comparison, we constructed an autoregressive version of our model, named NaST-auto. In NaST, the decoder receives a query sequence generated in advance by using the last observed time step

X_{o b s}

(see Figure 1). Nevertheless, NaST-auto does not receive the query sequence from the same inputs as NaST. NaST-auto predicts one future location each time. Once a prediction is made, it will be added to the historical trajectory sequence and then sent to the transformer encoder for subsequent processing. Hence, each next location is predicted based on the trajectories of previous time steps. The results are shown in Table 4.

The original non-autoregressive model NaST exhibited lower errors than its counterpart. The results of the autoregressive version NaST-auto showed a 17% error increase in ADE and 22% in FDE, and it worked even worse on FDE metrics in particular. The reason could be attributed to the fact that the autoregressive inference procedure was prone to error accumulation during prediction. Apart from that, the use of the last time step observation as a query sequence likely helps to predict the future locations and significantly reduces the error. The contribution of the non-autoregressive pattern to the final performance of the pedestrian trajectory prediction is clearly validated.

4.5. Visualization

4.5.1. Trajectory Prediction Visualization

We visualized some common interaction scenes in Figure 6, where yellow dot lines represented observed trajectories, and the blue and red lines were the trajectories predicted by STAR [27] and our proposed NaST model, respectively. The green lines were the ground truth. The visualization revealed that our prediction (red dot line) had better tendencies toward the ground truth. The prediction from STAR [27] (blue dot line) had more deviations from the ground truth.

In Figure 6 scenario (a), two pedestrians were walking in parallel with the same direction. NaST (red dot line) could better match the ground truth compared with STAR (blue dot line). In scenario (b), pedestrians were walking in a perpendicular direction. NaST captured the intention and predicted more accurate directions for each pedestrian. Since one of the pedestrians in scenario (b) had a sharp turn, both methods failed to achieve accurate prediction because only observed trajectory information was given. From scenario (c) and (d), we can see that STAR [27] suffered from the overlap issue with a high possibility of collision, while NaST considered both spatial interaction and temporal tendency to avoid collision.

STAR [27] also adopted transformers for feature extraction. It only constructed the encoder stack and predicted future locations in an autoregressive pattern. Errors accumulated during the inference procedure and made the final destination prediction deviate largely from the ground truth. Examples are shown in Figure 6a,c. From the pictures in the top of Figure 6a,c, we can see that the final destination predictions by STAR [27] (the last blue dot of each predicted trajectory) were far away from the ground truth (the last green dot of each predicted trajectory). The results were much better for NaST (pictures in the bottom of Figure 6a,c). The final destination predictions by NaST (the last red dot of each trajectory) were much closer to the ground truth (the last green dot of each trajectory).

4.5.2. Sparse Directed Interaction Visualization

NaST can successfully extract the sparse social interaction of a crowd. We visualized the sparse directed interactions in different scenes, as shown in Figure 7. The images in the top row are different scenes in the ETH [29] and UCY datasets [30]. The sparse directed interaction graphs of each scene are shown in the bottom row. In the scene images, the solid lines represent the observed trajectories, the colored dots indicate the current location, and the dash lines represent the trajectories in the future. Two graphs under each scene are the corresponding sparse directed interaction graphs, which illustrate that one pedestrian is only influenced by a part of the surrounding neighbors. In Figure 7a, the red pedestrian is only influenced by the blue and green ones (the interaction is represented by the directed edges in the left sparse directed interaction graph in Figure 7a). However, the degree of influence from the two neighbors to the red node is different: the thickness of the edges is proportional to the interaction weights. From the left sparse directed interaction graph in Figure 7a, we can see that the influence from the blue node to the red is much larger than from the green one. The right sparse directed interaction graph in Figure 7a shows the influence from the yellow and red nodes to the blue one. In addition, the two sparse directed interaction graphs in Figure 7a illustrate that the interaction between the red and blue nodes is different to each other. It is also demonstrated in Figure 7c that the red and blue pedestrians head from the opposite direction, the red one is influenced by the green and blue pedestrians, and the blue one is also influenced by the red pedestrian. However, the influence between the red and blue pedestrians is not the same degree to each other. When they meet, the red pedestrian is walking straight and keeps their direction without change, and the blue one detours to avoid collision. Therefore, the interaction weight from the red pedestrian to the blue one is larger than blue to red. It is confirmed in the directed graph that the edge from red node to the blue node is much thicker than that from blue to red.

5. Conclusions

In this paper, we present Non-autoregressive Sparse Transformer (NaST) networks for Pedestrian Trajectory Prediction. NaST captures the sparse directed spatial and temporal interactions by sparse spatial and sparse temporal transformer encoders separately, which generate trajectory predictions by a transformer decoder in a non-autoregressive pattern and could effectively avoid error accumulation and reduce computational cost. Beyond that, we decoded predictions from a query sequence generated in advance by using the last observed time step for making more accurate predictions. It was found that the sparse spatial transformer could effectively model the directed social interaction in real scenes. The sparse temporal transformer could capture the motion dependency. By combining sparsity and the self-attention in the transformer, we constructed the transformer encoder to extract the features for future trajectory prediction. Beyond that, we investigated the use of non-autoregressive inference with transformer neural networks to avoid error propagation. Extensive experimental studies illustrated that our proposed model achieved better performances than previous methods. The information that NaST used was only the past trajectories of pedestrians who might have failed to predict unexpected sharp turns. Additional information such as roads, vehicles, buildings, and surrounding environment could be considered to solve this issue. We leave it for future study.

Author Contributions

Conceptualization, D.L. and S.L.; methodology, D.L.; software, D.L.; validation, D.L., Q.L.; formal analysis, M.Q.; investigation, Q.L.; resources, J.K.; data curation, S.L.; writing—original draft preparation, D.L.; writing—review and editing, Q.L.; visualization, D.L.; supervision, J.K.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China, grant number 62272096, and the Fund of Jilin Provincial Science and Technology Department, grant number 20210201077GX.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://data.vision.ee.ethz.ch/cvl/aem/ewap_dataset_full.tgz (accessed on 25 December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bai, H.; Cai, S.; Ye, N.; Hsu, D.; Lee, W.S. Intention-aware online pomdp planning for autonomous driving in a crowd. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 454–460. [Google Scholar]
Luo, Y.; Cai, P.; Bera, A.; Hsu, D.; Lee, W.S. Porca: Modeling and planning for autonomous driving among many pedestrians. IEEE Robot. Autom. Lett. 2018, 3, 3418–3425. [Google Scholar] [CrossRef] [Green Version]
Luo, Y.; Cai, P. Gamma: A general agent motion prediction model for autonomous driving. arXiv 2019, arXiv:1906.01566. [Google Scholar] [CrossRef]
Luber, M.; Stork, J.A.; Tipaldi, G.D.; Arras, K.O. People tracking with human motion predictions from social forces. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 464–469. [Google Scholar]
Yasuno, M.; Yasuda, N.; Aoki, M. Pedestrian detection and tracking in far infrared images. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, 27 June–2 July 2004; p. 125. [Google Scholar]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 961–971. [Google Scholar]
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhang, P.; Ouyang, W.; Zhang, P.; Xue, J.; Zheng, N. Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Huang, Y.; Bi, H.; Li, Z.; Mao, T.; Wang, Z. Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Ivanovic, B.; Pavone, M. The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Mehdi, M.; Niriaska, P.; Simon, G.; Helbing, D.; Theraulaz, G. The walking behavior of pedestrian social groups and its impact on crowd dynamics. PLoS ONE 2010, 5, e10047. [Google Scholar]
Helbing, D.; Molnar, P. Social force model for pedestrian dynamics. Phys. Rev. E 1995, 51, 4282–4286. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Helbing, D.; Buzna, L.; Johansson, A.; Werner, T. Self-organized pedestrian crowd dynamics: Experiments, simulations, and design solutions. Transp. Sci. 2005, 39, 1–24. [Google Scholar] [CrossRef] [Green Version]
Liang, J.; Jiang, L.; Juan, C.N.; Hauptmann, A.G.; Li, F.F. Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5725–5734. [Google Scholar]
Amir, S.; Vineet, K.; Ali, S.; Noriaki, H.; Hamid, R.; Silvio, S. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1349–1358. [Google Scholar]
Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Soft+ hardwired attention: An lstm framework for human trajectory prediction and abnormal event detection. Neural Netw. 2018, 108, 466–478. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Vemula, A.; Muelling, K.; Jean, O. Social attention: Modeling attention in human crowds. In Proceedings of the 2018 International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018; pp. 1–7. [Google Scholar]
Kosaraju, V.; Sadeghian, A.; Roberto, M.; Reid, I.; Rezatofighi, H.; Savarese, S. Social bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2019; pp. 137–146. [Google Scholar]
Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. arXiv 2020, arXiv:2002.11927. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Petar, V.; Cucurull, G.; Casanova, A.; Romero, A.; Pietro, L.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N. End-to end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Gu, J.; Hu, H.; Wang, L.; Wei, Y.; Dai, J. Learning region features for object detection. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Yu, C.; Ma, X.; Ren, J.; Zhao, H.; Yi, S. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2020. [Google Scholar]
Gu, J.; Bradbury, J.; Xiong, C.; Victor, O.K.; Socher, R. Non-autoregressive neural machine translation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Pellegrini, S.; Ess, A.; Schindler, K.; Gool, L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV), Kyoto, Japan, 27 September–4 October 2009; pp. 261–268. [Google Scholar]
Lerner, A.; Chrysanthou, Y.; Lischinski, D. Crowds by example. Comput. Graph. Forum 2007, 26, 655–664. [Google Scholar] [CrossRef]
Ma, Y.; Zhu, X.; Zhang, S.; Yang, R.; Wang, W.; Manocha, D. Traffic predict: Trajectory prediction for heterogeneous traffic-agents. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 28–29 January 2019; pp. 6120–6127. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Lui, A.K.F.; Chan, Y.H.; Leung, M.F. Modelling of Destinations for Data-driven Pedestrian Trajectory Prediction in Public Buildings. In Proceedings of the IEEE International Conference on Big Data, Orlando, FL, USA, 15–18 December 2021. [Google Scholar]
Gu, T.; Chen, G.; Li, J.; Lin, C.; Rao, Y.; Zhou, J.; Lu, J. Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion. In Proceedings of the 2022 Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Xu, C.; Mao, W.; Zhang, W.; Chen, S. Remember Intentions: Retrospective-Memory-based Trajectory Prediction. In Proceedings of the 2022 Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Löhner, R. On the modeling of pedestrian motion. Appl. Math. Model. 2010, 32, 366–382. [Google Scholar] [CrossRef]
Antonini, G.; Bierlaire, M.; Weber, M. Discrete choice models of pedestrian walking behavior. Transp. Res. Part B Methodol. 2006, 40, 667–687. [Google Scholar] [CrossRef]
Wang, J.M.; Fleet, D.J.; Hertzmann, A. Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 283–298. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hou, L.; Liu, J.G.; Pan, X.; Wang, B.H. A social force evacuation model with the leadership effect. Phys. A Stat. Mech. Its Appl. 2014, 400, 93–99. [Google Scholar] [CrossRef]
Saboia, P.; Goldenstein, S. Crowd simulation: Applying mobile grids to the social force model. Vis. Comput. 2012, 28, 1039–1048. [Google Scholar] [CrossRef]
Mehran, R.; Oyama, A.; Shah, M. Abnormal crowd behavior detection using social force model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 935–942. [Google Scholar]
Yamaguchi, K.; Berg, A.C.; Ortiz, L.E.; Berg, T.L. Who are you with and where are you going? In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1345–1352. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
Nichol, A.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
Dickstein, J.S.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the The 32nd International Conference on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Jozefowicz, R.; Zaremba, W.; Sutskever, I. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015; pp. 2342–2350. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Gated feedback recurrent neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015; pp. 2067–2075. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5753–5763. [Google Scholar]
Radford, A.; Narasimhan, K.; Tim, S.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018, Volume 3. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 25 December 2022).
Liu, J.; Lin, H.; Liu, X.; Xu, B.; Ren, Y.; Diao, Y.; Yang, L. Transformer-based capsule network for stock movement prediction. In Proceedings of the First Workshop on Financial Technology and Natural Language Processing, Macao, China, 12 August 2019. [Google Scholar]
Fang, K.; Toshev, A.; Fei-Fei, L.; Savarese, S. Scene memory transformer for embodied agents in long-horizon tasks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chen, N.; Zhang, Y.; Zen, H.; Weiss, R.J.; Norouzi, M.; Chan, W. Wavegrad: Estimating gradients for waveform generation. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
Fang, L.; Jiang, Q.; Shi, J.; Zhou, B. Tpnet: Trajectory proposal network for motion prediction. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhao, H.; Gao, J.; Lan, T.; Sun, C.; Sapp, B.; Varadarajan, B.; Shen, Y.; Shen, Y.; Chai, Y.; Schmid, C.; et al. Tnt: Target-driven trajectory prediction. arXiv 2020, arXiv:2008.08294. [Google Scholar]
Chen, G.; Li, J.; Zhou, N.; Ren, L.; Lu, J. Personalized trajectory prediction via distribution discrimination. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15580–15589. [Google Scholar]
Lee, N.; Choi, W.; Vernaza, P.; Choy, C.B.; Torr, P.H.S.; Chandraker, M. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 336–345. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 9. Available online: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf (accessed on 25 December 2022).
Martinez-Gonzalez, A.; Villamizar, M.; Odobez, J.-M. Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual, 11–17 October 2021; pp. 2276–2284. [Google Scholar]
Xue, H.; Huynh, D.Q.; Reynolds, M. Take a NAP: Non-Autoregressive Prediction for Pedestrian Trajectories. In Proceedings of the 27th International Conference on Neural Information Processing (ICONIP2020), Bangkok, Thailand, 23–27 November 2020; pp. 544–556. [Google Scholar]
Mahdavian, M.; Nikdel, P.; Taherahmadi, M.; Chen, M. STPOTR: Simultaneous Human Trajectory and Pose Prediction Using a Non-Autoregressive Transformer for Robot Following Ahead. arXiv 2022, arXiv:2209.07600. [Google Scholar]
Achaji, L.; Barry, T.; Fouqueray, T.; Moreau, J.; Aioun, F.; Charpillet, F. PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer. arXiv 2022, arXiv:2203.09293. [Google Scholar]
Awad, G.; Butt, A.; Curtis, K.; Lee, Y.; Fiscus, J.; Godil, A.; Joy, D.; Delgado, A.; Smeaton, A.F.; Graham, Y.; et al. Trecvid 2018: Bench marking video activity detection, video captioning and matching, video storytelling linking and video search. In Proceedings of the TREC Video Retrieval Evaluation (TRECVID), Gaithersburg, MD, USA, 13–15 November 2018. [Google Scholar]
Liang, J.; Jiang, L.; Murphy, K.; Yu, T.; Hauptmann, A. The Garden of Forking Paths: Towards multi-future trajectory prediction. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10508–10518. [Google Scholar]
Sun, J.; Jiang, Q.; Lu, C. Recursive social behavior graph for trajectory prediction. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 660–669. [Google Scholar]
Wang, C.; Cai, S.; Tan, G. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction. In Proceedings of the WACV 2021: Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3449–3458. [Google Scholar]
Shi, L.; Wang, L.; Long, C.; Zhou, S.; Zhou, M.; Niu, Z.; Hua, G. SGCN: Sparse Graph Convolution Network for Pedestrian Trajectory Prediction. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8994–9003. [Google Scholar]

Figure 1. Overview architecture of the proposed Non-autoregressive Sparse Transformer (NaST) networks for pedestrian trajectory prediction.

Figure 2. The illustration of message passing mechanism in directed spatial graph.

Figure 3. The illustration of message passing mechanism of each pedestrian in direct temporal graph.

Figure 4. Spatial interaction matrices of different spatial sparsity (

γ

). (a) Spatial interaction matrix for

γ = 0

; (b) Spares spatial interaction matrix for

γ = 0.1

; (c) Spares spatial interaction matrix for

γ = 0.5

.

Figure 4. Spatial interaction matrices of different spatial sparsity (

γ

). (a) Spatial interaction matrix for

γ = 0

; (b) Spares spatial interaction matrix for

γ = 0.1

; (c) Spares spatial interaction matrix for

γ = 0.5

.

Figure 5. Temporal interaction matrices of different temporal sparsity (

δ

). (a) Temporal interaction matrix for

δ = 0

; (b) Spares temporal interaction matrix for

δ = 0.5

; (c) Spares temporal interaction matrix for

δ = 0.7

.

Figure 5. Temporal interaction matrices of different temporal sparsity (

δ

). (a) Temporal interaction matrix for

δ = 0

; (b) Spares temporal interaction matrix for

δ = 0.5

; (c) Spares temporal interaction matrix for

δ = 0.7

.

Figure 6. Trajectory visualization in different scenes. The yellow dot lines represent the observed trajectories, and the blue and red lines are the trajectories predicted by STAR [27] and our proposed model. The green lines are the ground truth. (a) Pedestrians walked in parallel from the same direction. (b) Pedestrians walked in perpendicular direction. (c) Multi pedestrians walked from the same direction. (d) Multi pedestrians walked from different direction.

Figure 7. Sparse direct interaction visualization in different scenes. Solid lines represent observed trajectories, colored dots indicate the current location, and the dash lines represent the trajectories in the future. (a) The scene in ZARA1. (b) The scene in ETH. (c) The scene in HOTEL.

Table 1. Comparison with some models on datasets ETH and UCY of ADE/FDE metrics (the lower numerical result is better).

Models	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
Social GAN [7]	0.87/1.62	0.67/1.37	0.76/1.52	0.35/0.68	0.42/0.84	0.61/1.21
Sophie [15]	0.70/1.43	0.76/1.67	0.54/1.24	0.30/0.63	0.38/0.78	0.51/1.15
Social-BIGAT [18]	0.69/1.29	0.49/1.01	0.55/1.32	0.30/0.62	0.36/0.75	0.48/1.00
SR-LSTM [8]	0.63/1.25	0.37/0.74	0.51/1.10	0.41/0.90	0.32/0.70	0.45/0.94
Social-STGCNN [19]	0.64/1.11	0.49/0.85	0.44/0.79	0.34/0.53	0.30/0.48	0.44/0.75
RSBG w/o context [64]	0.80/1.53	0.33/0.64	0.80/1.53	0.40/0.86	0.30/0.65	0.48/0.99
STAR [27]	0.36/0.65	0.17/0.36	0.31/0.62	0.26/0.55	0.22/0.46	0.26/0.53
SGCN [66]	0.63/1.03	0.32/0.55	0.37/0.70	0.37/0.70	0.25/0.45	0.37/0.65
GraphTCN [65]	0.59/1.12	0.27/0.52	0.42/0.87	0.30/0.62	0.23/0.48	0.36/0.72
NaST (Ours)	0.35/0.62	0.15/0.35	0.27/0.56	0.25/0.56	0.19/0.41	0.24/0.50

Table 2. Ablation Study on ETH and UCY datasets of ADE/FDE metrics (the lower numerical result is better).

Components				Performance (ADE/FDE)
	STE	TTE	TD	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
(a)	√	-	√	0.38/0.65	0.26/0.42	0.29/0.59	0.38/0.67	0.19/0.43	0.30/0.55
(b)	-	√	√	0.37/0.64	0.18/0.37	0.35/0.67	0.28/0.58	0.20/0.44	0.28/0.54
(c)	√	√	-	0.37/0.63	0.16/0.35	0.29/0.57	0.27/0.58	0.20/0.42	0.26/0.51
NaST	√	√	√	0.35/0.62	0.15/0.35	0.27/0.56	0.25/0.56	0.19/0.41	0.24/0.50

Table 3. The results of different extent of spatial and temporal sparsity on ETH and UCY datasets of ADE/FDE metrics (the lower numerical result is better).

Variants	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
NaST-sp₀	0.39/0.65	0.18/0.36	0.36/0.67	0.26/0.58	0.23/0.44	0.28/0.54
NaST-sp_0.3	0.35/0.63	0.16/0.35	0.28/0.59	0.25/0.56	0.19/0.41	0.25/0.51
NaST-sp_0.5	0.37/0.64	0.17/0.36	0.27/0.58	0.26/0.56	0.20/0.41	0.25/0.51
NaST-sp_0.8	0.40/0.67	0.21/0.42	0.33/0.67	0.27/0.58	0.23/0.40	0.29/0.55
NaST-sp₁	0.41/0.67	0.21/0.42	0.32/0.68	0.27/0.60	0.22/0.43	0.29/0.56
NaST-tp₀	0.38/0.65	0.18/0.37	0.31/0.65	0.25/0.59	0.22/0.45	0.27/0.54
NaST-tp_0.2	0.35/0.64	0.16/0.37	0.29/0.62	0.26/0.57	0.20/0.42	0.25/0.52
NaST-tp_0.7	0.40/0.66	0.19/0.44	0.33/0.61	0.29/0.63	0.21/0.44	0.28/0.56
NaST-tp₁	0.43/0.68	0.20/0.44	0.35/0.62	0.29/0.64	0.25/0.47	0.30/0.57
NaST	0.35/0.62	0.15/0.35	0.27/0.56	0.25/0.56	0.19/0.41	0.24/0.50

Table 4. The comparation of non-autoregressive and autoregressive inference pattern on ETH and UCY datasets of ADE/FDE metrics (the lower numerical result is better).

Variants	Performance (ADE/FDE)
Variants	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
NaST-auto	0.33/0.69	0.21/0.45	0.31/0.67	0.31/0.68	0.22/0.54	0.28/0.61
NaST (ours)	0.35/0.62	0.15/0.35	0.27/0.56	0.25/0.56	0.19/0.41	0.24/0.50

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, D.; Li, Q.; Li, S.; Kong, J.; Qi, M. Non-Autoregressive Sparse Transformer Networks for Pedestrian Trajectory Prediction. Appl. Sci. 2023, 13, 3296. https://doi.org/10.3390/app13053296

AMA Style

Liu D, Li Q, Li S, Kong J, Qi M. Non-Autoregressive Sparse Transformer Networks for Pedestrian Trajectory Prediction. Applied Sciences. 2023; 13(5):3296. https://doi.org/10.3390/app13053296

Chicago/Turabian Style

Liu, Di, Qiang Li, Sen Li, Jun Kong, and Miao Qi. 2023. "Non-Autoregressive Sparse Transformer Networks for Pedestrian Trajectory Prediction" Applied Sciences 13, no. 5: 3296. https://doi.org/10.3390/app13053296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Non-Autoregressive Sparse Transformer Networks for Pedestrian Trajectory Prediction

Abstract

1. Introduction

2. Related Works

2.1. Pedestrian Trajectory Prediction

2.2. Human-Human Interactions

2.3. Transformer Networks

2.4. Non-Autoregressive Inference

3. Methods

3.1. Overview

3.2. Sparse Spatial Transformer

3.3. Sparse Temporal Transformer

3.4. Non-Autoregressive Transformer Decoder

4. Experiments

4.1. Datasets and Metrics

4.2. Experimental Settings

4.3. Comparison with State-of-the-Arts

4.4. Ablation Study

4.4.1. The Individual Module in NaST

4.4.2. Contribution of Spatial and Temporal Sparsity

4.4.3. Contribution of Non-Autoregressive Prediction

4.5. Visualization

4.5.1. Trajectory Prediction Visualization

4.5.2. Sparse Directed Interaction Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI