1. Introduction
With the rapid growth of the internet and the rise of short video platforms, there has been an exponential increase in the number of videos available online. Understanding and categorizing these massive volumes of videos has become an urgent challenge [
1]. Action recognition is one of the challenging tasks in the field of video understanding [
2], with wide-ranging applications in video classification, short video recommendations, and security surveillance [
3].
Early action recognition methods primarily depended on handcrafted features, representing videos with multiple local descriptors [
4,
5]. Nevertheless, these handcrafted feature extraction methods were effective only in limited scenarios due to their lack of adaptability to different types of data. With the rapid development of deep learning in image recognition and detection, an increasing number of deep learning-based action recognition methods have emerged in recent years. These deep learning-based methods can generally be categorized into two groups: those based on convolutional neural networks (CNNs) [
6] and those based on transformers [
7].
Unlike static images, videos encompass both spatial and temporal dimensions of feature information. Effectively extracting and utilizing these two types of information from continuous video frames is crucial for action recognition [
8]. Two-dimensional (2D) CNNs can effectively extract high-level semantic information in the spatial dimension from video frames, while they lack sufficient capability to integrate spatio-temporal features. Simonyan et al. [
9] proposed a two-stream 2D CNN architecture, which enhances the network’s spatio-temporal modeling capabilities by integrating features from both RGB frames and optical flow frames. However, extracting optical flow is time-consuming, and changes in scenes and lighting in the sample may lead to errors in optical flow recognition, thereby reducing the accuracy of action recognition.
In contrast to 2D CNNs, 3D CNNs demonstrate greater efficacy in representing video features, directly modeling samples in the spatio-temporal dimension and effectively capturing motion information in videos [
10,
11,
12]. However, a prevalent limitation associated with CNN-based methods is their focus on local information. This focus inherently restricts their capacity to model global features, especially those beyond the scope of their receptive fields [
13,
14].
Originally developed for natural language processing, transformers compute attention for each token in relation to all other tokens in the input, enabling them to capture long-range dependencies effectively. This characteristic makes them particularly adept at understanding the context within sequences. With the application of vision transformers (ViTs) [
15] to image classification, the exceptional capabilities of the self-attention mechanism have been showcased.
This has spurred an increasing number of works applying transformers to action recognition tasks. For instance, TimeSformer [
16] utilizes temporal and spatial attention to process video data, effectively extracting time-based and spatial features. MViT [
17] enhances network performance and reduces computational consumption by integrating the multiscale feature hierarchies with transformer models. Nevertheless, as the sequence length (number of tokens) increases, the computational complexity grows quadratically. This sensitivity to token quantity restricts the transformer’s ability to model global features effectively.
In this paper, we introduce a 3D Longformer [
18] structure and utilize it to construct a Multi-Scale Video Longformer network for action recognition. Our MSVL comprises four stages, progressively reducing video feature resolution while increasing feature dimensions. In stages 1 and 2, it reduces local redundancy through local window attention and maintains global features with global tokens. In stages 3 and 4, the local window attention turns into a dense computation mechanism to enhance performance. The main contributions of this paper are as follows:
- (1)
Introduction of a 3D Longformer structure and its application in constructing a Multi-Scale Video Longformer action recognition network.
- (2)
Implementation of learnable absolute 3D position encoding and relative 3D position encoding based on depthwise separable convolution within the 3D Longformer structure.
- (3)
Creation of an assembly action dataset consisting of eight assembly actions: snipping, cutting, grinding, hammering, brushing, wrapping, turning screws, and turning wrenches.
- (4)
Comprehensive experiments were conducted to validate the proposed approach using the UCF101, HMDB51, and assembly action datasets.
2. Related Works
Existing action recognition methods can be classified into two categories based on how they extract features: convolution-based methods and attention-based methods. Convolution is effective at capturing local semantic information within a window. Simonyan et al. [
9] proposed the two-stream convolutional network, which uses spatial and temporal stream neural networks to perform 2D convolution operations on RGB and optical flow, respectively, to extract appearance and motion information from videos. Wang et al. [
19] proposed the TSN network to process longer video sequences by sparsely sampling the video and extracting a single frame from each segment. C3D [
11] demonstrated the effectiveness of 3D convolutional networks in extracting spatiotemporal features from videos and explores the optimal 3D convolutional kernel structure as 3 × 3 × 3. Carreira et al. [
12] presented the inflate and bootstrap methods, expanding the 2D network structure and weights trained on ImageNet to a 3D network structure and weights, making it easier to use mature 2D network structures and weights. R(2 + 1)D [
20] combined 2D convolution and 3D convolution to reduce computation and memory usage. ACRN [
21] modeled spatiotemporal relations with weakly supervised learning to capture interactions between actors, objects, and scenes to differentiate human actions. LFB [
22] introduced a long-term feature bank, which involves extracting supportive information from the entire span of a video to augment video models. To better balance the accuracy and speed of the network, Feichtenhofer et al. [
23] proposed the SlowFast network, including a slow and a fast branch, to extract motion and scene information from videos, respectively.
Convolutional methods are limited to capturing local information within a fixed window, which poses challenges for comprehensively considering global information. In contrast, transformers can leverage self-attention mechanisms to represent global features. To model long-distance semantic information, Wang et al. [
24] proposed the insertable non-local spatiotemporal self-attention module. Bertasius et al. [
16] introduced the TimeSformer, which applies self-attention separately in the temporal and spatial dimensions to reduce computational complexity. Girdhar et al. [
25] proposed the action transformer model to capture semantic context from other people’s actions. ViViT [
26] investigated the impact of different spatiotemporal attention structures on network performance, while VidTr [
27] used separable attention mechanisms to reduce redundancy in videos, thus decreasing memory consumption. TadTR [
28] effectively reduced computation costs by selectively attending to sparse key segments in videos. STPT [
29] balanced efficiency and accuracy by employing local window attention and global attention at both early and late stages to extract semantic information. MeMViT [
30] combined multimodal information by storing past visual and textual features to capture long-term dependencies in videos. MaskFeat [
31] employed masks to guide feature learning in videos, enhancing the feature representation of objects in video frames. VideoMAE [
32] enhanced the similarity between videos of the same class and distinguishes videos of different classes through a self-supervised paradigm of masking and reconstruction. MViT [
17] and VideoSwin [
33] reduced transformer resource consumption through multi-scale feature hierarchical structures and shifted window operations. UniFormer [
34] combined both CNN and transformer structures to mitigate local redundancy and effectively model global features.
Transformers exhibited quadratic computational complexity as the token count increased during calculations, limiting their ability to model global features effectively. To address this issue, Longformer employed sliding window attention to calculate local context, thereby reducing local redundancy and maintaining linear complexity with the sequence length. Additionally, Longformer utilized global tokens to perform global attention alongside local tokens in the sequence, providing global memory. VTN [
35] transferred Longformer from natural language processing tasks to video understanding, using a 1D Longformer structure to handle long video sequences. Zhang et al. [
36] introduced a 2D Longformer structure, which served as a fundamental feature extraction backbone known as ViL for image classification. Building upon the 2D Longformer structure, we introduced a temporal dimension
T, proposing a 3D Longformer structure, and employed it to construct a multi-scale action recognition network for action recognition.
4. Experiments
In this section, we validate the effectiveness of our method through experiments conducted on multiple datasets. Firstly, in
Section 4.1, we provide a detailed description of the datasets we used. In
Section 4.2, we introduce the hardware and software environment for these experiments. Subsequently, in
Section 4.3, we conduct ablation experiments to explore the optimal combination of our method and use this combination to train MSVL on our assembly action dataset. In
Section 4.4, we compare our method with state-of-the-art approaches. Finally, in
Section 4.5, we visualize the network and present cases where the network predictions were incorrect.
4.1. Dataset
4.1.1. Assembly Action Dataset
Unlike typical human action datasets, the assembly action dataset primarily encompasses upper-body operations. These actions often involve the use of specific assembly tools and exhibit high repeatability with significant similarities among different actions. Since there is currently no publicly available standard dataset for assembly action recognition in the industrial domain, we established an assembly action dataset consisting of eight different assembly actions, including snipping, cutting, grinding, hammering, brushing, wrapping, turning screws, and turning wrenches, as shown in
Figure 5. The dataset was captured using an Intel RealSense D435i camera, manufactured by Intel Corporation, Santa Clara, CA, USA, at a frame rate of 30 frames per second and a resolution of 640 × 480. Each action was recorded in 100 video segments, with each segment lasting for 5 s, resulting in a total of 800 video segments. These segments were organized into corresponding folders based on action categories. The dataset was further split into a training set (90%) and a test set (10%), with 20% of the training set reserved for validation purposes.
4.1.2. Ucf101 Dataset
UCF101 [
38] was created by a research team from the University of Central Florida (UCF) and is composed of 13,320 video clips downloaded from YouTube, with approximately 70% serving as the training set and about 30% as the validation set. It covers a total of 101 action categories distributed across 5 major action types. UCF101 boasts approximately 27 h of video content, and each action category typically includes between 100 and 800 video examples.
4.1.3. Hmdb51 Dataset
HMDB51 [
39] was released by Brown University. This dataset primarily comprises video clips extracted from movies, alongside some clips obtained from public databases and online video platforms such as YouTube. HMDB51 consists of 6766 video clips spanning 51 action categories with approximately 70% serving as the training set and about 30% as the validation set. The dataset collectively offers around 7 h of video content, with each action category containing at least 101 examples.
4.2. Experimental Environment
The experiment was conducted under the following conditions: Intel(R) Xeon(R) Gold 6248R CPU, 128 GB memory, NVIDIA RTX 3090 GPU, Windows 10 operating system, PyTorch 1.10.2, CUDA 11.4, and Python 3.8.12. The training configuration includes a batch size of 4, running for 100 epochs and utilizing the AdamW optimizer. The chosen loss function is SoftTargetCrossEntropy, the learning rate is updated using the CosineAnnealingLR strategy with WarmUp, the initial learning rate is set to 1 ×
, and the minimum learning rate is set to 1 ×
. The video is sampled with 16 frames at a sampling rate of 4. MixUp data augmentation is applied with an alpha value of 0.8, and label smoothing is set to 0.1. The image cropping size used for both training and testing is 224 × 224 pixels. During training, we follow the approach used in the SlowFast repository [
40], where videos are randomly cropped for validation. For testing, we set the number of crops to 1 and the number of views to 4.
4.3. Evaluation Metrics
In the experiment, we utilized top-1 and top-5 accuracy as the evaluation metrics for accuracy. top-1 accuracy involves comparing the predicted class with the highest confidence score to the true class label for each sample. If the top prediction aligns with the true label, it is considered a correct classification (Equation (
7)).
where
represents the number of correct predictions, and
T represents the total number of samples.
On the contrary, top-5 accuracy takes a more lenient approach by evaluating whether the correct label is among the top five predictions. This is determined by examining the predicted classes with the top five highest confidence scores (Equation (
8)).
where
represents the number of samples where the correct class is in the top five predictions.
4.4. Hyperparameter Tuning
We utilized the pretrained ViL weights on ImageNet-1K, provided by EsViT repository [
41] as the initial weights. We extended them to accommodate the temporal dimension for video tasks through inflation. Since training MSVL on the HMDB51 dataset leads to severe overfitting issues, resulting in poor accuracy, we first trained MSVL on the larger UCF101 dataset for 100 epochs with a warm-up of 10 epochs. Our MSVL achieved a top-1 accuracy of 97.6% on the UCF101 dataset. In
Figure 6, we have plotted the top-1 accuracy, top-5 accuracy, and loss curves on the validation set.
Table 2 presents the prediction accuracy of MSVL across various categories in the UCF101 dataset. While the accuracy for most categories reached 100%, the category with the lowest accuracy is “WalkingWithDog”, standing at only 80.62%. Upon reviewing the confusion matrix of MSVL on the UCF101 dataset (
Appendix A), we identified that 12 samples categorized as “WalkingWithDog” were inaccurately identified as “HorseRiding”. This directly contributes to the low accuracy observed in the “WalkingWithDog” category. Firstly, this misstep can be attributed to the limited samples in the “WalkingWithDog” category, as there are only 123 video files in this category in the UCF101 dataset, fewer than the average number of videos per category, which is 130.59. This limitation hinders MSVL’s capacity to learn sufficient features for precise recognition. Secondly, appearance features appear to dominate over motion characteristics in distinguishing between “WalkingWithDog” and “HorseRiding”. Since MSVL is sensitive to actions and does not prioritize the extraction of target features; while these two categories share similar actions in some samples, this could lead to confusion in MSVL due to the similarity in motion features.
We utilized weights pretrained on the UCF101 dataset to train MSVL on the HMDB51 dataset. The training process spanned 100 epochs, with an initial warm-up period of 10 epochs, and the learning rate was set to 1
. Consequently, MSVL achieved a top-1 accuracy of 72.94% on the HMDB51 dataset and the distribution of accuracies across different categories is presented in
Figure 7. It can be observed that, although a relatively high overall accuracy is achieved, there are still some actions that are prone to confusion. For example, it sometimes classifies “walk“ as “run”, or “stand“ as “jump”. One reason is that actions with a smaller field-of-view tend to receive less attention, causing MSVL to focus more on irrelevant information. Another factor is that the HMDB51 dataset consists of clips extracted from movies and, in some clips, background actors perform different actions from the main actors, introducing significant background interference.
To explore the effectiveness of the 3D Longformer and the optimal configuration for MSVL, we conducted hyperparameter tuning on MSVL with different parameters and network configurations using the HMDB51 dataset. Firstly, we investigated the impact of different positional encodings. Then, we tested the use of different numbers of 3D Longformer Attention Blocks and MlpBlocks. Finally, we investigated the role of global tokens in network feature extraction by determining whether they were passed to the next stage. We present the accuracy and loss of MSVL with four distinct network configurations in
Figure 8. In the legend, “Rpe“ denotes relative 3D position encoding, “Ape” denotes absolute 3D position encoding, “1191” signifies the specific network architecture employed, and “Glo” indicates the practice of passing global tokens to the next stage.
4.4.1. Position Encoding
We tested two encoding methods of MSVL, absolute 3D position encoding (Ape), and relative 3D position encoding (Rpe). The results for top-1 and top-5 accuracies, FLOPs, and parameters on the HMDB51 dataset are presented in
Table 3. Remarkably, the utilization of absolute 3D position encoding resulted in a reduction of 1G FLOPs compared to relative 3D position encoding, while maintaining consistent parameters. However, relative 3D position encoding also notably improved top-1 accuracy by 2.76% and top-5 accuracy by 1.97%.
To investigate the reasons behind the superior performance of relative 3D position encoding over absolute 3D position encoding, we visualized the position encoding of four stages in both scenarios, as depicted in
Figure 9. Our observations revealed that relative 3D position encoding, which relies on 3D depthwise separable convolution, can capture spatiotemporal positional information from local tokens, simultaneously enhancing the features. On the contrary, the absolute 3D position encoding contains no feature information and solely provides positional information. We think this disparity leads to the superior generalization performance of relative 3D position encoding compared to absolute 3D position encoding.
4.4.2. Number of 3D Longformer Attenblocks and Mlpblocks
We investigated the impact of different numbers (
n) of 3D Longformer AttenBlock and MlpBlock layers in each stage. To ensure a fair experiment, we set the total count of n to be equal. To be specific, we employed two network structures—(1, 1, 9, 1), resembling a transformer hierarchy, and (1, 2, 8, 1), resembling a convolutional network hierarchy—both using relative 3D position encoding. As shown in
Table 4, the (1, 1, 9, 1) structure reduced FLOPs by 2G but increased Params by 1.2M compared to (1, 2, 8, 1). Furthermore, the (1, 1, 9, 1) structure lagged behind (1, 2, 8, 1) by 1.97% in top-1 accuracy and 0.97% in top-5 accuracy.
From the data presented above, it is evident that the (1, 2, 8, 1) structure, which resembles a convolutional network hierarchy, is better suited for MSVL. We believe this is related to the feature extraction process and network architecture of MSVL. MSVL employs high resolution feature maps for extracting local information in shallow layers and low resolution feature maps for capturing global information in deeper layers, all while progressively increasing the feature depth, which aligns more closely with the characteristics of convolutional neural networks.
4.4.3. Passing Global Tokens to the Next Stage
In each stage, global tokens generate global features by attending to all local tokens. However, in the standard ViL structure, only input features were passed to the next stage, and old global tokens were discarded, with new global tokens initialized in the patch embedding. We conducted experiments to evaluate the impact of sequentially passing global tokens to the next stage. To ensure the smooth transfer of global tokens from the previous stage to the next stage, we trimmed a portion of the global tokens’ data during the transfer process to match the dimensions of the next stage. The comparative results are presented in
Table 5. It was observed that passing global tokens to the next stage had no impact on FLOPs and Params but improved top-1 accuracy by 1.56% and top-5 accuracy by 0.5%. Furthermore, we noticed that the practice of passing global tokens leads to faster network convergence during the early stages of training (within the first 20,000 steps), as shown in
Figure 8.
We believe that one significant reason for the improvement in accuracy is that these global tokens contain global features, which represent essential spatiotemporal features of the video and allow local tokens to better understand the video context. In
Figure 10, we visualize the global tokens passed at different stages. It can be observed that the global tokens passed from lower stages contain more specific spatial features, while those passed from higher stages contain more abstract features. Additionally, we also noticed the presence of shadows of human actions from the previous frame in the global tokens passed from stage 1 to stage 2. This further validates our hypothesis that global tokens capture both temporal and spatial features.
4.4.4. Training on the Assembly Action Dataset
Using ablation experiments to validate the optimal configuration, we trained MSVL on the assembly action dataset, obtaining top-1 accuracy, top-5 accuracy, and loss curves as shown in
Figure 11a,b. MSVL achieved 100% top-1 and top-5 accuracy after only 4000 training steps on the assembly action dataset. Evaluating the trained MSVL on the test set, we obtained a confusion matrix for the assembly action dataset, as depicted in
Figure 11c. MSVL achieved high accuracy, correctly classifying all categories.
In
Figure 12, we employed Grad-CAM [
42] to visualize MSVL’s output in stage 4, using a video segment from the assembly action dataset demonstrating the “hammering” category. We observed that MSVL focused more attention on the hand area interacting with assembly tools and parts, differentiating the current assembly action category by assigning higher weights to these regions.
Additionally, we visualized the feature maps of the “hammering” category produced by MSVL in each stage, as shown in
Figure 13. We noticed that each patch in the feature maps had global features, which were particularly evident in stage 1 and stage 2. This indicates that global tokens played a crucial role, leveraging attention calculations between global tokens and all local tokens in the input video features, thus passing these global features to each local token, even for those located far apart.
4.5. Comparative Experiments
We conducted comparative experiments to assess the performance of the proposed MSVL in comparison with state-of-the-art methods on the HMDB51 and UCF101 datasets. The results for top-1 and top-5 accuracies, FLOPs, and parameters are presented in
Table 6. The top-1 and top-5 accuracy results, with the best, second best, and third best performances, are highlighted in red, blue, and green, respectively. All networks used only RGB frames as input with a video frame resolution of 224 × 224 pixels. For feature extraction backbones, R(2 + 1)D, TSN, I3D, and TSM employed ResNet50. divST represents the version of TimeSformer that incorporates spatiotemporal separated attention, while VideoSwin utilized the Swin-S version. We initially attempted to train TimeSformer and VideoSwin directly on the HMDB51 and UCF101 datasets but achieved suboptimal accuracy. As a result, we decided to finetune them using pretrained weights provided by mmaction2 [
43]. The sampling parameters for TimeSformer are as follows: a clip length of 16, a frame interval of 2, and one clip. For VideoSwin, the parameters are 8, 32, and 1, respectively.
In most cases, it can be observed that our MSVL achieves the best performance compared to other methods. In contrast to methods employing convolutional approaches [
11,
12,
19,
20,
44], MSVL excels in extracting overall video features, demonstrating exceptional performance. In comparison to divST, MSVL harnesses 3D Longformer Attention in a joint spatiotemporal manner to extract spatiotemporal features. It also lowers computational complexity and reduces the network’s parameter count through a local attention mechanism. On the UCF-101 dataset, MSVL outperforms TimeSformer by a margin of 2.9% while simultaneously reducing the parameter count by 77.6%. VideoSwin also employs a convolution-like mechanism that recombines patches from different positions in the input video features. In comparison to VideoSwin, MSVL achieves a notable reduction in the parameter count and FLOPs by 45.1% and 4.8%, respectively. Despite not undergoing pretraining on large-scale datasets, MSVL achieves highly competitive accuracy compared to VideoSwin. We believe that, if MSVL were pretrained on similarly large-scale datasets (such as K400), its performance would further improve.
4.6. Visualization
In
Figure 14, we present examples of correct and incorrect predictions by our MSVL on the HMDB51 and UCF101 datasets, and display both RGB frames and the feature maps of the four stages of MSVL. We observe that MSVL performs well on videos where the human body occupies a significant portion of the frame. However, for videos with a small proportion of the human body region, prediction errors may occur. This phenomenon could be attributed to the attention mechanism involving global and local tokens, which makes MSVL focus more on the overall global features and may neglect local characteristics. We calculated the real-time performance of the network. Our MSVL took a total of 16 min and 56 s to perform inference on 3996 videos, with each video extracting 8 frames as input to the network; this resulted in our MSVL achieving 31.45 FPS on a 3090 GPU.
We also observed an intriguing phenomenon: the feature map patches exclusively contain global features in models trained for an extended number of epochs. In contrast, models trained for fewer epochs tend to contain predominantly local features, as illustrated in
Figure 15. We posit that this phenomenon arises because global tokens only transfer these global features to each local token through attention calculations after undergoing multiple epochs of training. Furthermore, in the feature maps of the subsequent video frames, we can also observe shadows of human actions from the previous video frame, similar to those that global tokens extract. This further indicates that global tokens can pass the extracted spatiotemporal features to local tokens, thereby maintaining global features of action information from previous video frames and effectively integrating temporal and spatial features.