HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition

Jiang, Li; Yu, Jiahao; Dang, Yuanjie; Chen, Peng; Huan, Ruohong

doi:10.3390/app13095277

Open AccessArticle

HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition

by

Li Jiang

,

Jiahao Yu

,

Yuanjie Dang

,

Peng Chen

^*

and

Ruohong Huan

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5277; https://doi.org/10.3390/app13095277

Submission received: 26 March 2023 / Revised: 16 April 2023 / Accepted: 19 April 2023 / Published: 23 April 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Although the existing few-shot action recognition methods have achieved impressive results, they suffer from two major shortcomings. (a) During feature extraction, few-shot tasks are not distinguished and task-irrelevant features are obtained, resulting in the loss of task-specific critical discriminative information. (b) During feature matching, information critical to the features within the task, i.e., self-information and mutual information, is ignored, resulting in the accuracy being affected by redundant or irrelevant information. To overcome these two limitations, we propose a hierarchical task information mining (HiTIM) approach for few-shot action recognition that incorporates two key components: an inter-task learner (

K^{inter}

) and an attention-matching module with an intra-task learner (

K^{intra}

). The purpose of the

K^{inter}

is to learn the knowledge of different tasks and build a task-related feature space for obtaining task-specific features. The proposed matching module with

K^{intra}

consists of two branches: the spatiotemporal self-attention matching (STM) and correlated cross-attention matching (CM), which reinforce key spatiotemporal information in features and mine regions with strong correlations between features, respectively. The shared

K^{intra}

can further optimize STM and CM. In our method, we can use either a 2D convolutional neural network (CNN) or 3D CNN as embedding. In comparable experiments using two different embeddings in the five-way one-shot and five-way five-shot task, the proposed method achieved recognition accuracy that outperformed other state-of-the-art (SOTA) few-shot action recognition methods on the HMDB51 dataset and was comparable to SOTA few-shot action recognition methods on the UCF101 and Kinetics datasets.

Keywords:

few-shot learning; action recognition; dynamic network; attention mechanism

1. Introduction

The advancement of action recognition algorithms [1,2] has been significantly facilitated by the availability of large-scale datasets [3], enabling the interpretation of increasingly complex video sequences and recognizing action when the action is occluded [4]. Nevertheless, the process of video annotation is both labor-intensive and costly. Consequently, few-shot action recognition has attracted considerable attention as it aims to develop effective action classifiers by utilizing minimal labeled samples, thus reducing the cost of video annotation. State-of-the-art (SOTA) action recognition algorithms primarily adopt an efficient metric-based framework [5] that consists of two core components, i.e., feature extraction and matching. Videos are mapped to a feature space, and classification is performed based on the nearest distances between query (test) and support (reference) videos. To enhance the realism of the training process, episodic training is employed.

Despite the impressive performance of current few-shot action recognition, there are still two limitations. First, during feature extraction, the information from episodic tasks is neglected, generating task-agnostic features. In different tasks, the same video sample may exhibit varying characteristics, and task-agnostic features can lose the most discriminative aspects within the current task. Additionally, studies [6] indicate that task-agnostic features weaken the generalization ability of the model. Second, in feature matching, critical information such as self-information and mutual information, is overlooked. Research indicates [7] that accurate predictions may only require processing a small portion of key temporal segments or spatial regions. Videos contain redundant information, such as unrelated frames and backgrounds. Furthermore, videos are not temporally or spatially aligned, with variations in action start times, scene locations, and angles leading to nonuniform key information distribution across videos. Therefore, neglecting the self-information and mutual information of features within tasks when conducting matching can severely impact the recognition accuracy.

To overcome the generalization limitations caused by task-agnostic features, research involving domain generalization [8] aims to enable models trained in the source domain to better adapt to tasks in other target domains. Existing methods for improving model generalization include techniques such as self-supervised learning [9] and meta-learning [10]. Self-supervised learning leverages freely available data labels and introduces auxiliary tasks to predict augmented labels. For instance, ARN [11] engages in self-supervision via jigsaw [12] and rotation to help obtain a more robust encoder through training. Meta-learning [13] exposes the model to a domain shift during the training process and continuously improves the algorithm across multiple learning stages, with the core idea of learning prior knowledge. Although both methods improve model generalization, they do not address the issue of features being task-agnostic. Other methods [14,15] employ dynamic networks [16] to mine task information, but these approaches have redundant model structures and weak modeling capabilities, limiting their performance.

To address the issue of weak self-information and mutual information in feature matching, existing methods generally employ feature enhancement techniques for insufficient self-information. For example, a second-order similarity network (SoSN) [17] uses second-order statistics for feature strengthening; alternatively, optical flow [18,19,20] or skeletal motion [21,22] are added. To tackle the problem of insufficient mutual information, temporal alignment strategies [20,23,24] are typically employed. Perrett [23] proposed a frame tuple matching method that involves extracting various combinations of frame tuples from video segments to find the key frame tuples in the videos. Although these methods improve recognition accuracy, they increase computational complexity and reduce efficiency. Attention mechanism [25,26] has been widely used for mining key feature information, exhibiting excellent performance and computational efficiency. However, its optimization effects are limited and only focus on either the temporal or spatial dimensions. Therefore, more efficient and flexible matching modules capable of effectively mining key information have to be designed.

Inspired by the above observations, we propose a hierarchical task information mining (HiTIM) approach, as shown in Figure 1, consisting of inter-task learner (

K^{inter}

) and a matching module with intra-task learner (

K^{intra}

). The proposed

K^{inter}

can adaptively construct the feature space according to the task through episodic training, thereby extracting task-specific features. To enhance the modeling capability of

K^{inter}

, we employ deformable three-dimensional convolution (D3D), which provides a more flexible receptive field than the standard convolution.

K^{inter}

can mine the critical discriminative information for the task. The matching module with

K^{intra}

consists of two branches: the spatiotemporal self-attention matching (STM) and correlated cross-attention matching (CM). We consider that the temporal and spatial information of videos are related. Thus, we concatenate the temporal and spatial dimensions of features into a new spatiotemporal dimension. The STM emphasizes the key regions in the spatiotemporal dimension of features using the self-attention, and the CM employs the cross-attention mechanism to emphasize the strongly correlated parts between support-set and query-set features. The shared

K^{intra}

further optimizes the attention mechanism. Through these two components, HiTIM can integrate the critical discriminative knowledge between tasks and the key information of features within tasks. We evaluated the performance of our method using the UCF101 [27], HMDB51 [28], and Kinetics datasets [3]. The contributions of the study are as follows:

We designed $K^{inter}$ to mine the key discriminative information of tasks, generating task-specific features.
We designed a matching module with $K^{intra}$ , consisting of two branches for STM and CM. These branches mine the key spatiotemporal regions of features and the strongly correlated parts between features, with the shared $K^{intra}$ further optimizing both branches.
We propose an end-to-end few-shot action recognition method named HiTIM, which can use either a 2D convolutional neural network (CNN) or 3D CNN as embedding. In experiments involving the UCF101, HMDB51, and Kinetics datasets, HiTIM achieved recognition accuracy that outperformed or was comparable to other SOTA few-shot action recognition methods on five-way one-shot and five-way five-shot tasks.

2. Related Work

The topics related to our work are few-shot learning, few-shot action recognition, and dynamic networks. This section provides a brief overview of these methods.

2.1. Few-Shot Learning

Several few-shot learning methods [29] that primarily focus on image classification have been proposed [30]; they can be divided into optimization-based and metric-based methods. Few-shot learning methods adopt episodic training [5], where each episode represents an N-way K-shot task. In episodic training, K support samples with annotation from each of the N classes are randomly selected as the support set for reference, and several query samples without annotation are selected as the query set for the test. In each episode, the model learns the knowledge in the support set and then attempts to make accurate predictions on the query set. Optimization-based methods train a meta-learning model to help it quickly adapt to new tasks, such as optimizing gradients [31] and learning initialization parameters [32]. Metric-based methods emphasize learning how to compare the support and query sets, first mapping samples to a feature space and then calculating the similarity between the support and query sets using mathematical distance functions [33,34] or learnable metric network [35]. Prototypical networks [33], when faced with large K values, average the features of the support set belonging to the same class; the average value serves as the prototype of the class, which is then matched with the query set. Prototypes not only weaken the impact of noise, but also increase the computational efficiency, making them the most widely used architecture for existing metric-based few-shot learning [11,15,17,23,35]. We adopt a prototypical network, i.e., a metric-based framework, with episodic training, but our focus is the challenging task of few-shot action recognition.

2.2. Few-Shot Action Recognition

Few-shot action recognition handles complex video data rather than 2D images. Mainstream networks for video feature extraction are based on 2D or 3D CNNs. Two-dimensional CNNs, such as ResNet [36] and InceptionNet, focus on mining information at the image level. Owing to effective pre-training on the large-scale ImageNet dataset, ResNet and InceptionNet are capable of efficiently and accurately extracting image features. They are currently adopted by many SOTA methods [20,23,24,37]. For instance, Perrett [23] utilized ResNet-50 [36] to extract features from each frame of the image, designed frame tuples for temporal alignment, and achieved excellent performance; however, when the video or frame tuple is long, the computational efficiency significantly decreases. Compared with 2D CNNs, 3D CNNs, such as C3D [38] and SlowFast [39], can more accurately capture information in time and space, recognizing actions and changes in videos. Currently, the mainstream few-shot action recognition methods [11,15,17,35] that adopt 3D CNNs are trained from scratch directly on the target dataset. The focus of our work is not on 2D or 3D CNNs, but on hierarchical task knowledge mining. We comprehensively and fairly compare existing methods by using two different embeddings: a 3D CNN without pre-training and a 2D CNN pre-trained on ImageNet. In feature matching, distance functions are divided into mathematical functions and deep-learning-based metric networks. Mathematical functions such as Euclidean distance or cosine distance are intuitive, but their accuracy is low because of the non-prominent key feature information. Regarding this, HyRSM [24] measured the distance between the query and support videos from the perspective of set matching and designed a bidirectional mean Hausdorff metric to enhance the resilience to misaligned instances. The attention mechanism enhances key information by setting different weights within the features to strengthen critical parts and weaken irrelevant parts, thereby highlighting the key areas of visual representation. For instance, ARN [11] uses temporal and spatial attention modules to strengthen the key information in the time and space of features. SoSN [17] enhances by introducing second-order feature statistics. Metric networks, such as a relation network [35], concatenate support-set and query-set features and input them into the network to calculate similarity, which is currently the mainstream way to match features. The proposed matching module is based on cosine distance and attention mechanism. In contrast to conventional methods, our method focuses on mining the key spatiotemporal information of features and parts with strong correlation between features.

2.3. Dynamic Network

Dynamic networks [16], which have recently attracted research attention, can modify the network structure according to input. They have better generalization and interpretability than static networks. Dynamic network strategies mainly include dynamic structures [40] and dynamic parameters [41,42]. Dynamic structures select different paths according to input, such as early exiting [43] and layer skipping [44]. Dynamic parameters adjust the size or weights of convolution kernels according to input, enhancing model generalization. In several studies, dynamic network ideas have been introduced into few-shot learning; e.g., learning to generate matching network (LGM) [14] for few-shot image classification adds a task encoder before the image feature extraction network to generate parameters. Accordingly, Meta-Relation (MR) [15] designed a meta-learner to generate parameters for three layers of 3D convolution. Introducing dynamic networks into few-shot learning can effectively enhance model generalization, but the improvement was limited in the aforementioned study owing to the redundant model structure and weak modeling capabilities. In contrast,

K^{inter}

, designed in the present study, not only simplifies the structure, but also introduces the more powerful D3D [45,46] into the dynamic parameter generation module for the first time. The convolution kernel of D3D has an offset for each sampling point, enabling arbitrary sampling in the vicinity of the current position, which makes the receptive field more flexible.

3. Method

In this section, we first describe the few-shot action recognition task. Then, we present the proposed HiTIM method for few-shot action recognition.

3.1. Problem Definition

The goal of few-shot action recognition is to train a model that can be generalized well to work on new classes with only a few labeled samples. To make the training closely resemble testing, we adopt episodic training [5]. For each episode, few-shot action recognition is formulated as an N-way K-shot task, which is divided into a support set and a query set. The support set consists of N classes, and each class has K labeled video clips, whereas the query set contains several unlabeled samples. In accordance with related works, we randomly select K different videos from N categories as the support set S = (

V_{1, 1}

,

V_{1, 2}

, …,

V_{K, N}

). Subsequently, we select a video segment from these N categories to serve as the query set

V_{q}

. The final goal is to predict to which of the N categories

V_{q}

belongs.

3.2. HiTIM

The design purpose of the HiTIM is to extract discriminative information between tasks and essential self-information and mutual information of features within tasks. To overcome the challenge of mining richer hierarchical information with limited samples, we designed

K^{inter}

and a matching module with

K^{intra}

.

Pipeline: The overall architecture of HiTIM is illustrated in Figure 1. It mainly includes feature extraction with

K^{inter}

and a feature matching module with

K^{intra}

. The feature extraction is composed of embedding, task-related encoder and

K^{inter}

. The proposed

K^{inter}

can dynamically generate parameters for the task-related encoder. The second part—feature measurement—is composed of two branches: STM and CM. These branches enhance feature self-information and relativity and calculate cosine distance as similarity. Taking the 5-way 5-shot task as an example, the first step is to construct the support set and query set according to the problem definition. The second step involves sampling 16 frames from each video, and randomly cropping each frame into image blocks of the same size. In the third step, the support-set and query-set video clips are input into the embedding to obtain universal features. Two types of embedding can be chosen: 3D convolution without pre-training or ResNet-50 pre-trained on the ImageNet dataset. The universal features of the support set are input into

K^{inter}

to obtain the parameters of the 3D convolution in the task-related encoder. Then, the parameters are passed to the corresponding network. In the fourth step, the universal features of the support set and query set are input into the the task-related feature extraction to obtain task-specific features. The average of the support-set features in each category is calculated to obtain the prototype of each class. The fifth step is to match the features: the query-set features and prototypes are input into the STM and CM to calculate the

S i m^{STM}

and

S i m^{CM}

. In the sixth step, the

S i m^{STM}

and

S i m^{CM}

are separately used to calculate the cross-entropy loss (

L o s s^{STM}

and

L o s s^{CM}

), and the losses of the two branches are summed to obtain the final loss.

3.2.1. Inter-Task Learner

The design principle of

K^{inter}

is to highlight the differences between few-shot tasks and improve the generalization performance of the model. The feature representations of videos from different tasks may vary in both temporal and visual dimensions. However, most current networks have fixed parameters after training. Our motivation is to learn the differences between tasks and build task-related feature space. Thus, we designed

K^{inter}

, which dynamically generates extraction network parameters.

K^{inter}

consists of three components: the task information extraction, noise injection, and fully connected layer, shown in Figure 2. Task feature extraction aims to encode the support-set features. To enhance the modeling capability, we introduce D3D, which exhibits a highly flexible receptive field and is more suitable for extracting task-specific features than 3D convolution. To improve robustness against interference, we inject random noise into task features, which makes the model less susceptible to small variations in input data and provides a degree of regularization to prevent overfitting. We employ Gaussian noise to introduce variability into the feature representation. The purpose of the fully connected layer is to generate parameters required by the feature extraction network. Furthermore, to decrease the degree of model redundancy and streamline the training process,

K^{inter}

generates a single layer of 3D convolution parameters, which are then shared across the three layers of 3D convolution within the task-related feature extraction. The computation is demonstrated as follows:

The universal features through embedding of the support set

F^{s}

are input to the task feature extraction T to obtain task features

T F

, which are then decomposed into expectation

μ

and standard deviation

σ

, as given by Equation (1). There are two reasons for only inputting the features of the support set into

K^{inter}

. First, in few-shot learning tasks, there is only label information for the support-set videos; additionally, the category of the query set in practical scenarios is unknown and may not belong to the categories of the support set. Second, the number of videos in query sets is unknown, and adding query sets may limit the applicability of the method.

μ, σ = T F = T (F^{s})

(1)

Gaussian distribution based on

μ

and

σ

is constructed, and the noise

G N

is injected into task features

T F

, as follows:

T F^{G} = T F + G N (μ, diag (σ^{2}))

(2)

where GN represents the Gaussian function, and diag is the diagonal matrix construction function.

Then

T F^{G}

is fed into fully connected layer f to generate the parameters

P^{a d a p t i v e}

of the task-related encoder as given by Equation (3).

P^{a d a p t i v e} = f (T F^{G})

(3)

3.2.2. Intra-Task Learner

The design goal of the matching module with

K^{intra}

is to improve the performance of video feature matching, which relies on reducing the intra-class distance, while increasing the inter-class distance. Inspired by attention mechanism, we propose two matching branches, i.e., the STM and CM, and optimize the attention calculation by

K^{intra}

.

K^{intra}

aims to optimize the attention matrix and adaptively assign high weights to critical positions. As shown in Figure 3,

K^{intra}

mainly consists of two layers of 2D convolution. The dimensions are kept consistent between the output of the first 2D convolution layer and the input of the second 2D convolution layer. Additionally, the rectified linear unit function, improves the modeling capability. Softmax converts the values in the attention matrix into a probability distribution, achieving dynamic weighting of different features.

3.2.3. Spatiotemporal Self-Attention Matching

In few-shot action recognition, videos usually contain irrelevant frames and interference information, such as background and unrelated actions, which significantly reduce the recognition accuracy. To overcome the challenge, we propose the STM, as shown in Figure 4, which aims to explore the key self-information in temporal and spatial dimensions within the features. The attention mechanism can capture key information within a certain dimension, but current attention modules have difficulty taking into account both temporal and spatial dimensions. During the process of using the 3D convolution task-related encoder to extract task-specific features, the receptive field of each convolution kernel contains both temporal and spatial dimensions. Each point in the video feature contains temporal and spatial information. Therefore, in the STM, the temporal dimension and spatial dimensions of features are concatenated into a new spatiotemporal dimension L, and self-attention is calculated on dimension L. The calculation process is as follows: Concatenate the temporal and spatial dimensions of the features, and perform a matrix multiplication of the concatenated features F and their transpose

F^{T}

. Then, reduce the dimensionality to obtain the spatiotemporal self-attention vector

M^{STM}

as follows:

M^{STM} = mean (F \cdot F^{T})

(4)

Then,

M^{STM}

is input into the

K^{intra}

for optimization, resulting in

A^{STM}

. Subsequently,

A^{S T M}

is multiplied with the input feature F to enhance the key self-information. To preserve the information of F and learn the relationship between the input and output, we also added a residual term to obtain the feature

F^{STM}

as follows:

A^{STM} = K^{intra} (M^{STM})

(5)

F^{STM} = F \cdot A^{STM} + F

(6)

The cosine distance between the strengthened query features

F_{q}^{STM}

and prototypes

F_{p}^{STM}

is calculated to obtain the

S i m^{STM}

as given by Equation (7).

S i m^{STM} = D_{cos} (F_{q}^{STM} \cdot F_{p}^{STM})

(7)

3.2.4. Correlated Cross-Attention Matching

Owing to the diversity of videos, the location and timing of actions vary in different videos. Therefore, in few-shot learning tasks, the correlation between query-set features and prototypes has to be considered. The purpose of the CM, as shown in Figure 5, is to explore the correlation between features and align them to obtain improved matching performance. The attention mechanism obtains the attention matrix by multiplying features and uses it to learn the relationship between feature points. Multiplying the query-set features and prototypes of the same dimension can obtain a cross-correlation attention matrix, which helps align the key parts of the features and strengthen their correlation. The CM takes the query features and prototypes as input, and the calculation is bidirectional. Taking the strengthening of query features as an example, the calculation process is as follows:

Concatenate temporal and spatial dimensions of query features and prototypes separately. Compute the matrix multiplication between F and

P^{T}

, and then reduce the resulting matrix to obtain the cross-correlation attention vector

M^{CM}

:

M_{q}^{CM} = mean (F_{q} \cdot {(F_{p})}^{T})

(8)

M^{CM}

is input into the

K^{intra}

for optimization, resulting in

A^{CM}

. Same as the STM,

A^{CM}

is multiplied by the query feature and added with the residual term, resulting in

F_{q}^{CM}

, as given by Equation (9). Similarly, we swap the order of the input prototypes and query-set features to obtain

F_{p}^{CM}

.

F_{q}^{CM} = F_{q} \cdot A^{CM} + F_{q}

(9)

The cosine distance between the strengthened query features

F q^{CM}

and prototypes

F_{p}^{CM}

is calculated to obtain the related cross-attention similarity

S i m^{CM}

as follows:

S i m^{CM} = D_{cos} (F_{q}^{CM} \cdot F_{p}^{CM})

(10)

Finally, we compute cross-entropy loss for

S i m^{STM}

and

S i m^{CM}

to obtain

L o s s^{STM}

and

L o s s^{CM}

, respectively. These two losses are added to obtain the final loss, as given by Equations (11)–(13).

L o s s^{STM} = - log (\frac{exp (S i m_{y}^{STM})}{\sum_{j}^{N} exp (S i m_{j}^{STM})})

(11)

L o s s^{CM} = - log (\frac{exp (S i m_{y}^{CM})}{\sum_{j}^{N} exp (S i m_{j}^{CM})})

(12)

L o s s = L o s s^{STM} + L o s s^{CM}

(13)

4. Experiments

We evaluated the performance of the proposed method and modules through experimental comparisons with conventional methods and ablation experiments. The datasets used in the experiments were UCF101, HMDB51, and Kinetics, and sample videos are shown in Figure 6.

4.1. Datasets and Experimental Setups

Datasets: HMDB51 [28], UCF101 [27], and Kinetics [3] are benchmark datasets for evaluating the performance of action recognition algorithms and cover various types of actions including general daily activities and sports. The HMDB51 dataset contains 51 categories, with a total of 6849 videos. Each category contains an average of 137 video clips, and the length of the video clips varies from 17 to 266 frames, with an average of 26 frames. UCF101 contains 101 categories, with a total of 13,320 videos. Each category contains an average of 133 video clips, and the length of the video clips varies from 17 to 441 frames, with an average of 47 frames. Kinetics contains 100 categories, with a total of 40,000 videos. Each category contains an average of approximately 400 video clips, and the average length of the video clips is approximately 250 frames. These three datasets use video-level annotations and each video contains only one category of action. The training and testing sets of the model are non-overlapping in terms of categories, implying that there are differences between the source domain used for training and the target domain used for testing. Thus, the recognition accuracy on the testing set reflects the generalization ability of the model.

Implementation details: As described in the previous section, to make a fair comparison with other SOTA methods, we adopted two different embeddings: 3D convolution and ResNet-50. The 3D convolution was set up the same as in other comparative methods [11,15,17], without importing pre-trained parameters. The size of the convolution kernel is 3 × 3 × 3, and the output channels are 64. For ResNet-50, we imported the pre-trained parameters on the ImageNet dataset as in comparative methods [20,23,24,37]. We replaced the last average pooling and fully connected layers with a 2D convolution in ResNet-50, and the output channel dimension was reduced from 2048 to 64, ensuring that the output channel number of the two different embeddings was the same.

For dividing the data into the training, validation, and testing sets, we followed the setting proposed by ARN [11]. For UCF101, the 101 categories were divided into 70 classes for training, 10 classes for validation, and 21 classes for testing. For HMDB51, the 51 categories were divided into 31 classes for training, 10 classes for validation, and 10 classes for testing. For Kinetics, the 100 categories were divided into 64 classes for training, 12 classes for validation, and 24 classes for testing. We randomly sampled a continuous sequence of 16 frames per video and performed random cropping and normalization on each frame of the video segments. The video frame cropping area for UCF101, HMDB51 and Kinetics was 224 × 224, 112 × 112, and 224 × 224, respectively. In the training phase, we used the ADAM optimizer [47] with an initial learning rate of 0.001; the learning rate decreased by 25% after every 20,000 iterations. After every 5000 iterations, we randomly generated 600 tasks from the validation set, tested the average recognition accuracy for these tasks, and determined the model with the highest recognition accuracy. In the inference phase, we randomly generated 10,000 tasks from the test set and calculated the average recognition accuracy with 95% confidence for these tasks. For many-shot classifications, e.g., 5-shot, we followed the prototypical networks [33], and calculated the mean features of the support videos in each class as the prototypes, and classified the query videos according to their distances against the prototypes. To enhance the clarity of the experimental results, bold font was used to indicate the highest performing data in the tables, while underlining was used to denote the second-best performing data.

4.2. Comparative Experiments

We compare the performance of HiTIM with conventional methods in this section. There are two types of compared embeddings: a 3D CNN without pre-training and a 2D CNN pre-trained on ImageNet, using ResNet-50.

4.2.1. Using 3D CNN Embedding without Pre-Training

In Figure 7, the horizontal axis of the graph indicates the number of training epochs (each consisting of 5000 iterations), and the vertical axis indicates the average accuracy for the validation set. After nine epochs of training on the UCF101 dataset (i.e., 45,000 tasks), our method achieved the highest validation accuracy. By contrast, MR [15] and ARN [11] required 11 and 13 epochs of training, respectively. Similarly, our model required nine epochs of training on the HMDB51 dataset (i.e., 45,000 tasks), while MR [15] and ARN [11] required 15 and 16 epochs, respectively. Thus, our method required fewer epochs to train the model than the compared methods. This may be because the

K^{inter}

, along with STM and CM, in our method provided more accurate information, accelerating the training.

As shown in Table 1, in the 5-way 1-shot and 5-way 5-shot experiments on the HMDB51, HiTIM achieved recognition accuracies of 46.92% and 61.74%, respectively, which were higher than those of the other methods by 1.22% to 8.87%. In the 5-way 1-shot and 5-way 5-shot experiments on the UCF101, HiTIM achieved accuracies of 66.54% and 84.71%, respectively, which were 0.22% to 9.49% higher than those of the other methods. Compared with the methods that use feature-enhanced techniques and relation network in matching module, the proposed method consistently achieved better performance under different settings on UCF101 and HMDB51 datasets, indicating that our STM and CM are efficient.

In Table 1, ARN+SsSA-rotation refers to ARN with self-supervised learning and data augmentation, which enhances the model’s generalization performance by increasing the amount of additional prior knowledge and training data. In contrast, our method does not use data augmentation or self-supervised learning; nonetheless, it achieved higher recognition accuracy than ARN+SsSA-rotation under different task settings for UCF101, HMDB51, and Kinetics. This may be because the dynamic generation of adaptive parameters for the

K^{inter}

improved the model’s ability to learn differences between tasks, along with its generalization ability. The MR is based on ARN and adds three dynamic parameter generation networks between the four layers of the feature extraction module. Compared with the MR, the proposed method only includes a

K^{inter}

and achieves higher recognition accuracies than MR under different task settings for UCF101 and HMDB51. This may be because our structure of

K^{inter}

is more concise and efficient than the corresponding module in MR, and the use of D3D enhanced the model’s representation capability.

4.2.2. Using 2D CNN Embedding Pre-Trained on ImageNet

Owing to the effective training of complex 2D CNN embedding, such as ResNet-50, on the large-scale ImageNet dataset containing over one million images, these embeddings have greatly improved the performance on tasks such as few-shot action recognition. This is due to the rich prior knowledge that ImageNet provides. In this section, we replace the embedding with ResNet-50 and compare the model with SOTA algorithms. We conduct comparison experiments on the Kinetics dataset to increase the persuasiveness of our results.

As shown in Table 2 and Table 3, in the 5-way 1-shot and 5-way 5-shot experiments on the HMDB51, and in the 5-way 1-shot experiments on Kinetics, HiTIM achieved recognition accuracies of 61.0%, 74.3% and 77.4%, respectively, which were higher than those of the other SOTA methods. In experiments with other settings, the performance of our proposed method is comparable to that of the best performing SOTA method, indicating that HiTIM is competitive in terms of performance compared to these methods. This may because, although ResNet-50 extracts high-quality images, these temporal alignment methods [20,23,24,37] lack discriminative information across tasks, and isolating image feature extraction also loses motion information. HiTIM incorporates inter-task information into the prior knowledge of ResNet-50 embedding and explores self-information in the spatiotemporal domain as well as correlations between features.

PAL [37] outperforms our method in the 5-way 1-shot experiments on UCF101 and 5-way 5-shot experiment on Kinetics, possibly because it incorporates query-set information when building prototypes, while our approach treats the query set as completely unknown and our method is more in line with the task setting. Although TRX [23] performs better than our method in the 5-way 5-shot experiments on UCF101, it is affected by the number of input frames when computing the combined frame tuples, while our method handles video-level features, which are more flexible to set the number of input frames. Overall, our method achieves new SOTA performance in more settings than other SOTA algorithms and can be applied flexibly to different embeddings, demonstrating the advantages of our approach.

4.3. Ablation Study

To verify the effectiveness of the proposed modules in HiTIM, i.e.,

K^{inter}

,

K^{intra}

, STM, and CM, we conducted ablation experiments under the 5-way 1-shot task settings of the UCF101 and HMDB51, with the 3D CNN embedding without pre-training.

To investigate the effectiveness of

K^{inter}

and

K^{intra}

, we introduced a comparative model called HiTIM*. The model is based on HiTIM, but with

K^{inter}

and

K^{intra}

removed, while the inputs and outputs remain the same, as shown in Table 4. In

K^{inter *}

, D3D is replaced with 3D convolution.

The results of the ablation study of

K^{inter}

and

K^{intra}

are shown in Table 4. Results for

K^{inter}

, compared to HiTIM*, HiTIM*+

K^{inter}

achieved a recognition-accuracy improvement of 1.41% and 1.11% for HMDB51 and UCF101, respectively. The additional parameters introduced by

K^{inter}

implicitly capture the knowledge of inter-task differences in the few-shot scenario. As shown in Figure 8, in the task-related feature space established by

K^{inter}

, the intra-class distance of video features from five random categories was reduced, while the inter-class distance was increased. The experimental results demonstrated that

K^{inter}

could construct task-related feature space and increase recognition accuracy, confirming its effectiveness. Results for

K^{intra}

, compared with HiTIM*, HiTIM*+

K^{intra}

achieved recognition-accuracy improvement of 0.67% and 0.86% for HMDB51 and UCF101, respectively. The parameters of

K^{intra}

contain implicit feature information, which can be used to mine the key parts of the attention matrix. The experimental results indicated that the attention mechanism had a limited feature enhancement effect, and

K^{intra}

enhanced the distribution of key feature information and the correlation between features, thereby reducing the degree of redundancy of task-specific features and aligning them. Results for

K^{inter}

+

K^{intra}

, compared to HiTIM* +

K^{inter}

and HiTIM* +

K^{intra}

, HiTIM achieved recognition-accuracy improvements of 0.90% and 1.64% in recognition accuracy for HMDB51, and improvements of 0.48% and 0.73% for UCF101, respectively.

K^{inter}

extracts task-related features by exploring the differences between tasks, whereas

K^{intra}

enhances the self-information and mutual information of features within the task. The experimental results indicated that the combination of

K^{inter}

and

K i n t r a

led to significantly better performance than the use of

K^{intra}

or

K^{inter}

alone. This is attributed to the fact that

K^{inter}

generates task-specific features, and

K^{intra}

enhances them. The joint use of these two modules resulted in the extraction of more discriminative features, improving the recognition performance. Results for D3D, compared with HiTIM* +

K^{inter *}

+

K^{intra}

, HiTIM achieved recognition-accuracy improvements of 0.74% and 0.99% for the HMDB51 and UCF101, respectively. The D3D, with added offsets for each sampling point in the convolution kernel, increases the convolution receptive field and enhances the modeling capability. The experimental results demonstrate that the use of D3D helps

K^{inter}

to mine task-related knowledge, further demonstrating the effectiveness of

K^{inter}

.

To investigate the effects of the proposed metric module and its two branches (STM and CM) on the results, we conducted ablation experiments on the matching module.

The results of the ablation study of STM and CM are shown in Table 5. Results for STM, compared with the cosine distance, the recognition accuracy of STM was 5.45% and 6.54% higher for the HMDB51 and UCF101, respectively. STM enhances the importance of key regions by exploring the correlation between spatiotemporal feature points inside the features. The experimental results indicated that the STM significantly increased recognition accuracy, suggesting that key regions of features largely determine the accuracy of matching. Results for CM, compared with the cosine distance, the recognition accuracy of CM was 5.13% and 6.14% higher for the HMDB51 and UCF101, respectively. The CM explores the correlation between the query-set and support-set features, assigning lager weights to the key parts that are correlated. The experimental results indicated that CM significantly increased recognition accuracy, suggesting that the mutual information between features influences the accuracy of matching. Results for STM and CM, the experimental results indicated that using both STM and CM increased the recognition accuracy compared with using STM or CM alone, confirming the effectiveness of the proposed matching module.

5. Discussion

Most of the current deep-learning-based video-action-recognition methods rely heavily on large amounts of annotated data. However, in the real world, obtaining and annotating data for some categories can be challenging due to factors such as filming difficulties, collection and annotation costs, and privacy and ethical concerns. Consequently, few-shot action recognition has significant application value in areas such as video surveillance and human–computer interaction.

We propose a hierarchical task information mining (HiTIM) method for few-shot action recognition. This method uses the

K^{inter}

and the matching module with the

K^{intra}

. The

K^{inter}

adaptively generates network parameters according to different tasks, constructs a task-related feature space, and improves model generalization. The matching module with the

K^{intra}

, which consists of two branches (the STM and CM), strengthens the self-information and mutual information of features. The

K^{intra}

optimizes the two matching branches to enhance feature discriminability and improve matching performance. In experiments involving the HMDB51, UCF101, and Kinetics datasets, our method achieved new SOTA recognition accuracies, outperforming leading few-shot action recognition algorithms using a 3D CNN without pre-training and a 2D CNN pre-trained on ImageNet as embedding. In future work, we will try to develop improved structures and apply our approach to practical settings such as surveillance and medical applications.

Author Contributions

Conceptualization, P.C.; methodology, J.Y. and Y.D.; software, J.Y. and L.J.; validation, J.Y., Y.D. and R.H.; investigation, J.Y.; resources, P.C.; data curation, J.Y.; writing—original draft preparation, L.J. and J.Y.; writing—review and editing, P.C. and R.H.; visualization, L.J. and J.Y.; project administration, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation (Grant U1909203 and 62206250), and Zhejiang Provincial Natural Science Foundation of China (Grant LY19F020032 and LQ22F020007).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The UCF101, HMDB51, and Kinetics used in this study are available at https://tensorflow.google.cn/datasets/catalog/ucf101 and https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database, https://www.deepmind.com/open-source/kinetics, respectively, accessed on 13 April 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HiTIM	Hierarchical task information mining
STM	Spatiotemporal self-attention matching
CM	Correlated cross-attention matching
3D	Tree-dimensional
CNN	Convolutional neural network
C3D	3D convolutional network
D3D	Deformable 3D convolution
ARN	Few-shot action recognition with permutation-invariant attention
SoSN	Power normalizing second-order similarity network for few-shot learning
MR	Few-shot action recognition using task-adaptive parameters

References

Leong, M.C.; Prasad, D.K.; Lee, Y.T.; Lin, F. Semi-CNN architecture for effective spatio-temporal learning in action recognition. Appl. Sci. 2020, 10, 557. [Google Scholar] [CrossRef]
Yang, H.; Gu, Y.; Zhu, J.; Hu, K.; Zhang, X. PGCN-TCA: Pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 2020, 8, 10040–10047. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2017; pp. 6299–6308. [Google Scholar]
Sahoo, S.P.; Modalavalasa, S.; Ari, S. DISNet: A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors. Digit. Signal Process. 2022, 131, 103763. [Google Scholar] [CrossRef]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Lille, France, 6–11 July 2015; Volume 2. [Google Scholar]
Li, H.; Eigen, D.; Dodge, S.; Zeiler, M.; Wang, X. Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1–10. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2921–2929. [Google Scholar]
Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; Loy, C.C. Domain generalization: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4396–4415. [Google Scholar] [CrossRef] [PubMed]
Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 69–84. [Google Scholar]
Balaji, Y.; Sankaranarayanan, S.; Chellappa, R. MetaReg: Towards domain generalization using meta-regularization. In Proceedings of the International Conference on Neural Information Processing Systems, Montreal, Canada, 3–8 December 2018; pp. 1006–1016. [Google Scholar]
Zhang, H.; Zhang, L.; Qi, X.; Li, H.; Torr, P.H.; Koniusz, P. Few-shot action recognition with permutation-invariant attention. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 525–542. [Google Scholar]
Carlucci, F.M.; D’Innocente, A.; Bucci, S.; Caputo, B.; Tommasi, T. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2229–2238. [Google Scholar]
Liu, Q.; Dou, Q.; Heng, P.A. Shape-aware meta-learning for generalizing prostate MRI segmentation to unseen domains. In Proceedings of the Medical Image Computing and Computer Assisted Intervention(MICCAI), Lima, Peru, 4–8 October 2020; pp. 475–485. [Google Scholar]
Li, H.; Dong, W.; Mei, X.; Ma, C.; Huang, F.; Hu, B.G. LGM-Net: Learning to generate matching networks for few-shot learning. In Proceedings of the International Conference on Machine Learning(ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 3825–3834. [Google Scholar]
Zong, P.; Chen, P.; Yu, T.; Yan, L.; Huan, R. Few-shot action recognition using task-adaptive parameters. Electron. Lett. 2021, 57, 848–850. [Google Scholar] [CrossRef]
Han, Y.; Huang, G.; Song, S.; Yang, L.; Wang, H.; Wang, Y. Dynamic neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7436–7456. [Google Scholar] [CrossRef]
Zhang, H.; Koniusz, P. Power normalizing second-order similarity network for few-shot learning. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1185–1193. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
Chen, D.; Zhang, T.; Zhou, P.; Yan, C.; Li, C. OFPI: Optical Flow Pose Image for Action Recognition. Mathematics 2023, 11, 1451. [Google Scholar] [CrossRef]
Cao, K.; Ji, J.; Cao, Z.; Chang, C.Y.; Niebles, J.C. Few-shot video classification via temporal alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10618–10627. [Google Scholar]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
Tasnim, N.; Islam, M.K.; Baek, J.H. Deep learning based human activity recognition using spatio-temporal image formation of skeleton joints. Appl. Sci. 2021, 11, 2675. [Google Scholar] [CrossRef]
Perrett, T.; Masullo, A.; Burghardt, T.; Mirmehdi, M.; Damen, D. Temporal-relational crosstransformers for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 475–484. [Google Scholar]
Wang, X.; Zhang, S.; Qing, Z.; Tang, M.; Zuo, Z.; Gao, C.; Jin, R.; Sang, N. Hybrid relation guided set matching for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–24 June 2022; pp. 19948–19957. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Vaswani, A.; Ramachandran, P.; Srinivas, A.; Parmar, N.; Hechtman, B.; Shlens, J. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 12894–12904. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. 2020, 53, 1–34. [Google Scholar] [CrossRef]
Gao, Y. Efficiently comparing face images using a modified Hausdorff distance. IEEE-Proc.-Vision Image Signal Process. 2003, 150, 346–350. [Google Scholar] [CrossRef]
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the International Conference on Neural Information Processing Systems, Beach, CA, USA, 4–9 December 2017; pp. 4080–4090. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, New South Wales, 2016; Volume 29. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhu, X.; Toisoul, A.; Perez-Rua, J.M.; Zhang, L.; Martinez, B.; Xiang, T. Few-shot action recognition with prototype-centered attentive learning. arXiv 2021, arXiv:2101.08085. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 4489–4497. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Huang, G.; Liu, S.; Van der Maaten, L.; Weinberger, K.Q. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2752–2761. [Google Scholar]
Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. Condconv: Conditionally parameterized convolutions for efficient inference. Adv. Neural Inf. Process. Syst. 2019, 32, 1307–1318. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitio (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11030–11039. [Google Scholar]
Zhou, W.; Xu, C.; Ge, T.; McAuley, J.; Xu, K.; Wei, F. Bert loses patience: Fast and robust inference with early exit. Adv. Neural Inf. Process. Syst. 2020, 33, 18330–18341. [Google Scholar]
Shen, J.; Wang, Y.; Xu, P.; Fu, Y.; Wang, Z.; Lin, Y. Fractional skipping: Towards finer-grained dynamic CNN inference. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5700–5708. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Ying, X.; Wang, L.; Wang, Y.; Sheng, W.; An, W.; Guo, Y. Deformable 3d convolution for video super-resolution. IEEE Signal Process. Lett. 2020, 27, 1500–1504. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Li, S.; Liu, H.; Qian, R.; Li, Y.; See, J.; Fei, M.; Yu, X.; Lin, W. TA2N: Two-stage action alignment network for few-shot action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 22 February–1 March 2022; Volume 36, pp. 1404–1411. [Google Scholar]

Figure 1. Architecture of proposed HiTIM method. Embedding is used to extract generic features from videos and can be performed using either 3D or 2D CNN. Based on the generic features from the support set,

K^{inter}

adaptively generates the parameters of a task-related encoder composed mainly of three layers of 3D convolutions to generate task-specific features. STM and CM are two attention matching branches optimized based on

K^{intra}

, which, respectively, enhance the spatiotemporal self-information and correlation among features, calculate the similarity between the task-specific features from the support and query sets, and each calculate the cross-entropy loss of the two similarities.

Figure 1. Architecture of proposed HiTIM method. Embedding is used to extract generic features from videos and can be performed using either 3D or 2D CNN. Based on the generic features from the support set,

K^{inter}

adaptively generates the parameters of a task-related encoder composed mainly of three layers of 3D convolutions to generate task-specific features. STM and CM are two attention matching branches optimized based on

K^{intra}

, which, respectively, enhance the spatiotemporal self-information and correlation among features, calculate the similarity between the task-specific features from the support and query sets, and each calculate the cross-entropy loss of the two similarities.

Figure 2. Architecture of

K^{inter}

.

K^{inter}

can be mainly divided into three steps: task feature extraction, noise injection, and parameter generation. During task feature extraction, D3D with flexible receptive fields is used. Noise is injected into task features to enhance the robustness of

K^{inter}

, and Gaussian noise is adopted here. Finally, adaptive parameters

P^{a d a p t i v e}

of the task-related encoder are generated through fully connected layers.

Figure 2. Architecture of

K^{inter}

.

K^{inter}

can be mainly divided into three steps: task feature extraction, noise injection, and parameter generation. During task feature extraction, D3D with flexible receptive fields is used. Noise is injected into task features to enhance the robustness of

K^{inter}

, and Gaussian noise is adopted here. Finally, adaptive parameters

P^{a d a p t i v e}

of the task-related encoder are generated through fully connected layers.

Figure 3. Architecture of

K^{intra}

.

K^{intra}

is a flexible optimization module for attention matrix, which mainly consists of two 2D convolutions and adaptively assigns high weights to critical positions. The input channel number of the first 2D convolution is equal to the output channel number of the second 2D convolution, keeping the dimension unchanged.

L^{*}

is the output channel number of the first 2D convolution.

Figure 3. Architecture of

K^{intra}

.

K^{intra}

is a flexible optimization module for attention matrix, which mainly consists of two 2D convolutions and adaptively assigns high weights to critical positions. The input channel number of the first 2D convolution is equal to the output channel number of the second 2D convolution, keeping the dimension unchanged.

L^{*}

is the output channel number of the first 2D convolution.

Figure 4. Architecture of STM. First, the temporal and spatial dimensions of the input features are concatenated to form a new spatiotemporal dimension L. Then, attention matrix

M^{STM}

is obtained by element-wise multiplication and dimension reduction on L.

K^{intra}

is used to optimize the

M^{STM}

and obtain

A^{STM}

. The optimized feature is obtained by multiplying the input feature with

A^{STM}

, adding a residual term. Similarity

S i m^{STM}

between

F_{q}^{STM}

and

F_{p}^{STM}

is calculated using cosine distance.

Figure 4. Architecture of STM. First, the temporal and spatial dimensions of the input features are concatenated to form a new spatiotemporal dimension L. Then, attention matrix

M^{STM}

is obtained by element-wise multiplication and dimension reduction on L.

K^{intra}

is used to optimize the

M^{STM}

and obtain

A^{STM}

. The optimized feature is obtained by multiplying the input feature with

A^{STM}

, adding a residual term. Similarity

S i m^{STM}

between

F_{q}^{STM}

and

F_{p}^{STM}

is calculated using cosine distance.

Figure 5. Architecture of CM. CM has a similar structure to STM, but the attention matrix

M^{CM}

is calculated by cross-multiplying the query features with the prototypes, rather than multiplying the features by themselves.

Figure 5. Architecture of CM. CM has a similar structure to STM, but the attention matrix

M^{CM}

is calculated by cross-multiplying the query features with the prototypes, rather than multiplying the features by themselves.

Figure 6. Sample videos of HMDB51, UCF101, and Kinetics. These three datasets consist of behaviors in various domains such as sports, music, and daily life. The scenes are complex and realistic, making them three benchmark datasets with video-level annotations.

Figure 7. Accuracy curve of the validation set during training. The horizontal axis of the graph indicates the number of training epochs (each consisting of 5000 iterations), and the vertical axis indicates the average accuracy for the validation set. (a) Results under 5-way 1-shot on the UCF101. (b) Results under 5-way 1-shot on the HMDB51.

Figure 8. tSNE plots of visual features for the first five random classes in HMDB51. (a) Task-agnostic feature space without

K^{inter}

. (b) Task-related feature space using

K^{inter}

.

Figure 8. tSNE plots of visual features for the first five random classes in HMDB51. (a) Task-agnostic feature space without

K^{inter}

. (b) Task-related feature space using

K^{inter}

.

Table 1. The 5-way 1-shot and 5-way 5-shot few-shot recognition accuracy results using 3D CNN embedding without pre-training on the HMDB51 and UCF101 datasets.

Methods	HMDB51		UCF101
	5-Way 1-Shot	5-Way 5-Shot	5-Way 1-Shot	5-Way 5-Shot
3D Prototypical Networks [33]	38.05 ± 1.02	53.15 ± 0.73	57.05 ± 1.02	78.25 ± 0.73
3D Relation Network [35]	38.23 ± 0.97	53.17 ± 0.86	58.21 ± 1.02	78.35 ± 0.72
3D SoSN [17]	40.83 ± 0.96	55.18 ± 0.86	62.57 ± 1.03	81.51 ± 0.73
ARN [11]	42.41 ± 0.99	56.81 ± 0.87	64.48 ± 1.06	82.37 ± 0.72
ARN [11] + SsSA-rotation	45.15 ± 0.96	60.56 ± 0.86	66.32 ± 0.99	83.12 ± 0.70
MR [15]	42.62 ± 0.52	58.39 ± 0.84	65.17 ± 0.38	83.51 ± 0.65
HiTIM	46.92 ± 0.42	61.74 ± 0.67	66.54 ± 0.51	84.71 ± 0.53

Table 2. The 5-way 1-shot and 5-way 5-shot few-shot recognition accuracy results using 2D CNN embedding pre-trained on ImageNet on the HMDB51 and UCF101 datasets.

Methods	HMDB51		UCF101
	5-Way 1-Shot	5-Way 5-Shot	5-Way 1-Shot	5-Way 5-Shot
2D Prototypical Networks [33]	54.2	68.4	74.0	89.6
OTAM [20]	54.5	68.0	79.9	88.9
TRX [23]	53.1	75.6	78.2	96.1
PAL [37]	60.9	75.8	85.3	95.2
${TA}^{2} N$ [48]	59.7	73.9	81.9	95.1
HyRSM [24]	60.3	76.0	83.9	94.7
HiTIM	61.0	77.4	84.6	94.9

Table 3. The 5-way 1-shot and 5-way 5-shot few-shot recognition accuracy results using 2D CNN embedding pre-trained on ImageNet on the Kinetics dataset.

Methods	5-Way 1-Shot	5-Way 5-Shot
OTAM [20]	73.0	85.8
TRX [23]	63.6	85.9
PAL [37]	74.2	87.1
${TA}^{2} N$ [48]	73.0	85.8
HyRSM [24]	73.7	86.1
HiTIM	74.3	86.0

Table 4. Ablation study of

K^{inter}

and

K^{intra}

under 5-way 1-shot setting on the HMDB51 and UCF101.

Table 4. Ablation study of

K^{inter}

and

K^{intra}

under 5-way 1-shot setting on the HMDB51 and UCF101.

Setting	HMDB51	UCF101
HiTIM*	44.61 ± 0.92	63.95 ± 0.64
HiTIM* + K^inter	46.02 ± 0.81	65.06 ± 0.50
HiTIM* + K^inter	45.28 ± 0.86	64.81 ± 0.42
HiTIM* +K^inter* + K^intra	46.18 ± 0.46	65.45 ± 0.38
HiTIM	46.92 ± 0.42	66.54 ± 0.41

Table 5. Ablation study of STM and CM under 5-way 1-shot setting on the HMDB51 and UCF101 datasets.

Match	HMDB51	UCF101
Cosine	40.28 ± 0.95	58.71 ± 1.12
STM	45.73 ± 0.68	65.37 ± 0.52
CM	45.41 ± 0.56	64.81 ± 0.43
STM + CM	46.92 ± 0.42	66.54 ± 0.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, L.; Yu, J.; Dang, Y.; Chen, P.; Huan, R. HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition. Appl. Sci. 2023, 13, 5277. https://doi.org/10.3390/app13095277

AMA Style

Jiang L, Yu J, Dang Y, Chen P, Huan R. HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition. Applied Sciences. 2023; 13(9):5277. https://doi.org/10.3390/app13095277

Chicago/Turabian Style

Jiang, Li, Jiahao Yu, Yuanjie Dang, Peng Chen, and Ruohong Huan. 2023. "HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition" Applied Sciences 13, no. 9: 5277. https://doi.org/10.3390/app13095277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Learning

2.2. Few-Shot Action Recognition

2.3. Dynamic Network

3. Method

3.1. Problem Definition

3.2. HiTIM

3.2.1. Inter-Task Learner

3.2.2. Intra-Task Learner

3.2.3. Spatiotemporal Self-Attention Matching

3.2.4. Correlated Cross-Attention Matching

4. Experiments

4.1. Datasets and Experimental Setups

4.2. Comparative Experiments

4.2.1. Using 3D CNN Embedding without Pre-Training

4.2.2. Using 2D CNN Embedding Pre-Trained on ImageNet

4.3. Ablation Study

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI