RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition

Xie, Zexun; Xu, Min; Zhang, Shudong; Zhou, Lijuan

doi:10.3390/electronics13050965

Open AccessArticle

RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition

¹

Information Engineering College, Capital Normal University, 56 West 3rd Ring North Road, Beijing 100048, China

²

School of Cyberspace Security, Hainan University, 58 Renmin Avenue, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(5), 965; https://doi.org/10.3390/electronics13050965

Submission received: 21 January 2024 / Revised: 27 February 2024 / Accepted: 29 February 2024 / Published: 2 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

The advent of Contrastive Language-Image Pre-training (CLIP) models has revolutionized the integration of textual and visual representations, significantly enhancing the interpretation of static images. However, their application to video recognition poses unique challenges due to the inherent dynamism and multimodal nature of video content, which includes temporal changes and spatial details beyond the capabilities of traditional CLIP models. These challenges necessitate an advanced approach capable of comprehending the complex interplay between the spatial and temporal dimensions of video data. To this end, this study introduces an innovative approach, Retentive CLIP Adapter Tuning (RCAT), which synergizes the foundational strengths of CLIP with the dynamic processing prowess of a Retentive Network (RetNet). Specifically designed to refine CLIP’s applicability to video recognition, RCAT facilitates a nuanced understanding of video sequences by leveraging temporal analysis. At the core of RCAT is its specialized adapter tuning mechanism, which modifies the CLIP model to better align with the temporal intricacies and spatial details of video content, thereby enhancing the model’s predictive accuracy and interpretive depth. Our comprehensive evaluations on benchmark datasets, including UCF101, HMDB51, and MSR-VTT, underscore the effectiveness of RCAT. Our proposed approach achieves notable accuracy improvements of 1.4% on UCF101, 2.6% on HMDB51, and 1.1% on MSR-VTT compared to existing models, illustrating its superior performance and adaptability in the context of video recognition tasks.

Keywords:

video recognition; clip adapter tuning; retentive network; video temporal analysis; hard prompt

1. Introduction

In the domain of computer vision, large-scale image-text pre-training models such as Contrastive Language-Image Pre-Training (CLIP) [1] have demonstrated remarkable efficacy in downstream applications such as image classification and caption generation. Departing from traditional research paradigms, this model is pre-trained via a contrastive learning approach and incorporates a fusion of image–text modal information. In parallel, within the realm of natural language processing, Sun. et al. recently introduced the Retentive Network (RetNet) [2], posited as an evolutionary advancement over the Transformer [3] architecture. When contrasted with Transformer, the RetNet exhibits enhanced capabilities in three key dimensions: parallel training efficiency, strong performance, and low-cost inference. Inspired by these technological advancements, our objective centers on transferring these robust architectures to video recognition tasks. This strategy aims to leverage the unique strengths of each model to realize enhanced performance.

Existing methods in video recognition primarily fall into two categories: those leveraging 3D-CNN [4] and Transformer models, like Uniformer [5] and UniformerV2 [6], and those extending the dual-modal capabilities of CLIP for video tasks, such as X-CLIP [7] and ViFi-CLIP [8]. While these approaches have achieved certain milestones in video recognition accuracy, they grapple with challenges in computational costs and effective video–text feature alignment.

Building upon the robust capabilities of RetNet and the foundational work on the CLIP model by predecessors, such as CoOp [9] and CLIP-Adapter [10], we find inspiration to address the aforementioned challenges. To effectively transfer the CLIP model to video recognition tasks while facilitating video–text alignment, we introduce the RCAT model, a potent dual-modal video recognition technology. This model synergizes CLIP with RetNet to harness both spatial and temporal video features, employing a hard prompt approach for integrating textual modal features as benchmarks. Additionally, the RCAT model incorporates an adapter layer to fine-tune video features for alignment with text features. Summarily, our contributions through this study are threefold:

We introduce the Retentive CLIP Adapter Tuning (RCAT) model, an innovative integration that combines CLIP’s robust image–text pre-training with RetNet’s advanced temporal processing. This synergy enhances the analysis of complex video sequences, establishing a novel benchmark in the realm of video recognition tasks.
Inception of an Innovative Adapter Tuning Mechanism: The RCAT model pioneers an adapter tuning methodology specifically tailored to refine the pre-trained CLIP model for enhanced applicability in video recognition tasks. This mechanism proficiently synchronizes video and text features, facilitating a more profound and precise understanding of video content, a pivotal aspect in diverse video recognition scenarios.
Empirical Validation on Benchmark Datasets: We have meticulously evaluated the RCAT model’s efficacy using public video recognition datasets, including UCF101 [11] and HMDB51 [12], and the MSR-VTT [13] dataset for video retrieval purposes. Our empirical results reveal notable improvements in recognition precision, highlighting the model’s strength and flexibility across different video recognition and retrieval scenarios.

2. Related Works

2.1. Visual Language Pre-Training Model

In recent years, significant advancements in vision–language pre-training have ushered in new directions for the field of computer vision. Notable models such as VL-BERT [14], CoCa [15], CLIP, VideoCLIP [16], and variants of CLIP like SLIP [17], BLIP [18], and BLIP-2 [19] exemplify this progress. Among these, the CLIP model, grounded in contrastive learning, stands out for its exceptional generalization capabilities, demonstrating impressive performance across a variety of visual downstream tasks. Following CLIP’s foundational work, adaptations like CoOp [9], CLIP-Adapter [10], and Tip-Adapter [20] have emerged, fine-tuning CLIP for few-shot and zero-shot learning, yet these adaptations still grapple with the inherent limitations of resource-intensive training and generalization to dynamic content.

Since Google introduced the Transformer [3] architecture in 2017, it has been widely adopted in both the natural language processing and computer vision domains. The Transformer excels in these fields due to its parallel training capabilities and robust performance, making it a preferred choice for sequence modeling. However, it does face challenges, notably high inference costs. Recently, the retention network (RetNet) [2] was introduced in the field of natural language processing, considered to be an evolutionary progress of the Transformer [3] architecture. RetNet simplifies computations by replacing the traditional self-attention mechanism with a retention mechanism and using a positionally relevant exponential decay term instead of Softmax. This method reduces the model’s computational load and complexity by simply multiplying it by a coefficient matrix. It also introduces complex space representation for positional information, replacing absolute or relative position encoding, and is more amenable to recursion. These innovations effectively address the issue of high inference costs, enabling parallel training, cost-effective inference, and maintained performance. Utilizing RetNet’s architecture for video sequence processing, we achieve high performance with low-cost inference.

2.2. Video Recognition

In video recognition, a field distinct from image recognition, extracting image features is only part of the challenge. It is equally important to consider the temporal relationships between video frames to capture sequential information. In addressing this complex task, researchers have explored several approaches, each with its unique methodologies and challenges.

Initially, the focus was on using 3D convolution, as exemplified by models like CVRL [21]. This model uses a self-supervised method with the 3D-ResNet-50 [22] network to learn 2D image features and temporal features from unlabeled videos. Subsequently, attention shifted to leveraging the Transformer architecture, leading to the development of models such as MViTv2 [23], ViViT [24], and ViTTA [25]. These models address the challenges of processing long video sequences, with ViViT [24] separating spatial and temporal dimensions and MViTv2 [23] introducing a versatile multi-scale ViT [26] suitable for various tasks including image classification and object detection. ViTTA [25] innovates with a cross-frame attention mechanism for extracting temporal information from videos.

More recently, the trend has shifted towards a hybrid approach that combines 3D convolution with Transformer architectures. This approach is illustrated in models like Video Swin Transformer [27], Swin TransformerV2 [28], Uniformer [5], and UniformerV2 [6], each integrating unique mechanisms to process video data. For instance, the Video Swin Transformer [27] introduces a 3D attention mechanism specifically for video temporal data, and UniformerV2 [6] amalgamates local and global attention mechanisms to effectively extract both spatial and temporal sequence information.

Alongside these developments, the adaptation of CLIP from image to video recognition tasks has led to innovative models such as X-CLIP [7], ViFi-CLIP [8], Text4Vis [29], and CaFo [30]. X-CLIP [7] introduces a novel cross-frame communication attention mechanism, while ViFi-CLIP [8] employs a ‘bridging and prompting’ strategy for fine-tuning the CLIP model for video tasks. Text4Vis [29] leverages language model outputs for improved transfer learning in video contexts. CaFo [30] augments training data by generating semantic cues and synthesizing images, then boosting predictive performance through model caching. Despite their effectiveness, these methods require extensive pre-training on large video datasets, making them resource-intensive.

In this landscape, our RCAT model emerges as a significant advancement. It integrates the robust image–text pre-training capabilities of CLIP with the efficient temporal processing of RetNet. This combination addresses the challenges of computational efficiency and dynamic content processing in video recognition, offering a more resource-efficient and comprehensive solution for advanced tasks in this field.

3. Methodology

In this section, we elucidate the Retentive CLIP Adapter Tuning (RCAT) framework, our innovative approach to enhance video recognition. Initially, we introduce the foundation of RCAT, combining the temporal finesse of RetNet with CLIP’s multimodal proficiency. Subsequently, we delve into the integration of RetNet into CLIP, emphasizing advances in temporal feature processing. Following this, our focus shifts to the dual-path feature extraction and fusion strategy, a crucial component of RCAT’s effectiveness. Finally, we provide an overview of our adapter tuning mechanism, which plays a pivotal role in fine-tuning feature alignment and optimizing the model.

3.1. Overview of RCAT Framework

In our pursuit of crafting a high-performing video recognition model, we have pioneered the integration of the CLIP and RetNet models. This innovative fusion capitalizes on CLIP’s exceptional image recognition prowess to extract spatial features, while leveraging RetNet’s proficiency in processing temporal information to distill temporal features from videos. Simultaneously, the text encoder from CLIP is employed to derive textual cue features. These features are then interwoven with video features to facilitate contrastive learning, a key aspect of our comprehensive strategy leading to the development of the RCAT model.

As shown in Figure 1, the RCAT model is architecturally composed of three primary components: a video feature extractor, a textual feature extractor, and a bimodal fusion model. The video feature extractor itself is subdivided into three distinct modules: a spatial feature extractor, a temporal feature extractor, and a video adapter. To refine the model’s parameters, a cross-entropy loss function is employed, ensuring effective learning and adaptation.

3.2. Detailed Integration of CLIP and RetNet

CLIP: The CLIP model embodies an innovative bimodal architecture integrating images and text through the principles of contrastive learning. It computes similarities between image and text embeddings. As illustrated in Figure 2, during its pre-training phase, the model is exposed to a vast dataset of image–text pairs. The primary objective is to minimize the embedding distances for relevant pairs (positive samples) while maximizing them for irrelevant pairs (negative samples). During the testing phase, our text encoder leverages its acquired knowledge to construct a zero-shot learning linear classifier. This is accomplished by transforming the names or descriptions of the classes in the target dataset into an embedded form, thereby enabling the model to recognize and classify classes it has not encountered previously based on these textual descriptions. This approach exploits the model’s understanding of the similarities between text and image embeddings, allowing it to bridge the gap between the training and testing datasets to perform zero-shot classification tasks. This process is formalized in Equations (1) and (2):

\begin{matrix} logits & = LN (f_{v} (v)) \times LN {(f_{t} (t))}^{T} \end{matrix}

(1)

\begin{matrix} loss & = CEL (logits, Y) \end{matrix}

(2)

Here, v and t denote the visual and textual embeddings, respectively. The functions

f_{v}

and

f_{t}

are the visual and textual encoders. The resulting visual and textual features are then normalized through a layer normalization (LN) process. The term “logits” refers to the similarity scores obtained by multiplying these two normalized vectors. CEL represents the cross-entropy loss function. Y signifies the ground truth labels, which, combined with the calculated similarities, are subjected to cross-entropy loss to optimize the model’s performance and weight adjustments.

Retentive Network: The Retentive Network (RetNet) emerges as a potent successor to the Transformer model in the realm of language processing. The overall architecture of RetNet is shown in Figure 3. RetNet introduces a multi-scale retention attention mechanism as an alternative to the multi-head attention mechanism in Transformer. This mechanism, encompassing both parallel and recurrent forms, facilitates efficient inference and model training. As illustrated in Figure 3a, the parallel retention mechanism can be defined as follows:

Q = (X W_{Q}) ⊙ Θ, K = (X W_{K}) ⊙ \bar{Θ}, V = X W_{V}

(3)

Θ_{n} = e^{i n θ}, D_{n m} = \{\begin{matrix} γ^{n - m}, & n \geq m \\ 0, & n < m \end{matrix}

(4)

Retention (X) = (Q K^{⊤} ⊙ D) V

(5)

where

\bar{Θ}

is the complex conjugate of

Θ

, and

D \in R^{| x | \times | x |}

combines causal masking and exponential decay along the relative distance as one matrix.

W_{Q}, W_{K}, W_{V} \in R^{d \times d}

are learnable matrices,

γ, θ \in R^{d}

. Equations (3)–(5) show that the parallel retention mechanism simplifies the computation by replacing complex softmax operations with a coefficient matrix. As illustrated in Figure 3b, the recurrent retention mechanism can be defined as follows:

\begin{matrix} S_{n} & = γ S_{n - 1} + K_{n}^{⊤} V_{n} \end{matrix}

(6)

\begin{matrix} Retention (X_{n}) & = Q_{n} S_{n}, n = 1, \dots, | x | \end{matrix}

(7)

where

S_{n} \in R^{d \times d}

represents the state matrix, d represents the hidden layer dimension, and

γ

is a real weight.

K_{n} \in R^{1 \times d}

represents the K projection corresponding to time step n, then

K_{n}^{T} V_{n} \in R^{d \times d}

. Similarly,

Q_{n} \in R^{1 \times d}

represents the Q projection corresponding to time step n.

X_{n}

represents the time input sequence, and

|x|

represents the sequence length.

As can be inferred from Equations (3)–(7), the RetNet architecture demonstrates strong capabilities for handling sequence modeling problems. Consequently, we have integrated RetNet into the video recognition model, effectively utilizing its parallel and recursive retention mechanisms to manage the temporal information between video frames. This retention mechanism focuses on the association between the current and previous positions during each computation rather than on the entire global context. This approach not only enhances the model’s self-learning capacity but also reduces the computational load.

3.3. Feature Extraction and Modality Fusion

Video Feature Extraction: In our study, we addressed the challenge of adapting the CLIP model, known for its robust performance in visual tasks, to the realm of video recognition. This adaptation necessitated enhancing the model’s capability in temporal feature extraction, essentially transitioning it from processing two-dimensional images to handling three-dimensional video data. Initially, the CLIP model’s visual encoder was employed to extract spatial features from individual video frames, deliberately omitting temporal elements. The extracted features were then positionally encoded along the temporal dimension. To further this process, we integrated the novel RetNet architecture. This architecture facilitated the parallel extraction of temporal features, culminating in a comprehensive representation of video features.

Text Feature Extraction: Considering the lack of Chinese subtitles in mainstream video classification datasets, our approach utilized the dataset’s label information as a textual hard prompt. This method involved the use of a fixed template, specifically “a video of [CLS],” into which all label information was inserted. This template was then processed by the text encoder. The derived semantic information from the text served as a critical criterion for video recognition, thereby fine-tuning the visual features to align with the textual context.

Visual–Textual Fusion: In line with the foundational principles of the original CLIP model, our method involved computing the similarity between the extracted visual and textual features. The synergy between these modalities was enhanced through the use of a cross-entropy loss function, as described in Equations (8) and (9). This bimodal fusion approach effectively leveraged the strengths of both visual and textual data, leading to a more robust and accurate video recognition system.

logits = LN (f_{r} [f_{v} (v)]) \times LN {(f_{t} (t))}^{T}

(8)

loss = CEL (logits, Y)

(9)

where the function

f_{r}

represents the temporal feature extractor, specifically the RetNet architecture. The other symbols retain their meanings as defined in Equations (1) and (2). By integrating RetNet for the extraction of temporal features, the original two-dimensional CLIP model was adeptly expanded to a three-dimensional framework, thereby facilitating its application to video recognition tasks. This enhancement not only signifies a pivotal advancement from static image analysis to dynamic video understanding but also establishes a novel paradigm in the domain of video recognition.

3.4. Video Adapter Tuning

The CLIP model’s effectiveness in image–text bimodal contrastive learning is well established. To adapt CLIP for video recognition tasks, we have evolved its bidimensional architecture into a tridimensional framework. This expansion integrates the intrinsic image features with the dynamic temporal elements present across video frames, introducing a new challenge: the alignment of video features with textual data. Our approach, RCAT, introduces an innovative tuning methodology, named the Video Adapter. This technique meticulously recalibrates the pre-trained CLIP model, enhancing its compatibility with video recognition tasks by fine-tuning the alignment between video and textual features. Consequently, we evolve from Equations (8) and (9) to Equations (10) and (11):

logits = LN (g \{f_{r} [f_{v} (v)]\}) \times LN {(f_{t} (t))}^{T}

(10)

loss = CEL (logits, Y)

(11)

where the function g represents the video adapter. Upon integrating a video adapter, the model fine-tunes video features to achieve better alignment with corresponding textual features. This refined synchronization significantly augments the model’s capability in cross-modal analysis, bridging the semantic gap between visual and textual data. Specifically, the video adapter comprises two linear layers and the non-linear activation function GELU [31]. The linear layers adjust the alignment of video and textual features linearly, while the GELU activation function enhances the expressive capacity of the video adapter. To preserve the original video features and prevent forgetting, a residual connection is incorporated. The residual ratio is controlled by the coefficient

α

. This process can be formalized as Equations (12) and (13):

\begin{matrix} V A_{v} & = GELU (x^{T} W_{1}^{v}) W_{2}^{v} \end{matrix}

(12)

\begin{matrix} x^{*} & = α V A_{v} (x) + (1 - α) x \end{matrix}

(13)

where

V A_{v}

represents the video feature adapter, x represents the original video feature,

W_{1}^{v}

and

W_{2}^{v}

represent the weight parameters of the two linear layers, and

x^{*}

represents the final output video feature.

4. Experiment

To thoroughly evaluate the RCAT framework, we conducted comprehensive comparative and ablation studies, emphasizing key metrics like computational efficiency, inference speed, and model complexity. These evaluations were carried out on prominent video classification datasets: UCF101 [11] and HMDB51 [13]. Moreover, to illustrate the model’s robust generalization ability, we performed video retrieval experiments on the MSR-VTT [14] dataset, where our framework achieved state-of-the-art performance.

4.1. Dataset

UCF101 [11] is an action recognition dataset comprising real-life action videos sourced from YouTube. It offers 13,320 videos from 101 action categories, totaling approximately 27 h. These actions are broadly categorized into five types: human–object interaction, purely body motions, human–human interaction, playing musical instruments, and sports. Each category is further divided into 25 groups, with each group containing 4 to 7 short videos, each around 6 s in duration.

The HMDB51 [13] dataset, drawn from sources like YouTube and Google Video, consists of 6849 videos across 51 action categories. These actions primarily include facial movements, facial and object manipulation, general body movements, interactive motions, and locomotion actions. Each action category comprises at least 51 videos, with a resolution of 320 × 240 and an average duration of around 5 s per video.

The MSR-VTT [14] dataset, originating from the 2016 ACM Challenge, encompasses 10,000 video clips, each approximately 15 s long, cumulatively amounting to 41.2 h and 200,000 clip–sentence pairs. This dataset covers the most comprehensive range of categories and diverse visual content, representing the largest sentence and vocabulary set. Each video segment is annotated with about 20 natural sentences by 1327 AMT workers. The dataset is segmented into 20 categories, including music, cooking, sports, and TV shows, among others.

4.2. Experimental Setup

Baselines and Metrics: In our comprehensive evaluation of the RCAT’s capabilities in video recognition, we meticulously established a baseline using the original CLIP + Transformer architecture. This architecture, renowned for its efficiency in processing large-scale image and text data, served as a solid foundation for benchmarking our RCAT model. By leveraging this baseline, we were able to rigorously assess the advancements and enhancements that RCAT brings to the field of video recognition. For a broader comparative analysis, we included several other established methods in our study. The results for these methods were derived directly from their respective original publications, ensuring a fair and consistent comparison. Our comparison included methods used for video classification including ViViT-L/16 × 2 [23], Swin-B [26], X-CLIP-B/16 [7], ViFi-CLIP-L/14 [8], TextVis-L/14 [29], and methods for video retrieval such as DiffusionRet [32], TS2-Net [33], STAN [34], MuLTI [35], HunYuan_tvr [36], DMAE [37], CLIP-ViP [38], and others. This comparison was critical in positioning the RCAT within the current landscape of video recognition technologies and methodologies. By comparing RCAT against these varied approaches, we aimed to highlight its unique features and performance capabilities.

Our evaluation metrics were carefully chosen to provide a thorough understanding of the RCAT’s performance. We utilized TOP-1 accuracy and TOP-5 accuracy as primary indicators of the model’s effectiveness in recognizing and categorizing video content correctly. Furthermore, we extended our evaluation to include an analysis of the model’s complexity and operational efficiency. This involved an examination of Floating Point Operations (FLOPs) and the total number of model parameters (Params). Such metrics are crucial in understanding the computational demands and scalability of the RCAT. In addition to these, we also assessed low-cost inference performance by considering two critical indicators: GPU memory usage and the training duration. These indicators are essential for evaluating the practical applicability of the RCAT in scenarios where computational resources are limited or when rapid deployment is required.

Implementation Details: Expanding on the foundational principles of the CLIP model, our study developed two advanced variants, RCAT-B/16 and RCAT-L/14, each tailored to meet the intricate challenges inherent in video recognition tasks. Our experimental setup was carefully designed, taking into account critical variables such as the number of input video frames, the duration of training epochs, and the chosen batch size. Utilizing OpenCV technology, we achieved robust and efficient video processing, which included the comprehensive extraction of frames from videos, resizing each frame to a resolution of 224 × 224 pixels. In our approach, we chose a 224 × 224 pixel resolution for video frames to balance computational efficiency with the preservation of critical visual details. This resolution is a widely recognized standard in deep learning for its ability to minimize computational demands while maintaining essential visual information necessary for accurate recognition. To assess the effect of different frame counts on the model’s performance, we methodically sampled 8, 16, and 32 frames from the entire range of video frames. We configured the training to span 30 epochs, with batch sizes of either 8 or 16. Notably, during training, we observed that the model tended to converge by approximately the 25th epoch. All experiments were conducted on a system equipped with four GeForce RTX 4090 GPUs. The core algorithm of the Retentive CLIP Adapter Tuning (RCAT) model is detailed in Algorithm 1, which outlines the essential processes including bimodal feature extraction, feature fusion, and the computation of the loss function. This algorithm efficiently integrates visual and textual information, leveraging the strengths of both modalities to enhance learning and representation capabilities.

Algorithm 1 Retentive CLIP Adapter Tuning (RCAT)

1:: Inputs:
2:: $V [b \times d, h, w, c]$ —A minibatch of aligned videos.
3:: $T [b, l]$ —A minibatch of aligned texts.
4:: Outputs:
5:: Optimized ImageEncoder, RetNet, and VideoAdapter models.
6:: Procedure:
7:: 1. Initialize ImageEncoder, TextEncoder, RetNet, and VideoAdapter with appropriate dimensions and parameters.
8:: 2. While not converged do:
9:: a. Pass video through the ImageEncoder to obtain flat features V_f.
10:: b. Reshape and adapt video features V_f to match embedding dimensions using RetNet and VideoAdapter, resulting in V_f_rd.
11:: c. Encode texts using the TextEncoder to obtain text features T_f.
12:: d. Normalize embeddings V_f_rd and T_f, and project them to a common space.
13:: e. Compute scaled pairwise cosine similarities between video and text embeddings.
14:: f. Calculate symmetric loss function based on the similarities.
15:: g. Perform gradient zeroing, backward pass, and parameters update using the optimizer.
16:: 3. Return Optimized ImageEncoder, RetNet, and VideoAdapter models.

4.3. Experimental Results

Video Classification: Under identical conditions with an input of 16 frames, each at a resolution of 224 × 224 pixels, the RCAT model achieves 97.3% TOP-1 accuracy and 99.8% TOP-5 accuracy, surpassing the TextVis model by 1.4% and 0.2%, respectively. Compared with the TextVis-L/14 model, RCAT improved by 1.5% and 1.8%. Notably, RCAT-B/16 demonstrated the lowest computational complexity, with only 275 G FLOPs and 83.2 M parameters. Furthermore, in contrast to alternative approaches that leverage additional training data from Kinetics-400 or Kinetics-600, RCAT achieved better results without the need for such supplementary datasets. Comparative testing, using the TextVis model as a benchmark and without the inclusion of extra training data, yielded a TOP-1 accuracy of merely 95.6%, as shown in Table 1.

Similarly, on the HMDB51 dataset, RCAT achieves 81.5% TOP-1 accuracy and 96.5% TOP-5 accuracy, improving by 2.6% and 1.8% over TextVis without extensive pre-training. Compared with the TextVis-L/14 model, RCAT improved by 4.4% and 5.3%. Notably, RCAT also exceeded the performance of the TextVis model, which incorporates external data for pre-training, by 0.2%, as explicitly outlined in Table 2.

Figure 4a illustrates the training dynamics of the RCAT model on the UCF101 dataset, delineating the trajectories of mean loss and peak accuracy. It is observed that the model achieves an optimal accuracy of 97.3% at epoch 25, subsequent to which the loss value exhibits minor fluctuations around 0.04. Figure 4b illustrates the training dynamics of the RCAT model on the HMDB51 dataset. Notably, the model achieved its peak accuracy of 81.5% at epoch 26, with the loss fluctuating around 0.04. This indicates the stabilization in the model’s learning process, underscoring the efficacy of the RCAT framework in harnessing the intricacies of video data for robust feature extraction and classification performance.

Video Retrieval: To verify the generalization ability of the RCAT model, we conducted video retrieval task tests on the MSR-VTT dataset. The RCAT model demonstrated remarkable efficacy, achieving a TOP-1 accuracy of 58.8% and a TOP-5 accuracy of 82.3%. These figures represent a notable enhancement of 1.1% and 1.8% in TOP-1 and TOP-5 accuracies, respectively, compared to the CLIP-ViP [40] model. As highlighted in Table 3, RCAT outperformed its counterparts, achieving this superior performance without reliance on supplementary training data.

Figure 5 delineates the training trajectory of the RCAT model on the MSR-VTT dataset, where a pinnacle accuracy of 58.8% was attained at epoch 25, accompanied by a loss variance of around 0.05. This performance underscores the RCAT model’s robust generalization capability.

4.4. Ablation Studies

Video Adapter: The impact of the video adapter on video classification tasks was validated using two datasets, UCF101 and HMDB51. During the experiments, the original model parameters of CLIP were frozen, and only the parameters of the video adapter were trained to assess its effectiveness. The experimental results, as presented in Table 4 and Table 5, demonstrate a significant enhancement in model performance attributable to the video adapter.

Specifically, for the UCF101 dataset, we observed that the CLIP-VideoAdapter-B/16 model achieved remarkable accuracy rates of 96.0% for TOP-1 and 98.3% for TOP-5, marking enhancements of 1.7% and 0.1%, respectively, over its predecessors. Further elevating the performance benchmark, the CLIP-VideoAdapter-L/14 variant exhibited superior accuracy, attaining 96.9% in TOP-1 and an impressive 99.5% in TOP-5 accuracy, reflecting gains of 1.3% and 0.1%, correspondingly. Transitioning to the HMDB51 dataset, the enhancements afforded by our video adapter integration became even more pronounced. The CLIP-VideoAdapter-B/16 configuration demonstrated significant improvements, securing 79.8% TOP-1 accuracy and 94.3% TOP-5 accuracy, which represent increments of 2.0% and 1.2%, respectively. The advanced CLIP-VideoAdapter-L/14 model further showcased its efficacy, achieving 81.1% TOP-1 accuracy and 96.2% TOP-5 accuracy, surpassing baseline measures with improvements of 2.2% and 0.6%. These findings underscore the significant role of video adapters in bolstering model performance within video recognition tasks, highlighting their effectiveness in enhancing the accuracy of video content analysis.

RetNet: We evaluated the performance and inference cost differences between Transformer and RetNet. Using the UCF101 dataset as an example, training was conducted on 4 GeForce RTX 4090 GPUs with a batch size of 16 for 30 epochs. The results, presented in Table 6, include scenarios both with and without the incorporation of the video adapter.

Under identical input configurations, the incorporation of a video adapter into RetNet in a TOP-1 accuracy of 97.3%, while requisitioning a modest 17.9 GB of GPU memory and concluding training within 6.5 h. This configuration evidences a marked improvement over the Transformer model, with a notable enhancement in accuracy of 0.4%, a significant reduction in memory consumption of 1.7 GB (amounting to an 8.7% decrease), and a reduction in training duration of 1.3 h (equivalent to a 16.6% decrease). Even in the absence of a video adapter, RetNet managed to eclipse the Transformer model, registering a TOP-1 accuracy of 96.8%, utilizing 17.8 GB of GPU memory, and necessitating a training period of 6.3 h. This reflects a subtle yet significant elevation in accuracy of 0.3%, coupled with a reduction in memory usage of 1.7 GB (8.7%) and a decrease in training time of 1.2 h (16.0%). These outcomes emphatically underscore RetNet’s adeptness in optimizing computational resources to sustain elevated accuracy levels in video analysis tasks. This proficiency manifests irrespective of the integration of a video adapter, underscoring RetNet’s effectiveness and its potential to serve as a resource-efficient alternative in the realm of video content analysis.

Residual Ratio $α$ : We investigated the impact of varying residual ratios on bimodal feature alignment. The values ranged from 0.1 to 0.9, increasing incrementally by 0.1. Experiments were conducted using the RCAT-L/14 model on the UCF101 dataset, with results detailed in Table 7. It was observed that the model’s performance peaked when the residual ratio was set at 0.2. Presently, the RCAT model has attained 97.3% TOP-1 accuracy and 99.8% TOP-5 accuracy on the UCF101 dataset. Furthermore, it is observed that the model exhibits superior performance when the parameter

α

ranges from 0.1 to 0.5, indicating that smaller values of

α

are more conducive to video recognition tasks. This insight underscores the importance of parameter optimization in enhancing model efficacy for complex video analyses.

5. Conclusions

In this work, we presented the RCAT framework, a novel advancement in video recognition that significantly enhances the original CLIP model by extending it into a three-dimensional analysis approach. By incorporating the RetNet architecture for temporal feature extraction, we adeptly transitioned the CLIP model from static image analysis to dynamic video understanding, thus paving the way for more sophisticated video recognition capabilities. The introduction of a video adapter further refines the synergy between video and textual features, leading to a marked improvement in the model’s performance across various datasets. Our extensive evaluations, including comparative and ablation studies, underscore the RCAT framework’s superior ability to capture and generalize complex video and text features, setting new benchmarks for accuracy and efficiency in video recognition tasks. In future work, we aim to further refine and extend the RCAT framework’s capabilities, focusing on enhancing the model’s efficiency and adaptability across a variety of video recognition contexts. By exploring advancements in feature extraction and fusion techniques, we intend to incrementally improve the model’s accuracy and robustness, particularly in handling complex video datasets with nuanced semantic content.

Author Contributions

Z.X.: Investigation, data collection, processing, methodology, experimental analysis, and writing the first draft. M.X.: methodology, experimental suggestions, and revised first draft. S.Z.: supervision, writing review, and editing. L.Z.: supervision, writing review, and editing. All authors have read and approved the published version of the manuscript.

Funding

This work was funded in part by the National Natural Science Foundation of China under Grants 62177034.

Data Availability Statement

The datasets employed in this research are publicly accessible and can be retrieved from the following URLs: MSR-VTT at https://www.robots.ox.ac.uk/ maxbain/frozen-in-time/data/MSRVTT.zip, UCF101 at https://www.crcv.ucf.edu/data/UCF101.php, and HMDB51 at https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database, all accessed on 2 February 2024.

Conflicts of Interest

The research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Sun, Y.; Dong, L.; Huang, S.; Ma, S.; Xia, Y.; Xue, J.; Wang, J.; Wei, F. Retentive network: A successor to transformer for large language models. arXiv 2023, arXiv:2307.08621. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Karasawa, H.; Liu, C.L.; Ohwada, H. Deep 3d convolutional neural network architectures for alzheimer’s disease diagnosis. In Proceedings of the Intelligent Information and Database Systems: 10th Asian Conference, ACIIDS 2018, Dong Hoi City, Vietnam, 19–21 March 2018; Proceedings, Part I 10. Springer International Publishing: Berlin, Germany, 2018; pp. 287–296. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 12581–12600. [Google Scholar] [CrossRef]
Li, K.; Wang, Y.; He, Y.; Li, Y.; Wang, Y.; Wang, L.; Qiao, Y. Uniformerv2: Unlocking the potential of image vits for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1632–1643. [Google Scholar]
Ni, B.; Peng, H.; Chen, M.; Zhang, S.; Meng, G.; Fu, J.; Ling, H. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 1–18. [Google Scholar]
Rasheed, H.; Khattak, M.U.; Maaz, M.; Khan, S.; Khan, F.S. Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oxford, UK, 15–17 September 2023; pp. 6545–6554. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 2337–2348. [Google Scholar] [CrossRef]
Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. 2024, 581–595. [Google Scholar] [CrossRef]
Safaei, M.; Balouchian, P.; Foroosh, H. UCF-STAR: A large scale still image dataset for understanding human actions. Proc. Aaai Conf. Artif. Intell. 2020, 34, 2677–2684. [Google Scholar] [CrossRef]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
Xu, J.; Mei, T.; Yao, T.; Rui, Y. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar]
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020; pp. 1–16. [Google Scholar]
Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
Xu, H.; Ghosh, G.; Huang, P.-Y.; Okhonko, D.; Aghajanyan, A.; Metze, F.; Zettlemoyer, L.; Feichtenhofer, C. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. arXiv 2021, arXiv:2109.14084. [Google Scholar]
Mu, N.; Kirillov, A.; Wagner, D.; Xie, S. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 529–544. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Zhang, R.; Zhang, W.; Fang, R.; Gao, P.; Li, K.; Dai, J.; Qiao, Y.; Li, H. Tip-adapter: Training-free adaption of clip for few-shot classification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 24–28 October 2022; pp. 493–510. [Google Scholar]
Qian, R.; Meng, T.; Gong, B.; Yang, M.; Wang, H.; Belongie, S.; Cui, Y. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 6964–6974. [Google Scholar]
Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6546–6555. [Google Scholar]
Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4804–4814. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Lin, W.; Mirza, M.J.; Kozinski, M.; Possegger, H.; Kuehne, H.; Bischof, H. Video Test-Time Adaptation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oxford, UK, 15–17 September 2023; pp. 22952–22961. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 3202–3211. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12009–12019. [Google Scholar]
Wu, W.; Sun, Z.; Ouyang, W. Revisiting classifier: Transferring vision-language models for video recognition. Proc. Aaai Conf. Artif. Intell. 2023, 37, 2847–2855. [Google Scholar] [CrossRef]
Zhang, R.; Hu, X.; Li, B.; Huang, S.; Deng, H.; Qiao, Y.; Gao, P.; Li, H. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oxford, UK, 15–17 September 2023; pp. 15211–15222. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Jin, P.; Li, H.; Cheng, Z.; Li, K.; Ji, X.; Liu, C.; Yuan, L.; Chen, J. DiffusionRet: Generative Text-Video Retrieval with Diffusion Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2470–2481. [Google Scholar]
Liu, Y.; Xiong, P.; Xu, L.; Cao, S.; Jin, Q. Ts2-net: Token shift and selection transformer for text-video retrieval. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 319–335. [Google Scholar]
Liu, R.; Huang, J.; Li, G.; Feng, J.; Wu, X.; Li, T.H. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oxford, UK, 15–17 September 2023; pp. 6555–6564. [Google Scholar]
Xu, J.; Liu, B.; Chen, Y.; Shi, X. MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling. arXiv 2023, arXiv:2303.05707. [Google Scholar]
Jiang, J.; Min, S.; Kong, W.; Wang, H.; Li, Z.; Liu, W. Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations. IEEE Access 2022. [Google Scholar] [CrossRef]
Jiang, C.; Liu, H.; Yu, X.; Wang, Q.; Cheng, Y.; Xu, J.; Liu, Z.; Guo, Q.; Chu, W.; Yang, M.; et al. Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4626–4636. [Google Scholar]
Xue, H.; Sun, Y.; Liu, B.; Fu, J.; Song, R.; Li, H.; Luo, J. Clip-ViP: Adapting Pre-trained Image-text Model to Video-Language alignment. arXiv 2023, arXiv:2209.06430. [Google Scholar]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
Carreira, J.; Noland, E.; Banki-Horvath, A.; Hillier, C.; Zisserman, A. A short note about kinetics-600. arXiv 2018, arXiv:1808.01340. [Google Scholar]

Figure 1. The overall architecture of our proposed RCAT Framework. ‘CLS’ denotes the video label, and ‘GELU’ [31] represents the non-linear activation function employed.

Figure 2. The overall architecture of the CLIP model.

Figure 3. Dual form of RetNet, including parallel representation and recurrent representation.

Figure 4. Epoch-wise training loss and TOP-1 accuracy of RCAT model on UCF101 and HMDB51 datasets.

Figure 5. Epoch-wise training loss and TOP-1 accuracy of RCAT model on MSR-VTT datasets.

Table 1. Performance comparison between RCAT and other alternative methods on the UCF101 dataset.

Method	Pretrain Dataset	Input	TOP-1 (%)	TOP-5 (%)	FLOPs (G)	Params (M)
ViViT-L/16 × 2 [23]	Kinetics-400 [39]	16 × 224 × 224	85.3	90.1	3992	310.8
Swin-B [26]	Kinetics-400 [39]	16 × 224 × 224	88.7	92.4	282	88.1
X-CLIP-B/16 [7]	Kinetics-400 [39] & 600 [40]	16 × 224 × 224	94.8	97.6	287	131.5
ViFi-CLIP-L/14 [8]	Kinetics-400 [39]	16 × 224 × 224	95.9	98.5	281	124.7
TextVis-L/14 [29]	Kinetics-400 [39]	16 × 224 × 224	98.1 (95.6) ¹	99.6	1661	430.7
RCAT-B/16 (Ours)	-	16 × 224 × 224	96.0	98.7	275	83.2
RCAT-L/14 (Ours)	-	16 × 224 × 224	97.3	99.8	431	182.3

¹ (95.6) represents the result without introducing external data.

Table 2. Performance comparison between RCAT and other alternative Methods on the HMDB51 dataset.

Method	Pretrain Dataset	Input	TOP-1 (%)	TOP-5 (%)
ViViT-L/16 × 2 [23]	Kinetics-400 [39]	16 × 224 × 224	54.3	70.1
Swin-B [26]	Kinetics-400 [39]	16 × 224 × 224	56.1	76.6
X-CLIP-B/16 [7]	Kinetics-400 [39] & 600 [40]	16 × 224 × 224	64.2	80.3
ViFi-CLIP-L/14 [8]	Kinetics-400 [39]	16 × 224 × 224	77.1	91.2
TextVis-L/14 [29]	Kinetics-400 [39]	16 × 224 × 224	81.3 (78.9) ¹	95.6
RCAT-B/16 (Ours)	-	16 × 224 × 224	80.2	94.7
RCAT-L/14 (Ours)	-	16 × 224 × 224	81.5	96.5

¹ (78.9) represents the result without introducing external data.

Table 3. Performance comparison between RCAT and other alternative methods on the MSR-VTT dataset.

Method	Extra Training Data	TOP-1 (%)	TOP-5 (%)
DiffusionRet [32]	× ¹	49.0	75.2
TS2-Net [33]	×	54.0	79.3
STAN [34]	√ ²	54.1	79.5
MuLTI [35]	×	54.7	77.7
HunYuan_tvr [36]	√	55.0	78.4
DMAE [37]	×	55.5	79.4
CLIP-ViP [38]	√	57.7	80.5
RCAT-L/14 (Ours)	×	58.8	82.3

¹ × represents no additional training data. ² √ represents additional training data.

Table 4. Ablation study results: assessing the role of video adapters in the RCAT on the UCF101 dataset.

Method	Input	TOP-1 (%)	TOP-5 (%)
CLIP-B/16 [1]	16 × 224 × 224	94.3	98.2
CLIP-L/14 [1]	16 × 224 × 224	95.6	99.4
CLIP-VideoAdapter-B/16	16 × 224 × 224	96.0	98.3
CLIP-VideoAdapter-L/14	16 × 224 × 224	96.9	99.5

Table 5. Ablation study results: assessing the role of video adapters in the RCAT on the HMDB51 dataset.

Method	Input	TOP-1 (%)	TOP-5 (%)
CLIP-B/16 [1]	16 × 224 × 224	77.8	93.1
CLIP-L/14 [1]	16 × 224 × 224	78.9	95.6
CLIP-VideoAdapter-B/16	16 × 224 × 224	79.8	94.3
CLIP-VideoAdapter-L/14	16 × 224 × 224	81.1	96.2

Table 6. Ablation study results: assessing the role of RetNet in the RCAT on the UCF101 dataset.

Method	Video Adapter	Input	TOP-1 (%)	GPU Memory (GB)	Training Time (H)
Transformer [3]	√	16 × 224 × 224	96.9	19.6	7.8
RetNet [2]	√	16 × 224 × 224	97.3	17.9	6.5
Transformer [3]	×	16 × 224 × 224	96.5	19.5	7.5
RetNet [2]	×	16 × 224 × 224	96.8	17.8	6.3

Table 7. Results of parameter analysis: impact of various residual ratios on the classification performance of RCAT with the UCF101 dataset.

Method	Input	$α$	TOP-1 (%)	TOP-5 (%)
RCAT-L/14	16 × 224 × 224	0.1	96.8	99.4
	16 × 224 × 224	0.2	97.3	99.8
	16 × 224 × 224	0.3	96.7	99.4
	16 × 224 × 224	0.4	96.5	99.3
	16 × 224 × 224	0.5	96.2	99.1
	16 × 224 × 224	0.6	95.8	98.8
	16 × 224 × 224	0.7	95.1	98.2
	16 × 224 × 224	0.8	94.7	97.3
	16 × 224 × 224	0.9	93.3	95.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, Z.; Xu, M.; Zhang, S.; Zhou, L. RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition. Electronics 2024, 13, 965. https://doi.org/10.3390/electronics13050965

AMA Style

Xie Z, Xu M, Zhang S, Zhou L. RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition. Electronics. 2024; 13(5):965. https://doi.org/10.3390/electronics13050965

Chicago/Turabian Style

Xie, Zexun, Min Xu, Shudong Zhang, and Lijuan Zhou. 2024. "RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition" Electronics 13, no. 5: 965. https://doi.org/10.3390/electronics13050965

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition

Abstract

1. Introduction

2. Related Works

2.1. Visual Language Pre-Training Model

2.2. Video Recognition

3. Methodology

3.1. Overview of RCAT Framework

3.2. Detailed Integration of CLIP and RetNet

3.3. Feature Extraction and Modality Fusion

3.4. Video Adapter Tuning

4. Experiment

4.1. Dataset

4.2. Experimental Setup

4.3. Experimental Results

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI