RMFNet: Redetection Multimodal Fusion Network for RGBT Tracking

Zhao, Yanjie; Lai, Huicheng; Gao, Guxue

doi:10.3390/app13095793

Open AccessArticle

RMFNet: Redetection Multimodal Fusion Network for RGBT Tracking

by

Yanjie Zhao

^1,2,

Huicheng Lai

^1,2,*

and

Guxue Gao

^1,2

¹

College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

²

Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5793; https://doi.org/10.3390/app13095793

Submission received: 29 March 2023 / Revised: 23 April 2023 / Accepted: 4 May 2023 / Published: 8 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

The development of single-modality target tracking based on visible light has been limited in recent years because visual light images are highly susceptible to environmental and lighting influences. Thermal infrared images can well compensate for this defect, so RGBT tracking has attracted increasing attention. However, existing studies are limited to the aggregation of multimodal information using feature fusion, ignoring the role of decision-level fusion in tracking, and the original re-detection algorithm in the used baseline model is prone to the accumulation of failures. To deal with these problems, we propose the Redetection Multimodal Fusion Network (RMFNet). The network is divided into three branches, the visible light branch, the thermal infrared branch, and the fusion branch. The three-branch structure can plainly utilize the complementary advantages of multimodal information and the commonalities and specific characteristics of the two modalities. We propose a multimodal feature fusion module (EFM), which can adaptively calculate the reliability of the modality and perform a weighted fusion of the two-modality features features. The existing redetection algorithm is improved, and the re-detection mechanism of global search in the current frame is added to reduce the accumulation of failures. We have conducted extensive comparative validation on two widely used benchmark datasets, GTOT and RGBT234. The outcomes of the experiments suggest that RMFNet outperforms other tracking methods.

Keywords:

RGBT tracking; decision-level fusion; re-detection algorithm; attention

1. Introduction

Object tracking is a computer vision task that involves continuously estimating and predicting the position of an object in each subsequent frame of a video sequence, based on the annotation of the object in the first frame, to track and predict the object’s motion trajectory and position. Video tracking technology involves theories from many fields and has great practical value in areas such as security monitoring, human–computer interaction, and intelligent life. While single-modality target tracking technology has made significant progress, it often performs poorly in complex scenes, such as those involving background clutter, occluded targets, low light, rain, and snow. In contrast, thermal infrared cameras, which are based on thermal radiation imaging, can effectively compensate for the shortcomings of visible light imaging and are better suited to dealing with such challenges. Furthermore, the visual light image can provide the detailed texture and color information that may be lost in the thermal infrared image. Therefore, the two modalities have their respective strengths and weaknesses, and exhibit strong complementarity. Combining the information from both modalities in target tracking can provide more comprehensive target information and improve the accuracy and robustness of tracking. Hence, the research on target tracking using both visible light and thermal infrared modalities has substantial significance.

The current RGBT tracking algorithms based on deep networks can be broadly classified into two types of research. The first type investigates how to use deep networks to adaptively fuse the features of visible and thermal infrared modality. For example, Zhu et al. [1] proposed a recursive strategy to densely aggregate hierarchical features of the network and reduced redundant features and noise through channel pruning. Xu et al. [2] performed bilinear pooling on any two layers using cross-product (a second-order computation effectively aggregating deep semantic and shallow texture information of the target) and aggregated bilinear pooling features of different modalities using a quality-aware network. Li et al. [3] proposed multi-scale adapters and multi-modality fusion modules to adaptively aggregate features of different scales and modalities. The second direction aims to explore the commonalities and characteristics of the two modalities and fully leverage their potential value. For example, Xiao et al. [4] proposed an enhanced transformer structure for feature fusion, which can further strengthen the fused feature and modality-specific features. Lu et al. [5] proposed a hierarchical structure with a parallel combination of general adapters and modality adapters and integrated hierarchical divergence losses, which enabled single-stage joint learning of modality-sharing and modality-specific features. Additionally, a dynamic fusion module was designed in the instance adapter, which can perform quality-aware fusion of different source data.

Although these tracking networks demonstrate reliable performance and great potential in various environments and challenges, two issues still exist. Firstly, these networks only use feature-level fusion and, subsequently, only use the fused feature for classification tasks. However, in certain scenarios (e.g., extreme lighting or thermal crossover), fused features may be easily influenced by low-quality interference when one modality image fails or has poor quality, making them not always superior to single-modality features. Decision-level fusion can address this issue and to some extent alleviate the impact of image mismatch. Unfortunately, the importance of decision-level fusion has been overlooked by the aforementioned networks. Secondly, these networks belong to the MDNet network architecture-based algorithms, and the MDNet network has a cumulative failure issue in targeting re-detection algorithms.

To overcome these two problems, we propose a network for RGBT tracking: a redetection multimodal fusion network that improves the redetection algorithm and combines feature-level and decision-level fusion. We extracted features using the same network configuration as the first three convolutional layers of VGG-M [6] with the modification of expanding the receptive field. To fully exploit the complementarity of multimodal information and explore the differences between two modalities, we divide the network into three branches, the visible light branch, the thermal infrared branch, and the fusion branch. The two specific modality branches each use three convolutional layers to extract features, with no weight sharing between the convolutional layers of the two modalities. A fusion branch is introduced in parallel between the two modalities, where the two features output by the 7 × 7 convolution are, respectively, input into the second 5 × 5 convolution and third 3 × 3 convolution of the fusion branch to extract shared information between the two modalities. The ECA [7] attention-based feature fusion module designed in this paper can adaptively compute the weights of two modalities and perform a weighted fusion of multimodal features to obtain more robust fused features. Three fully-connected classification layers are added after each of the three obtained features to obtain the sample scores output using each of the three features, and the three scores are summed to obtain the final sample score. We further improved the re-detection mechanism of existing algorithms. Specifically, when the tracker’s pre-location is unreliable, we directly expand the search area and perform a global target search to obtain target re-location. We then determine whether the re-location is reliable, and, if so, use the re-location as the target location for the frame, or, if not, use the one with the higher score of pre-location and re-location as the target position for this frame. Finally, the Alpha-Refine [8] bounding box refinement module is added in this paper to adjust the coarse positioning of the tracker precisely.

This paper offers several key contributions, summarised as follows:

We propose an RGBT tracking network: the Redetection Multimodal Fusion Network (RMFNet), which contains two types of fusion, medium-term feature-level and late-term decision-level fusion, and can be divided into three branches, visible, thermal infrared, and fusion, and can fully utilize the complementarity and correlation of multimodal information to achieve robust RGBT tracking.
We have designed an ECA attention-based multimodal feature fusion module that adaptively computes two modalities’ reliability and performs a weighted fusion of multimodal features to obtain a more favourable feature representation.
We improve the re-detection algorithm of the base tracker by adding a re-detection step that performs a global target search at the current frame, which mitigates the accumulation of failures and can increase the robustness and precision of the tracking algorithm while decreasing the amount of computation and making the tracking algorithm more efficient.
Extensive experimental results on the GTOT [9] and RGBT234 [10] datasets show that RMFNet outperforms other advanced RGBT tracking methods and obtains good performance.

2. Related Work

2.1. RGBT Tracking

Siamese network and MDNet network architectures are two mainstream frameworks for RGBT tracking algorithms, especially the former. Li et al. [11] suggested a fusion tracking method based on the MDNet [12] framework in 2018, and, subsequently, many RGBT trackers based on the MDNet framework appeared. Zhu et al. [1] proposed the dense feature aggregation and pruning network (DAPNet), proposing a recursive strategy to densely aggregate the hierarchical features of the network densely and by channel pruning to reduce redundant features and remove noise. Li et al. [13] proposed a multi-adapter convolutional network (MANet) to fully exploit the potential value of mode-sharing, mode-specific, and instance-aware information. Xu et al. [2] proposed the cross-layer bilinear pooling network (CBPNet), which aggregates features at different levels by performing bilinear pooling operations on any two layers through cross-products, and uses a quality-aware network to fuse the bilinear pooled features of different modalities. Zhang et al. [14] proposed the Adaptive Learning Attribute Representation Network (ADRNet), where the core of the algorithm is adaptive attribute representation and adaptive modality fusion, by learning how to distinguish the target from the background based on the target attribute representation.

The first Siamese framework to be applied to RGBT tracking is SiamFT [15], which considers SiamFC [16] as the baseline and uses a doubled Siamese network, visible and infrared network, which can process visual and infrared images separately to meet the real-time requirements. DSiamMFT [17] further employs dynamic online learning transformation strategies and multi-level semantic features based on SiamFT. SiamIVFN [18] utilizes complementary feature fusion networks and contribution aggregation networks to handle not only the misalignment of image pairs but also the fusion of features based on the contributions of the two modalities, giving the network high accuracy and a speed of 147.6 FPS, making it the fastest deep learning-based RGBT tracker available. It is worth mentioning that algorithms based on the Siamese framework rely on a large amount of data to train the network. However, since the existing RGBT dataset does not have such a large amount of data, the current algorithms of the Siamese framework usually use a large-scale RGB dataset and train the network with the grayscale maps generated from RGB images. However, the imaging principle of thermal infrared images differs from that of visible images, resulting in the generated grayscale maps lacking reliability to some extent. In this paper, we conducted a study using the MDNet framework.

2.2. Multimodal Fusion Network

A suitable fusion strategy can take full advantage of multimodal information and reinforce the robustness and accuracy of the tracker. Existing multimodal fusion methods can be divided into three fusion methods, pixel level, feature level, and decision level. In early fusion approaches, pixel-level fusion was used by directly connecting the multiple modality data channels and feeding the fused data into a convolutional neural network (CNN) for feature extraction, as shown in Figure 1a. However, with the outstanding performance of feature aggregation in multiple domains, Li et al. [11] designed a feature fusion network that directly integrates RGB and thermal infrared depth features, as shown in Figure 1b, and fused the two modality features using element-wise addition. Yang et al. [19] designed a fusion network that adds the output classification results, as shown in Figure 1c. Li et al. [13], as shown in Figure 1d, distinguish the multimodal information into shared and exclusive information extracted by different adapters, proposing a multi-adapter fusion network. Gao et al. [20], as shown in Figure 1e, adaptively fuses modality features with hierarchical features in the form of a recursive fusion chain to achieve a double height of speed and accuracy. However, the above algorithms only use the fused features in the subsequent processing stage, without considering the unreliable fused features caused by noise or failure of one modality. As shown in Figure 1f, it is the RMFNet network proposed in this paper, the parallel fusion branches, can fully exploit the shared information of two modalities, and the three-branch fusion network gives full play to the advantages of commonality and characteristics of multimodal features.

2.3. Attention Mechanism

The attention mechanism is a powerful tool in computer vision that helps neural networks focus on the most critical information in the input. By enhancing the learning and expression of this information, it has been extensively used for a broad range of tasks such as object tracking, image segmentation, image restoration, etc., and has proven to improve network performance significantly. SENet [21] employs an adaptive channel feature recalibration module to retain valuable features and improve network performance. ECANet [7] models the dependency between channels by using a more lightweight and faster one-dimensional convolution instead of the fully connected layers in SENet. SKNet [22] proposes a convolutional kernel attention mechanism that allows the network to select appropriate convolutional kernels and fuse multiple branches with different kernel sizes. CANet [23] embeds the position information of features into a channel attention mechanism, and enables the network to aggregate features along the horizontal and vertical directions through pooling operations, thereby improving the expressiveness and accuracy of the target. Non-local [24] and PSANet [25] each model the spatial relationships between all pixels in the feature map and embed attention modules after each block in CNN. The attention mechanism is utilized in this paper to adaptively adjust the weight of features to enhance the performance of the tracker under different modalities.

3. Our Method

We first present the proposed redetection multimodal fusion network and then describe our fusion methodology, the feature fusion module, the improved re-detection algorithm and the final used bounding box refinement Alpha-Refine module.

3.1. RMFNet Overall Architecture

RMFNet architecture is shown in Figure 2. In order to fully leverage the shared and specific information from both modality images and fully utilize the complementary strengths of the two modality information for more accurate and dependable target tracking, our RMFNet consists of two modality-specific branch-based and one fusion branch network. A pair of visible and thermal images of arbitrary size are taken as input to the network. In the two modality-specific branch networks, the backbone comprises the first three convolutional layers of VGG-M [6], ensuring accuracy while avoiding the problem of low efficiency caused by too many layers. The backbone network was modified to improve the representation quality of the region of interest (ROI) and obtain a larger receptive field. Specifically, the first layer has a convolutional kernel size of 7 × 7 × 96, the second layer has a convolutional kernel size of 5 × 5 × 256, and both layers have a ReLU activation function and a local response normalisation (LRN), removing the max pooling layer after the second layer in the original network. The third layer has a convolutional kernel size of 3 × 3 × 512 and is a dilated convolution with

r = 3

.

In addition, to obtain the shared information between the two modalities, we adopted a similar two- and three-layer convolutional setting as that of the specific modality branches. The input two-modality images were first extracted by a 7 × 7 convolution to extract features and then were separately inputted into the second and third layers of the fusion branch to mine their shared information, where the shared weights can explore more shared information between the two modalities. Furthermore, utilizing the designed feature fusion module, the two features outputted from the fusion branch were adaptively weighted and fused to enhance the target’s feature expression capability, ultimately achieving more accurate and precise object tracking.

Finally, all three features we obtained use three fully connected classification layers to learn the instance perception of the target and a softmax layer to classify the optimized features. The

F C 6

layer is a binary fully-connected layer with K domains (one domain represents a video sequence). The three classification scores obtained using the three features are summed to obtain the final sample score, and the mean of the top 5 scoring sample frames is the location of the predicted target.

3.2. The Fusion Strategy

Visible and thermal infrared images have different ways of perceiving and expressing target information and have different advantages under other conditions. Modality fusion can comprehensively utilize the benefits of both modalities, improve the precision and robustness of object tracking, enhance the tracker’s resistance to environmental changes and noise, and improve the tracking speed by reducing unnecessary calculations. This paper employs two modality fusion methods, medium-term feature-level fusion and late-term decision-level fusion.

(1): Medium-Term Feature-Level Fusion

The feature-level fusion tracking algorithm first requires feature extraction using convolutional neural networks for visual and thermal infrared images, respectively, and then feature fusion using some designed fusion rules to obtain fused features of the two modalities, then tracking using this fused feature. As shown in Figure 2, our fusion branch aims to exploit the shared information between the two modalities fully. To better utilize the shared information, we perform feature-level fusion operations on the two features outputted by the fusion branch. In this paper, we propose a multimodal feature fusion module based on ECA attention (EFM), which can adaptively compute the contribution of the two modality features to target tracking and obtain the weighted fusion feature based on the relative reliability between modalities. The specific details of the EFM module are illustrated in Figure 3. The EFM module can reduce noise and redundant information in the fusion process, obtain more robust feature representation, and improve tracking accuracy.

Inspired by the attentional fusion module proposed in AFF [26], we suggest an improved feature fusion module based on ECA attention [7]. We use fast one-dimensional convolution to capture inter-channel interactions effectively, without reducing the number of channels, and with fast computational speed and minimal memory occupation. Unlike the fusion strategy in AFF, we do not add the two features and then weigh the resulting fused feature. Instead, we first calculate the global feature representation of the two features, then the global features of the two global feature representations obtained separately, and then calculate the weights after stitching the two feature channels together.

The input of the EFM module comprises two features that need to be weighted and fused. Firstly, perform global average pooling on the two features separately to extract their global information, denoted as

f_{c}^{t} \in R

and

f_{c}^{r} \in R

, respectively. The formula is as follows:

f_{c}^{t} = \frac{1}{W \times H} \sum_{j = 1}^{W} \sum_{k = 1}^{H} X_{t}^{c} (j, k)

(1)

f_{c}^{r} = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{k = 1}^{H} X_{r}^{c} (i, k)

(2)

W and H denoted of dimensions of the input features. One-dimensional fast convolution is then applied to

f_{c}^{t}

and

f_{c}^{r}

with a kernel size of 3, which allows the network to capture interactions between the 3 adjacent channels. The two global feature representations are then concatenated along the channel dimension, and the channel weights are calculated through a softmax operation. Finally, the two features are weighted and fused using the obtained weights, resulting in a fused feature more conducive to subsequent tracking.

\begin{matrix} f_{c}^{a} = [F_{1 D}^{r} (f_{c}^{t}), F_{1 D}^{t} (f_{c}^{r})] \end{matrix}

(3)

\begin{matrix} w^{t} = σ ((f_{c}^{a}) [0; 0 : d]) \end{matrix}

(4)

\begin{matrix} w^{r} = σ ((f_{c}^{a}) [0; d : 2 d]) \end{matrix}

(5)

\begin{matrix} X_{f} = X_{r} \times w^{r} + X_{t} \times w^{t} \end{matrix}

(6)

The channel-wise concatenation of two input features is denoted as [·, ·], where

w^{t}

and

w^{r}

are the weights of the two input features.

X_{f}

,

X_{r}

, and

X_{t}

represent the fused feature and the two original input features, respectively.

(2): Late Decision-Level Fusion

In target tracking, Decision-level multimodal fusion refers to the fusion of tracking results from different modalities according to some rules during the tracking process. This approach can effectively utilize the tracking results from different modalities, thereby improving tracking accuracy and robustness. However, most existing RGBT tracking algorithms [3,4,14,27] are based on the feature-level fusion framework, ignoring the importance of decision-level fusion in object tracking. In this paper, we adopt a late decision fusion strategy, using visible light modality features, thermal infrared modality features, and fused modality features for classification tracking, obtaining three tracking results based on different features. Then, we combine the three results by addition to obtain the final prediction score. The advantage lies in utilizing tracking results from three different features, fully leveraging the modality-specific and shared information for target tracking. This approach also mitigates the impact of two modality images needing to be more strictly aligned, thereby improving the tracker’s accuracy, robustness, and stability. The formula for late decision fusion in this paper is presented below:

\begin{matrix} X = F_{f c}^{6 v} (F_{f c}^{5 v} (F_{f c}^{4 v} (X_{v}))) + F_{f c}^{6 f} ((F_{f c}^{5 f} (F_{f c}^{4 f} (X_{f}))) + F_{f c}^{6 t} ((F_{f c}^{5 t} (F_{f c}^{4 t} (X_{t}))) \end{matrix}

(7)

The symbol

F f c

denotes a fully connected layer, while

X f

,

X_{v}

, and

X_{t}

represent the fused feature, visible light feature, and thermal infrared feature, respectively. The late fusion process is depicted in Equation (7), in which three fully connected layers are employed to learn feature representations for each feature type. Then, the scores obtained from the three modalities are combined using element-wise addition to produce a two-dimensional feature for background and target binary classification.

3.3. Improved Re-Detection Algorithm

As shown in Figure 4, it is the flow of the re-detection algorithm of the original MDNet [12] algorithm, in which the pre-positioning at this time is still used as the target position of the current frame when the pre-positioning is judged to be unreliable, and the candidate frame is generated by expanding the search area around the predetermined position of this frame when tracking the next frame. However, changes such as deformation, rapid motion, occlusion of the target, or changes in illumination or background interference during the tracking process can cause the tracker to fail and lose the target. When the mean of the top 5 candidate boxes’ scores is less than 0, the pre-positioning is deemed unreliable, and it is judged that the tracker may have drifted, causing the target to be lost. The re-detection process of the original MDNet algorithm causes a continuous accumulation of errors, which is improved in this paper, as shown in Figure 5. When the pre-positioning is unreliable, a global search is performed around the previous frame’s target location by expanding the search area, and the current frame’s target is repositioned. Target repositioning can detect the target again and reinitialize the tracker, avoiding the continuous accumulation of errors and improving tracking reliability. Moreover, target repositioning can be recovered promptly, preventing the tracker from continuously tracking the wrong target or losing the target, reducing the computation and time consumption and improving tracking efficiency. The reliability of repositioning is evaluated, and when it is reliable, samples are collected and updated around this position, and the tracker is updated. When repositioning is unreliable, the score of the pre-positioning and repositioning is evaluated, and the one with a higher score is selected as the target position of the current frame.

3.4. Alpha-Refine Module

Object tracking aims to accurately estimate a given target’s boundary box in consecutive video frames. This paper adopts a multi-stage strategy to improve the prediction accuracy of boundary boxes. First, a re-detection multimodal fusion network is used as the basic tracker to provide a coarse location boundary box of the target, and then the Alpha-Refine refinement module is used to adjust it for a final target boundary box. The process is described in Figure 6. Alpha-Refine is a generic and accurate refinement module that significantly improves the quality of boundary boxes generated by the basic tracker. In the data preparation stage, we use the target in the first frame to initialize the reference template of the Alpha-Refine refinement module. During the tracking stage, the basic tracker’s coarse location boundary box output is expanded to twice its size as the search area of the Alpha-Refine module. Templates and search regions use pixel-by-pixel correlation to compute high-performance response maps and use keypoint prediction heads to predict bounding boxes, maximizing extraction and maintaining detailed spatial morphology for more accurate bounding box estimation.

4. Implementation Details

4.1. Offline Training

This section outlines the offline training process of RMFNet. Firstly, the weights of the first three convolutional layers of the VGG-M [6] model trained on the ImageNet dataset are loaded and used to initialize the convolutional layers of our two specific modal branches and fusion branch. Then they trained the entire RMFNet using the Adam algorithm, with the learning rate of the convolutional layers set to 0.0001 and the learning rate of the fully connected classification layer formed to 0.001, which is 10 times higher than that of the convolutional layers. For each training iteration, 8 frames of images and their corresponding ground-truth targets are randomly extracted for each video sequence. Samples are then Gaussian-sampled around the ground truth targets on the image. In total, 256 positive samples and 768 negative samples are collected from the 8 frames of images, forming a small batch of samples for network training. The only criterion for dividing positive and negative samples is the IOU between the sample box and the ground truth box, with two thresholds set at 0.7 and 0.5. Boxes with IOU greater than 0.7 are positive samples, and those with IOU less than 0.5 are negative samples. RMFNet is trained for 120 and 220 rounds on the GTOT and RGBT234 datasets. These two datasets serve as each other’s training and testing sets, with the RGBT234 dataset as the testing set when GTOT is used as the training set, and vice versa. It is worth noting that to save training time, we randomly select 78 video sequences for each iteration when training the network on the RGBT234 dataset.

4.2. Online Tracking

During online tracking with RMFNet, the three branches’ last multi-domain binary classification fully connected layers need to be replaced with new classification layers to allow the tracker to learn instance representation features of new video sequences. The network needs to be fine-tuned to track a new video sequence using the first frame’s marked target. The training process is similar to MDNet [12]. Specifically, the tracking process of the tracker is shown in Algorithm 1. During tracking, the convolution layers and the parameters of the first two classification layers,

F C 4

and

F C 5

, trained on the previous dataset, are loaded, and the network is fine-tuned using the first frame’s image. Around the ground-truth box of the target in the first frame, 500 and 5000 positive and negative sample boxes

S_{1}^{+}

and

S_{1}^{-}

are generated and used for 50 iterations of fine-tuning training, updating only the parameters of the classification layers. To obtain precise bounding boxes, we also trained a bounding box regression network on the first frame’s target, which is used for regression adjustment of reliable pre-localization in subsequent frames. We collected samples near the ground truth in the first frame for long-term and short-term network updates. When tracking the target at frame t, 256 candidate boxes are generated around the ground truth of t − 1 in the t-th frame, and the network calculates their scores

f (X_{t}^{i})

, and the one with the highest score is the location of the target predicted by the tracker. The score calculation formula is shown in Equation (8):

\begin{matrix} X_{t}^{*} = \underset{i = 0, \dots, 255}{\underset{︸}{argmax}} f (X_{t}^{i}) \end{matrix}

(8)

Algorithm 1: Online Tracking

Input: Pretrainded CNN filters { $W_{c o n v}$ , $W_{f c 4 - f c 5}$ }.

Initial target state $X_{1}$ .

Output: Estimated target state $X_{t}^{*}$ .

1: Randomly initialize $W_{f c 6}$ ;

2: Train a bounding box regression model B R(·);

3: Draw positive and negative samples $S_{1}^{+}$ , $S_{1}^{-}$ ;

4: Update { $W_{f c}$ } using $S_{1}^{+}$ and $S_{1}^{-}$ ;

5: Initialize short-term and long-term samples

ζ_{S} \leftarrow {1}

and

ζ_{l} \leftarrow {1}

6: repeat

7: Draw target candidate samples $X_{t}^{i}$ .

8: Compute the optimal target state $X_{t}^{*}$ by (8).

9: if $f (X_{t}^{*})$ >0.5 then

10: Draw training samples $S_{t}^{+}$ and $S_{t}^{-}$ ;

11:

ζ_{S} \leftarrow ζ_{S} ⋃ {t}

,

ζ_{l} \leftarrow ζ_{l} ⋃ {t}

12: $X_{t}^{*}$ = B R( $X_{t}^{*}$ );

13: if $f (X_{t}^{*})$ <0.5 then

14: Performing short-term updates { $W_{f c}$ } with

ζ_{S}

.

15: else if t mod 10 = 0 then

16: Performing long-term updates { $W_{f c}$ } with

ζ_{l}

.

17: until end

5. Performance Evaluation

We have chosen to measure our method against popular tracking algorithms of recent years on two datasets, RGBT234 [10] and GTOT [9]. The validity of our proposed algorithm and each module is fully proved through comparative analysis and ablation experiments on the benchmarks of the above two datasets.

5.1. Datasets and Metrics

GTOT Dataset: The GTOT dataset consists of 50 pairs of visible and thermal infrared video sequences, including common targets, such as people, vehicles, ships, and aeroplanes, as well as various indoor, outdoor, daytime, and nighttime environments. The location of the current target is annotated in each frame by a human, and based on factors, such as the shooting environment, time, and target motion state, they are labelled as seven different attributes. The performance of the RGBT tracker in dealing with different challenging attributes is evaluated based on these seven attribute labels. We use two commonly used tracking evaluation metrics for experimental analysis: precision rate (PR) and success rate (SR). The PR and SR indicators represent the proportion of the number of frames whose centre position error and intersection ratio between the bounding box output by the tracker and the ground truth are less than the preset threshold in the current frame in the total number of video frames. Since targets in the GTOT dataset are usually small, 5 pixels is the threshold we set.

RGBT234 Dataset: The RGBT234 Dataset is a massive RGBT tracking dataset that extends the RGBT210 dataset, containing 234 pairs of time and space-aligned visible and thermal infrared video sequences with approximately 210,000 frames, with the longest video sequence consisting of 8000 frames. Furthermore, the dataset encompasses a variety of challenging attributes, including, but not limited to, camera motion, deformations, motion blur, changes in illumination, and occlusions. These challenges are labelled separately in order to enable a more comprehensive evaluation of the algorithms. The algorithms are fairly evaluated using the maximum precision rate (MPR) and success rate (MSR) metrics. We set the threshold for MPR to 20 pixels. MPR and MSR use the maximum value, while PR and SR use the average value. MPR and MSR focus more on the accuracy and stability of the algorithm than PR and SR.

5.2. Evaluation on the GTOT Dataset

We first compared the performance of RMFNet with 12 RGB trackers to verify the necessity of incorporating thermal infrared modality information. The trackers include KCF [28], ACT [29], ECO [30], ACFN [31], CFnet [32], MDNet [12], RT-MDNet [33], SiameseFC [16], SiamDW [34], BACF [35], SRDCF [36], and DAT [37]. The results are shown in Figure 7a, and it is clear that RMFNet outperforms the above RGB tracking algorithm, which fully demonstrates the complementary nature of thermal infrared modality to the visual modality. In particular, RMFNet outperforms MDNet and DAT in PR/SR by 9.9%/13.6% and 14%/15.1%, respectively.

In addition, we conducted a comparison experiment between RMFNet and 14 state-of-the-art RGBT trackers, including SiamDW+RGBT [34], RT-MDNet+RGBT [33], MDNet+RGBT [12], DAT+RGBT [37], SGT [38], CMR [39], JMMAC [40], MANet [13], DAFNet [20], DAPNet [1], MACNet [41], APFNet [4], and HIDNet [42]. The first four trackers are RGB trackers extended to RGBT. SGT and CMR are traditional RGBT trackers, while the latter six are RGBT trackers based on the MDNet framework. We present the evaluation results in Figure 7b. According to the curve, it can be observed that RMFNet achieves the best performance compared to these RGBT trackers, reaching 91.1%/76.9% in PR/SR. In particular, RMFNet’s PR/SR is 1.7%/4.5% and 0.6%/3.2% higher than MANet and APFNet, respectively.

Finally, the GTOT dataset can be divided into 7 challenge attributes, namely thermal crossover (TC), fast motion (FM), low illumination (LI), deformation (DEF), small object (SO), and large scale variation (LSV). We combined RMFNet with SiamDW+RGBT [34], RT-MDNet+RGBT [33], MDNet+RGBT [12], SRDCF [36], SGT [38], RT-MDNet [33], CMR [39], ECO [30], MANet [13], DAPNet [1], MACNet [41], and APFNet [4], 12 trackers were compared for their evaluation results in these 7 challenge attributes. As shown in Table 1, our proposed algorithm performs well under various challenging attributes.

5.3. Evaluation on the RGBT234 Dataset

We also verified the performance of our RMFNet with other trackers on a larger-scale dataset, RGBT234, to further validate its effectiveness. The evaluation results on the RGBT234 dataset are analyzed for three aspects, overall performance, performance under 12 challenging attributes, and visualization results in some highly challenging scenarios.

The overall performance of RMFNet was compared with 11 RGB trackers and 14 RGBT trackers. The RGB trackers are MDNet [12], RT-MDNet [33], DAT [37], ECO [30], SOWP [43], SRDCF [36], CSR-DCF [44], DSST [45], CFnet [32], SAMF [46], and C-COT [47]. The RGBT trackers are KCF + RGBT [28], SiamDW + RGBT [34], MDNet+RGBT [12], RT-MDNet + RGBT [33], SGT [38], JSR [48], L1-PE [49], MANet [13], DAPNet [1], HDINet [42], JMMAC [40], APFNet [4], and MACNet [41]. As shown in Figure 8a, RMFNet outperforms all RGB trackers, especially compared to DAT and ECO, where RMFNet achieves 6.3%/8.2% and 9.2%/7.1% performance improvements on PR/SR, respectively. As shown in Figure 8b, although RMFNet performs 3.3% worse than APFNet on PR, it performs 0.6% better on SR.

The RGBT234 dataset comprises 12 challenge attributes, low resolution (LR), partial occlusion (PO), scale variation (SV), background clutter (BC), thermal crossover (TC), deformation (DEF), low illumination (LI), camera motion (CM), motion blur (MB), no occlusion (NO), fast motion (FM), and heavy occlusion (HO). We selected 12 trackers from the aforementioned RGB and RGBT trackers for comparing the results of 12 challenging attributes. As depicted in the evaluation results shown in Figure 9 and Figure 10, RMFNet outperforms other trackers in most of these challenges. In PR evaluation, RMFNet performs well in BC, CM, DEF, LI, and FM challenges, indicating a stable performance in these challenging scenarios. However, in the HO, SV, and NO challenges, RMFNet’s performance is inferior to that of MACNet, as MACNet makes better use of appearance information than RMFNet. Moreover, RMFNet is outperformed by HDINet in LR and PO challenges, implying that our model needs improvement in utilizing complementary information. In future work, motion location prediction and camera motion compensation can be incorporated to enhance the model’s appearance modelling capability, and better fusion rules can be designed to fully exploit complementary information. In the SR evaluation, we were only inferior to ECO in the TC challenge. This is because when the thermal crossover phenomenon occurs, the thermal infrared image can no longer distinguish the target, and the use of feature fusion operations increases the interference of low-quality information, resulting in a loss of tracking accuracy. At this point, deep features are not as helpful in locating the target as traditional hand-crafted features, such as color and texture. Notably, SR is a better evaluation indicator than PR.

We performed comparative experiments on the visual tracking results of four video sequences (elecbike10, man4, Children4, and car41) in the RGBT234 dataset, with the compared algorithms being MANet [13], ECO [30], SGT [38], and MDNet + RGBT [12]. The visual tracking diagrams of RMFNet and other algorithms shown in Figure 11 show that RMFNet can maintain good performance even in challenging environments such as cluttered backgrounds, camera motion, appearance changes, and occlusions. For example, in Figure 11a, the target undergoes obvious appearance changes due to a turn in a low-light environment, and all trackers lost the target in the middle of the video, but RMFNet was able to relocate the target later. In Figure 11b, severe occlusion occurs, and all trackers except RMFNet lost the target. RMFNet could accurately locate the target despite the repeated occlusions with the background. In Figure 11c, the target faces challenges such as low-light, camera lens rotation, and deformation, but RMFNet can still adapt to the target shape and track it well. In Figure 11d, multiple background individuals similar to the target move across it, and the target is partially occluded. RMFNet can still differentiate the target from other objects in the cluttered background and partial occlusion conditions.

5.4. Ablation Experiments

We conducted ablation experiments on both the GTOT and RGBT234 datasets, which demonstrated the effectiveness of all the proposed components and module improvements presented in this chapter.

5.4.1. Component Analysis

As shown in Table 2, Baseline is the basic model used in this article. It uses the RT-MDNet network and takes visible light and thermal infrared images as inputs to extract features separately. Then, it performs an element-wise addition fusion operation on the output of the feature by the third convolutional layer of both modalities to obtain the fused features. Finally, the fused feature is fed into three fully connected classification layers for object tracking. LF stands for late-term decision-level fusion, which combines visible light, thermal infrared, and fused features separately through three fully connected classification layers to obtain sample scores. Then, the scores of the three modalities are added and averaged to obtain the final sample score. FB represents the fusion branch, which adds two convolutional layers in parallel between the two specific modality branches to extract shared features of the two modalities. EFM represents the ECA-based multimodal feature fusion module, which adaptively weights the feature maps of two modalities output by the shared branch based on their reliability and fuses them into a single representation. RD stands for our improved re-detection algorithm that incorporates a re-detection process to mitigate the accumulation of failures. AR represents the Alpha-Refine [8] refinement module, which performs fine adjustment of the tracker’s coarse localization. As shown in Table 2, all of the proposed components exceed the results of the baseline model, and each component can contribute significantly to the network’s performance, proving our network model’s superiority.

5.4.2. Multimodal Fusion Module Analysis

Comparing the performance of III and IV in Table 2, it can be verified that the fusion features obtained using the EFM module enhance the tracker’s performance. To further validate the advanced nature of the fusion strategy used by the EFM module, the following experimental performance comparisons were performed: (1) baseline, which fused the two-modality features at the third convolutional layer output through element-wise addition; (2) baseline+EFM-AFF, which replaced the attention mechanism in AFF [26] with ECA attention and used the feature fusion strategy in AFF for feature fusion; and (3) baseline+EFM, which utilized the proposed EFM multimodal fusion module for feature fusion. Table 3 shows the results of the comparison. We found that baseline+EFM outperformed baseline, demonstrating that our multimodal fusion module effectively integrated different modality features and significantly improved the tracker’s performance. Moreover, the performance of baseline+EFM-AFF was inferior to that of baseline, which indicated that our approach was more suitable for multimodal feature fusion than the fusion strategy employed in AFF.

5.4.3. Efficiency Analysis

The implementation of RMFNet was based on Python 3.6, PyTorch 1.2.0, Tesla V100 GPU. Finally, we measured the performance and runtime of RMFNet on the RGBT234 and GTOT datasets using MDNet + RGBT [12], MANet [13], DAPNet [1], and CBPNet [2]. As shown in Table 4, the evaluation results demonstrate that RMFNet is superior to other trackers for speed and performance. RMFNet-noAR, which removes the AR module, achieves the highest speed. Notably, under the same conditions we tested MANetz on the RGBT234 and GTOT datasets at 0.88 fps and 0.64 fps, respectively.

6. Conclusions

This article proposes a new robust RGBT tracking network called the Redetection Multimodal Fusion Network (RMFNet). RMFNet uses a three-branch structure to fully exploit the potential value of shared and specific information in multimodal data. By combining medium-term feature-level fusion with in a multimodal fusion approach, RMFNet fully utilizes the complementary advantages of multimodal information and compensates for the shortcomings of single-modality information. Additionally, RMFNet uses an improved re-detection algorithm to address the failure accumulation problem of the base algorithm, as well as a multi-stage bounding box estimation strategy further to improve the robustness and accuracy of the tracker. Through extensive experiments, RMFNet has demonstrated good performance in various challenging scenarios and has stable tracking capabilities. In the future, we will further investigate how to leverage the temporal and motion information of videos to improve RMFNet and enhance the performance of the tracker.

Author Contributions

Y.Z.: Investigation, Methodology, Validation, Writing—original draft. H.L.: Resources, Writing—review and editing. G.G.: Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the Natural Science Foundation of China (Grant Nos. U1903213).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhu, Y.; Li, C.; Luo, B.; Tang, J.; Wang, X. Dense feature aggregation and pruning for rgbt tracking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 465–472. [Google Scholar]
Xu, Q.; Mei, Y.; Liu, J.; Li, C. Multimodal cross-layer bilinear pooling for RGBT tracking. IEEE Trans. Multimed. 2021, 24, 567–580. [Google Scholar] [CrossRef]
Li, Y.; Lai, H.; Wang, L.; Jia, Z. Multibranch Adaptive Fusion Network for RGBT Tracking. IEEE Sens. J. 2022, 22, 7084–7093. [Google Scholar] [CrossRef]
Xiao, Y.; Yang, M.; Li, C.; Liu, L.; Tang, J. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 2831–2838. [Google Scholar]
Lu, A.; Li, C.; Yan, Y.; Tang, J.; Luo, B. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 2021, 30, 5613–5625. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Yan, B.; Zhang, X.; Wang, D.; Lu, H.; Yang, X. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–27 June 2021; pp. 5289–5298. [Google Scholar]
Li, C.; Cheng, H.; Hu, S.; Liu, X.; Tang, J.; Lin, L. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process. 2016, 25, 5743–5756. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Liang, X.; Lu, Y.; Zhao, N.; Tang, J. RGB-T object tracking: Benchmark and baseline. Pattern Recognit. 2019, 96, 106977. [Google Scholar] [CrossRef]
Li, C.; Wu, X.; Zhao, N.; Cao, X.; Tang, J. Fusing two-stream convolutional neural networks for RGB-T object tracking. Neurocomputing 2018, 281, 78–85. [Google Scholar] [CrossRef]
Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
Li, C.; Lu, A.; Zheng, A.; Tu, Z.; Tang, J. Multi-Adapter RGBT Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 2262–2270. [Google Scholar]
Zhang, P.; Wang, D.; Lu, H.; Yang, X. Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int. J. Comput. Vis. 2021, 129, 2714–2729. [Google Scholar] [CrossRef]
Zhang, X.; Ye, P.; Peng, S.; Liu, J.; Gong, K.; Xiao, G. SiamFT: An RGB-infrared fusion tracking method via fully convolutional siamese networks. IEEE Access 2019, 7, 122122–122133. [Google Scholar] [CrossRef]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
Zhang, X.; Ye, P.; Peng, S.; Liu, J.; Xiao, G. DSiamMFT: An RGB-T fusion tracking method via dynamic Siamese networks using multi-layer feature fusion. Signal Process. Image Commun. 2020, 84, 115756. [Google Scholar] [CrossRef]
Peng, J.; Zhao, H.; Hu, Z.; Zhuang, Y.; Wang, B. Siamese Infrared and Visible Light Fusion Network for RGB-T Tracking. arXiv 2021, arXiv:2103.07302. [Google Scholar]
Yang, R.; Zhu, Y.; Wang, X.; Li, C.; Tang, J. Learning Target-Oriented Dual Attention for Robust RGB-T Tracking. In Proceedings of the 2019 IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 3975–3979. [Google Scholar]
Gao, Y.; Li, C.; Zhu, Y.; Tang, J.; He, T.; Wang, F. Deep Adaptive Fusion Network for High Performance RGBT Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 Octomber–2 November 2019; pp. 91–99. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8 September 2018; pp. 267–283. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]
Tu, Z.; Lin, C.; Zhao, W.; Li, C.; Tang, J. M 5 l: Multi-modal multi-margin metric learning for rgbt tracking. IEEE Trans. Image Process. 2021, 31, 85–98. [Google Scholar] [CrossRef] [PubMed]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Chen, B.; Wang, D.; Li, P.; Wang, S.; Lu, H. Real-time’actor-critic’tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 318–334. [Google Scholar]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Choi, J.; Jin Chang, H.; Yun, S.; Fischer, T.; Demiris, Y.; Young Choi, J. Attentional correlation filter network for adaptive visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4807–4816. [Google Scholar]
Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2805–2813. [Google Scholar]
Jung, I.; Son, J.; Baek, M.; Han, B. Real-time mdnet. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 83–98. [Google Scholar]
Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
Kiani Galoogahi, H.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1135–1143. [Google Scholar]
Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
Pu, S.; Song, Y.; Ma, C.; Zhang, H.; Yang, M. Deep Attentive Tracking via Reciprocative Learning. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1935–1945. [Google Scholar]
Li, C.; Zhao, N.; Lu, Y.; Zhu, C.; Tang, J. Weighted sparse representation regularized graph learning for RGB-T object tracking. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA USA, 23–27 October 2017; pp. 1856–1864. [Google Scholar]
Li, C.; Zhu, C.; Huang, Y.; Tang, J.; Wang, L. Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 808–823. [Google Scholar]
Zhang, P.; Zhao, J.; Bo, C.; Wang, D.; Lu, H.; Yang, X. Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 2021, 30, 3335–3347. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Zhang, L.; Zhuo, L.; Zhang, J. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 2020, 20, 393. [Google Scholar] [CrossRef] [PubMed]
Mei, J.; Zhou, D.; Cao, J.; Nie, R.; Guo, Y. Hdinet: Hierarchical dual-sensor interaction network for rgbt tracking. IEEE Sens. J. 2021, 21, 16915–16926. [Google Scholar] [CrossRef]
Kim, H.U.; Lee, D.Y.; Sim, J.Y.; Kim, C.S. Sowp: Spatially ordered and weighted patch descriptor for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3011–3019. [Google Scholar]
Lukezic, A.; Vojir, T.; Cehovin Zajc, L.; Matas, J.; Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6309–6318. [Google Scholar]
Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference, Nottingham, UK, 1–5 September 2014. [Google Scholar]
Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 254–265. [Google Scholar]
Danelljan, M.; Robinson, A.; Shahbaz Khan, F.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 472–488. [Google Scholar]
Liu, H.; Sun, F. Fusion tracking in color and infrared images using joint sparse representation. Sci. China Inf. Sci. 2012, 55, 590–599. [Google Scholar] [CrossRef]
Wu, Y.; Blasch, E.; Chen, G.; Bai, L.; Ling, H. Multiple source data fusion via sparse representation for robust visual tracking. In Proceedings of the 14th International Conference on Information Fusion, Chicago, IL, USA, 5–8 July 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1–8. [Google Scholar]

Figure 1. Different multimodal fusion networks. (a) Early pixel-level fusion. (b) Mid-term feature-level fusion, the framework proposed by [11]. (c) Late decision-level fusion, the framework proposed by [19]. (d) Framework proposed by [13]. (e) Framework proposed by [20]. (f) Our framework. “C” and “+” denotes splicing and element summation. “AFM” and “EFM” indicate multimodal feature fusion modules.

Figure 2. RMFNet architecture, including thermal infrared branch, visible light branch, and fusion branch.

Figure 3. Detail diagram of the EFM module,

X_{r}, X_{t}, X_{f}

, indicating the available optical, thermal infrared, and fusion features, respectively. B × C × W × H represents the dimensions of the tensor, where B is the batch size, C is the number of channels, and W × H represents the width and height of the feature map.

Figure 3. Detail diagram of the EFM module,

X_{r}, X_{t}, X_{f}

, indicating the available optical, thermal infrared, and fusion features, respectively. B × C × W × H represents the dimensions of the tensor, where B is the batch size, C is the number of channels, and W × H represents the width and height of the feature map.

Figure 4. MDNet re-detection algorithm.

Figure 5. Improved re-detection algorithm.

Figure 6. Alpha-Refine module process.

Figure 7. PR (%) and SR (%) for different trackers in the GTOT dataset.

Figure 8. PR (%) and SR (%) for different trackers in the RGBT234 dataset.

Figure 9. MPR (%) curves for RMFNet and 15 trackers on 12 attributes in the RGBT234 dataset.

Figure 10. MSR (%) curves for RMFNet and 15 trackers on 12 attributes in the RGBT234 dataset.

Figure 11. Qualitative comparison of RMFNet with other trackers.

Table 1. PR/SR scores (%) for RMFNet and 12 trackers on 7 challenge attributes in the GTOT dataset. The best, second, and third results are denoted by red, green, and blue, respectively.

Attributes	OCC	LSV	FM	LI	TC	SO	DEF	ALL
ECO	77.5/62.2	85.6/70.5	77.9/64.5	75.2/61.7	81.9/65.3	90.7/69.1	75.2/59.8	77.0/63.1
RT-MDNet	73.3/57.6	79.1/63.7	78.1/64.1	77.2/63.8	73.7/59.0	85.6/63.4	73.1/61.0	74.5/61.3
SiamDW+RGBT	67.5/53.6	68.9/56.6	71.1/57.6	70.0/58.8	63.5/51.7	76.4/58.5	69.1/58.2	68.0/56.5
SRDCF	72.7/58.0	80.4/68.1	68.3/61.1	71.7/59.4	70.5/58.0	80.5/57.5	66.6/53.7	71.9/59.1
RT-MDNet+RGBT	79.6/61.8	80.9/63.6	79.4/61.2	85.0/68.9	86.4/65.8	93.5/67.6	90.8/73.4	83.9/66.9
MDNet+RGBT	82.9/64.1	77.0/57.3	80.5/59.8	79.5/64.3	79.5/60.9	87.0/62.2	81.6/73.3	80.0/63.7
SGT	81.0/56.7	84.2/54.7	79.9/55.9	88.4/65.1	84.8/61.5	91.7/61.8	91.9/73.3	85.1/62.8
CMR	82.5/62.6	85.3/66.7	83.5/65.0	88.7/67.8	81.1/62.2	86.5/61.0	84.7/65.2	82.7/64.3
MANet	88.2/69.6	86.9/70.6	87.9/69.4	91.4/73.6	88.9/70.2	93.2/70.0	92.3/75.2	89.4/72.4
DAPNet	87.3/67.4	84.7/54.8	82.3/61.9	90.0/72.2	89.3/69.0	93.7/69.2	91.9/77.1	88.2/70.7
MACNet	87.6/68.7	84.6/67.3	82.3/65.9	89.4/73.1	89.2/69.7	95.0/69.5	92.6/76.5	88.0/71.4
APFNet	90.3/71.3	87.7/71.2	86.5/68.4	91.4/74.8	90.4/71.6	94.3/71.3	94.6/78.0	90.5/73.7
RMFNet	89.2/73.6	88.1/75.1	85.9/73.4	92.5/77.1	92.0/76.3	93.6/74.9	94.1/78.4	91.1/76.9

Table 2. The validity of the components was verified in the GTOT and RGBT234 datasets.

Method							RGBT234		GTOT
	Baseline	LF	FB	EFM	RD	AR	PR	SR	PR	SR
I	✓	×	×	×	×	×	75.5	51.9	83.6	68.2
II	✓	✓	×	×	×	×	76.2	52.9	86.5	70.0
III	✓	✓	✓	×	×	×	77.2	53.5	87.4	70.3
IV	✓	✓	✓	✓	×	×	77.4	54.0	87.7	71.1
V	✓	✓	✓	✓	✓	×	78.2	54.4	87.7	71.5
VI	✓	✓	✓	✓	✓	✓	79.4	58.5	91.1	76.9

Table 3. EFM integration strategy analysis evaluation results.

Method	RGBT234		GTOT
Method	PR	SR	PR	SR
Baseline	75.5	51.9	83.6	68.2
Baseline + EFM-AFF	71.8	46.0	82.5	64.7
Baseline + EFM	77.9	52.9	86.9	69.2

Table 4. Evaluation results for runtime and overall performance.

		MDNet + RGBT	MANet	DAFNet	CBPNet	RMFNet-noAR	RMFNet
GTOT	Speed PR/SR	3 fps 80.0/63.7	1.1 fps 89.4/72.4	2 fps 88.2/70.7	3.7 fps 88.5/71.6	14.2 fps 87.7/71.5	10.9 fps 91.1/76.9
RGBT234	Speed PR/SR	3 fps 72.2/49.5	1.5 fps 77.7/53.9	2 fps 76.6/53.7	3.3 fps 79.4/54.1	13.8 fps 78.2/54.4	10.1 fps 79.4/58.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Lai, H.; Gao, G. RMFNet: Redetection Multimodal Fusion Network for RGBT Tracking. Appl. Sci. 2023, 13, 5793. https://doi.org/10.3390/app13095793

AMA Style

Zhao Y, Lai H, Gao G. RMFNet: Redetection Multimodal Fusion Network for RGBT Tracking. Applied Sciences. 2023; 13(9):5793. https://doi.org/10.3390/app13095793

Chicago/Turabian Style

Zhao, Yanjie, Huicheng Lai, and Guxue Gao. 2023. "RMFNet: Redetection Multimodal Fusion Network for RGBT Tracking" Applied Sciences 13, no. 9: 5793. https://doi.org/10.3390/app13095793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RMFNet: Redetection Multimodal Fusion Network for RGBT Tracking

Abstract

1. Introduction

2. Related Work

2.1. RGBT Tracking

2.2. Multimodal Fusion Network

2.3. Attention Mechanism

3. Our Method

3.1. RMFNet Overall Architecture

3.2. The Fusion Strategy

3.3. Improved Re-Detection Algorithm

3.4. Alpha-Refine Module

4. Implementation Details

4.1. Offline Training

4.2. Online Tracking

5. Performance Evaluation

5.1. Datasets and Metrics

5.2. Evaluation on the GTOT Dataset

5.3. Evaluation on the RGBT234 Dataset

5.4. Ablation Experiments

5.4.1. Component Analysis

5.4.2. Multimodal Fusion Module Analysis

5.4.3. Efficiency Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI