1. Introduction
Single object tracking (SOT) refers to the task of predicting the position and size of a given target in the subsequent frames of a video, given its initial information in the first frame [
1]. It is an important research area in computer vision and has been utilized in various applications, such as motion object analysis [
2], automatic driving [
3], and human–computer interactions [
4]. As deep learning demonstrates outstanding performance in various visual fields, more and more researchers have introduced deep learning into the field of single object tracking, achieving many astonishing results. Tao et al. [
5] proposed the SINT network, the first to apply the Siamese network to single-object tracking fields. SINT learns a matching function through a Siamese network using the first frame of target information as a template, and calculates the matching score between the template and all subsequent frames sampled. The highest score represents the position of the target in the current frame. SiamFC [
6] abstracts the tracking process into a similarity learning problem. By learning a function
to compare the similarity between the template image
Z and the search image
X, the target’s location can be predicted based on the position with the highest score on the response map. Building upon SiamFC, SiamRPN [
7] introduces the RPN module from Faster R-CNN [
8], which eliminates the time-consuming multi-scale testing process, further improving performance and speeding up the system. Li et al. [
9] proposed the SiamRPN++ network, which introduces the ResNet-50 [
10] network to improve feature extraction capability. Additionally, they proposed Depth-wise Cross Correlation to replace the previous Up-Channel Cross Correlation, drastically reducing the number of parameters and enhancing the overall stability of the training. Xu et al. [
11] proposed an anchor-free tracking algorithm called SiamFC++. The algorithm enhances the original SiamFC tracker with added position regression, quality score, and multiple joint training losses, resulting in a significantly improved tracking performance. SiamCAR [
12] is also a novel anchor-free fully convolutional Siamese tracking network, which decomposes visual tracking tasks into two sub-problems: pixel-wise classification and target bounding box regression, and solves end-to-end visual tracking problems in a pixel-wise manner. Guo et al. [
13] proposed a simple and perceptible Siamese graph network, SiamGAT, for generic object tracking. The tracker establishes part-to-part correspondences between the target and search regions using a full bipartite graph and applies a graph attention mechanism to propagate target information from template features to search features. Yang et al. [
14] proposed a dedicated Siamese network, SiamMDM, designed for single object tracking in satellite videos. This network addresses the challenge of weak features exhibited by typical targets in satellite videos by incorporating feature map fusion and introducing a dynamic template branch. Additionally, the network suggests an adaptive fusion of both motion model predictions and Siamese network predictions to alleviate issues commonly encountered in satellite videos, such as partial or full occlusions.
As Transformer [
15] has shown great potential in the field of object detection, more and more researchers are trying to introduce Transformer into the field of object tracking. Due to the utilization of only spatial features in Siamese algorithms, they may not be particularly suitable for scenarios involving target disappearance or significant object variations. To address this limitation, Yan et al. [
16] proposed the incorporation of a transformer architecture, which combines spatial and temporal characteristics, effectively resolving the issue of long-range interactions in sequence modeling. Chen et al. [
17] pointed out that correlation operation is a simple way of fusion, and they proposed a Transformer tracking method based on the attention fusion mechanism called TransT. SparseTT [
18] has designed a sparse attention mechanism that allows the network to focus on target information in the search area, and proposed a Double-Head approach to improve classification and regression accuracy. The ToMP [
19] tracker also replaces traditional optimization-based model predictors with transformers. This tracker incorporates two novel encoding methods that include both target position information and range information. In addition, a parallel two-stage tracking method is proposed to decouple target localization and bounding box regression, achieving a balance between accuracy and efficiency. In order to fully leverage the capabilities of self-attention, Gui et al. [
20] introduced a novel tracking framework known as MixFormer, which deviates from traditional tracking paradigms. Additionally, they proposed the MAM module, which employs attention mechanisms to perform feature extractions and feature interactions simultaneously. This renders MixFormer remarkably concise without the need for additional fusion modules. Ye et al. [
21] introduced OSTrack, a concise and efficient one-stream one-stage tracking framework. This tracker leverages the prior knowledge of similarity scores obtained in the early stages and proposes an in-network early candidate elimination module, thereby reducing inference time. To address the issue of inhibited performance improvements due to independent correlation calculations in attention mechanisms, AiATrack [
22] introduces an Attention in Attention (AiA) module. This module enhances appropriate correlations and suppresses erroneous correlations by seeking consensus among all relevant vectors. SwinTrack [
23] employs Transformer for feature extraction and fusion, enabling full interaction between the template region and the search region for tracking. Moreover, SwinTrack extensively investigates various strategies for feature fusion, position encoding, and training loss to enhance performance.
In recent years, with the advancement of remote sensing and image processing technology [
24,
25,
26], the resolution of satellite video has been continuously improving, enabling the tracking of trains, ships, airplanes, and even ordinary cars from a remote sensing perspective. Compared with general optical object tracking, remote sensing single object tracking (RSOT) [
27,
28,
29,
30] faces several challenges that lead to significant performance degradation when directly applying deep learning-based single-object tracking methods to RSOT. Taking the common vehicle target in the tracking dataset as an example, we conducted a comparative analysis involving vehicle targets captured in natural image scenes, aerial views from unmanned aerial vehicles (UAV), and satellite perspectives. The results are depicted in
Figure 1. The vehicle depicted in
Figure 1a is selected from the Car24 video sequence in the OTB [
31] dataset. In the displayed image frame, the vehicle target occupies 1848 pixels, accounting for 2.41% of the entire image. Despite the limited resolution of this image, the rich feature information at the rear of the vehicle provides sufficient discriminative cues for the tracker’s decision-making process. The car depicted in
Figure 1b is extracted from the car4 video sequence in the UAV123 [
32] dataset. The car target occupies a total of 1560 pixels, which accounts for approximately 0.17% of the entire image. Despite the diminished proportion of the vehicle target within the entire image when viewed from the aerial perspective of a UAV, the contour of the vehicle target remains distinctly discernible. The car in
Figure 1c is selected from the car_04 video sequence in the SatSOT [
33] dataset. In complete contrast to the previous two images, the car target in this image occupies only 90 pixels, taking up only 0.01% of the whole image.
The comparison of the three images in
Figure 1 reveals that the minuscule size of the vehicle targets in satellite videos results in a reduced amount of feature information. Therefore, it is essential to employ corresponding feature enhancement techniques to bolster the features of small objects in satellite videos. Additionally, it is unreasonable to rely solely on the Depth-wise Cross Correlation matching approach from the general object tracking domain, given the aforementioned scarcity of features in small objects. In summary, applying methods from the natural tracking field to the RSOT field primarily confronts two critical issues, as follows.
- (1)
Weak target feature: Due to the altitude of the satellite and the spatial resolution of satellite videos, the target in the satellite video usually occupies only a tiny percentage of the entire image [
34]. Compared to generic optical targets, targets to be tracked in satellite video are generally too small in size, resulting in insufficient feature information for trackers to exploit [
35].
- (2)
Inappropriate matching strategy: The common Depth-wise Cross Correlation finds the best match within the search map based on the target appearance and texture information provided by the template map. However, due to the tiny size of objects in satellite videos, which usually exist in clusters or lines in images and lack noticeable contour features, the Depth-wise Cross Correlation matching strategy is not fully applicable to RSOT.
In summary, how to effectively suppress the background information around the target and enhance the intensity of the target’s own features has become an unavoidable problem in the study of target tracking in the RSOT field. At the same time, it is essential to design a matching method suitable for target tracking in satellite videos. Based on the above thinking, we have designed a tracker SiamTM specifically for satellite video target tracking based on feature enhancement and coarse-to-fine matching strategies.
The main contributions of this paper can be summarized as follows.
Firstly, we propose a novel target information enhancement module that can capture the direction and position-aware information from both the horizontal and vertical dimensions, as well as inherent channel information from a global perspective of the image. The target information enhancement module embeds position information into channel attention to enhance the feature expression of our proposed SiamTM algorithm for small targets in satellite videos.
Secondly, we have designed a multi-level matching module that is better suited for satellite video targets’ characteristics. It combines coarse-grained semantic abstraction information with fine-grained location detail information, effectively utilizing template information to accurately locate the target in the search area, thereby improving the network’s continuous tracking performance of the target in various complex scenarios.
Finally, extensive experiments have been conducted on two large-scale satellite video single-object tracking datasets, SatSOT and SV248S. The experimental results show that the proposed SiamTM algorithm achieved state-of-the-art performance in both success and precision metrics, while having a tracking speed of 89.76 FPS, exceeding the standard of real-time tracking.
3. Proposed Approach
In this section, we introduce the proposed tracker SiamTM network in detail. Firstly, in
Section 3.1, we expound the proposed SiamTM single-object tracking network from a holistic perspective. In order to extract more valuable and discriminative features from the template feature map and the search feature map, we creatively introduce a target information enhancement (TIE) module, and a detailed explanation of this module is presented in
Section 3.2. Furthermore, we propose a multi-level matching (MM) module that integrates target information into the search feature map to improve tracking performance in
Section 3.3.
3.1. Overall Architecture
The overall framework of the proposed SiamTM network is shown in
Figure 2. The SiamTM network consists of three subnetworks: the feature extraction subnetwork, the feature enhancement and matching subnetwork, and the target prediction subnetwork. Similar to SiamCAR [
12], the feature extraction network of SiamTM comprises two parts, the template branch and the search branch. The size of the template image is
, while that of the search image is
. Among them, the first dimension represents the number of channels in the image, and the last two dimensions represent the height and width of the image. The template region image and the search region image are fed into the modified ResNet50 [
9] network with weight parameter sharing at the same time. After feature extraction, we obtain a
template feature map and a
search feature map. To enhance localization and discern foreground from background more effectively, we leverage the features extracted from the final three residual blocks of the backbone in reference to SiamCAR. Extensive literature [
9,
12,
50] has substantiated that the joint utilization of low-level and high-level feature maps significantly contributes to improved accuracy in tracking. This is because low-level features encompass a multitude of informative cues facilitating precise positioning, such as the target’s edge details and color attributes. On the other hand, high-level features encompass a greater wealth of semantic information that aids in effectively differentiating between foreground and background.
In order to extract more beneficial and distinguishable features from the original feature maps, and thereby facilitate the subsequent processing of the network, the template region feature map and the search region feature map are separately fed into the TIE module. The above operations outputs new feature maps that focus more on the information of the target itself, while maintaining the same size as the original feature maps. Compared to targets in natural images, targets in satellite videos are smaller in size and are more easily interfered by other objects in the background. In response to these characteristics of satellite video targets, a multi-level matching (MM) module was designed. The new template feature map and the search feature map are fed into the MM matching module. First, the template feature is embedded into each position of the search feature map to obtain a more accurate center position of the target, resulting in a preliminary rough response map. Then, through coarse-grained semantic abstraction information matching, the target outline is determined, resulting in a more refined final response map after matching.
Finally, the response map is fed into the target prediction subnetwork. In this subnetwork, the classification branch is used to distinguish the foreground and background in the current frame and to perform center-ness calculation, while the regression branch is used to determine the predicted bounding box. For the classification branch, it first performs a feature transformation on the response map output from the Feature Enhancement and Matching Subnetwork through a four-layer CNN structure to get the feature map . Each layer in the CNN structure consists of a Convolution layer, a GroupNorm layer, and a ReLU layer successively. The feature map is subsequently channel transformed by two independent single-layer Convolution layers to get the output feature maps and . is used to differentiate the foreground from the background of the input image, whereas denotes the center-ness score of each position. Similar to the classification branch, the regression branch first performs a feature transformation on through an identical but independent CNN structure to obtain the feature map . Subsequently, a channel transformation is performed on through a Convolution layer to obtain the output of the regression branch . Each point in the feature map respectively represents the distance from the corresponding position to the four sides of the bounding box in the search region.
3.2. Target Information Enhancement Module
Compared to objects in natural scenarios, objects in remote sensing images are smaller and contain less feature information [
51,
52]. Therefore, how to effectively extract distinctive features from objects in remote sensing images has become one of the critical issues affecting subsequent tracking performance. Studies [
53] on lightweight networks have shown that channel attention can significantly improve model performance. However, channel attention often ignores the crucial positional information in visual tasks that capture target structures [
54]. Therefore, it is necessary to consider how to embed positional information into channel attention. Based on the above problems and corresponding considerations, we designed a feature enhancement module to capture inter-channel information at the global 2D image level, while also capturing direction and position-aware information along the two spatial directions. The TIE module is shown in
Figure 3.
Given a feature map
, first, a one-dimensional feature-encoding operation is performed on each channel using pooling kernels of size
and
along the horizontal and vertical directions, respectively. The output expression of the
c-th channel with a height of
h is shown below:
where
c and
h represent the channel and height of the current operation respectively, while
W represents the width of the image.
Similarly, the output expression of the
c-th channel with a width of
w is shown as follows:
where
H represents the height of the image.
Then, concatenate the feature map obtained through vertical average pooling with the feature map obtained through horizontal average pooling and transpose and transform it into an intermediate feature map through a convolution kernel. The intermediate feature map retains spatial information from the original feature map in both vertical and horizontal directions and also captures relationships between channels.
After that, the intermediate feature map
is divided into two separate tensors,
and
, along the spatial dimension. After undergoing a
convolution operation, attention weights
and
are obtained. The final output expression for the 1D part is
where
denotes the
convolution layer,
and
denote the attention weights along the horizontal direction and the attention weights along the vertical direction, respectively.
In addition, in order to capture more distinctive target information from a global perspective and highlight more effective target features, we started from the 2D level of the image and performed global average pooling on the original feature map to obtain the feature map
. We also generated another intermediate feature map, whose output expression is:
where ⊕ means the broadcasting addition, ⊗ means the element-wise multiplication, and
denotes the Sigmoid operation.
Finally, we concatenate and reduce the one-dimensional feature map and the two-dimensional feature map obtained from feature encoding operations, and obtain the final output feature map.
3.3. Multi-Level Matching Module
Feature matching in Siamese networks refers to integrating feature maps obtained from the template branch and the search branch, calculating the similarity between each region of the template feature map and the search feature map, and finally, outputting a response map. The output response map is then sent to subsequent target prediction subnetworks for classification, regression, and other operations. As one of the most essential steps in the single-object tracking network, the quality of the feature matching directly determines the tracking performance. In existing works, since SiamRPN++ introduced Depth-wise Cross Correlation to the Siamese tracking network, most tracking networks have used this correlation operation as their feature matching subnetwork. Few networks modify the feature matching module for the characteristics of remote sensing targets. However, the Depth-wise Cross Correlation uses the target template as a spatial filter to convolve over the search area, emphasizing coarse-grained semantic abstractions such as target contours and appearance, while ignoring position information. Therefore, we propose a multi-level matching (MM) module, and the architecture of the MM module is shown in
Figure 4.
In order to achieve a more precise localization and tracking of tiny-sized objects in remote sensing videos, the proposed MM Module employs a coarse-to-fine feature matching fashion. Initially, the template feature map, denoted as , is utilized to extract the segment containing solely the features of the target, followed by the application of the ROI Align operation. The resultant feature map is then embedded into the feature map of the search region, denoted as , yielding an initial response map. Subsequently, a Depth-wise Cross Correlation operation is conducted between the template feature map and the preliminary response map , yielding the ultimate response map. Within this multi-level matching mechanism, each position within the preliminary response map assimilates information from the template features, thus rendering higher response values for pixels corresponding to the object’s location within the response map. Moreover, the Depth-wise Cross Correlation operation imparts further constraints upon the object’s contours. The MM Module aims to enhance the performance of the tracker by employing a two-step approach: initially coarsely localizing the target’s position and subsequently refining the determination of the target’s bounding box. This methodology leads to the acquisition of a more refined response map, consequently improving the tracking performance. The specific steps of the MM Module are delineated as follows.
Firstly, the feature map of the template region and the feature map of the search region, both of which have undergone feature extraction and enhancement, are, respectively, represented as and . Then, according to the label information, extract the feature map that only contains the information of the target itself from the template area feature map, represented by , ignoring the interfering background information in the template area feature map. Next, we use ROI Align to transform feature map containing only target information into feature map . Feature map will be embedded into the corresponding position of each pixel in the search area feature map, resulting in a 2C feature map . This ensures that every position in the feature map contains target features for later processing. To efficiently control the channel dimension of the feature map and avoid complex matrix operations, we use a convolution to reduce the dimensionality of the feature map and obtain a response map with only C channels. After these operations, the network obtains a preliminary matching result, which is more focused on the localization of the target center rather than confirming the outline of the target rectangle.
Subsequently, a Depth-wise Cross Correlation operation is performed between the template area feature map and the preliminarily matched response map, which emphasizes more on locating the target bounding box, and the final response map obtained is fed into the target prediction subnetwork. Through this multi-level matching method, more precise positioning of the center point location and target rectangle box prediction can be achieved. The entire formula expression for the MM module is shown as follows.
where
denotes the final output response map, ★ denotes the Depth-wise Cross Correlation,
denotes the Concatenation operation described above to embed
into
, and
denotes the process of converting the feature map
to the feature map
of size
.