STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking

Xie, Xueli; Xi, Jianxiang; Yang, Xiaogang; Lu, Ruitao; Xia, Wenxin

doi:10.3390/drones7050296

Open AccessArticle

STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking

by

Xueli Xie

,

Jianxiang Xi

^*,

Xiaogang Yang

,

Ruitao Lu

and

Wenxin Xia

Department of Automation, Rocket Force University of Engineering, Xi’an 710025, China

^*

Author to whom correspondence should be addressed.

Drones 2023, 7(5), 296; https://doi.org/10.3390/drones7050296

Submission received: 25 March 2023 / Revised: 21 April 2023 / Accepted: 26 April 2023 / Published: 28 April 2023

(This article belongs to the Special Issue Advances in UAV Detection, Classification and Tracking-II)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The rapid popularity of UAVs has encouraged the development of Anti-UAV technology. Infrared-detector-based visual tracking for UAVs provides an encouraging solution for Anti-UAVs. However, it still faces the problem of tracking instability caused by environmental thermal crossover and similar distractors. To address these issues, we propose a spatio-temporal-focused Siamese network for infrared UAV tracking, called STFTrack. This method employs a two-level target focusing strategy from global to local. First, a feature pyramid-based Siamese backbone is constructed to enhance the feature expression of infrared UAVs through cross-scale feature fusion. By combining template and motion features, we guide prior anchor boxes towards the suspicious region to enable adaptive search region selection, thus effectively suppressing background interference and generating high-quality candidates. Furthermore, we propose an instance-discriminative RCNN based on metric learning to focus on the target UAV among candidates. By measuring calculating the feature distance between the candidates and the template, it assists in discriminating the optimal target from the candidates, thus improving the discrimination of the proposed method to infrared UAV. Extensive experiments on the Anti-UAV dataset demonstrate that the proposed method achieves outstanding performance for infrared tracking, with 91.2% precision, 66.6% success rate, and 67.7% average overlap accuracy, and it exceeded the baseline algorithm by 2.3%, 2.7%, and 3.5%, respectively. The attribute-based evaluation demonstrates that the proposed method achieves robust tracking effects on challenging scenes such as fast motion, thermal crossover, and similar distractors. Evaluation on the LSOTB-TIR dataset shows that the proposed method reaches a precision of 77.2% and a success rate of 63.4%, outperforming other advanced trackers.

Keywords:

Anti-UAV; infrared UAV tracking; spatio-temporal focusing; adaptive search region

1. Introduction

With the rapid spread of UAV technology [1,2], the risk of UAV abuse has significantly increased, posing a considerable threat to public safety [3,4]. Effectively countering non-cooperative UAVs has become an urgent social security problem.

As a critical part of the Anti-UAV system, accurately tracking non-cooperative UAVs forms the basis for further countermeasures. In recent years, infrared-detector-based visual tracking has become an increasingly popular solution for Anti-UAV missions thanks to its numerous advantages, including its compact size, good visual quality, and round-the-clock operation. Thus, infrared UAV tracking has served as a hot topic in the Anti-UAV field [5].

Given the initialized bounding box of the UAV in the first frame, the purpose of infrared UAV tracking is to automatically locate the interested UAV in successive frames. Compared with natural object tracking, infrared UAV tracking poses three distinct challenges: (1) the small-scale and sparse texture features of infrared UAVs make them easily susceptible to environmental disturbances; (2) fast-moving and frequent out-of-view attributes of infrared UAVs often result in tracking failures; (3) environmental thermal crossover and distractors make it difficult to distinguish UAVs from other aerial objects in infrared images. Given these challenges, designing a robust and efficient tracking method for infrared UAVs remains a great challenge.

Recently, visual object tracking has undergone rapid development, especially for the Siamese-based tracker, which has achieved many remarkable outcomes. For instance, SiamFC [6] reformulated the visual tracking task as a template matching problem, providing a new paradigm for subsequent Siamese-based trackers. SiamRPN [7] introduced the RPN module into visual tracking, realizing accurate location and precise box regression. Since then, many improvements have been made regarding online updating [8,9], robust feature extraction [10,11,12], and similar distractor suppression [13,14,15]. Despite these developments, the aforementioned algorithms, known as short-term tracking methods, are based on a local smoothing assumption, which limit the search range to a fixed neighborhood around the target center in the previous frame. When the UAV suddenly moves or becomes out-of-view, as is common in infrared UAV tracking, this assumption will result in error accumulation and even tracking failures.

Long-term tracking methods exhibit global search capabilities to recapture the lost target, making them well-suited for infrared UAV tracking. Existing long-term tracking algorithms can be divided into two types: combined tracking methods and global tracking methods. The former, represented by TLD [16], SPLT [17], and LTMU [18], mainly consists of a short-term tracker and a global detector. In case the local tracker fails, the global detector is activated to relocate the lost target, achieving a local-to-global search pipeline. Some work has aimed to accomplish infrared UAV tracking by enhancing such algorithms. For instance, Zhao et al. [19] proposed a unified infrared UAV tracking framework, combining a local tracker, camera motion estimation module, bounding box refinement module, re-detection module and model updater. Despite performance gains, this algorithm entails multiple independent modules and a loose overall structure, hindering optimal parameter tuning. Moreover, these combined tracking methods struggle to accurately determine when to switch, given weak semantic features and environmental interference in complex infrared scenes. The latter, represented by GlobalTrack [20] and SiamRCNN [21], employs Siamese-based global search tracking, foregoing local search and tracking targets across the entire frame to avoid drift from unreliable switching. Owing to their compact structures and end-to-end optimization, global tracking methods are gaining popularity in infrared UAV tracking. Fang H. et al. [22] designed a YOLO-based global tracker called SiamYOLO, achieving efficient tracking for infrared UAVs. Chen J. et al. [23] incorporated a local tracking strategy based on spatio-temporal attention into the SiamRCNN and realized an adaptive switching between local tracking and global detection, leading to robust tracking for infrared UAVs. Shi X. et al. [24] developed a graph attention-based Siamese tracker for infrared UAV based on SiamRCNN, using a Lucas-Kanade optical flow to select either local tracking or global re-detection. To address the challenge of infrared UAV tracking, the above tracking methods have improved the global tracking method by adopting different strategies, enhancing the performance of infrared UAV tracking. However, these strategies are post-processing approaches for tracking results, without improving the feature expression of infrared UAV targets at the feature level. Therefore, there still exist issues of unstable tracking in complex environments with thermal crossover and similar distractors.

We contend that two critical factors contribute to the above phenomenon. One is the fixed search region. Neither local search nor global search is optimal for infrared UAV tracking. Local search facilitates search accuracy by focusing on a small region, but at the cost of reduced recall due to the limited search range. Global search is conducive to search recall by recapturing lost targets, but it also introduces more background interference, making infrared UAVs fall into the chaotic background, and leading to decreased accuracy. Consequently, it is favorable to incorporate the local constraints with a global framework for more robust tracking. The other is the weak discrimination of target features. The lack of color and texture of infrared UAVs makes it challenging to describe the difference between targets and distractors. Moreover, the expansion of a search region will lead to a large increase in negative samples. Excessive simple negative samples interfere with the discriminative ability of features, thus making it more difficult to distinguish the target from hard negative samples, i.e., similar distractors. Therefore, it is urgent to strengthen the discriminability of target features to accurately identify the target among confusing candidates.

Motivated by the above analysis, we propose a novel spatio-temporal-focused Siamese network for infrared UAV tracking called STFTrack. Different from the aforementioned methods, STFTrack realizes an adaptive target search region mechanism, effectively reducing the interference from background negative samples. Additionally, STFTrack employs a two-stage tracking framework, implementing a two-stage target-focusing mechanism by incorporating instance-level feature metric learning to enhance the discrimination of hard negative samples. Specifically, to achieve search region focusing, we propose a spatio-temporal information-guided region proposal network (STG-RPN). Under the guidance of template target features and motion cues, STG-RPN predicts a probability distribution of the target location over the entire image. This distribution further guides prior anchors to focus on candidate regions, enabling a soft adaptive search region selection mechanism. With this focused search region, interference from negative samples due to an expanded search region is alleviated, enhancing the discriminability of target features and generating high-quality candidate targets. To further focus on the infrared UAV among the candidate targets, we design an instance discrimination region-CNN (ID-RCNN). Using metric learning, the target characteristics are comprehensively described from the perspective of classification and measurement, which further expands the feature difference between the target and similar distractors. Experiments on the Anti-UAV dataset and LSOTB-TIR demonstrate that the proposed method can effectively handle interference caused by similar distractors and thermal crossover, thereby enabling the tracking of infrared UAVs in challenging scenarios.

In summary, this paper makes the following contributions:

(1): We propose a spatio-temporal-focused Siamese network for infrared UAV tracking, which incorporates spatio-temporal constraints within a Siamese-based global search tracking framework, employing a two-level global-to-local focusing strategy.
(2): To achieve search region focusing, a spatio-temporal information-guided region proposal network is proposed, which implements an adaptive region selection strategy to generate high-quality candidate targets.
(3): To improve the accuracy of the target discrimination, we propose an instance discrimination region-CNN, which specifically focuses on the target among the candidate targets. Through metric learning, we further enhance the discrimination of the target features.

2. Proposed Method

The proposed method is a two-stage tracking framework, including the candidate target generation stage and the instance discrimination stage. The overall framework, as shown in Figure 1, consists of a Siamese feature extraction network, a STG-RPN, and an ID-RCNN. The Siamese feature extraction network is used to extract deep features from template images and search images. The STG-RPN generates candidate targets that are spatio-temporally similar to the template target. The ID-RCNN further identifies the UAV among the proposed candidates.

2.1. Deep Feature Extraction Based on the Siamese Network

Due to factors such as the small scale, lack of color, and inconspicuous texture of infrared UAVs, combining semantic and location characteristics within single-scale features is challenging. One commonly employed approach to address this issue is fusing low-level features and high-level features. For this, we construct a feature pyramid-based Siamese network, which performs a multi-scale feature fusion to obtain more robust deep features. The network comprises a template branch and a search branch, with shared weights in both branches. Given a pair of infrared images, i.e., the template image

I_{Z}

and the search image

I_{x}

, a weight-shared backbone

φ

is first used to extract respective feature maps

z = φ (I_{Z})

and

x = φ (I_{x})

. Then, we adopt the FPN [25], denoted as

ϕ

, to fuse the multi-level features from the backbone, obtaining multi-scale template image features

Z = {Z_{i} |i = 1, 2, 3} = ϕ (z)

and search image features

X =

{X_{i} |i = 1, 2, 3} = ϕ (x)

from three levels (P3, P4, and P5). In addition, we extract multi-scale template target features

F_{i}

from the template image features

Z_{i}

using

F_{i} = ψ (Z_{i}, B_{1}),

(1)

where

ψ

represents the ROIAlign operation and

B_{1}

is the initial bounding box. By virtue of the weight-sharing feature, the output features of the two branches are semantically consistent, facilitating the identification of common objects in both images.

2.2. Candidate Target Generation Based on STG-RPN

During infrared UAV tracking, the target search region plays a critical role in the tracking performance. The traditional local search can be viewed as a hard motion-constrained selection strategy that conducts a search only in a limited area and ignores the remaining area information. While effective at mitigating background interference, it struggles to cope with the rapid movement and out-of-view attributes of UAVs. Conversely, the global search employs a uniform selection strategy that searches the entire frame without motion assumptions. While this strategy is effective in recovering lost UAVs, it suffers from more background interference, leading to tracking instability.

To address this issue, we introduce local constraints into the global search frame- work and propose a spatio-temporal information-guided RPN (STG-RPN). Under the guidance of spatio-temporal information, target search region focusing is achieved in a soft-selection way, which further alleviates the sample imbalance. The overview of STG-RPN is depicted in Figure 2, and mainly includes three phases: target location prediction, anchor shape prediction, and feature adaptation.

2.2.1. Target Location Prediction

Target tracking aims to search for the specific object that matches the appearance of the template target and its previous motion pattern. To achieve this, we use time–space cues to locate infrared UAV in three stages: spatial location prediction, temporal location prediction, and spatio-temporal fusion prediction.

From a spatial perspective, the target infrared UAV is often found in regions that share similarities with the template target. Thus, spatial location prediction aims to identify these similar template target locations in the search image and generate a location probability map. Given the template target features

F_{t} \in ℝ^{k \times k \times c}

and search image features

F_{s} \in ℝ^{w \times h \times c}

, the spatial location probability map

F_{s l} \in ℝ^{w \times h \times 1}

is formulated as

F_{s l} = σ (f_{c} (F_{t} \otimes F_{s})),

(2)

where

\otimes

denotes the deep-wise cross-correlation operator [10], which is used to encode the template target information into the search image features.

f_{c}

indicates the channel aggregation function, which is defined as a 1 × 1 convolution that converts the channel number from c to 1.

σ

represents the Sigmoid function that compresses the output range to

[0, 1]

.

From a temporal perspective, the ideal motion of the infrared UAV may be a continuous process. However, due to the camera’s jitter and the shooting interval, the UAV’s motion state can be classified into two categories: smooth motion and sudden motion. In the case of smooth motion, the motion state before and after moments is highly correlated, allowing for the use of a motion model to predict the current position of the UAV based on its previous state. Conversely, if the UAV experiences sudden motion changes, the prediction becomes unreliable, and it is likely that the UAV may appear at any location in the search image.

In this study, the Kalman filter framework is adopted to estimate the motion state of the infrared UAV. We formulate the target motion state as a four-dimensional vector

X = [x_{c}, y_{c}, {\dot{x}}_{c}, {\dot{y}}_{c}]

, where

(x_{c}, y_{c})

represent the coordinates of the target center point and

({\dot{x}}_{c}, {\dot{y}}_{c})

denotes the speed of the center point. Thus, the target position in the next frame can be predicted as

X_{t} = A X_{t - 1} + B U_{t},

(3)

where

A

is the state transfer matrix corresponding to the target motion model,

B

is the control parameter, and

U_{t}

is the control amount at time t (zero in this study). Here, we approximate the inter-frame displacement of the UAV as a uniform motion model. Accordingly, the observational model is expressed as

Z_{t} = H X_{t} + v_{t}

, where

H

is the observation matrix and

v_{t}

is the observation noise with covariance

R

(assumed to be Gaussian). The covariance of prediction error is

{\tilde{P}}_{t} = F P_{t - 1} F^{T} + Q

, where

P_{t - 1}

is the Kalman estimation error covariance and

Q

is the process noise covariance.

In order to determine the motion state of the UAV, a maneuver detection method based on pseudo-innovation is adopted. In the Kalman filtering framework, the innovation refers to the difference between the predicted observation and the true observation. In our case, the highest peak position of the spatial location map

F_{s l}

is selected as the estimate of observation. Thus, the pseudo-innovation can be formulated as

n_{t} = a r g m a x (F_{s l}) - {\tilde{Z}}_{t}

, where

{\tilde{Z}}_{t}

denotes the predicted position. Based on

n_{t}

, we define a novel statistical variable, namely using normalized innovation squared (NIS), as the indicator for maneuver detection. NIS can be calculated as

ε_{t} = n_{t}^{T} S_{t}^{- 1} n_{t},

(4)

where

S_{t}

indicates the pseudo-innovation covariance and

S_{t} = H {\tilde{P}}_{t} H^{T} + R

. Since

ε_{t}

follows a chi-square distribution, we set the maneuver detection threshold as

P \{ε_{t} \leq ε_{m a x}\} = 1 - η,

(5)

where

η

denotes the confidence score, we set

η = 0.05

. In our case,

ε_{m a x}

is the chi-square critical value corresponding to

η

; if

ε_{t} > ε_{\max}

, then the UAV is considered to have maneuvered.

When the target moves smoothly, we Gaussianize the position of the motion prediction

{\tilde{Z}}_{t}

based on the prediction error covariance

{\tilde{P}}_{t}

and obtain the temporal location prediction map

F_{t l} \in ℝ^{w \times h \times 1}

. However, when the UAV motion suddenly changes, we set

F_{t l} = I

. Therefore, the temporal location prediction map

F_{t l} = \{f_{t l}^{i} |i \in (w, h)\}

can be calculated as

f_{t l}^{i} = \{\begin{array}{l} e x p (- {‖m_{i} - {\tilde{Z}}_{t}‖}^{2} / α p_{t}^{}), i f m o v i n g s m o o t h l y \\ 1, o t h e r w i s e \end{array},

(6)

where

m_{i}

is the coordinate of the i-th point in

F_{t l}

,

{‖\cdot‖}^{2}

represents the Euclidean distance,

p_{t}

is the variance of the position prediction error,

α

denotes the tradeoff hyperparameters, and

α = 5

in our case. After obtaining the spatial and temporal location maps, we further utilize element-wise multiplication to fuse them. Thus, the target location map is calculated as

F_{l} = F_{s l} ⊙ F_{t l},

(7)

where

⊙

denotes the Hadamard production.

Remark 1:

When the infrared UAV moves smoothly, the regions far from the predicted position in the spatial location map are suppressed incrementally by the temporal location map, which helps to reduce interference from similar objects. While the UAV motion suddenly changes, the temporal location map transforms into a unit matrix. In this case, the target location probability map degenerates into a spatial location map, serving as a preventive measure against inaccurate motion predictions.

Subsequently, a probability threshold is applied to binarize the target location probability map. We select the high-confidence regions to establish anchor points, thus achieving the region focus at the anchor level. Figure 3 illustrates an example of the target location prediction process. We can find that a focused search region for setting anchor points is extracted under the guidance of spatio-temporal information.

2.2.2. Anchor Shape Prediction

Following the prediction of the anchor location, the optimal anchor shape is estimated to obtain the anchor box that is most similar to the ground truth. Given the encoded features

F_{m} \in ℝ^{w \times h \times c}

, we employ a nonlinear transformation to map

F_{m}

into the shape prediction maps

{F^{'}}_{s h} \in ℝ^{w \times h \times 2}

, whose each channel represents the length and width of the anchor box. Here, we model the nonlinear transformation using a 1 × 1 convolution layer with two output channels and a Leak ReLU activation function.

To improve the stability of our prediction process, we adopt the deflation transformation, as described in [26], to convert the length and width of the anchor box into

d h

and

d w

. The deflation transformation is formulated as

w = τ s e^{d w}, h = τ s e^{d h},

(8)

where

s

is the step size of the feature map and

τ

is the scale factor (

τ = 8

in this study). Combining the anchor point and anchor shape, we can derive preset anchor boxes that are focused on the target.

2.2.3. Feature Adaptation

Infrared UAVs exhibit variable scales, affecting the shape prediction of anchor boxes at different locations. As a result, it leads to a feature mismatch problem, which is detrimental to tracking accuracy. To tackle this problem, we implement the feature adaptation module proposed in [26], which enhances the adaptability of features to different shapes of anchor boxes. First, we employ a 1 × 1 convolution layer to calculate the offset field according to the anchor shape. Then, a deformable convolution layer using this offset is applied to adjust the original features. The feature adaptation process is expressed as

F_{a}^{} = N_{T} (F_{m}^{}, w, h)

(9)

where

F_{a}^{}

is the adapted features,

F_{m}

is the encoded feature, and

(w, h)

denotes the anchor shape.

N_{T}

represents the feature adaptation module which is implemented by a 3 × 3 deformable convolution layer.

We use the adapted features to perform classification and regression on predefined anchor boxes, and to obtain high-quality candidate targets. The structure of the classification head and regression head is consistent with traditional RPN [27], except that we use masked convolution [28] instead of traditional convolution. During tracking, the masked convolution applies the anchor region as a filter, confining the feature computation within the region of interest while ignoring irrelevant ones. This leads to a significant increase in the computational efficiency compared to classical point-by-point calculation.

2.3. Target Discrimination Based on ID-RCNN

Extracting discriminative features from infrared images is a challenging task for neural networks due to the weakness of color and texture information. Therefore, it is difficult to differentiate infrared UAVs from other similar targets. In particular, when the infrared UAV is caught in background noise, the candidates generated by STG-RPN may incorporate many similar distractors. These interferers will seriously affect the determination of tracking results, thus causing tracking instability.

Based on the QG-RCNN [20], we design the ID-RCNN module to handle the above challenge. Our ID-RCNN incorporates deep metric learning (DML) to enhance the discriminability of target descriptions, allowing for better focus on the desired target within candidates. The structure of the ID-RCNN is illustrated in Figure 4, mainly consisting of two branches, i.e., the QG-RCNN and DML. The QG-RCNN branch is responsible for the classification and regression of candidate proposal, while the DML branch is leveraged to the determination further of confusable candidates.

2.3.1. QG-RCNN Branch

The QG-RCNN consists of two stages: feature modulation and the RCNN. Given the template target feature

F_{t} \in ℝ^{k \times k \times c}

and the j-th candidate feature

F_{p}^{j} \in ℝ^{k \times k \times c}

, the feature modulation can be expressed as

{\tilde{F}}_{p}^{j} = h_{o u t} (h_{t} (F_{t}) ⊙ h_{p} (F_{p}^{j})),

(10)

where

⊙

denotes the Hadamard production,

h_{t}

and

h_{p}

are the feature projection functions realized by a 3 × 3 convolution with one-pixel padding, and

h_{o u t}

is the output projection function that retains the same size and channels as

F_{p}^{j}

. Following this, the classical RCNN is adopted to predict the classification score

S_{p}^{j}

and regression parameters

B_{p}^{j}

of each candidate.

2.3.2. DML Branch

Let

X \in ℝ^{k \times k \times c}

be the source space features and

\hat{X} \in ℝ^{1 \times 1 \times n}

be the embedding space features. The DML aims to find a feature mapping function

f_{e m b} : ℝ^{1 \times 1 \times c} \to ℝ^{1 \times 1 \times n}

, which ensures that positive candidates are close to the template target and far from negative ones in the embedding space [29]. To facilitate the measurement between the template target and candidates, we first use global average pooling (GAP) to convert the source space feature into a vector. Then,

f_{e m b}

is applied to map this vector into the embedding feature

\hat{X} \in ℝ^{1 \times 1 \times n}

. We model the feature mapping function as

f_{m a p} (X) = f_{2} (f_{a} (f_{1} (X))),

(11)

where

f_{1}

and

f_{2}

indicate full connection layers, and

f_{a}

denotes the Leak ReLU function.

2.3.3. Target Discrimination

We apply

f_{e m b}

to convert the template target and candidates into a consistent embedding space, where a smaller feature distance indicates higher similarity. Accordingly, we define the embedding similarity between two targets as

D_{p}^{j} = φ ({\hat{F}}_{p}^{j}, {\hat{F}}_{t}) = \exp (- {‖{\hat{F}}_{p}^{j} - \hat{F}‖}_{2}^{2}),

(12)

where

{\hat{F}}_{t}

and

{\hat{F}}_{p}^{j}

represent the embedding feature of the template target and the j-th candidate, respectively. Following the QG-RCNN and DML processing, the affinity between the template target and candidates can be comprehensively described in terms of both classification and measurement perspectives. We further integrate the similarity and the classification scores to generate the final discriminant scores, denoted by

κ_{p} = \{D_{p}^{j} \times S_{p}^{j} |j = 1, 2, \dots, N\},

(13)

where

N

indicates the total number of the candidates. We adopt the similar approach as GlobalTrack to determine the final tracking result. Specifically, we select the candidate with the highest discriminant score as the final target and employ regression parameters to generate a more precise bounding box.

2.4. Loss Function

In this study, we train the proposed model end-to-end using a multi-task loss function, including STG-RPN and ID-RCNN losses, denoted as

L_{t o t a l} = L_{S T G - R P N} + L_{I D - R C N N},

(14)

For the STG-RPN part, there are three learning objectives, i.e., spatial location prediction, anchor shape prediction, and RPN prediction for anchor boxes. The joint loss is calculated as

L_{S T G - R P N} = λ_{1} L_{l o c} + λ_{2} L_{s h a p e} + L_{R P N}^{},

(15)

where

λ_{1}

and

λ_{2}

denote the weighting factors,

L_{l o c}

represents the spatial location loss,

L_{s h a p e}

indicates the anchor shape loss, and

L_{R P N}^{}

denotes the traditional RPN loss that includes a binary cross-entropy loss for anchor classification and a smooth L₁ loss for anchor regression.

Considering the small size of UAV targets in infrared images, we adopt focal loss [30] as the spatial location loss to predict the anchor location. The focal loss is expressed as

L_{l o c} (p_{}, p_{}^{*}) = \frac{1}{N_{p}} \sum_{i = 1}^{N_{p}} - [β p_{i}^{*} {(1 - p_{i})}^{γ} \log (p_{i}) + (1 - β) (1 - p_{i}^{*}) p_{i}^{γ} \log (1 - p_{i})],

(16)

where

N_{p}

denotes the total number of pixels in the location map;

p_{i}

indicates the probability for spatial localization;

p_{i}^{*}

denotes the corresponding truth label generated by reference [26]; and

β

and

γ

are the adjustment factors, which are set as

β = 0.25, γ = 2

in this study.

The matching strategy in [26] is used to associate each anchor with the ground truth, and bounded IOU loss [31] is used for anchor shape prediction.

L_{s h a p e} (w, h, w^{*}, h^{*}) = L_{1} (1 - \min (\frac{w}{w^{*}} - \frac{w^{*}}{w})) + L_{1} (1 - \min (\frac{h}{h^{*}} - \frac{h^{*}}{h})),

(17)

where

(w, h)

represents the predicted anchor shape,

(w^{*}, h^{*})

denotes the matched ground truth, and

L_{1}

represents the smooth L1 loss.

The ID-RCNN part comprises two losses: metric loss and RCNN loss. The overall loss is calculated as

L_{I D - R C N N} = λ_{3} L_{m e t r i c} + L_{R C N N},

(18)

where

λ_{3}

is the balancing factor;

L_{m e t r i c}

indicates the metric loss; and

L_{R C N N}

denotes the RCNN loss that is composed of binary cross-entropy loss and smooth L₁ loss for candidates’ classification and regression, respectively.

Regarding the metric loss, traditional triple loss [32] only considers the relative distance between positive and negative samples, without constraining the absolute distance between positive samples. As a result, the embedding distance between positive samples and the template target is not compact enough, which will disturb tracking results. Therefore, a triplet loss with an absolute distance constrain [33] is employed as our metric loss. Here, the template target is set as the anchor sample, while candidate targets with IOUs greater than 0.5 and less than 0.35 are labeled as positive samples (P) and negative samples (N), respectively. Thus, we can construct the triplet sample set as

T = \{〈X_{o}^{}, X_{+}^{i}, X_{-}^{j}〉

|i \in ℝ^{P}, j \in ℝ^{N}\}

. The constrained metric loss can be computed as

L_{m e t r i c} (X) = \frac{1}{K_{p}} \sum_{i = 1}^{K_{p}} [\log (1 + \exp (D (X_{o}^{i}, X_{+}^{i}) - D (X_{o}^{i}, X_{-}^{i}) + m)) + ν \max (0, D (X_{o}^{i}, X_{+}^{j}) - n)],

(19)

where

K_{p}

denotes the total number of triplet samples,

X_{o}^{}

denotes the anchor sample,

X_{+}^{}

denotes the positive sample,

X_{-}^{}

denotes the negative sample,

D

represents the distance function (Euclidean distance is used in this study),

m

indicates the margin between positive and negative samples,

n

indicates the distance threshold between positive samples, and

ν

denotes the balance parameter.

2.5. Tracking Process

Based on the above description of the key modules of our STFTrack, the main tracking flow is presented in Algorithm 1.

Algorithm 1: STFTrack: Spatio-temporal-focused Siamese network for infrared UAV tracking.
1:	Input: Tracking sequence of length L, $\{I^{1}, I^{2}, \dots, I^{L}\}$ ; Initial target bounding box $B^{1} = (x_{c}^{1}, y_{c}^{1}, w^{1}, h^{1})$ .
	Output: Target bounding box for each frame $\{B^{1} \|i = 1, 2, \dots, L\}$ .
2:	for f = 1 to L do
3:	if f == 1 then
4:	Construct multi-scale template image features;
5:	Extract multi-scale template target features using ROIAlign;
6:	Initialize the Kalman filter.
7:	if 1 < f < L + 1 then
8:	Compute the multi-scale search feature map;
9:	Compute the target location map by Equations (2), (6) and (7), and set anchor points in the region above threshold t;
10:	Predict the anchor shape of each anchor point and obtain focused anchor boxes;
11	Obtain adapted feature by Equation (9);
12:	Using the focused anchor boxes and adapted feature, obtain ROIs with candidate features through RPN;
13:	Calculate modulated feature by Equation (10);
14:	Using the modulated feature, obtain the classification scores and regression parameters of each candidate through the RCNN;
15:	Calculate the similarity between each candidate and template target by Equation (12);
16:	Obtain the discriminant scores by Equation (13);
17:	Choose the candidate with highest discriminant score as the tracking target and output the refined bounding box $B^{f} = (x_{c}^{f}, y_{c}^{f}, w^{f}, h^{f})$ ;
18:	Update the Kalman Filter.
19:	End for

3. Experiments

3.1. Datasets and Evaluation Metrics

3.1.1. Datasets

We evaluate our proposed approach using two datasets: Anti-UAV [34] and LSOTB-TIR [35]. The Anti-UAV contains over 580,000 manually annotated boxes, covering 318 high-quality infrared UAV flight videos of about 1000 frames length each. It presents a realistic and dynamic tracking environment with challenging attributes such as thermal crossover, similar distractors, and frequent disappearance–reappearance. We evaluate the performance of the proposed method for infrared UAV tracking on the Anti-UAV dataset. The LSOTB-TIR dataset contains 1400 infrared video sequences with 47 different target types and more than 600,000 frames. It is a large and diverse general infrared target tracking dataset with 12 challenging attributes including non-rigid deformation, scale change, motion blur, etc. We use the LSOTB-TIR training set to pre-train the proposed method to improve its adaptation to infrared images, as well as the test set to validate the tracking performance of the approach for generic infrared targets.

3.1.2. Evaluation Metrics

Tracking stability is a crucial factor that significantly impacts the efficacy of visual tracking methods. This term encompasses both the precision of tracking results and the robustness to environmental interferences and occlusions. Improving tracking stability is essential for ensuring reliable and accurate tracking performance. In our study, we use the one-pass evaluation (OPE) test mode with three quantitative indicators: precision, success rate, and average overlap accuracy. The precision is defined as the percentage of frames in which the center location error (CLE) between the predicted target and the ground truth is less than a certain threshold. The precision can be calculated as

p r e c i s i o n = \frac{1}{M} \sum_{t = 1}^{M} δ (C_{t} < T_{c}),

(20)

where M denotes the total number of video frames,

C_{t}

represents the CLE between the t-th predicted target and ground truth,

T_{c}

denotes the certain threshold,

δ

indicates an indicator function, where

δ (C_{t} > T_{c}) = 1

when

C_{t} > T_{c}

, and

δ (C_{t} > T_{c}) = 0

otherwise. The CLE is defined as

C L E = {‖L_{p} - L_{g}‖}_{2},

(21)

where

{‖\cdot‖}_{2}

denotes the L2-norm;

L_{p}

and

L_{g}

represent the center positions of the predicted target and ground truth, respectively.

The success rate is the proportion of successful frames with an IOU greater than the set threshold between the predicted target box and the ground truth. The success rate can be calculated as

s u c c e s s = \frac{1}{M} \sum_{t = 1}^{M} δ (I O U_{t} > T_{o}),

(22)

where

I O U_{t}

represents the IOU between the t-th predicted box and its corresponding ground truth;

T_{o}

denotes the set threshold.

To quantify the proposed approach, we calculate the precision and success rate with different thresholds to generate the precision plot and success rate plot. We obtain the AUC (area under curve) of the success rate plot and record the precision at the 20-pixel threshold as representative metrics. In addition, we adopt the average overlap accuracy defined in [34] to evaluate the long-term tracking performance of trackers. The average overlap accuracy is defined as

a c c u a r c y = \frac{1}{M} \sum_{t = 1}^{M} I O U_{t} \times δ (v_{t} > 0) + p_{t} \times (1 - δ (v_{t} > 0)),

(23)

where

v_{t}

is the ground-truth visibility flags of the target, and

δ

indicates an indicator function. When the target is truly visible,

v_{t} = 1

and

δ (v_{t} > 0) = 1

. When the target is invisible,

v_{t} = 0

and

δ (v_{t} > 0) = 0

.

p_{t}

denotes the prediction indicator function. When the prediction is empty,

p_{t} = 1

, otherwise

p_{t} = 0

.

3.2. Implementation Details

Experimental platform: We conduct algorithm training and evaluation on the same high-performance workstation. The workstation specifications are:

CPU: Intel Core i9-9940X@3.30 GHz
GPU: 4× NVIDIA RTX 2080Ti graphics cards, each with 12 GB GDDR6 memory
Storage: 1 T solid state drive
Memory: 64 GB DDR4
Operating system: Windows 10
Programming environment: Python 3.7 and PyTorch 1.10

Parameters. Our STFTrack is built on Globaltrack and utilizes its trained weights for the backbone and FPN. We adopt ResNet-50 as the backbone. The input image size is set to 640 × 512. During training, STG-RPN generates 512 candidate targets for ID-RCNN training. The weights of each loss are set to

λ_{1} = 1, λ_{2} = 0.1, λ_{3} = 0.1

. The metric loss parameters are set as

m = n = 0.35, ν = 1

. For testing, STG-RPN generates 32 high-quality candidate targets, while ID-RCNN outputs the highest-scoring target. The threshold for determining the existence of UAV is set to 0.1, and the location probability threshold is set to 0.1. Additional parameter settings can be found in [20,26].

Training details. For model training, we use SGD to optimize our model for four NVIDIA 2080Ti GPUs. The momentum and weight decay are set to 0.95 and 1 × 10⁻⁵, respectively. The batch size on each GPU is set to two. In a given batch, the template image remains consistent across all image pairs. Image pairs are augmented in the same way as GlobalTrack. We first train STG-RPN on LSOTB-TIR for 32 epochs at a learning rate of 1 × 10⁻³. In rounds 16 and 28, the learning rate decayed by 0.1. Then, the entire model is trained at the same epoch and learning rate strategy using the Anti-UAV.

3.3. Quantitative Evaluation

We compare STFTrack with eight advanced tracking methods, including GlobalTrack, SiamRCNN, STARK [36], SiamRPN++LT [10], ATOM [9], SiamFC, ECO [37], and DSST [38]. For a fair comparison, each tracker uses default parameter settings and public trained weights, and is trained to converge on the Anti-UAV dataset on our platform.

The precision plots and success rate plots for the above nine trackers on the Anti-UAV dataset are shown in Figure 5. It can be observed intuitively that long-term trackers significantly outperform short-term trackers in terms of the precision and success rate. This is because long-term trackers, with their ability to perform re-capture, are better suited for the infrared UAV tracking task. In addition, among these long-term trackers, STFTrack achieves both the highest precision and success rate. Table 1 shows the detailed comparison results of different tracking methods in terms of the precision, success rate and average overlap accuracy. STFTrack exhibits considerable improvement in performance metrics, achieving a precision of 91.2%, a success rate of 66.6%, and an average overlap accuracy of 66.7%. Specifically, STFTrack yields improvements of 2.3%, 2.7%, and 3.5% over the baseline GlobalTrack, and 1.2%, 1.3%, and 1.9% over SiamRCNN, respectively. Moreover, with fewer anchor boxes and candidates used during testing, STFTrack runs at 12.4 FPS, which is a 5.3 FPS increase over SiamRCNN. These results demonstrate that STFTrack has achieved significant improvements in both speed and accuracy for infrared UAV tracking.

3.4. Attribute-Based Evaluation

To comprehensively evaluate the proposed method, we conduct extended experiments on six challenging attributes such as out-of-view (OV), thermal crossover (TC), scale variation (SV), object occlusion (OC), fast motion (FM), and low resolution (LR) using the Anti-UAV dataset. Precision plots and success rate plots for each tracker on these six attributes are presented in Figure 6 and Figure 7, respectively. STFTrack obtains a competitive precision and success rate on all attributes in relation to the SiamRCNN, especially SV and TC, with precision scores of 3.8% and 2.5% and success rates of 3.8% and 2.4%, respectively. This demonstrates that with the help of the spatio-temporal focusing mechanism, STFTrack possesses a strong ability to adapt to target scale change and discriminate the infrared UAV from the infrared clutter background. Furthermore, Figure 8 reports the average overlap accuracy of each tracker on different attributes. In these results, STFTrack ranks first in average overlap accuracy for all attributes and encloses the largest polygon area, which indicates a superior performance over the other trackers.

In summary, the comprehensive rankings of the algorithms in terms of precision, success rate, and average overlap accuracy on different challenge attributes demonstrate that STFTrack can better address various challenges in infrared UAV tracking. In particular, the proposed method shows significant performance advantages when coping with target scale changes and thermal crossover interferences.

3.5. Qualitative Evaluation

To visually verify the superiority of the proposed STFTrack, we present the tracking results and IOU-Frames plots of the nine tracking methods mentioned above in several challenging scenarios on Figure 9.

Figure 9a displays the comparison results of each tracker on a fast motion scenario. When the UAV undergoes significant range displacement between consecutive frames, e.g., frame 53 to frame 56, only SiamRCNN, GlobalTrack, STARK, and STFTrack are able to track the target due to their large search approach. Among these trackers, STFTrack exhibits the highest success rate (the number of IOU > 0.5 frames) and the mean IOU over the whole sequence. This is attributed to its guided anchoring strategy, which is particularly effective in handling target scale changes.

Figure 9b demonstrates the performance of each tracker in scenarios involving occlusion and similar distractors. In this sequence, the infrared UAV disappears at frame 39 due to occlusion, and reappears at frame 102. Moreover, in frame 438, other airborne interfering objects appear. The tracking results reveal that STFTrack is particularly effective at recapturing the UAV quickly and accurately after it reappears following occlusion at frame 102. In contrast, other trackers exhibit suboptimal performance in such situations. Especially when the infrared UAV disappears during frame 39 to frame 101, STFTrack realizes a reliable judgment of the existence of the UAV, while other tracking methods are poor. Moreover, in scenarios where other aerial objects appear in the background, e.g., frame 438, SiamRPN++LT and GlobalTrack tend to incorrectly track the interfering objects due to their inadequate discrimination capabilities. However, STFTrack utilizes a two-level spatio-temporal-focused mechanism that leverages spatio-temporal cues to resist interference from distractors, which effectively mitigates this problem and leads to superior tracking results.

Figure 9c,d show the tracking results under complex thermal crossover interference. In this scenario, the contrast between the UAV and the background is substantially reduced, posing a significant challenge to the discriminative ability of trackers. Upon observing the tracking effect of these two video sequences, we can find that almost all the trackers fail to track the UAV when subjected to thermal crossover. However, STFTrack exhibits outstanding accuracy and stability under such challenging scenarios. These results effectively verify that STFTrack has a stronger discriminatory ability than other trackers.

In summary, the proposed STFTrack is capable of handling various challenging tracking scenarios such as fast motion, similar distractors, and thermal crossover. Compared with other algorithms, STFTrack is better suited for infrared UAV tracking tasks.

3.6. Ablation Analysis

To investigate the contributions of each component of STFTrack, we conduct ablation analysis on the critical sub-modules, target location prediction, and metric loss on the Anti-UAV testset. By removing and replacing the essential parts of the baseline model, different variants are constructed. We adopt the precision, success rate, and average overlap accuracy as the evaluation metrics.

Table 2 shows the ablation results of the STG-RPN module and the ID-RCNN module. Each row of the table represents a set of model variants. Specifically, the QG-RPN and QG-RCNN are the original modules from GlobalTrack, while both the QG-RPN and STG-RPN represent the single-stage models that directly adopt the highest-scoring candidate of RPN as the tracking result.

Effects of STG-RPN. By comparing the evaluation metrics of variant 1 with variant 2, variant 3 with variant 5, and variant 4 with variant 6, we can observe that the variants based on the STG-RPN exhibited a greater performance improvement than those based on the QG-RPN. Furthermore, our analysis revealed that the STG-RPN module outperformed the QG-RPN module with notable improvements of 3.3%, 3.7%, and 3.9% in terms of precision, success rate, and average overlap accuracy, respectively. These results provide compelling evidence that our spatio-temporal information-guided strategy is highly effective at enhancing performance by suppressing distractions in the Anti-UAV scenario.

Effects of ID-RCNN. By comparing variant 3 with variant 4 and variant 5 with variant 6, we can find that the variants with the ID-RCNN outperformed those with the QG-RCNN, particularly in terms of the success rate and average overlap accuracy. These results suggest that the ID-RCNN performs better than the QG-RCNN by further enhancing the discriminability of the model through metric learning.

Effects of target location prediction. The ablation results of the target location prediction strategies are presented in Table 3, which compared the performance of three strategies, namely spatial location prediction, temporal location prediction, and spatio-temporal fusion location prediction. The results demonstrate that the spatio-temporal fusion location prediction strategy outperforms the other two strategies, with improvements of 2.6%, 2.1%, and 1.9% in the three evaluation metrics, respectively, compared to spatial location prediction. Conversely, the performance of the temporal location prediction strategy is the weakest among the three. This is due to the fact that temporal localization prediction is essentially a local strategy that struggles to handle the sudden movements and disappearance–reappearance of infrared UAVs. Conversely, spatial localization prediction is a global strategy that is vulnerable to thermal crossover and similar object interference. However, spatio-temporal fusion location prediction cleverly integrates the advantages of both a local and global search, such as resistance to background interference and target recapture, making it the most suitable method for infrared UAV tracking.

Effects of DML. Table 4 provides a comparison of different metric loss functions, including no metric loss, naive triplet loss, hard triplet loss, and constrained triplet loss. The results demonstrate that utilizing triplet loss can improve the tracking performance, with constrained triplet loss exhibiting the greatest improvement. In particular, constrained triplet loss outperforms no metric loss by 1.3%, 1.2%, and 1.4% in the three evaluation metrics.

3.7. Tracking Generalizability Evaluation

In order to assess the tracking generalization of STFTrack to general infrared targets, we conduct a comparison study with nine other tracking methods, including GlobalTrack, ECO-deep-TIR, ECO_TIR, MDNet [39], ATOM, SiamRPN++, SiamMask [40], Siamese-FC-TIR, and MMNet [41], on the LSOTB-TIR testset. The tracking performance comparison of each tracker is illustrated in Figure 10. Based on these comparison results, it is evident that GlobalTrack, which performs exceptionally well in anti-UAV tracking, exhibits poor performance in LSOTB. This is due to the relatively smooth target motion and the presence of a wider variety of targets in LSOTB, which demands a more discriminative tracker with less emphasis on re-detection and more focus on local target search. In such cases, short-term trackers are more effective. However, STFTrack outperforms the other nine trackers with a precision of 77.2% and a success rate of 63.4%, achieving comparable performance to best short-term trackers, i.e., ECO-deep-TIR. Additionally, compared to the baseline algorithm, i.e., GlobalTrack, STFTrack achieves a 6% improvement in precision and 4.5% in success rate. These results suggest that STFTrack’s global-to-local focusing mechanism applies to both long-term and short-term infrared target tracking, substantially enhancing tracking performance.

4. Conclusions

In this study, we propose a novel spatio-temporal-focused Siamese network for infrared UAV tracking. It seamlessly integrates spatio-temporal constraints into a Siamese-based global tracking framework, employing a novel two-level global-to-local focusing strategy. At the first level, we constrain the search region by implementing a soft region selection mechanism using the proposed STG-RPN. This mechanism utilizes template features and motion characteristics to focus the target location and generate high-quality candidate targets. At the second level, we focus on refining candidate targets by introducing ID-RCNN. It further magnifies the discriminability between the target UAV and similar distractors, thereby improving target discrimination accuracy. Experiments on the Anti-UAV dataset demonstrate that our proposed STFTrack achieves 89.5% precision, 64.9% success rate, 65.6% average overlap precision, and 12.4 FPS. Compared to other tracking methods, STFTrack exhibits superior performance. STFTrack also demonstrates robust tracking capability in complex scenarios such as rapid motion, thermal crossover, and comparable distractors. Experiments on LSOTB-TIR dataset show that STFTrack achieves 77.2% precision and 63.4% success rate, comparable to the best short-term tracker. This indicates STFTrack applies to both long-term and short-term infrared target tracking. In conclusion, STFTrack deftly integrates the anti-background interference capability of the local search and the re-detection capability of global search, enhancing the precision and stability of infrared target tracking.

Despite achieving promising results, our proposed STFTrack still suffers two limitations. One limitation is that, as a Siamese-network-based tracking method, STFTrack cannot adaptively update the template features over time, leading to an error accumulation in long-term tracking scenarios. The other is that the maneuver detection method based on pseudo-innovation used in STFTrack is not robust enough against camera jitter, which may result in the inaccurate prediction of target motion states. In our future work, we will first explore lightweight template feature updating strategies, which adaptively integrate target appearance features at different times, to enhance the algorithm’s adaptability to target shape variations. Then, we plan to introduce an image stabilization module to reduce the interference of camera jitter, further enhancing the tracking stability. Finally, we will investigate the reproduction of other advanced tracking methods for a more in-depth study of infrared UAV tracking.

Author Contributions

Conceptualization, J.X. and X.Y.; data curation, X.X. and W.X.; formal analysis, X.X.; funding acquisition, J.X. and R.L.; investigation, J.X. and X.Y.; Methodology, X.X.; project administration, J.X.; resources, R.L.; software, X.X.; supervision, J.X., X.Y. and R.L.; validation, X.X.; visualization, X.X. and W.X.; writing—original draft, X.X.; writing—review and editing, X.X., R.L. and J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of China under Grants 62176263 and 62276274, Science Foundation for Distinguished Youth of Shaanxi Province under Grant 2021JC-35.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fan, J.; Yang, X.; Lu, R.; Xie, X.; Li, W. Design and Implementation of Intelligent Inspection and Alarm Flight System for Epidemic Prevention. Drones 2021, 5, 68. [Google Scholar] [CrossRef]
Filkin, T.; Sliusar, N.; Ritzkowski, M.; Huber-Humer, M. Unmanned Aerial Vehicles for Operational Monitoring of Landfills. Drones 2021, 5, 125. [Google Scholar] [CrossRef]
Svanström, F.; Alonso-Fernandez, F.; Englund, C. Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities. Drones 2022, 6, 317. [Google Scholar] [CrossRef]
Dewangan, V.; Saxena, A.; Thakur, R.; Tripathi, S. Application of Image Processing Techniques for UAV Detection Using Deep Learning and Distance-Wise Analysis. Drones 2023, 7, 174. [Google Scholar] [CrossRef]
Luo, J.; Wang, Z. A Review of Development and Application of UAV Detection and Counter Technology. Control Decis. 2022, 37, 530–544. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.; Vedaldi, A.; Torr, P. Fully-convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 850–865. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate Tracking by Overlap Maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4660–4669. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.; Timofte, R. Learning Discriminative Model Prediction for Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4277–4286. [Google Scholar]
Fan, H.; Ling, H. SANet: Structure-aware Network for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 42–49. [Google Scholar]
Wang, C.; Shi, Z.; Meng, L.; Wang, J.; Wang, T.; Gao, Q.; Wang, E. Anti-Occlusion UAV Tracking Algorithm with a Low-Altitude Complex Background by Integrating Attention Mechanism. Drones 2022, 6, 149. [Google Scholar] [CrossRef]
Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Know Your Surroundings: Exploiting Scene Information for Object Tracking. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 205–221. [Google Scholar] [CrossRef]
Zhang, H.; Li, X.; Zhu, B.; Zhang, Y. Two-stage Object Tracking Method Based on Siamese Neural Network. Infrared Laser Eng. 2021, 50, 20200491–1-20200481-12. [Google Scholar]
Sun, L.; Zhang, J.; Yang, Z.; Fan, B. A Motion-Aware Siamese Framework for Unmanned Aerial Vehicle Tracking. Drones 2023, 7, 153. [Google Scholar] [CrossRef]
Kalal, Z.; Mikolajczyk, K.; Matas, J. Tracking-Learning-Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1409–1422. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yan, B.; Zhao, H.; Wang, D.; Lu, H.; Yang, X. ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-Term Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2385–2393. [Google Scholar]
Dai, K.; Zhang, Y.; Wang, D.; Li, J.; Lu, H.; Yang, X. High-Performance Long-Term Tracking with Meta-Updater. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6298–6307. [Google Scholar]
Zhao, J.; Zhang, X.; Zhang, P. A Unified Approach for Tracking UAVs in Infrared. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1213–1222. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. GlobalTrack: A Simple and Strong Baseline for Long-term Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11037–11044. [Google Scholar]
Voigtlaender, P.; Luiten, J.; Torr, P.; Leibe, B. Siam R-CNN: Visual Tracking by Re-Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6578–6588. [Google Scholar]
Fang, H.; Wang, X.; Liao, Z.; Chang, Y.; Yan, L. A Real-time Anti-distractor Infrared UAV Tracker with Channel Feature Refinement Module. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1240–1248. [Google Scholar] [CrossRef]
Chen, J.; Huang, B.; Li, J.; Wang, Y.; Ren, M.; Xu, T. Learning Spatio-Temporal Attention Based Siamese Network for Tracking UAVs in the Wild. Remote. Sens. 2022, 14, 1797. [Google Scholar] [CrossRef]
Shi, X.; Zhang, Y.; Shi, Z.; Zhang, Y. GASiam: Graph Attention Based Siamese Tracker for Infrared Anti-UAV. In Proceedings of the International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; pp. 986–993. [Google Scholar]
Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wang, J.; Chen, K.; Yang, S.; Loy, C.; Lin, D. Region Proposal by Guided Anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2965–2974. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Song, G.; Liu, Y.; Jiang, M.; Wang, Y.; Yan, J.; Leng, B. Beyond Trade-off: Accelerate FCN-based Face Detector with Higher Accuracy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7756–7764. [Google Scholar]
Cakir, F.; He, K.; Xia, X.; Kulis, B.; Sclaroff, S. Deep Metric Learning to Rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1861–1870. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2318–2327. [Google Scholar] [CrossRef] [Green Version]
Tychsen-Smith, L.; Petersson, L. Improving Object Localization with Fitness NMS and Bounded IOU Loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6877–6885. [Google Scholar]
Cheng, D.; Gong, Y.; Zhou, S.; Wang, J.; Zheng, N. Person Re-identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 1335–1344. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Han, Z. Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking. IEEE Trans. Multimed. 2023, 25, 486–500. [Google Scholar] [CrossRef]
Liu, Q.; Li, X.; He, Z.; Li, C.; Li, J.; Zhou, Z.; Yuan, D.; Li, J.; Yang, K.; Fan, N.; et al. LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3847–3856. [Google Scholar] [CrossRef]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
Zolfaghari, M.; Singh, K.; Brox, T. ECO: Efficient Convolutional Network for Online Video Understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 713–730. [Google Scholar] [CrossRef] [Green Version]
Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Discriminative Scale Space Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1561–1575. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Nam, H.; Han, B. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P. Fast Online Object Tracking and Segmentation: A Unifying Approach. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1328–1338. [Google Scholar]
Liu, Q.; Li, X.; He, Z.; Fan, N.; Yuan, D.; Liu, W.; Liang, Y. Multi-task Driven Feature Models for Thermal Infrared Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11604–11611. [Google Scholar]

Figure 1. The overall framework of a spatio-temporal-focused infrared UAV tracking method. The framework consists of three major components: a feature pyramid network (FPN)-based Siamese network for extracting deep features from template images and search images; a spatio-temporal information-guided RPN (STG-RPN) for generating high-quality proposals by focusing on the target search area; and an instance discrimination RCNN (ID-RCNN) for accurately detecting the target UAV from the candidates.

Figure 2. The overview of STG-RPN. First, the location probability distribution of target across the entire image is estimated using spatio-temporal information, and anchor points are set in the high score region. Next, the anchor shape of each anchor point is predicted under the guidance of template target features. Then, feature adaptation is employed to optimize the feature according to the anchor shape. Finally, anchor boxes in the focused region are classified and regressed to generate a series of spatio-temporally similar candidates.

Figure 3. Example of target location prediction. (a) Input image with ground truth marked by the yellow box; (b) spatial location map and its threshold map; (c) temporal location map; (d) target location map and its threshold map. Comparing the threshold map of the spatial location map and target location map, we can find that the search area is more focused on the target region after the constraint of temporal location prediction.

Figure 4. The structure of the ID-RCNN. The framework comprises two main branches: the QG-RCNN and DML. The former first performs feature modulation to capture the correlation between the template target and candidates. Afterwards, the classical RCNN is applied to classify and regress proposals according to the modulated feature. The latter discriminates confusable candidates through learning an embedding mapping function, which enables the similarity between the template target and each candidate to be measured in an appropriate embedding space. Ultimately, the decision results of both branches are fused by multiplication, and the candidate with the highest score is selected as the tracking result in the current frame.

⊙

denotes Hadamard production.

Figure 4. The structure of the ID-RCNN. The framework comprises two main branches: the QG-RCNN and DML. The former first performs feature modulation to capture the correlation between the template target and candidates. Afterwards, the classical RCNN is applied to classify and regress proposals according to the modulated feature. The latter discriminates confusable candidates through learning an embedding mapping function, which enables the similarity between the template target and each candidate to be measured in an appropriate embedding space. Ultimately, the decision results of both branches are fused by multiplication, and the candidate with the highest score is selected as the tracking result in the current frame.

⊙

denotes Hadamard production.

Figure 5. Precision plots and success rate plots of our STFTrack and state-of-the-art trackers on the Anti_UAV dataset.

Figure 6. Precision plots of our STFTrack and state-of-the-art trackers on different attributes.

Figure 7. Success rate plots of our STFTrack and state-of-the-art trackers on different attributes.

Figure 8. Average overlap accuracy of our STFTrack and state-of-the-art trackers on different attributes. STFTrack encloses the largest area, which indicates superior performance over the other trackers.

Figure 9. Visualization of tracking results of our STFTrack and state-of-the-art trackers in different challenging scenarios for infrared UAV tracking. The local tracking results are zoomed in on and marked as the yellow patches in the corners of the images for better visualization. The black lines in the IOU-Frames plots are used to indicate where IOU = 0.5. (a) 20190925_193610_1_3. (b) 20190925_134301_1_6. (c) 20190925_134301_1_5. (d) 20190925_124000_1_9.

Figure 10. Precision plots and success rate plots of our STFTrack and state-of-the-art trackers on the LSOTB-TIR test set.

Table 1. Comparison results our STFTrack and state-of-the-art trackers on the Anti_UAV dataset. “↑” indicates that the larger the value, the better the performance.

Method	Precision ↑	Success ↑	Accuracy ↑	FPS ↑
DSST	0.490	0.349	0.354	31.2
SiamFC	0.510	0.369	0.375	60.2
ECO	0.618	0.437	0.444	7.5
ATOM	0.711	0.484	0.490	28.7
SiamRPN++LT	0.756	0.501	0.509	26.3
STARK	0.843	0.588	0.607	33.5
GlobalTrack	0.889	0.639	0.642	9.2
Siam-RCNN	0.900	0.653	0.658	6.1
Ours	0.912	0.666	0.677	12.4

Table 2. Comparison of different variants of sub-modules. “↑” indicates that the larger the value, the better the performance. “√” means that the element is selected.

Num.	Main Modules				Evaluation Metrics
Num.	QG-RPN	STG-RPN	QG-RCNN	ID-RCNN	Precision ↑	Success ↑	Accuracy ↑
1	√				0.838	0.586	0.594
2		√			0.865	0.623	0.631
3	√		√		0.889	0.635	0.642
4	√			√	0.887	0.643	0.653
5		√	√		0.903	0.654	0.663
6		√		√	0.916	0.666	0.677

Table 3. Comparison of different target location strategies. “↑” indicates that the larger the value, the better the performance. “√” means that the element is selected.

Spatial Location Prediction	Temporal Location Prediction	Precision ↑	Success ↑	Accuracy ↑
√		0.890	0.645	0.658
	√	0.723	0.484	0.491
√	√	0.916	0.666	0.677

Table 4. Comparison of different metric loss functions. “↑” indicates that the larger the value, the better the performance.

Loss Function	Precision ↑	Success ↑	Accuracy ↑
No metric loss	0.903	0.654	0.663
Naive triplet loss	0.907	0.658	0.667
Hard triplet loss	0.913	0.662	0.674
Constrained triplet loss	0.916	0.666	0.677

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, X.; Xi, J.; Yang, X.; Lu, R.; Xia, W. STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones 2023, 7, 296. https://doi.org/10.3390/drones7050296

AMA Style

Xie X, Xi J, Yang X, Lu R, Xia W. STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones. 2023; 7(5):296. https://doi.org/10.3390/drones7050296

Chicago/Turabian Style

Xie, Xueli, Jianxiang Xi, Xiaogang Yang, Ruitao Lu, and Wenxin Xia. 2023. "STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking" Drones 7, no. 5: 296. https://doi.org/10.3390/drones7050296

Article Menu

STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking

Abstract

1. Introduction

2. Proposed Method

2.1. Deep Feature Extraction Based on the Siamese Network

2.2. Candidate Target Generation Based on STG-RPN

2.2.1. Target Location Prediction

2.2.2. Anchor Shape Prediction

2.2.3. Feature Adaptation

2.3. Target Discrimination Based on ID-RCNN

2.3.1. QG-RCNN Branch

2.3.2. DML Branch

2.3.3. Target Discrimination

2.4. Loss Function

2.5. Tracking Process

3. Experiments

3.1. Datasets and Evaluation Metrics

3.1.1. Datasets

3.1.2. Evaluation Metrics

3.2. Implementation Details

3.3. Quantitative Evaluation

3.4. Attribute-Based Evaluation

3.5. Qualitative Evaluation

3.6. Ablation Analysis

3.7. Tracking Generalizability Evaluation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI