Oriented Object Detection Based on Foreground Feature Enhancement in Remote Sensing Images

Lin, Peng; Wu, Xiaofeng; Wang, Bin

doi:10.3390/rs14246226

Open AccessArticle

Oriented Object Detection Based on Foreground Feature Enhancement in Remote Sensing Images

by

Peng Lin

^1,2

,

Xiaofeng Wu

^1,2 and

Bin Wang

^1,2,*

¹

Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China

²

Image and Intelligence Laboratory, School of Information Science and Technology, Fudan University, Shanghai 200433, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(24), 6226; https://doi.org/10.3390/rs14246226

Submission received: 5 November 2022 / Revised: 2 December 2022 / Accepted: 6 December 2022 / Published: 8 December 2022

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Oriented object detection is a fundamental and challenging task in remote sensing image analysis and has received much attention in recent years. Optical remote sensing images often have more complex background information than natural images, and the number of annotated samples varies in different categories. To enhance the difference between foreground and background, current one-stage object detection algorithms attempt to exploit focus loss to balance the foreground and background weights, thus making the network more focused on the foreground part. However, the current one-stage object detectors still face two main challenges: (1) the detection network pays little attention to the foreground and does not make full use of the foreground information; (2) the distinction of similar object categories has not attracted attention. To address the above challenges, this paper presents a foreground feature enhancement method applied to one-stage object detection. The proposed method mainly includes two important components: keypoint attention module (KAM) and prototype contrastive learning module (PCLM). The KAM is used to enhance the features of the foreground part of the image and reduce the features of the background part of the image, and the PCLM is utilized to enhance the discrimination of samples between foreground categories and reduce the confusion of samples between different categories. Furthermore, the proposed method designs and adopts an equalized modulation focal loss (EMFL) to optimize the training process of the model and increase the loss weight of the foreground later in the model training. Experimental results on the publicly available DOTA datasets and HRSC2016 datasets show that our method exhibits state-of-the-art performance.

Keywords:

remote sensing images; oriented object detection; keypoint attention; contrastive learning; focal loss

1. Introduction

Object detection in remote sensing images (RSIs) aims to localize and recognize remote sensing objects, which plays an important role in a wide scope of applications, such as intelligent monitoring, precision agriculture, and geographic information systems. In the past few years, due to the rapid development of deep learning, object detection methods based on convolutional neural networks have emerged [1,2,3], and the methods of object detection in remote sensing images have also been greatly developed.

Object detection in remote sensing images is usually divided into two tasks: horizontal object detection and oriented object detection. Compared with natural images, objects in remote sensing images are usually densely arranged and have arbitrary directions. As a result, the use of horizontal box annotation may introduce background noise to the object detection of remote sensing images, resulting in unclear definition of labels. Moreover, the rotation box annotation is more accurate for the size measure of the object than the horizontal box annotation. Thus, in practice, the oriented object detection methods have more practical significance than the horizontal target detection methods since using rotation box annotation is more suitable for the remote sensing image object detection task.

Most oriented object detection methods have been migrated from the horizontal object detection methods, but these horizontal object detection methods are designed based on natural images. Accordingly, we cannot simply obtain the oriented object detection methods by extending the horizontal object detection methods. Moreover, compared with natural images, remote sensing images often have the following unique characteristics: (1) the objects are usually densely arranged; (2) the objects of remote sensing images are seriously disturbed by noise, and some images have color distortion; (3) the background information of remote sensing images is complex; (4) some categories show a high degree of similarity between object categories, while some categories show great differences. Figure 1 shows some examples of the remote sensing images.

Due to the above characteristics of remote sensing images, a large number of oriented object detection methods specifically used for remote sensing images have emerged. Depending on whether the object detector can directly extract features with neural networks to predict the object classification and location, the current object detection methods can be divided into two categories: one-stage detection methods and two-stage detection methods.

The first stage of the two-stage object detection methods generates candidate boxes, and then classifies and regresses each candidate box in the second stage. Such algorithms are slow to train and infer because they require multiple runs of the localization and classification process. Early two-stage detection methods use anchors with multiple angles and multiple aspect ratios to detect the oriented objects [4]. However, as the number of preset angles increases, so does the number of anchors, resulting in a rapid increase in computational cost, making model training more difficult. RoI Transformer [5] converts the horizontal proposals into oriented proposals by RRoI Learner, and then extracts the features from the oriented proposals for subsequent classification and regression. This method gives the angle value through the network, instead of the preset angle, which greatly reduces the number of anchors, i.e., reduces the computational time. ReDet [6] introduces rotation-invariant convolution into the whole model based on the RoI Transformer [5], extracting rotation-invariant features by the rotation-invariant RoI alignment operation, and improving the classification and regression of the final boxes. The Oriented R-CNN [7] replaces the RRoI Learner in RoI-transformer [5] with a simpler Oriented RPN, making the model architecture more simple and more efficient. In general, the two-stage detection methods have higher detection accuracy and recall rate than the one-stage detection methods, but they are more complex in the model architecture, resulting in a large amount of time required for both model training and model inference, and as a result, two-stage methods have a slow training and inference speed.

In contrast, one-stage object detection methods need to predict all bounding boxes by sending the images to the network only once. R³Det [8] is a refined one-stage oriented detection method which adjusts the anchor in horizontal box format by feature refinement module to obtain the rotating box results. SCRDet++ [9] smoothly detects small, cluttered, and rotating objects by instance-level feature denoising and rotation loss smoothing, either if applied to a one-stage detector or to a two-stage detector. S²ANet [10] is a one-stage object detection method, which consists of feature alignment module (FAM) and oriented detection module (ODM). The FAM can generate high-quality anchors with an anchor refinement network and adaptively align the convolutional features according to the anchor boxes with a novel alignment convolution. The ODM first adopts active rotating filters to encode the orientation information and then produces orientation-sensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy. IENet [11] has designed a branching interaction module with a self-attention mechanism that can combine the features of classification and regression branches to improve the discrimination between foreground and background. GWD [12], KLD [13], and ProbIoU [14] using distance measures between a 2D Gaussian distribution to represent the loss can better allow the network to learn differences in foreground and background. PIoU [15] designs an IoU loss function based on pixel statistics in the rotation box that adequately distinguishes pixels from foreground and background. BBAVectors [16] and PolarDet [17] define the rotation box with the boundary-aware vector and the polar coordinates, respectively, making the detector more sensitive to the boundaries of the foreground and background. CenterRot [18] uses a variable convolution fusion of multiscale features to effectively improve the detection accuracy of the detector. Generally speaking, the one-stage detection methods are less accurate than the two-stage detection method, but the model architecture of the single-stage method is very simple, making the model training speed and inference speed very fast.

Analyzing the above two types of object detection methods, we consider whether we can obtain a one-stage object detection method with a detection accuracy comparable to the two-stage detection methods, and with a high detection speed. According to the above introduction on one-stage detection methods, the current one-stage detection methods still face the following two challenges.

First, the detection network pays little attention to the foreground and does not make full use of the foreground information. General one-stage object detection methods have a large number of negative samples, and therefore there is a serious positive and negative sample imbalance problem. In order to solve this problem, most methods use focal loss [19] to train the network, but focal loss [19] pays more attention to simple samples in the early stage of model training and difficult samples in the late stage of model training, resulting in ignoring a large number of positive sample information, which should be reused by model.

Second, determining how to improve the discrimination of foreground features of similar categories has not been addressed. The two-stage object detection methods only consider how to reduce the number of negative samples through the first-stage screening, and do not consider the confounding between similar categories, while the one-stage object detection methods consider neither how to reduce the number of negative samples nor how to deal with the high category similarity situation.

To address the above challenges, we propose a foreground feature enhancement method for one-stage anchor-free oriented detection. The method mainly includes three parts: a keypoint attention module (KAM), a prototype contrastive learning module (PCLM), and an equalized modulation focal loss (EMFL). As the KAM draws on the idea of semantic segmentation, we design a new branch whose supervised information is a mask combined by several elliptic gaussian distributions, and the mask can be obtained from the original rotation box annotation. Using the mask as a label trains a spatial attention map, which is used to enhance the original feature and increase the attention to the foreground. The PCLM draws on the idea of contrast learning, in feature space, the sample features of different categories as far away as possible and the sample features of the same category as close as possible, so that the samples of similar categories can reduce the possibility of confusion as much as possible. As for the EMFL, it takes into account that focal loss [19] has only focused on the weight balance problem of positive and negative samples, while ignoring the unbalanced number of samples in different categories. Moreover, considering that most positive samples tend to be simple samples after model training and, therefore, positive sample weights are lower than negative sample weights, we design a dynamic loss function so that the model can pay more attention to later model training of informative positive samples.

The main contributions of this paper can be summarized as follows:

(1): A novel one-stage anchor-free oriented object detection method is presented, which mainly includes two modules (KAM and PCLM). The KAM is utilized to enhance the features of the foreground part of the image and suppress the features of the background part, enabling the one-stage detector to have a higher distinction between the foreground and background. The PCLM is applied to the classification branches to make the foreground features of different categories as orthogonal to the feature space as possible, in order to enhance the identification of different categories of samples and reduce the confusion between the samples of different categories.
(2): Furthermore, the EMFL is proposed to improve the learning process of positive and negative samples. Compared with the original focal loss [19], the constructed EMFL has a more dynamic training process, allowing the network to fully learn the information of positive samples in the early, middle and late stages, and also to balance for the learning of different categories of samples.

The rest of this article is organized as follows: Section 2 briefly reviews the related works. In Section 3, the proposed foreground feature enhancement method is described in detail. In Section 4, the experiments on the publicly available DOTA datasets and HRSC2016 datasets are conducted, which demonstrate the superiority of the proposed method. Finally, Section 5 draws the conclusions of this paper.

2. Related Works

2.1. Anchor-Based Oriented Object Detection

Unlike the horizontal object detection task, the oriented object detection relies on the oriented bounding box (OBB) to capture objects in any direction. The anchor-based oriented object detection methods require a manual preset of a series of standard boxes, called anchor, for boundary regression and refinement. Current oriented object detection methods are generally extensions to horizontal object detection methods. The R²PN [20] introduces the orientation parameters of the rotation box into the RPN network to form the rotating RPN network, and the network also uses the rotation ROI pooling operation to refine the rotation box parameters. The R-DFPN [21] uses the feature pyramid network (FPN) [22] combined with multiscale features to improve the detection performance. Based on the DFPN backbone, [23] further proposes an adaptive ROI alignment method for the regression of the rotation box of the two-stage method. The RRD [24] encodes the rotation information using active rotation filtering. Gliding Vertex [25] slides the vertices of the horizontal bounding box to capture the oriented bounding box. RoI Transformer [5] learns a spatially transformed parameter via RRoI Learner, transforms the horizontal Proposal to the rotating Proposal, and then uses the features in the rotating Proposal for subsequent classification and regression. ReDet [6], based on the RoI Transformer [5], adds the rotation-equivariant network to the detector to extract the rotation-equivariant features, which can accurately predict the direction and lead to a large reduction of the model size. The Oriented R-CNN [7] proposes that oriented RPN replaces the RRoI Learner in RoI Transformer [5] and can be used to generate high-quality rotating boxes. R³Det [8] proposes the feature refinement module (FRM) to encode the positional information of the current refined boundary box to the corresponding feature point through feature interpolation pixel-by-pixel to realize feature reconstruction and alignment. SCRDet++ [9] is a method to improve the accuracy of small object detection by performing noise reduction of instance-level features. S²ANet [10] proposes a novel alignment convolution to alleviate the misalignment between axis alignment convolution features and arbitrary directed objects by fully convolution. Overall, the anchor-based detector first densely spreads large amounts of anchors over the feature map, and then regresses to the offset between the bounding box and anchor.

2.2. Anchor-Free Oriented Object Detection

The anchor-free oriented object detection methods do not require a manual preset of the anchor, and generally obtains the final bounding box by predicting the keypoints. The first one-stage anchor-free object detector IENet [11] is based on the one-stage anchor-free full convolutional detector FCOS [3], which designs an interactive embranchment module with a self-attention mechanism to fuse features from classification and regression branches. The Axis Learning [26] detects an arbitrary-oriented object by predicting the axis of the object, which is a line connecting the head and tail of the object, and the width of the object is perpendicular to the axis. By predicting objects directly at the pixel level of the feature graph, the proposed method avoids setting many hyperparameters related to the anchor, and is computationally efficient. PIoU [15] indicates that when detecting oriented objects, using conventional Smooth L1 loss focuses more on reducing angular errors than the global IoU, especially with large aspect ratios. Therefore, they propose the pixel IoU (PioU) loss, using pixel-level sampling to optimize the IoU, greatly improving the detection performance of large aspect ratio objects. P-RSDet [27] replaces the Cartesian coordinate representation of the bounding box with polar coordinates. Thus, bounding box regression occurs by predicting the center point, polar radius, and two polar angles. Furthermore, polar ring area loss is introduced to represent the geometric constraint between polar radius and polar angle. BBAvectors [16] builds on the CenterNet [28], and improves the width and height of the regression level box to the boundary-aware vector from the regression center point to the four boundaries. In DAFNe [29], the authors introduce an orientation-aware generalization of the centrality function of the bounding box facing an arbitrary orientation to reduce low-quality predictions, as well as a center-to-corner bounding box prediction strategy to improve object positioning accuracy. Anchor-free object detectors have better robustness than anchor-based detectors.

2.3. Contrast Learning in Object Detection

Contrast learning [30,31,32] has been widely used in the field of self-supervised learning. Most of these methods use instance discrimination as an auxiliary task, to pretrain the network, and then to fine-tune for different downstream tasks (e.g., classification, object detection, and segmentation). The purpose of contrast learning is to minimize the distance between the opposite pairs (i.e., two different enhanced views of the same image) and to push away the negative pairs. Specifically, Detco [33] introduces global images with contrast learning with a local patch on multi-level features, and then transfers the learned model to an object detection task. PixPro [34] introduces pixel-level auxiliary tasks for learning dense feature representations, which are friendly to intensive prediction tasks (e.g., object detection and segmentation). In order to reduce the dependence on false labels and improve the tolerance of false labels. Self-tuning [35] proposes a pseudo group contrast (PGC) mechanism to solve the challenge of confirmation bias in self-training. The above contrast learning methods have not been used to address the discrimination of similar object categories.

3. Proposed Method

In order to increase the attention of the one-stage oriented object detectors to the foreground and the ability to deal with similar categories of samples, an Oriented Object Detector based on Foreground Feature Enhancement (O²DFFE) is proposed and it mainly includes three parts: KAM, PCLM, and EMFL. The overall architecture of the network is shown in Figure 2. In this Section, we first present the initial network architecture and then the three innovations of O²DFFE.

3.1. Initial Network Architecture

Considering the large-scale variation in objects in remote sensing images, it is difficult to select a suitable set of parameters to generate a series of suitable anchors. Hence, the anchor-free method is adopted in this paper. Our initial network model is based on a one-stage anchor-free oriented object detection model: BBAVectors [16].

The input image is

I \in R^{3 \times H \times W}

. We obtain the P2 by Formula (1), where P2 is a multi-scale feature that contains both location information of low-level features and semantic information of high-level features. The resolution of P2 features is not much lower than that of the original image, which is very suitable for our keypoint-based object detection method. Then, the P2 feature is fed into the KAM to obtain the enhanced feature map A2, and finally we decode A2 using the detection head to obtain four sets of parameters: Heatmap, Size, Offset, and Angle.

P 2 = F (I) \in R^{N \times \frac{H}{s} \times \frac{W}{s}},

(1)

A 2 = K (P 2) \in R^{N \times \frac{H}{s} \times \frac{W}{s}},

(2)

H = C_{1, 1} (C_{3, 3} (A 2)) \in R^{C \times \frac{H}{s} \times \frac{W}{s}},

(3)

S = C_{3, 3} (C_{3, 3} (A 2)) \in R^{10 \times \frac{H}{s} \times \frac{W}{s}},

(4)

O = C_{1, 1} (C_{3, 3} (A 2)) \in R^{2 \times \frac{H}{s} \times \frac{W}{s}},

(5)

A = C_{1, 1} (C_{3, 3} (A 2)) \in R^{1 \times \frac{H}{s} \times \frac{W}{s}}

(6)

where F denotes the feature extraction network composed of backbone and neck, K denotes the KAM,

C_{1, 1}

denotes the

1 \times 1

convolution operation,

C_{3, 3}

denotes the

3 \times 3

convolution operation, s is the downsampling factor, N is the number of channels, and C is the number of object categories, depending on the datasets.

We use Heatmap to represent the position information of each object. Specifically,

H \in R^{C \times \frac{H}{s} \times \frac{W}{s}}

means that H has C channels and represents C object categories in the dataset. The value of Heatmap at each position represents the confidence at the center of this type of objects. Since a sigmoid activation function is added to the last layer of the network output, each value in Heatmap is guaranteed to be (0, 1). Assuming that the center of a certain bounding box is

c = (c_{x}, c_{y})

, according to [36], we can place a two-dimensional Gaussian distribution at the center point position to generate the final Ground Truth

\hat{H}

, and specifically,

{\hat{H}}_{x y c} = \exp (- \frac{{(h_{x} - c_{x})}^{2} + {(h_{y} - c_{y})}^{2}}{2 σ^{2}})

(7)

where x, y and c represent three dimensions: abscissa, ordinate and category, respectively. Moreover, in the Heatmap branch, the features after the

3 \times 3

convolution operation are fed into the PCLM, so that the features of different categories are separated as far as possible in the feature space, preferably orthogonal, benefiting the training of the heatmap branch, thus reducing the confounding probability of different categories of objects as much as possible.

We use Size to represent the size information for each object, specifically, according to [16], as defined

\hat{S} = [t l, b l, b r, t r, w, h] \in R^{10 \times \frac{H}{s} \times \frac{W}{s}} .

(8)

As shown in Figure 2, where tl, bl, br and tr are the four vectors composed by the central point of the object and the midpoint of the four edges of the annotation box, respectively, also called boundary-aware vectors, w and h are the width and height of the minimum horizontal external rectangle of the annotation box, respectively, and t, l, b and r are the top, left, bottom and right four vertices of the annotation box, respectively. When the detection object is close to the horizontal box, we cannot accurately define tl, bl, br and tr, and for this boundary case, we only need two parameters of w and h to represent a horizontal box. Therefore, we also need to define a parameter representing whether an object is rotating, namely Angle. For the Size branch, Smooth L1 Loss is used to train the model, specifically as follows

L_{s} = \frac{1}{N} \sum_{i = 1}^{N} S m o o t h_{L_{1}} (s_{i} - {\hat{s}}_{i})

(9)

where

S m o o t h_{L_{1}} (x) = \{\begin{matrix} 0.5 x^{2}, i f |x| < 1 \\ |x| - 0.5, i f |x| \geq 1 \end{matrix} .

(10)

We use the Offset parameter to represent the offset of the central point of the object. Since the size of the feature map is

\frac{1}{s}

of the original image, the positioning of the center point will bring some error, and the Offset parameter is needed to compensate for the loss of accuracy of the Heatmap parameter. Specifically, for the

O \in R^{2 \times \frac{H}{s} \times \frac{W}{s}}

, assuming that the central point coordinates before the quantification of the object are

c = (c_{x}, c_{y})

, then the ground truth of the Offset parameter is

\hat{O} = (\frac{c_{x}}{s} - ⌊\frac{c_{x}}{s}⌋, \frac{c_{y}}{s} - ⌊\frac{c_{y}}{s}⌋) .

(11)

For the Offset branch, Smooth L1 Loss is utilized to train the model, specifically as follows

L_{o} = \frac{1}{N} \sum_{i = 1}^{N} S m o o t h_{L_{1}} (o_{i} - {\hat{o}}_{i}) .

(12)

We used the Angle parameter to indicate whether the object is a horizontal or a rotated annotation box. This parameter is mainly used to cooperate with the Size parameter. Specifically,

\hat{A} = \{\begin{matrix} 1, I o U (B, B^{'}) < 0.95 \\ 0, I o U (B, B^{'}) \geq 0.95 \end{matrix} .

(13)

As shown in Figure 3, where

B

is the annotation box and

B^{'}

is the minimum horizontal external rectangular box of

B

. For the Angle branch, Binary Cross Entropy Loss is adopted to train the model, specifically as follows

L_{a} = - \frac{1}{N} \sum_{i = 1}^{N} ({\hat{α}}_{i} \log (α_{i}) + (1 - {\hat{α}}_{i}) \log (1 - α_{i})) .

(14)

3.2. Keypoint Attention Module

Considering the problem of low foreground information utilization of one-stage oriented object detection methods, the KAM is proposed, as shown in Figure 4. It consists of a channel attention module (shown in Figure 5) and a spatial attention module (shown in Figure 6). The channel attention module mainly reconstructs the original features by reweighting the channel dimension of the features, which can allow the model to pay attention to the more important channels. In the spatial attention module, the original input features are first fed into three dilated blocks to obtain the temp feature, where the dilated block is composed of a series of convolution and residual structures containing dilated convolution, which can effectively increase the receptive field of convolution and extract spatial information more effectively [37]. Then, the temp feature is fed into a

1 \times 1

convolution to obtain the Elliptic Gauss map. The Elliptic Gauss map is constrained by soft label, where soft label is a

\frac{H}{s} \times \frac{W}{s}

matrix, as shown in the right figure of Figure 7, the matrix is set to 1 at the center of each object, and then gradually decreases from 1 to 0 from the center of the object to the boundary, indicating the distance from the center. At the same time, the value of a certain point also indicates the probability that the point is the center of the object. Therefore, a soft label is a two-dimensional matrix formed by adding elliptic Gaussian distributions at several different positions. Relative to the non-0-to-1 hard label, a soft label is more suitable for our keypoint-based object detection method. The keypoints of the object should have a larger weight than the other points, rather than all using the same weight. Specifically, the rotation box labeled as

(x, y, w, h, θ)

is converted into a two-dimensional Gaussian distribution

N (μ, Σ)

,

μ = {(x, y)}^{T},

(15)

Σ^{1 / 2} = R Λ R^{T} = (\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}) (\begin{matrix} w / 2 & 0 \\ 0 & h / 2 \end{matrix}) (\begin{matrix} \cos θ & \sin θ \\ - \sin θ & \cos θ \end{matrix}) .

(16)

The Soft Label

\hat{a}

can then be obtained by using the following formula,

f_{i} (X) = \frac{1}{2 π {|Σ|}^{1 / 2}} \exp (- \frac{1}{2} [{(X - μ)}^{T} Σ^{- 1} (X - μ)]),

(17)

\hat{a} = \sum_{i = 1}^{N} f_{i} (X)

(18)

where N denotes the number of annotation boxes in an image. Assuming that the elliptic Elliptic Gauss map is a, Smooth L1 Loss is used to train the branch, specifically as follows

L_{a t t} = S m o o t h_{L_{1}} (a - \hat{a}) .

(19)

Finally, we multiply Temp Feature and Input Feature element by element to obtain the keypoint-enhanced feature Output Feature. The keypoint we use is the center point of the object.

3.3. Prototype Contrastive Learning Module

To increase the distance of similar foreground features in feature space, the PCLM is proposed. The top part of Figure 1 shows the PCLM. The

A 2 \in R^{N \times \frac{H}{s} \times \frac{W}{s}}

can be obtained from the Section 2.1. After the Heatmap branch, the A2 feature is fed into a

3 \times 3

convolution layer, to obtain the classification feature C:

C = C_{3, 3} (A 2) \in R^{N \times \frac{H}{s} \times \frac{W}{s}} .

(20)

Based on the feature C, we sample the features corresponding to different categories of objects under the same batch during the training phase, and then calculate the sample center of each category, called the prototype. Since the ground truth

\hat{H}

of the Heatmap is not a non-0-to-1 tensor, but we only need the vector at the center of the object to calculate our prototype, so we first binarize the

H

,

{\hat{H}}^{'}_{x y c} = \{\begin{matrix} 0, i f {\hat{H}}_{x y c} \neq 1 \\ 1, i f {\hat{H}}_{x y c} = 1 \end{matrix} .

(21)

Then, we calculate the prototype of each category

c_{i}

by using the following formula,

f_{c_{i}} = l_{2} (\sum_{x y} {\hat{H}}^{'}_{x y c_{i}} C)

(22)

where the sum of the vectors of the positions corresponding to all sample centers for each category is calculated, followed by

l_{2}

regularization.

Prototype contrastive loss is adopted to constrain the prototype

f_{c_{i}}

,

L_{p c l} = \frac{1}{C} \sum_{i = 1}^{C} \log (\frac{\exp (f_{c_{i}} \cdot f_{c_{i}} / τ)}{Z_{i}}),

(23)

Z_{i} = \sum_{k = 1}^{C} \exp (f_{c_{i}} \cdot f_{c_{k}} / τ) .

(24)

where we adopt the inner product of the two prototypical vectors to represent the distance of the two vectors.

3.4. Equalized Modulation Focal Loss

On the Heatmap branch, in order to reduce the influence of positive and negative sample (foreground and background) imbalance on model training, the EMFL is proposed, which is an improved version of variant focal loss. The specific loss function is shown below,

L_{h} = - \frac{1}{N} \sum_{c} \sum_{i} \{\begin{array}{l} {(1 - p_{i})}^{γ_{c} (1 - p_{i})} \log (p_{i}), i f {\hat{p}}_{i} = 1 \\ {(1 - {\hat{p}}_{i})}^{β} p_{i}^{γ_{c}} \log (1 - p_{i}), i f {\hat{p}}_{i} < 1 \end{array}

(25)

where N is the number of positive samples,

p_{i}

represents the confidence value predicted through the network,

{\hat{p}}_{i}

represents the ground truth of Heatmap.

γ_{c}

is a variable that is only related to the number of object category samples. In order to make up for some categories because of sample scarcity when model training, we use

γ_{c}

for equilibrium. For a category with a larger sample number, we choose a larger

γ_{c}

, and for a category with a smaller sample number, we choose a smaller

γ_{c}

. Focal loss [19] reduces the weight of simple samples and makes the model more focused on the information of difficult samples. Then, when the model is trained to the later stage, most of the positive samples become simple samples. Relatively speaking, the negative samples are irregular, and it is difficult for the model to have a relatively high classification score for the negative samples. Therefore, we need to improve the utilization of the positive sample information in the later stage of the model. For the positive sample part, we add a modulation factor

1 - p_{i}

, which makes the loss function more similar to focal loss [19] in the early training period and more similar to cross entropy loss in the late training period, increasing the proportion of positive samples in the later training period; for negative samples, keep the same form as variant focal loss, using

{(1 - p_{i})}^{β}

to reduce the proportion of negative samples around the sample center. In the experiments presented in this paper,

γ_{c} \in [2, 3]

is selected, with

β = 4

. Figure 8 shows the comparison of several loss functions.

Finally, total loss function is shown below,

L = L_{h} + L_{s} + L_{o} + L_{a} + L_{a t t} + L_{p c l} .

(26)

4. Experimental Results and Analysis

4.1. Datasets

The performance of O²DFFE is evaluated on the public datasets: DOTA1.0, DOTA1.5, and the HRSC2016 datasets.

(1): DOTA: DOTA [38] is a large-scale dataset for remote sensing object detection. Data comes from different sensors and platforms. DOTA1.0 contains 2806 aerial images with various scales, orientations, and object shapes. The images are derived from different sensors and platforms. Image resolution range from $800 \times 800$ to $4000 \times 4000$ , 188,282 samples in 15 common categories: plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST),soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). DOTA1.5 adds a container crane (CC) class and instances of less than 10 pixels to version 1.0. DOTA1.5 contains 402,089 instances. Compared to DOTA1.0, DOTA1.5 is more challenging but also more stable during training. In this paper, we use the training and validation sets for training and the test set for testing. All images were cropped to $608 \times 608$ patch with a gap of 250. The multiscale parameters for DOTA1.0 are {0.8,1.0,1.2}, and those for DOTA1.5 are {0.8,1.0,1.2}. During the training process, we also used the random flip and random rotation methods. Cropped image detection results were merged into the final results during testing. Using non-maximum-suppression (NMS) method with a 0.1 IOU threshold. When testing, we use multiscale testing as well as flip testing.
(2): HRSC2016: HRSC2016 [39] is a challenging ship detection dataset annotated with bounding boxes, with 1061 aerial images in size ranging from $300 \times 300$ to $1500 \times 900$ . In total, 436, 181, and 444 images were included in the training, validation, and test set, respectively. We used the training and validation sets for training and the test set for testing. All images were resized to $608 \times 608$ , without changing the aspect ratio. Random flipping and random rotation were used for training.

4.2. Experimental Setup

O²DFFE is implemented with the Pytorch framework and utilizes ResNet [40] as backbone. The backbone weights were pretrained on the ImageNet dataset and initialization of the rest of the network takes the default way of Pytorch. We trained approximately 50 epochs on the DOTA dataset, and about 100 epoch on the HRSC2016 dataset. We used the Adam optimizer to train the DOTA model, with an initial learning rate (LR) of 0.000125. The network was trained with 8 GeForce RTX 3090 GPU, Batch Size 64, and tested with a single GeForce RTX 3090 GPU.

4.3. Comparisons with Other Methods

The final evaluation score of our experiment is mAP, which is the mean of AP for each class, and AP is calculated as

A P = \int_{0}^{1} P d R

, where P and R represent precision and recall, respectively.

(1) Results in DOTA1.0: the O²DFFE was tested on the DOTA1.0 OBB task, comparing the results to other state-of-the-art methods, as shown in Table 1. O²DFFE reached 77.44 mAP at ResNet101 under single-scale test conditions, exceeding all one-stage and two-stage detection methods. When we adopt the multi-scale training and test conditions, the mAP can reach 79.93. Although there is still a certain gap compared with the two-stage method, the mAP of 0.59 is from the best Oriented R-CNN [7]. Considering that multi-scale training and testing methods can already respond well to the confusion of foreground and background, our O²DFFE method performs better in single-scale testing, while with added multi-scale methods, the improved performance is less obvious than other methods. Under the multiscale training and testing conditions, the model has better detection ability for objects at different scales, and therefore, the model has better discriminative performance for foreground and background compared to single-scale training and testing.

(2) Results in DOTA1.5: the O²DFFE was tested on the DOTA1.5 OBB task, comparing the results to other state-of-the-art methods, as shown in Table 2. O²DFFE achieved a mAP of 69.12 under single-scale test conditions, and when the backbone was ResNet101, surpassing the other two methods in the table. When we adopt the multiscale training and multiscale testing conditions, the mAP can reach 73.79. Although it does not compare as well with two-stage’s ReDet [6], the specific reasons are similar to the DOTA1.0 dataset, but our approach has reached a very good level.

(3) Results in HRSC2016: since the HRSC2016 dataset only has only one category, we will only apply two of the submodules of O²DFFE, namely KAM and EMFL, when conducting experiments on this dataset. The O²DFFE was tested on the HRSC2016 OBB task, comparing the results to other state-of-the-art methods, as shown in Table 3. O2DFFE achieves a mAP of 89.23 when the Backbone is ResNet101, and a mAP of 90.54 with the multiscale training and multiscale test conditions, exceeding all one-stage and two-stage methods.

4.4. Ablation Studies

A series of experiments are performed on the DOTA1.0 dataset to evaluate the effectiveness of the proposed method. We use ResNet50 as the backbone. We only train and test the model on a single scale, without a multiscale, to reduce the interference of the scale change to our method. We use the DOTA1.0 training set for training and the validation set for testing.

As shown in Table 4, the baseline mAP for O²DFFE is 69.22. After replacing the original Focal Loss with EMFL, the mAP becomes 70.20, increasing the mAP by 0.98. When the KAMh is added, the mAP of the model is changed to 70.48, increasing the mAP by 0.28. While the KAMs is added, the mAP of the model is changed to 71.26, increasing the mAP by 1.06. Since soft label performs better than hard label, KAMs is used in all our subsequent experiments. When the PCLM is added, the mAP of the model is changed to 72.10, increasing the mAP by 0.84. Through the use of multiple modules, O²DFFE achieves a very significant performance improvement. Moreover, the three modules do not significantly enhance the number of parameters of the model, and the FPS does not be significantly reduced, that is, we can keep the model size and inference speed basically unchanged and effectively improve the detection accuracy. This makes the versatility and robustness of our method better and easier to transplant into other methods of oriented object detection.

To provide a detailed analysis of the effect of each module on the baseline model, we plot the normalized confusion matrix for each ablation experiment on the DOTA1.0 dataset in Figure 9, where a confidence threshold of 0.5 was applied. Comparing Figure 9a,b, it can be seen that the detection accuracy of almost all classes is improved after the introduction of EMFL. Comparing EMFL and FL, EMFL not only gives different categories of different

γ_{c}

parameters, but also can alleviate different sample number to dominate some samples in the training stage, and dynamically adjust the proportion of positive and negative samples in the early and late stages of the model training, and therefore, the accuracy of almost all classes improved. Comparing Figure 9a,c, it can be observed that the confusion between the foreground and background classes has decreased after the introduction of KAM. Considering that after introducing KAM, we can obtain stronger features relative to other points in the keypoint of each object, and therefore, in the subsequent decoding part, we can more easily judge whether each point is foreground or background, and therefore, we can effectively reduce the confusion between foreground and background, and the final detection accuracy is also improved. Comparing Figure 9a,d, it can be seen that the confusion between the foreground classes has decreased after the introduction of the PCLM. Considering that after introducing PCLM, the features between different categories are separated as far as possible in the feature space, or, the different channels of the feature map have different responses to different categories, and therefore, in the decoding stage, the confusion between similar categories can be reduced as much as possible, even if they are originally close in the feature space.

4.5. Visual Analysis

To visually show the effectiveness of the O²DFFE model, Figure 10 shows the visual results of the O²DFFE test on the DOTA1.0 partial test set images, which can be seen that the O²DFFE model has good detection results for objects of different categories. From the detection results, our model has a very good detection effect for some small targets, and can accurately detect the densely arranged situations, such as ships and vehicles. For some similar categories, such as helicopter and plane, our model has good interference resistance. Further, for some images with complex background, our model makes a good distinction between foreground and background.

To demonstrate the ability of the PCLM to distinguish between similar categories, the t-SNE [41] was used to examine the degree of separation of samples from different categories in the feature space, as shown in Figure 11. Figure 11a represents the t-SNE results of the Baseline model, and Figure 11b represents the t-SNE results of the baseline model plus the PCLM. It can be seen that different classes of samples are further away in the feature space due to the introduction of PCLM. Figure 11 circles the feature distributions of the PL and HC classes, and the interclass distance between them is calculated. It can be seen that by applying the PCLM, the interclass distance can be reduced, thus increasing the classification accuracy.

Finally, we also show some examples of test failure on the DOTA1.0 dataset, as shown in Figure 12, with the top row representing the detection results and the bottom line representing the ground truth. According to the analysis, the appearance of Figure 12a is because the real annotation box contains a large proportion of background parts, and the detection of the keypoints is not accurate enough. In this case, we should be able to improve by strictly defining the keypoint of the object, because the central point of the object is sometimes not the keypoint of the object. Figure 12b appears because the object size is too small and not annotated, but the model can detect these small objects. Both Figure 12c,d show that some backgrounds very similar to the foreground category are still mis-detected. In this case, we could consider adding global spatial attention to the model, or consider the use of the backbone Transformer [42], so that the model can give more attention to the global information to extract more informative foreground features, thus excluding some backgrounds that are similar to the foreground.

4.6. Discussion

In this study, we propose a foreground feature enhancement method that can be used for one-stage object detection. We use our model for the oriented object detection task and can achieve better performance than other methods. At a deep level, the KAM module in our method is a global spatial attention mechanism, PCLM belongs to the category of self-supervised learning, and EMFL belongs to the category of equalization strategy. Our method is highly scalable and can be used for other one-stage object detectors.

5. Conclusions

This paper has presented the foreground feature enhancement method for one-stage anchor-free oriented object detector, which is more robust and easier to migrate to other models than the anchor-based method. The O²DFFE method mainly consists of three parts. The first part is the designed KAM. It can be used to enhance the features of the foreground portion of the image and weaken the features of the background portion of the image. The second part is the designed PCLM, applied on the classification branch, makes the different classes of foreground features as orthogonal as possible in the feature space. It can be utilized to enhance the discrimination of samples between different categories, thereby reducing the confusion of samples between different categories. The third part is the constructed EMFL. It can be adopted to improve the learning process of the model for positive and negative samples. The results on the DOTA and HRSC2016 datasets show that the proposed method outperforms existing methods, and can achieve the accurate detection results of the two-stage target detection method with a fast calculation speed.

In future work, we may try to replace the backbone of the proposed method with the form of Transformer [42] to achieve a new object detection framework that makes the model more fully utilized for the foreground information, so that we can obtain the features of the global spatial attention.

Author Contributions

Conceptualization, P.L., X.W. and B.W.; methodology, P.L. and B.W.; software, P.L.; validation, P.L., X.W. and B.W.; formal analysis, P.L. and B.W.; investigation, P.L.; writing—original draft preparation, P.L. and B.W.; writing—review and editing, P.L.; supervision, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61971141.

Data Availability Statement

The data presented in this study are available on reasonable request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Liu, L.; Pan, Z.; Lei, B. Learning a Rotation Invariant Detector with Rotatable Bounding Box. arXiv 2017, arXiv:1711.09405. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A Rotation-Equivariant Detector for Aerial Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 2786–2795. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. Scrdet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022. Early Access. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar]
Lin, Y.; Feng, P.; Guan, J.; Wang, W.; Chambers, J. IENet: Interacting Embranchment One Stage Anchor Free Detector for Orientation Aerial Object Detection. arXiv 2019, arXiv:1912.00969. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. In Proceedings of the International Conference on Machine Learning, Chongqing, China, 9–11 July 2021; pp. 11830–11841. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Llerena, J.M.; Zeni, L.F.; Kristen, L.N.; Jung, C. Gaussian Bounding Boxes and Probabilistic Intersection-over-Union for Object Detection. arXiv 2021, arXiv:2106.06072. [Google Scholar]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou Loss: Towards Accurate Oriented Object Detection in Complex Environments. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 195–211. [Google Scholar]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2150–2159. [Google Scholar]
Zhao, P.; Qu, Z.; Bu, Y.; Tan, W.; Guan, Q. Polardet: A Fast, More Precise Detector for Rotated Target in Aerial Images. Int. J. Remote Sens. 2021, 42, 5831–5861. [Google Scholar]
Wang, J.; Yang, L.; Li, F. Predicting Arbitrary-Oriented Objects as Points in Remote Sensing Images. Remote Sens. 2021, 13, 3731. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward Arbitrary-Oriented Ship Detection with Rotated Region Proposal and Discrimination Networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar]
Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic Ship Detection in Remote Sensing Images from Google Earth of Complex Scenes Based on Multiscale Rotation Dense Feature Pyramid Networks. Remote Sens. 2018, 10, 132. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Yang, X.; Sun, H.; Sun, X.; Yan, M.; Guo, Z.; Fu, K. Position Detection and Direction Prediction for Arbitrary-Oriented Ships via Multitask Rotation Region Convolutional Neural Network. IEEE Access 2018, 6, 50839–50849. [Google Scholar]
Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-Sensitive Regression for Oriented Scene Text Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5909–5918. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar]
Xiao, Z.; Qian, L.; Shao, W.; Tan, X.; Wang, K. Axis Learning for Orientated Objects Detection in Aerial Images. Remote Sens. 2020, 12, 908. [Google Scholar]
Zhou, L.; Wei, H.; Li, H.; Zhao, W.; Zhang, Y.; Zhang, Y. Arbitrary-Oriented Object Detection in Remote Sensing Images Based on Polar Coordinates. IEEE Access 2020, 8, 223373–223384. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Lang, S.; Ventola, F.; Kersting, K. DAFNe: A One-Stage Anchor-Free Approach for Oriented Object Detection. arXiv 2021, arXiv:2109.06148. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
Zhu, R.; Zhao, B.; Liu, J.; Sun, Z.; Chen, C.W. Improving Contrastive Learning by Visualizing Feature Transformation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10306–10315. [Google Scholar]
Xie, E.; Ding, J.; Wang, W.; Zhan, X.; Xu, H.; Sun, P.; Luo, P. Detco: Unsupervised Contrastive Learning for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8392–8401. [Google Scholar]
Xie, Z.; Lin, Y.; Zhang, Z.; Cao, Y.; Lin, S.; Hu, H. Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16684–16693. [Google Scholar]
Wang, X.; Gao, J.; Long, M.; Wang, J. Self-Tuning for Data-Efficient Deep Learning. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10738–10748. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-Level Feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13039–13048. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A High Resolution Optical Satellite Image Dataset for Ship Recognition and Some New Baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; SciTePress: Setúbal, Portugal, 2017; Volume 2, pp. 324–331. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]

Figure 1. Remote sensing images with (a) densely packed vehicles, (b) color distortion, cloud and fog noise, (c) complex background, and (d) high similarity between helicopters and planes.

Figure 2. The overall architecture of the O²DFFE detection network.

Figure 3. The definition of the Size parameter.

Figure 4. Illustration of the KAM.

Figure 5. Channel attention module.

Figure 6. Spatial attention module.

Figure 7. Input image and soft label.

Figure 8. Comparison of the loss (Focal loss:

L = - {(1 - p)}^{γ} \log (p)

, EMFL:

L = - {(1 - p)}^{(1 - p) γ} \log (p)

, and Cross Entropy Loss:

L = - \log (p)

).

Figure 8. Comparison of the loss (Focal loss:

L = - {(1 - p)}^{γ} \log (p)

, EMFL:

L = - {(1 - p)}^{(1 - p) γ} \log (p)

, and Cross Entropy Loss:

L = - \log (p)

).

Figure 9. Normalized confusion matrix for ablation experiment on the DOTA1.0 dataset. (a) Baseline. (b) Baseline + EMFL. (c) Baseline + KAM. (d) Baseline + PCLM.

Figure 10. O²DFFE visualization results on the DOTA1.0 OBB task.

Figure 11. Visualization using t-SNE on DOTA1.0 dataset. (a) Baseline. (b) Baseline + PCLM.

Figure 12. Examples of the O2DFFE model detection failures on the DOTA1.0 dataset. (a) Label definition is inaccurate. (b) Small objects are detected. (c,d) Confusion resulting from complex backgrounds.

Table 1. Evaluation results on DOTA1.0 datasets. R50 denotes ResNet50. R101 denotes ResNet101. R152 denotes ResNet152. H104 denotes Hourglass104. ReR50 denotes ReResNet50. * denotes multiscale training and multiscale testing. Bold and underlined data denotes best performance. Bold data denotes second best performance.

Methods		Backbone	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
Two-stage	RoI-Trans [5]	R101	88.6	78.5	43.4	75.9	68.8	73.6	83.5	90.7	77.2	81.4	58.3	53.5	62.8	58.9	47.6	69.56
	ReDet [6]	ReR50	88.7	82.6	53.9	74.0	78.1	84.0	88.0	90.8	87.7	85.7	61.7	60.3	75.9	68.0	63.5	76.25
	Oriented R-CNN [7]	R101	88.8	83.4	55.2	76.9	74.2	82.1	87.5	90.9	85.5	85.3	65.5	66.8	74.3	70.1	57.2	76.28
	Oriented R-CNN* [7]	R101	90.2	84.7	62.0	80.4	79.0	85.0	88.5	90.8	87.2	87.9	72.2	70.0	82.9	78.4	68.0	80.52
One-stage	PIoU [15]	DLA-34	80.9	69.7	24.1	60.2	38.3	64.4	64.8	90.9	77.2	70.4	46.5	37.1	57.1	61.9	64.0	60.50
	IENet [11]	R101	88.1	71.3	34.2	51.7	63.7	65.6	71.6	90.1	71.0	73.6	37.6	41.5	48.0	60.5	49.5	61.24
	ProbIoU [14]	R50	89.0	72.1	46.9	62.2	75.7	74.7	86.6	89.5	78.3	83.1	55.8	64.0	65.5	65.4	46.2	70.04
	R³Det [8]	R101	88.7	83.0	50.9	67.2	76.2	80.3	86.7	90.7	84.6	83.2	61.9	61.3	66.9	70.6	53.9	73.79
	SCRDet++ [9]	R152	89.2	83.3	50.9	68.1	71.6	80.2	78.5	90.8	86.0	84.0	65.9	60.8	68.8	71.3	66.2	74.41
	S²ANet [10]	R50	89.1	82.8	48.3	71.1	78.1	78.3	87.2	90.8	84.9	85.6	60.3	62.6	65.2	69.1	57.9	74.12
	CenterRot [18]	R152	89.6	81.4	51.1	68.8	78.7	81.4	87.2	90.8	80.3	84.2	56.1	64.2	75.8	74.6	56.5	74.75
	PolarDet [17]	R101	89.7	87.0	45.3	63.3	78.4	76.6	87.1	90.7	80.5	85.8	60.9	67.9	68.2	74.6	68.6	75.02
	KLD [13]	R50	88.9	83.7	50.1	68.7	78.2	76.0	84.5	89.4	86.1	85.2	63.1	60.9	75.0	71.5	67.4	75.28
	BBAV [16]	R101	88.6	84.0	52.1	69.5	78.2	80.4	88.0	80.9	87.2	86.3	56.1	65.6	67.1	72.0	63.9	75.36
	DAFNe [29]	R101	89.4	86.2	53.7	60.5	82.0	81.1	88.6	90.3	83.8	87.2	53.9	69.3	75.6	81.2	70.8	76.95
	O²DFFE (Ours)	R50	89.7	87.0	53.8	63.2	82.0	85.4	88.5	90.6	89.1	86.7	64.8	63.4	75.7	73.7	65.2	77.30
	O²DFFE (Ours)	R101	89.8	87.0	54.0	63.3	81.5	85.5	88.5	90.8	89.1	86.6	65.0	63.4	76.4	74.0	67.0	77.44
	O²DFFE * (Ours)	R101	90.2	83.8	57.3	80.9	79.3	84.3	88.6	90.8	85.7	86.8	70.6	70.6	79.2	76.0	74.4	79.93

Table 2. Evaluation results on DOTA1.5 datasets. R101 denotes ResNet101. ReR50 denotes ReResNet50. * denotes multiscale training and multiscale testing.

Methods		Backbone	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	CC	mAP
Two-stage	ReDet [6]	ReR50	79.2	82.8	51.9	71.4	52.3	75.7	80.9	90.8	75.8	68.6	49.2	72.3	73.3	70.5	63.3	11.5	66.86
Two-stage	ReDet * [6]	ReR50	88.5	86.4	61.2	81.2	67.6	83.6	90.0	90.8	84.3	75.3	71.4	72.6	78.3	74.7	76.1	46.9	76.80
One-stage	DAFNe [29]	R101	80.6	86.3	52.1	62.8	67.0	76.7	88.9	90.8	77.2	83.4	51.7	74.0	75.9	75.7	72.4	34.8	71.99
	O²DFFE(Ours)	R101	83.8	80.6	49.3	71.6	61.7	76.7	85.3	86.8	80.1	80.2	59.5	9.0	69.2	72.1	65.6	13.7	69.12
	O²DFFE(Ours)	R101	80.8	83.8	53.3	76.2	66.8	82.5	89.6	90.8	80.1	84.2	61.7	72.9	76.2	75.2	70.0	35.8	73.79

Table 3. Evaluation results on HRSC2016 datasets. R50 denotes ResNet50. R101 denotes ResNet101. ReR50 denotes ReResNet50. * denotes multiscale training and multiscale testing. Bold and underlined data denotes best performance. Bold data denotes second best performance.

Methods		Backbone	mAP
Two-stage	RoI-Trans [5]	R101	86.20
	ReDet [6]	ReR50	90.46
	Oriented R-CNN [7]	R101	90.50
One-stage	IENet [11]	R101	75.01
	ProbIoU [14]	R50	87.09
	BBAV [16]	R101	88.60
	R³Det [8]	R101	89.26
	DAFNe [29]	R50	89.76
	S²ANet [10]	R50	90.17
	CenterRot [18]	R50	90.20
	PolarDet [17]	R101	90.46
	O²DFFE (Ours)	R101	89.23
	O²DFFE* (Ours)	R101	90.54

Table 4. Results of ablation studies on DOTA datasets. KAMh means KAM with hard label and KAMs means KAM with soft label.

Method	FL	EMFL	KAMh	KAMs	PCLM	mAP	Params	FPS
O²DFFE	√					69.22	611.02M	15.68
		√				70.20	611.02M	15.57
		√	√			70.48	644.54M	14.82
		√		√		71.26	644.54M	14.84
		√		√	√	72.10	644.54M	14.73

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, P.; Wu, X.; Wang, B. Oriented Object Detection Based on Foreground Feature Enhancement in Remote Sensing Images. Remote Sens. 2022, 14, 6226. https://doi.org/10.3390/rs14246226

AMA Style

Lin P, Wu X, Wang B. Oriented Object Detection Based on Foreground Feature Enhancement in Remote Sensing Images. Remote Sensing. 2022; 14(24):6226. https://doi.org/10.3390/rs14246226

Chicago/Turabian Style

Lin, Peng, Xiaofeng Wu, and Bin Wang. 2022. "Oriented Object Detection Based on Foreground Feature Enhancement in Remote Sensing Images" Remote Sensing 14, no. 24: 6226. https://doi.org/10.3390/rs14246226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Oriented Object Detection Based on Foreground Feature Enhancement in Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Anchor-Based Oriented Object Detection

2.2. Anchor-Free Oriented Object Detection

2.3. Contrast Learning in Object Detection

3. Proposed Method

3.1. Initial Network Architecture

3.2. Keypoint Attention Module

3.3. Prototype Contrastive Learning Module

3.4. Equalized Modulation Focal Loss

4. Experimental Results and Analysis

4.1. Datasets

4.2. Experimental Setup

4.3. Comparisons with Other Methods

4.4. Ablation Studies

4.5. Visual Analysis

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI