1. Introduction
Ship detection is a highly regarded topic in remote sensing. The precise identification of ships through remote sensing images is a crucial task in the field of target recognition. Typically, ships are located offshore or near coasts, leading to a lot of similarities in the imagery. Despite this, accurately determining the size, position, and orientation of ships remains a daunting challenge due to the intricate nature of remote sensing scenarios and the varying sizes of ships. With the development of imaging hardware, remote sensing images have higher resolution, so the method of ship detection based on remote sensing images has been widely studied in various fields of marine supervision [
1,
2,
3], port traffic flow [
4,
5], ship reconnaissance and statistics [
6,
7], etc.
For traditional ship detection methods [
1,
8], the common problems are poor system robustness, high leakage rate in complex scenarios, and low detection accuracy. With the development of deep learning in various applications, deep learning-based methods [
9,
10,
11] can provide automated, high-precision, and high-accuracy results for target detection from remote sensing images. Due to the specificity of remote sensing, there are still some open problems: (1) the bounding box may contain much background in the selected area, and cannot accurately represent the position and direction of ships [
12]; (2) false detection because of the small and densely distributed ships [
6]; (3) difficulty of recognizing multi-scale ships [
13]. Aiming at these problems, some anchor-based methods have been developed [
2,
13,
14]. However, to ensure satisfactory detection accuracy, these methods usually require the manual design of pre-selected boxes according to the actual scene, hindering their applications in practice because of a large number of hyper-parameters and high computational complexity.
To solve the problems of anchor-based detection methods, anchor-free methods have been further proposed [
6,
15,
16,
17], which do not require the anchor parameters and can directly predict the class and location information of objects. Thus, these methods can avoid hyper-parameters and reduce computational processes, gradually attracting the extensive attention of researchers. The CenterNet models [
6,
9,
16] have been built based on the center point of the object as a positive sample point to represent an oriented object, as shown in
Figure 1a. However, if a target is represented by only one point, or if the shape characteristics of the oriented target are not considered. These methods fail to balance the number of positive and negative samples well, which also causes some positive locations to be misjudged as negative samples.
Aiming at the misjudgment problem of balancing the number of positive and negative samples, dense points-based methods are further proposed to increase the number of positive samples to achieve better detection performance. Zinelli et al. [
18] proposed a dense point method to represent and predict the oriented objects, thus optimizing the imbalance of positive and negative samples to some extent. Nevertheless, the number of dense points is different with different scales of objects, as shown in
Figure 1b. Specifically, the small objects usually contain fewer dense points than the large ones, and the contribution of the small objects to the loss function is small and easily ignored by the optimizer during training. Moreover, different locations of positive sample points exhibit different effects on the prediction results of the same object. The richer the object features extracted from the sample points close to the object center, the greater the impact on the prediction. However, these methods can not suppress background and noise interference well, which may result in the same confidence level for the feature points at the target centroid and edges, thus generating worse prediction results after non-maximum suppression (NMS), as shown in
Figure 2. Additionally, these methods do not take the shape of the target into account, which may cause some negative locations to be misallocated as positive samples, such as the two ends of a ship.
To tackle the misallocation problem, some methods shrink the object bounding box to obtain the core region and take the core region as a positive location [
18,
19], while the other regions of the object bounding box are the transition from the positive location to the negative location. In this way, the problem of mislabeling can be alleviated to a certain extent. However, it fails to reflect the shape and direction characteristics of the oriented ships. Gaussian rotation heatmaps [
6,
20] have been used as the supervisory information to distinguish the positive and negative positions of oriented ships. Ref. [
20] proposed a target detection method by adopting a Gaussian rotation heatmap as prior information and an adaptive weight-adjustment mechanism (OWAM) algorithm to weight the positive and negative samples at different positions. While these methods address the issue of mislabeling and incorporate the shape and direction of oriented ships, they only assign positive and negative labels through a continuous two-dimensional function represented by a Gaussian heatmap. This approach has a crucial limitation, as the Gaussian heatmap cannot be utilized as the confidence output in the prediction stage. Additionally, the contribution of positive samples from different locations to the object is neglected, leading to the possibility of positive samples at the edge of the target location being overlooked and increasing the impact of noise on accurate target detection.
We aim to solve the imbalance problem caused by the number of positive and negative samples, better suppress background and noise interference, and simultaneously improve the robustness for remote sensing multi-scale ships and the training model’s capability. Therefore, the multi-scale dense-point rotation Gaussian heatmap (MDP-RGH) method is proposed to overcome background and noise interference while enhancing the robustness of multi-scale ship detection and the training model’s capabilities. The MDP-RGH is a discrete two-dimensional function based on a Gaussian heatmap that operates on dense points, allowing for the modeling of multi-scale oriented ships based on their shape and direction. The positive samples are weighted using the MDP-RGH to ensure rotational Gaussian distribution. The Gaussian heatmap confidence is used to predict whether a target is a positive or negative sample, thereby improving detection and reducing network computation. After that, a new anchor-free oriented ship detector network (AF-OSD) is constructed using the MDP-RGH method to detect multi-scale oriented ships. Additionally, a multi-task object size adaptive loss (OSALoss) function is designed to address the training imbalance caused by varying ship sizes. The weight of this function is determined by both the object area and the density of dense points, leading to improved ship detection accuracy. The contributions of this work can be summarized as follows.
An oriented ship model based on MDP-RGH is proposed, which can balance the number of positive and negative samples, suppress the interference of negative samples such as background and noise in the image, and improve the training accuracy.
An AF-OSD based on MDP-RGH is designed to achieve a better prediction for oriented ships with multi-scale attributes.
A multi-task OSALoss function is constructed to further overcome the training imbalance problem caused by different ship sizes to improve the detection quality and performance of the whole model for multi-scale ships.
The rest of the paper is organized as follows. In
Section 2, the ship model based on MDP-RGH is elaborated in detail.
Section 3 describes the AF-OSD based on the MDP-RGH ship model, which shows the detail of the network.
Section 4 reports the hyper-parameter settings.
Section 5 shows the experimental results and analysis.
Section 6 discusses the experiment. Finally,
Section 7 is the conclusion.
The variables themselves indicate a certain class of meaning, e.g., $(x,y)$ for coordinate variables, $\mathbf{G}$ for ground truth, $\mathbf{F}$ for output convolutional features, and $\mathbf{M}$ for masks. The subscript of a variable indicates the qualification of the variable, indicating that the variable belongs to the range indicated by the subscript, and the next-level subscript is the qualification of the previous-level subscript.
2. The Oriented Ship Model Based on MDP-RGH
To solve the problem of an unbalanced number of positive and negative samples, predict the target using Gaussian heatmap confidence, and account for the contribution of positive samples at different locations, a new model of the ship, MDP-RGH, is proposed to suppress the background and noise interference, and effectively describe the characteristics of the shape and direction of the multi-scale ship. Specifically, there are three steps to complete the model of an oriented ship: (1) dividing the image region; (2) obtaining multi-scale dense points by down-sampling the image; (3) weighting the dense points with a rotation Gaussian Heatmap.
To address the issue of the imbalanced number of positive and negative samples, a new ship model called MDP-RGH is proposed. This model incorporates Gaussian heatmap confidence in target prediction and considers the contribution of positive samples from different locations. MDP-RGH aims to reduce background and noise interference and effectively describe the shape and direction of multi-scale ships, as shown in
Figure 3. The process of creating the oriented ship model involves three steps: (1) dividing the image region; (2) obtaining multi-scale dense points through down-sampling the image; (3) weighting the dense points with a rotation Gaussian heatmap.
2.1. Dividing the Image Region
As the shape of the ship is similar to a shuttle [
6], if all the areas in the object bounding box are divided into object regions, part of the background pixels will be included in the object region.
To fit the shape and direction characteristics of the ship and coordinate with the subsequent rotation Gaussian heatmap, an image region division method is designed, as shown in
Figure 3. Specifically, the shrink rotation ellipse regions are delineated as the object regions. Other regions in the object-bounding box are regarded as ignorable regions, while the regions not in any object-bounding box are regarded as the background regions. In this case, we only need to determine the object regions and the ignorable regions. More specifically, here are three steps to divide the image region.
2.1.1. Transformation the Coordinate for the Oriented Ship
To facilitate the subsequent calculation, we transform the rotating ship into an upright ship. Therefore, as shown in
Figure 4, coordinate transformation is performed. Specifically, we transform the points
$\left[\begin{array}{c}{x}_{\mathrm{pixel}}\\ {y}_{\mathrm{pixel}}\end{array}\right]$ in the original pixel coordinate system
$\mathbf{XOY}$ to the new coordinate system
${\mathbf{X}}^{\prime}{\mathbf{O}}^{\prime}{\mathbf{Y}}^{\prime}$ with the center point of the oriented ship being set as the origin, the long axis of the ship (the line between the center points of the bow and stern) is set as the
${Y}^{\prime}$ axis, and the short axis is set as the
${X}^{\prime}$ axis. The coordinates of pixel points in the coordinate system
${\mathbf{X}}^{\prime}{\mathbf{O}}^{\prime}{\mathbf{Y}}^{\prime}$ can be expressed as
where
$\left[\begin{array}{c}{x}_{\mathrm{c}}\\ {y}_{\mathrm{c}}\end{array}\right]$ means the coordinates of the center point of the oriented ship in the coordinate system
$\mathbf{XOY}$.
$\alpha $ represents the counterclockwise angle between the positive half-axis of the oriented ship (the direction from the center point of the ship to the center point of the bow is positive) and the positive
Y axis of the pixel coordinate system
$\mathbf{XOY}$.
2.1.2. Creating a Shrink Rotation Ellipse Equation for the Oriented Ship
We create a shrink rotate ellipse equation for the oriented ship to determine which region the pixels belong to. The ellipse of the shrink rotation ship bounding box in the coordinate system
${\mathbf{X}}^{\prime}{\mathbf{O}}^{\prime}{\mathbf{Y}}^{\prime}$ is a standard ellipse. Specifically, the ellipse equation is written as
2.1.3. Identifying the Region to Which the Pixels in the Image Belong
After the above two steps, we can determine whether the pixel point
$\left[\begin{array}{c}{x}_{\mathrm{pixel}}\\ {y}_{\mathrm{pixel}}\end{array}\right]$ belongs to the object region
${\mathrm{area}}_{\mathrm{ship}}$ or the ignorable region
${\mathrm{area}}_{\mathrm{ignore}}$ as
Particularly, a pixel point belongs to the background region when it does not belong to any ship’s object or ignorable region.
2.2. Multi-Scale Dense Points by Down-Sampling
Considering the multi-scale attributes of different ships, as shown in
Figure 5, a multi-scale dense-point sampling method is further proposed to balance the number of positive samples of ships of different sizes. In particular, dense points can also reduce the amount of model computation since not all points need to be involved.
Specifically, according to the different scales of the ships, the image is down-sampled at different multiples to obtain different low-resolution images. We conduct
s (
s = 4, 8, 16) down-samplings of the image with three scales. Then, the low-resolution image is mapped to the original image to obtain a three-scale dense point matrix. The coordinate calculation formula of dense points can be described as
where
${x}_{\mathrm{denp}}{{}_{s}}_{{}_{i}j}$ and
${y}_{\mathrm{denp}}{{}_{s}}_{{}_{i}j}$ denote the horizontal and vertical coordinates of the
j-th row and
i-th column sample point
$\mathrm{denp}{{}_{s}}_{{}_{i}j}$ in the original image in the dense point matrix after
s samplings, respectively.
Thus, small-scale dense points are used to represent large ships and large-scale dense points are used to represent small ships, which could balance the number of positive samples of the large, medium, and small ships to a certain extent. Among them, large-scale, medium-scale, and small-scale ships can be clustered with the
k-means clustering algorithm. To better delineate the effective area of the image, dense points in the background region are defined as negative sample points, dense points in the ignorable region are ignorable dense points, and dense points in the object region are positive sample points, as shown in
Figure 6a–c, respectively. Particularly, the ignorable dense points do not participate in calculating the loss function when training the model.
2.3. Weighting the Dense Points with Rotation Gaussian Heatmap
It has been proved that the closer the sample point is to the center of the ship, the richer the features of the ship are extracted during model inference, obtaining more accurate detection results. Therefore, the Gaussian weighting on the dense points with the object region is performed, where the dense points with different positions mean the various degrees of importance of the ship. Furthermore, the multi-scale dense-point rotation Gaussian heatmap is obtained, as shown in
Figure 6.
Specifically, the calculation method of the rotation Gaussian heatmap value of the dense points in the ship region at each scale can be represented as
where
$g(\xb7)$ means the rotation Gaussian heatmap function;
${g}^{\prime}(\xb7)$ means the general Gaussian heatmap function;
${x}_{{\mathrm{denp}}_{{s}_{ij}}}^{\prime}$ and
${y}_{{\mathrm{denp}}_{{s}_{ij}}}^{\prime}$ denote the horizontal and vertical coordinates of the dense point in the coordinate system
${\mathbf{X}}^{\prime}{\mathbf{O}}^{\prime}{\mathbf{Y}}^{\prime}$, respectively;
$\mathrm{Exp}(\xb7)$ is the exponential function;
${\sigma}_{w}$ and
${\sigma}_{h}$ are the parameters related to the width and height of the ship, respectively.
To determine the values of
${\sigma}_{w}$ and
${\sigma}_{h}$, a hyper-parameter
${g}_{init}\in (0,1)$ is introduced, which represents the value of the rotation Gaussian heatmap when the dense point is located at the boundary of the shrink rotation ellipse. Take two points on the scaled rotating ellipse boundary, and then the values of
${\sigma}_{w}$ and
${\sigma}_{h}$ can be obtained by combining the initial Gaussian heatmap value
${g}_{\mathrm{init}}$, scale factor
$\xi $, (1), (2), and (5). For example, we take the two vertices of this scaled rotation ellipse (e.g., the two points A and B in
Figure 4) to determine the values of
${\sigma}_{w}$ and
${\sigma}_{h}$. At this point, the horizontal and vertical coordinates of these two points in the coordinate system
${\mathbf{X}}^{\prime}{\mathbf{O}}^{\prime}{\mathbf{Y}}^{\prime}$ can be calculated by (1) and (2). Bringing
${g}_{init}$, A, and B into (5) yields
then the values of
${\sigma}_{w}$ and
${\sigma}_{h}$ can be obtained.
In particular, the rotation Gaussian heatmap value of the dense points located in the ignorable region or background region is 0. After that, the Gaussian rotation heatmap of the ships with three scales can be generated in
Figure 6d–f, respectively.
Based on the above designs, an oriented ship model based on MDP-RGH can be obtained, which can better balance the influence of positive and negative samples of the training dataset, and reduce the influence of noise and background on ship recognition. The positive samples near the edge of the ship are made to conform to the rotated Gaussian distribution by avoiding noise interference. The closer the positive sample is to the edge, the lower its value, so our positive sample is soft. A higher positive sample score means that the positive sample is more representative of the ship. Thus, the MDP-RGH improves the robustness for different size targets and the calculation speed of the proposed model. Next, we will introduce the oriented ship detection algorithm base on MDP-RGH.
4. Experimental Conditions
In this section, experiments on two public-oriented ship datasets are conducted to quantitatively and qualitatively evaluate the proposed AF-OSD. A series of experiments are conducted using the DOTA ship dataset [
28] and HRSC2016 dataset [
29]. Next, we will introduce the experimental conditions, including experimental platforms, datasets, evaluation metrics, and implementation details.
4.1. Experimental Platforms
All the experiments are implemented on a desktop computer with an Intel Core i7-9700 CPU, 32 GB of memory, and a single NVIDIA GeForce RTX 3090 with 24 GB GPU memory.
4.2. Dataset
4.2.1. DOTA Ship Dataset
DOTA is a large-scale aerial image dataset containing 2806 images, with 15 categories labeled with oriented boxes and image sizes ranging from
$800\times 800$ to
$1600\times 1600$. We select 434 images containing ships (including 37,028 ships), where 90% of them are randomly selected as the training set, and the other 10% are as the validation set. The size distribution of ships is not uniform, and there are few large-scale ships. So, we enhance and expand the large ships in the dataset (rotation and flip enhancement), and the size distribution of the original dataset and the enhanced dataset is shown in
Figure 11a, and the rotation angle distribution is shown in
Figure 11b.
4.2.2. HRSC2016 Dataset
HRSC2016 is a remote sensing image dataset containing ship targets labeled as arbitrary orientations. It consists of 1061 images (436 images are the training set, 181 images are the validation set, and 444 images are the test set), whose spatial sizes range from
$300\times 300$ to
$1500\times 900$. We use the training set to train the model and the validation set to test the model. The size and rotation angle distributions of ships are illustrated in
Figure 12, and the ship sizes and angles are more uniformly distributed. Therefore, we do not target special scale ships when we perform enhancement expansion (flip enhancement).
4.3. Evaluation Metrics
In this article, the widely used metrics in object detection are adopted to measure the detection performance, i.e., precision, recall, and average precision (AP) (IOU threshold set to 0.5). Moreover, we use the number of parameters (Params) and average running time to evaluate the complexity and speed of the model, respectively.
4.4. Implementation Details
Due to the large size of the original input image, we cut it before network training. Specifically, the input resolution is set to $800\times 800$. As such, the original image and the corresponding labels are first preprocessed, where the cut image is uniform $800\times 800$ in size (if the image is smaller than $800\times 800$, the padding operation is applied). To avoid the situation that an object is cut into two halves and disappears, an overlapping area of 200 pixels is set in the experiments. Of course, there will be some cases where only a part of the object is in the cropped image.
If this happens, we will judge the ratio of the area of the target on the cropped image to the area of the target itself. If this ratio is greater than the set threshold (taken as 0.6), the label information of the object will be kept. The judgment method of whether to keep or discard the target on the cropped picture is represented as
where
$\mathbf{A}(\xb7)$ denotes the function of the area and
${box}_{Object}\cap {box}_{Cro{p}_{image}}$ means the intersection of the target box and the cut picture box to be calculated. If it is determined that the target needs to be retained, its coordinates on the cut image are calculated as
where
${p}^{O*}$ is the new coordinate of the target frame in the cropped picture,
${p}_{i}^{O}$ denotes the
i-th coordinate point of the target frame in the original picture, and
${p}_{1}^{\mathrm{Crop}}$ denotes the first coordinate point of the cropped picture frame in the original pictures. The image cutting schematic is shown in
Figure 13.
During the training process, the Adam optimizer is used to optimize the weight parameters of our AF-OSD, and the initial learning rate is set to 0.001 with an exponential decay of 5%. The batch size is set to 10, and all networks are trained for 200 epochs.
During the testing process, since the image size of HRSC2016 is not very large, before input to the model, we resize the image and then pad its size to 800 × 800. For the test images of DOTA, we used the same processing method as the training data. The confidence threshold ${score}_{t}$ is set to 0.1 and the NMS threshold is set to 0.6.
6. Discussion
In this paper, in order to solve the sample imbalance problem and suppress the interference of negative samples such as background and noise, the oriented ship is modeled via the proposed MDP-RGH according to its shape and direction to generate ship labels with more accurate information, which can predict the target for the contribution of positive samples at different positions to judge the target. Additionally, we designed an end-to-end anchor-free oriented ship detector (AF-OSD) net based on MDP-RG and validate its detection performance.
For the necessity study of network modules, the ablation experiment includes three parts: multi-scale structure, MDP-RGH-based label confidence, and MOSA loss. It has been proved that the framework with a multi-scale structure can extract the features of the oriented ship with different scales better. Multi-scale-based detection head setup enhances the robustness of the model. Therefore, the design of the multi-scale structure of the oriented ship detection network is reasonable. We use MDP-RGH to weight the positive samples so that the samples are rotationally Gaussian distributed and have values closer to the edge. The positive samples of our method are soft. As the result of the ablation experiment, the output can use the Gaussian heatmap confidence to predict whether the target is a negative sample or a positive one. A higher positive sample score means that the sample is more representative of the ship. Therefore, the MDP-RGH confidence can better suppress the interference of negative samples such as background and noise in the image, so the detection performance has improved. The improved loss function comparison experiments can prove that the weight can better solve the loss imbalance caused by the large and small ships. The larger the target, the higher the number of positive sample densities included; the smaller the target, the lower the number of positive sample densities included. The ablation experiment shows that the weight can adaptively solve the situation of sample imbalance that easily occurs during training.
Based on the test results on the DOTA ship dataset and HRSC2016 dataset, one can reach that the proposed AF-OSD has the best target recognition effect in remote sensing images with multi-scale ships, complex scenes, and positive and negative sample imbalance. The AF-OSD has features of high accuracy, few parameters of the network model, and high robustness. Due to the complexity of airborne remote sensing scenes, strong background and noise interference, positive and negative sample imbalance, and multiple ship scales, ship detection is a key and challenging task in remote sensing. The proposed method in this paper can better solve the above challenges. In summary, the proposed AF-OSD has a low number of parameters and higher model accuracy, detection accuracy, and computational speed.
As our main research object is ship targets. Most ships in the HRSC2016 and DOTA ship datasets have the characteristic of being narrow at both ends and wide in the middle, and their external contours are similar to rectangles. Therefore, the proposed method in this paper can be extended to the detection of remote sensing targets with similar external contours as rectangles (such as vehicles and stadium-like remote sensing targets) in any orientation, and even be possibly applied to symmetrical targets. However, AF-OSD is not an optimal solution for irregular and asymmetrical remote sensing targets (such as port-like targets). The application of the method in this paper to the detection of irregular targets is somewhat affected by the poor matching of the Gaussian heatmap. Therefore, the detection of full targets will be developed in the subsequent work. Moreover, to improve recognition speed and increase detection accuracy by decoupling the scale and task, the channel and spatial attention mechanism need to be considered in the design. However, the results showed that the inclusion of the attention mechanism actually led to a decrease in the detection accuracy of the network, which means that aspect requires further research and study in future work.