SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images

Ju, Moran; Niu, Buniu; Jin, Sinian; Liu, Zhaoming

doi:10.3390/electronics12061312

Open AccessArticle

SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images

by

Moran Ju

^1,*

,

Buniu Niu

¹,

Sinian Jin

¹ and

Zhaoming Liu

²

¹

The College of Information Science and Technology, Dalian Maritime University, Dalian 116026, China

²

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences (CAS), Shenyang 110016, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(6), 1312; https://doi.org/10.3390/electronics12061312

Submission received: 9 February 2023 / Revised: 3 March 2023 / Accepted: 7 March 2023 / Published: 9 March 2023

(This article belongs to the Special Issue Investigations of Object Detection in Images/Videos Using Various Deep Learning Techniques)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Vehicle detection in remote sensing images plays an important role for its wide range of applications. However, it is still a challenging task due to their small sizes. In this paper, we propose an efficient single-shot-based detector, called SuperDet, which achieves a combination of a super resolution algorithm with a deep convolutional neural network (DCNN)-based object detector. In SuperDet, there are two interconnected modules, namely, the super resolution module and the vehicle detection module. The super resolution module aims to recover a high resolution sensing image from its low resolution counterpart. With this module, the small vehicles will have a higher resolution, which is helpful for their detection. Taking the higher resolution image as input, the vehicle detection module extracts the features and predicts the location and category of the vehicles. We use a multi-task loss function to train the network in an end-to-end way. To assess the detection performance of SuperDet, we conducted experiments between SuperDet and the classical object detectors on both VEDAI and DOTA datasets. Experimental results indicate that SuperDet outperforms other detectors for vehicle detection in remote sensing images.

Keywords:

vehicle detection; remote sensing image; super resolution; convolutional neural networks

1. Introduction

There has been an increase in interest in vehicle detection in remote sensing images for their wide range of application [1,2], such as military security, traffic monitoring [3,4,5], intelligent transportation [6,7], traffic monitoring, etc. However, vehicle detection in remote sensing images is still a challenging task for many reasons, such as the small sizes of vehicles and the complex environment around them.

The task of vehicle detection is to localize and classify them in remote sensing images. The traditional vehicle detection approaches are mostly based on hand-crafted features and sliding window search [8,9,10]. These approaches usually require more computation and their representation ability is limited. Most recently, with the development of deep learning, deep convolutional neural network (DCNN)-based [11] object detectors have made significant advancements and receiving increasing attention, such as SSD [12], YOLO [13], YOLO V3 [14], R-CNN [15], Fast R-CNN [16] and Faster R-CNN [17]. This is because the features extracted by DCNN have a more powerful representation and stronger characterization ability.

The vehicles in the remote sensing images are treated differently from the general object detection task. They are usually small and inconspicuous, which makes them hard to spot. Additionally, because they are photographed from a distance in the air, vehicles in the remote sensing images invariably lose their “faces”, as illustrated in Figure 1. Furthermore, the environment around the vehicles is particularly complex. The visual appearance of vehicles and some other objects may be similar.

When it comes to target detection task, resolution is very important, especially for small targets. If the resolution of the vehicles in the remote sensing images can be increased, the features will be clear and the detection performance will be effectively improved. Therefore, we tried to recover a high resolution remote sensing image from its low resolution counterpart to improve the detection efficiency for vehicles in remote sensing images.

In this paper, a novel single-shot-based detector is introduced, called SuperDet, which obtains a better speed-to-accuracy trade-off for vehicle detection in remote sensing images. It achieves the resolution increase of the super resolution algorithm with DCNN-based object detector. In SuperDet, there are two interconnected modules, namely, the super resolution module (SRM) and the vehicle detection module (VDM). The super resolution module aims to recover a high resolution sensing image from its low resolution counterpart. With this module, the resolution of the vehicles will be increased, which is helpful for their detection. Taking the higher resolution image as input, the vehicle detection module extracts the features and then predicts the location and category of the vehicles. To add more contextual information for the vehicles, we design a context expansion module in VDM. In addition, multi-task loss function is utilized to train SuperDet in an end-to-end way. Lastly, we conduct the comparative experiments on both DOTA and VEDAI datasets to assess the detection performance of SuperDet. The results indicate that SuperDet outperforms other vehicle detectors.

The main contributions of this paper can be listed in the following:

(1) We proposed a novel single-shot-based detector SuperDet, aiming to achieve the combination of the super resolution algorithm with the DCNN-based object detector.

(2) We designed SRM and VDM, aiming to recover a high resolution sensing image from its low resolution counterpart and perform vehicle detection. In VDM, we introduced a context expansion module to increase the context information of the vehicles.

(3) We conducted the comparative experiment on both DOTA and VEDAI datasets, aiming to assess the efficiency of SuperDet. The experimental results indicate that SuperDet outperforms other vehicle detectors.

This paper’s remaining sections are arranged as follows. Section 2 covers the related work. Our proposed method is discussed in detail in Section 3. Our experimental findings are described in detail in Section 4. In the end, we summarize our paper.

2. Related Work

2.1. Object Detection

DCNN has been widely used in computer vision since the development of computer technology and deep learning, such as object tracking, object detection and image segmentation. The features extracted by DCNN have a more powerful representation and stronger characterization ability, which are helpful to learn the deep semantic information of the objects.

Target detectors fall into two broad categories based on DCNN. One is a two-stage object detector, such as R-CNN [15], Fast R-CNN [16], Faster R-CNN [17], Mask R-CNN [18], Cascade R-CNN [19] and so on. R-CNN proposes to use Selective Search [20] to generate a lot of candidate proposals and output the detection results of the objects by a detection network. Fast R-CNN designs Region of Interest Pooling to merge the features from different regions of the feature map. By introducing this module, Faster R-CNN can reduce low-level candidates and enhances the detector’s effectiveness. The task is expanded from detection to image segmentation by Mask R-CNN. Cascade R-CNN enhances the R-CNN using multi-stage extension to boost the detection accuracy.

To improve the efficiency, many scholars introduce the one-stage object detectors, such as SSD [12], YOLO [13], YOLO 9000 [21], YOLO V3 [14], You Only Look Twice [22], CenterNet [23], CornerNet [24], etc. Different from the two-stage object detectors, target category and location can be directly output by the one-stage object detectors. SSD detects objects of different sizes by a feature pyramid structure and proposes an anchor-boxes generation strategy. The images are divided into grids by YOLO series object detectors and detection is performed in each grid. A backbone known as Darknet19 is used to extract features in YOLO 9000, and K-means clustering is utilized to produce the size of the anchor boxes. YOLO V3 makes use of the Feature Pyramid Network and uses multiple scales to detect the targets. CenterNet and CornerNet propose an anchor-free method by locating the object’s key points. One-stage object detectors are faster than two-stage object detectors, but their detection accuracy still falls short of that of two-stage detectors.

2.2. Vehicle Detection in Remote Sensing Images

Most of the vehicle detectors for the remote sensing images are improved by the above-mentioned general object detectors. Inspired by these object detection approaches based on DCNN, a lot of vehicle detectors for remote sensing images are proposed. In [25], the authors improved Faster R-CNN and incorporated specific knowledge to detect the vehicles in remote sensing images. In [26], the authors introduced a fully convolutional regression network to alleviate the mapping problem. In [27], the authors designed a region DCNN and utilized hard negative example mining to detect vehicles. To test large-scale images, these images are cropped to small blocks. After testing, all the detection results of blocks were stitched together. In [28], the authors used a multi-task learning residual fully convolutional network to perform vehicle instance segmentation. In [29], the authors proposed AVDNet for small vehicle detection in aerial image. AVDNet used ConvRes block to preserve the small object features. In [30], the authors introduced cascaded CNNs, which combines two independent CNNs; one is to produce vehicle-like regions, the other is to judge the position and category of the target accurately. In [31], a detection framework that appropriately handles the rotation equivariance inherent to any aerial image task is developed using the Faster R-CNN approach. In [32], the authors improved the non-maximum suppression (NMS) method and introduced the dual-NMS by combining the density of the generated bounding boxes, which can reduce the false detection rate effectively. In addition, a correlation network (CorrNet) was designed by the proposed correlation calculation layer and the dilated convolution guidance structure, which is helpful to improve the network’s feature extraction capability. The method used by the authors in [33] pays more attention to the small vehicle detection in the remote sensing images. They improve YOLO V3 and design a depthwise-separable attention-guided network (DAGN) by introducing the depthwise-separable convolutions, which can accelerate the proposed model effectively. With attention block, it is helpful for the network to distinguish the inconsequential and important features. Furthermore, [33] also improves the candidate-merging strategy and the loss function. The authors of [34] proposed an ensemble deep-transfer learning based on Faster R-CNN, which achieves the superiority in terms of the total accuracy. In [35] the authors utilized the Alexnet network to classify the vehicle target and Faster R-CNN to perform vehicle detection. Although the above mentioned vehicle detectors have achieved great performance, most of them are inappropriate for some real-time scenarios. Furthermore, they haven’t addressed the low resolution and inconspicuous features of the vehicle targets and the detection accuracy needs to be further improved. According to the characteristics of vehicles in remote sensing images, we designed an SRM to recover the details of the vehicle from its low resolution image and our proposed detector achieved a better speed-to-accuracy tradeoff. The advantages and disadvantages comparison of the above-mentioned detectors are concluded in Table 1.

3. The Proposed Method

As depicted in Table 1, most of the existing vehicle detectors in remote sensing images are inappropriate for some real-time scenarios and haven’t addressed the low resolution and inconspicuous features of the vehicle targets. To address these problems, in this section, we will introduce the design of SuperDet in detail, including the architecture of SuperDet and multi-task loss function.

3.1. The Architecture Design of SuperDet

As depicted in Figure 2, SuperDet includes three parts, namely, Data Preprocessing, SRM and VDM. The Data Preprocessing is to achieve data augmentation and supply label for SRM and VDM. SRM and VDM are designed to recover a high resolution sensing image from its low resolution counterpart and perform vehicle detection, respectively.

3.1.1. Data Preprocessing

For training: SuperDet aims to enhance the detection efficiency of vehicles in remote sensing images by utilizing the super resolution algorithm. To recover a high resolution image from its low resolution counterpart, we need take the high resolution image as label to train SuperDet. As shown in Figure 1, as most DCNN-based object detectors, We use random color shifting, horizontal flipping, random affine, and random cropping to augment the data on high resolution image, aiming to enhance the learning ability of the network. Suppose that

H

and

H^{*}

represent the original high resolution image and the high resolution image with data augmentation. We down-sample

H^{*}

to the low resolution image

L

. We take

H^{*}

as the label of the super resolution module and

L

as the input of SuperDet. Then SuperDet can be trained in an end-to-end way.

For testing: After training, SuperDet achieves to recover a high resolution image from its low resolution part. Therefore, in the testing phase, we only need to take low resolution images as inputs to detect the vehicles in remote sensing images.

3.1.2. Super Resolution Module

The vehicles are usually inconspicuous and small in the remote sensing images, which makes them hard to spot. A higher resolution is able to make the features of these vehicles clearer, which is helpful for their detection. Therefore, we propose SRM to recover a high resolution image from its low resolution part.

As shown in Figure 2, five convolutions are used to extract the features first. We denote a convolution layer as

C o n v (I, O, F)

, where

I, O, F

represent the number of input channels, the number of output channels and the filter size, respectively. Then, we propose to use a pixel shuffle layer to reconstruct the high resolution output. As shown in Figure 3, the input feature map of the pixel shuffle layer will be divided into four parts channel-wise, such as red, orange, yellow and green. Each part will be resettled adjacently to reconstruct a high resolution image. With the pixel shuffle layer, the feature maps can be transformed from

H \times W \times r^{2} C

to

r H \times r W \times C

(

r = 2

), which guarantees the reconstruction of the three RGB channels, where

W

,

H

and

C

denote the width, height and channels,

r

indicates the scaling ratio of feature maps. To enhance the reuse of the features, we add the output of the shuffle pixel layer to the up-sampled low resolution image by Residual block [36]. The SRM can be represented as follows:

O = U (I) + P S (C o n v s (I))

(1)

where

O

and

I

denote the output high resolution image and input low resolution image,

U (\cdot)

and

C o n v s (\cdot)

represent up-sample operation and feature extraction, and

P S (\cdot)

denotes the pixel shuffle operation.

3.1.3. Vehicle Detection Module

The high resolution image serves as the input for VDM, which predicts the category and location of the vehicles in remote sensing images.

Backbone: Darknet53 [14] is used as the foundational design for the backbone of SuperDet because it is straightforward and effective. A scale matching strategy [37] is utilized to choose the appropriate scales for vehicle detection. Both 8× and 16× down-sampled feature maps are selected as two scales and we truncate the convolutions prior to the 32× down-sampled layer in Darknet53 for vehicle detection based on the size of vehicles in the remote sensing images.

Context expansion module: The environment around the vehicles is particularly complex. The visual appearance of some vehicles and the background may be similar. Therefore, the context information around the vehicles is vital to recognize the vehicles.

To increase the context information, we use dilated convolution to design a context expansion module. As depicted in Figure 4, we utilize four convolutions with different dilation rates and kernel sizes to expand the receptive field, i.e., kernel size 1 × 1, kernel size 3 × 3 with dilation rate 1, kernel size 3 × 3 with dilation 2 and kernel size 4 × 4 with dilation rate 2. Compared to the original 3 × 3 convolution layer, we examine a single location four times with a different receptive field. The information extracted from various scopes in this hierarchical structure contains both global and detailed information of the current region. Finally, to show different contributions of the output features with different receptive fields, each output features are multiplied by weight

W

, which can be learned adaptively. Feature fusion is performed by adding the features with a different receptive field, which can be calculated by (2):

F_{f u s i o n} = \sum_{i = 0}^{3} W_{i} \cdot F_{i}

(2)

The low level features contain finer-grained information and the high level features contain more meaningful semantic information. Therefore, for scale 8×, the 16× down-sampled feature maps are up-sampled and concatenated in the channel wise with the 8× down-sampled feature maps to perform feature fusion. Similarly, the 8× down-sampled feature maps are down-sampled and concatenated in the channel wise with the 16× down-sampled feature maps for scale 16×, as depicted in Figure 2.

3.2. Multi-Task Loss Function

SRM loss: The loss for SRM and the loss for VDM are the two parts of SuperDet’s loss function. For the former, we use pixel-wise mean square error (MSE):

L o s s_{S R M} = \frac{1}{4 H W C} \sum_{c = 1}^{C} \sum_{x = 1}^{2 H} \sum_{y = 1}^{2 W} {(H_{x, y, c}^{*} - S_{x, y, c})}^{2}

(3)

where

H \times W \times C

denotes the size of the low resolution image. With super resolution module, the low resolution image can be transferred to a high resolution image with size

2 H \times 2 W \times C

.

S

indicates the output of SRM.

H^{*}

denotes the high resolution image with data augmentation.

VDM loss: After that, the high resolution image is passed to VDM to output the category and the position of the vehicle. Generalized Intersection over Union loss (GIOU loss) is utilized [38] to regress the position of the vehicles:

G I O U_{B_{G T}, B_{P}} = \frac{| B_{G T} \cap B_{P} |}{| B_{G T} \cup B_{P} |} - \frac{| B \ (B_{G T} \cup B_{P}) |}{| B |}

(4)

where

B_{P}

represents the predicted box and

B_{G T}

denotes the ground truth box.

B

indicates the smallest enclosing convex region of

B_{G T}

and

B_{P}

. Then the GIOU loss can be calculated by the following:

L o s s_{g i o u} = 1 - G I O U_{B_{G T}, B_{P}}

(5)

To solve the imbalance between negative and positive samples and focus more on hard samples, the vehicle confidence is regressed using Focal loss [39].

L o s s_{c o n f} = {\begin{cases} - α {(1 - y_{P})}^{γ} \times \log y_{P}, & y_{G T} = 1 \\ - (1 - α) y_{P}^{γ} \times \log (1 - y_{P}), & y_{G T} = 0 \end{cases}

(6)

where

y_{G T}

indicates the ground truth confidence and

y_{P}

represents the predicted confidence.

γ

is set to 2 and

α

is set to 0.25.

The classification of the vehicles is regressed using Binary Cross Entropy loss:

L o s s_{c l s} = C_{G T} \log C_{P} - (1 - C_{G T}) \log (1 - C_{P})

(7)

where

C_{G T}

and

C_{P}

indicates the ground truth category and the output category, respectively.

Then, the loss of VDM can be obtained by (8):

L o s s_{V D M} = \sum_{n = 0}^{1} L o s s_{g i o u}^{n} + L o s s_{c o n f}^{n} + L o s s_{c l s}^{n}

(8)

where

n

indicates the network scale.

At last, the total loss of SuperDet can be calculated by the following:

L o s s_{S u p e r D e t} = L o s s_{V D M} + L o s s_{S R M}

(9)

4. Experiments

In this section, we will introduce the experiments in detail to verify the efficiency of the proposed SuperDet, including the experimental setup, ablation study and the comparative experiments on both DOTA and VEDAI datasets.

4.1. Experimental Setup

Datasets: Both the VEDAI [40] dataset and DOTA [41] dataset are used to conduct the comparative experiments, aiming to assess the SuperDet’s performance.

The VEDAI dataset is for aerial imagery vehicle detection. The same distance from the ground was used to capture each image. The initial large aerial images, which included a variety of vehicles and backgrounds, were divided into 1024 × 1024 and contain a variety of vehicles and backgrounds. The 1024 × 1024 images are down-sampled to 512 × 512 as low resolution images. There are nine different classes of vehicles in the VEDAI dataset. We take the 1024 × 1024 images as high resolution images in the training phase and 512 × 512 images as low resolution images for testing.

DOTA is a large-scale dataset for detection tasks in aerial images. Each image is approximately 4000 × 4000 pixels in size and includes objects with a wide range of scales, shapes, and orientations. There are 15 common object categories in the DOTA dataset. Due to the diverse sizes of the images in the DOTA dataset, we crop the images to 512 × 512 as high resolution images and down-sample them to 256 × 256 as low resolution images. In addition, the single class ‘small-vehicle’ is selected as the detection targets. The training set contains 713 images, while the testing set contains 135 images.

Training details: On 1 NVIDIA GeForce GTX TITAN X, each model is trained using stochastic gradient descent (SGD). Pytorch 0.4.1 is used to carry out the experiments. The model is trained with the help of the cosine learning rate schedule [42]. The range for the learning rate is 0.0001 to 0.000001. We set the momentum decay to 0.9 and the weight decay to 0.0005.

Evaluation Metrics: The detection performance and detection speed are evaluated using the mean average precision (mAP) and frames per second (FPS) metrics. We assessed the detectors using the common performance metric for object detection whether Intersection over Union (IOU) is greater than 0.5.

4.2. Ablation Study

To assess the effectiveness of each design in SuperDet, the ablation study is conducted on the VEDAI dataset, as described in Table 2. Baseline denotes the backbone of the network and contains two detection scales, 8× and 16×.

As shown in Table 2, the baseline achieves 70.0% mAP at 53.9 FPS. With the context expansion module, the context information around the vehicles has been increased, which is helpful to improve the detection performance. It helps improve the mAP from 70.0% to 72.1% and still keeps the detection speed at 47.7 FPS. With feature fusion, the mAP has been further enhanced by 1.4% because the feature fusion between low level and high level is helpful to categorize and locate the vehicles. With SRM, the details of the vehicle can be recovered from its low resolution image, the mAP has been obviously boosted from 73.5% to 77.6% and the detection speed of SuperDet still keeps at 29.3 FPS, which demonstrates that SRM is useful for vehicle detection in remote sensing images. Although each design in SuperDet added the computing and detecting time, the detection performance has been boosted. SuperDet is still suitable for real-time vehicle detection in remote sensing images.

4.3. Experiment on VEDAI

Comparative results between SuperDet and the classical detectors on the VEDAI dataset are described in Table 3 and Table 4. We use both one-stage detectors and two-stage detectors to conduct the comparative experiments, including Faster RCNN, Faster RER-CNN, YOLO V2, SSD, YOLO V3, CorrNet and DAGN. It should be noted that Faster RER-CNN and DAGN are made to detect vehicles in remote sensing images.

The loss curve of SuperDet is depicted in Figure 5. The loss tends to be stable with 150 epochs.

Table 3 depicts the mAP and FPS comparison of different classical detectors and Table 4 depicts the AP comparison in each class. It is obvious that mAP is higher with two-stage detectors than with one-stage detectors, however, their detection speed is slower. SuperDet achieves 77.6% mAP at 29.3 FPS on the VEDAI dataset, which outperforms other target detection methods. The mAP has been boosted by 6.8% and 10.5% compared with Faster RER-CNN and DAGN, however, the detection speed of SuperDet is as fast as DAGN and ten times faster than Faster RER-CNN.

To assess SuperDet’s detection capabilities qualitatively, Figure 6 depicts the visualization of the detection results of DAGN and SuperDet on VEDAI dataset. The fault detection and the missed detection are marked in the red box. Compared with the Ground Truth, it is obvious that some vehicles have been missed and some have been detected with fault categories by DAGN. SuperDet can effectively alleviate the above problems by recovering the details of the vehicle from its low resolution image. The results demonstrate that SuperDet has good detection performance in different scenes, such as desert scene (first column), road scene (second column) and farmland scene (third column).

4.4. Experiment on DOTA Dataset

Comparative experiments among different target detectors are also conducted on the DOTA dataset. The comparative results are shown in Table 5. To further validate the performance of SuperDet, the input of YOLO V2, Tiny-YOLO V3 and YOLO V3 is high resolution images with size 512 × 512 and the input of SuperDet is low resolution images with size 256 × 256. SuperDet achieves 91.3% mAP at 95.3 FPS on the DOTA dataset which outperforms other target detection methods. Compared with YOLO V2, Tiny-YOLOV3, YOLO V3 and ARFFDet, the mAP has been boosted by 18.6%, 18.1%, 3.0% and 2.3% respectively, which demonstrates that SRM can effectively achieve high resolution image reconstruction and provide a high quality image for VDM. Although ARFFDet is faster than SuperDet, it has not addressed the low resolution and inconspicuous features of the vehicle targets, which affects the detection accuracy. We designed SRM to recover the details of the vehicle from its low resolution image and SuperDet is still suitable for real-time vehicle detection in remote sensing images.

Similarly, Figure 7 depicts the visualization of the detection results of ARFFDet and SuperDet on the DOTA dataset. The fault detection and the missed detection are marked in the red box. Compared with the Ground Truth, it is obvious that some vehicles have been missed by ARFFDet. SuperDet can effectively alleviate this problem by recovering the details of the vehicle from its low resolution image. The results demonstrate SuperDet has a good performance for small vehicle detection in remote sensing images even though the environment is complicated and the vehicles are dense.

5. Conclusions

In this paper, we introduce SuperDet to detect vehicles in remote sensing images. SuperDet is a novel single-shot-based detector, consisting of two interconnected parts, the SRM and the VDM, aiming to achieve the combination of the super resolution algorithm with DCNN-based object detector. In SuperDet, we designed an SRM to recover a high resolution sensing image from its low resolution counterpart and VDM to perform vehicle detection. The SuperDet can be trained and tested in an end-to-end way with multi-task loss. An ablation study is conducted to evaluate the performance of each design in SuperDet. To assess the detection efficiency of SuperDet, comparative experiments are conducted on both VEDAI and DOTA datasets between SuperDet and classical object detectors. The qualitative and quantitative results of the experiments indicate that SuperDet outperforms other vehicle detectors in terms of speed-to-accuracy tradeoff. SuperDet achieves 77.6 mAP at 29.3 FPS with 512 × 512 input size on the VEDAI dataset and 91.3 mAP at 95.3 FPS with 256 × 256 input size on the DOTA dataset, making it suitable for real-time vehicle detection in remote sensing images. Because we introduced SRM in vehicle detection, the model needs to be further simplified. Now we are exploring to simplify the whole model. We hope this paper can provide new ideas for the vehicle or small target detection task in remote sensing images.

Author Contributions

M.J. set up the experimental environment, designed the algorithms, carried out the experiments, analyzed the data, and wrote the paper. B.N., S.J. and Z.L. reviewed and amended the paper as research supervisors and reviewers. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Natural Science Foundation of China grant 62201114 and the Fundamental Research Funds for the Central Universities grant 3132022242.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nassim, A.; Haikel, A.; Yakoub, B.; Bilel, B.; Naif, A.; Mansour, Z. Deep learning approach for car detection in UAV imagery. Remote Sens. 2017, 9, 312. [Google Scholar]
Nicolas, A.; Le, S.B.; Sébastien, L. Segment-before-detect: Vehicle detection and classification through semantic segmentation of aerial images. Remote Sens. 2017, 9, 368. [Google Scholar]
Zhou, Y.; Liu, L.; Shao, L.; Mellor, M. DAVE: A Unified Framework for Fast Vehicle Detection and Annotation. arXiv 2016, arXiv:1607.04564. [Google Scholar]
Wang, L.; Lu, Y.; Wang, H.; Zheng, Y.; Ye, H.; Xue, X. Evolving Boxes for Fast Vehicle Detection. arXiv 2017, arXiv:1702.00254. [Google Scholar]
Mattyus, G.; Liu, K. Fast multiclass vehicle detection on aerial images. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 1938–1942. [Google Scholar]
Mou, L.; Zhu, X. Spatiotemporal scene interpretation of space videos via deep neural network and tracklet analysis. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016. [Google Scholar]
Kopsiaftis, G.; Karantzalos, K. Vehicle detection and traffic density monitoring from very high resolution satellite video data. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Milan, Italy, 26–31 July 2015. [Google Scholar]
Cheng, H.Y.; Weng, C.C.; Chen, Y.Y. Vehicle detection in aerial surveillance using dynamic bayesian networks. IEEE Trans. Image Process. 2011, 21, 2152–2159. [Google Scholar] [CrossRef] [PubMed]
Wen, S.; Wen, Y.; Gang, L.; Jie, L. Car detection from high-resolution aerial imagery using multiple features. In Proceedings of the Geoscience and Remote Sensing Symposium (IGARSS), Munich, Germany, 22–27 July 2012; pp. 4379–4382. [Google Scholar]
Chen, Z.; Wang, C.; Wen, C.; Teng, X.; Chen, Y.; Guan, H.; Luo, H.; Cao, L.; Li, J. Vehicle Detection in High-Resolution Aerial Images via Sparse Representation and Superpixels. IEEE Trans. Geosci. Remote. Sens. 2015, 54, 103–116. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, C.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. Computer Vision and Pattern Recognition (cs.CV). arXiv 2016, arXiv:1512.02325, 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; p. 1. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
van de Sande, K.E.A.; Uijlings, J.R.R.; Gevers, T.; Smeulders, A.W.M. Segmentation as selective search for object recognition. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1879–1886. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO 9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Etten, A.V. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar]
Zhou, X.; Wang, D.; Philipp, K. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired key points. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 765–781. [Google Scholar]
Ji, H.; Gao, Z.; Mei, T.; Li, Y. Improved Faster R-CNN with multiscale feature fusion and homography augmentation for vehicle detection in remote sensing images. IEEE Geosci. Remote. Sens. Lett. 2019, 16, 1761–1765. [Google Scholar] [CrossRef]
Tayara, H.; Soo, K.G.; Chong, K.T. Vehicle detection and counting in high-resolution aerial images using convolutional regression neural network. IEEE Access 2017, 6, 2220–2230. [Google Scholar] [CrossRef]
Tang, T.; Zhou, S.; Deng, Z.; Zou, H.; Lei, L. Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 2017, 17, 336. [Google Scholar] [CrossRef] [Green Version]
Mou, L.; Zhu, X.X. Vehicle instance segmentation from aerial image and video using a multi-task learning residual fully convolutional network. arXiv 2018, arXiv:1805.10485. [Google Scholar]
Mandal, M.; Shah, M.; Meena, P.; Devi, S.; Vipparthi, S.K. AVDNet: A Small-Sized Vehicle Detection Network for Aerial Visual Data. IEEE Geosci. Remote. Sens. Lett. 2019, 17, 494–498. [Google Scholar] [CrossRef] [Green Version]
Zhong, J.; Lei, T.; Yao, G. Robust vehicle detection in aerial images based on cascaded convolutional neural networks. Sensors 2017, 17, 2720. [Google Scholar] [CrossRef] [Green Version]
Du Terrail, J.O.; Jurie, F. Faster RER-CNN: Application to the detection of vehicles in aerial images. arXiv 2018, arXiv:1809.07628. [Google Scholar]
Lin, Z.; Wu, Q.; Fu, S.; Wang, S.; Kong, Y. Dual-NMS: A method for autonomously removing false detection boxes from aerial image object detection results. Sensors 2019, 19, 4691. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, Z.; Liu, Y.; Liu, T.; Lin, Z.; Wang, S. DAGN: A Real-Time UAV Remote Sensing Image Vehicle Detection Framework. IEEE Geosci. Remote. Sens. Lett. 2019, 17, 1884–1888. [Google Scholar] [CrossRef]
Darehnaei, Z.G.; Fatemi, S.; Mirhassani, S.M.; Fouladian, M. Ensemble deep learning using faster r-cnn and genetic algorithm for vehicle detection in uav images. IETE J. Res. 2021, 1–10. [Google Scholar] [CrossRef]
Tan, Q.; Ling, J.; Hu, J.; Qin, X.; Hu, J. Vehicle Detection in High Resolution Satellite Remote Sensing Images Based on Deep Learning. IEEE Access 2020, 8, 153394–153402. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Ju, M.; Luo, J.; Liu, G.; Luo, H. A real-time small target detection network. Signal Image Video Process. 2021, 15, 1265–1273. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. Proc. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2999–3007. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef] [Green Version]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. arXiv 2018, arXiv:1711.10398. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]

Figure 1. Examples of the vehicles in remote sensing images.

Figure 2. The architecture of SuperDet. SuperDet consists of three parts, namely, Data preprocessing, Super resolution module and Vehicle detection module.

Figure 3. Pixel Shuffle Layer.

Figure 4. Context expansion module. Context expansion module contains four convolutions with different dilation rates and kernel sizes to expand the receptive field, i.e., kernel size 1 × 1, kernel size 3 × 3 with dilation rate 1, kernel size 3 × 3 with dilation 2 and kernel size 4 × 4 with dilation rate 2.

Figure 5. Loss curve for SuperDet.

Figure 6. Visual comparative results of SuperDet on the VEDAI dataset. First row: detection results of DAGN. Second row: detection results of SuperDet. Third row: Ground truth.

Figure 7. Visual comparative results of SuperDet on DOTA dataset. First row: detection results of ARFFDet. Second row: detection results of SuperDet. Third row: Ground truth.

Table 1. Comparison of different vehicle detectors.

Detector	Advantage	Disadvantage
[25]	Apply homography data augmentation to improve the generalization ability	Not fit for real-time vehicle detection
HRPN	1. Design combination of hierarchical feature maps 2. Reducing false detection by negative example mining	Not fit for real-time vehicle detection
ResFCN	Design for vehicle instance segmentation	Not fit for real-time vehicle detection
AVDNet	Alleviate the problem of vanishing features for smaller objects by introducing ConvRes blocks at multiple scales	Not fit for vehicle detection with multi-scales
Cascaded CNNs	Improve performance by combining two independent convolutional neural networks	Not fit for real-time vehicle detection
CorrNet	Remove the false detection boxes using dual non-maximum suppression Improve the extraction ability of the overall features with dilated convolution guidance structure	Not fit for real-time vehicle detection
Faster RER-CNN	Handle the rotation equivariance inherent to any aerial image task	Not fit for real-time vehicle detection
DAGN	Accelerate the proposed model by introducing the depthwise-separable convolutions	No addressing the low resolution of vehicles in remote sensing images
[34]	Improve performance by ensemble deep transfer learning	Not fit for real-time vehicle detection

Table 2. Effectiveness of each design on VEDAI dataset.

Baseline	With Context Expansion Module	With Feature Fusion	With SRM	mAP	FPS
√				70.0%	53.9
√	√			72.1%	47.7
√	√	√		73.5%	41.3
√	√	√	√	77.6%	29.3

Table 3. Comparative results of different detectors on VEDAI dataset.

Detector	SSD	Faster RCNN	YOLO V2	YOLO V3	CorrNet	Faster RER-CNN	DAGN	SuperDet
Input	512 × 512	512 × 512	512 × 512	512 × 512	512 × 512	512 × 512	512 × 512	512 × 512
FPS	22.0	6.3	55.2	31.5	10.0	2.7	25.1	29.3
mAP (%)	46.1	67.0	50.3	61.6	68.05	70.8	67.1	77.6

Table 4. Performance comparisons for each class on the VEDAI dataset: AP (%).

Detector	Car	Pickup	Camping	Truck	Boat	Van	Other	Tractor	Plane	mAP
Faster RER-CNN	80.2	77.0	72.8	72.3	65.3	74.1	51.0	67.8	77.4	70.8
DAGN	81.3	66.1	72.0	32.9	41.5	78.6	53.9	85.3	78.2	67.1
SuperDet	88.6	82.4	77.3	70.0	64. 1	93.8	47.2	77.9	99.8	77.6

Table 5. Comparative results of each detector on the DOTA dataset.

Detector	YOLO V2	Tiny-YOLO V3	YOLO V3	ARFFDet	SuperDet
Input	512 × 512	512 × 512	512 × 512	256 × 256	256 × 256
FPS	58.3	76.4	14.7	130.0	95.3
mAP (%)	72.7	73.2	88.3	89.0	91.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ju, M.; Niu, B.; Jin, S.; Liu, Z. SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images. Electronics 2023, 12, 1312. https://doi.org/10.3390/electronics12061312

AMA Style

Ju M, Niu B, Jin S, Liu Z. SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images. Electronics. 2023; 12(6):1312. https://doi.org/10.3390/electronics12061312

Chicago/Turabian Style

Ju, Moran, Buniu Niu, Sinian Jin, and Zhaoming Liu. 2023. "SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images" Electronics 12, no. 6: 1312. https://doi.org/10.3390/electronics12061312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Vehicle Detection in Remote Sensing Images

3. The Proposed Method

3.1. The Architecture Design of SuperDet

3.1.1. Data Preprocessing

3.1.2. Super Resolution Module

3.1.3. Vehicle Detection Module

3.2. Multi-Task Loss Function

4. Experiments

4.1. Experimental Setup

4.2. Ablation Study

4.3. Experiment on VEDAI

4.4. Experiment on DOTA Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI