Next Article in Journal
A Novel Method for Building Contour Extraction Based on CSAR Images
Previous Article in Journal
Periodic-Filtering Method for Low-SNR Vibration Radar Signal
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Few-Shot Object Detection in Remote Sensing Imagery via Fuse Context Dependencies and Global Features

State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(14), 3462; https://doi.org/10.3390/rs15143462
Submission received: 11 May 2023 / Revised: 19 June 2023 / Accepted: 6 July 2023 / Published: 8 July 2023
(This article belongs to the Section Remote Sensing Image Processing)

Abstract

:
The rapid development of Earth observation technology has promoted the continuous accumulation of images in the field of remote sensing. However, a large number of remote sensing images still lack manual annotations of objects, which makes the strongly supervised deep learning object detection method not widely used, as it lacks generalization ability for unseen object categories. Considering the above problems, this study proposes a few-shot remote sensing image object detection method that integrates context dependencies and global features. The method can be used to fine-tune the model with a small number of sample annotations based on the model trained in the base class, as a way to enhance the detection capability of new object classes. The method proposed in this study consists of three main modules, namely, the meta-feature extractor (ME), reweighting module (RM), and feature fusion module (FFM). These three modules are respectively used to enhance the context dependencies of the query set features, improve the global features of the support set that contains annotations, and finally fuse the query set features and support set features. The baseline of the meta-feature extractor of the entire framework is based on the optimized YOLOv5 framework. The reweighting module of the support set feature extraction is based on a simple convolutional neural network (CNN) framework, and the foreground feature enhancement of the support sets was made in the preprocessing stage. This study achieved beneficial results in the two benchmark datasets NWPU VHR-10 and DIOR. Compared with the comparison methods, the proposed method achieved the best performance in the object detection of the base class and the novel class.

1. Introduction

Object detection can help further decision-making analysis by obtaining the object location and category information of the region of interest (ROI) in an image [1,2,3,4]. Object detection is one of the important tasks in the understanding of satellite images, and it plays an indispensable role in military decision making, urban management, environmental monitoring, disaster detection, public safety, etc. [5,6,7,8,9]. Moreover, the result obtained by object detection techniques can also be used to track moving objects, which is of great significance for analyzing the state and movement pattern of the objects [10,11,12]. Currently, studies are relatively mature in strong supervised object detection technology using remote sensing images and have achieved a large number of excellent research results. However, for some specific research fields, it is extremely difficult to obtain a large amount of data to build a sample library to acquire excellent results, such as disaster data and military damage data [13,14,15,16]. Therefore, how to use limited remote sensing images to study the object detection problem under a small number of manually annotated samples is a topic of great research significance.
Object detection technology in remote sensing images has been relatively mature after decades of development, from the early methods based on geometric principles, methods based on low-level image feature extraction (color, texture, etc.), and traditional machine learning methods to current deep learning methods [17,18,19]. These gradually got rid of the complicated steps of manually designing features and developed toward a higher level of less artificial assistance and intelligent direction. The current mainstream deep learning methods include (1) two-stage detection models, such as R-CNN [20], Fast R-CNN [21], and faster R-CNN [22], which usually have high detection accuracy but slow speed, and (2) single-stage detection models, such as YOLO series [23] and SSD [24], which are usually faster but the detection accuracy is relatively low.
Additionally, some graph-based methods [25,26] are used to enhance the context information of remote sensing images, thereby enhancing the performance of object detection. Examples include the research on building an embedded graph attention network using latent spaces and semantic relationships among objects [27], the research on constructing semantic graphs by combining semantic graphs with word embeddings of object category labels on graph nodes [28], and the research on constructing semantic graphs by combined with the graph aggregation network to update the weight of the object node to achieve enhanced attention to the rotation angle and scale of the object [29].
These relatively mature models are deployed in application projects to solve some practical problems and improve production efficiency. However, remote sensing images are different from general natural images in that they have more complex spectral features, scene features, multiscale features, etc. Therefore, the application of these strongly supervised network models in object detection tasks in remote sensing images still faces many challenges [30,31]. The major challenges currently facing remote sensing object detection tasks are how to detect when there are very few samples, and how to use the trained model to adapt to infer new categories of objects.
To solve these practical problems, a few-shot remote sensing image object detection method was proposed. Few-shot object detection aims to expand the task of new category recognition and detection by training known category objects. It learns from the perspective of the model and learns the characteristics of the annotated base class. Its essence is to realize the transferability of knowledge through fine-tuning, and finally realize the detection of new categories. At present, few-shot object detection methods mainly include methods based on metric learning, methods based on meta-learning, methods based on data enhancement, and methods based on multimodal data [32,33,34,35]. However, the methods based on data enhancement and multimodal data are directly driven by data, but the generalization performance of these models is weak; thus, it is difficult to detect objects in other scenarios. The methods based on metric learning and meta-learning start from the perspective of model learning, learn and absorb the characteristics of data to form a model, and then identify the objects. This kind of method based on the model often faces problems such as incomplete features, as well as insufficient feature fusion ability and model learning ability, making the detection and recognition of the base class and new class inaccurate.
Addressing the above problems, this study proposes a few-shot remote sensing image object detection approach that integrates the global features of the support set (S) and the enhanced context dependency of the query set (Q). This method includes two parts: the base class training stage and the new class fine-tuning training stage. Moreover, our proposed network contains two important feature extraction modules and one feature fusion module. They are the meta-feature extractor (ME) for query set feature extraction, the reweighting module (RM) for support set feature extraction, and the feature fusion module (FFM) for fusing the two features. Among them, ME uses the optimized YOLOv5 [36,37] framework as the backbone, and focuses on the contextual dependencies of the query set objects through graph convolution. The RM is applied to obtain the global features of the support set, and then use the FFM to fuse it and the query set enhancement features obtained from ME; the output results are sent to the prediction layer of the model for prediction.
In summary, the main contributions of this study are as follows:
  • This study innovatively proposes a few-shot remote sensing image object detection method based on meta-learning, which integrates the global features of the S and the enhanced contextual dependencies of the Q, improving the final feature representation ability and object detection performance.
  • In the ME structure, a feature representation structure that takes into account the contextual long-distance dependencies of the Q was constructed, focusing on the regional similarity of the query features, and optimizing the encoding performance of the query features.
  • In the RM, the global feature pyramid extraction (GFPE) module is constructed to enhance the global feature representation of the S. Simultaneously, a new fusion module of query meta-features and support features is designed, which enhances the salient representation ability after feature fusion.

2. Related Work

As a key research topic in the computer vision domain, object detection technology is of great significance for understanding the high-level semantic information of images. In recent years, with the continuous development of artificial intelligence (AI) technology, a large number of object detection methods based on deep learning have been proposed and applied in pedestrian detection, automatic driving, aerospace object detection, etc. [38,39,40].

2.1. General Object Detection

In the general field, the method of object detection is divided into two categories: two-stage and one-stage. Among them, the earliest two-stage method originated from R-CNN in 2014. This model appeared in the public eye with the combined mode of region proposal + CNN + support vector machine (SVM). It first generates a candidate area, and then judges whether there is an object to be detected in the candidate area, as well as the specific category of the object. Subsequently, inspired by this idea, representative two-stage methods such as SPP-Net [41], fast R-CNN, and faster R-CNN appeared one after the other. To meet specific application requirements, on the basis of these two-stage approaches, predecessors proposed a series of improved methods. For example, in the field of face detection, Sun et al. [42] perfected the shortcomings of faster R-CNN by combining improvement strategies such as feature splicing, multiscale features training, and joint pretraining, and achieved better face detection performance. Similarly, to solve the problem of poor vehicle detection results caused by large changes in vehicle size, severe occlusion, or truncated vehicles in the image, Hoanh et al. [43] replaced the backbone of faster R-CNN with MobileNet, and replaced the non-maximum suppression (NMS) algorithm with the soft-NMS algorithm, thus realizing fast and accurate vehicle detection. In addition, the impact of the loss function on the model is important; hence, some researchers proposed to use IoU-balanced loss functions to improve the positioning accuracy without sacrificing efficiency. Experiments proved that the proposed loss function is indeed helpful to improve the performance of faster R-CNN [44]. The two-stage object detection method directly solves the object or background candidates as a classification problem, which has the advantage of high detection accuracy. However, the YOLO series, as a representative network of one-stage, has the characteristics of fast speed. It redefines object detection as a regression problem, directly predicting image pixels as objects and their bounding box attributes [45].
In intelligent driving or assisted driving, to reduce the interference of irrelevant scene information, a real-time detection framework called increase–decrease YOLO (ID-YOLO) was used to simulate the driver’s selective attention mechanism, paying more attention to all objects in the driving scene, while suppressing other irrelevant objects [46]. To solve the problem that small objects are easily missed, Luo et al. [47] proposed to build a context matrix instead of the original classification matrix based on YOLOv3, and replaced the NMS module with the context filtering algorithm, finally obtaining a better detection effect, with an average accuracy of over 90%. In addition, in the actual application deployment, Wang et al. [48] took advantage of the fast speed of the YOLOv4 network and proposed a Trident-YOLO framework, which has played a huge role in mobile devices with limited computing power, whereby the parameter amount of the original modules was reduced, significantly improving the performance of object detection.
The above object detection algorithms have played an important role in different general fields and have achieved good results in practical applications. However, there are still a lot of challenges in object detection based on remote sensing images, including the problems of multiscale objects [49], dense objects [50], and arbitrary direction [51]. Scale, as the biggest feature of object detection in remote sensing, is the primary concern, which directly affects the performance of object detection. Aiming at the limited receptive field of the feature pyramid network (FPN) during feature extraction, Dong et al. [52] proposed to use the gated context-aware module (G-CAM) that adaptively perceives multiscale contextual information to enhance object detection performance. Similarly, since faster R-CNN weakly matches the scale variability of different objects in remote sensing images and the poor positioning performance of small objects, an enhanced feature extractor was proposed for faster R-CNN, such that small and dense objects have a stronger response [13]. Moreover, Zhang et al. [53] proposed a scale-adaptive proposal network (SAPNet), which consists of multilayer RPN and detection subnetworks that generate multiscale object proposals. The results showed that SAPNet significantly improves the accuracy of multi-object detection in remote sensing images. The arbitrary direction of the object in the remote sensing image often easily leads to the missed detection of the objects marked in the form of a horizontal bounding box (HBB). Therefore, a form of the oriented bounding box (OBB) with angle information annotation was applied. Combining high-level semantic information and low-level features can improve multiscale representation capabilities for remote sensing images, and an effective OBB object detector can be used to define objects in any direction, making object positioning more accurate, and ultimately yielding excellent object detection capabilities [14,54]. In addition to focusing on the objects in any direction, Li et al. [55] focused on solving the problem of the blurred appearance of similar objects (for example, roads and bridges). They achieved a high average accuracy (87%) with a dual-channel feature fusion network to learn local and contextual information about objects. In addition, Wang et al. [56] refined the object features by simulating the rotation angle and used the Bhattacharyya distance of the Gaussian distribution as a loss function to strengthen the learning ability and convergence ability, the experimental results on multiple datasets showed that the proposed network has excellent performance. Additionally, in terms of specific applications, more detailed object detection studies can be reviewed. Sun et al. [57] applied the object detection technology to the detection of composite objects (sewage treatment plants, golf courses, airports, etc.) and proposed a unified part-based convolutional neural network (PBNet) specifically for detecting composite objects. The network treats composite objects as a set of parts and incorporates individual part features into contextual information to improve overall composite object detection performance. Li et al. [58] proposed a CNN-based ship fine-grained detection network, which improved the efficiency and accuracy of fine-grained ship detection in large-format remote sensing images. Furthermore, to get rid of the existing detection methods relying too much on instance-level labels, Chen et al. [59] proposed a weakly supervised object detection network based on collaborative learning. The method introduces the joint pooling module and optimizes the candidate proposals and the object distribution characteristics; good performance was achieved.

2.2. Few-Shot Object Detection

The current deep learning object detection technology relies too much on annotated samples, which hinders the development of AI industrialization. In recent years, researchers have paid more and more attention to learning a model with excellent effect with a small amount of data annotation. Few-shot learning is a solution proposed to solve the problem of fewer annotated samples. At present, there is paradigm research based on meta-learning and paradigm research based on transfer learning [35,60,61,62,63].
The object detection based on meta-learning is to use the meta-learner to learn meta-knowledge from the query set and support set, and then complete the detection of the new class by adjusting the meta-knowledge. Generally, the few-shot object detection in remote sensing images focuses on multiscale feature enhancement, the relationship between support features and query features, etc. Specifically, a feature attention highlight module was proposed, which greatly promotes the detection performance of few-shot objects in remote sensing images [64]. Similarly, Zhang et al. [65] proposed a few-shot enhancing class features approach, which was applied to solve the large gap within the class and cause confusion in the recognition results. Furthermore, some researchers focused on the connection between query features and support features. To make full use of the information provided by the support set, Wang et al. [66] proposed a diversity measurement module, which was used to measure diversity information to obtain more meta-feature knowledge and strengthen the connection between support features and query features. Zhang et al. [67] proposed a few-shot remote sensing image detection method of self-adaptive global similarity and two-way foreground stimulator, which improved the spatial similarity and asymmetry problems between support features and query features. Zhu et al. [68] used the meta-learning method to integrate faster R-CNN into the reweighting module, and fine-tuned the reweighting module and the last layer of faster R-CNN in the meta-test stage, achieving good few-shot object detection performance. The huge scale and direction changes of objects in remote sensing images are major difficulties in few-shot object detection. Integrating FPNs and using prototype features to enhance query features are new ideas to solve this problem. The TINet proposed by Liu et al. explicitly aligns query features and support features while ensuring geometric invariance, which can obtain performance gains while maintaining the same inference speed [69]. In addition, some researchers explored lightweight models, which have a direct connection to the actual application deployment. Li et al. [70] proposed a meta-learning-based few-shot object detection method (FSODM). The model used YOLOv3-tiny as the main architecture, and a multiscale lightweight object detection framework was designed, achieving excellent results in both benchmark datasets. On the basis of the FSODM infrastructure, Zhou et al. [71] made a lightweight improvement for SAR image object detection, changed the meta-feature extractor to a lighter DarknetS model, and considered the correlation and saliency between different classes of the support set. The new model has achieved excellent performance in SAR images. In addition, for the detection of important time-sensitive objects such as ships and airplanes, more refined type identification is required. Therefore, Zhang et al. [72] proposed a few-shot multiclass ship detection algorithm with the attention mechanism and multi-relation detectors. Moreover, because the remote sensing images own complex background features, and as it is difficult to extract fine-grained features, Liu et al. [73] proposed an improved task sampling strategy for optimizing the object distribution, achieving excellent performance in fine-grained few-shot airplane object detection.
The two-stage fine-tuning approach (TFA) is a common method for few-shot object detection in transfer learning, and most examples are improved compared to faster R-CNN. In few-shot detection in the general field, the problem of sample scarcity often makes it difficult for models such as faster R-CNN to extract effective features, resulting in low object detection accuracy. Therefore, a large number of researchers have proposed combining transfer learning methods to solve the problem of few-shot object detection. Combined with the transfer learning, by adding an attention mechanism, and improving the loss function based on the faster R-CNN network, this move can significantly improve the generalization ability of the few-shot object detection [74]. Similarly, Zhou et al. [75] proposed a few-shot airplane detection method with a feature scale selection pyramid for the problem of small foreground size and low contrast in satellite video data. This method takes full advantage of the two-stage fine-tuning strategy and the characteristics of satellite video data. In addition, the low-shot mechanism has the problem of lack of internal class variation, while the high-shot mechanism has the problem of misalignment between the learning distribution and the real distribution, which will lead to poor detection of new categories. Therefore, the learning of model knowledge can be enhanced and transferred to guide new category detection, thereby improving the competitiveness of the model [76]. In addition, in the research on the efficiency of the few-shot object detection, some researchers proposed an efficient pre-training transmission framework without calculation increments, mainly by designing a knowledge-inherited initializer to reweight the frame classifier, which effectively facilitated the knowledge transfer process and increased the speed of adaptation [77]. Recent studies have shown that text data are beneficial for few-shot object detection. Lu et al. [73] proposed an FSOD method guided by text-modal knowledge, which uses text-modal knowledge to highly summarize the universality and uniqueness of the category, greatly reducing the confusing classification problem of new categories.

3. Methods

3.1. Overall Framework

Inspired by [70], a similar structure of the few-shot remote sensing image object detection framework based on meta-learning is proposed in this study. It proposes a few-shot remote sensing image object detection method that fuses the global features of S and the enhanced context dependencies of Q. The whole framework includes three important modules (ME, RM, and FFM) and the final bounding box prediction layer, as shown in Figure 1. The proposed framework consists of two stages: base training and fine-tuning training. The base training is used to train the dataset of visible categories and learn meta-knowledge from visible categories; the fine-tuning training stage transfers the meta-knowledge learned in the base training stage to train unseen category datasets.
The input of the proposed framework includes Q and S, where the Q can be regarded as test samples for evaluating the performance of the task, which does not have annotations; the S is used as training samples, including the images and annotations for individual instances. The few-shot object detection is also called the N-ways K-shot detection task, where N and K represent N categories in the support set S, and each category has K instance annotations. In addition, the Q generally contains Nq images, and these images come from the same set of the same category C in the S. In this study, we use the annotations in the support set S as an auxiliary for frequency-sensitive foreground enhancement (FE) processing to incorporate frequency knowledge into the feature learning of the support set S.
Specifically, as shown in Figure 2, we first use the annotation information of the support set S to generate the mask (Figure 2b), and then use the mask to intercept the annotation part of the supporting image (Figure 2a), forming a mask with only the foreground object category; secondly, this study constructs a frequency extractor as shown in Figure 2 to reduce the noise of the input image. In this study, the frequency extractor uses discrete Fourier transform [78] to realize the frequency representation of the two-dimensional (2D) image data f(w, h) of size W × H, and the formula is shown below.
F k , l = w = 0 W 1   h = 0 H 1   f w , h e 2 π i k w W e 2 π i l h H ,
where k ∈ [0, W − 1] and l ∈ [0, H − 1] represent the coordinates of the two-dimensional image, respectively. Meanwhile, i in the 2D Fourier transform is the imaginary unit.
To better represent the frequencies of different bands, the representation of Cartesian coordinates is first converted to polar coordinates (r and θ), as shown in Equation (2), where AI(r) represents the average intensity of the signal of the two-dimensional image in the radial distance. Meanwhile, to enhance the difference between the high-frequency bands of the image, we take the average intensity AI(r) of the signal as input, and this vector is input into a high-pass filter (Fhp), finally obtaining the one-dimensional spectral vector VI of input image I. The formula is shown in Equation (3).
r , θ = F k , l : r = k 2 + l 2 , θ = arctan l k , A I r = 1 2 π 0 2 π   F r , θ d θ .
v I = F h p A I r , F h p x = x ,         r > r τ 0 ,         o t h e r w i s e ,
where  r τ  is the threshold radius of high-pass filtering. VI’s dimension representation is  R C l a s s × 1 , which is reshaped as  R C l a s s × 1 × 1 × 1 , followed by matrix multiplication with the original support set feature Si  R C l a s s × 3 × W × H . Its output feature map is shown in Figure 2c. Here, Class represents the number of visible categories contained in the support set S. VI can easily distinguish whether there is noise on the image in the high-frequency band; thus, it can be used for noise reduction. Lastly, the object-only image obtained from the first step in Figure 2 is merged with the image shown in Figure 2c to obtain the final output image in Figure 2d. Figure 2d reduces the noise of the background except for the object part, which is beneficial to the enhancement of object features. Figure 2d is the final input of the RM.

3.2. Meta-Feature Extractor

ME is used for the multiscale representation of the features of Q. Usually, the same object in a remote sensing image has feature representations of different scales. Therefore, adding a multiscale feature representation structure is beneficial to the detection of objects of different sizes. The ME module designed in this study is optimized with YOLOv5 as the backbone. The overall structure of the module is shown in Figure 3, including Backbone, PANet, and Output. Among them, the original backbone, CSPDarkNet, of the YOLOv5 model was replaced with a more lightweight G-GhostNet [79] module. This module can minimize network parameters and improve detection accuracy. Simultaneously, G-GhostNet can achieve a balance between accuracy and GPU delay, by deleting some activations layers based on C-GhostNet [80]. In addition, to obtain the context dependence of each object scale feature, inspired by [81], a graph convolutional unit (GCU) is introduced. The design of this unit is to enhance the close connection between similar pixels of each scale feature in the Output layer and deepen the difference between dissimilar pixels.
In this study, the specific settings of G-GhostNet are as shown in Table 1. The G-Ghost Bottleneck structure in the Backbone layer in Figure 3 corresponds to the specific operations shown in Stage L1, Stage L2, and Stage L3 in Table 1, respectively. The output feature Yn is the fusion of each stage with the complex feature  Y n c R ( 1 μ ) c × W × H  obtained by the Block operation and the cheap feature  Y n g R μ c × W × H  obtained by the Cheap operation. The value range of μ is 0 ≤ h ≤ 1; here, the value of μ is taken as 0.5. The specific expression is shown in Equation (4), where  Y n c  is generally the feature obtained after the residual operation, and  Y n g  is generally the feature obtained after 1 × 1 or 3 × 3 convolution.
Y n = Y n c , Y n g .
In addition, in Figure 3, the multiscale features obtained by the 1 × 1 convolution of the output layer lack contextual long-range dependence. Therefore, the GCU is established by using the graph structure to strengthen the dependency between adjacent pixels. The structure of GCU is shown in Figure 4, and the specific calculation method is shown in Algorithm 1.
First, the feature X of the output layer after 1 × 1 convolution is reprojected spatially, and the K-means [82] algorithm is used to initialize the W  R c 1 × V  and Variance R c 1 × V . Here, C1 represents the channel size of the input feature X, and V represents the number of regions where the feature map X is divided, which is also the number of nodes in the graph structure. In Step 1 of Algorithm 1, the calculation formulas of any element of the probability matrix Q  R H × W × c 1  and the coding matrix Z  R c 1 × V  are shown in Equations (5) and (6). In order to achieve multiscale context long-range dependence, this study sets the value of V to 4 and 8, respectively. Therefore, the output features of GCUi (i = 1, 2, 3) in Figure 3 are the feature fusion of the three features, which includes its own feature X, the feature  X ~ V4 when V = 4, and the feature  X ~ V8 when V = 8.
q i j k = exp x i j w k / σ k 2 2 / 2 k   exp x i j w k / σ k 2 2 / 2 ,
z k = z k z k 2 , z k = 1 i j   q i j k i j   q i j k x i j w k σ k ,
where  q i j k  is the probability of each pixel belonging to the region (node) k, k ∈ (1,V),  x i j  represents the feature of the pixel in the i-th row and j-th column of the two-dimensional feature map, and wk represents the feature vector of the k-th node. In addition,  σ k  is the variance of all dimensions of node k, and it is normalized to (0,1) by the Sigmoid function.
Algorithm 1 Graph convolutional unit (GCU) algorithm
1: Input: A feature map X
2: Output: The enhanced context-dependent features representation  X ~
3: while in training(test) stage: do
4: // Step 1: The feature map X is projected into the graph structure to obtain the probability matrix Q, encoding feature Z and adjacency matrix A
5:   Init W, Variance ← KMeans(n_clusters = V) (X)
6:   probability matrix Qf (xij ω k )
encoding feature Zf ( q i j k Q ,   ω k W , σ k V a r i a n c e )
adjacency matrix A ←  Z T Z
7: // Step 2: The graph convolution operation,  W g R c 1 × c 2  is a random weight matrix
8:  Z ~  ← f (A Z T W g ),  W g R c 1 × c 2
9: // Step 3: Reverse reprojection
10:           X ~ = Q Z ~ T
11: end while

3.3. Reweighting Module

The RM is used to extract meta-feature knowledge from the S, as well as to guide the positioning and object recognition of each image in the Q. The input of the RM is the changed form of VI mentioned in Equation (3) (Section 3.1). The specific structure of the RM is shown in Figure 5. Figure 5a shows the overall structure of the RM. After multiple convolution and pooling operations, it is input into the GFPE shown in Figure 1, inspired by [83], and finally three multiscale feature weights (R0 R W / 16 × H / 16 × 512 , R1 R W / 32 × H / 32 × 1024 , and R2 R W / 8 × H / 8 × 256 ) of S are obtained. Figure 5b shows the global feature extraction (GFE) block in GFPE. The module obtains the attention weight of each feature map (such as F R W 1 × H 1 × C 1  in Figure 5b), and then passes weighted averaging combines all features to generate global contextual features. Its calculation formula is shown below.
G F c t x = i = 1 N p e W g F i / m = 1 N p e W g F m F i ,
where F R W 1 × H 1 × C 1  =  { F i } i = 1 N = 3  is the input feature map of the GFE block, Fi is any feature in F, Np = W1 × H1 is the feature dimension in the feature map, and Z is the output feature map of the GFE block. For global feature extraction, Wg (1 × 1 convolution kernel) is first used to compress the feature map channel to 1, and then the Softmax function is used to activate WgFi. A 1 × 1 × H1W1 convolution is used to reshape, and then the attention weights are obtained. Finally, the weight  e W g F i / m = 1 N p e W g F m  is applied to aggregate any feature Fi to obtain the global context feature  G F c t x .
To establish the interdependence between channels, the global context features  G F c t x  are weighted and reassigned. The redistribution calculation formula is as follows:
Z = W t 2 ReLU L N W t 1 G F c t x ,
where  W t 1  (1 × 1 convolution kernel) is used for channel dimensionality reduction of the feature map, followed by LayerNorm for feature normalization and ReLU for nonlinear activation; finally,  W t 2  (1 × 1 convolution kernel) is used to increase the dimension of the feature map channel, restoring the size of the feature channel C1, and the feature map Z is output.

3.4. Feature Fusion Module

To transfer the meta-knowledge of the RM to the ME, inspired by the omni-dimensional (OD) convolution [84], we propose a dynamic FFM, as shown in Figure 6, which implements the fusion of the output feature maps of the two-part structure. FFM introduces a multidimensional attention mechanism through a parallel strategy to learn more flexible attention to the four dimensions of the convolution kernel space. FFM can be expressed by Equations (9) and (10).
y = ( α w 1 α f 1 α c 1 α s 1 W 1 + + α w n α f n α c n α s n W n ) × F r ,
F f = F . c o n v 2 d F m , y ,
where Fr represents the feature map output (R0, R1, or R2) by the RM, αwi represents the attention scalar of the convolution kernel Wi, and αsi  R k × k   , αci  R c i n   , and αfi  R c o u t  represent the three newly introduced attentions, which work along the spatial dimension, input channel dimension, and output channel dimension, respectively. These four attentions are calculated using the multi-head attention module πi(x) [85,86]. By gradually multiplying the convolution Wi along the position, channel, filter, and kernel dimensions with different attention, the convolution operation can strengthen the difference of the input feature map in each dimension, as well as provide better performance to capture rich meta-knowledge contextual information. F.conv2d represents a two-dimensional convolution operation with weights, and Fm represents the feature map generated by the ME.

3.5. Loss Function

In order to obtain the excellent performance of few-shot remote sensing image object detection, a good loss function design is essential. In this study, we construct the loss function between the predicted bounding box and the ground-truth bounding box from the localization and classification parts of the object detection, so that the training phase gradually reaches convergence. First, for the localization loss, we adopt the mean square error loss to penalize the deviation between the predicted localization and the true localization. Its calculation formula is shown below.
L l o c = 1 N p o s p o s   c   c o o r d t c c o o r d p c 2 .
Similar to [70], pos means that this study only considers the loss of positive anchor boxes, while ignoring the loss of negative anchor boxes. In practical situations, we choose the localization threshold empirically; for example, if the IOU between the predicted anchor box and the ground-truth anchor box is greater than 0.7, it is considered a positive anchor box; if the IOU between the predicted anchor box and the ground-truth anchor box is less than 0.3, it is considered a negative anchor box. In addition, c represents a coordinate selector to select from four coordinate representations {x, y, w, h} of a specific bounding box, respectively.
Since the current IOU loss does not take into account the mismatched direction between the ground-truth anchor box and the predicted anchor box, which leads to slow convergence and low efficiency, it is proposed to use the SIOU loss function [87] as the constraint between the predicted anchor box and the ground-truth anchor box.
L b o x = 1 I o U + Δ + Ω 2 ,
where  Δ  represents distance cost,  Ω  represents shape cost, and  Δ  is a cost function redefined based on angle cost.
In addition, the confidence loss  L obj    needs to be paid attention to in object detection, and confidence loss focuses on the possibility of an object in a certain area. In this study, the confidence loss is a binary cross-entropy loss, which comprehensively considers objectness loss ( L o ) and non-objectness loss ( L noobj   ). Its calculation formula is shown below.
L obj   = w obj   L o + w noobj   L noobj = w obj   1 N pos   pos     log P o + w noobj   1 N neg   neg     log 1 P o ,
where  P o  indicates represents the possibility of containing the object. To balance  L o  and  L noobj   w obj    and  w noobj    represent the weight, considering that, in actual situations, there are generally more negative boxes than positive boxes.
The cross-entropy loss function is used as the object classification loss, and the calculation method is shown in Equations (14) and (15).
R c = e r p c c = 1 N   e r p c ,
L c l s = 1 N pos   pos     c   y c log R c ,
where  y c  represents the true category annotation of category c, and  R c  represents the classification probability belonging to category c R c  is normalized by the softmax function.  r p c  (c = 1, 2, 3,…, N) represents one of the N predicted anchor boxes.
Ultimately, the loss function constraints in the entire few-shot object detection are shown in Equation (16).
L = L loc   + L obj   + L cls .

4. Experiment and Results

4.1. Datasets and Evaluation Metrics

To verify the effectiveness of the method proposed in this study, this study conducts experimental verification using the two datasets of NWPU VHR-10 [88] and DIOR [2]. The specific dataset introduction is shown in Table 2.
NWPU VHR-10: This dataset contains a total of 800 remote sensing images collected from the Google Earth and ISPRS Vaihingen datasets. The NWPU VHR-10 dataset is divided into two categories: one where each image has at least one manually labeled object instance, with a total of 650 images; the other where each image does not contain object instances, with a total of 150 images. The dataset contains 10 types of objects. According to the classification standards of most studies, the base classes include the ship, the storage tank, the basketball court, the ground track field, the harbor, the bridge, and the vehicle. The novel classes include the airplane, the baseball diamond, and the tennis court. Furthermore, in this study, few-shot experiments were used to evaluate the performance of the proposed method in 3/5/10-shot cases, respectively.
DIOR: This dataset contains a total of 23,463 images and is a large-scale benchmark dataset. The source of the dataset is mainly Google Earth, and a total of 20 categories are marked, including 192,472 instances. The size of each image is 800 × 800 pixels, and the image resolution range is 0.5–30 m. According to the general classification standard, the base classes include airport, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, golf course, ground track field, harbor, overpass, ship, stadium, storage tank, and vehicle. There are five types of novel classes, including airplanes, baseball fields, tennis courts, train stations, and windmills. Furthermore, in this study, few-shot experiments were used to evaluate the performance of the proposed method in 3/5/10-shot cases, respectively.
To verify the performance of the proposed method and the advanced comparison methods, we use mean average precision (mAP) as the evaluation index, and its calculation process is shown in Equations (17)–(20).
  P   = T P T P + F P ,
  R   = T P T P + F N ,
A P c = 0 1   P c R d R ,
  m A P k = 1 k c = 1 k   A P c ,
where TP denotes the number of true positives, FP represents the number of false positives, FN is the number of false negatives, and FP denotes the number of false positives. The AP calculation formula of a certain category c is shown in Equation (19). mAPk is the average accuracy of the k-shot model in the case of k-shot.

4.2. Experiment Settings

This study uses an RTX 3090 graphics processing unit (24 G) for the training and prediction of the proposed network model, as well as the training and prediction process of the comparative experiment. In the experiment, due to the nonuniform size of the image, we resized the image; the image was reshaped to 512 × 512 pixels and sent to the network for experimentation. Moreover, we set up a mini-batch with a size of eight. The initial learning rate of network training was set to 1 × 10−4, and the weight decay was 5 × 10−5. For the base training of the NWPU VHR 10 dataset, the proposed model was trained for 12k iterations. In addition, for 3-shot, 5-shot, and 10-shot experiments, the models experienced 2k, 2.5k, and 4k iterations respectively. At the same time, for the base training of the DIOR dataset, the proposed model was trained for 8k iterations. In addition, for the 3-shot, 5-shot, and 10-shot experiments, the model experienced 2k, 3k, and 4k iterations, respectively.

4.3. Results on the NWPU VHR 10 Dataset

Table 3 presents the quantitative evaluation results of the experiments using the NWPU VHR 10 dataset in different state-of-the-art FSOD networks, including YOLOv5 [89], faster R-CNN (ResNet101) [22], FSRW [90], FSODM [70], TFA [91], PAMS-Det [92], G-FSDet [93], CIR-FSD [94], SAGS-TFS [67], TINet [69], and the proposed network.
As shown in Table 3, the model proposed in this study achieved excellent mAP scores in the prediction results of both the base class and the novel class. In the base class detection results, the mAP score reached 92%, indicating that the network proposed in this study can achieve satisfactory results under a large number of samples. For the novel class detection, we performed experiments under 3-shot, 5-shot, and 10-shot settings, and the mAP reached 0.56, 0.69, and 0.72, respectively. Compared with several state-of-the-art few-shot object detectors, our model also had advantages. For example, compared with SAGS-TFS, our proposed model achieved a 5% improvement in mAP value in the 3-shot experimental setting and a 3% improvement in the 5-shot setting. As the number of object instance annotations increased, the mAP of the model also tended to be stable. Under the 10-shot setting, our model and the SAGS-TFS model achieved the same results, both 0.72. Meanwhile, compared with TINet, our model achieved the same performance in the 3-shot and 10-shot settings, and had a slight advantage in the 5-shot setting. However, for the prediction results of fully supervised YOLOv5 and faster R-CNN (ResNet101), these models achieved poor results when the training samples were gradually reduced. Simultaneously, when comparing the prediction results of the same type of meta-learning-based models such as FSRW and FSODM, our proposed method also achieved outstanding results. Especially in the experimental results under the few-shot setting, according to the numerical comparison of mAP, the model proposed in this study was 19%, 16%, and 7% better than FSODM. TFA, PAMS-Det, G-FSDet, and CIR-FSD are all two-stage fine-tuning methods. This type of model can generally achieve better results, especially if the prediction results in the base class are better, while the prediction results in the novel class are slightly lower. This is because of the introduction of novel class samples, whereby the model is still not perfect in adjusting the balance of foreground and foreground proportions.
Figure 7 shows the qualitative visualization results of the method proposed in this study for the 10-shot setting of the base class (results shown in light-blue background) and novel classes (results shown in light-yellow background) in the NWPU VHR 10 dataset. Combined with the quantitative analysis of Table 3, our model had excellent performance in the detection results of the base class and achieved a score of mAP above 0.90. Therefore, Figure 7 shows a good visualization effect of the base class. Especially for the ground track field, full-coverage detection was realized, and the mAP reached 100%, which shows that our model had the best detection effect for large-scale objects. Furthermore, for the detection of small and densely distributed objects such as ships and vehicles, the model proposed in this study showed a good detection performance. At the same time, the detection performance of novel classes also achieved the best visualization when compared with some state-of-the-art models. For example, it can be seen from Table 3 that FSODM had a large number of missed objects in novel class detection, while two-stage fine-tuning networks such as CIR-FSD had a higher probability of missed detection in tennis courts. The meta-learning-based model we proposed considers the enhancement of context information and global information, which can strengthen the distinction between the foreground and background of the tennis court, thus improving the accuracy of tennis court detection.

4.4. Results on the DIOR Dataset

Table 4 shows the quantitative evaluation results of DIOR dataset in different state-of-the-art FSOD networks. Table 4 and Figure 8 jointly give some of the most advanced few-shot object detection models, and the quantitative results and qualitative visualization experimental results of the network proposed in this study in the DIOR dataset are shown.
In Table 4 and Figure 8, compared with the current state-of-the-art models, our proposed model achieved the highest mAP value and the most accurate visual detection results in the base class and the novel class; the highest base class detection mAP value reached 0.74, reaching the same level as the CIR-FSD model in base class detection. In the novel class, we performed experiments under 5-shot, 10-shot, and 20-shot settings, and the mAP values reached 0.36, 0.41, and 0.47 respectively. It can be found that, as the number of annotations increased, the effect improved, which is also in line with the principle of supervised learning. In the similar few-shot learning comparison, we found that the method proposed in this study improved the mAP value by 24% in the base class compared with the results of the FSRW network. Corresponding to the settings under 5-shot, 10-shot, and 20-shot settings, the detection results of the novel class were improved by 14%, 13%, and 13%, respectively, in terms of mAP values. In addition, for the G-FSDet network, since the prediction category of the network in the few-shot experiment differed from the experimental settings in this study and other network models, we took the maximum mAP of the network in different prediction categories as the comparison accuracy. In general, the detection accuracies of G-FSDet in the base class and novel class were better, but there were still objects with missed detection under very few samples.
Figure 8 shows the qualitative visualization results of the proposed model in the base class (results shown in light-blue background) and for the 20-shot setting (results shown in light-yellow background), respectively. According to the visualization results, our model paid more attention to reducing the missed detection rate of each dense object in the few-shot detection of DIOR dataset and had better detection performance than models such as FSODM, TFA, PAMS-Det, G-FSDet, CIR-FSD, SAGS-TFS and TINet. For example, in the dense object detection of airplanes, baseball fields, and tennis courts (Figure 8), there were only a small number of missed objects in the entire prediction set.

5. Discussion

5.1. Ablation Experiment

In order to further discuss the specific role and rationality of the proposed network in few-shot remote sensing image object detection, we discuss and analyze the effectiveness of important modules in the proposed model. Our ablation experiments are based on the NWPU VHR-10 dataset. Our baseline setting is a combined framework of ME based on the YOLOv5 framework and RM based on simple CNN network feature extraction. Then, the FE, GFPE, GCU, and FFM modules were added in turn for ablation experiments. The specific quantitative evaluation analysis is shown in Table 5.
Effectiveness of FE: The proposed FE module aims to enhance the ability to express foreground features in visible category images. After the FE module is operated, its output is used as the input of the RM to guide the object feature enhancement representation. As shown in Figure 9a, we visualized the feature maps of the middle layer before and after adding the FE module under the 10-shot setting (all feature maps in Figure 9 are based on the 10-shot setting). It can be observed from Figure 9a, when the FE module was not added, the shallow shape features of the foreground object and the background were more obvious, but the foreground and background did not have differential saliency expressions. When the FE module was added, the feature expression of the background was weakened, while the feature expression of the foreground was enhanced. Table 5 shows that the mAP values after adding the FE module were increased by 4%, 3%, and 2% respectively compared with the mAP values before adding the 3-shot, 5-shot, and 10-shot settings, which also directly shows that the introduced FE module is effective for few-shot object detection.
Effectiveness of GFPE: The role of GFPE is to extract the multiscale features of the S, and its output feature map is used as the weight feature of the FFM module to fuse with the output feature of the ME. As shown in Figure 9b, after adding the GFPE module, the deep semantic information of the object features was more prominently expressed. As shown in Figure 9b, it can be seen from the details displayed in the red rectangle that the feature map without the GFPE module reflected the phenomenon of multi-object adhesion; for example, the boundaries between the harbors in Figure 9b were not completely distinguished. On the contrary, after adding the GFPE module, the harbor feature was more specific, and the difference between the harbor and the background was also more prominent. At the same time, it can also be seen from Table 5 that, after adding the GFPE module, the mAP under each experimental setting increased by 12%, 6%, and 4%, which reflects the importance of the GFPE module.
Effectiveness of GCU: As an important structure in the ME, the GCU module is used to guide the enhancement of the context-dependent information of the meta-features of the Q. By setting the multiple graph convolution nodes such as V = 4 and V = 8, and fusing the original features, the meta-features fusion of multiscale context information of the Q is realized. It can be observed from Figure 9c that the meta-features guided by the GCU module were more focused on the expression of object semantic information. The first line shows the expression of ship features before and after the introduction of the GCU module. It can be found that, after the introduction of GCU, the semantic information of the ship was more obvious, as shown in the details of the red rectangle. In addition, regarding the display of the second row of airplane features, before the introduction of the GCU module, the features of the airplane and the background were difficult to distinguish. As shown in the details of the red rectangle, after the GCU module was introduced, the semantic information of the aircraft was more obvious, and objects could be well distinguished from other objects and the background. At the same time, combined with the specific quantitative indicators in Table 5, the accuracy was 12%, 6%, and 4% higher than the 3-shot, 5-shot, and 10-shot settings without adding the GCU module. This also fully demonstrates the role played by the GCU module, which is consistent with the information reflected in the heat map after adding the GCU module.
Effectiveness of FFM: The function of FFM is to aggregate the multiscale features generated by ME and the multiscale weight features generated by RM to generate fusion features for final object detection. Through the middle layer representation of the Q features in Figure 9d, it can be found that the most important difference before and after adding the FFM module was that FFM could focus more on characterizing the object features, strengthening the difference with the background information, and enhancing the feature representation of the object. As shown in the red rectangle details in Figure 9d, before adding the FFM module, the object feature representation was weaker. Combined with the quantitative representation results of Table 5, FFM played a role in enhancing the characteristics of the Q. With small sample annotations, the accuracy increased by 4%, 5%, and 3% under 3-shot, 5-shot, and 10-shot settings, respectively.

5.2. Comprehensive Analysis

The detection ability of the base class and novel class in special cases: In order to show that the network proposed in this study has the capability of comprehensive detection of the base class and novel class, we selected some samples with multiple categories in the NWPU VHR-10 dataset for display. Figure 10 shows the combinations of ground track field + basketball court + baseball diamond, bridge + baseball diamond, basketball court + tennis court, and harbor + tennis court, which are all base classes + novel classes. The realization of these combinations is extremely challenging; the first three images show successful cases, while the fourth is a typical failure case. The failure of the fourth image was mainly due to the missed detection and false detection of the harbor class. The main reasons for imperfect detection of the fourth image were as follows: (1) most of the training sets of the harbor class were full of ships, while the test set had harbors that were not full of ships or had no ships; (2) the scale gap between harbor classes was large, resulting in a particularly large-scale harbor being detected as multiple harbors.
In addition, there were some cases of detection failure in the NWPU VHR-10 dataset, as shown in Figure 11: (1) failures caused by inter-class similarity, e.g., the nonobvious characteristic differences between a few vehicles and airplanes, and between airplanes and ships; (2) failures caused by objects being too densely distributed, e.g., an increase in the probability of repeated detection or missed detection due to the close-to-close distribution of various courts; (3) a small number of missed detections caused by noise, e.g., vehicle missed detection caused by tree occlusion or shadow. Overall, our proposed method has certain scene adaptability and generalization.
Model complexity and inference time: The complexity and inference time of the model are important factors for evaluating performance. Therefore, this subsection makes a statistical comparison of the complexity and inference time of the model proposed in this study and other representative state-of-the-art few-shot object detection models, as shown in Table 6. Table 6 shows that the complexity of the FSODM network was the highest, and the number of parameters reached 81.25 M, followed by the model proposed in this study, which reached 65.01 M. Simultaneously, compared with these superior models, we calculated the computational complexity of the model in the test phase with a batch size of 4 under the same standard as other papers, and the computational complexity of our model was at a medium level, reaching 197.89 GFLOPs.
Although the model parameters proposed in this study were relatively high, the inference time per image was the lowest at 0.11 s. This shows that the proposed network could achieve optimal performance benefits at the cost of increasing a small number of parameters. In addition, TFA, G-FSDet, CIR-FSD, SAGS-TFS are two-stage fine-tuning approaches; Although the other models had slightly fewer parameters compared to our proposed model in this study, the single image inference time was higher for them. This can be attributed to the fact that our proposed model has less architectural complexity during the inference stage.In particular, G-FSDet model had a model complexity of 60.19 M after the reparameterization technique, and a complexity of 74.08 M before the reparameterization technique. In summary, the proposed model achieved a superior few-shot object detection performance.

6. Conclusions

This study proposed a few-shot remote sensing image object detection method that fuses the context dependencies of the query set and the global features of the support set. It mainly includes three important structures: ME, RM, and FFM. Firstly, the RM passes the multiscale support set global features as meta-knowledge to the query set; secondly, the optimized YOLOv5 model is used as the baseline of the ME, combined with the multiscale GCUs module to enhance context dependence relationship, and the FFM structure is combined with the meta-knowledge of the RM to make the query set features more prominently expressed and achieve the optimal detection performance. Experiments on NWPU VHR-10 and DIOR datasets showed that the proposed model outperformed some current advanced FSODM detection models and had good detection generalization performance for different remote sensing images.
However, combined with the failure cases in the comparative experiments and comprehensive experiments in Section 5, the proposed model still has room for improvement. For example, we will pay more attention to the detection of objects with large-scale differences and focus on the detection of small objects with stronger density. These problems will be important topics for our future work.

Author Contributions

Conceptualization, H.S. and B.W.; methodology, B.W. and G.M.; writing—original draft preparation, B.W., Y.Z. (Yuan Zhou), and H.Z.; writing—review and editing, B.W., Y.Z. (Yongxian Zhang), and G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangxi Science and Technology Major Project, grant number AA22068072.

Data Availability Statement

The experiments were evaluated on publicly open data sets. The access manner of the datasets can be determined from the corresponding published papers.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tiwari, A.K.; Mishra, N.; Sharma, S. Analysis and Survey on Object Detection and Identification Techniques of Satellite Images. In Proceedings of the India International Science Festival, Delhi, India, 4–8 December 2015. [Google Scholar]
  2. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  3. Wu, X.; Sahoo, D.; Hoi, S.C. Recent Advances in Deep Learning for Object Detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef] [Green Version]
  4. Bhil, K.; Shindihatti, R.; Mirza, S.; Latkar, S.; Ingle, Y.S.; Shaikh, N.F.; Prabu, I.; Pardeshi, S.N. Recent Progress in Object Detection in Satellite Imagery: A Review. In Sustainable Advanced Computing: Select Proceedings of ICSAC 2021; Springer: Singapore, 2022; pp. 209–218. [Google Scholar]
  5. Pi, Y.; Nath, N.D.; Behzadan, A.H. Detection and Semantic Segmentation of Disaster Damage in UAV Footage. J. Comput. Civ. Eng. 2021, 35, 04020063. [Google Scholar] [CrossRef]
  6. Ciaramella, A.; Perrotta, F.; Pappone, G.; Aucelli, P.; Peluso, F.; Mattei, G. Environment Object Detection for Marine ARGO Drone by Deep Learning. In Proceedings of the Pattern Recognition, ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 121–129. [Google Scholar]
  7. Liu, H.; Yu, Y.; Liu, S.; Wang, W. A Military Object Detection Model of UAV Reconnaissance Image and Feature Visualization. Appl. Sci. 2022, 12, 12236. [Google Scholar] [CrossRef]
  8. Haris, M.; Hou, J.; Wang, X. Lane Lines Detection under Complex Environment by Fusion of Detection and Prediction Models. Transp. Res. Rec. 2022, 2676, 342–359. [Google Scholar] [CrossRef]
  9. Thayalan, S.; Muthukumarasamy, S. Multifocus Object Detector for Vehicle Tracking in Smart Cities Using Spatiotemporal Attention Map. J. Appl. Remote Sens. 2023, 17, 016504. [Google Scholar] [CrossRef]
  10. Zhang, Z.; Wang, C.; Song, J.; Xu, Y. Object Tracking Based on Satellite Videos: A Literature Review. Remote Sens. 2022, 14, 3674. [Google Scholar] [CrossRef]
  11. Liu, Y.; Liao, Y.; Lin, C.; Jia, Y.; Li, Z.; Yang, X. Object Tracking in Satellite Videos Based on Correlation Filter with Multi-Feature Fusion and Motion Trajectory Compensation. Remote Sens. 2022, 14, 777. [Google Scholar] [CrossRef]
  12. He, Q.; Sun, X.; Yan, Z.; Li, B.; Fu, K. Multi-Object Tracking in Satellite Videos with Graph-Based Multitask Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  13. Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-Scale Object Detection in Remote Sensing Imagery with Convolutional Neural Networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
  14. Fu, K.; Chang, Z.; Zhang, Y.; Xu, G.; Zhang, K.; Sun, X. Rotation-Aware and Multi-Scale Convolutional Neural Network for Object Detection in Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 161, 294–308. [Google Scholar] [CrossRef]
  15. Huang, Z.; Li, W.; Xia, X.-G.; Wu, X.; Cai, Z.; Tao, R. A Novel Nonlocal-Aware Pyramid and Multiscale Multitask Refinement Detector for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–20. [Google Scholar] [CrossRef]
  16. Shivappriya, S.N.; Priyadarsini, M.J.P.; Stateczny, A.; Puttamadappa, C.; Parameshachari, B.D. Cascade Object Detection and Remote Sensing Object Detection Method Based on Trainable Activation Function. Remote Sens. 2021, 13, 200. [Google Scholar] [CrossRef]
  17. Inglada, J. Automatic Recognition of Man-Made Objects in High Resolution Optical Remote Sensing Images by SVM Classification of Geometric Image Features. ISPRS J. Photogramm. Remote Sens. 2007, 62, 236–248. [Google Scholar] [CrossRef]
  18. Lei, Z.; Fang, T.; Huo, H.; Li, D. Rotation-Invariant Object Detection of Remotely Sensed Images Based on Texton Forest and Hough Voting. IEEE Trans. Geosci. Remote Sens. 2011, 50, 1206–1217. [Google Scholar] [CrossRef]
  19. Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  20. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
  21. Girshick, R. Fast R-Cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  22. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28 (NIPS 2015); Curran Associates, Inc.: Dutchess County, NY, USA, 2015. [Google Scholar]
  23. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  24. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  25. Tian, S.; Kang, L.; Xing, X.; Li, Z.; Zhao, L.; Fan, C.; Zhang, Y. Siamese Graph Embedding Network for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 602–606. [Google Scholar] [CrossRef]
  26. Li, Z.; Liu, Y.; Liu, J.; Yuan, Y.; Raza, A.; Huo, H.; Fang, T. Object Relationship Graph Reasoning for Object Detection of Remote Sensing Images. In Proceedings of the 2021 6th International Conference on Image, Vision and Computing (ICIVC), Qingdao, China, 23–25 July 2021; pp. 43–48. [Google Scholar]
  27. Tian, S.; Kang, L.; Xing, X.; Tian, J.; Fan, C.; Zhang, Y. A Relation-Augmented Embedded Graph Attention Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
  28. Tian, S.; Cao, L.; Kang, L.; Xing, X.; Tian, J.; Du, K.; Sun, K.; Fan, C.; Fu, Y.; Zhang, Y. A Novel Hybrid Attention-Driven Multistream Hierarchical Graph Embedding Network for Remote Sensing Object Detection. Remote Sens. 2022, 14, 4951. [Google Scholar] [CrossRef]
  29. Zhu, Z.; Sun, X.; Diao, W.; Chen, K.; Xu, G.; Fu, K. Invariant Structure Representation for Remote Sensing Object Detection Based on Graph Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  30. Guo, W.; Yang, W.; Zhang, H.; Hua, G. Geospatial Object Detection in High Resolution Satellite Images Based on Multi-Scale Convolutional Neural Network. Remote Sens. 2018, 10, 131. [Google Scholar] [CrossRef] [Green Version]
  31. Goyal, V.; Singh, R.; Dhawley, M.; Kumar, A.; Sharma, S. Aerial Object Detection Using Deep Learning: A Review. In Computational Intelligence: Select Proceedings of InCITe 2022; Springer Nature: Singapore, 2023; Volume 968, pp. 81–92. [Google Scholar]
  32. Karlinsky, L.; Shtok, J.; Harary, S.; Schwartz, E.; Aides, A.; Feris, R.; Giryes, R.; Bronstein, A.M. Repmet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June2019; pp. 5197–5206. [Google Scholar]
  33. Li, X.; Sun, Z.; Xue, J.-H.; Ma, Z. A Concise Review of Recent Few-Shot Meta-Learning Methods. Neurocomputing 2021, 456, 463–468. [Google Scholar] [CrossRef]
  34. Wu, A.; Zhao, S.; Deng, C.; Liu, W. Generalized and Discriminative Few-Shot Object Detection via SVD-Dictionary Enhancement. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021); Curran Associates, Inc.: Dutchess County, NY, USA, 2021; pp. 6353–6364. [Google Scholar]
  35. Han, G.; Ma, J.; Huang, S.; Chen, L.; Chellappa, R.; Chang, S.-F. Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting. arXiv 2022, arXiv:2204.07841. [Google Scholar]
  36. Hendrawan, A.; Gernowo, R.; Nurhayati, O.D.; Warsito, B.; Wibowo, A. Improvement Object Detection Algorithm Based on YoloV5 with BottleneckCSP. In Proceedings of the 2022 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), Solo, Indonesia, 3–5 November 2022; pp. 79–83. [Google Scholar]
  37. Liu, Z.; Gao, Y.; Wang, L.; Du, Q. Aircraft Target Detection in Satellite Remote Sensing Images Based on Improved YOLOv. In Proceedings of the 2022 International Conference on Cyber-Physical Social Intelligence (ICCSI), Nanjing, China, 18–21 November 2022; pp. 63–68. [Google Scholar]
  38. Yao, X.; Shen, H.; Feng, X.; Cheng, G.; Han, J. R2IPoints: Pursuing Rotation-Insensitive Point Representation for Aerial Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  39. Iftikhar, S.; Zhang, Z.; Asim, M.; Muthanna, A.; Koucheryavy, A.; Abd El-Latif, A.A. Deep Learning-Based Pedestrian Detection in Autonomous Vehicles: Substantial Issues and Challenges. Electronics 2022, 11, 3551. [Google Scholar] [CrossRef]
  40. Balasubramaniam, A.; Pasricha, S. Object Detection in Autonomous Vehicles: Status and Open Challenges. arXiv 2022, arXiv:2201.07706. [Google Scholar]
  41. Purkait, P.; Zhao, C.; Zach, C. SPP-Net: Deep Absolute Pose Regression with Synthetic Views. arXiv 2017, arXiv:1712.03452. [Google Scholar]
  42. Sun, X.; Wu, P.; Hoi, S.C. Face Detection Using Deep Learning: An Improved Faster RCNN Approach. Neurocomputing 2018, 299, 42–50. [Google Scholar] [CrossRef] [Green Version]
  43. Nguyen, H. Improving Faster R-CNN Framework for Fast Vehicle Detection. Math. Probl. Eng. 2019, 2019, 3808064. [Google Scholar] [CrossRef]
  44. Wu, S.; Yang, J.; Wang, X.; Li, X. Iou-Balanced Loss Functions for Single-Stage Object Detection. Pattern Recognit. Lett. 2022, 156, 96–103. [Google Scholar] [CrossRef]
  45. Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A Survey of Modern Deep Learning Based Object Detection Models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
  46. Qin, L.; Shi, Y.; He, Y.; Zhang, J.; Zhang, X.; Li, Y.; Deng, T.; Yan, H. ID-YOLO: Real-Time Salient Object Detection Based on the Driver’s Fixation Region. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15898–15908. [Google Scholar] [CrossRef]
  47. Luo, H.-W.; Zhang, C.-S.; Pan, F.-C.; Ju, X.-M. Contextual-YOLOV3: Implement Better Small Object Detection Based Deep Learning. In Proceedings of the 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 8–10 November 2019; pp. 134–141. [Google Scholar]
  48. Wang, G.; Ding, H.; Li, B.; Nie, R.; Zhao, Y. Trident-YOLO: Improving the Precision and Speed of Mobile Device Object Detection. IET Image Process. 2022, 16, 145–157. [Google Scholar] [CrossRef]
  49. Li, Y.; Li, Z.; Ye, F.; Li, Y. A Dual-Path Multihead Feature Enhancement Detector for Oriented Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  50. Yin, Q.; Hu, Q.; Liu, H.; Zhang, F.; Wang, Y.; Lin, Z.; An, W.; Guo, Y. Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
  51. Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A Semantic Attention-Based Mask Oriented Bounding Box Representation for Multi-Category Object Detection in Aerial Images. Remote Sens. 2019, 11, 2930. [Google Scholar] [CrossRef] [Green Version]
  52. Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y. Remote Sensing Object Detection Based on Gated Context-Aware Module. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  53. Zhang, S.; He, G.; Chen, H.-B.; Jing, N.; Wang, Q. Scale Adaptive Proposal Network for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 864–868. [Google Scholar] [CrossRef]
  54. Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
  55. Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-Insensitive and Context-Augmented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2337–2348. [Google Scholar] [CrossRef]
  56. Wang, H.; Liao, Y.; Li, Y.; Fang, Y.; Ni, S.; Luo, Y.; Jiang, B. BDR-Net: Bhattacharyya Distance-Based Distribution Metric Modeling for Rotating Object Detection in Remote Sensing. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
  57. Sun, X.; Wang, P.; Wang, C.; Liu, Y.; Fu, K. PBNet: Part-Based Convolutional Neural Network for Complex Composite Object Detection in Remote Sensing Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 173, 50–65. [Google Scholar] [CrossRef]
  58. Li, J.; Tian, J.; Gao, P.; Li, L. Ship Detection and Fine-Grained Recognition in Large-Format Remote Sensing Images Based on Convolutional Neural Network. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2859–2862. [Google Scholar]
  59. Chen, S.; Wang, H.; Mukherjee, M.; Xu, X. Collaborative Learning-Based Network for Weakly Supervised Remote Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022. [Google Scholar] [CrossRef]
  60. Liu, Q.S.Y.; Chua, T.S.; Schiele, B. Meta-Transfer Learning for Few-Shot Learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  61. Ding, Y.; Tian, X.; Yin, L.; Chen, X.; Liu, S.; Yang, B.; Zheng, W. Multi-Scale Relation Network for Few-Shot Learning Based on Meta-Learning. In Proceedings of the Computer Vision Systems: 12th International Conference, ICVS 2019, Thessaloniki, Greece, 23–25 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 343–352. [Google Scholar]
  62. Yu, Z.; Chen, L.; Cheng, Z.; Luo, J. Transmatch: A Transfer-Learning Scheme for Semi-Supervised Few-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 12856–12864. [Google Scholar]
  63. Wang, Y.-X.; Ramanan, D.; Hebert, M. Meta-Learning to Detect Rare Objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 9925–9934. [Google Scholar]
  64. Xiao, Z.; Zhong, P.; Quan, Y.; Yin, X.; Xue, W. Few-Shot Object Detection with Feature Attention Highlight Module in Remote Sensing Images. In Proceedings of the 2020 International Conference on Image, Video Processing and Artificial Intelligence, Shanghai, China, 21–23 August 2020; Volume 11584, pp. 217–223. [Google Scholar]
  65. Zhang, Z.; Hao, J.; Pan, C.; Ji, G. Oriented Feature Augmentation for Few-Shot Object Detection in Remote Sensing Images. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021; pp. 359–366. [Google Scholar]
  66. Wang, L.; Zhang, S.; Han, Z.; Feng, Y.; Wei, J.; Mei, S. Diversity Measurement-Based Meta-Learning for Few-Shot Object Detection of Remote Sensing Images. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3087–3090. [Google Scholar]
  67. Zhang, Y.; Zhang, B.; Wang, B. Few-Shot Object Detection With Self-Adaptive Global Similarity and Two-Way Foreground Stimulator in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7263–7276. [Google Scholar] [CrossRef]
  68. Zhu, D.; Guo, H.; Li, T.; Meng, Z. Fine-Tuning Faster-RCNN Tailored to Feature Reweighting for Few-Shot Object Detection. In Proceedings of the 5th International Conference on Control and Computer Vision, Xiamen, China, 19–21 August 2022; pp. 48–51. [Google Scholar]
  69. Liu, N.; Xu, X.; Celik, T.; Gan, Z.; Li, H.-C. Transformation-Invariant Network for Few-Shot Object Detection in Remote Sensing Images. arXiv 2023, arXiv:2303.06817. [Google Scholar]
  70. Li, X.; Deng, J.; Fang, Y. Few-Shot Object Detection on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
  71. Zhou, Z.; Chen, J.; Huang, Z.; Wan, H.; Chang, P.; Li, Z.; Yao, B.; Wu, B.; Sun, L.; Xing, M. FSODS: A Lightweight Metalearning Method for Few-Shot Object Detection on SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  72. Zhang, H.; Zhang, X.; Meng, G.; Guo, C.; Jiang, Z. Few-Shot Multi-Class Ship Detection in Remote Sensing Images Using Attention Feature Map and Multi-Relation Detector. Remote Sens. 2022, 14, 2790. [Google Scholar] [CrossRef]
  73. Liu, S.; Ma, A.; Pan, S.; Zhong, Y. An Effective Task Sampling Strategy Based on Category Generation for Fine-Grained Few-Shot Object Recognition. Remote Sens. 2023, 15, 1552. [Google Scholar] [CrossRef]
  74. Hou, K.; Wang, H.; Li, J. Few-Shot Object Detection Model Based on Transfer Learning and Convolutional Neural Network. Preprint 2022. [Google Scholar] [CrossRef]
  75. Zhou, Z.; Li, S.; Guo, W.; Gu, Y. Few-Shot Aircraft Detection in Satellite Videos Based on Feature Scale Selection Pyramid and Proposal Contrastive Learning. Remote Sens. 2022, 14, 4581. [Google Scholar] [CrossRef]
  76. Zhao, Z.; Liu, Q.; Wang, Y. Exploring Effective Knowledge Transfer for Few-Shot Object Detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6831–6839. [Google Scholar]
  77. Yang, Z.; Zhang, C.; Li, R.; Xu, Y.; Lin, G. Efficient Few-Shot Object Detection via Knowledge Inheritance. IEEE Trans. Image Process. 2022, 32, 321–334. [Google Scholar] [CrossRef]
  78. Kim, N.; Jang, D.; Lee, S.; Kim, B.; Kim, D.-S. Unsupervised Image Denoising with Frequency Domain Knowledge. arXiv 2021, arXiv:2111.14362, preprint. [Google Scholar]
  79. Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Wu, E.; Tian, Q. GhostNets on Heterogeneous Devices via Cheap Operations. Int. J. Comput. Vis. 2022, 130, 1050–1069. [Google Scholar] [CrossRef]
  80. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
  81. Li, Y.; Gupta, A. Beyond Grids: Learning Graph Representations for Visual Recognition. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Curran Associates, Inc.: Dutchess County, NY, USA, 2018; pp. 9225–9235. [Google Scholar]
  82. Ahmed, M.; Seraj, R.; Islam, S.M.S. The K-Means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
  83. Xiao, B.; Xu, B.; Bi, X.; Li, W. Global-Feature Encoding U-Net (GEU-Net) for Multi-Focus Image Fusion. IEEE Trans. Image Process. 2020, 30, 163–175. [Google Scholar] [CrossRef]
  84. Li, C.; Zhou, A.; Yao, A. Omni-Dimensional Dynamic Convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
  85. Chen, H.; Jiang, D.; Sahli, H. Transformer Encoder with Multi-Modal Multi-Head Attention for Continuous Affect Recognition. IEEE Trans. Multimed. 2020, 23, 4171–4183. [Google Scholar] [CrossRef]
  86. Zhu, H.; Lee, K.A.; Li, H. Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. arXiv 2021, arXiv:2107.06493. [Google Scholar]
  87. Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
  88. Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-Class Geospatial Object Detection and Geographic Image Classification Based on Collection of Part Detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
  89. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.U. Yolov5: V3.0. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 June 2023).
  90. Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-Shot Object Detection via Feature Reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
  91. Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly Simple Few-Shot Object Detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
  92. Zhao, Z.; Tang, P.; Zhao, L.; Zhang, Z. Few-Shot Object Detection of Remote Sensing Images via Two-Stage Fine-Tuning. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
  93. Zhang, T.; Zhang, X.; Zhu, P.; Jia, X.; Tang, X.; Jiao, L. Generalized Few-Shot Object Detection in Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 353–364. [Google Scholar] [CrossRef]
  94. Wang, Y.; Xu, C.; Liu, C.; Li, Z. Context Information Refinement for Few-Shot Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 3255. [Google Scholar] [CrossRef]
Figure 1. The framework of the proposed network. This network contains three important modules: meta-feature extractor, reweighting module, and feature fusion module. The meta-feature extractor uses optimized YOLOv5 as the backbone to extract the meta-features of the query set. The reweighting module uses a global feature pyramid extractor to extract multiscale global features of the support set, and the fusion features through the feature fusion module operation are used as the input (F0, F1, and F2) of the prediction layer, which is eventually used to detect the object.
Figure 1. The framework of the proposed network. This network contains three important modules: meta-feature extractor, reweighting module, and feature fusion module. The meta-feature extractor uses optimized YOLOv5 as the backbone to extract the meta-features of the query set. The reweighting module uses a global feature pyramid extractor to extract multiscale global features of the support set, and the fusion features through the feature fusion module operation are used as the input (F0, F1, and F2) of the prediction layer, which is eventually used to detect the object.
Remotesensing 15 03462 g001
Figure 2. Schematic diagram of foreground enhancement (FE) processing. (a) represents the original image from the support set S, (b) is the generated mask, (c) is the image obtained by frequency extractor, and (d) is the final output result obtained by fuse the (b,c). Where, the mask of the support set S was used to crop the foreground (i.e., the ground track field), and the frequency extractor was applied to perform frequency transformation to remove some noise, enhancing the object feature expression.
Figure 2. Schematic diagram of foreground enhancement (FE) processing. (a) represents the original image from the support set S, (b) is the generated mask, (c) is the image obtained by frequency extractor, and (d) is the final output result obtained by fuse the (b,c). Where, the mask of the support set S was used to crop the foreground (i.e., the ground track field), and the frequency extractor was applied to perform frequency transformation to remove some noise, enhancing the object feature expression.
Remotesensing 15 03462 g002
Figure 3. Meta-feature extractor module. The overall structure is an optimization of YOLOv5. G-GhostNet is used as the backbone structure, and PANet is used as the neck layer, with the multiscale GCUs structure as the output layer.
Figure 3. Meta-feature extractor module. The overall structure is an optimization of YOLOv5. G-GhostNet is used as the backbone structure, and PANet is used as the neck layer, with the multiscale GCUs structure as the output layer.
Remotesensing 15 03462 g003
Figure 4. Schematic diagram of graph convolutional unit.
Figure 4. Schematic diagram of graph convolutional unit.
Remotesensing 15 03462 g004
Figure 5. Reweighting module: (a) the entire module structure of the reweighting module; (b) the GE block in the global feature pyramid extractor (shown by GFPE in Figure 1). In addition, the feature map output of each layer is shown in the feature dimension, e.g., W1 × H1 × C1 denotes a feature map with width W1, height H1, and channel number C1; ⊗ denotes matrix multiplication.
Figure 5. Reweighting module: (a) the entire module structure of the reweighting module; (b) the GE block in the global feature pyramid extractor (shown by GFPE in Figure 1). In addition, the feature map output of each layer is shown in the feature dimension, e.g., W1 × H1 × C1 denotes a feature map with width W1, height H1, and channel number C1; ⊗ denotes matrix multiplication.
Remotesensing 15 03462 g005
Figure 6. Schematic diagram of feature fusion module (FFM).
Figure 6. Schematic diagram of feature fusion module (FFM).
Remotesensing 15 03462 g006
Figure 7. The visualization detection results of the NWPU VHR 10 dataset under the base class setting and the 20-shot setting. Results with light-blue background include the visual displays of the base class, and results with light-yellow background include the visual displays of the novel class.
Figure 7. The visualization detection results of the NWPU VHR 10 dataset under the base class setting and the 20-shot setting. Results with light-blue background include the visual displays of the base class, and results with light-yellow background include the visual displays of the novel class.
Remotesensing 15 03462 g007
Figure 8. The visualization detection results of the DIOR dataset under the base class setting and the 20-shot setting. Results with light-blue background include the visual displays of the base class, and results with light-yellow background include the visual displays of the novel class.
Figure 8. The visualization detection results of the DIOR dataset under the base class setting and the 20-shot setting. Results with light-blue background include the visual displays of the base class, and results with light-yellow background include the visual displays of the novel class.
Remotesensing 15 03462 g008
Figure 9. Effectiveness analysis of FE/GFPE/GCU/FFM. (ad) represent the middle layer heat maps before and after adding different modules. The red rectangles indicate the salience representation of adding different modules.
Figure 9. Effectiveness analysis of FE/GFPE/GCU/FFM. (ad) represent the middle layer heat maps before and after adding different modules. The red rectangles indicate the salience representation of adding different modules.
Remotesensing 15 03462 g009
Figure 10. Comprehensive visual display of base + novel classes object detection results. The ground track field + basketball court + baseball diamond, bridge + baseball diamond, basketball court + tennis court, and harbor + tennis court combinations are shown respectively.
Figure 10. Comprehensive visual display of base + novel classes object detection results. The ground track field + basketball court + baseball diamond, bridge + baseball diamond, basketball court + tennis court, and harbor + tennis court combinations are shown respectively.
Remotesensing 15 03462 g010
Figure 11. Failed detection cases are marked with red ovals. The cases correspond to failed detection of interclass similarity, false detection due to dense distribution, and missed detection caused by tree occlusion, respectively.
Figure 11. Failed detection cases are marked with red ovals. The cases correspond to failed detection of interclass similarity, false detection due to dense distribution, and missed detection caused by tree occlusion, respectively.
Remotesensing 15 03462 g011
Table 1. The overall framework of G-GhostNet set in this study. Block represents a residual bottleneck, the Cheap operation is generally a 1 × 1 or 3 × 3 convolution to obtain cheap features, and Output represents the size of the output feature map and the number of channels at this stage.
Table 1. The overall framework of G-GhostNet set in this study. Block represents a residual bottleneck, the Cheap operation is generally a 1 × 1 or 3 × 3 convolution to obtain cheap features, and Output represents the size of the output feature map and the number of channels at this stage.
StageOutputOperatorStride
Stem256 × 256 × 32Conv 3 × 32
L0128 × 128 × 128Block2
128 × 128 × 128Block × 1 Cheap1
128 × 128 × 128Concat1
L164 × 64 × 256Block2
64 × 64 × 256Block × 1 Cheap1
64 × 64 × 256Concat1
L232 × 32 × 512Block2
32 × 32 × 512Block × 5 Cheap1
32 × 32 × 512Concat1
L316 × 16 × 1024Block2
16 × 16 × 1024Block × 5 Cheap1
16 × 16 × 1024Concat1
Table 2. Details of the experimental dataset.
Table 2. Details of the experimental dataset.
DatasetsData CollectionImage NumbersTypesResolutionImage SizeBase Classes:
Novel Classes
NWPU VHR-10Google Earth/ISPRS Vaihingen dataset800100.5–2.0 mLong side:
500–1200 pixels
7:3
DIORGoogle Earth23,463200.5–30 m800 × 800 pixels15:5
Table 3. Comparisons among different FSOD networks on the NWPU VHR 10 dataset.
Table 3. Comparisons among different FSOD networks on the NWPU VHR 10 dataset.
ClassShotYOLOv5Faster R-CNN
(ResNet101)
FSRWFSODMTFAPAMS-DetG-FS
Det
CIR-FSDSAGS-TFSTINetOurs
Base Class Results
Ship 0.800.880.770.720.860.88-0.91--0.91
Storage tank0.520.490.800.710.890.89-0.88--0.90
Basketball court0.580.560.510.720.890.90-0.91--0.91
Ground track field0.991.000.940.910.990.99-0.99--1.00
Harbor0.670.660.860.870.840.84-0.80--0.90
Bridge0.560.570.770.760.780.80-0.87--0.90
Vehicle0.700.740.380.760.870.89-0.89--0.90
Mean0.690.700.760.780.870.880.890.89--0.92
Novel Class Results
Airplane30.060.090.130.150.120.21-0.520.35-0.54
50.100.190.240.580.510.55-0.670.64-0.69
100.180.200.200.600.600.61-0.710.66-0.68
Baseball diamond30.140.190.120.570.610.76-0.790.76-0.80
50.200.230.390.840.780.88-0.880.82-0.87
100.280.350.740.880.850.88-0.880.87-0.89
Tennis court30.120.120.110.250.130.16-0.310.43-0.35
50.150.170.110.160.190.20-0.370.52-0.52
100.150.170.260.480.490.50-0.530.64-0.62
Mean30.110.130.120.320.290.370.490.540.510.560.56
50.150.200.240.530.490.550.560.640.660.640.69
100.200.240.400.650.650.660.720.700.720.720.72
Table 4. Comparisons among different FSOD networks on the DIOR dataset.
Table 4. Comparisons among different FSOD networks on the DIOR dataset.
ClassShotYOLOv5Faster R-CNN
(ResNet101)
FSRWFSODMTFAPAMS-DetG-FS
Det
CIR-FSDSAGS-TFSTINetOurs
Base Class Results
Airport 0.590.730.590.630.760.78-0.87--0.85
Basketball court0.710.690.740.800.780.79-0.88--0.88
Bridge0.260.260.290.320.520.52-0.55--0.56
Chimney0.680.720.700.720.660.69-0.79--0.80
Dam0.400.570.520.450.540.55-0.72--0.75
Expressway service area0.550.590.630.630.660.67-0.86--0.87
Expressway toll station0.450.450.480.600.600.62-0.78--0.69
Golf course0.600.680.610.610.790.81-0.84--0.87
Ground track field0.650.650.540.610.770.78-0.83--0.85
Harbor0.310.310.520.430.500.50-0.57--0.56
Overpass0.460.450.490.460.500.51-0.64--0.65
Ship0.100.100.330.500.660.67-0.72--0.75
Stadium0.650.670.520.450.750.76-0.77--0.81
Storage tank0.210.210.260.430.550.57-0.70--0.72
Vehicle0.170.190.290.390.520.54-0.56--0.54
Mean0.450.480.500.540.630.650.710.74--0.74
Novel Class Results
Airplane50.020.030.090.090.130.14-0.200.17-0.21
100.080.090.150.160.170.17-0.200.24-0.23
200.090.090.190.220.240.25-0.270.33-0.35
Baseball field50.090.090.330.270.510.54-0.500.50-0.49
100.270.310.450.460.530.55-0.550.51-0.51
200.300.350.520.500.560.58-0.620.53-0.64
Tennis court50.100.120.470.570.240.24-0.500.62-0.63
100.120.130.540.600.410.41-0.500.63-0.66
200.200.210.550.660.500.50-0.550.63-0.68
Train station50.000.000.090.110.130.17-0.240.17-0.18
100.000.020.070.140.150.17-0.230.22-0.25
200.020.040.180.160.210.23-0.280.28-0.30
Windmill50.010.010.130.190.250.31-0.200.23-0.31
100.100.120.180.240.300.34-0.360.24-0.40
200.120.210.260.290.330.36-0.370.35-0.39
Mean50.040.050.220.250.250.280.310.330.340.290.36
100.110.130.280.320.310.330.370.380.370.380.41
200.150.180.340.360.370.380.400.430.420.430.47
Table 5. Ablation experiment on the NWPU VHR-10 Dataset.
Table 5. Ablation experiment on the NWPU VHR-10 Dataset.
BaselineFEGFPEGCUFFMmAP
3-Shot5-Shot10-Shot
0.320.510.62
0.360.540.64
0.480.600.68
0.520.640.69
0.560.690.72
Table 6. Model complexity comparison and inference time between our network and other FSOD approaches.
Table 6. Model complexity comparison and inference time between our network and other FSOD approaches.
ModelFSODMTFAG-FSDetCIR-FSDSAGS-TFSTINetOurs
Params (M)81.2558.2174.08/60.1963.3549.58-65.01
FLOPs (G)216.38154.70-158.98327.17-197.89
Time (per image)0.150.28-0.350.130.250.11
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, B.; Ma, G.; Sui, H.; Zhang, Y.; Zhang, H.; Zhou, Y. Few-Shot Object Detection in Remote Sensing Imagery via Fuse Context Dependencies and Global Features. Remote Sens. 2023, 15, 3462. https://doi.org/10.3390/rs15143462

AMA Style

Wang B, Ma G, Sui H, Zhang Y, Zhang H, Zhou Y. Few-Shot Object Detection in Remote Sensing Imagery via Fuse Context Dependencies and Global Features. Remote Sensing. 2023; 15(14):3462. https://doi.org/10.3390/rs15143462

Chicago/Turabian Style

Wang, Bin, Guorui Ma, Haigang Sui, Yongxian Zhang, Haiming Zhang, and Yuan Zhou. 2023. "Few-Shot Object Detection in Remote Sensing Imagery via Fuse Context Dependencies and Global Features" Remote Sensing 15, no. 14: 3462. https://doi.org/10.3390/rs15143462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop