Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images

Zhang, Sanxing; Song, Fei; Liu, Xianyuan; Hao, Xuying; Liu, Yujia; Lei, Tao; Jiang, Ping

doi:10.3390/rs15051187

Open AccessArticle

Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images

by

Sanxing Zhang

^1,2,3

,

Fei Song

^1,4

,

Xianyuan Liu

^1,2,3

,

Xuying Hao

^1,2,3,

Yujia Liu

^1,2,3,

Tao Lei

^1,*

and

Ping Jiang

¹

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China

⁴

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(5), 1187; https://doi.org/10.3390/rs15051187

Submission received: 9 January 2023 / Revised: 13 February 2023 / Accepted: 16 February 2023 / Published: 21 February 2023

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Most object detection methods based on remote sensing images are generally dependent on a large amount of high-quality labeled training data. However, due to the slow acquisition cycle of remote sensing images and the difficulty in labeling, many types of data samples are scarce. This makes few-shot object detection an urgent and necessary research problem. In this paper, we introduce a remote sensing few-shot object detection method based on text semantic fusion relation graph reasoning (TSF-RGR), which learns various types of relationships from common sense knowledge in an end-to-end manner, thereby empowering the detector to reason over all classes. Specifically, based on the region proposals provided by the basic detection network, we first build a corpus containing a large number of text language descriptions, such as object attributes and relations, which are used to encode the corresponding common sense embeddings for each region. Then, graph structures are constructed between regions to propagate and learn key spatial and semantic relationships. Finally, a joint relation reasoning module is proposed to actively enhance the reliability and robustness of few-shot object feature representation by focusing on the degree of influence of different relations. Our TSF-RGR is lightweight and easy to expand, and it can incorporate any form of common sense information. Sufficient experiments show that the text information is introduced to deliver excellent performance gains for the baseline model. Compared with other few-shot detectors, the proposed method achieves state-of-the-art performance for different shot settings and obtains highly competitive results on two benchmark datasets (NWPU VHR-10 and DIOR).

Keywords:

few-shot object detection; relation reasoning; remote sensing images; gated graph neural network (GGNN)

Graphical Abstract

1. Introduction

With the continuous development of earth observation technology, more high-quality and high-resolution remote sensing images (RSI) can be generated. It also promotes the research and practical application of object detection in this field, such as disaster monitoring [1], military reconnaissance [2], urban planning [3] and change detection [4]. Recently, numerous remote sensing object detection methods [5,6,7,8] with excellent performance have been developed due to the strong learning ability of deep learning (including convolutional neural network and transformer [9]). These algorithms are well suited for solving the challenging problems of remote sensing objects with large scale variations and arbitrary orientations. However, most of the models are highly dependent on the size of the training data. When the number of samples is sufficient, the algorithms can achieve reliable detection results. On the contrary, for some complex remote sensing scenes and rare high-value objects, trying to acquire a sufficient number of images requires consuming a large amount of acquisition and annotation costs. The scarcity of sample data seriously affects the detection performance or even failure of the model.

In recent years, few-shot object detection (FSOD) has attracted extensive attention. FSOD aims to learn knowledge from the base classes with rich data and novel classes with scarce data so as to achieve more accurate detection of all classes. In general, few-shot object detection in RSI is built on the basis of general detection frameworks (such as YOLO [10], Faster R-CNN [11], etc.) by implementing transfer learning and meta learning methods. The training is then performed on some common remote sensing detection datasets, such as NWPU VHR-10 [12] and DIOR [13]. Meta-learning-based approaches [14] try to define multiple small batches of few-shot tasks in the source domain to continuously learn new tasks quickly from what has been learned. The methods based on transfer learning [15] directly learn knowledge from the source domain with rich annotations and transfer it to the target domain with only a few samples. Compared with the former, these types of methods show more powerful performance advantages.

When recognizing novel objects in an image, humans draw on previously acquired knowledge information about the appearance and interrelationships of the objects to achieve the correct class inference. Moreover, there is no need to depend on the number of images that have been observed. In other words, constructing common sense reasoning will help make more accurate identification regardless of the availability of image data. For example, as shown in Figure 1, assuming that airplane and ship are novel classes, if we have sufficient prior common sense knowledge (e.g., associated linguistic descriptions), we can infer that the object is more likely to be an airplane or ship to a large extent via the identification of an airport or a port. Current studies on common sense reasoning can be summarized as learning different forms of common sense knowledge and the construction of different types of relational representations. These works have been well used in large-scale object detection and few-shot recognition. However, there is little exploration in remote sensing few-shot object detection. Moreover, performing simple transfer fusion on existing detection pipelines instead results in more noise and less efficient detection. Therefore, our work aims to explore how text data associated with novel classes can provide more robust feature support for detection tasks under limited image data conditions. Meanwhile, a graph-based reasoning model for RSI is developed, which can not only effectively fuse various types of common sense knowledge, but also learn the semantic and spatial relationships between classes to improve the performance of the few-shot methods.

In this paper, we propose text semantic fusion relation graph reasoning for a few-shot detection network. The network is composed of three submodules: a text semantic encoding module, a relational graph learning module, and a joint relational reasoning module. To obtain all classes of common sense knowledge expressions, our approach first constructs a proposal region-to-region graph structure. The text semantic encoding module is responsible for learning word embeddings of different classes from a text corpus containing rich object information descriptions. Furthermore, the initialized feature expressions of the graph nodes are generated and guided by the classification scores of previous classifiers. Then, the learning of relational graphs is achieved by computing similarity-based semantic relational edges and Gaussian kernel based spatial relational edges with the help of a gated graph neural network (GGNN) [16]. Finally, joint relational reasoning is executed to perform an adaptive fusion of visual features of region proposals and learned relational knowledge, thus empowering the detector to capture and reason about prior common sense knowledge. With this design approach, our TSF-RGR not only enhances the strong correlation between different candidate regions, but also effectively alleviates the problem of scarcity of visual information in few-shot classes.

The main contributions of our work are summarized as follows:

1.: We build a corpus consisting of a large amount of natural language for the NWPU VHR-10 and DIOR datasets, respectively, that contains descriptions of semantic information about the appearance, attributes, and potential relationships of all classes. Additional common-sense guidance for scarce novel classes is provided by encoding the word vectors of the classes in the corpus and aligning the features in the text and image domains. To the best of our knowledge, this is the first work to explore this method for few-shot object detections on RSI;
2.: A relation graph learning module is designed, where the nodes of the graph consist of each region proposal and the edges compute the semantic and spatial relations that exist between the proposals. With the information interaction of GGNN, the potential semantic and spatial information of each proposal is learned while enhancing the association of base and novel classes;
3.: A joint relation reasoning module is designed to actively combine visual and relational features to simulate the reasoning process. This reduces the dependence of the model on image data and provides more discriminative features for novel classes. The whole model is more interpretable while improving the performance of few-shot detection;
4.: Our TSF-RGR achieves consistent performance on both public benchmark datasets. It outperforms the leading method in the same domain for a wide range of shots settings, especially in the case of extremely sparse shots.

The rest of this paper is organized as follows. The related work is outlined in Section 2 and some preparatory knowledge is briefly presented in Section 3. The proposed few-shot method is presented in Section 4. The specific experimental details and results are discussed in Section 5. Finally, our work is summarized in Section 6.

2. Related Work

2.1. Object Detection in Remote Sensing Images

As a fundamental task in image analysis, object detection is crucial in the remote sensing community [7,17,18]. However, remote sensing scenes are composed of complex and cluttered backgrounds and objects with various classes and sizes, which severely affect the performance of the detector on the dataset [5,18,19]. Therefore, researchers focused on the challenges and extensively proposed object detection methods under the deep learning paradigm, for example, RSADet (remote sensing spatial adaptation detector) [8], CF2PN (cross-scale feature fusion pyramid network) [20], FSoD-Net (full-scale object detection network) [21], SODNet (single-stage small object detection network) [22], and CFE-HA (combining feature enhancement and hybrid attention) [23]. To avoid the effects of a complex and cluttered background, RSADet adopted deformable convolutions for different object shape variations. The convolutional network model with an adaptive attention fusion mechanism (AAFM) [6] was proposed by using parallel spatial and channel attention. In the mechanism, the distinguishability of semantic information for the different feature maps was generated using convolutions and different pooling operations. To address the multi-scale problem, CF2PN was proposed to obtain sufficiently comprehensive semantic information by utilizing thinning U-shaped modules and a cross-scale fusion module. FSoD-Net proposed applying a multiscale enhancement network in the backbone and remains scale-invariant in the regression layer for cascading. Simultaneously, CFE-HA proposed dilation convolution in multilayer features and an attention mechanism between pixels and channels. To solve the small object problems, SODNet proposes an efficient spatially parallel convolution scheme while applying multi-scale fusion. Several recent methods [24,25] focused on designing detectors based on the transformer paradigm to obtain a wider range of perceptual capabilities. Despite their impressive results, all of the above algorithms are designed for large amounts of labeled training data. When the labeled data are inaccurate or even lack training data, the detection performance of the algorithms drops dramatically. Such problems are common in real world applications because some special classes are difficult to obtain and require expert knowledge for accurate labeling. Therefore, object detection algorithms based on few-shot have started to emerge.

2.2. Few-Shot Object Detection

Recently, few-shot object detection [15,26,27,28] has received increasing attention from researchers. In natural scene images, Karlinsky et al. [26] proposed RepMet to introduce distance metric learning to simultaneously learn the network backbone, embedding space and class distribution. Wang et al. [27] proposed a simple fine-tuning training strategy to obtain better detection performances for novel classes by fine-tuning only the last layer of the detector. Sun et al. [29] significantly reduced the problem of novel class misclassification by designing a contrast prediction code. Fan et al. [30] proposed implementing few-shot detection based on the similarity between the few-shot support set and the query set. Qiao et al. [15] proposed decoupling between multiple components of the Faster R-CNN to improve the classification performance of novel classes with sparse data. Kaul et al. [28] proposed to generate a large number of pseudo-labels for new classes to solve the problem of insufficient training samples. Inspired by transformer, Han et al. [31] designed a cross-transformer to aggregate the key information between the support set and the query set. Bulat et al. [32] proposed a few-shot detection transformer (FS-DETR) that can process an arbitrary number of novel classes of data simultaneously without the need for fine-tuning. The unique characteristics of remote sensing scene images led to the failure of some methods in direct applications. In response, Cheng et al. [33] focused on analyzing the main challenges faced and proposed prototype-CNN to provide class-aware prototype guidance for detectors. Li et al. [34] proposed a feature reweighting model based on meta-learning. Wolf et al. [35] introduced a double head predictor to weigh the classification weights between the base class and the novel class. Huang et al. [36] proposed sharing attention maps in different training stages to solve the problem of large variations in the size of remote sensing objects, while Wang et al. [37] effectively identified objects in complex backgrounds by capturing contextual information between different sensory domains. Zhou et al. [38] proposed context-driven pixel aggregation and feature aggregation to capture more semantic information. However, all of the above methods only rely on image information and do not explore knowledge reasoning.

2.3. Knowledge Reasoning

Knowledge reasoning has been extensively studied and applied in numerous fields (image classification [39], recognition [40], and zero-shot learning [41], etc.). The broad work can be classified as implicit and explicit knowledge construction. Implicit knowledge construction mainly considers encoding various types of relations directly in the network. Both Hu et al. [42] and Xu et al. [43] learned key semantic spatial relations directly inside images by constructing relational modules. Marino et al. [44] learned associated object classes as knowledge graphs to improve the performance of classification. Mou et al. [45] proposed to exploit global relations between arbitrary spatial locations and feature graphs to enhance the feature representation of objects. Although their approach has shown great performance improvements, it depends heavily on the size of the data volume. For this reason, other works have started to explore the introduction of more types of explicit knowledge to complement the lack of visual information. Zhu et al. [46] improve the adaptation to novel classes by learning semantic embeddings from a text corpus and aligning them with image features. Gu et al. [47] considered joint learning for feature embeddings from text and visual data to better reduce the variability between domains. Some recent work [48,49,50,51] suggests constructing prior knowledge as graphical structures to guide the dissemination of information. In addition, Chen et al. [52] proposed a variational reasoning framework to address the problem of missing links in knowledge graphs. Li et al. [53] transferred knowledge by encoding a tree structure of semantic relationships between classes. In contrast, the embedding and learning of knowledge graphs have rarely been explored in few-shot detection for remote sensing images.

3. Preliminary Knowledge

3.1. Model Formulation

The training set for a general object detection task usually has only one dataset

D_{b a s e} = \{(x_{i}^{b a s e}, y_{i}^{b a s e})\}

, which contains a large number of images of base class

C_{b a s e}

instances, where

x_{i}^{b a s e}

denotes the images with base class instances and

y_{i}^{b a s e}

denotes the labels and bounding box annotations of the base class. For the few-sample target detection task, the training set,

D_{t r a i n} = D_{b a s e} \cup D_{n o v e l}

, includes both

D_{b a s e}

with a large number of labeled samples and a novel class dataset

D_{n o v e l} = \{(x_{i}^{n o v e l}, y_{i}^{n o v e l})\}

with only a few labeled samples, where the two subsets do not overlap in terms of categories,

C_{b a s e} \cap C_{n o v e l} = ⊘

. Based on the Faster R-CNN detection framework, we build our TSF-RGR model to first train the base detector on

D_{b a s e}

and then fine-tune the balanced dataset with K instances of the novel class and the randomly sampled base class in the k-shot phase. For the final testing phase, the test set

D_{t e s t}

contains all classes.

3.2. Gated Graph Neural Networks

To facilitate the understanding of the proposed model, we briefly introduce the gated graph neural network (GGNN), as shown in Figure 2. The basic idea is to iteratively learn feature representations of arbitrarily structured data by passing information between graph nodes. For each iteration t, the inputs are the hidden states

h_{v}^{t - 1}

of the nodes in the previous iteration and the corresponding adjacency matrix

A

, and in the first iteration

t = 1

, hidden states

h_{v}^{1}

are initialized by the node input features

f_{v}

, with the insufficient dimensions filled with extra zeros. Then, the GGNN uses a mechanism that is similar to the gate recurrent unit (GRU) [54] to output the final node features

o_{v}

after T iterations. So the basic iterative process is

\begin{matrix} h_{v}^{1} = [f_{v}, 0] \\ a_{v}^{t} = A_{v}^{T} [h_{1}^{t - 1} \dots h_{|V|}^{t - 1}] + b \\ z_{v}^{t} = σ (W^{z} a_{v}^{t} + U^{z} h_{v}^{t - 1}) \\ r_{v}^{t} = σ (W^{r} a_{v}^{t} + U^{r} h_{v}^{t - 1}) \\ \tilde{h_{v}^{t}} = tan h (W a_{t}^{v} + U (r_{v}^{t} ⊙ h_{v}^{t - 1})) \\ h_{v}^{t} = (1 - z_{v}^{t}) ⊙ h_{v}^{t - 1} + z_{v}^{t} ⊙ \tilde{h_{v}^{t}} \\ o_{v} = g (h_{v}^{T}, f_{v}) \end{matrix}

(1)

where

A_{v}

is the adjacency matrix associated with node v.

σ (\cdot)

and

tan h (\cdot)

are the logistic sigmoid function and hyperbolic tangent function, respectively, and

g (\cdot)

is a fully connected layer network. ⊙ denotes element-wise multiplication, and

W^{z}, W^{r}, W, U^{z}, U^{r}, U

and b are learnable weight parameters. For the convenience of subsequent expressions, we simplify Equation (1) and write it as

o_{v} = G G N N (f_{v}, A)

.

4. Proposed Method

In this section, we first briefly describe the process of building the relation graph model. Then, each module of the TSF-RGR model is described in detail, and the overall framework is shown in Figure 3. The proposed model consists of three main components: text semantic encoding (TSE), relation graph learning (RGL), and joint relation reasoning (JRR). Specifically, TSE encodes word embeddings for all class labels. RGL learns the relation features for each region proposal by constructing different relation types (semantic and spatial relation) with the help of GGNN. JRR aggregates relational features and visual information into the classification and bounding box regression layers to enhance the feature representation of few-shot classes in order to obtain better detection results. Finally, a two-stage fine-tuning training process is introduced.

Our proposed method uses Faster R-CNN as the base detector. The input image generates a certain number of region proposals after going through the backbone feature extractor and region proposal network (RPN). Considering these regions as graph nodes, we construct a region-to-region directed graph

G : G = 〈𝒱, ℰ〉

, where the set of nodes

𝒱 = \{ν_{1}, ν_{2}, \dots, ν_{n}\}

represents the set of all region proposals and n is the number of proposals.

ε_{i \to j} \in E

is a directed edge from node

ν_{i}

to node

ν_{j}

, which represents the various types of relations (semantic and spatial) between regions. We believe that two types of objects have different degrees of influence on each other, i.e.,

ε_{i \to j} \neq ε_{j \to i}

. For example, for detecting two types of objects, airport and airplane, the degree of influence of the airport on the airplane is significantly greater than the influence of the airplane on the airport. After that, the different relationship maps are fed into GGNN to learn the corresponding relationship features.

4.1. Text Semantic Encoding

Most recent graph-based methods only propagate the visual features between region proposals on the image. However, in the few-shot task, the lack of visual information about novel classes leads to information loss during graph propagation and the model cannot learn any useful knowledge. To address such problems, a text corpus containing a large number of class information descriptions is constructed. The additional information provided by the corpus can compensate for the lack of visual features, since we consider the same class to be uniform in the distribution of feature space, whether it is text, image, or video. Specifically, we first retrieve the label information of each class (including novel classes) separately in the search engine. Meanwhile, a large amount of textual language describing relevant classes and inter-class relations (e.g., ‘ship is usually parked near the harbor’) is collected to build the corpus. This way is easy to operate and has lower cost consumption compared to image acquisition. Then, using the corpus as input, the category labels in the corpus are encoded into word vectors using word embedding model (e.g., Glove [55]) commonly used in natural language processing. The text semantic embedding,

{\{c_{i}\}}_{i = 1}^{N}

, of each class label is obtained. where

c_{i}

is a d-dimension feature vector (

d = 32

in the experiments) and N is the number of classes contained in the dataset. Furthermore, in the end-to-end training, the output of the RPN network obtains the visual feature of each region proposal through RoiAlign. These features are fed to the detector classification layer to estimate a classification score vector

(s_{1}, s_{2}, \dots, s_{N})

for each region, where the score

s_{i}

represents the confidence that the region proposal belongs to the class i and is a scalar. Based on different score weights, the initial features of each region node can be obtained from a weighted mapping of the text semantics of potentially relevant classes, i.e.,

f_{v} = \sum_{i = 1}^{N} s_{i} c_{i}

. It is noted that when the number of classes is too large, a blind weighted sum not only increases the computational effort, but redundant knowledge may also produce misleading leads. Consequently, for the score vector of each region, we only keep the first p large values (

k = 4

in NWPU VHR-10,

k = 8

in DIOR), and the rest is set to 0. This ensures that the introduced text semantic knowledge is highly relevant.

4.2. Relation Graph Learning

After obtaining the features of all region nodes, The purpose of RGL is to extract feature information highly relevant to the region proposals from the text encoding, and the structure is shown in Figure 4. Firstly, it is output as proposed relational features for subsequent reasoning. Secondly, the alignment between text and image is achieved. Typically, we can easily verbally describe the relationships that exist between two classes of objects in an image and their positional relationships. For example, two specific classes are similar in appearance: One class is on top of the other, etc. Nevertheless, we cannot tell the algorithm from this relationship information directly, and there is no relevant dataset for object detection that provides relationship annotation. Fortunately, we can encode the relationships into feature vectors without introducing any additional annotations, and we can learn them as a score between 0 and 1 to indicate the strength of the relationship that exists between the objects of the certain two classes. All scores are integrated to obtain adjacency matrix

A

, which is required for graph learning. In this paper, we define two types of relational adjacency matrices: semantic adjacency matrix

A^{s e m}

and spatial adjacency matrix

A^{s p a}

.

Semantic adjacency matrix. We model the semantic relationships in terms of vector projections, representing the degree of influence of region

v_{i}

on region

v_{j}

in terms of visual similarity, and each element of

A^{s e m}

can be calculated as follows.

\begin{matrix} ε_{i \to j}^{s e m} = e^{- |∥ m_{i} ∥ - \frac{d o t (m_{i}, m_{j})}{∥ m_{i} ∥}|} \end{matrix}

(2)

In Equation (2),

m_{i}

is the visual feature of region

v_{i}

extracted from the backbone,

d o t (,)

is the vector dot product and

∥ \cdot ∥

is the 2-norm of the vector. It can be seen that the more similar the two regions are, the more

ε_{i \to j}^{s e m}

converges to 1, and the opposite converges to 0. This is also in accordance with our expectation.

Spatial adjacency matrix. For the description of spatial relations, we first construct three-dimensional vector

s_{i j}

, which computes the Euclidean distances of the centroids of regions

v_{i}

and

v_{i}

, the angles and the intersection over union (IoU) of the regions, and it is formulated as

\begin{matrix} s_{i j} = (d i s (v_{i}, v_{j}), a n g (v_{i}, v_{j}), 1 - I o U (v_{i}, v_{j})) \\ w h e r e d i s (v_{i}, v_{j}) = \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2}} \\ a n g (v_{i}, v_{j}) = arctan (\frac{y_{i} - y_{j}}{x_{i} - x_{j}}) \end{matrix}

(3)

where

(x_{i}, y_{i})

and

(x_{j}, y_{j})

denote the coordinates of the centroids of regions

v_{i}

and

v_{j}

, respectively. The operation

1 -

is to maintain the consistency of the metric. We then introduce a Gaussian kernel function by encoding the spatial relationship between regions, and this is given by

\begin{matrix} ε_{i \to j}^{s p a} = e^{- \frac{{∥ s_{i j} ∥}^{2}}{2 σ^{2}}} \end{matrix}

(4)

where

∥ \cdot ∥

denotes the mode of the vector,

σ

controls the range of action of the function, and in practical experiments, we take

σ = 1

in order to make the function value domain within

[0, 1]

. However, in each training step, several hundreds of region proposals are generated, and not all of them are spatially relevant. The regions that are far away from each other are useless for learning relations. Therefore, for region

v_{i}

, we only focus on the most relevant p regions and ignore the others. Meanwhile, this sparsity constraint also ensures that the inference speed of the detection network is not affected.

Finally, semantic relation features A and spatial relation features B of each region node are learned by GGNN, which is expressed as

\begin{matrix} f_{v}^{s e m} = G G N N (f_{v}, A^{s e m}) \\ f_{v}^{s p a} = G G N N (f_{v}, A^{s p a}) \end{matrix}

(5)

4.3. Joint Relation Reasoning

Most existing work simply performs concatenate or adds the learned features with the original features for subsequent classification and bounding box regression. On the contrary, we believe that focusing on more discriminative relational knowledge is particularly important in the context of few-shot learning. Different relational knowledge differs in the degree of inference for identifying the same object, and it is also not the case that increased stacked knowledge improves detection. Therefore, we propose a joint relational inference unit with a gating mechanism that selectively allows critical knowledge to be retained for better detection under the joint guidance of original features and relational knowledge. It is also able to reduce the gap between the image and text domains when introducing textual semantic features for novel classes. The specific formulation is

\begin{matrix} f^{^{'}} & = m_{v} \oplus l (w_{1} g (m_{v}, f_{v}^{s e m}), w_{2} g (m_{v}, f_{v}^{s p a})) \end{matrix}

(6)

where

l (,)

and

g (,)

are fully connected layers that are responsible for filtering the information, and

w_{1}

and

w_{2}

are two learnable weight parameters for determining the importance of different relation knowledge. ⊕ denotes the element sum. Ultimately, region feature

f^{^{'}}

is a 1024-dimension feature vector that is used for the final classification and bounding box regression.

4.4. Two-Stage Fine-Tuning

In the second fine-tuning stage, we randomly initialize the weight parameters for the novel classes and freeze the backbone feature extractor to fine-tune the parameters of the RPN and relational graph inference parts. Unlike other fine-tuning methods [27], we believe that the remote sensing images have more complex backgrounds and span larger object size differences. When the features of the novel class are completely different from the base class features, freezing RPN can easily filter out the novel class as the background or be misclassified as a base class. Therefore, unfrozen RPN can help the network learn richer region proposals about the novel class. On the other hand, in the basic training phase, we have learned a robust fusion pattern between text information and visual features using base classes with rich labeled data. After adding sparse novel class data and relevant word embedding knowledge of the novel classes in the second stage, although the original graph structure cannot provide effective knowledge reasoning for the novel classes, the fusion pattern does not need to be relearned and only fine tuning is required to enable fast fusion of critical semantic information conducive to enhancing the feature representation of the novel classes. The advantage of the graph structure is demonstrated at this point, because for each novel class, we do not need to relearn the knowledge graphs of all classes from the first stage, which would introduce a substantial amount of repetitive learning. Instead, we can simply add the knowledge of the novel class to the base class knowledge graph. In other words, we do not need to retrain a new model, but we only need to perform quick fine-tuning in the second stage.

5. Experiments and Discussion

5.1. Experimental Settings

5.1.1. Datasets

In the experimental phase, we comprehensively evaluated the performance of the proposed method on two publicly available benchmark remote sensing datasets: NWPU VHR-10 and DIOR.

NWPU VHR-10 is an open and very high-resolution optical remote sensing image dataset. The dataset contains 800 images acquired from Google Earth and ISPRS Vaihingen, of which 150 are negative image sets without any annotated objects and 650 are positive image sets with at least 1 annotated object. These images range in size from 500 to 1200 pixels on the long side and in spatial resolution from 0.08 m to 2.0 m. A total of 3651 annotated objects are classified into 10 classes, including airplane (PL), baseball diamond (BD), basketball court (BC), bridge (BR), ground track field (GTF), harbor (HA), ship (SH), storage tank (ST), tennis court (TC), and vehicle (VE). All instances are labeled with horizontal bounding boxes (HBB).

DIOR is a large-scale remote sensing image dataset for object detection. The dataset consists of 23,463 images collected from Google Earth covering different weather and different seasons, of which 5862 are training images, 5863 are evaluation images, and 11,738 are test images. These images are all

800 \times 800

pixels in size with spatial resolutions ranging from 0.5 m to 30 m. The 192,472 instances labeled by horizontal bounding boxes are classified into 20 classes: airplane (PL), airport (PO), baseball field (BF), basketball court (BC), bridge (BR), chimney (CH), dam (DA), expressway service area (ESA), expressway toll station (ETS), golf course (GC), ground track field (GTF), harbor (HA), overpass (OP), ship (SH), stadium (SD), storage tank (ST), tennis court (TC), train station (TS), vehicle (VE), and wind mill (WM).

5.1.2. Data Preprocessing

To evaluate the detection performance of the proposed TSF-RGR model under few-shot detection, based on Ref. [36], we performed the same experimental setup. Specifically, three classes (airplane, baseball diamond, and tennis court) are considered as novel classes in NWPU VHR-10, and the remaining classes are considered as base classes. We first cropped the images in the dataset, where the long edge is 1024 pixels and keep the original image scale unchanged. Then, a training set containing only base class object images is selected for base training, and images with only K-labeled ground truth instances per base class and novel class are randomly selected for fine-tuning training, where the values of K are set to 3, 5, and 10. Finally, the proposed method is tested on a test set containing all classes. Similarly, in DIOR, the novel class consists of five classes (airplane, baseball field, tennis court, train station and wind mill). The training image size is used as the original

800 \times 800

pixels and the K values are set to 5, 10 and 20 for the fine-tuned training phase. Eventually, the performance of the model is tested on the evaluation set.

5.1.3. Network Setup

We use a Faster R-CNN with FPN as the baseline and ResNet-101 pretrained on ImageNet as the backbone. Our TSF-RGR model is implemented on PyTorch [56] and trained on an NVIDA TITAN Xp GPU. In addition, we optimize the proposed model at each stage using SGD with weight decay of 0.0001 and momentum of 0.9. In the base training phase, we trained 30,000 and 200,000 iterations on NWPU VHR-10, and DIOR, respectively, with the initial learning rate set to 0.001 and reduced twice (

\times 0.1

) after (18,000, 28,000) and (128,000, 188,000), respectively. In the fine-tuning phase, the same initial learning rate is maintained and 3000 and 8000 iterations are trained in the two datasets, respectively. In order to reduce randomness to obtain fairer results, we selected 10 different instance partitioning ways for the same training in the fine-tuning step.

5.1.4. Evaluation Metrics

To quantitatively evaluate the performance of various algorithms on novel class detections, we used evaluation metrics that are widely used for few-shot detection tasks: average precision (AP) and mean average precision (mAP).

A P

denotes the average of precision at different recall rates for a single class—i.e., the area under the precision-recall curve (PRC). Precision and recall are calculated as follows:

\begin{matrix} P r e c i s i o n & = \frac{T P}{T P + F P} \\ R e c a l l & = \frac{T P}{T P + F N} \end{matrix}

(7)

where

T P

,

F P

and

F N

denote the number of true positives, false positives, and false negatives, respectively. Thus, precision measures the detection accuracy of the model and recall measures the detection completeness of the model. Another metric,

m A P

, is expressed as the mean of the model over all novel classes of

A P

, and it is calculated as

\begin{matrix} m A P = \frac{1}{C} \sum_{i = 1}^{C} A P^{i} \end{matrix}

(8)

where C denotes the number of novel classes. Typically, the higher the

m A P

value, the better the performance.

5.2. Performance Comparison

To significantly demonstrate the outstanding performance of the proposed TSF-RGR model, we compared it with state-of-the-art few-shot object detection models in RSI, including FSODM [34], PAMS-Det [57], OAF [58], CIR-FSD [37], and SAM-BFS [36]. Moreover, we made a comparison with the original Faster R-CNN [11] without fine-tuning. Note that we directly cite their experimental results in the paper, but all parameters in the experiments are kept consistent in order to maintain fairness. In addition, since SAM-BFS does not provide AP values for each class, OAF did not test the detection results under 20-shot in DIOR, and we omitted these results in the table.

Results on NWPU VHR-10. Table 1 shows the detection accuracy of the proposed TSF-RGR with the comparison method on three novel classes. The results show that our method achieved the optimum performance for different shots settings. Compared with the existing state-of-the-art method SAM-BFS, mAP is

10 %

higher under the 3-shot setting,

4 %

higher under 5-shot, and

2 %

higher under 10-shot. For fewer shot settings, our method demonstrates more prominent superiority. Meanwhile, we also observed that the Faster R-CNN has a poor mAP due to an overly small training sample, resulting in almost no detection of any novel class instances. This indicates that the introduction of common sense knowledge effectively improves the feature representation of novel classes when visual features are severely underdeveloped, as we will further explain in Section 5.3. In addition, the detection rate of each novel class was improved, and the AP of baseball diamond even reached

94 %

under 10-shot settings, which is close to that of advanced traditional object detection algorithms. Figure 5 shows the visual detection results of our method on all classes, and it can be seen that our method achieves better localization and still maintains a good detection accuracy on the base classes.

Results on DIOR. To further validate the advantages of the proposed TSF-RGR, we conducted similar experiments on DIOR. Table 2 shows the quantitative experimental results of our method and the comparison method under different shots setting. Our method achieves

42 %

,

49 %

, and

54 %

of mAP under 5-shot, 10-shot, and 20-shot settings, respectively, outperforming the other methods. Compared with SAM-BFS, it improved the mAP by

4 %

,

2 %

, and

3 %

, respectively, also demonstrating again the effectiveness of our method at fewer shot settings. However, although our method achieves better performance on the airplane and baseball field classes, achieving

81 %

and

82 %

AP at 20-shot, respectively, our method does not show better detection performance for the train station and wind mill classes. The possible reason for this result is that such objects are more likely to be confused with the background and similar objects, and the computation of semantic and spatial relations amplifies such problems, leading to detector failure. Figure 6 shows the visual detection results of the proposed TSF-RGR on all classes. When faced with dense and small objects, both novel and base classes are well identified and accurately localized by our method using reasoning about the relationships.

Results on base classes. To fully illustrate the comprehensive performance of the proposed TSF-RGR, we also test the detection accuracy of the model on base classes. Table 3 and Table 4 show the results of our method compared with other methods on NWPU VHR-10 and DIOR, respectively. Similarly, it can be seen that our method achieves the highest average detection precision in both datasets. Compared to the same shot based detection model CIR-FSD, the mAP is improved by

3 %

and

2 %

, respectively, which proves the effectiveness and robustness of our method on the base class. It is also easy to find that the model outperforms on NWPU VHR-10 than on DIOR for both the base class and the novel class, which may be due to the fact that DIOR has a larger amount of data and more classes, distracting the attention of the model. This also indicates that simply increasing the number of samples to improve the performance in few-shot detection has significant limitations.

5.3. Ablation Study

In this section, we explore in depth the contribution of each component of the proposed method on the detector. Specifically, we apply semantic relation features and spatial relation features to the baseline separately to verify the validity of the different relations. All ablation results are reported in Table 5 and Table 6. Compared with the results of the baseline, semantic relations (SemR) can improve the performance of the detector on the NWPU VHR-10 by

11 %

,

8 %

and

5 %

for the 3-shot, 5-shot, and 10-shot settings, respectively. The spatial relationship (SpaR) can be improved by

4 %

,

1 %

and

3 %

, respectively. On DIOR, the shot is set under 5, 10, and 20, the semantic relation can be improved by

4 %

,

2 %

, and

5 %

, and the spatial relation can be improved by

3 %

,

3 %

, and

4 %

, respectively. Thus, the different relations introduced show better performance than the baseline, and the performance is optimal after relation fusion. We also find that the semantic relations improve the detection performance much more than the spatial relations, and the reason for this is that the semantic relations focus on the variability between features, while spatial relations are more concerned with the error of bounding box localization.

In addition, with respect to relational feature and visual feature aggregation, we use direct concatenate (Concat) and add for aggregation to verify the rationality of joint relation reasoning (JRR). As shown in Table 7, neither simple concatenate nor add steps resulted in better performance for the detector. The inference ability decreases significantly at fewer shot settings. In particular, the add method was

12 %

lower than JRR and the concatenate method was

11 %

lower under the 3-shot setting. We consider that in feature aggregation, both concatenate and add simply fuse all feature information together without considering the high correlation and difference between the information. Moreover, some redundant and noisy information will be mixed in. In contrast, the JRR method fully takes into account the degree of influence of different relations on class reasoning in order to achieve optimal feature aggregation.

5.4. Discussion

Contribution of knowledge reasoning: To further examine the improvements introduced by the proposed TSF-RGR after performing knowledge reasoning, we performed a qualitative analysis with the help of the detection analysis tool proposed by Hoiem et al. [59]. Meanwhile, we also defined two similar classes in the NWPU VHR-10 dataset based on the appearance and usage of the classes: sports fields (baseball diamond, basketball court, ground track field, and tennis court) and long strip (bridge, harbor, ship, and vehicle). As shown in Figure 7, we discuss the proposed TSF-RGR and the baseline false positive analysis reports that ranked high on the novel classes. All false positives include confusion between positive samples and the background (BG), confusion with similar classes (Sim), confusion with other NWPU VHR-10 classes (Oth), and poor localization (Loc). The results show that our method improves the correct detection (Cor), while false confusion with the background and other classes significantly improved: For example, the confusion of baseball diamond with other sports fields was reduced by

8.4 %

, and the confusion of the tennis court with other sports fields was reduced by

8.9 %

, which is very considerable. This is also a sufficient indication that good knowledge reasoning effectively enhances the discriminability of region proposals.

Contribution of cycle times: In the relation graph learning phase, GGNN needs to pass information between nodes adequately using loop iterations. In this section, we explore the impact of cycle times on the detection performance. Specifically, we set different cycle times (2, 3, and 4) to perform ablation experiments on NWPU VHR-10. The results are shown in Table 8, where the model performance is optimized after two cycles. As the cycle times increase, the performance decreases at different shot settings, particularly at lower shots, with a serious loss of detection accuracy. In addition, the increase in cycle times also makes the inference speed gradually slower. Analyzing the reason for performance degradation, it may be that the graph structure causes the nodes to tend to toward being homogeneous with one another during the process of continuous information transfer; i.e., all nodes receive the same information. This redundant information can cause the classifier to fail. Therefore, a suitable cycle times setting is necessary during training.

Sensitivity and impact of novel class sizes: To understand in more detail the sensitivity and impact of the proposed TSF-RGR on the novel classes sizes under different shot settings, we again used the analysis tool in Ref. [59]. As shown in Figure 8, after statistically ranking the bounding box area (BBox Area) based on the ground truth values of similar objects, we classified the object sizes as extra small (XS:

10 %

), small (S:

20 %

), medium (M:

40 %

), large (L:

20 %

), and extra large (XL:

10 %

). It can be easily found that regardless of the number of class training samples, our model shows strong detection performance on objects of moderate sample size in each class, and the model maintains excellent detection rates even for tiny objects. This is understandable. For the extra small and extra large objects, the detection performance becomes extremely poor as the number of shot decreases. In particular, for the airplane class, the AP of an extra small airplane remains at

88 %

under 10-shot, but it is only

15 %

under 3-shot. On the one hand, this is because when the number of samples available for training is too weak, they cannot cover the same type of objects with appearance differences in the entire dataset. As a result, the detectors are unable to learn a more comprehensive representation of the features even with the guidance of knowledge. On the other hand, the detection of tiny objects is still a pressing difficulty in the current object detection field, inspiring us to explore and design solutions in a targeted manner in subsequent studies.

Contribution of lightweight: To evaluate the contribution made by the proposed method in terms of lightweight, we apply the number of parameters of the model, floating point operations (FLOPs), and inference time as evaluation metrics. Table 9 shows the results of the proposed method compared with the baseline. Compared to the baseline, the FLOPs of the added module increase by 4.02 G and the parameters increase by only 6.31 M. Meanwhile, the inference time is not overly affected. This shows that the proposed method can deliver superior performance benefits (see Table 6) with a tiny increase in parameters. Such improvement is considerable. Therefore, it can be used as a plug-and-play module that can be flexibly embedded into other models to bring effective gains.

6. Conclusions

In this work, to alleviate the problem of under-representation of visual features of objects caused by limited remote sensing image data, we propose an effective text semantic fusion relation graph reasoning method for few-shot object detection on remote sensing images. Specifically, we first construct a corpus of information descriptions about classes and feed them to the detector as a priori knowledge by means of word embedding, which can provide additional information support for novel classes. Moreover, the designed relation graph learning (RGL) and joint relation reasoning (JRR) modules build graph structures from region proposals to learn possible semantic and spatial relations among them. With adaptive joint reasoning with the original visual features, the detector is empowered with the ability to accurately capture and learn novel class concepts. With the help of the two-stage fine-tuning learning method, our model demonstrates robust performance on two public benchmark remote sensing datasets. Compared with past methods, the proposed TSF-RGR achieves the best detection results for a wide range of shot settings. We hope that this approach of joint reasoning with the help of multi-source knowledge (texts, images, and videos) can bring more significant solutions for few-shot object detection on remote sensing images, and even other application fields on remote sensing. For future work, we will try to design more suitable spatial mapping functions and more robust methods for scale variation to carry out subsequent research in view of the problems of poor detection rate of tiny objects and large variability of features among different source domain knowledge that still exist in this method.

Author Contributions

Conceptualization, T.L. and S.Z.; methodology, S.Z.; validation, S.Z., F.S. and X.H.; resources, T.L.; data curation, S.Z., Y.L., and X.L.; writing—original draft preparation, S.Z.; writing—review and editing, S.Z., F.S., X.L., X.H., and T.L.; supervision, T.L. and P.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All experiments were evaluated on publicly available datasets. The datasets can be accessed by referring to the corresponding published papers.

Acknowledgments

The authors would like to thank Bo Sun, Gong Cheng, and Ke Li for providing the implementation source codes and the image datasets.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RSI	Remote sensing images
FSOD	Few-shot object detection
GGNN	Gated graph neural network
GRU	Gate recurrent unit
TSF-RGR	Text semantic fusion relation graph reasoning
TSE	Text semantic encoding
RGL	Relation graph learning
JRR	Joint relation reasoning
AP	Average precision
mAP	Mean average precision

References

Quan, Y.; Zhong, X.; Feng, W.; Dauphin, G.; Gao, L.; Xing, M. A Novel Feature Extension Method for the Forest Disaster Monitoring Using Multispectral Data. Remote Sens. 2020, 12, 2261. [Google Scholar] [CrossRef]
Shimoni, M.; Haelterman, R.; Perneel, C. Hypersectral Imaging for Military and Security Applications: Combining Myriad Processing and Sensing Techniques. IEEE Geosci. Remote Sens. Mag. 2019, 7, 101–117. [Google Scholar] [CrossRef]
Wellmann, T.; Lausch, A.; Andersson, E.; Knapp, S.; Cortinovis, C.; Jache, J.; Scheuer, S.; Kremer, P.; Mascarenhas, A.; Kraemer, R.; et al. Remote sensing in urban planning: Contributions towards ecologically sound policies? Landsc. Urban Plan. 2020, 204, 103921. [Google Scholar] [CrossRef]
Song, F.; Zhang, S.; Lei, T.; Song, Y.; Peng, Z. MSTDSNet-CD: Multiscale Swin Transformer and Deeply Supervised Network for Change Detection of the Fast-Growing Urban Regions. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Ye, Y.; Ren, X.; Zhu, B.; Tang, T.; Tan, X.; Gui, Y.; Yao, Q. An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 516. [Google Scholar] [CrossRef]
Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature Split–Merge–Enhancement Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Yu, D.; Ji, S. A New Spatial-Oriented Object Detection Framework for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef] [Green Version]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Xiao, Y.; Marlet, R. Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild. In Computer Vision—ECCV 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 192–210. [Google Scholar] [CrossRef]
Qiao, L.; Zhao, Y.; Li, Z.; Qiu, X.; Wu, J.; Zhang, C. DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
Ruiz, L.; Gama, F.; Ribeiro, A. Gated Graph Convolutional Recurrent Neural Networks. In Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), Coruna, Spain, 2–6 September 2019. [Google Scholar] [CrossRef] [Green Version]
Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Cheng, G.; He, M.; Hong, H.; Yao, X.; Qian, X.; Guo, L. Guiding Clean Features for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Huang, W.; Li, G.; Chen, Q.; Ju, M.; Qu, J. CF2PN: A Cross-Scale Feature Fusion Pyramid Network Based Remote Sensing Target Detection. Remote Sens. 2021, 13, 847. [Google Scholar] [CrossRef]
Wang, G.; Zhuang, Y.; Chen, H.; Liu, X.; Zhang, T.; Li, L.; Dong, S.; Sang, Q. FSoD-Net: Full-Scale Object Detection From Optical Remote Sensing Imagery. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Qi, G.; Zhang, Y.; Wang, K.; Mazur, N.; Liu, Y.; Malaviya, D. Small Object Detection Method Based on Adaptive Spatial Parallel Convolution and Fast Multi-Scale Fusion. Remote Sens. 2022, 14, 420. [Google Scholar] [CrossRef]
Zheng, J.; Wang, T.; Zhang, Z.; Wang, H. Object Detection in Remote Sensing Images by Combining Feature Enhancement and Hybrid Attention. Appl. Sci. 2022, 12, 6237. [Google Scholar] [CrossRef]
Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
Karlinsky, L.; Shtok, J.; Harary, S.; Schwartz, E.; Aides, A.; Feris, R.; Giryes, R.; Bronstein, A.M. RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar] [CrossRef]
Kaul, P.; Xie, W.; Zisserman, A. Label, Verify, Correct: A Simple Few Shot Object Detection Method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14237–14247. [Google Scholar] [CrossRef]
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, 19–25 June 2021. [Google Scholar] [CrossRef]
Fan, Q.; Zhuo, W.; Tang, C.K.; Tai, Y.W. Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Han, G.; Ma, J.; Huang, S.; Chen, L.; Chang, S.F. Few-Shot Object Detection with Fully Cross-Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Bulat, A.; Guerrero, R.; Martinez, B.; Tzimiropoulos, G. FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training. arXiv 2022, arXiv:2210.04845. [Google Scholar] [CrossRef]
Cheng, G.; Yan, B.; Shi, P.; Li, K.; Yao, X.; Guo, L.; Han, J. Prototype-CNN for Few-Shot Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–10. [Google Scholar] [CrossRef]
Li, X.; Deng, J.; Fang, Y. Few-Shot Object Detection on Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Wolf, S.; Meier, J.; Sommer, L.; Beyerer, J. Double Head Predictor based Few-Shot Object Detection for Aerial Imagery. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
Huang, X.; He, B.; Tong, M.; Wang, D.; He, C. Few-Shot Object Detection on Remote Sensing Images via Shared Attention Module and Balanced Fine-Tuning Strategy. Remote Sens. 2021, 13, 3816. [Google Scholar] [CrossRef]
Wang, Y.; Xu, C.; Liu, C.; Li, Z. Context Information Refinement for Few-Shot Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 3255. [Google Scholar] [CrossRef]
Zhou, Y.; Hu, H.; Zhao, J.; Zhu, H.; Yao, R.; Du, W.L. Few-Shot Object Detection via Context-Aware Aggregation for Remote Sensing Images. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Liu, Y.; Sheng, L.; Shao, J.; Yan, J.; Xiang, S.; Pan, C. Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018. [Google Scholar] [CrossRef] [Green Version]
Chen, R.; Chen, T.; Hui, X.; Wu, H.; Li, G.; Lin, L. Knowledge Graph Transfer Network for Few-Shot Recognition. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10575–10582. [Google Scholar] [CrossRef]
Lee, C.W.; Fang, W.; Yeh, C.K.; Wang, Y.C.F. Multi-label Zero-Shot Learning with Structured Knowledge Graphs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef] [Green Version]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation Networks for Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef] [Green Version]
Xu, H.; Jiang, C.; Liang, X.; Li, Z. Spatial-Aware Graph Relation Network for Large-Scale Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Marino, K.; Salakhutdinov, R.; Gupta, A. The More You Know: Using Knowledge Graphs for Image Classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
Mou, L.; Hua, Y.; Zhu, X.X. A Relation-Augmented Fully Convolutional Network for Semantic Segmentation in Aerial Scenes. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef] [Green Version]
Zhu, C.; Chen, F.; Ahmed, U.; Shen, Z.; Savvides, M. Semantic Relation Reasoning for Shot-Stable Few-Shot Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Zeroshot detection via vision and language knowledge distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar] [CrossRef]
Xu, H.; Fang, L.; Liang, X.; Kang, W.; Li, Z. Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12492–12499. [Google Scholar] [CrossRef]
Zhang, S.; Song, F.; Lei, T.; Jiang, P.; Liu, G. MKLM: A multiknowledge learning module for object detection in remote sensing images. Int. J. Remote. Sens. 2022, 43, 2244–2267. [Google Scholar] [CrossRef]
Kim, G.; Jung, H.G.; Lee, S.W. Few-Shot Object Detection via Knowledge Transfer. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020. [Google Scholar] [CrossRef]
Shu, X.; Liu, R.; Xu, J. A Semantic Relation Graph Reasoning Network for Object Detection. In Proceedings of the 2021 IEEE 10th Data Driven Control and Learning Systems Conference (DDCLS), Suzhou, China, 14–16 May 2021; pp. 1309–1314. [Google Scholar] [CrossRef]
Chen, W.; Xiong, W.; Yan, X.; Wang, W.Y. Variational Knowledge Graph Reasoning. arXiv 2018, arXiv:1803.06581. [Google Scholar] [CrossRef]
Li, A.; Luo, T.; Lu, Z.; Xiang, T.; Wang, L. Large-Scale Few-Shot Learning: Knowledge Transfer With Class Hierarchy. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
Zhao, Z.; Tang, P.; Zhao, L.; Zhang, Z. Few-Shot Object Detection of Remote Sensing Images via Two-Stage Fine-Tuning. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, Z.; Hao, J.; Pan, C.; Ji, G. Oriented Feature Augmentation for Few-Shot Object Detection in Remote Sensing Images. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021. [Google Scholar] [CrossRef]
Hoiem, D.; Chodpathumwan, Y.; Dai, Q. Diagnosing Error in Object Detectors. In Computer Vision – ECCV 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 340–353. [Google Scholar] [CrossRef]

Figure 1. Example of common sense knowledge reasoning. After acquiring certain common sense knowledge, people can reason using other object classes via the identification of the relevant object. For example, an airplane can be derived from an airport and a ship from a harbor.

Figure 2. The illustration of the GGNN.

Figure 3. Overview architecture of the TSF-RGR. After RPN generates a large number of region proposals, under the guidance of textual semantic information, different graph structures are constructed to learn the semantic and spatial relationships between regions. Afterwards, a joint relational reasoning module is applied to aggregate the visual and relational features of regions to improve the performance of few-shot object detection.

Figure 4. Detailed flowchart of the relation graph learning.

Figure 5. Visualization results of TSF-RGR on NWPU VHR-10. Note that the different class instances are marked with different colors.

Figure 6. Visualization results of TSF-RGR on DIOR. Note that the different class instances are marked with different colors.

Figure 7. Analysis of the top-ranked false-positives. Detectors are trained with 10 shots, where Cor indicates the proportion of correct detections and the rest are the proportion of false detections, including confusion with background (BG), confusion with similar instances (Sim), confusion with other NWPU VHR-10 instances (Oth), and poor localization (Loc).

Figure 8. Sensitivity and impact of BBox Area on NWPU VHR-10. The horizontal axis indicates the BBox Area: extra small (XS), small (S), medium (M), large (L), and extra large (XL). The vertical axis indicates

A P_{N}

[59]. The red line represents

A P_{N}

with standard error bars. The black dashed line indicates the overall

A P_{N}

for each class.

Figure 8. Sensitivity and impact of BBox Area on NWPU VHR-10. The horizontal axis indicates the BBox Area: extra small (XS), small (S), medium (M), large (L), and extra large (XL). The vertical axis indicates

A P_{N}

[59]. The red line represents

A P_{N}

with standard error bars. The black dashed line indicates the overall

A P_{N}

for each class.

Table 1. Comparison of the experimental results of novel classes on NWPU VHR-10. Bold denotes the best performance.

Method	Shot	Airplane	Baseball Diamond	Tennis Court	mAP
Faster R-CNN [11]	3	0.09	0.19	0.12	0.13
	5	0.19	0.23	0.17	0.20
	10	0.20	0.35	0.17	0.24
FSODM [34]	3	0.15	0.57	0.25	0.32
	5	0.58	0.84	0.16	0.53
	10	0.60	0.88	0.48	0.65
PAMS-Det [57]	3	0.21	0.76	0.16	0.37
	5	0.55	0.88	0.20	0.55
	10	0.61	0.88	0.50	0.66
OAF [58]	3	0.18	0.87	0.24	0.43
	5	0.28	0.89	0.65	0.60
	10	0.43	0.89	0.68	0.67
CIR-FSD [37]	3	0.52	0.79	0.31	0.54
	5	0.67	0.88	0.37	0.64
	10	0.71	0.88	0.53	0.70
SAM-BFS [36]	3	-	-	-	0.47
	5	-	-	-	0.62
	10	-	-	-	0.75
TSF-RGR (ours)	3	0.57	0.79	0.35	0.57
	5	0.71	0.85	0.42	0.66
	10	0.78	0.94	0.60	0.77

Table 2. Comparison of the experimental results of novel classes on DIOR. Bold denotes the best performance.

Method	Shot	Airplane	Baseball Field	Tennis Court	Train Station	Wind Mill	mAP
Faster R-CNN [11]	5	0.03	0.09	0.12	0.01	0.01	0.05
	10	0.09	0.31	0.13	0.02	0.12	0.13
	20	0.09	0.35	0.21	0.04	0.21	0.18
FSODM [34]	5	0.09	0.27	0.57	0.11	0.19	0.25
	10	0.16	0.46	0.60	0.14	0.24	0.32
	20	0.22	0.50	0.66	0.16	0.29	0.36
PAMS-Det [57]	5	0.14	0.54	0.24	0.17	0.31	0.28
	10	0.17	0.55	0.41	0.17	0.34	0.33
	20	0.25	0.58	0.50	0.23	0.36	0.38
OAF [58]	5	0.26	0.60	0.69	0.25	0.09	0.38
	10	0.30	0.63	0.69	0.31	0.11	0.41
	20	-	-	-	-	-	-
CIR-FSD [37]	5	0.20	0.50	0.50	0.24	0.20	0.33
	10	0.20	0.55	0.50	0.23	0.36	0.38
	20	0.27	0.62	0.55	0.28	0.37	0.43
SAM-BFS [36]	5	-	-	-	-	-	0.38
	10	-	-	-	-	-	0.47
	20	-	-	-	-	-	0.51
TSF-RGR (ours)	5	0.58	0.79	0.58	0.11	0.06	0.42
	10	0.72	0.79	0.67	0.08	0.19	0.49
	20	0.81	0.82	0.78	0.15	0.18	0.54

Table 3. Comparison of the experimental results of base classes on NWPU VHR-10. Bold denotes the best performance.

Class	Faster R-CNN [11]	FSODM [34]	PAMS-Det [57]	CIR-FSD [37]	TSF-RGR (Ours)
basketball court	0.56	0.72	0.90	0.91	0.96
bridge	0.57	0.76	0.80	0.97	0.84
ground track field	1.00	0.91	0.99	0.99	0.96
harbor	0.66	0.87	0.84	0.80	0.92
ship	0.88	0.72	0.88	0.91	0.94
storage tank	0.49	0.71	0.89	0.88	0.93
vehicle	0.74	0.76	0.89	0.89	0.92
mAP	0.70	0.87	0.88	0.89	0.92

Table 4. Comparison of the experimental results of base classes on DIOR. Bold denotes the best performance.

Class	Faster R-CNN [11]	FSODM [34]	PAMS-Det [57]	CIR-FSD [37]	TSF-RGR (Ours)
airport	0.73	0.63	0.78	0.87	0.86
basketball court	0.69	0.80	0.79	0.88	0.90
bridge	0.26	0.32	0.52	0.55	0.52
chimney	0.72	0.72	0.69	0.79	0.79
dam	0.57	0.45	0.55	0.72	0.70
expressway service area	0.59	0.63	0.67	0.86	0.88
expressway toll station	0.45	0.60	0.62	0.78	0.79
golf course	0.68	0.61	0.81	0.84	0.84
ground track field	0.65	0.61	0.78	0.83	0.84
harbor	0.31	0.43	0.50	0.57	0.62
overpass	0.45	0.46	0.51	0.64	0.65
ship	0.10	0.50	0.67	0.72	0.89
stadium	0.67	0.45	0.76	0.77	0.79
storage tank	0.24	0.43	0.57	0.70	0.79
vehicle	0.19	0.39	0.54	0.56	0.55
mAP	0.48	0.54	0.65	0.74	0.76

Table 5. Ablation study on NWPU VHR-10. Bold denotes the best performance.

Baseline	SemR	SpaR	mAP
Baseline	SemR	SpaR	3-Shot	5-Shot	10-Shot
✓	-	-	0.42	0.56	0.69
✓	✓	-	0.53	0.64	0.74
✓	-	✓	0.46	0.57	0.72
✓	✓	✓	0.57	0.66	0.77

Table 6. Ablation study on DIOR. Bold denotes the best performance.

Baseline	SemR	SpaR	mAP
Baseline	SemR	SpaR	5-Shot	10-Shot	20-Shot
✓	-	-	0.37	0.44	0.47
✓	✓	-	0.41	0.46	0.52
✓	-	✓	0.40	0.47	0.51
✓	✓	✓	0.42	0.49	0.54

Table 7. The performance on NWPU VHR-10 using different ensemble methods. Bold denotes the best performance.

Ensemble Method	mAP
Ensemble Method	3-Shot	5-Shot	10-Shot
Concat	0.46	0.56	0.71
Add	0.45	0.57	0.74
JRR	0.57	0.66	0.77

Table 8. The effect of cycle times on NWPU VHR-10. Bold denotes the best performance.

Cycle Times	Inference Time	mAP
Cycle Times	Inference Time	3-Shot	5-Shot	10-Shot
2	0.34	0.57	0.66	0.77
3	0.34	0.49	0.59	0.73
4	0.35	0.46	0.55	0.73

Table 9. Comparison of the model parameters and FLOPs on DIOR.

Method	FLOPs (G)	Params (M)	Inference Time (s)
Baseline	38.94	14.60	0.38
TSF-RGR (ours)	42.96	20.91	0.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Song, F.; Liu, X.; Hao, X.; Liu, Y.; Lei, T.; Jiang, P. Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images. Remote Sens. 2023, 15, 1187. https://doi.org/10.3390/rs15051187

AMA Style

Zhang S, Song F, Liu X, Hao X, Liu Y, Lei T, Jiang P. Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images. Remote Sensing. 2023; 15(5):1187. https://doi.org/10.3390/rs15051187

Chicago/Turabian Style

Zhang, Sanxing, Fei Song, Xianyuan Liu, Xuying Hao, Yujia Liu, Tao Lei, and Ping Jiang. 2023. "Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images" Remote Sensing 15, no. 5: 1187. https://doi.org/10.3390/rs15051187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Text Semantic Fusion Relation Graph Reasoning for Few-Shot Object Detection on Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Object Detection in Remote Sensing Images

2.2. Few-Shot Object Detection

2.3. Knowledge Reasoning

3. Preliminary Knowledge

3.1. Model Formulation

3.2. Gated Graph Neural Networks

4. Proposed Method

4.1. Text Semantic Encoding

4.2. Relation Graph Learning

4.3. Joint Relation Reasoning

4.4. Two-Stage Fine-Tuning

5. Experiments and Discussion

5.1. Experimental Settings

5.1.1. Datasets

5.1.2. Data Preprocessing

5.1.3. Network Setup

5.1.4. Evaluation Metrics

5.2. Performance Comparison

5.3. Ablation Study

5.4. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI