Next Article in Journal
Novel Approaches in Tropical Forests Mapping and Monitoring–Time for Operationalization
Next Article in Special Issue
A Defect Detection Method Based on BC-YOLO for Transmission Line Components in UAV Remote Sensing Images
Previous Article in Journal
A Deep Learning Application for Deformation Prediction from Ground-Based InSAR
Previous Article in Special Issue
Oriented Ship Detection Based on Intersecting Circle and Deformable RoI in Remote Sensing Images
 
 
Article
Peer-Review Record

MSCNet: A Multilevel Stacked Context Network for Oriented Object Detection in Optical Remote Sensing Images

Remote Sens. 2022, 14(20), 5066; https://doi.org/10.3390/rs14205066
by Rui Zhang 1,2, Xinxin Zhang 1,2,*, Yuchao Zheng 1,2, Dahan Wang 1,2 and Lizhong Hua 1,2
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2022, 14(20), 5066; https://doi.org/10.3390/rs14205066
Submission received: 9 August 2022 / Revised: 28 September 2022 / Accepted: 3 October 2022 / Published: 11 October 2022

Round 1

Reviewer 1 Report

Review of paper 1883129 “MSCNet: A Multi-Level Stacked Context Network for Oriented Object Detection in Optical Remote Sensing Images.”

 The authors propose a new network to enhance target detection accuracy, taking into account the semantic relationship between different objects and their context. Also, they suggested using a Gaussian distribution to represent the bounding boxes traditionally used in object recognition tasks.

 Looking at the numbers of the results of your approach, it seems that there is a slight improvement in comparison with the existing methods. You get an mAP of 1.02  on DOTA and 0.22 AP on HRSC2016 Databases, which could be very important in some applications. To improve the paper, I propose the points explained below.

 As you are using a hierarchical approach please compare your work with similar methods, for example,” Generalized Zero-Shot Vehicle Detection in Remote Sensing Imagery via Coarse-to-Fine Framework” Hong Chen et al. IJCAI, 2019.

Please, as the whole model seems to be very complex, introduce some measures of computation time for processing an image so that the lector can have an idea if your method is worth using, having an mAP better on 0.15 in relation to the base approach.

Explain better figure 5 and figure 7 so the reader can see what is exciting and what an error in these images. Also, in table 1, there are columns where your proposed net (MSCNet) does not get the best numbers; for example, in column HC, Please highlight with bold numbers the ones that got the best numbers even if the model is not yours.

Also, highlight the importance of the Loss function GWD as it is one of your principal contributions, and add a graph to see its convergence ratio when running your approach. Maybe a better description of the proposed Loss function can help. 

Rename the titles of your sections according to their content. For example, 3. Experiments and analysis; 4. Experiments

Define all abbreviations that you use, for example, RPN, etc.

Several writing errors make it complex to understand the main ideas of your article. Please correct them. Below are some of them. Please check the use of the article “the” and “a” in English.

 

Andthen; is imbedded to the; definition of five; angleparameter; evaluate proposed; be carefully adjust; sematic information; coefficients us used; of proposing rotating proposals; object-ness; different angels as; 2806 aerial image; and it reduce; , SH and HA gets an; and 0.69 AP comparted;…

Author Response

Dear Reviewer:

Thank you for your letter and for the reviewers’ comments concerning our manuscript entitled “MSCNet: A Multi-Level Stacked Context Network for Oriented Object Detection in Optical Remote Sensing Images” (remotesensing-1883129). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied the comments carefully and have made corrections which we hope meet with approval. Revisions are marked with appropriate marks on the paper. 

Author Response File: Author Response.docx

Reviewer 2 Report

 1.       “Your statement needs clarification: "it is difficult to model the relationship with convolution between long distance” ?

->Express it simpler (what is “convolution between long distance”?)

->What about the stride parameter in a CNN layer? It can control the size of local neighborhood.

->Exactly a reason for using multi-layer networks. Every next layer is observing not only higher semantics but also more global relations.

-> For feature extraction and region proposal you use a Faster-RCNN like network (the R3Det?) as a baseline. Then how you want to improve the "convolution between" far distant objects?  

 

2.       The Introduction should be completely rewritten. In current form it looks like an extended abstract or a communication manuscript. The Introduction should concentrate on the problem description and motivation for your work, give general references to state-of-the-art methods, and finally explain what and how you want to improve the solutions.

 

3.        It is strange to place Figure 1, containing some anonymous solution structure, in the Introduction. There is nothing explained here. The structure is a pipeline of three blocks. Apparently, it represents a Faster RCNN - like network (the R3Det?), but not sufficiently clearly explained. Every C2-C5 layer should feed the corresponding F2-F5 layer. But there is only one connection C5-F5, which is called MSSC. The first block is an RPN implemented in the FPN technique. The second block is some selection of ROIs, while the third one is a classification and regression network. It requires a better explanation of Fig. 1, when it is already placed in in the Introduction.

 

4.       For example, you say “a multi-level stacked semantic capture module (MSSC Module) is imbedded to the network”, but there is only one level C5-F5 connected by MSSC, and not C2-C5 with F2-F5?

 

5.       You have not explained the terms “horizontal object detection task to rotation target object detection task” before first use of them.

 

6.       You state at one level the “OpenCV representation and long-side representation”, while the first one is a library with many “data representations” while the second is an undefined abstract term. Again, it is unusual and should be avoided, to give technical details, like unexplained mathematical symbols, in the Introduction.

 

7.     The following is unclear: “the length and width exchange and angle discontinuity will be caused in the process of bounding box regression, andthen the discontinuity and quadrangle problem will be caused[“

 

8.   The abbreviations RPN and FPN are understood in the context of semantic image segmentation technology, as Region Proposal Network and Feature Pyramid Network, but both abbreviations should first be fully named in the Introduction. 

 

9.     Can you clarify the statement: “location of the intersection of the vertices of the oriented bounding box and the minimum enclosing rectangle”. An oriented bounding box is a rotated minimum enclosing rectangle? What is “the intersection of vertices” – how can points intersect each other? Generally, is it important to explain this different representation of oriented boundary box (by 6 parameters, including two parameters for rotation), than typically used?

 

10. The purpose of the MSSC module is unclear to me. You add some additional convolutional layer with several fixed kernels of size 3x3 that perform morphological dilatation operation. Why should this be particularly helpful in oriented object detection? The results that you present in the experimental section show only marginal improvement in comparable conditions (the same input image resolution).

 

11.    What are the “Rates” (fig. 2), also called “scale coefficients” (in the text)? Do you mean “stride”?

 

12.   Is there any difference of the FPN network and “Adaptive Roi Assignment” from the implementation of region proposal in the baseline method (R3Det?)?

 

13. In Table 1 please add the resolution of input images of every object detector. There is an improvement of your baseline detector, but you have never mentioned the name of your baseline – eventually I have missed it (R3Det). Please provide detailed information on the difference between R3Det and your solution. Also give a convincing explanation, why your new module is responsible for this gain and not a higher resolution of the input image (if any) or the different backbone network (R101 vs R50).

 

14. There is a confusion resulting from Table 2. You say that the improvement is by 0.69% but the result of the apparent baseline R3Det is lower (as shown in the Table 2) by 1.39%. The oriented R-CNN seems to perform better than your baseline and only 0.15% lower than yours. But there are further improvements of this line, Fast- and Faster-RCNN, which probably should give even better results?

 

15.   The Ablation study takes now the Faster RCNN as a baseline and not the oriented R3Det. What is the reason for it?

 

16. Was the result of Faster RCNN, given in Table 3, achieved for oriented object detection? It is strange, that it performs weaker than the R-CNN in Table 2.

 

 

Author Response

Dear Reviewer:

Thank you for your letter and for the reviewers’ comments concerning our manuscript entitled “MSCNet: A Multi-Level Stacked Context Network for Oriented Object Detection in Optical Remote Sensing Images” (remotesensing-1883129). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied the comments carefully and have made corrections which we hope meet with approval. Revisions are marked with appropriate marks on the paper.

Author Response File: Author Response.docx

Reviewer 3 Report

1. Language: The language used in paper is undersandable, however, there are some grammatical errors and awkward sentences. It is recommended to improve the quality of the language. Further, it is recommended to fully define all acronyms/abbreviations in the first instance. 

2. Introduction: One of the main topics (if not the main topic)  of the paper is how to exploit the features at multiple levels. However, Introduction lacks references and description of previous work on this topic. Even the basic concept of Feature Pyramid Networks is suddenly mentioned without giving a reference or a description.

Introduction has devoted two subsections on anchor free and anchor based object detction methods. It is not clear how this description is related to the specific topics treated in  this paper.

3. Description of the methods: The greatest  weakness of this paper lies in the description of the methods. The descriptions are sometimes vague and details are hard to follow. More specific comments related to this issue are:

a.    In figures (1,2,3 & 4), label what different entities represent either directly in the figure, and refer to these labels from the text or caption. There are many poorly labeled or unlabeled entitities and undefined acronyms in these figures. For example in Figure 2, what is the input, what is the output and what is meant by GAP?

b. Consider adding a figure to support description of "RPN head with oriented bounding box" in subsection 3.2. Refer equation (1) to this  figure.

c. A new loss function (GWD loss) is proposed as a solution to the weakness of the IoU loss applied to oriented object detection. However, an explanation of how this new loss (GWD) avoids the weaknesses of IoU is missing. It is recommended to explain this or give a reference which has more details. 

d. In Lines 185-187, t_i and t_i* are defined. It is however, unclear how a 5D vector is transformed to a 2D Gaussian distribution. Literally, this sentence is meaningless. It is recommended to explain this more or give a reference which has more details. 

  e. In Figure 4, what is C? Is an arrow missing to the pink boxes? What happens in the layer before the last layer (i.e. when two arrows meet)

 

4. Results:  Clearly define what the baseline is in each of the experiments. Mark the baseline in Table 1 and Table 2.  Indicate how statistically significant these results are (i.e. Is each experiment conducted only once or are they conducted several times?)

5. Conclusions: What is the shortcoming of MSSC and ARA used together?

Author Response

Dear Reviewer:

Thank you for your letter and for the reviewers’ comments concerning our manuscript entitled “MSCNet: A Multi-Level Stacked Context Network for Oriented Object Detection in Optical Remote Sensing Images” (remotesensing-1883129). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied the comments carefully and have made corrections which we hope meet with approval. Revisions are marked with appropriate marks on the paper. 

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

This is the second version of a paper already presented for review in the journal of remote sensing.

I have read the new version of the article, and I can observe that some of the suggestions I proposed are already incorporated into the writing.

Nonetheless, I suggest reading the article carefully and looking for poorly written sentences; for example, “we use the RPN of proposing rotating proposal.”

Please correct them. The caption of  Fig. 1 can be improved.

Also, the argument about the superiority of the GWD loss function compared with the SmootheL1 looks pretty subjective. Is it possible to leverage them with some numbers?

The number of writing errors has decreased, yet there are several typos in the article; here are two. Please correct them.

images. additionally

 

in long-side representation represent

Author Response

Dear Editors and Reviewers:

Thank you for your letter and for the reviewers’ comments concerning our manuscript entitled “MSCNet: A Multilevel Stacked Context Network for Oriented Object Detection in Optical Remote Sensing Images” (remotesensing-1883129). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied the comments carefully and have made corrections which we hope meet with approval. Revisions are marked with appropriate marks on the paper. 

Author Response File: Author Response.docx

Reviewer 2 Report

Thank you for your response. The approach is now clearly presented. I can recommend the publication of your revised manuscript.

Author Response

Dear Editors and Reviewers:

Thank you for your letter recommending our manuscript entitled “MSCNet: A Multilevel Stacked Context Network for Oriented Object Detection in Optical Remote Sensing Images” (remotesensing-1883129).

Back to TopTop