Next Article in Journal
Assessment of Spatio-Temporal Variations in PM2.5 and Associated Long-Range Air Mass Transport and Mortality in South Asia
Next Article in Special Issue
Semantic-Layout-Guided Image Synthesis for High-Quality Synthetic-Aperature Radar Detection Sample Generation
Previous Article in Journal
Bridging the Data Gap: Enhancing the Spatiotemporal Accuracy of Hourly PM2.5 Concentration through the Fusion of Satellite-Derived Estimations and Station Observations
Previous Article in Special Issue
Multi-Stage Multi-Scale Local Feature Fusion for Infrared Small Target Detection
 
 
Article
Peer-Review Record

Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images

Remote Sens. 2023, 15(20), 4974; https://doi.org/10.3390/rs15204974
by Jiarui Zhang, Zhihua Chen *, Guoxu Yan, Yi Wang and Bo Hu
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Remote Sens. 2023, 15(20), 4974; https://doi.org/10.3390/rs15204974
Submission received: 25 August 2023 / Revised: 8 October 2023 / Accepted: 12 October 2023 / Published: 15 October 2023

Round 1

Reviewer 1 Report

This paper proposes an improved YOLOv5 object detector. The reviewers’ comments are as follow.

1)      In introduction Section, for statement of “Within computer vision, tasks can be classified into image classification [1], object detection [2], and image segmentation [3]” please provide references such as

[1] Hyperspectral image classification using a spectral–spatial random walker method. International journal of remote sensing40(10), 3948-3967.

[2] A comprehensive survey of oriented object detection in remote sensing images. Expert Systems with Applications, 119960.

[3] Implicit Ray-Transformers for Multi-View Remote Sensing Image Segmentation. IEEE Transactions on Geoscience and Remote Sensing.

 

2)      The explanation of some compared methods such as CANet[86], TRD[87] , FSoD-Net[88], MDCT[89] , CF2PN[90], MSA-YOLO[91] have not been inserted in the related works.

 

3)      YOLOv5 applies non-maximum suppression (NMS) to remove redundant bounding box predictions and filter out low-confidence detections. The confidence threshold and NMS overlap threshold are important parameters to consider. If it is possible, please explain about it in your proposed method.

4)      If it is possible, please provide the website address of datasets for readers. 

5)      In conclusion, please provide the limitless and future work of proposed method.

 

Moderate editing of English language required

Author Response

Dear Reviewer:

Thank you for your comments concerning our manuscript entitled "Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images". Those comments are all valuable and extremely helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made corrections which we hope meet with approval. Revised portions are marked yellow in the paper. The main corrections in the paper and the responds to the comments are as flowing:

Reviewer 1:

  1. Response to comment: In introduction Section, for statement of “Within computer vision, tasks can be classified into image classification [1], object detection [2], and image segmentation [3]” please provide references such as

Response: Thank you for your valuable suggestion. As you suggested that we have inserted the corresponding references in lines 51-52 of our manuscript.

  1. Response to comment: The explanation of some compared methods such as CANet[86], TRD [87], FSoD-Net [88], MDCT [89], CF2PN [90], MSA-YOLO [91] have not been inserted in the related works.

Response: Thank you for your valuable suggestion. We apologize for not providing an explanation of the comparative methods. The reason is that both the comparative methods provided and the method proposed in this paper are similar one-stage detection approaches, with the use of attention mechanisms in the comparative methods. Therefore, we have collectively explained these methods in lines 749-753 of the manuscript.

  1. Response to comment: YOLOv5 applies non-maximum suppression (NMS) to remove redundant bounding box predictions and filter out low-confidence detections. The confidence threshold and NMS overlap threshold are important parameters to consider. If it is possible, please explain about it in your proposed method.

Response: We appreciate your insightful suggestion regarding the explanation of the NMS method. We consider it a splendid complement to our article's content. As a result, we have provided an elucidation on threshold settings and the NMS method in lines 734-743 of the manuscript. Our experimental IoU threshold and confidence threshold were respectively set to 0.6 and 0.25. Furthermore, we analyzed the reasons behind the lower detection accuracy observed under the specified threshold values.

  1. Response to comment: If it is possible, please provide the website address of datasets for readers.

Response: Thank you for your valuable suggestion. We apologize for not including the specific web address of the dataset in our manuscript. As the experimental dataset we used is a publicly available remote sensing image dataset, readers can easily access it through the relevant references [97], [98], [99] provided in our article or on platforms like GitHub or the internet.

  1. Response to comment: In conclusion, please provide the limitless and future work of proposed method.

Response: Thank you for your valuable suggestion. In accordance with your feedback, we have elucidated the limitations of our method in the current field of remote sensing object detection in lines 901-903 of the article. Furthermore, we have provided insights into the future direction of our work. Additionally, in lines 859-873, we have offered some guidance on the potential application of the proposed method at present.

Special thanks to you for your good comments.

Author Response File: Author Response.docx

Reviewer 2 Report

The manuscript entitled “Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images” proposed a new model based on YOLOv5 that can improve object detection of remote sensing images. It is an interesting work, and the manuscript is well-organized and innovative. However, some issues still should be addressed.

1.       There are some schemes for remote sensing image object detection algorithm, but what is the gap? The gap needs to be clarified.

2.       It would be better to have a Graphical Abstract. A well-drawn graphical abstract promotes readers' understanding.

3.       It is suggested that the author should add a paragraph clarifying the structure of the manuscript at the end of the Introduction.

4.       Introduction must be enriched by recent published articles; I could recommend you to read and integrated the following articles. https://doi.org/10.3390/land12081602; https://doi.org/10.3390/land12040831; https://doi.org/10.1071/MF22167; https://doi.org/10.3390/rs14102385; https://doi.org/10.1016/j.eswa.2022.116793; https://doi.org/10.1016/j.isprsjprs.2021.12.004.

 

5.       Please clearly highlight how  your work advances the field from the present state of knowledge and you should provide a clear justification for your work. The impact or advancement of the work can also appear in the conclusion.

Author Response

Dear Reviewer:

Thank you for your comments concerning our manuscript entitled "Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images". Those comments are all valuable and extremely helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made corrections which we hope meet with approval. Revised portions are marked yellow in the paper. The main corrections in the paper and the responds to the comments are as flowing:

  • Response to comment: 1)There are some schemes for remote sensing image object detection algorithm, but what is the gap? The gap needs to be clarified.

Response: Thank you for your valuable suggestion. We apologize for not providing an explanation of the existing gaps in current remote sensing image object detection methods. In response to your suggestion, we have analyzed the issues associated with remote sensing object detection methods in lines 69-100 of the revised manuscript. Additionally, in lines 238-246, we have addressed the shortcomings of advanced remote sensing object detection methods introduced in the article, explaining the gaps that exist. The aim of this study is to achieve a delicate balance between lightweight design and detection performance. However, most of the current methods only focus on improving one aspect, limiting the overall progress. Therefore, our focus is to address the primary challenge of achieving lightweight effects while ensuring detection performance in remote sensing object detection methods.

  • Response to comment: 2)It would be better to have a Graphical Abstract. A well-drawn graphical abstract promotes readers' understanding.

Response: Thank you for your suggestion, regarding the graphical abstract ,Figure 2 in the manuscript as a structural diagram of the model can help readers to understand the main work of the article, therefore we think that using Figure 2 as a graphical abstract for the explanation of the work of the article is helpful, therefore we did not provide a new graphical abstract again in the revised manuscript, if you think that Figure 2 is used as a graphical abstract we will redraw a new graphical abstract in a future draft of the manuscript. graphical summary of the article.

  • Response to comment: 3)It is suggested that the author should add a paragraph clarifying the structure of the manuscript at the end of the Introduction.

Response: Thank you for your valuable suggestion. Following your suggestion we have added a new paragraph on the structure of the article in lines 174-181 of the manuscript, so that once again the reader can quickly get a summary of the content of each chapter.

  • Response to comment: 4)Introduction must be enriched by recent published articles;

Response: Thank you for your valuable suggestion. We deeply apologize for the insufficient utilization of current relevant literature in the introduction. With reference to the provided citations, we conducted extensive research and incorporated recent references on remote sensing imagery into the introduction. This is particularly evident in the analysis of challenges in remote sensing image object detection in lines 69-92 and the discussion on the YOLO model in lines 107-130.

  • Response to comment: 5)Please clearly highlight how your work advances the field from the present state of knowledge and you should provide a clear justification for your work. The impact or advancement of the work can also appear in the conclusion.

Response: Thank you for your valuable suggestion. We deeply regret our failure to adequately justify the innovative aspects of our proposed method in the field of remote sensing object detection in previous versions. Therefore, in the revised manuscript, we have provided a renewed explanation of the key improvement modules in our approach. And a brief description of our innovation is given in lines 160-176 and conclusion.

Regarding the LADH-Head detection head, we address the issue of conflicting regression and classification tasks that arise from the coupled head used in YOLOv5, leading to decreased detection accuracy. In lines 314-337, we explain why we replace the coupled head with a decoupled head. However, introducing a standard decoupled head significantly increases model parameters, as readers can understand from Table 7 in the ablation experiments in Chapter 6. To meet the requirements of real-time performance and lightweight design, we clarify the principle of using depthwise separable convolutions instead of standard convolutions in lines 351-369. Furthermore, we provide an explanation of how to replace the original standard convolutional units in the decoupled head in subsequent sections.

For the C3 module employing attention mechanisms, the original feature extraction network in YOLOv5 struggles to achieve satisfactory detection performance for small objects with significant orientation and scale variations in complex backgrounds of remote sensing images. Hence, we consider incorporating attention mechanisms into the C3 module. In lines 390-405 of the revised manuscript, we elucidate the principle of channel attention mechanism and its drawbacks when applied to remote sensing image object detection. Consequently, we propose the use of coordinate attention mechanisms to enhance detection performance. Lines 406-460 explain the principles and advantages of the coordinate attention mechanism in detecting small objects in remote sensing images, providing justification for our work.

As for FasterConv, it represents a novel approach in lightweight unit design. In lines 484-507, we explain the thought process and rationale behind the design of lightweight modules, analyzing the pros and cons of existing lightweight modules compared to our proposed approach. Since the design requirements of lightweight modules are universally applicable in the field of object detection, we did not elaborate on the development process specific to remote sensing in the paper. Our focus lies primarily in effectively integrating the lightweight design principles of FasterConv with the YOLOv5 model. Lines 508-537 delve into the analysis of the primary component unit, PConv, within FasterConv, explaining why PConv effectively reduces model parameters and GFLOPs. Lastly, lines 538-556 emphasize the architectural design of FasterConv in YOLOv5.

Special thanks for your kind review.

Reviewer 3 Report

Dear Authors,

 

Congratulations on your successful research and notable results. The manuscript presents a novel modification of the YOLOv5 algorithm for satellite and aerial images. This is an extremely important topic, as there are currently various remote sensing applications being developed that require near-real-time performance and accurate results. The manuscript is well-structured and easy to follow. Here are my suggestions for possible improvements:

 

1) It would be interesting to show standard deviation (std) for better understanding of significant of the achieved model’s improvement. Because for different random seeds models’ performance might slightly varies and local optimum might be different. I see that it might be computational extensive to conduct additional training for all algorithms, therefore, I suggest to compute std just for your model and the original YOLOv5 if it is possible.  

 

2) Table 4. I suggest to transpose the table that methods are located in the first column the same as in Table 3 (and all other tables).

 

3) Table 3. I suggest to change in the first row «mAP» to «average» (score) or something like that, because for other listed classes such as «AL», «AT» etc. in this row, mAP are also computed. You can try to improve it. You can also consider adding bold font for the best result in each column (for each class).

 

4) Figure 10. Please add image with ground truth labels. It will visually highlighted the benefit of the proposed approach.

 

5) Line 13: it should be small letter for «Firstly»     

 

6) Line 193: Please rephrase «This enhancement enhances»

 

7) Line 603: I suggest to start this sentence as a new paragraph «The Satellite Imagery Multivehicle Dataset (SIMD)»

 

8) 4.2 Datasets: Please add spatial resolution of images in DOTA and SIMD datasets.

 

9) In discussion section, I suggest to add some speculations on how the proposed approach can be further used in remote sensing applications and which crucial real-life cases require such approach. For instance, you can refer to this current papers that addresses remote sensing monitoring tasks and might benefit from your model

 

https://doi.org/10.3390/rs14030620 The proposed approach is essential for real-time traffic monitoring in urban areas

https://doi.org/10.3390/rs15184463 Improved YOLOv5 model can be integrated to the flood monitoring system system for real-time detection of damaged buildings

 

You can also discuss how YOLOv7 or YOLOv8 models can benefit from the proposed improvements.

Author Response

Dear Reviewer:

Thank you for your comments concerning our manuscript entitled "Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images". Those comments are all valuable and extremely helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made corrections which we hope meet with approval. Revised portions are marked yellow in the paper. The main corrections in the paper and the responds to the comments are as flowing:

  • Response to comment: 1)It would be interesting to show standard deviation (std) for better understanding of significant of the achieved model’s improvement. Because for different random seeds models’ performance might slightly varies and local optimum might be different. I see that it might be computational extensive to conduct additional training for all algorithms, therefore, I suggest to compute std just for your model and the original YOLOv5 if it is possible.  

Response: We appreciate your valuable suggestion. The influence of randomness on the model's performance is indeed present to a certain extent. However, in our experiment, we primarily focus on the impact of model lightweighting on detection performance, that is, how to ensure detection performance while reducing model parameters to facilitate better deployment on mobile devices. Our attention lies not on the effect of randomness on model performance. The relevance of randomness to the theme discussed in the paper is rather limited. In our outlook, we provide insights into the implications of model randomness. As a result, we apologize that did not calculate the "std" evaluation metric.

  • Response to comment: 2)Table 4. I suggest to transpose the table that methods are located in the first column the same as in Table 3 (and all other tables).

Response: Thank you for your valuable suggestion. We have revised Table 4 as you suggested.

  • Response to comment: 3)Table 3. I suggest to change in the first row «mAP» to «average» (score) or something like that, because for other listed classes such as «AL», «AT» etc. in this row, mAP are also computed. You can try to improve it. You can also consider adding bold font for the best result in each column (for each class).

Response: Thank you for your valuable suggestion. We apologize for the lack of clarity in the first column of Table 3 in our paper. In Table 3, we referenced the methodology from Table 5 in the article found at https://doi.org/10.3390/rs15092429. Regarding the "mAP" entry, it represents the average precision across all categories. The data in each subsequent column corresponds to the precision of each individual category, denoted as "AP," rather than the calculation of average precision. Therefore, we also provided an explanation for the data of the following 20 categories (AP) in line 745 of our paper.

  • Response to comment: 4)Figure 10. Please add image with ground truth labels. It will visually highlight the benefit of the proposed approach.

Response: Thank you for your valuable suggestion. We have made correction according to your comments.

  • Response to comment: 5) Line 13: it should be small letter for “Firstly” and 6) Line 193: Please rephrase “This enhancement enhances” and 7) Line 603: I suggest to start this sentence as a new paragraph “The Satellite Imagery Multivehicle Dataset (SIMD)”

Response: Thank you for your valuable suggestion. According to your suggestions 5), 6) and 7), we have made changes in line 13, line 234 and line 694 respectively, and the bolded yellow manuscript in the revised version is the improved manuscript.

  • Response to comment:  8) 4.2 Datasets: Please add spatial resolution of images in DOTA and SIMD datasets.

Response: Thank you very much for your suggestion, we neglected the spatial resolution of these datasets in the original manuscript, and we added the spatial resolution of the DOTA dataset in line 688, and the spatial resolution of the SIMD dataset in lines 694-695 of the revised version.

  • Response to comment: 9) In discussion section, I suggest to add some speculations on how the proposed approach can be further used in remote sensing applications and which crucial real-life cases require such approach. 

Response: Thank you for your valuable suggestion. Your ninth suggestion is of significant value to our article, but since our discussion chapter mainly discusses the analytical validation of the effectiveness of individual modules through ablation experiments, we did not analyze the real-world application of the proposed method much in the original article, and in the revised version we added a brief analysis of the use of the proposed method in remote sensing image applications in lines 859-873, and we also discussed the ideas of the future improvement of the proposed method in the conclusion, in lines 901-913.

Special thanks to you for your good comments.

Reviewer 4 Report

Dear Authors,

My comments on this manuscript are listed as below:

1. Introduction part is difficult to read and understand. In my point of view a seperate subsection for the YOLO project can be written which outlines history of the project and the information between lines 88 and 140.

2. When I zoon in figures 1, 2, 3, 4 and 5, the image gets blurry. Could you please provide images at higher quality ?

3. Between lines 572 and 582, it is stated that DIOR dataset was trained for 300 epochs and other datasets were trained for 150 epochs. Could you please explain how the number of training epochs were chosen. And why the number of tarining epochs are different for test datasets ?

4. In table 3 some abbreviations were provided for target objects. Such as, AT stands for airports and GC stands for golf courses. However, there is no information on meainings of AL, BF, BC,C,D, GTF, HB, O, S, SD, ST, TS and V.

5. In line 607 it is stated that the SIMD dataset is used for detecting stationary targets. However, in line 681 it is tate that the proposed algorithm is used for detecting dynamic small objects. Can you please clarify if the SIMD dataset is used for detection of stationay or dynamic objects ?

6. For visual comparison, can you please create figures for the SIMD and DOTA datasets as the one shown in Figure 10?

Author Response

Dear Reviewer:

Thank you for your comments concerning our manuscript entitled "Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images". Those comments are all valuable and extremely helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made corrections which we hope meet with approval. Revised portions are marked yellow in the paper. The main corrections in the paper and the responds to the comments are as flowing:

  • Response to comment: 1)Introduction part is difficult to read and understand. In my point of view a separate subsection for the YOLO project can be written which outlines history of the project and the information between lines 88 and 140.

Response: Thank you for your valuable suggestion. Based on your suggestions we have improved the narrative of the YOLO section in the manuscript. Lines 107-130 of the manuscript we give an overview of the development process of YOLO and explain the design ideas of the YOLOv5 model.

  • Response to comment: 2)When I zoon in figures 1, 2, 3, 4 and 5, the image gets blurry. Could you please provide images at higher quality?

Response: Thank you for your valuable suggestion. Based on your comments we have replaced Figures 1, 2, 3, 4 and 5 with clearer graphics. The reason for the blurring of the previous images may be due to the loss of frames in the compression process of the images.

  • Response to comment: 3)Between lines 572 and 582, it is stated that DIOR dataset was trained for 300 epochs and other datasets were trained for 150 epochs. Could you please explain how the number of training epochs were chosen? And why the number of training epochs are different for test datasets?

Response: Thank you for your valuable suggestion. Based on your suggestion we explain the choice of training epochs: during the experimental process, we observed a linear overfitting trend in both the DOTA and SIMD datasets after surpassing 200 epochs. Furthermore, during training, the improvement in model accuracy becomes limited at approximately 150 epochs. Hence, for training on the DOTA and SIMD datasets, we opted to conduct training for 150 epochs, utilizing a batch size of 8.

  • Response to comment: 4)In table 3 some abbreviations were provided for target objects. Such as, AT stands for airports and GC stands for golf courses. However, there is no information on meanings of AL, BF, BC, C, D, GTF, HB, O, S, SD, ST, TS and V.

Response: Thank you for your suggestion, in response to which we have provided additional clarification of the abbreviations in Table 3 in lines 674-678 of the manuscript.

  • Response to comment: 5)In line 607 it is stated that the SIMD dataset is used for detecting stationary targets. However, in line 681 it is tate that the proposed algorithm is used for detecting dynamic small objects. Can you please clarify if the SIMD dataset is used for detection of stationary or dynamic objects?

Response: Thank you for your valuable suggestion. We sincerely apologize for our oversight. At line 681, due to a lapse in attention, an error occurred in the translation. The SIMD dataset mentioned in line 601 is intended for dynamic remote sensing small object detection, rather than static small object detection.

  • Response to comment: 6)For visual comparison, can you please create figures for the SIMD and DOTA datasets as the one shown in Figure 10?

Response: Thank you for your valuable suggestion. Our emphasis on validating the DOTA dataset and SIMD data analysis primarily lies in assessing whether the model can achieve similar results on different datasets compared to the DIOR dataset. Consequently, we did not include visualizations of the final detection results in the literature. Moreover, since the visualization results for both datasets bear similarities to the DIOR dataset, our paper solely focuses on comparing the results from the DIOR dataset. However, in the revised manuscript, Figure 10 now highlights the ground truth data from the DIOR dataset as evidence of the model's effectiveness.

Special thanks to you for your good comments.

Back to TopTop