Next Article in Journal
Dual-Tracer PET Image Separation by Deep Learning: A Simulation Study
Previous Article in Journal
Novel Incremental Conductance Feedback Method with Integral Compensator for Maximum Power Point Tracking: A Comparison Using Hardware in the Loop
 
 
Article
Peer-Review Record

Exploration of Vehicle Target Detection Method Based on Lightweight YOLOv5 Fusion Background Modeling

Appl. Sci. 2023, 13(7), 4088; https://doi.org/10.3390/app13074088
by Qian Zhao 1, Wenyue Ma 1,*, Chao Zheng 2 and Lu Li 2
Reviewer 1:
Reviewer 2:
Reviewer 4:
Appl. Sci. 2023, 13(7), 4088; https://doi.org/10.3390/app13074088
Submission received: 17 February 2023 / Revised: 9 March 2023 / Accepted: 9 March 2023 / Published: 23 March 2023
(This article belongs to the Topic Peaceful and Secure Cities)

Round 1

Reviewer 1 Report

This is an interesting manuscript which proposes a detection model based on an improved YOLOv5 to solve the problem of missed and false detection during rush hours. The proposed model improves the inference ability and accuracy compared to other models. Some questions should be addressed before publication.

 1 Page 16, Figure 9: the markers 1-5 can not be found in the Figure 9. It is important to show the different.

 2 Page 17 Figure 10: which one is the improved results? It should be indicated in the figure 10.

Author Response

We sincerely appreciate your hard work and responsibility. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript.

Point 1: 1 Page 16, Figure 9: the markers 1-5 can not be found in the Figure 9. It is important to show the different.

Response 1: This was my negligence. I used PowerPoint to format Figure 9 and marked the differences in the results detected by the two models, but the markings were lost when I imported it into Word from VISIO. I apologize for this. I have replaced the image with the properly marked version.

Point 2: Page 17 Figure 10: which one is the improved results? It should be indicated in the figure 10.

Response 2: For Figure 10, I tested two datasets. The first row shows the images I selected from the UA-DETRAC dataset for five different scenarios. The second row displays the results from the original model, while the third row shows the optimized model's results. The fourth row shows the corresponding images for the five scenarios in my own custom dataset, and the remaining two rows follow the same format accordingly.

Author Response File: Author Response.docx

Reviewer 2 Report

This manuscript reports target detection method for background modeling. The method can be useful and could be used to solve some problems. Thus, it can be accept after the folowing issue are solved.

1, What are the difference of Figure 4 5 6, the author can explain them clear in the main text.

2, How many data be used for training? What is the reason for those training and novelrty?

3, What is the methism to make the model size small compared with other model?

4, The author can cite more reference to solid the backrgound, such as  Sample balancing of curves for lens distortion modeling and decoupled camera calibration. Optics Communications, 2023, 129221.

Author Response

Response to Reviewer 2 Comments

We sincerely appreciate your hard work and responsibility. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript.

Point 1: What are the difference of Figure 4 5 6, the author can explain them clear in the main text.

Response 1: Images 4, 5, and 6 respectively show the integration of the CA attention network with the backbone's Concat, which improves the detection accuracy of the model by reconstructing channel and coordinate attention in the multi-scale target feature map extracted by the backbone. The CA attention is also integrated with the Neck network's Concat during multi-scale fusion to improve the model's detection of multiple target sizes. Finally, the CA attention mechanism is fused with the Head network to perform channel and coordinate reconstruction of the feature map before prediction, thereby improving prediction accuracy.

Point 2: How many data be used for training? What is the reason for those training and novelrty?

Response 2: There are a total of 4526 images. As I was previously working with traditional algorithms, I innovatively improved the ViBE background modeling algorithm as a pre-processing operation for the YOLO model, which enabled me to obtain complete vehicle contours and facilitated the extraction of vehicle appearance using the backbone. Attached is my latest interpretable figure, but I cannot publish it in the journal as I intend to submit another paper. Sorry for the inconvenience.

Point 3: What is the methism to make the model size small compared with other model?

 

Response 3: The main reasons for the smaller size of YOLOv5 model are as follows:

  • Network Architecture: YOLOv5 uses a lightweight backbone network (CSPDarknet53) with fewer layers than previous YOLO versions, making the model faster and smaller. In addition, it also incorporates the Ghostnet model to reduce parameter redundancy.
  • Model pruning: YOLOv5 utilizes automatic model pruning techniques to remove unnecessary weights from the network, further reducing its size.
  • Quantization: YOLOv5 uses low-precision representations of parameters (8-bit integers) instead of full-precision floating-point numbers, which reduces the model size without compromising accuracy too much.

Efficient implementation: YOLOv5 is optimized for both CPU and GPU, allowing it to leverage hardware acceleration.

Point 4: The author can cite more reference to solid the backrgound, such as  Sample balancing of curves for lens distortion modeling and decoupled camera calibration. Optics Communications, 2023, 129221.

Response 4: Thank you, I will consolidate my research background. Here is my revised version:

         Wang's proposed background modeling algorithm employs interframe difference and Gaussian mixture model, while also utilizing Harris corner detection method to accurately detect moving objects. Although the algorithm exhibits impressive detection accuracy and real-time performance, its practical application still requires further development[5]. Yin introduced a method for enhancing vehicle camera monitoring systems, which relies on fast background modeling and adaptive moving target detection. The core idea of this method is to constantly update the background model to enable efficient detection of moving targets within real-time video streams. Although the approach showcases promising results, its detection outputs might not suffice to support follow-up research. Further research and refinement are necessary to enhance its detection capabilities and maximize its potential in vehicle monitoring applications[6].

Lu et al. [18] proposed a novel approach for real-time multiple-vehicle detection and tracking from a moving platform. The method integrates vehicle detection with vehicle tracking and utilizes convolutional neural networks (CNNs) and deep reinforcement learning for improved accuracy and robustness. Meng et al. [19] proposed a tracking-by-detection approach for vehicle tracking in unmanned aerial vehicle (UAV) videos. The proposed method combines deep learning techniques with motion cues for detecting and tracking vehicles in challenging aerial scenarios.

5.Wang,J.;Zhang,Q. “Background     Modeling and Moving Object Detection Based on Frame Difference and Gaussian Mixture Model, ” Electronic Design Engineering, vol.27,no.14,pp. 16-19,2019.

6.Yin,Y.;Yang,T.;Wang,Y. “Adaptive Background Modeling and Detection Method for Moving Objects from Car Cameras,” Optoelectronics·Laser,vol.30,no.9,pp.864-872,2019.

18.Lu,X.;Ling,H. “Real-time Multiple-vehicle Detection and Mracking from A Moving Platform,” IEEE Transactions on Intelligent Transportation Systems,vol.18,no.12,pp.3473-3483,July.2017[CrossRef]

19.Meng,D.;Xiang,S.;Pan,C. “Tracking-by-detection of Vehicles in UAV Videos Using Deep Learning and Motion Cues,” Computer Vision and Image Understanding,vol. 198,pp. 103015,Nov.2020. [CrossRef]

Author Response File: Author Response.docx

Reviewer 3 Report

This article proposes a lightweight detection model based on an improved YOLOv5 to address missed and false detections caused by traffic jams in vehicle detection techniques and surveillance videos.

·        Please address the issues I have noted in your submitted document.

 

·        Please review your article for grammar, sentence building, and translated technical terms. I have highlighted some of them as a reference.

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 3 Comments

We sincerely appreciate your hard work and responsibility. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript.

Response1:Due to the explosive increase in per capita vehicle ownership in China brought by the continu-ous development of the economy and society, many negative impacts have arisen, making it nec-essary to establish the smart city system that has rapidly developing vehicle detection technology as its data acquisition system. This paper proposes a lightweight detection model based on an im-proved version of YOLOv5 to address the problem of missed and false detections caused by occlu-sion during rush hour vehicle detection in surveillance videos. The proposed model replaces the BottleneckCSP structure with the Ghostnet structure and prunes the network model to speed up inference. Additionally, the Coordinate Attention Mechanism (CA) is introduced to enhance the network's feature extraction and improve its detection and recognition ability. Distance-IoU Non-Maximum Suppression (DIoU-NMS) replaces Non-Maximum Suppression (NMS) to address the issue of false detection and omission when detecting congested targets. Lastly, the combination of the Five-frame differential method with VIBE (FFD-ViBe) and MD-SILBP operators is used to enhance the model's feature extraction capabilities of vehicle contours. The experimental results show that the proposed model outperforms the original model in terms of the number of parame-ters, inference ability, and accuracy when applied to both the expanded UA-DETRAC and a self-built dataset. Thus, this method has significant industrial value in intelligent traffic systems and can effectively improve vehicle detection indicators in traffic monitoring scenarios.

Response2:Vehicle detection, being the primary task of intelligent transportation systems, plays a crucial role in determining the efficacy of the entire intelligent system.

Response3:There are numerous superfluous and redundant neurons within the bottle-neck feature extraction network. Replacing the CSP structure with GhostNet structure can significantly decrease both model parameters and computa-tional resources. Furthermore, substituting the conventional parallel three-layer maximum pooling operation with SPPF structure and serial three-layer maximum pooling can enhance the model's reasoning capabilities further.

Response4:To tackle vehicle occlusion, DIoU-NMS is employed to accelerate the regres-sion of a suitable target box for detecting vehicles.

Response5:In this experiment, the extended dataset was partitioned into training and testing sets at an 8:2 ratio. A sequence of ablation experiments was conducted to validate the effectiveness of three proposed enhancements. Ghostnet, DIoU-NMS, and CA modules were incrementally incorporated to analyze their benefits. The experiments were con-ducted without utilizing pre-trained weights, and the parameters were uniformly con-figured as prescribed in Table 3. The experimental outcomes are presented in Table 6.

Response5:In the final results chart. AP@0.5 is calculated using the formula that I provided, Frame Time represents the time it takes to process each image frame, FPS represents the number of images processed per unit time, FLOPs represent the computational complexity of the model, and Model Size refers to the weights of the model.

The remaining changes are just modifications to a few words, which I've only made in the body of the text.

Author Response File: Author Response.docx

Reviewer 4 Report


- The words 'Research on' may not be included in the title of manuscript or other suitable words may be included, if necessary
- For the words which are not used at least twice in the abstract, their acronyms may not be written in the abstract. Rather, such acronyms are to be included in the main text.
- The detailed discussion and references for YOLOv5 model is can be added.
- Literature review can be strengthened further.
- Consistency in parameter symbols/representation is should be maintained (e.g. all italics, etc) or as per MDPI requirements.
- Authors can highlight how the novel is research work carried out by them.
- Any trade-offs/limitations can be indicated.    

Author Response

Response to Reviewer 4 Comments

We sincerely appreciate your hard work and responsibility. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript.

Point 1: The words 'Research on' may not be included in the title of manuscript or other suitable words may be included, if necessary

Response 1: Exploration of Vehicle Target Detection Method Based on Light-weight YOLOv5 Fusion Background Modeling

Point 2: For the words which are not used at least twice in the abstract, their acronyms may not be written in the abstract. Rather, such acronyms are to be included in the main text.

Response 2: Due to the explosive increase in per capita vehicle ownership in China brought by the continuous development of the economy and society, many negative impacts have arisen, making it necessary to establish the smart city system that has rapidly developing vehicle detection technology as its data acquisition system. This paper proposes a lightweight detection model based on an improved version of YOLOv5 to address the problem of missed and false detections caused by occlusion during rush hour vehicle detection in surveillance videos. The proposed model replaces the Bot-tleneckCSP structure with the Ghostnet structure and prunes the network model to speed up in-ference. Additionally, the Coordinate Attention Mechanism is introduced to enhance the net-work's feature extraction and improve its detection and recognition ability. Distance-IoU Non-Maximum Suppression replaces Non-Maximum Suppression to address the issue of false de-tection and omission when detecting congested targets. Lastly, the combination of the Five-frame differential method with VIBE and MD-SILBP operators is used to enhance the model's feature extraction capabilities of vehicle contours. The experimental results show that the proposed model outperforms the original model in terms of the number of parameters, inference ability, and accu-racy when applied to both the expanded UA-DETRAC and a self-built dataset. Thus, this method has significant industrial value in intelligent traffic systems and can effectively improve vehicle detection indicators in traffic monitoring scenarios.

Point 3: The detailed discussion and references for YOLOv5 model is can be added.

Response 3: YOLOv5 is the latest version of the YOLO series, which uses a new design ap-proach - implementing a lighter model and employing new design methods to ensure detection accuracy. The YOLOv5 model, based on the PyTorch framework, is extremely lightweight, reducing the model size by nearly 90% compared to YOLOv4. At the same time, YOLOv5 has faster detection speed while maintaining accuracy, making it more advantageous in practical applications. YOLOv5 includes four versions: S, M, L, and X, with the S version having the fastest detection speed. The network layer number of the other versions increases in order. In this chapter, YOLOv5s is used to implement vehi-cle detection. The YOLOv5 network structure, shown in Figure 1, is mainly divided in-to four parts: Input, Backbone, Neck, and Head. Input: preprocesses the image, includ-ing calculating adaptive anchor frames, filling to unify the size of input images, and using Mosaic data enhancement to enrich the features of the dataset. Backbone: re-sponsible for feature extraction and abstraction of input images. The backbone net-work usually consists of multiple convolutional layers, which can progressively trans-form the original image into more abstract and higher-level semantic features for sub-sequent processing and classification tasks. Due to the characteristics of convolutional neural networks, the backbone network can learn local and global information of im-ages and perform feature extraction and abstraction based on this information. Neck: used to fuse feature maps at different levels to obtain more comprehensive feature in-formation. Feature maps processed by the Neck network can be better input to the Head layer for target classification and position determination, thus improving the accuracy and robustness of target detection. Head: is mainly responsible for using the features extracted by the Backbone after compression and fusion in the Neck for task classification and prediction.

Point 4: Literature review can be strengthened further.

Response 4: Wang's proposed background modeling algorithm employs interframe difference and Gaussian mixture model, while also utilizing Harris corner detection method to accurately detect moving objects. Although the algorithm exhibits impressive detection accuracy and real-time performance, its practical application still requires further development[5]. Yin introduced a method for enhancing vehicle camera monitoring systems, which relies on fast background modeling and adaptive moving target detection. The core idea of this method is to constantly update the background model to enable efficient detection of moving targets within real-time video streams. Although the approach showcases promising results, its detection outputs might not suffice to support follow-up research. Further research and refinement are necessary to enhance its detection capabilities and maximize its potential in vehicle monitoring applications[6].

Lu et al. [18] proposed a novel approach for real-time multiple-vehicle detection and tracking from a moving platform. The method integrates vehicle detection with vehicle tracking and utilizes convolutional neural networks (CNNs) and deep reinforcement learning for improved accuracy and robustness. Meng et al. [19] proposed a tracking-by-detection approach for vehicle tracking in unmanned aerial vehicle (UAV) videos. The proposed method combines deep learning techniques with motion cues for detecting and tracking vehicles in challenging aerial scenarios.

5.Wang,J.;Zhang,Q. “Background     Modeling and Moving Object Detection Based on Frame Difference and Gaussian Mixture Model, ” Electronic Design Engineering, vol.27,no.14,pp. 16-19,2019.

6.Yin,Y.;Yang,T.;Wang,Y. “Adaptive Background Modeling and Detection Method for Moving Objects from Car Cameras,” Optoelectronics·Laser,vol.30,no.9,pp.864-872,2019.

18.Lu,X.;Ling,H. “Real-time Multiple-vehicle Detection and Mracking from A Moving Platform,” IEEE Transactions on Intelligent Transportation Systems,vol.18,no.12,pp.3473-3483,July.2017[CrossRef]

19.Meng,D.;Xiang,S.;Pan,C. “Tracking-by-detection of Vehicles in UAV Videos Using Deep Learning and Motion Cues,” Computer Vision and Image Understanding,vol. 198,pp. 103015,Nov.2020. [CrossRef]

Point 5: Consistency in parameter symbols/representation is should be maintained (e.g. all italics, etc) or as per MDPI requirements.

Response 5: I have already formatted all parameters in italics in the paper, following the writing guidelines of MDPI.

Point 6: Authors can highlight how the novel is research work carried out by them.

Point 7: Any trade-offs/limitations can be indicated.   

Response 6,7: I am sorry, I do not understand the last two questions and I am unclear on what you are trying to convey. Could you please rephrase your statements in a different way?

Author Response File: Author Response.docx

Back to TopTop