Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

CNN-Based Crosswalk Pedestrian Situation Recognition System Using Mask-R-CNN and CDA

Appl. Sci. 2023, 13(7), 4291; https://doi.org/10.3390/app13074291

by Sac Lee¹

, Jaemin Hwang¹

, Junbeom Kim¹

and Jinho Han^2,*

Reviewer 1:

Wen Zheng

Reviewer 2: Anonymous

Reviewer 3:

Man Li

Appl. Sci. 2023, 13(7), 4291; https://doi.org/10.3390/app13074291

Submission received: 27 February 2023 / Revised: 16 March 2023 / Accepted: 27 March 2023 / Published: 28 March 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

I read the manuscript entitled ‘CNN-Based Crosswalk Pedestrian Situation Recognition System Using Mask-R-CNN and CDA’ with interest. This article tests an approach using a masked R-CNN and CDA. The trained CNN allows us to classify hazards and safety based on the location of pedestrians and pedestrian crossings. However, I found several presentation problems with the content of the article, some details missing from the manuscript and some shortcomings that do not meet the journal's innovation requirements and recommend a rejection.

1. I recommended that the full name of the abbreviation be indicated when it first appears in the article.

2. l28-l29, the first sentence of the introduction is not consistent and logical, suggest adding the significance of the study in the first paragraph

3. l37, should present the shortcomings of the previous research and be progressive in relation to the presentation of subsequent related work. Highlight what improvements have been made to address the shortcomings of the research work in this paper.

4. Please indicate the source of the data set used in the article and cite the source if you do.

5. After training the models using box plots in subsection 4.2, the accuracy of both models decreased, what is the significance of this?

6. L225 mentions that this paper has developed its own CDA algorithm to process the images. Please add a comparison test with other pedestrian crossing detection algorithms to verify the accuracy of this paper's algorithm.

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 1 Comments

We appreciate your accurate and professional comments. We revised our paper to your points as below. We also proofread our paper with an English-native faculty for a better English language.

Point 1: I recommended that the full name of the abbreviation be indicated when it first appears in the article.

Response 1: We described abbreviations with their full names when they first appeared.

Point 2: l28-l29, the first sentence of the introduction is not consistent and logical, suggest adding the significance of the study in the first paragraph

Response 2: We revised the sentence of the introduction, like below

Researchers initially used CNN [1] to classify objects in images. Later, they found a way to differentiate and detect each object in the image. In addition, they conducted their studies on object relationships or situation recognition between detected objects. Our research is to perform situation recognition between objects by detecting pedestrians and crosswalk objects in images, respectively. Through deep learning training, our system can distinguish whether a pedestrian in a crosswalk is safe from the driver's point of view. Many researchers have studied object detection and object relationships. Hu et al. [2] devised the object relationship module using original and geometric weights to understand the dependence between objects for object detection [3]. Redmon et al. [4, 5] used YOLO (You Only Live Once) for object detection.

Point 3: l37, should present the shortcomings of the previous research and be progressive in relation to the presentation of subsequent related work. Highlight what improvements have been made to address the shortcomings of the research work in this paper.

Response 3: We added Table 1 and wrote what improvements had been made in our paper in Section 2, like below.

Point 4: Please indicate the source of the data set used in the article and cite the source if you do.

Response 4: We added a reference for the source of our dataset available on GitHub and wrote about the reference in Section 4, like below.

The source of our dataset is available on GitHub [34].

Reference,

[34] The source of our dataset is available on GitHub: https://github.com/toast-ceo/CNN-Based-Crosswalk-Pedestrian-Situation-Recognition-System-Using-Mask-R-CNN-and-CDA

Point 5: After training the models using box plots in subsection 4.2, the accuracy of both models decreased, what is the significance of this?

Response 5: We added Table 3 and explained the significance of using box images in subsection 4.2, like below.

Table 3 presents the accuracy (%) of experiments I & II tested by ResNet50 and Xception. These results reveal that learning CNNs with images created by coloring simple shapes is as accurate as the method used in Experiment I. There is just a 0.7 difference between the two experiments with ResNet50. If somebody can train an AI with box images for a specific purpose, like in our case, efficient time and economic cost are possible to prepare training datasets.

Point 6: L225 mentions that this paper has developed its own CDA algorithm to process the images. Please add a comparison test with other pedestrian crossing detection algorithms to verify the accuracy of this paper's algorithm.

Response 6: We added Table 1 and wrote why we needed to develop our CDA in Section 2, like below.

When we tried to detect the crosswalk with YOLO, we did not make an exact crosswalk shape but only a simple box shape. Also, with Mask R-CNN, we cannot satisfy detecting crosswalks when crosslines are invisible according to the road condition. So we developed our CDA, which can draw the crosswalk shape even if some parts of zebra cross lines are erased.

Author Response File: Author Response.docx

Reviewer 2 Report

1. How to deal with the influence of shadow on image?

2. For the pedestrian and objects in the image, how to correctly classify the outline of travelers? As shown in the original figure in Fig. 6.

3. In the first part, the author should add the introduction of pedestrian posture recognition using sensors to improve the readability of the paper, such as:

Shi, L.-F.; Liu, Z.-Y.; Zhou, K.-J.; Shi, Y.; Jing X. Novel Deep Learning Network for Gait Recognition Using Multimodal Inertial Sensors. Sensors, 2023, 23, 849.

4. The information in reference [31] is incorrect, which should be:

M. A. Malbog, "MASK R-CNN for Pedestrian Crosswalk Detection and Instance Segmentation," 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Kuala Lumpur, Malaysia, 2019, pp. 1-5, doi: 10.1109/ICETAS48360.2019.9117217.

Author Response

Response to Reviewer 2 Comments

We appreciate your kind and professional comments. We revised our paper to your points as below.

Point 1: How to deal with the influence of shadow on image?

Response 1: We added an article in references and wrote that Mask R-CNN could distinguish an object's shadow from an object in subsection 3.1, like below.

Bakr et al. [33] showed that Mask R-CNN could distinguish an object's shadow from an object with 98.09% accuracy.

Reference,

[33] Bakr, H.; Hamad, A.; Amin, K. Mask R-CNN for moving shadow detection and segmentation. IJCI. International Journal of Computers and Information 2021, 8, 1, 1-18.

Point 2: For the pedestrian and objects in the image, how to correctly classify the outline of travelers? As shown in the original figure in Fig. 6.

Response 2: We added Figure 2 to describe the Mask R-CNN framework for instance segmentation, and wrote how Mask R-CNN correctly classifies the outline of a pedestrian in subsection 3.1, like below.

Mask R-CNN is a segmentation model that localizes objects in pixel units. It uses RoIAlign to distinguish objects in pixels in an image. The RoIAlign technique prevents the loss of object location information and allows accurate feature maps. Figure 2 describes the Mask R-CNN framework for instance segmentation.

Point 3: In the first part, the author should add the introduction of pedestrian posture recognition using sensors to improve the readability of the paper, such as:

Shi, L.-F.; Liu, Z.-Y.; Zhou, K.-J.; Shi, Y.; Jing X. Novel Deep Learning Network for Gait Recognition Using Multimodal Inertial Sensors. Sensors, 2023, 23, 849.

Response 3: We added the article in references and wrote in Section 1, like below.

Furthermore, Shi et al. [9] recently suggested a new gait recognition system with a deep learning network. They used multimodal inertial sensors for their system.

Reference,

[9] Shi, L.-F.; Liu, Z.-Y.; Zhow, K.-J.; Shi, Y.; Jing, X. Novel Deep Learning Network for Gait Recognition Using Multimodal Inertial Sensors. Sensor, 2023, 23, 849.

Point 4: The information in reference [31] is incorrect, which should be:

A. Malbog, "MASK R-CNN for Pedestrian Crosswalk Detection and Instance Segmentation," 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Kuala Lumpur, Malaysia, 2019, pp. 1-5, doi: 10.1109/ICETAS48360.2019.9117217.

Response 4: We revised the reference, like below.

Reference,

[32] Malbog, M.A. MASK R-CNN for Pedestrian Crosswalk Detection and Instance Segmentation. 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Kuala Lumpur, Malaysia, Xplore 2019, pp. 1–5, doi: 10.1109/ICETAS48360.2019.9117217.

Author Response File: Author Response.docx

Reviewer 3 Report

Please see the attached file

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 3 Comments

We appreciate your kind and professional comments. We revised our paper to your points as below.

(Figures are added to the attached word file.)

Point 1: In Lines 124 to 126 the evaluation of the performance with test images different from the training images used for trained CNN Test images are created in the same process as the training images through stages 1, 2, and 3. what is the difference between the tested images and the training images, since both trained and tested images are obtained into the same process ? in other terms how could the tested images evaluate the performance ?

Response 1: Our research is to create a system that can distinguish safe pedestrians. For our system, we developed stage 1~3 as the conversion process of the original image. Therefore, the stage 1~3 transformation process is the key achievement of our study. Without the stage 1~3 transformation process, the test accuracy is 70%, but with stages, the accuracy increases to 98%.

So, we trained CNN with the data created through stage 1~3. Then, we tested data created after the same process. We added Table 3 to clarify the details of the experiments.

Point 2: Combined line 206 and figures 4c showed the image process in experiment 2, why do we have different situations IN and OUTSIDE with the same shape of image ? how can you explain that the pedestrian position in the front or behind the cross-walking mark showed two different situations ?

Response 2: The two figures indicate whether the foot is inside or outside the crosswalk based on the human shape. Please refer to the picture below.

Point 3: In the lines 207-211 and the figure 6 is the camera angle different from one to another ? did yo take it in count on your proposed method system ? how could you explain it ?

Response 3: We made the image data by changing the camera angle considering the special situation.

Even in this case, the crosswalk was marked in black, and the person was marked in red, so there was no problem learning and testing CNN. Please refer to the picture below.

Point 4: What is the procedure in terms of input and output of each method? Mask R-CNN, ect. Referred to line 131

Response 4: The input and output for each stage are shown in the figure below.

Stage 1

Stage 2

Stage 3

Point 5: Did you do any comparison of your proposed method with other existing method to find the computation cost? Does not MASK R-CNN itself capable to give approximately the same results as your finding? any comparisons?

Response 5: We added Table 1 to compare our system to other related works and wrote about that in Section 2, like below.

Table 1 compares previous studies on crosswalk pedestrian situation recognition with our study on five items and shows actual detected sample images: Accuracy of detecting a pedestrian and a crosswalk, Use of deep learning, Detect crosswalk, The sight of the car driver, Method of detecting crosswalk. Our system showed the highest accuracy and satisfied all other items, showing superior results to the systems presented in previous research. When we tried to detect the crosswalk with YOLO, we did not make an exact crosswalk shape but only a simple box shape. Also, with Mask R-CNN, we cannot satisfy detecting crosswalks when crosslines are invisible according to the road condition. So we developed our CDA, which can draw the crosswalk shape even if some parts of zebra cross lines are erased.

Point 6: In the lines 164-170 Training using CNN part it is mentioned your proceeded by a combination of MASK R-CNN and CDA image without precising the combination process ? how would you make it possible ? what do you use?

Response 6: We used Python and OpenCV for the combination of two images output from stages 1 and 2. "hconcat" was one of the useful functions of OpenCV.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors have better revised the issues raised during the first review process, the study design is better and the presentation of the results is clearer and acceptable after minor checks and revisions.

Reviewer 3 Report

Please check your paper for English language editing, and ensure that you follow the Editorial Formatting Comments listed in this letter. Congratulations

Article Menu

CNN-Based Crosswalk Pedestrian Situation Recognition System Using Mask-R-CNN and CDA

Further Information

Guidelines

MDPI Initiatives

Follow MDPI