Next Article in Journal
Application of Deep Learning in the Deployment of an Industrial SCARA Machine for Real-Time Object Detection
Next Article in Special Issue
What Is Next in Computer-Assisted Spine Surgery? Advances in Image-Guided Robotics and Extended Reality
Previous Article in Journal
Towards Fully Autonomous Negative Obstacle Traversal via Imitation Learning Based Control
Previous Article in Special Issue
Teleoperation Control of an Underactuated Bionic Hand: Comparison between Wearable and Vision-Tracking-Based Methods
 
 
Article
Peer-Review Record

Sensor Fusion with Deep Learning for Autonomous Classification and Management of Aquatic Invasive Plant Species

by Jackson E. Perrin 1, Shaphan R. Jernigan 1, Jacob D. Thayer 2, Andrew W. Howell 3, James K. Leary 2 and Gregory D. Buckner 1,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Submission received: 21 May 2022 / Revised: 22 June 2022 / Accepted: 24 June 2022 / Published: 28 June 2022

Round 1

Reviewer 1 Report

The paper presents a novel approach for automated identification of aquatic invasive plants by using a multi-sensor classification system in SAVs.
The paper is very well written and targets a interessting practical application.

I have the following minor concerns:
- Chapter "Neural Network Training": the architecture is not completely clear. The author state that the fully connected layers were replaced. By what layers?
- The hyper parameters of the training, especially the learning rate, are not given.
- Chapter "Multi-Sensor Classification": an additional approach would be to fuse the information in a fully connected layers, which is also very common. It might be interessting to read why this approach was not considered here.
- Figure 5/6/7/8: the figure descriptions should explain if this are results from training or test set
- The dataset seems to be very unbalanced (OTHER has much more samples than HYDR or NONE). Therefore the accuracy is not a good performance measure. The authors should consider actions for the unbalanced dataset (like for example minority class oversampling, etc.)
- The numerical results given as recall for the classes could better be given additionally in a table for all the approaches, which would improve the comparability.

Author Response

The authors of Robotics-1758476 (“Sensor Fusion with Deep Learning for Autonomous Classification and Management of Aquatic Invasive Plant Species”) would like to thank Reviewer 1 for the thorough evaluation of our manuscript and the constructive feedback. We have carefully considered each reviewer comment, and have addressed each as summarized below:

Comment: The paper presents a novel approach for automated identification of aquatic invasive plants by using a multi-sensor classification system in SAVs. The paper is very well written and targets an interesting practical application.

Response: The authors appreciate these supportive comments.

Comment: I have the following minor concerns: Chapter "Neural Network Training": the architecture is not completely clear. The author state that the fully connected layers were replaced. By what layers?

Response: We have clarified the architecture description in the revised manuscript. It now reads “A ResNet-50 model with weights pre-trained on the popular ImageNet dataset was selected as the basis for transfer learning. The convolutional weights were loaded, while the final, multi-level perceptron layer that performs classification on the extracted features was replaced. The pre-trained multi-level perceptron layer was replaced with a feature classification pipeline consisting of a 2D average pooling, a feature flattening, fully connected layers of dimensions 1000x10 and 10x3, and a final output Softmax layer.”

Comment: The hyper parameters of the training, especially the learning rate, are not given.

Response: Excellent observation. We have clarified the training parameters in the revised manuscript. It now reads “Training was conducted over 100 epochs, using the Adam optimizer with a learning rate of .001 and early stopping and dropout to lessen overfitting.” Also in this section, we have added “Of the total image pairs generated, 15% were allocated for testing. The remaining 85% of image pairs were partitioned into 80% for training and 20% for validation.”

Comment: Chapter "Multi-Sensor Classification": an additional approach would be to fuse the information in a fully connected layers, which is also very common. It might be interesting to read why this approach was not considered here.

Response: This is another good point. We have addressed it with the following sentences in the revised manuscript: “A second technique, stacking, also involves training multiple networks on the subsets of training data. In a stacking approach, multiple base models are trained, after which a fully connected network learns how to fuse the output of these base models. However, this approach requires that a set of validation data be withheld to train the fully connected output model [11]. Because of this, stacking, like bagging, requires larger datasets than was available.”

Comment: Figure 5/6/7/8: the figure descriptions should explain if this are results from training or test set

Response: The figure captions have been revised to indicate that each of the results apply to test images.

Comment: The dataset seems to be very unbalanced (OTHER has much more samples than HYDR or NONE). Therefore the accuracy is not a good performance measure. The authors should consider actions for the unbalanced dataset (like for example minority class oversampling, etc.)

Response: This is another good suggestion by Reviewer 1. We have addressed this issue of imbalanced datasets in the revised manuscript: "The datasets generated for both the aerial RGB images and the hydroacoustic images were unbalanced, with a disproportionate number of images from the “OTHER” class. Using standard log losses for training, the network would likely tend to overpredict the “OTHER” class to minimize training loss. In response to similar unbalanced dataset issuees in computer vision applications, Lin et al. introduced the concept of focal loss, a modification to the standard log loss formulation, FL(pt) = -(1 – pt)Ë  log(pt), where pt is the argmax probability [9]. This modification provides a simple method for addressing class imbalance, as the loss contribution of each image in the training set decreases as the prediction confidence increases. Thus, as the network learns to confidently predict classes that make up larger portions of the training set it becomes increasingly difficult to decrease training loss by increasing accuracy for those classes. For training purposes, focal loss was implemented with ϒ = 5."

Comment: The numerical results given as recall for the classes could better be given additionally in a table for all the approaches, which would improve the comparability.

Response: This is a very good recommendation. Table 1 has been revised to include precision, recall, and overall accuracy for each of the approaches.

Reviewer 2 Report

In this paper, a multi-sensor classification algorithm is developed for the classification of subaquatic vegetation using transfer learning to train models that are generalizable under different depth, turbidity, and lighting conditions. The paper also introduced a new dataset, obtained with a drone equipped with suitable cameras. Several fusion methods have been tested to benefit from the multi-sensor data. Empirical experiments were performed to demonstrate that the method is effective. Experimental results support that transfer learning leads the improved performance.
    The paper reads well and is on an interesting topic with practical benefits. I have the following comments to be incorporated before publication.     1. Please perform ablative experiments by performing classification using single-modality data to demonstrate that fusion is indeed effective.     2. In addition to ResNet, I would like to see results using another network architecture. This will altos to study the effectiveness of the approach on other molds.     3. Continual learning is a subset of transfer learning that focuses on the automation of exploring agents. I think it deserves to be included in the Introduction section with the following work cited as relevant papers:   a. Ashfahani, A. and Pratama, M., 2019, May. Autonomous deep learning: Continual learning approach for dynamic environments. In Proceedings of the 2019 SIAM International Conference on Data Mining (pp. 666-674). Society for Industrial and Applied Mathematics.   b. Rostami, M., Kolouri, S., Pilly, P. and McClelland, J., 2020, April. Generative continual concept learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 5545-5552).   c. Jha, S., Schiemer, M., Zambonelli, F. and Ye, J., 2021. Continual learning in sensor-based human activity recognition: An empirical benchmark analysis. Information Sciences, 575, pp.1-21.   d. Li, D., Liu, S., Gao, F. and Sun, X., 2020. Continual learning classification method with new labeled data based on the artificial immune system. Applied Soft Computing, 94, p.106423.   4. Please run your experiments several times and report both the average performance and the standard deviation to make more informative comparison possible.   5. Please release the code on a public domainsuch as GitHyb so other researchers can benefit from this work.

Author Response

The authors of Robotics-1758476 (“Sensor Fusion with Deep Learning for Autonomous Classification and Management of Aquatic Invasive Plant Species”) would like to thank Reviewer 2 for the thorough evaluation of our manuscript and the constructive feedback. We have carefully considered each reviewer comment, and have addressed each as summarized below:

Comment: Please perform ablative experiments by performing classification using single-modality data to demonstrate that fusion is indeed effective.   

Reply: This is a good point. Please note that single modality data was, in fact, included earlier; however, we admit our initial draft was not clear in this regard. Thus, we have revised Table 1 to include results for all single-modality and ensemble approaches.

Comment: In addition to ResNet, I would like to see results using another network architecture. This will altos to study the effectiveness of the approach on other molds. 

Reply:  In response to this recommendation, we trained a DenseNet architecture for comparison to our ResNet. The following text in the revised manuscript details this comparison: “To assess performance with another network architecture, network training and testing were repeated for the six approaches above with the DenseNet model. Identical training and test data sets and methods were used for both the ResNet and DenseNet models. The DenseNet architecture yielded lower or nearly equal (within 0.3%) overall accuracy for all but the average ensemble approach (DenseNet average overall accura-cy 69.0% vs. 70.9% for ResNetIn five out of six approaches, the DenseNet model pro-duced lower classifications of HYDR (average 14.6% fewer), and higher classifications of NONE (average 8.3% higher) and OTHER (average 16.3% higher) than the ResNet model.”

Comment: Continual learning is a subset of transfer learning that focuses on the automation of exploring agents. I think it deserves to be included in the Introduction section with the following work cited as relevant papers:   a. Ashfahani, A. and Pratama, M., 2019, May. Autonomous deep learning: Continual learning approach for dynamic environments. In Proceedings of the 2019 SIAM International Conference on Data Mining (pp. 666-674). Society for Industrial and Applied Mathematics.   b. Rostami, M., Kolouri, S., Pilly, P. and McClelland, J., 2020, April. Generative continual concept learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 5545-5552).   c. Jha, S., Schiemer, M., Zambonelli, F. and Ye, J., 2021.  Continual learning in sensor-based human activity recognition: An empirical benchmark analysis. Information Sciences, 575, pp.1-21.   d. Li, D., Liu, S., Gao, F. and Sun, X., 2020. Continual learning classification method with new labeled data based on the artificial immune system. Applied Soft Computing, 94, p.106423. 

Reply: Thank you for this insight. We have amended the Introduction section to include discussion of continual learning and incorporated the suggested references into the draft.

Comment: Please run your experiments several times and report both the average performance and the standard deviation to make more informative comparison possible.

Reply: This is a good recommendation. Please note that the training data does include datasets from two separate lakes, Lake Mann and Lake Sampson, with data collection repeated monthly five and seven times, respectively, for each of these lakes. It was not practical to report results individually; thus, they have been lumped into a single output.

Comment: Please release the code on a public domain such as GitHub so other researchers can benefit from this work.

Reply: We appreciate this comment, and will be happy to post this code on GitHub.

 

Reviewer 3 Report

This manuscript presents a multi-sensor classification method for subaquatic vegetation images using ConvNets based on the ResNet-50 backbone. The experimentally obtained results using originally developed datasets revealed that the classification accuracy of the proposed method was greater than that of the Monte Carlo dropout ensemble learning-based method. Nevertheless, the questions below remain subjects that must be addressed in this manuscript.

 

1. Based on the previous work [7] using the AlexNet backbone, the ResNet-50 backbone was employed in this method. Although the ResNet backbone offers excellent performance, it is somewhat outdated. Thanks to active deep learning research, deeper and more advanced performance backbones have already been proposed, such as Inception, InceptionResNet, MobileNet, DenseNet, NASNet, ResNeXt, SENet, SE-ResNeXt. Furthermore, transform-based backbones using attention mechanisms have been applied to solve numerous computer vision problems. Unfortunately, the state-of-the-art backbones to be employed in the proposed method have not been considered.

 

2. The dataset was split into 85% training images and 15% test images. Did you split the training images into a training subset and a validation subset?

 

3. Transfer learning is applied to objects between different categories. Your dataset and the dataset provided by the reference [6] are in the same domain. Therefore, your implementation would belong to domain adaptation.

 

4. What are the white areas on the left images in Figure 4?

 

5. Please show loss curves. It helps readers to understand the training process of the evaluation experiment.

 

6. The accuracy of the comparison methods is summarized in Table 1. However, the accuracy of the proposed method is scattered throughout the text.

 

7. Different confusion matrixes are also scattered with the same caption. It is recommended that they be integrated into each experiment. In addition, color bars are required according to the number of data in the cells of each matrix.

 

8. MATLAB is just a commercial SDK. The repetitive descriptions of this tool and its functions are too verbose for an academic paper.

 

9. The conclusion of this manuscript is unclear because it is included in the discussion section.

 

10. Did you achieve real-time classification?

Author Response

The authors of Robotics-1758476 (“Sensor Fusion with Deep Learning for Autonomous Classification and Management of Aquatic Invasive Plant Species”) would like to thank Reviewer 3 for the thorough evaluation of our manuscript and the constructive feedback. We have carefully considered each reviewer comment, and have addressed each as summarized below:

Comment: Based on the previous work [7] using the AlexNet backbone, the ResNet-50 backbone was employed in this method. Although the ResNet backbone offers excellent performance, it is somewhat outdated. Thanks to active deep learning research, deeper and more advanced performance backbones have already been proposed, such as Inception, InceptionResNet, MobileNet, DenseNet, NASNet, ResNeXt, SENet, SE-ResNeXt. Furthermore, transform-based backbones using attention mechanisms have been applied to solve numerous computer vision problems. Unfortunately, the state-of-the-art backbones to be employed in the proposed method have not been considered.

Reply:  In response to this recommendation, we trained a DenseNet architecture for comparison to our ResNet. The following text in the revised manuscript details this comparison: “To assess performance with another network architecture, network training and testing were repeated for the six approaches above with the DenseNet model. Identical training and test data sets and methods were used for both the ResNet and DenseNet models. The DenseNet architecture yielded lower or nearly equal (within 0.3%) overall accuracy for all but the average ensemble approach (DenseNet average overall accura-cy 69.0% vs. 70.9% for ResNetIn five out of six approaches, the DenseNet model pro-duced lower classifications of HYDR (average 14.6% fewer), and higher classifications of NONE (average 8.3% higher) and OTHER (average 16.3% higher) than the ResNet model.”

Comment: The dataset was split into 85% training images and 15% test images. Did you split the training images into a training subset and a validation subset?

Reply: The training images were, in fact, split into a training subset and a validation subset; however, this was not specifically stated in the text. Verbiage has been added to the Neural Network Training section to clarify.

Comment: Transfer learning is applied to objects between different categories. Your dataset and the dataset provided by the reference [6] are in the same domain. Therefore, your implementation would belong to domain adaptation.

Reply: This is an interesting point. Reference [6] refers to the DeepWeeds project, which uses RGB images of Australian plants on land. Our work, of course, involves aquatic vegetation.

Comment: What are the white areas on the left images in Figure 4?

Reply: These solid white areas in acoustic images represent regions in which acoustic data is not present due to automatic depth ranging of the sonar unit. This explanation has been added to the figure caption to clarify.

Comment: Please show loss curves. It helps readers to understand the training process of the evaluation experiment.

Reply: This is an excellent recommendation. In response we have included a new figure (Figure 4 in our revised manuscript), presenting training and validation loss curves for RGB and hydroacoustic neural network training.

Comment: The accuracy of the comparison methods is summarized in Table 1. However, the accuracy of the proposed method is scattered throughout the text.

Reply: We agree that our original presentation of data was lacking, and we have improved it in the revised manuscript. Specifically, Table 1 has been revised to include precision, recall, and overall accuracy for each of the approaches.

Comment: Different confusion matrixes are also scattered with the same caption. It is recommended that they be integrated into each experiment. In addition, color bars are required according to the number of data in the cells of each matrix.

Reply: We now realize that several of our original captions were not copied over properly when placing material into the MDPI template); these captions have been corrected. Also, color bars have been added to all confusion matrix figures in response to this recommendation.   

Comment: MATLAB is just a commercial SDK. The repetitive descriptions of this tool and its functions are too verbose for an academic paper.

Reply: We agree with this assessment, and have removed references to specific MATLAB functions from the draft.

Comment: The conclusion of this manuscript is unclear because it is included in the discussion section.

Reply: We have addressed this concern by renaming the final section to be Discussion and Conclusions.

Comment: Did you achieve real-time classification?

Reply: This a keen observation. Although our prior draft failed to provide further details, we have added text to the Discussion and Conclusions section indicating the current state of real-time classification: “As of this publication, real-time classification utilizing the sensor fusion concept of this paper is currently under development. This methodology utilizes a Windows laptop running MATLAB and Python scripts, for data processing and classification, and a wireless SD card (Toshiba FlashAir, Toshiba, Tokyo, Japan) to transmit sonar log data from the sonar head unit to the laptop. UAV-based RGB images are captured asynchronously, prior to classification, and copied to the laptop hard drive. Tests to date indicate that sonar data collection and processing and sensor fusion classification can be performed in real-time; however, transmission of sonar log data via the wireless SD card is currently inconsistent.”

Round 2

Reviewer 2 Report

The authors have addressed my concerns convincingly. 

Author Response

The authors would like to thank Reviewer 2 for the thoroughness and promptness of both reviews.

Reviewer 3 Report

The revised manuscript has been adequately revised based on my questions and comments. The addition of comparison results between ResNet and DenseNet improves the reliability of this study focused on SAV classification. Advances in computer hardware will enable real-time processing. As a minor question, what is the parameter N in the abstract?

Author Response

The authors would like to thank Reviewer 3 for the thoroughness and promptness of both reviews. Regarding the parameter "N" in the abstract, it refers to the number of trials performed and compared. We will likely change this in the final submission to be more evident.

Back to TopTop