Next Article in Journal
On-Shore Plastic Waste Detection with YOLOv5 and RGB-Near-Infrared Fusion: A State-of-the-Art Solution for Accurate and Efficient Environmental Monitoring
Next Article in Special Issue
Sign-to-Text Translation from Panamanian Sign Language to Spanish in Continuous Capture Mode with Deep Neural Networks
Previous Article in Journal
An Ontology Development Methodology Based on Ontology-Driven Conceptual Modeling and Natural Language Processing: Tourism Case Study
Previous Article in Special Issue
Classification of Microbiome Data from Type 2 Diabetes Mellitus Individuals with Deep Learning Image Recognition
 
 
Article
Peer-Review Record

Hand Gesture Recognition Using Automatic Feature Extraction and Deep Learning Algorithms with Memory

Big Data Cogn. Comput. 2023, 7(2), 102; https://doi.org/10.3390/bdcc7020102
by Rubén E. Nogales * and Marco E. Benalcázar
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Big Data Cogn. Comput. 2023, 7(2), 102; https://doi.org/10.3390/bdcc7020102
Submission received: 23 March 2023 / Revised: 5 May 2023 / Accepted: 10 May 2023 / Published: 23 May 2023

Round 1

Reviewer 1 Report

The paper presents a hand gesture recognition from Leap Motion Controller data. In order for the paper to be publishable I suggest to take into account the following comments:

1. There is a typo in Figure 2 caption.

2. Line 166 has an incomplete sentence "Figure 3."

3. What is the sample time? In line 167 it says participant has 5 seconds to perform a gesture, while in line 176 it says the dataset has 70 time frames at sampling frequency 70Hz, which corresponds to 1 second.

4. Table 1 is never referenced in the text.

5. I understood from the experimental design that convolution is performed over flattened feature-time domain. It it is so, the local filters are looking for local correspondences between dimensions, not along the time domain, which is counter-intuitive.

6. Figure 6 misses an arrow in the first block. Average pooling in the second block shows different two dimensional stride, compared to the text. What is the "Pixel Layer"?

7. How are the model architecture and hyperparameters selected? What is the motivation for using two different activation functions on hidden layers of the ANN classifier?

8. Parameters in lines 277-279 are not explained properly. What are LearnRateDropFactor = 1 and LearnRateDropPeriod = 1? These seem like a variable names related to some specific piece of code. Does this drop factor mean the learning rate is not changing?

9. Why is the augmentation increasing the data amount to only 3 times the original number of samples? Is it not possible to implement online augmentation to generate non-repeating samples?

10. I don't understand sentence on line 282 and caption of Table 2, what is recognition performance and how is it different from testing?

11. Where is "586 predictors" in line 289 coming from? The average pooling has stride 1 so there is not supposed to be any subsampling in spatial dimension.

12. Inconsistency in figure numbering, 7&8 are missing.

13. Does "dropout layer is configured with a regularization factor of 1e-4" mean that dropping probability is 0.0001? If it is so, does it have any effect on the model with such a low value?

14. The BiLSTM model described is a hybrid of CNN and an LSTM, which is not emphasized in the text. This hybrid architecture has more convolutional layers and parameters. Why aren't the CNN models same to show contribution of the LSTM module?

15. What is the motivation of training a separate ANN classifier on top of a neural network features? Should it not perform the same if the extra hidden layer is added directly in the full model?

16. Are the models capable of running in real-time mode, usually required for gesture recognition? There is no mentioning of computational complexity or run-time analysis.

17. The related work should describe studies related to the proposed work, not what steps the authors took to find them.

18. The "manual" features are described very vaguely, without any details or references to works where they are better explained.

19. The authors in [16] used only the KNN classifier as opposed to what is said in line 99.

20. The novelty aspect with respect to the existing literature is not clearly emphasized.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This article presents hand gesture recognition algorithms using deep learning. The main objective of the article is to evaluate manual and automatic feature extraction for the hand gesture recognition problem 

The paper lacks a dataset description section and it is not clear how many sample videos are used for model training and evaluation. There are also no details about model training such as which algorithms are used for training the presented CNN model. In addition, I do not see any outstanding novelty in this research work.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

1. The work looks interesting but i would like toi figure out like what is new in this research ?

2. As most of the algorithms used was already existing so what is the new findings in this research?. Need a justification

3. Most of the pictures are not clear (ie., fig 3& 4)

4.Based on what parameters the authors are trying to prove their work?

5. Only Accuracy is considered and they should be atleast two paremeters to compare

6.The proposed work should be compared with the traditional algorithms to check the performance of the work

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

I would like to thank the authors for making an effort to reply to the comments. However, some of the previously mentioned problems are not resolved.

3. The question was referring rather to the sample length (5 seconds in line 167, 1 second in line 176) than to the choice of sampling rate.

5. This flattened representation, however, does not have locality property that convolution is designed for. In time domain data, different features are included as separate channels and the time dimension is preserved. Why is it not done in this work?

6. The Figure 6 was corrected, but the same errors remain in Figure 7.

10. The only explanation of the recognition is " the algorithm must be able to return the time at which the gesture is performed", but the manuscript does not explain how the model predicts the time and how that is evaluated.

11. All the convolution and pooling layers use stride 1 according to the figures and text, so the dimensionality is not supposed to be reduced from 2100. Moreover, the last convolution layer has 8 filters and the number 586 is not divisible by 8.

16. The BiLSTM model has more convolution layers with a higher number of filters and an additional LSTM block, yet according to the run-time analysis it has shorter run-time than the small CNN model.

17. The related work is now more focused on the literature, yet it still talks about the search procedure in line 73.

18. The added paragraph lists a set of methods used in another work and is not explaining the feature extractors used in this work and only creates more confusion.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

My comments are addressed properly in the revised manuscript.

Author Response

Dear reviewer, thank you for your comments.

Round 3

Reviewer 1 Report

11. It would be more clear if the model diagrams in Figure 7 and 8 show shape of the data tensor (including channel dimension) at the input and after each convolution and pooling operation.

18. This paragraph, regardless of using windows, is written in a way that it is not clear what is used in THIS work, whether those described features are used here the same way or not. It also says 17 features are used, but only 6 are named.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 4

Reviewer 1 Report

The main issues previously raised have been addressed and corrected.

Back to TopTop