Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite

Open AccessArticle

Peer-Review Record

3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition

Future Internet 2019, 11(2), 42; https://doi.org/10.3390/fi11020042 (registering DOI)

by Sheeraz Arif^*, Jing Wang^*

, Tehseen Ul Hassan and Zesong Fei

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Future Internet 2019, 11(2), 42; https://doi.org/10.3390/fi11020042 (registering DOI)

Submission received: 20 December 2018 / Revised: 6 February 2019 / Accepted: 6 February 2019 / Published: 13 February 2019

(This article belongs to the Special Issue Innovative Topologies and Algorithms for Neural Networks)

Round 1

Reviewer 1 Report

This paper proposed a new framework using Deep convolutional networks for action recognition. Authors used 3D- CNN and LSTM together. Usually, for detecting the dynamics of video, many works employed motion information to get temporal information. The results were promising but I have some comments to improve this manuscript:

1) In p.4, F should be F_i in Figure 1. I think this scheme accumulated motion data as frame goes by. But why authors used this kind of information? We need to explain in more detail.

2) The proposed structure of 3D-CNN was explained well. But the problem is that it is difficult to understand relationship between CCN and LSTM. I think authors would be better to describe this in detail.

3) In p. 13, when comparing to TSN [17], there was no gain or benefit. It would be better to give what is the advantage unlike TSN [17].

4) I think authors would be better to add some recent works as:

- deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors , Displays (Elsevier), DOI: 10.1016/j.displa.2018.08.001, Vol. 55, pp. 38-45, Dec. 2018.

- Fight Detection in Hockey Videos using Deep Network, Journal of Multimedia Information System (KMMS), VOL. 4, NO. 4, pp. 225-232, December 2017.

Author Response

Point-1

1) In p.4, F should be F_i in Figure 1. I think this scheme accumulated motion data as frame goes by. But why authors used this kind of information? We need to explain in more detail.

Action Taken

Figure has been updated and proper explanation has been added on page-4 in bold blue colour in revised manuscript as follows:

Our propose model is very simple to implement and can be trained by increasing the training video length iteratively. Mainly, it is very helpful to solve the problem of various lengths videos to get the same effect of the map representation from videos and also integrate the temporal information into a map without losing the discriminative information of videos. Another advantage of this method is that we can extract a constant number video frames per second, which improves the generalization performance of the network.

Point-2

The proposed structure of 3D-CNN was explained well. But the problem is that it is difficult to understand relationship between CCN and LSTM. I think authors would be better to describe this in detail.

Action Taken

The relationship and effect of combination of both CNN-LSTM has been described in revised manuscript on page 3-4 in bold Blue color text as follows:

The Combination of CNN-RNN provides effective representation for long-term motion and model the sequential data. Each of which has the time relationship with adjacent points. LSTM which is the most widely used Recurrent Neural Network (RNN uses the extracted C3D features as input and model more robust longer-range features. C3D network is able to encode local temporal features within each video unit, it cannot model across the multiple units of a video sequence. We thus introduce LSTM to capture global sequence dependencies of the input video and cue on motion information.

Point-3

In p. 13, when comparing to TSN [17], there was no gain or benefit. It would be better to give what is the advantage unlike TSN [17].

Action Taken

Text passage has been included on page 13-14 in bold blue color as follows:

We can explain the decrease in prediction rate by fact that this dataset contains action classes with clutter background and illumination changes and TSN is pre-trained on the large-scale ImageNet dataset which provides large scale size and diversity. Our approach is based on C3D which is pre-trained on UCF101 dataset. However, our introduced method outperforms the 3D conv – iDT by 0.9% and TSN method by 0.7% on the HDMB51 dataset and show the higher recognition rate on small-scale dataset. The possible reason for this higher recognition accuracy is that our model is based on hybrid deep learning model and introduction of LSTM temporally works well by capturing the long-term dependencies and boost the recognition accuracy for complex actions categories in HDMB51 dataset.

Point-4

I think authors would be better to add some recent works as:-

deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors , Displays (Elsevier), DOI: 10.1016/j.displa.2018.08.001, Vol. 55, pp. 38-45, Dec. 2018.

This above recent work has been added with reference no (49)

Reviewer 2 Report

The authors have satisfactorily responded to all my questions and made the necessary changes to the manuscript. I recommend the manuscript for publication.

Author Response

Dear Sir/Madam,

I hope you are keeping good health. I am extremely thankful for your recommendation for possible acceptance. There is no any suggestions for minor/major revision from your side.

Thanks and Regards,

Sheeraz Arif

Reviewer 3 Report

This is an interesting and important topic. The overall presentation of the manuscript looks good to me. I recommend acceptance.

Author Response

Dear Sir/Madam,

I hope you are keeping good health. I am extremely thankful for your recommendation for possible acceptance. There is no any suggestions for minor/major revision from your side.

Thanks and Regards,

Sheeraz Arif

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Authors propose a method for action recognition on video sequences, based on a CNN-based feature extraction scheme, and a LSTM module for encoding the temporal evolution of visual information.

Although the abstract is almost descent, there are numerous grammatical mistakes throughout the whole manuscript, which make really difficult to follow the rationale. For example, there are at least 7 grammatical mistakes of grammar number and articles only in the first page of the manuscript. There are also significant mistakes that affect the meaning, such as the last word of page 3 which probably should be “spatial” instead of “temporal”.

Apart from the linguistic aspects, the paper lacks a comprehensive description of the proposed technique. Important aspects such what exactly are the motion maps (are the output feature maps or something else?) and the details of how they are formed are only vaguely described, and the reader is not sure at any point on what exactly authors are suggesting. For example, what is the symbol in eq.1? What is the topology and the architectural details of the utilized CNN (number of kernels, layers, connectivity etc.)? why in page 3 there are both maps of size 4x4 and 7x7 mentioned? Why the output maps have a temporal dimension of 2? Most importantly, what exactly are the spatial and temporal streams mentioned in page 5? Given the level of clarity in the method’s description, it is almost impossible to gain even a general idea of how the video signal is handled.

Furthermore, even that the presented experimental results seem promising, it is very difficult to understand both the level of performance improvements offered by the presented method, let alone the exact innovative aspects and overall contribution. The methods used for comparison in table 3 where not presented or mentioned at any point in the introductory segments (prior art or related work). It is mandatory to briefly present at least the most related methods (i.e. the hybrid ones) and highlight the differentiation of the presented technique. This enables authors to pinpoint the important aspects of their method, responsible to deliver the presented improvements, and draw some meaningful conclusions instead of generic non-informative phrases.

To summarize, the paper is not suitable for publication in its current form. It needs a complete re-writing with help from a more competent English speaker, for even to be reviewed in a comprehensive and constructive manner. Since the results seem promising, I encourage authors to re-design the presentation of their method, and describe all aspects and novelty of their method with clarity in a future submission.

Reviewer 2 Report

This is a well presented and well-constructed manuscript, covering some interesting and important scientific aspects of this field of research. The results are compelling and well discussed in the manuscript. I would recommend this paper for publication in Future Internet. However, some minor improvements are required before publication:

1- The first two paragraphs in the introduction are under-referenced. Please revise the manuscript accordingly.

2- It would be better if the authors can move the Algorithm 1 to the end of the manuscript, or provide is as a supplementary information.

Article Menu

Printed Edition

3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition

Further Information

Guidelines

MDPI Initiatives

Follow MDPI