A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition

Hassan, Najmul; Miah, Abu Saleh Musa; Shin, Jungpil

doi:10.3390/app14020603

Open AccessArticle

A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition

by

Najmul Hassan

,

Abu Saleh Musa Miah

and

Jungpil Shin

^*

School of Computer Science and Engineering, University of Aizu, Aizuwakamatsu 965-8580, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(2), 603; https://doi.org/10.3390/app14020603

Submission received: 24 November 2023 / Revised: 7 December 2023 / Accepted: 8 January 2024 / Published: 10 January 2024

(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Dynamic human activity recognition (HAR) is a domain of study that is currently receiving considerable attention within the fields of computer vision and pattern recognition. The growing need for artificial-intelligence (AI)-driven systems to evaluate human behaviour and bolster security underscores the timeliness of this research. Despite the strides made by numerous researchers in developing dynamic HAR frameworks utilizing diverse pre-trained architectures for feature extraction and classification, persisting challenges include suboptimal performance accuracy and the computational intricacies inherent in existing systems. These challenges arise due to the vast video-based datasets and the inherent similarity in the data. To address these challenges, we propose an innovative, dynamic HAR technique employing a deep-learning-based, deep bidirectional long short-term memory (Deep BiLSTM) model facilitated by a pre-trained transfer-learning-based feature-extraction approach. Our approach begins with the utilization of Convolutional Neural Network (CNN) models, specifically MobileNetV2, for extracting deep-level features from video frames. Subsequently, these features are fed into an optimized deep bidirectional long short-term memory (Deep BiLSTM) network to discern dependencies and process data, enabling optimal predictions. During the testing phase, an iterative fine-tuning procedure is introduced to update the high parameters of the trained model, ensuring adaptability to varying scenarios. The proposed model’s efficacy was rigorously evaluated using three benchmark datasets, namely UCF11, UCF Sport, and JHMDB, achieving notable accuracies of 99.20%, 93.3%, and 76.30%, respectively. This high-performance accuracy substantiates the superiority of our proposed model, signaling a promising advancement in the domain of activity recognition.

Keywords:

pre-trained neural networks; HAR; MobileNetv2; deep bidirectional LSTM

1. Introduction

Human activity recognition (HAR) plays a crucial role in diverse real-life domains, including intelligent video surveillance, abnormal activity detection, action-based video retrieval, semantic video recognition, and healthcare patient monitoring [1,2]. Action recognition, especially from online data streams derived from visual sensors in surveillance, website videos, and social media feeds, holds the potential to detect anomalies, identify fraudulent activities, and discern abnormal situations [1]. In video analysis, human actions are discerned through the movement of distinct body parts, such as hands and legs. Unlike static images, a single frame cannot fully convey the essence of an action [3]. For example, the initial action pose of jumping for a headshot in football and skipping rope may appear similar. The distinction between these actions becomes apparent when observed across a sequence of frames, capturing the nuanced movements and interactions of the human body with its surroundings [4,5]. In non-stationary data streams, the effectiveness of a trained model over previous data diminishes when faced with variations in new data. Adaptability to the new data distribution is crucial for addressing this challenge, requiring diversity in a non-stationary environment [6]. Lobo et al. framed this as an optimization problem, employing a bio-inspired algorithm to address drift heterogeneity and achieve diversity through a self-learning optimization technique [7]. Krawczyk et al. introduced a novel approach, modifying the weighted one-class SVM to enhance its analysis of non-stationary streaming data [8]. They emphasized the adaptability of a one-class classifier’s decision boundary to new data streams, aided by a forgetting mechanism, which facilitates the re-learning of parameters. Bartosz et al. proposed an efficient ensemble learning technique that iteratively adjusts Naïve Bayes classifier weights, enabling smooth adaptability to the current stream situation without an external drift detector [9]. Abdallah et al. conducted a comprehensive survey on activity recognition in online data stream mining [10]. However, accurately recognizing human actions in real-time from online surveillance data streams remains challenging due to factors such as high-dimensional features, variations in viewpoint, motion, cluttered backgrounds, occlusion, and diverse illumination conditions [11,12,13]. To address these challenges, the domain of HAR has witnessed the extensive use of handcrafted local feature descriptors over the past decade [14,15]. The drawbacks of handcrafted-feature-extraction mechanisms lie in their rigid engineering, the representation of low-level semantics in visual data, and the high complexity associated with both extraction and classification. Consequently, researchers have shifted towards the exploration of automatic-feature-learning methods [15]. Automatic-feature-learning methods, particularly those based on neural networks, offer a more-adaptive approach. These methods can directly extract features from raw inputs based on the learned weights and biases of the network. Convolutional Neural Networks (CNNs) exemplify this approach, employing a hierarchical learning process. The initial layers capture local features from visual data, while the final layers extract global features representing high-level semantic features.

In recent developments, researchers have been exploring various techniques, including CNN-based methods, transfer learning, and Transformers, to enhance performance accuracy in sequence learning for action recognition. Notably, BT-LSTM, as employed by [16], achieved an accuracy of 85.3% on the UCF11 dataset. Similarly, KFDI [17], Dilated CNN+BiLSTM+RB [18], local–global features+QSVM [19], the dynamic human activity recognition work [20], the 3DCNN [21], and deep autoencoder+CNN [22] methods yielded accuracy rates of 79.40%, 89.00% 82.60%, 85.20%, 96.20%, and 92.40%, respectively, for the UCF11 dataset. Despite these advancements, existing systems struggle to achieve satisfactory accuracy for the UCF Sport and JHMDB benchmark datasets [15,18,23,24,25,26,27,28,29]. The primary reason for the lower performance on these benchmark action recognition datasets is the lack of spatiotemporal information for a more-comprehensive understanding. To address these challenges and improve accuracy, we propose a deep-learning-based feature-extraction and classification approach for dynamic human action recognition.

To address the aforementioned challenges, our study makes substantial contributions in the following key areas:

Feature extraction: Leveraging the power of MobileNetV2, we employed an effective feature-extraction methodology to capture intricate patterns from the dataset. This ensures the model receives rich and informative input, enhancing its ability to discern nuanced details crucial for dynamic human activity recognition.
Feature refining and classification: Our approach introduces a specialized variant of long short-term memory (LSTM) known as Deep BiLSTM. This model excels in refining the extracted features and leveraging their temporal dependencies for accurate classification. Utilizing Deep BiLSTM represents a novel and powerful technique in dynamic human activity recognition.
Extensive experiments: We conducted comprehensive experiments to validate the efficacy of our proposed model. The model underwent rigorous testing using three benchmark human action datasets—UCF11, UCF Sport, and JHMDB. This extensive evaluation showcases the model’s versatility and effectiveness across diverse scenarios, providing robust evidence of its real-world applicability.

Through these strategic contributions, our study aimed to transcend the limitations of conventional handcrafted approaches. By advancing state-of-the-art feature-extraction, -refining, and -classification techniques, we contribute to developing more-robust and -adaptive human activity recognition (HAR) systems, thus fostering progress in this critical domain.

We organize the rest of the paper as follows: A literature review is given in Section 2, and Section 3 describes the architecture of the proposed system. Section 4 highlights the evaluation performance. In Section 5, we describe the discussion and conclusion.

2. Related Work

One of the major research fields of computer vision is action recognition, which is also a part of gesture recognition [30,31]. Many researchers have employed numerous methods to design action-recognition systems, including AI, hand-crafted features with conventional machine learning algorithms, and various deep learning methodologies [32,33,34,35,36]. The existing HAR is mainly developed with various handcrafted-feature-extraction techniques and traditional machine learning algorithms.

Traditional approaches to action recognition in machine learning typically follow a three-phase methodology. The first phase involves extracting features using handcrafted descriptors, followed by encoding these features using a specific algorithm. The final phase entails classifying the encoded features using a suitable machine learning algorithm [37,38]. Within computer vision, two primary types of feature-extraction techniques are prevalent: local-feature-based and global-feature-based approaches. Local-feature-based methods describe features as independent patches, interest points, and gesture information that align with learned cues for a given task. In contrast, global features are represented by the area of interest, often described through background subtraction and tracking techniques [39,40]. In the realm of conventional machine learning, researchers have devised effective systems for action recognition by leveraging handcrafted features. Notable examples include VLAD [41] and BOW [42], which contribute to the rich landscape of methodologies aiming to enhance the recognition accuracy of actions within diverse datasets. This established paradigm showcases the versatility of machine learning in processing and interpreting visual data, particularly in the context of recognizing actions or movements.

Handcrafted feature extractors are often tailored for specific datasets, making them domain-specific and unsuitable for general-purpose feature learning [43]. The specificity of these extractors restricts their applicability beyond the datasets for which they were designed. Recognizing the need for efficiency, some researchers have employed keyframe-based strategies to reduce the processing time in their systems [44]. For instance, Yasin et al. [45] introduced a key technique for action recognition in video sequences involving keyframe selection and its application to HAR. Similarly, Zhao et al. [46] proposed multi-feature fusion-based HAR using keyframes through conventional machine learning methods. While conventional machine learning algorithms have seen significant success in the past decade, they are constrained by human cognition and encounter challenges such as time consumption, labor intensiveness, and the complexity of feature engineering selection [47]. Recognizing these limitations in handcrafted-based HAR, researchers are increasingly turning to deep learning to devise efficient and innovative techniques for advanced video-based HAR systems. This shift toward deep learning reflects a quest for more-automated and -adaptive approaches that can overcome the limitations posed by traditional handcrafted features.

Diverging from the conventional three-step machine learning architecture, deep learning brings forth a contemporary end-to-end structure, enabling the simultaneous learning and representation of high-level discriminative visual features during classification [48]. Notably, prevalent end-to-end (CNN) architectures dynamically adjust the parameters based on the data, utilizing convolutional operations to learn optimal features [49]. While CNN-based approaches excel in learning features from 2D data, they may not be directly applicable to 3D data representations. To address this, some researchers have introduced methods that employ 3D filters instead of 2D when learning features from video frames [44,50]. The resulting models have demonstrated superior achievement in video investigation tasks, including action recognition, object tracking, and video retrieval, surpassing the capabilities of 2D CNNs and handcrafted-feature-based methods. This shift towards 3D feature learning exemplifies the adaptability of deep learning architectures to different types of visual data and their effectiveness in capturing intricate patterns within dynamic sequences.

To address the challenges of extracting motion cues from repetitive video frames, Simonyan et al. [51] introduced a two-stream architecture for video action recognition, leveraging CNNs. Wensel et al. [20] proposed a method called “ViT-ReT”, which combines Vision Transformer (ViT) and Recurrent Transformer neural networks for HAR in video data, leveraging the Transformer networks for understanding temporal and spatial features in video sequences. The reported 85.20% accuracy was produced by their model for the JHMDB dataset. Feichtenhofer et al. [52] proposed a method that considers both spatial and motion features to recognize actions, emphasizing the significance of temporal information through the evaluation of the UCF101 dataset. Tu et al. [28] employed a multi-stream CNN model to identify human-related regions, enabling the recognition of multiple actions in a video and reported an achievement accuracy of 71.17% accuracy with the JHMDB dataset. Other studies, such as those by [25], explored the use of two LSTM-based fused networks to capture spatiotemporal cues in video sequences and achieved 92.2% accuracy with the UCF Sports dataset. Hybrid techniques, as presented in [27], focus on learning multiple features for action recognition through feature fusion [53], and they reported 69.0% accuracy for the UCF11 dataset with their model. While these deep learning approaches excel in capturing short-term temporal information, they may fall short for long-term sequences. The advent of LSTM networks, an advanced version of Recurrent Neural Networks (RNNs), addresses this limitation by encoding long-term dependencies [23]. They achieved 93.2% accuracy with their RNN model. Currently, LSTM networks find widespread use for classifying long sequential data in various domains, including action recognition, speech processing, and weather prediction. Recognizing the challenges posed by redundancy in big data, Xu et al. [54] developed a deep-learning-based method to mitigate redundancy in large datasets and incorporated reinforcement learning for resource allocation based on IoT content-centric principles [55]. This holistic exploration showcases the versatility of deep learning approaches in capturing temporal dependencies and managing big data intricacies across diverse applications. In recent years, some people have used the BiLSTM model to recognize dynamic action and achieved 85.30% accuracy with the UCF11 dataset. Similarly, KFDI [17], Dilated CNN+BiLSTM+RB [18], local–global features+QSVM [19], the dynamic human activity recognition work [20], the 3DCNN [21], and deep autoencoder+CNN [22] methods yielded accuracy rates of 79.40%, 89.00% 82.60%, 85.20%, 96.20%, and 92.40%, respectively, for the UCF11 dataset. Despite these advancements, existing systems struggle to achieve satisfactory accuracy for the UCF Sports dataset; Jaowedi et al. employed the GMM+KF+GRNN model and achieved 89.01% accuracy for the UCF Sports dataset [24]. Muhammad et al. employed CNN+BiLSTM-RB model and achieved 89.00% accuracy for the UCF Sports dataset [18]. Sequentially, many researchers have developed their models and evaluated them with the challenging JHMBD dataset to achieve high accuracy, aiming to deploy a system. Rama et al. employed a deep learning architecture and achieved 67.24% accuracy for the JHMBD dataset [26]. Gammulle et al. employed two-stream LSTM and achieved 52.70% accuracy after evaluating with the JHMBD dataset [25]. Yang et al. employed various deep learning models to improve the performance accuracy of the dynamic action and achieved 65.00% accuracy with the JHMBD dataset [15]. The primary reason for the lower performance in these benchmark action recognition datasets is the lack of spatiotemporal information for a more-comprehensive understanding. To address these challenges and improve accuracy, we propose a deep-learning-based feature-extraction and -classification approach for dynamic human action recognition.

3. Proposed Method

Figure 1 outlines our specialized model architecture, leveraging MobileNetV2 for feature extraction in a dynamic human-action-recognition system coupled with the Deep BiLSTM framework. MobileNetV2’s strength in handling dense connections enhances model performance. Our methodology integrates a novel Deep BiLSTM classification model, refining hierarchical features from MobileNetV2 for enhanced discriminative power and nuanced understanding of temporal dependencies in dynamic human actions. This model was meticulously designed to excel in extracting robust action features, particularly adept at efficiently handling diverse and detailed images, addressing the computational challenges associated with managing numerous pixel values. A distinctive strength of MobileNetV2 lies in its adept handling of dense connections, where each pixel value communicates with every other neuron. This intrinsic capability significantly contributes to preserving the model’s performance, ensuring optimal outcomes in action recognition tasks. The efficiency of our model is further heightened through the integration of powerful techniques. Leveraging convolution and a 3 × 3 MaxPooling filter and employing strategic batch normalization and dropout layers, we optimized the features. The results from these layers, presented in multidimensional matrices, seamlessly undergo conversion to a one-dimensional format using a flattened layer. This transformation facilitates effective utilization by the subsequent dense layer. Choosing ReLU and softmax activation strategically enhances the computational efficiency and prediction accuracy. Acknowledging the resource-intensive nature of training CNN models, we strategically capitalized on the efficiency of pre-existing models, with MobileNetV2 playing a pivotal role in our approach. This deliberate choice not only enhances the computational efficiency, but also harnesses the proven capabilities of MobileNetV2 in our specific study context. Our methodology used a novel Deep BiLSTM classification model for refining hierarchical features extracted by MobileNetV2. This hierarchical refinement enhances discriminative power by understanding sequential dependencies in action sequences. Deep BiLSTM excels in capturing long-term temporal dependencies, providing a nuanced perspective on dynamic human actions. The model dynamically adapts to varying temporal contexts, ensuring robust performance across different action scenarios. Configurable layer and neuron explorations showcase adaptability, and iterative fine-tuning enhances responsiveness to real-world variations. Algorithm 1 elucidates our proposed methodology. In the algorithm, we included every step we performed in our study. In step 1, we used a preprocessing technique to convert the video into a sequence of frames and make a data frame. In step 2, we used a pre-trained model, namely MobileNetV2, to extract features from the data frame. The main reason behind MobileNetV2 is that, to stabilize the computational complexity, we used the multi-stage deep learning technique. In step 3, we employed the Deep BiLSTM model to refine the feature and the classification. After that, we trained the model with a strong tuning technique, where it trained multiple times, which is equivalent to the number of epochs. We combined each batch-oriented prediction for training and testing cases and used the combined prediction value to calculate the accuracy score. In step 4, we trained the model and made validation results. In step 5, we showed the display result in terms of accuracy. In step 6, we included real-time testing, where the user can give any video-based activity recognition, and our model would produce the activity name in English.

Algorithm 1 Transfer-Learning-Feature-Based HAR

Input:

List of videos (dynamic human activity dataset) Corresponding ground truth labels
(groundTruthLabels)

Output: Prediction for the individual video

Step 1: Preprocessing

while InputVideos ≠ eachVideo do

SequenceOfFrames ← VideoToImageConvertion (Video)

PreprocessedVideos.append(SequenceOfFrames)

Step 2: Feature extraction with pre-trained models

VideoFeatures ← loadModels(MobileNetV2)

Step 3: Classification

ClassModels ← DeepBiLSTM(VideoFeature)

Step 4: Generate predictions

while i ≠ NumEpochs do

// For training

while BatchVideoFeatures ≠ NumberBatchTraining do

PredictedClass ← ClassModels(BatchVideoFeatures)

Loss ← Criterion(PredictedClass, groundTruthLabels)

Updatetheloss ← Loss.backward(), Optimizer.Step()

// For testing

while BatchVideoFeatures ≠ NumberBatchTesting do

PredictedClass ← ClassModels(BatchVideoFeatures)

Output ← CPerformanceMatrix(PredictedClass, groundTruthLabels)

Step 5: Display results

print(“Performance:”, accuracy)

Step 6: Real-time testing

HumanActivityTestVideo ← Recorded(Video)

PredictedActivityName ← ClassModels(HumanActivityTestVideo)

print(“Predicted:”, ActionName)

3.1. Feature Extraction with MobileNetV2

The MobileNet, employed for feature extraction in our framework, is a compact, low-latency, and energy-efficient CNN model designed to empower deep neural network utilization on mobile devices [56]. Utilizing depthwise separable convolutions, it innovatively replaces traditional convolutions, dividing the process into depthwise and pointwise convolutions, significantly reducing the parameters and optimizing the size and computational efficiency. MobileNetV2’s user-friendly characteristic notably allows users to adjust the network’s magnitude and dimensions, catering to diverse computational needs. In our context, this facilitates feature extraction on both low-power and larger systems. The enhanced MobileNetV2 introduces a linear bottleneck and skipconnections, contributing to improved performance while maintaining accuracy [56]. The subsequent incorporation of the LSTM model replaces the softmax unit in our specific objective for predictions.

Figure 2 demonstrates the working procedure of the MobileNetV2 architecture, which was designed by its efficient design, which revolves around inverted residual blocks. These models consisted of the depthwise and pointwise convolutional layer, which projects, also known as the projection layer. The idea behind this is to achieve a harmonious balance between computational efficiency and model performance. The utilization of linear bottlenecks and ReLU-6 activations in MobileNetV2 ensures streamlined operations while preserving the representational power of the model. By employing an initial 3 × 3 convolutional layer for feature extraction and global average pooling for information aggregation, MobileNetV2 generates classification probabilities using dense layers. The adaptability of MobileNetV2, which allows for customization through the use of a width multiplier and resolution adjustments, positions it as an ideal choice for real-time computer vision tasks on devices with limitations in computational resources. Using this model, we produced effective features from the dynamic-human-action-based dataset.

3.2. Classification with Deep BiLSTM

However, regarding the notion that extracting human activity recognition (HAR) features solely with CNN MobileNetV2 [56] is insufficient, we leveraged the power of Recurrent Neural Networks (RNNs), specifically the long short-term memory (LSTM) architecture. While RNNs exhibit proficiency in extracting temporal knowledge, the inherent challenge of mitigating exploding gradient problems limits their effectiveness over extended durations, as demonstrated by Bengio et al. [57].

To address this limitation, we adopted the specialized LSTM variant, Deep BiLSTM [58], an extension of BiLSTM [58,59,60]. This architecture excels in scrutinizing sequential spatial features within the local environment, capturing sustained patterns in spatial–temporal features. Figure 3 illustrates the detailed architecture of the Deep BiLSTM model, emphasizing its unique ability to comprehensively analyze and classify human actions based on the features extracted by MobileNetV2 [61].

The Deep BiLSTM model comes from LSTM [60,61,62], which is employed via three gates, namely the input gate, forget gate, and output gate. An LSTM unit can be precisely described as follows.

i_{t} = σ (Y_{i} \cdot x_{t} + W_{i} \cdot h_{t - 1} + b_{i})

(1)

where Y and W denote the weight matrix, b signifies the bias term,

i_{t}

represents the input gate at time t, · denotes the matrix multiplication process,

σ

denotes the (sigmoid function),

x_{t}

signifies the input data at a time (t), and

h_{t - 1}

signifies the output of the preceding LSTM unit. The input gate is crucial in identifying the specific information from the preceding unit that necessitates modification.

f_{t} = σ (Y_{f} \cdot x_{t} + W_{f} \cdot h_{t - 1} + b_{c})

(2)

where

f_{t}

denotes the forget gate and is responsible for computing the significance of the information and forgetting old information.

{\tilde{c}}_{t} = tanh (Y_{c} \cdot x_{t} + W_{c} \cdot h_{t - 1} + b_{c})

(3)

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c_{t}}

(4)

where the

c_{t}

state of the candidate is determined through the utilization of the tangent activation function, as illustrated in Equation (2). Subsequently, the current cell state is evaluated and provided in equation, where ⊙ represents point-to-point multiplication.

g_{t} = σ (Y_{g} x_{t} + W_{g} h_{t - 1} + b_{g})

(5)

h_{t} = g_{t} ⊙ t a n h (c_{t})

(6)

The output gate

g_{t}

is calculated in Equation (5), where

h_{t}

represents the output of the LSTM unit, as given in Equation (6). The baseline LSTM model predicts the current human activity based on only past data. It is evident that there is a possibility of losing certain information if the data are evaluated in a unidirectional manner. The Deep BiLSTM is made of two LSTM layers that operate in both the forward and backward directions, as depicted in Figure 4.

v_{t}

is the output layer in the Deep BiLSTM and is formulated as follows [59].

v_{t} = [\vec{h_{t}} \overset{\leftarrow}{h_{t}}]

(7)

The forward and backward outcomes of the LSTM units are represented by the symbols

\overset{\leftarrow}{h_{t}}

and

\vec{h_{t}}

. The combination of these two LSTM units forms the output

v_{t}

. The fundamental concept underlying the RNN resides in the notion that, rather than transmitting the complete set of input data to the neural network in a singular instance, we progressively introduce the data one by one sequentially, effectively incorporating the temporal variable as well. If we possess an array of input values, we feed the initial value into the network and obtain a corresponding output. Subsequently, for the subsequent output, we feed the subsequent input alongside the preceding output.

4. Experimental Evolution and Performances

In this particular section, we evaluate the proposed approach on various benchmark datasets that are commonly used to assess action recognition. These datasets consisted of UCF11 or YouTube Actions [63], UCF Sport [64], and JHMDB [65]. The visual depiction of samples of different actions from each dataset can be found in the provided Figure 5.

4.1. Dataset

In the study, we used three human action datasets for the experiment. There are many datasets available, but we think these datasets are more challenging and need to work to improve recognition performance by solving challenges.

4.1.1. YouTube Actions or UCF11 Dataset

The dataset of YouTube Actions is one the most-challenging and complex due to the collection of action samples at low resolution, utilizing both moving and stationary cameras at varying scales. Furthermore, the presence of a cluttered background, alterations in illumination, and changes in viewpoint further contribute to the complexity of the dataset. This dataset contains 11 distinct categories of physical movements obtained from various sporting activities such as “Basketball”, “Biking”, “Diving”, “GolfSwing”, “HorseRiding”, “SoccerJuggling”, “Swing”, “TennisSwing”, “TrampolineJumping”, “VolleyballSpiking”, and “WalkingDog”. These actions were captured through video recordings involving 25 individuals, with four samples for each action, while additional videos were collected from YouTube [63,66].

4.1.2. Joint-Annotated Human Motion Data Base (JHMDB) Dataset

The JHMDB dataset comprises 21 distinct categories of activities: catching, clamping, clapping, hair brushing, baseball swinging, gunshot firing, jumping, and more. This dataset is characterized by a substantial collection of 923 videos featuring diverse actions, making it a challenging recognition task. Despite the various challenges and difficulties, the JHMDB dataset achieved low performance compared to other datasets; however, the recognition rate is effective compared to SOTA approaches [65,67].

4.1.3. UCF Sports Dataset

The UCF Sports dataset contains 150 sequences of videos with a resolution of 720 × 480, which are gathered from diverse sporting activities such as diving, horse riding, golf swings, skateboarding, and weightlifting. These sports-related action videos are composed by various outlets, including the BBC and ESPN, and are commonly transmitted via television channels. The videos represent authentic and unscripted actions from different perspectives and a wide array of scenes [68,69,70].

4.2. Experimental Setup

The hardware architectures are very important to facilitate efficient model training. Deep learning models contain high-level interface libraries, such as TensorFlow [71], etc., which are responsible for executing computationally demanding operations. In our particular scenario, NVIDIA was selected as the preferred choice for deep learning models due to its exceptional software support for computations specific to deep learning. All evaluation outcomes were obtained using an NVIDIA GeForce RTX 3090 graphics card. The suggested neural network architecture was encoded within a Python environment, employing the Keras and TensorFlow frameworks. The process involved the utilization of the deep learning toolbox ConvNet’ for extracting CNN features, including MobilNetV2, the neural network libraries for Deep BiLSTM.

The proposed approach under consideration was evaluated through the use of two distinct metrics, the accuracy and confusion matrix, as provided for each dataset. The outcomes for each dataset, along with the corresponding comparisons made against the state-of-the-art (STOA), are provided for each dataset.

4.3. Performance Accuracy and SOTA Comparison with UCF11 or YouTube Actions Dataset

The quantitative comparison of the proposed model and SOTA methods for the UCF11 or YouTube Actions dataset is shown in Table 1.

The comparative methods included hierarchical clustering [14], single-stream CNN [72], BT-LSTM [16], KFDI [17], dilated CNN+BiLSTM+RB [18], local–global features + QSVM [19], the 3DCNN [21], the deep autoencoder with CNN [22] and ViT-ReT-based models [20], which attained accuracy rates of 89.7%, 93.1%, 85.3%, 79.4%, 89.0%, 82.6%, 96.2%, 92.4%, and 89.20%, respectively. Our proposed method outperformed these approaches by nearly 3%, leading to improved results. The accuracy and loss curve for the YouTube actions dataset are depicted in Figure 6. Furthermore, we present the difference between the corrected and predicted labels of the UCF11 dataset. The confusion matrix of the UCF11 or YouTube actions dataset is depicted in Figure 7. This matrix provides an overview of the accuracy at the class level. The ”basketball”, “biking”, “golf swing”, “horse riding”, “soccer juggling”, “tennis swing”, “trampoline jumping”, “volleyball”, and “spiking”, in these classes, achieved remarkable accuracy near 100%. The diving, swing, and walking classes showed low accuracy of 96%, 97%, and 96%, respectively. It is worth noting that certain action classes, such as volleyball spiking, basketball, and shooting, exhibited similar motions, which consequently minimized the occurrence of errors in classification.

4.4. Performance Accuracy and SOTA Comparison with UCF Sports Dataset

The quantitative comparison analysis of the proposed model with SOTA methods by using the UCF Sports dataset is shown in Table 2. The methods QST-CNN-LSTM [23], GMM + KF + GRNN [24], two-stream LSTM [25], and dilated CNN+BiLSTM+RB [18], achieved accuracy rates of 93.2%, 89.1%, 89.0%, and 92.2 %, respectively. The proposed model achieved effective results for the UCF Sports dataset. We achieved nearly 100% for all classes except for the running class, which was due to model confusion between the diving running and kicking running classes due to the limited contextual alteration between them. The outcomes of the UCF Sports dataset in terms of the confusion matrix and accuracy with loss curve can be found in Figure 8, and Figure 9 respectively. The recommended proposed method substitution, as a whole, is effectively targeted toward the HAR task through the process of experimentation.

4.5. Performance Accuracy and SOTA Comparison with JHMDB Dataset

The quantitative comparative results for the JHMDB dataset and other methodologies are given in Table 3. We compared our proposed model to the challenging JHMDB dataset to enhance the performance of our proposed model in identifying demanding actions such as brushing, catching, clapping, golfing, jumping, kicking, pouring pulling, shooting a ball, shooting bowing, and throwing. Our model demonstrated a recognition accuracy exceeding 90% for actions such as catching, sitting, playing basketball, and running. The model accomplished an impressive overall classification level of 76.6% for the challenging actions in the J-HMDB dataset, thus substantiating the significance and effectiveness of the model. Figure 10 demonstrated the curve of the accuracy and loss.

5. Conclusions

Recognizing diverse actions in surveillance video data, including human activity recognition (HAR), heavily relies on spatiotemporal features. In this study, we introduced a specialized long short-term memory (LSTM) framework, specifically Deep BiLSTM, tailored for human action recognition. Leveraging Convolutional Neural Networks (CNNs), notably MobileNet, we extracted salient features from video frames, which were then fed into the Deep BiLSTM network to capture temporal information. The bidirectional nature of Deep BiLSTM facilitated improved feature extraction by considering both forward and backward temporal dependencies. Employing softmax activation functions enhanced the prediction accuracy of human actions in videos. Rigorous evaluations on benchmark datasets (UCF11, UCF Sports, JHMDB) yielded impressive recognition accuracies of 99.2%, 93.3%, and 76.3%, respectively. This substantiates the efficacy of our proposed model, marking a significant advancement in the field of activity recognition. However, our proposed model achieved impressive recognition accuracies on benchmark datasets. Future work may explore advanced temporal context fusion and privacy-preserving techniques to enhance the model’s temporal understanding and address privacy concerns in surveillance applications.

Author Contributions

Conceptualization, N.H. and A.S.M.M.; methodology, N.H., A.S.M.M. and J.S.; software, N.H., A.S.M.M. and J.S.; validation, N.H. and A.S.M.M.; formal analysis, N.H., A.S.M.M. and J.S.; investigation, N.H. and A.S.M.M.; resources, N.H., A.S.M.M. and J.S.; data curation, N.H. and A.S.M.M.; writing—original draft preparation, N.H. and A.S.M.M.; writing—review and editing, J.S.; visualization, N.H., A.S.M.M. and J.S.; supervision, J.S.; project administration, N.H., A.S.M.M. and J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Competitive Research Fund of The University of Aizu, Japan.

Data Availability Statement

UCF11 data presented in this study are openly available in: https://www.crcv.ucf.edu/data/UCF_YouTube_Action.php. UCFSports data presented in this study are openly available in: https://www.crcv.ucf.edu/data/UCF_YouTube_Action.php. JHMDB data presented in this study are openly available in: http://jhmdb.is.tue.mpg.de/.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

LSTM	Long short-term memory
BiLSTM	Bidirectional long short-term memory
CNN	Convolutional Neural Networks
GMM	Gaussian Mixture Model
DD-Net	Double-feature double-motion network
GAN	Generative Adversarial Network
SOTA	State-of-the-art
MobileNetV2	Mobile Network Variant 2
ReLU	Rectified Linear Unit
DCNN	Dilated Convolutional Neural Network
GRNN	General regression neural network
KFDI	Key Frames Dynamic Image
QST	Quaternion Spatial–Temporal
RNNs	Recurrent Neural Networks
BT-LSTM	Block-term long short-term memory
KF	Kalman filter

References

Luo, S.; Yang, H.; Wang, C.; Che, X.; Meinel, C. Action recognition in surveillance video using convents and motion history image. In Proceedings of the International Conference on Artificial Neural Networks, Barcelona, Spain, 6–9 September 2016; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Egawa, R.; Miah, A.S.M.; Hirooka, K.; Tomioka, Y.; Shin, J. Dynamic Fall Detection Using Graph-Based Spatial Temporal Convolution and Attention Network. Electronics 2023, 12, 3234. [Google Scholar] [CrossRef]
Liu, Y.; Cui, J.; Zhao, H.; Zha, H. Fusion of low-and high-dimensional approaches by trackers sampling for generic human motion tracking. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012. [Google Scholar]
Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action recognition in video sequences using deep Bi-directional LSTM with CNN features. IEEE Access 2017, 6, 1155–1166. [Google Scholar] [CrossRef]
Ullah, A.; Muhammad, K.; Del Ser, J.; Baik, S.W.; de Albuquerque, V.H.C. Activity recognition using temporal optical flow convolutional features and multi-layer LSTM. IEEE Trans. Ind. Electron. 2018, 66, 9692–9702. [Google Scholar] [CrossRef]
Lobo, J.L.; Del Ser, J.; Bilbao, M.N.; Perfecto, C.; Salcedo-Sanz, S. DRED: An evolutionary diversity generation method for concept drift adaptation in online learning environments. Appl. Soft Comput. 2018, 68, 693–709. [Google Scholar] [CrossRef]
Lobo, J.L.; Del Ser, J.; Villar-Rodriguez, E.; Bilbao, M.N.; Salcedo-Sanz, S. On the creation of diverse ensembles for nonstationary environments using Bio-inspired heuristics. In Proceedings of the International Conference on Harmony Search Algorithm, Bilbao, Spain, 22–24 February 2017; Springer: Singapore, 2017. [Google Scholar]
Krawczyk, B.; Woźniak, M. One-class classifiers with incremental learning and forgetting for data streams with concept drift. Soft Comput. 2015, 19, 3387–3400. [Google Scholar] [CrossRef]
Krawczyk, B. Active and adaptive ensemble learning for online activity recognition from data streams. Knowl.-Based Syst. 2017, 138, 69–78. [Google Scholar] [CrossRef]
Abdallah, Z.S.; Gaber, M.M.; Srinivasan, B.; Krishnaswamy, S. Activity recognition with evolving data streams: A review. ACM Comput. Surv. 2018, 51, 71. [Google Scholar] [CrossRef]
Wang, Y.; Mori, G. Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1310–1323. [Google Scholar] [CrossRef]
Liu, Y.; Nie, L.; Han, L.; Zhang, L.; Rosenblum, D.S. Action2Activity: Recognizing complex activities from sensor data. In Proceedings of the IJCAI, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Chang, X.; Yu, Y.L.; Yang, Y.; Xing, E.P. Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1617–1632. [Google Scholar] [CrossRef]
Liu, A.A.; Su, Y.T.; Nie, W.Z.; Kankanhalli, M. Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 102–114. [Google Scholar] [CrossRef] [PubMed]
Yang, F.; Wu, Y.; Sakti, S.; Nakamura, S. Make skeleton-based action recognition model smaller, faster and better. In Proceedings of the ACM Multimedia Asia, Beijing China, 15–18 December 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
Ye, J.; Wang, L.; Li, G.; Chen, D.; Zhe, S.; Chu, X.; Xu, Z. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Alt Lake City, UT, USA, 18–22 June 2018; pp. 9378–9387. [Google Scholar]
Riahi, M.; Eslami, M.; Safavi, S.H.; Torkamani Azar, F. Human activity recognition using improved dynamic image. IET Image Process. 2020, 14, 3223–3231. [Google Scholar] [CrossRef]
Muhammad, K.; Ullah, A.; Imran, A.S.; Sajjad, M.; Kiran, M.S.; Sannino, G.; de Albuquerque, V.H.C. Human action recognition using attention based LSTM network with dilated CNN features. Future Gener. Comput. Syst. 2021, 125, 820–830. [Google Scholar] [CrossRef]
Al-Obaidi, S.; Al-Khafaji, H.; Abhayaratne, C. Making sense of neuromorphic event data for human action recognition. IEEE Access 2021, 9, 82686–82700. [Google Scholar] [CrossRef]
Wensel, J.; Ullah, H.; Munir, A. ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos. IEEE Access 2023, 11, 72227–72249. [Google Scholar] [CrossRef]
Vrskova, R.; Hudec, R.; Kamencay, P.; Sykora, P. Human activity classification using the 3DCNN architecture. Appl. Sci. 2022, 12, 931. [Google Scholar] [CrossRef]
Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Gener. Comput. Syst. 2019, 96, 386–397. [Google Scholar] [CrossRef]
Meng, B.; Liu, X.; Wang, X. Human action recognition based on quaternion spatial–temporal convolutional neural network and LSTM in RGB videos. Multimed. Tools Appl. 2018, 77, 26901–26918. [Google Scholar] [CrossRef]
Jaouedi, N.; Boujnah, N.; Bouhlel, M.S. A new hybrid deep learning model for human action recognition. J. King Saud Univ.-Comput. Inf. Sci. 2020, 32, 447–453. [Google Scholar] [CrossRef]
Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. Two stream lstm: A deep fusion framework for human action recognition. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017. [Google Scholar]
Ramasinghe, S.; Rajasegaran, J.; Jayasundara, V.; Ranasinghe, K.; Rodrigo, R.; Pasqual, A.A. Combined static and motion features for deep-networks-based activity recognition in videos. IEEE Trans. Circuits Syst. Video Technol. 2017, 29, 2693–2707. [Google Scholar] [CrossRef]
Ijjina, E.; Mohan, C. Hybrid deep neural network model for human action recognition. Appl. Soft Comput. 2016, 46, 936–952. [Google Scholar] [CrossRef]
Tu, Z. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 2018, 79, 32–43. [Google Scholar] [CrossRef]
Sahoo, S.P.; Ari, S.; Mahapatra, K.; Mohanty, S.P. HAR-depth: A novel framework for human action recognition using sequential learning and depth estimated history images. IEEE Trans. Emerg. Top. Comput. Intell. 2020, 5, 813–825. [Google Scholar] [CrossRef]
Miah, A.S.M.; Shin, J.; Hasan, M.A.M.; Rahim, M.A. BenSignNet: Bengali Sign Language Alphabet Recognition Using Concatenated Segmentation and Convolutional Neural Network. Appl. Sci. 2022, 12, 3933. [Google Scholar] [CrossRef]
Miah, A.S.M.; Hasan, M.A.M.; Shin, J. Dynamic Hand Gesture Recognition using Multi-Branch Attention Based Graph and General Deep Learning Model. IEEE Access 2023, 11, 4703–4716. [Google Scholar] [CrossRef]
Wu, D.; Sharma, N.; Blumenstein, M. Recent advances in video-based human action recognition using deep learning: A review. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2865–2872. [Google Scholar]
Miah, A.S.M.; Hasan, M.A.M.; Shin, J.; Okuyama, Y.; Tomioka, Y. Multistage Spatial Attention-Based Neural Network for Hand Gesture Recognition. Computers 2023, 12, 13. [Google Scholar] [CrossRef]
Miah, A.S.M.; Shin, J.; Abu Saleh Musa, M.; Rahim, M.A.; Okuyama, Y. Rotation, Translation and Scale Invariant Sign Word Recognition Using Deep Learning. Comput. Syst. Sci. Eng. 2023, 44, 2521–2536. [Google Scholar] [CrossRef]
Shin, J.; Musa Miah, A.S.; Hasan, M.A.M.; Hirooka, K.; Suzuki, K.; Lee, H.S.; Jang, S.W. Korean Sign Language Recognition Using Transformer-Based Deep Neural Network. Appl. Sci. 2023, 13, 3029. [Google Scholar] [CrossRef]
Rahim, M.A.; Miah, A.S.M.; Sayeed, A.; Shin, J. Hand gesture recognition based on optimal segmentation in human-computer interaction. In Proceedings of the 2020 3rd IEEE International Conference on Knowledge Innovation and Invention (ICKII), Kaohsiung, Taiwan, 21–23 August 2020; pp. 163–166. [Google Scholar]
Antar, A.D.; Ahmed, M.; Ahad, M.A.R. Challenges in sensor-based human activity recognition and a comparative analysis of benchmark datasets: A review. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 30 May–2 June 2019; pp. 134–139. [Google Scholar]
Ullah, S.; Bhatti, N.; Qasim, T.; Hassan, N.; Zia, M. Weakly-supervised action localization based on seed superpixels. Multimed. Tools Appl. 2021, 80, 6203–6220. [Google Scholar] [CrossRef]
Hsueh, Y.L.; Lie, W.N.; Guo, G.Y. Human behavior recognition from multiview videos. Inf. Sci. 2020, 517, 275–296. [Google Scholar] [CrossRef]
Elhoseny, M.; Abdelaziz, A.; Salama, A.S.; Riad, A.M.; Muhammad, K.; Sangaiah, A.K. A hybrid model of internet of things and cloud computing to manage big data in health services applications. Future Gener. Comput. Syst. 2018, 86, 1383–1394. [Google Scholar] [CrossRef]
Kwon, H.; Kim, Y.; Lee, J.S.; Cho, M. First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recognit. Lett. 2018, 112, 161–167. [Google Scholar] [CrossRef]
Zhen, X.; Shao, L. Action recognition via spatio-temporal local features: A comprehensive study. Image Vis. Comput. 2016, 50, 1–13. [Google Scholar] [CrossRef]
Saghafi, B.; Rajan, D. Human action recognition using pose-based discriminant embedding. Signal Process. Image Commun. 2012, 27, 96–111. [Google Scholar] [CrossRef]
Lee, T.; Yoon, J.C.; Lee, I.K. Motion sickness prediction in stereoscopic videos using 3D convolutional neural networks. IEEE Trans. Vis. Comput. Graph. 2019, 25, 1919–1927. [Google Scholar] [CrossRef]
Yasin, H.; Hussain, M.; Weber, A. Keys for action: An efficient keyframe-based approach for 3D action recognition using a deep neural network. Sensors 2020, 20, 2226. [Google Scholar] [CrossRef]
Zhao, Y.; Guo, H.; Gao, L.; Wang, H.; Zheng, J.; Zhang, K.; Zheng, Y. Multi-feature fusion action recognition based on keyframes. In Proceedings of the 2019 Seventh International Conference on Advanced Cloud and Big Data (CBD), Suzhou, China, 21–22 September 2019. [Google Scholar]
Wei, X.S.; Wang, P.; Liu, L.; Shen, C.; Wu, J. Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. IEEE Trans. Image Process. 2019, 28, 6116–6125. [Google Scholar] [CrossRef] [PubMed]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Martinez-Gonzalez, P.; Garcia-Rodriguez, J. A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 2018, 70, 41–65. [Google Scholar] [CrossRef]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
Khan, S.U.; Haq, I.U.; Rho, S.; Baik, S.W.; Lee, M.Y. Cover the violence: A novel deep-learning-based approach towards violence-detection in movies. Appl. Sci. 2019, 9, 4963. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Patel, C.I.; Garg, S.; Zaveri, T.; Banerjee, A.; Patel, R. Human action recognition using fusion of features for unconstrained video sequences. Comput. Electr. Eng. 2018, 70, 284–301. [Google Scholar] [CrossRef]
Xu, C.; Wang, K.; Sun, Y.; Guo, S.; Zomaya, A.Y. Redundancy avoidance for big data in data centers: A conventional neural network approach. IEEE Trans. Netw. Sci. Eng. 2018, 7, 104–114. [Google Scholar] [CrossRef]
He, X.; Wang, K.; Huang, H.; Miyazaki, T.; Wang, Y.; Guo, S. Green resource allocation based on deep reinforcement learning in content-centric IoT. IEEE Trans. Emerg. Top. Comput. 2018, 8, 781–796. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
Sreelakshmi, K.; Rafeeque, P.C.; Sreetha, S.; Gayathri, E.S. Deep Bi-Directional LSTM Network for Query Intent Detection. Procedia Comput. Sci. 2018, 143, 939–946. [Google Scholar] [CrossRef]
Radman, A.; Suandi, S.A. BiLSTM regression model for face sketch synthesis using sequential patterns. Neural Comput. Appl. 2021, 33, 12689–12702. [Google Scholar] [CrossRef]
Tatsunami, Y.; Taki, M. Sequencer: Deep lstm for image classification. Adv. Neural Inf. Process. Syst. 2022, 35, 38204–38217. [Google Scholar]
Mekruksavanich, S.; Jitpattanakul, A. Lstm networks using smartphone data for sensor-based human activity recognition in smart homes. Sensors 2021, 21, 1636. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Liu, J.; Luo, J.; Shah, M. Recognizing realistic actions from videos “in the wild”. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1996–2003. [Google Scholar]
Shao, L.; Zhen, X.; Tao, D.; Li, X. Spatio-temporal Laplacian pyramid coding for action recognition. IEEE Trans. Cybern. 2013, 44, 817–827. [Google Scholar] [CrossRef]
Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar]
Liu, J.; Yang, Y.; Shah, M. Learning semantic visual vocabularies using diffusion distance. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 461–468. [Google Scholar]
Shin, J.; Miah, A.S.M.; Suzuki, K.; Hirooka, K.; Hasan, M.A.M. Dynamic Korean Sign Language Recognition Using Pose Estimation Based and Attention-based Neural Network. IEEE Access. 2023, 11, 143501–143513. [Google Scholar] [CrossRef]
Rodriguez, M. Spatio-temporal maximum average correlation height templates in action recognition and video summarization. Doctor Thesis, University of Central Florida, Orlando, FL, USA, 2010. [Google Scholar]
Soomro, K.; Zamir, A.R. Action recognition in realistic sports videos. In Computer Vision in Sports; Springer: Berlin/Heidelberg, Germany, 2015; pp. 181–208. [Google Scholar]
Rodriguez, M.D.; Ahmed, J.; Shah, M. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
Ramasinghe, S.; Rodrigo, R. Action recognition by single stream convolutional neural networks: An approach using combined motion and static information. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 101–105. [Google Scholar]

Figure 1. Proposed method architecture.

Figure 2. Architecture of MobileNetV2.

Figure 3. Architectures of LSTM [61].

Figure 4. Framework diagram of Deep BiLSTM.

Figure 5. Visual depictions sample action classes from the UCF11, UCF Sports, and J-HMDB datasets.

Figure 6. Accuracy and loss curve of the proposed model for UCF11 dataset.

Figure 7. The confusion matrices attained through the utilization of the proposed model on the UCF11 dataset concerning the true and predicted labels.

Figure 8. The confusion matrices attained through the utilization of the proposed model on the UCF Sports dataset concerning the true and predicted labels.

Figure 9. Accuracy and loss curve of the proposed model for UCF Sports dataset.

Figure 10. Accuracy and loss curve of the proposed model for JHMDB dataset.

Table 1. Comparison analysis of the proposed model with SOTA methods by using UCF11 or YouTube Actions dataset.

Methods	Accuracy%
Hierarchical clustering [14]	89.7%
Single-stream CNN [72]	93.1%
BT-LSTM [16]	85.3%
KFDI [17]	79.4%
Dilated CNN+BiLSTM+RB [18]	89.01%
Local–global features+QSVM [19]	82.6%
3DCNN [21]	85.2%
Deep autoencoder+CNN [22]	96.2%
ViT-ReT [20]	92.4%
Two-stream LSTM [25]	89.20%
Proposed method	99.2%

Table 2. Comparison analysis of the proposed model with SOTA methods by using the UCF Sports dataset.

Methods	Accuracy%
QST-CNN-LSTM [23]	93.2%
GMM + KF + GRNN [24]	89.01%
Dilated CNN+BiLSTM+RB [18]	92.63%
Two-stream LSTM [25]	92.2%
Proposed method	93.3%

Table 3. Comparison analysis of the proposed model with SOTA methods by using the JHMDB dataset.

Methods	Accuracy%
Deep networks [26]	67.24%
Hybrid deep neural network [27]	69.0%
Multi-stream CNN [28]	71.17%
DD-Net [15]	65.0%
HAR-Depth [29]	73.1%
Two-stream LSTM [25]	52.70%
Proposed method	76.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hassan, N.; Miah, A.S.M.; Shin, J. A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition. Appl. Sci. 2024, 14, 603. https://doi.org/10.3390/app14020603

AMA Style

Hassan N, Miah ASM, Shin J. A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition. Applied Sciences. 2024; 14(2):603. https://doi.org/10.3390/app14020603

Chicago/Turabian Style

Hassan, Najmul, Abu Saleh Musa Miah, and Jungpil Shin. 2024. "A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition" Applied Sciences 14, no. 2: 603. https://doi.org/10.3390/app14020603

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Feature Extraction with MobileNetV2

3.2. Classification with Deep BiLSTM

4. Experimental Evolution and Performances

4.1. Dataset

4.1.1. YouTube Actions or UCF11 Dataset

4.1.2. Joint-Annotated Human Motion Data Base (JHMDB) Dataset

4.1.3. UCF Sports Dataset

4.2. Experimental Setup

4.3. Performance Accuracy and SOTA Comparison with UCF11 or YouTube Actions Dataset

4.4. Performance Accuracy and SOTA Comparison with UCF Sports Dataset

4.5. Performance Accuracy and SOTA Comparison with JHMDB Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI